Custom TextStripper / PDGraphicsState Not Reading Color

-A Sat, 26 Jul 2014 17:36:12 -0700

Hello again. I've been trying to figure out this issue that has come up for
me and in my research I found someone posting on StackOverflow (
http://stackoverflow.com/questions/10844271/how-to-get-font-color-using-pdfbox)
a similar issue where they could not read any colors from a PDF. The user
posted the code and someone else took it, ran it, and reported that it
worked. The users approach was different than mine, but alas.


I'm not sure at this point what is going on. I have stepped through each
individual character and checked the PDGraphicsState object, and even when
I am looking at an open file with visibly red text (attached) the debugger
only reports DeviceGray. If I print out the ColorSpace name from the
PDGraphicsState this is what is printed - for every character.

I would appreciate if someone could perhaps run the attached text stripper
with the attached PDF file and report back if it actually prints true
instead of false, as it does for me. Since I saw this occurrence elsewhere
I'd like to rule that out - in case an IDE setting of some sort may be
causing this?

It should be noted that I began using PDFBox with 1.8.5 and had this code
working fine. Still with 1.8.5 yesterday it was failing. Upgrading to 1.8.6
yielded the same results.

If this is an actual issue I do not mind attempting to solve it if someone
may have a general idea where to point me as to prevent needless meddling
with graphics state objects. Or, if this should be reported I can do that
as well.

Thanks!

-Aaron




*Previous Message:*



I’ve attached an updated stripper file with the only addition being a main
function to test the class specifically.

When ran with the PDF I have also attached it indeed does not recognize the
red text.

At this point it seems that this issue is solely dependent on PDFBox. I’ll
stay tuned for some insight hopefully. If any other information is needed,
let me know!

public class IncrementalPDFStripper extends PDFTextStripper
{

    /**
     * boolean to denote if a parsed file has red text in it
     */
    private boolean hasRed;


    /**
     * IncrementalPDFStripper constructor
     *
     * @throws java.io.IOException
     */
    public IncrementalPDFStripper() throws IOException
    {

        super();

        super.setSortByPosition(true);

        this.hasRed = false;    // initialize to no red

    }

    /**
     * Method to parse a PDF document.
     *
     * @param doc <code>PDDocument</code> of the PDF to be checked for red.
     * @throws IOException
     */
    public boolean containsRed(PDDocument doc) throws IOException
    {


        /**
         * Set hasRed to false in case method is ran with same object in memory
         */
        this.hasRed = false;

        /**
         * Get a list of pages within the document
         */
        List<PDPage> pages = doc.getDocumentCatalog().getAllPages();

        // FOR every page in the document
        for (PDPage page : pages) {
            processStream(page, page.getResources(), 
page.getContents().getStream());   // process the page
        }

        return hasRed;

    }

    /**
     * Overridden method with simple functionality added to set a flag
     * if a desired color is found.
     *
     * @param textPos <code>TextPosition</code> representing the current 
position in the pages text.
     */
    @Override
    protected void processTextPosition(TextPosition textPos)
    {
        try
        {
            PDGraphicsState graphicsState = getGraphicsState();

            // IF the current text contains RED
            if (graphicsState.getNonStrokingColor().getJavaColor().getRed() == 
255)
            {
                this.hasRed = true;
            }

        }
        catch (IOException e)
        {
            throw new RuntimeException(e);
        }

    }

    public static void main(String[] args)
    {
        try
        {
            PDDocument doc = PDDocument.load(args[0]);

            IncrementalPDFStripper stripper = new IncrementalPDFStripper();

            System.out.println(stripper.containsRed(doc));
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
    }


}

Custom TextStripper / PDGraphicsState Not Reading Color

Reply via email to