This issue came up for me, but it was a slightly different problem (I just
wanted to know how many pages were being skipped so I could deal with blank
pages)

Let's say you want to process a PDF using PDFTextStripper so your output
will be an ArrayList<Foo> with exactly one element per PDF page.

My solution is:
1 - track the number of PDF pages with a variable
2 - every time we start working on a new page, and then one last time at
the end when the PDFTextStripper.writeText is finished running, we call the
method PDFTextStripper.getCurrentPageNo() and compare that to the page
number variable; by comparing it we know how many (if any) pages we have
skipped over
3 - iterate through a loop N times, where N is the number of skipped pages,
and take appropriate action once per each skipped page (incrementing the
page count variable by 1 and adding to the ArrayList a new Foo object that
represents a blank page).  Then increment the page number variable by 1 to
account for the page with content that is being parsed by PDFTextStripper.

So that code (calling getCurrentPageNo() and comparing it to the number of
pages, then iterating N times) is part of the overridden
PDFTextStripper.startPage method, and then similar code is executed after
PDFTextStripper.writeText is done running to make sure you don't miss any
blank pages at the end.

-

That might be helpful in your situation if you are going to run two things
in parallel (1 - run the PDFTextStripper to look at the text, and 2- run
something else to look at the other content that PDFTextStripper skips).

-Michael Levy


On Tue, Apr 2, 2019 at 4:33 AM Tim Allison <[email protected]> wrote:

> All,
>   I just noticed this in PDFTextStripper's processPages():
>
> if (page.hasContents())
> {
>     processPage(page);
> }
>
> If a page has an embedded file, inline images, annotations etc, but no
> text content, does this mean we're skipping the page by accident?  In
> short, do we need to override processPages in Tika to process every
> page?
>
> Or, does "hasContents()" include anything... whether or not it is
> text-based?
>
> Thank you.
>
>          Best,
>
>                  Tim
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to