This issue came up for me, but it was a slightly different problem (I just wanted to know how many pages were being skipped so I could deal with blank pages)
Let's say you want to process a PDF using PDFTextStripper so your output will be an ArrayList<Foo> with exactly one element per PDF page. My solution is: 1 - track the number of PDF pages with a variable 2 - every time we start working on a new page, and then one last time at the end when the PDFTextStripper.writeText is finished running, we call the method PDFTextStripper.getCurrentPageNo() and compare that to the page number variable; by comparing it we know how many (if any) pages we have skipped over 3 - iterate through a loop N times, where N is the number of skipped pages, and take appropriate action once per each skipped page (incrementing the page count variable by 1 and adding to the ArrayList a new Foo object that represents a blank page). Then increment the page number variable by 1 to account for the page with content that is being parsed by PDFTextStripper. So that code (calling getCurrentPageNo() and comparing it to the number of pages, then iterating N times) is part of the overridden PDFTextStripper.startPage method, and then similar code is executed after PDFTextStripper.writeText is done running to make sure you don't miss any blank pages at the end. - That might be helpful in your situation if you are going to run two things in parallel (1 - run the PDFTextStripper to look at the text, and 2- run something else to look at the other content that PDFTextStripper skips). -Michael Levy On Tue, Apr 2, 2019 at 4:33 AM Tim Allison <[email protected]> wrote: > All, > I just noticed this in PDFTextStripper's processPages(): > > if (page.hasContents()) > { > processPage(page); > } > > If a page has an embedded file, inline images, annotations etc, but no > text content, does this mean we're skipping the page by accident? In > short, do we need to override processPages in Tika to process every > page? > > Or, does "hasContents()" include anything... whether or not it is > text-based? > > Thank you. > > Best, > > Tim > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

