Re: Text Extraction with multi-column documents in PDFBox

Ted Dunning Thu, 31 Mar 2011 10:29:08 -0700

Yes.  This use of the native flow works about 50-80% of the time in my
experience.  But it was waay to error prone to depend on and failed
spectacularly for many critical data sources.  Even where it worked, the
results were often not good enough.  For one thing, I needed real text flow
so that I could reliably reverse engineer hyphenation (for text indexing).
 I also needed to reliably remove headers, footers, page numbers, article
titles and similar boilerplate across thousands of document sources without
hand engineering each kind of document.


On Thu, Mar 31, 2011 at 8:58 AM, Martinez, Mel - 1004 - MITLL <
[email protected]> wrote:

> Ted,
>
> A lot depends on how the PDF file was generated, but in general, so long as
> you leave the 'sort by position' attribute of the PDFBox' PDFTextStripper as
> 'false' (the default) then the text extraction will be (mostly) logical and
> not positional.
>
>        PDFTextStripper myStripper = ...
>        myStripper.setSortByPosition(false);  //not actually necessary since
> false is the default.
>
> That is, if you have text in two columns on a page, the lines will be
> extracted by article and not cross columns.
>

Sort of.  As I mentioned, the quality across a bunch of data sources was
just not good enough to even contemplate deployment.  Moreover, there was no
way forward to improve the situation.

SOME PDFs can be (and unfortunately are) generated such that the text
> objects are not logically arranged by article and the extraction still
> messes up.  But in my experience on most documents it does a pretty good
> job, especially those generated from word processors.
>

I was working against documents from publishers.  My results were much worse
than what you ahve seen, it sounds like.


> The only recurring glitches tend to be where text in headers and footers
> gets inserted and sometimes a floating text box will be inserted in the
> extracted text quite far from where it appears on the page.  But the block
> of text from the box usually will at least be integral and not chopped up.
>

Only sometimes.  The rearrangements in practice are quite capricious.


> The times when you may WANT to sort by position is when parsing text from
> PDFs that are more graphical in nature, such as those generated from
> PowerPoint type documents.   Even then though, it depends a lot on how the
> page is structured.   A bit of testing is usually necessary to figure out
> which setting works best with the particular PDF.
>

And my requirement was that I could not accept any magical knob turning.  My
solution had to work across a huge range of sources.


> As of 1.4 we have a lot of instrumentation that allows you to override /
> customize the demarcation between the following structural points:
>
> Page
> Article
> Paragraph
> Line
> Word
>

That just doesn't really help.  I needed auto-tuning, line unbreaking and
real flow following.

Re: Text Extraction with multi-column documents in PDFBox

Reply via email to