Fwd: Text Extraction with multi-column documents in PDFBox

Ted Dunning Wed, 30 Mar 2011 10:08:30 -0700

---------- Forwarded message ----------
From: Ted Dunning <[email protected]>
Date: Wed, Mar 30, 2011 at 10:04 AM
Subject: Re: Text Extraction with multi-column documents in PDFBox
To: Jeremy Barkan <[email protected]>

I haven't looked at that lately so I may be a bit wrong on details, but if
you look at the sample article that I posted, you can see how simply
following any heuristic for generating the flow based on position alone will
not work.  The text inset on the first page, for instance, will get the
columns all confused.  The current heuristics are probably fine for finding
individual lines, but not for splitting lines into columns and then
threading those lines into correct flows and marking those flows as text or
decoration.  Moreover, there are important cues given by font and size that
need to be used.  One such cue is whether the text is in the majority font.
 This alone is enough to separate about 90% of the main flow of the document
from other parts fo the document (for the journals I examined).   Most of
the remaining 10% can be had from considering geometrical cues in the
context of that initial assignment, but without the original assignment
based on fonts, the geometry isn't really strong enough.

I think that there is more to be done with what I started in that you can
look at how things came out from the first pass and use statistics
describing positions on the page and font/size/position transitions within a
single text type to refine the statistical model of the document.  That
would allow the flow to be recalculated, hopefully handling a few corner
cases more accurately.

My original goal was to simply remove the boiler-plate from the document and
leave a residue that would allow a high quality retrieval index to be
created.  The final results were nearly good enough to present as a
simplified, text-only surrogate for the document, but not quite.  They were
certainly quite readable, but not very pretty.

On Wed, Mar 30, 2011 at 9:54 AM, Jeremy Barkan <[email protected]> wrote:

> How is what you describe similar or different than the charactersByArticle
> method of PDFTextStripper ?
>
>
>
> Thanks so much for your help
>
>
>
> Best Regards
>
>
>
> Jeremy
>
>
>
>
>
> *Jeremy Barkan*
>
>
>
> Tel: +972 2 6728069
>
> Mobile: +972 54 6321603
>
> Skype: jeremy_barkan
>
>
>
> *From:* Ted Dunning [mailto:[email protected]]
> *Sent:* 30 March 2011 17:55
> *To:* Jeremy Barkan
> *Subject:* Re: Text Extraction with multi-column documents in PDFBox
>
>
>
> Neither.
>
>
>
> Never.
>
>
>
> It would be very helpful to have it, though.
>
> On Wed, Mar 30, 2011 at 8:52 AM, Jeremy Barkan <[email protected]>
> wrote:
>
> Thanks for getting back to me – I was looking into this kind of algorithm.
>
> Was this merged into PDFBox 1.4 or 1.5 ?
>
> I'm trying to decide if to implement this on my own on top of PDFBox or to
> use what PDFBox would have already implemented
>
>
>

Fwd: Text Extraction with multi-column documents in PDFBox

Reply via email to