---------- Forwarded message ---------- From: Ted Dunning <[email protected]> Date: Wed, Mar 30, 2011 at 10:04 AM Subject: Re: Text Extraction with multi-column documents in PDFBox To: Jeremy Barkan <[email protected]>
I haven't looked at that lately so I may be a bit wrong on details, but if you look at the sample article that I posted, you can see how simply following any heuristic for generating the flow based on position alone will not work. The text inset on the first page, for instance, will get the columns all confused. The current heuristics are probably fine for finding individual lines, but not for splitting lines into columns and then threading those lines into correct flows and marking those flows as text or decoration. Moreover, there are important cues given by font and size that need to be used. One such cue is whether the text is in the majority font. This alone is enough to separate about 90% of the main flow of the document from other parts fo the document (for the journals I examined). Most of the remaining 10% can be had from considering geometrical cues in the context of that initial assignment, but without the original assignment based on fonts, the geometry isn't really strong enough. I think that there is more to be done with what I started in that you can look at how things came out from the first pass and use statistics describing positions on the page and font/size/position transitions within a single text type to refine the statistical model of the document. That would allow the flow to be recalculated, hopefully handling a few corner cases more accurately. My original goal was to simply remove the boiler-plate from the document and leave a residue that would allow a high quality retrieval index to be created. The final results were nearly good enough to present as a simplified, text-only surrogate for the document, but not quite. They were certainly quite readable, but not very pretty. On Wed, Mar 30, 2011 at 9:54 AM, Jeremy Barkan <[email protected]> wrote: > How is what you describe similar or different than the charactersByArticle > method of PDFTextStripper ? > > > > Thanks so much for your help > > > > Best Regards > > > > Jeremy > > > > > > *Jeremy Barkan* > > > > Tel: +972 2 6728069 > > Mobile: +972 54 6321603 > > Skype: jeremy_barkan > > > > *From:* Ted Dunning [mailto:[email protected]] > *Sent:* 30 March 2011 17:55 > *To:* Jeremy Barkan > *Subject:* Re: Text Extraction with multi-column documents in PDFBox > > > > Neither. > > > > Never. > > > > It would be very helpful to have it, though. > > On Wed, Mar 30, 2011 at 8:52 AM, Jeremy Barkan <[email protected]> > wrote: > > Thanks for getting back to me – I was looking into this kind of algorithm. > > Was this merged into PDFBox 1.4 or 1.5 ? > > I'm trying to decide if to implement this on my own on top of PDFBox or to > use what PDFBox would have already implemented > > >
