Re: Text Extraction with multi-column documents in PDFBox

Ted Dunning Thu, 31 Mar 2011 14:00:27 -0700

Exactly.

That is why I reverted to looking at how the text sits on the page.  My
approaches would fall apart for wide classes of documents as well.  For
instance, mono-font documents kill the "body font" technique that I use.
 Image only OCR'ed documents are also a problem since they rarely have good
location or font information.


On Thu, Mar 31, 2011 at 12:46 PM, Martinez, Mel - 1004 - MITLL <
[email protected]> wrote:

> If you have some foreknowledge about the structure of a given corpus of
> documents, you may be able to right some custom code that figures things
> out, but otherwise, PDF in general is simply not designed for that purpose.
>
>

Re: Text Extraction with multi-column documents in PDFBox

Reply via email to