Exactly. That is why I reverted to looking at how the text sits on the page. My approaches would fall apart for wide classes of documents as well. For instance, mono-font documents kill the "body font" technique that I use. Image only OCR'ed documents are also a problem since they rarely have good location or font information.
On Thu, Mar 31, 2011 at 12:46 PM, Martinez, Mel - 1004 - MITLL < [email protected]> wrote: > If you have some foreknowledge about the structure of a given corpus of > documents, you may be able to right some custom code that figures things > out, but otherwise, PDF in general is simply not designed for that purpose. > >
