Steve,

Steve Garcia wrote
> Am trying to pull table data out of PDF files that contain non tabular
> text as well as the tables.  I've successfully parsed the non tabled text
> using PdfTextExtractor.GetTextFromPage(), but the resulting text stream is
> empty at each table location.

The text in the tables cannot be extracted without OCR.

The text in the tables is drawn using type 3 fonts with an ad-hoc encoding,
i.e. the first glyph drawn on the page is encoded as 0, the second
(differing) glyph as 1, ...

E.g. on page 11 the first text drawn is "B6 Summary (Official Form 6 -
Summary) (12/14)" and is encoded as 00, 01, 02, 03, 04, 05, 05, 06, 07, 08,
02, 09, 0A, 0B, 0B, 0C, 0D, 0C, 06, 0E, 02, 0F, ...

Furthermore the font has not mapping to Unicode.

Thus, automated text extraction without some kind of OCR is impossible.

Regards,   Michael



--
View this message in context: 
http://itext.2136553.n4.nabble.com/iText-help-resources-tp4660980p4660981.html
Sent from the iText mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to