> I have tried using a utility called > pdftotext. I does a > pretty good job when invoked with the -layout switch. That > switch preserves > the document layout. However pdftotext produces garbage > characters for some fonts it seems.
There are at least three ways of including text in a PDF: 1) standard encoding: The bytes in the PDF to draw the text conform to some known encoding, WinAnsi, UTF-16, whatever. PDF->Text programs have no trouble with this sort of text. 2) Custom encoding: A byte[s]->characters mapping was generated for this PDF. This is relatively common in subsetted-fonts. The first character used might by 0x01, the next 0x02, and so on, regardless of what those characters might be. PDF->Text programs must exert a little extra effort to decypher this kind of text. Some do, some don't. It's hard to tell whether or not "pdftotext" does based on your description. 3) glyph indexes: The bytes to draw the text directly index 'glyphs' in the font. These indexes may have been modified in the case of a subsetted font. The only way to extract information from this sort of text is through OCR (Optical Character Recognition). 4) Paths: There isn't any actual text in the PDF, just curves and straight lines. Illustrator can convert text to paths, and I'm sure there are other programs out there with the same capability. This results in a larger file, but you can do Cool Things to paths that you can't with regular text. OCR is the only way to get information out of this kind of "text"... and because they have often been through processes to do Cool Things, the OCR can have trouble with it... depending on what was done. 5) Images: A pixel map of the text. OCR is again the only hope. Some scanned PDFs and company logos are images. 6) Our 3 weapons are fear, suprise, ruthless effeciency, an almost fanatical devotion to the Pope, and nice red uniforms. --Mark Storer Senior Software Engineer Cardiff Software #include <disclaimer> typedef std::Disclaimer<Cardiff> DisCard; ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642 _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions
