> I have tried using a utility called 
> pdftotext. I does a 
> pretty good job when invoked with the -layout switch. That 
> switch preserves 
> the document layout. However pdftotext produces garbage 
> characters for some fonts it seems.

There are at least three ways of including text in a PDF:

1) standard encoding:  The bytes in the PDF to draw the text conform to some 
known encoding, WinAnsi, UTF-16, whatever.  PDF->Text programs have no trouble 
with this sort of text.

2) Custom encoding: A byte[s]->characters mapping was generated for this PDF.  
This is relatively common in subsetted-fonts.  The first character used might 
by 0x01, the next 0x02, and so on, regardless of what those characters might 
be.  PDF->Text programs must exert a little extra effort to decypher this kind 
of text.  Some do, some don't.  It's hard to tell whether or not "pdftotext" 
does based on your description.

3) glyph indexes:  The bytes to draw the text directly index 'glyphs' in the 
font.  These indexes may have been modified in the case of a subsetted font.  
The only way to extract information from this sort of text is through OCR 
(Optical Character Recognition).

4) Paths: There isn't any actual text in the PDF, just curves and straight 
lines.  Illustrator can convert text to paths, and I'm sure there are other 
programs out there with the same capability.  This results in a larger file, 
but you can do Cool Things to paths that you can't with regular text.  OCR is 
the only way to get information out of this kind of "text"... and because they 
have often been through processes to do Cool Things, the OCR can have trouble 
with it... depending on what was done.

5) Images: A pixel map of the text.  OCR is again the only hope.  Some scanned 
PDFs and company logos are images.

6) Our 3 weapons are fear, suprise, ruthless effeciency, an almost fanatical 
devotion to the Pope, and nice red uniforms.

--Mark Storer
  Senior Software Engineer
  Cardiff Software

#include <disclaimer>
typedef std::Disclaimer<Cardiff> DisCard;


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Reply via email to