Text Extraction and Fonts

Hannes Carl Meyer Sat, 29 Jan 2011 13:24:48 -0800

Hi,

I'm using PDFBox to extract text from various PDFs.
Since these PDFs are from good ol' germany in german language they contain
lots of nice umlauts (ä,ö,ü etc).


On some PDFs the extraction of Umlauts fails.

>From my first analysis I could imagine it is somehow because I'm not owning
the particular PDFs font.

Is it necessary to have a font installed and loaded into PDFBox to perform a
proper extraction?

Another interesting point: If I open these PDF documents which I can't
extract Umlauts from in my Adobe Reader and try to search for an umlaut
which is displayed properly - it fails. It also fails to manually extract
the text via copy & paste from the pdf.

Thanks & Regards

Hannes

Text Extraction and Fonts

Reply via email to