Hi, I'm using PDFBox to extract text from various PDFs. Since these PDFs are from good ol' germany in german language they contain lots of nice umlauts (ä,ö,ü etc).
On some PDFs the extraction of Umlauts fails. >From my first analysis I could imagine it is somehow because I'm not owning the particular PDFs font. Is it necessary to have a font installed and loaded into PDFBox to perform a proper extraction? Another interesting point: If I open these PDF documents which I can't extract Umlauts from in my Adobe Reader and try to search for an umlaut which is displayed properly - it fails. It also fails to manually extract the text via copy & paste from the pdf. Thanks & Regards Hannes

