Hi Andreas, thank you very much for your reply!
The problem occurs for example on this document https://www.sparkasse-hildesheim.de/pdf/vertragsbedingungen/057_produktbedingungen_spk_cards.pdf I'm using the latest version of PDFBox, 1.4.0! Do you know a tool to debug a given PDF? Maybe you could have a hand on the PDF shown above. Regards Hannes On Sun, Jan 30, 2011 at 4:18 PM, Andreas Lehmkuehler <[email protected]>wrote: > Hi, > > Am 29.01.2011 22:24, schrieb Hannes Carl Meyer: > > Hi, >> >> I'm using PDFBox to extract text from various PDFs. >> Since these PDFs are from good ol' germany in german language they contain >> lots of nice umlauts (ä,ö,ü etc). >> >> On some PDFs the extraction of Umlauts fails. >> >> From my first analysis I could imagine it is somehow because I'm not >> owning >> the particular PDFs font. >> >> Is it necessary to have a font installed and loaded into PDFBox to perform >> a >> proper extraction? >> >> Another interesting point: If I open these PDF documents which I can't >> extract Umlauts from in my Adobe Reader and try to search for an umlaut >> which is displayed properly - it fails. It also fails to manually extract >> the text via copy& paste from the pdf. >> > Without having a hand on the pdf, it's hard to say what may be the reason > for the described issue. There are different possibilities: > > 1.) the font isn't embebbed and the substitution made my PDFBox doesn't fit > 100% > 2.) the font is an embedded subset of a true type font, which will be > substituted with another font due to an issue concerning font subsets (see > [1] for further info) and that may lead to the same effect than 1. > 3.) the pdf uses so called CIDs (charactes IDs) without a suitable mapping > to unicode > 4.) the pdf uses a type3 font without a suitable mapping to unicode > 5.) you're using wrong parameters for the extraction > 6.) you're using an editor with limited capabilities concerning text > encoding > 6.) there is still an issue with PDFBox > > Following your last comment, the cases 3. or 4. are most likely. > > BTW, what version of PDFBox are you using? > > BR > Andreas Lehmkühler > > [1] https://issues.apache.org/jira/browse/PDFBOX-490 >

