Re: Text Extraction and Fonts

Andreas Lehmkuehler Sun, 30 Jan 2011 09:32:15 -0800

Hi,


Am 30.01.2011 17:20, schrieb Hannes Carl Meyer:

Hi Andreas,

thank you very much for your reply!

The problem occurs for example on this document
https://www.sparkasse-hildesheim.de/pdf/vertragsbedingungen/057_produktbedingungen_spk_cards.pdf

I'm using the latest version of PDFBox, 1.4.0!

Hmm, I can confirm your issue and it seems to be case 7., the second case 6.;-)It works fine with the current trunk (we recently made some improvements).

Do you know a tool to debug a given PDF? Maybe you could have a hand on the
PDF shown above.

To determine which fonts are used, just have a look at the pdf properties. TheAcrobat reader and other tools provide those props.Use the PDFDebugger [1] which comes with PDFBox to walk through a pdf on alogical level.



[1] http://pdfbox.apache.org/commandlineutilities/PDFDebugger.html

On Sun, Jan 30, 2011 at 4:18 PM, Andreas Lehmkuehler<[email protected]>wrote:

Hi,

Am 29.01.2011 22:24, schrieb Hannes Carl Meyer:

  Hi,


I'm using PDFBox to extract text from various PDFs.
Since these PDFs are from good ol' germany in german language they contain
lots of nice umlauts (ä,ö,ü etc).

On some PDFs the extraction of Umlauts fails.

  From my first analysis I could imagine it is somehow because I'm not
owning
the particular PDFs font.

Is it necessary to have a font installed and loaded into PDFBox to perform
a
proper extraction?

Another interesting point: If I open these PDF documents which I can't
extract Umlauts from in my Adobe Reader and try to search for an umlaut
which is displayed properly - it fails. It also fails to manually extract
the text via copy&   paste from the pdf.

Without having a hand on the pdf, it's hard to say what may be the reason
for the described issue. There are different possibilities:

1.) the font isn't embebbed and the substitution made my PDFBox doesn't fit
100%
2.) the font is an embedded subset of a true type font, which will be
substituted with another font due to an issue concerning font subsets (see
[1] for further info) and that may lead to the same effect than 1.
3.) the pdf uses so called CIDs (charactes IDs) without a suitable mapping
to unicode
4.) the pdf uses a type3 font without a suitable mapping to unicode
5.) you're using wrong parameters for the extraction
6.) you're using an editor with limited capabilities concerning text
encoding
6.) there is still an issue with PDFBox

Following your last comment, the cases 3. or 4. are most likely.

BTW, what version of PDFBox are you using?

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-490


BR
Andreas Lehmkühler

Re: Text Extraction and Fonts

Reply via email to