Re: Question mark in the extracted text

Villu Ruusmann Thu, 04 Feb 2010 13:13:52 -0800

Hello there,

>
> I'm using the text extraction of the Apache PDFBox 0.8.0 library.
> Unfortunately, the text extraction is replacing some signs and letters by
> '?'.
>


Without having seen the PDF file, I guess that the problem is that the
"faulty" characters depend on a font which is not properly supported
by PDFBox 0.8.0 (the translation rules from bytes to character codes
could be embedded into the font program; PDFBox does not know yet how
to parse/interpret all types of font programs, so it bails out with a
"?" instead).

Hopefully the upcoming PDFBox 1.0.0 release is a bit more savvy in this regard.


VR

Re: Question mark in the extracted text

Reply via email to