Re: Question mark in the extracted text

Iain Clapham Thu, 04 Feb 2010 15:37:26 -0800

I get this a lot with "obscure" fonts - I would love to improve the fonthandlingbut worry that the project is not well controlled and any effort in thisdirection

would be wasted.

Who is producing 1.0.0 and WHEN ???????????


iaincc

Villu Ruusmann wrote:

Hello there,

I'm using the text extraction of the Apache PDFBox 0.8.0 library.
Unfortunately, the text extraction is replacing some signs and letters by
'?'.

Without having seen the PDF file, I guess that the problem is that the
"faulty" characters depend on a font which is not properly supported
by PDFBox 0.8.0 (the translation rules from bytes to character codes
could be embedded into the font program; PDFBox does not know yet how
to parse/interpret all types of font programs, so it bails out with a
"?" instead).

Hopefully the upcoming PDFBox 1.0.0 release is a bit more savvy in this regard.


VR

Re: Question mark in the extracted text

Reply via email to