Hi,

I'm using the text extraction of the Apache PDFBox 0.8.0 library.
Unfortunately, the text extraction is replacing some signs and letters by '?'.

The PDF-File contains German language. I have extracted the text with the ExtractText.java example from the PDFBox package.

Here is an example:
input text:
"Front: Weiß Hochglanz, Korpus: Noce Dekor, Griff: Metall chrom glänzend, B ca. 234 cm 4425678 394.-** Hochschrank B/H/T ca. 35/179/29 cm 10060786 175,-** "
pdfbox output text:
"Front: Weiß Hochglanz, Korpus: Noce Dekor, Griff: Metall chrom gl?nzend, ? ca? ??? cm ???????? 394.-**
H?hschrank  ?H? ca??????cm  ???? ??- **"

I would be please, if you could help me with that problem and suggest some possibilities to make it work.

Cheers,
Christian

Reply via email to