Hello Stan,

I'm trying to evaluate different options to extract text from PDF files, PDFBox 
(v.8) being one of them.
My experience is:
– what you get depends on the way the PDF file was created (mine are TeX based, 
and there is a wide variety),
– you may need to post-process the extracted text,
– PDFBox is not perfect, but among the best for this job.

Here is an example (not Romanian though, but similar I suppose):
Białynicki (polish ł) works correctly
Świȩcicka comes out as S´wie¸cicka
Here "´" and "¸" are the non-combining equivalents of the combing diacritics.
To get it right one would have to use a general replacement of non-combining 
into combining diacritics (and probably a normalisation process for unicode to 
replace combinations by single characters). By the way, you might also have to 
look out for ligatures (e.g. ff ffi fi fl).
And beware: these are the best possible results I found. With other PDFs, you 
might lose diacritic characters completely (both base and decoration), get the 
diacritic signs reversed (probably only some of them), or scattered over the 
respective line with no reference to the decorated character (you might have 
picked up one of those before your "„").

Cheers
Thomas


Am 19.12.2009 um 00:09 schrieb Stan Ioan-Eugen:

> Hello,
> 
> I'm having some difficulties using pdfbox. It does not behave how I expect
> and I don't know the problem. I'm tryng to build a pdf translation app using
> a translating engine. The idea is upload pdf, click button get pdf
> translated. The problem is that pdfbox messes up the characters. I tryed the
> ReplaceString.java application on a romanian newspaper pdf trying to replace
> a string. Pdfbox seems to mess up the diacritics. After replace the newly
> created PDF file shows as folows:
> 
> ́„ instead of „
> ́” instead of ”
> (the leading quote should not be there, romainian quotation is like „quoted
> text” )
> ^fi instead of î (i circumflex)
> ~ and another character which did not display (displayed as an empty box)
> instead of ă (a grave i guess).
…
> -- 
> -stan ioan-eugen

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to