Hello Stan, I'm trying to evaluate different options to extract text from PDF files, PDFBox (v.8) being one of them. My experience is: – what you get depends on the way the PDF file was created (mine are TeX based, and there is a wide variety), – you may need to post-process the extracted text, – PDFBox is not perfect, but among the best for this job.
Here is an example (not Romanian though, but similar I suppose): Białynicki (polish ł) works correctly Świȩcicka comes out as S´wie¸cicka Here "´" and "¸" are the non-combining equivalents of the combing diacritics. To get it right one would have to use a general replacement of non-combining into combining diacritics (and probably a normalisation process for unicode to replace combinations by single characters). By the way, you might also have to look out for ligatures (e.g. ff ffi fi fl). And beware: these are the best possible results I found. With other PDFs, you might lose diacritic characters completely (both base and decoration), get the diacritic signs reversed (probably only some of them), or scattered over the respective line with no reference to the decorated character (you might have picked up one of those before your "„"). Cheers Thomas Am 19.12.2009 um 00:09 schrieb Stan Ioan-Eugen: > Hello, > > I'm having some difficulties using pdfbox. It does not behave how I expect > and I don't know the problem. I'm tryng to build a pdf translation app using > a translating engine. The idea is upload pdf, click button get pdf > translated. The problem is that pdfbox messes up the characters. I tryed the > ReplaceString.java application on a romanian newspaper pdf trying to replace > a string. Pdfbox seems to mess up the diacritics. After replace the newly > created PDF file shows as folows: > > ́„ instead of „ > ́” instead of ” > (the leading quote should not be there, romainian quotation is like „quoted > text” ) > ^fi instead of î (i circumflex) > ~ and another character which did not display (displayed as an empty box) > instead of ă (a grave i guess). … > -- > -stan ioan-eugen
smime.p7s
Description: S/MIME cryptographic signature

