Hi Villu,

> Hello there,
> 
>> To get it right one would have to use a general replacement of non-combining 
>> into combining diacritics (and probably a normalisation process for unicode 
>> to replace combinations by single characters). By the way, you might also 
>> have to look out for ligatures (e.g. ff ffi fi fl).
> 
> The need for text post-processing depends on the class you're using for the 
> job.
> 
> Class org.apache.pdfbox.util.PDFTextStripper does it for you, because
> all texts are filtered through
> org.apache.pdfbox.util.TextNormalize#normalizeDiac(String)/#normalizePres(String)
> before they are exposed to the application programmer via methods like
> PDFTextStripper#writeString(String). However, it must be borne in mind
> that TextNormalize relies on external ICU4J dependency - if it is not
> properly installed, then the original string is returned unchanged.
> 
> Other classes such as org.apache.pdfbox.pdfviewer.PageDrawer do not do
> it for you. For example, when overriding
> PageDrawer#processTextPosition(TextPosition) with the intent of
> capturing the text before it is painted, you must filter it through
> TextNormalize manually to get the "correct" characters.
> 
This is interesting. I use PDFBox as a command line tool on my Mac:
java org.apache.pdfbox.ExtractText -encoding UTF-8 file.pdf file.txt
Is there a way to activate some post-processing if I do it this way?
Or shouldn't it be included automatically?

All the best
Thomas


Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to