Hi Villu,
> Hello there, > >> To get it right one would have to use a general replacement of non-combining >> into combining diacritics (and probably a normalisation process for unicode >> to replace combinations by single characters). By the way, you might also >> have to look out for ligatures (e.g. ff ffi fi fl). > > The need for text post-processing depends on the class you're using for the > job. > > Class org.apache.pdfbox.util.PDFTextStripper does it for you, because > all texts are filtered through > org.apache.pdfbox.util.TextNormalize#normalizeDiac(String)/#normalizePres(String) > before they are exposed to the application programmer via methods like > PDFTextStripper#writeString(String). However, it must be borne in mind > that TextNormalize relies on external ICU4J dependency - if it is not > properly installed, then the original string is returned unchanged. > > Other classes such as org.apache.pdfbox.pdfviewer.PageDrawer do not do > it for you. For example, when overriding > PageDrawer#processTextPosition(TextPosition) with the intent of > capturing the text before it is painted, you must filter it through > TextNormalize manually to get the "correct" characters. > This is interesting. I use PDFBox as a command line tool on my Mac: java org.apache.pdfbox.ExtractText -encoding UTF-8 file.pdf file.txt Is there a way to activate some post-processing if I do it this way? Or shouldn't it be included automatically? All the best Thomas
smime.p7s
Description: S/MIME cryptographic signature

