Hello there, >> >> The need for text post-processing depends on the class you're using for the >> job. >> >> Class org.apache.pdfbox.util.PDFTextStripper does it for you, because >> all texts are filtered through >> org.apache.pdfbox.util.TextNormalize#normalizeDiac(String)/#normalizePres(String) >> before they are exposed to the application programmer via methods like >> PDFTextStripper#writeString(String). However, it must be borne in mind >> that TextNormalize relies on external ICU4J dependency - if it is not >> properly installed, then the original string is returned unchanged. >> > This is interesting. I use PDFBox as a command line tool on my Mac: > java org.apache.pdfbox.ExtractText -encoding UTF-8 file.pdf file.txt > Is there a way to activate some post-processing if I do it this way? > Or shouldn't it be included automatically? >
The command-line application org.apache.pdfbox.ExtractText uses class org.apache.pdfbox.util.PDFTextStripper internally. So, in principle, there shouldn't be any need for text post-processing if the ICU4J dependency is properly installed. Since PDFBox JAR comes in many flavours, it is very hard for me to tell if you have it all right or not. I guess the easiest solution would be to download ICU4J 3.8 JAR manually and append it to you command-line application's classpath. You can find the said JAR for example here: http://www.jarvana.com/jarvana/browse/com/ibm/icu/icu4j/3.8/ VR

