Text extraction with PDFTextStripper is system file.encoding dependent.
Override does not work.
-----------------------------------------------------------------------------------------------
Key: PDFBOX-561
URL: https://issues.apache.org/jira/browse/PDFBOX-561
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 0.8.0-incubator, 0.7.3
Reporter: d ferbas
The text extraction depends on the jvm file.encoding setting. The "override"
new PDFTextStripper("utf-8") (since version 0.8.0) has no effect.
If there are critical characters in a pdf file, the extracted string differs
dependent of the jvm system encoding.
It has to be possible to set the encoding for the extraction to ensure same
results independent of the default system encoding.
Sample file: http://e-nnovation.at/downloads/blindtext_mit_bullets_unsigned.pdf
Bullets #3 to #8 differ using utf-8 vs cp1252
Be aware that the file.encoding setting only works if passed while starting the
jvm (-Dfile.encoding=utf-8). System.setProperty(..) does not work.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.