Text extraction with PDFTextStripper is system file.encoding dependent. 
Override does not work.
-----------------------------------------------------------------------------------------------

                 Key: PDFBOX-561
                 URL: https://issues.apache.org/jira/browse/PDFBOX-561
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 0.8.0-incubator, 0.7.3
            Reporter: d ferbas


The text extraction depends on the jvm file.encoding setting. The "override" 
new PDFTextStripper("utf-8") (since version 0.8.0) has no effect.
If there are critical characters in a pdf file, the extracted string differs 
dependent of the jvm system encoding. 
It has to be possible to set the encoding for the extraction to ensure same 
results independent of the default system encoding.

Sample file: http://e-nnovation.at/downloads/blindtext_mit_bullets_unsigned.pdf
Bullets #3 to #8 differ using utf-8 vs cp1252

Be aware that the file.encoding setting only works if passed while starting the 
jvm (-Dfile.encoding=utf-8). System.setProperty(..) does not work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to