Christof Luick created TIKA-960:
-----------------------------------

             Summary: Duplicate letters in text extracted from PDF files
                 Key: TIKA-960
                 URL: https://issues.apache.org/jira/browse/TIKA-960
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.2
         Environment: Windows 7, Oracle JRE 1.6.0_32, 64-Bit Server VM
            Reporter: Christof Luick


When I extract the text from a given PDF (fussball.pdf, see link below) with 
Tika 1.2, the text extractor returns duplicated letters for some words. The 
string "SCHULE/KINDERGRUPPE" for example will be transformed into 
SSCCHHUULLEE//KKIINNDDEERRGGRRUUPPPPEE".

The file "fussball.pdf" can be found at:
http://www.pixelschleuder.de/misc/fussball.pdf

I used the following command line for text extraction:
java -jar tika-app-1.2.jar -t -eUTF-8 fussball.pdf > test.utf-8



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to