[jira] [Created] (TIKA-4231) Parsing Arabic PDF is returning bad data

Aamir (Jira) Fri, 29 Mar 2024 09:22:16 -0700

Aamir created TIKA-4231:
---------------------------

             Summary: Parsing Arabic PDF is returning bad data
                 Key: TIKA-4231
                 URL: https://issues.apache.org/jira/browse/TIKA-4231
             Project: Tika
          Issue Type: Bug
    Affects Versions: 2.6.0
         Environment: I am using Java 18. And using maven dependency 
tika-parsers-standard-package 
([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]


 
            Reporter: Aamir
         Attachments: arabic.pdf, arabic.txt

Attached is a PDF with arabic text in it. 
When parsed using PDFBox version 2.6.0, it produces gibberish characters. 

The generated text doc is also attached which contains the parsed text. 

Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-4231) Parsing Arabic PDF is returning bad data

Reply via email to