Aamir created TIKA-4231: --------------------------- Summary: Parsing Arabic PDF is returning bad data Key: TIKA-4231 URL: https://issues.apache.org/jira/browse/TIKA-4231 Project: Tika Issue Type: Bug Affects Versions: 2.6.0 Environment: I am using Java 18. And using maven dependency tika-parsers-standard-package ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
Reporter: Aamir Attachments: arabic.pdf, arabic.txt Attached is a PDF with arabic text in it. When parsed using PDFBox version 2.6.0, it produces gibberish characters. The generated text doc is also attached which contains the parsed text. Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)