[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

Tim Allison (Jira) Wed, 03 Apr 2024 14:19:05 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833745#comment-17833745
 ]


Tim Allison commented on TIKA-4231:
-----------------------------------

On some PDFs, there can be problems with Unicode mappings and other glyph 
issues. For some of these files, they render well but the underlying electronic 
text is junk. In those cases, OCR is the best option.

I haven’t looked at this pdf and don’t know if the above is the case for this 
one.

> Parsing Arabic PDF is returning bad data
> ----------------------------------------
>
>                 Key: TIKA-4231
>                 URL: https://issues.apache.org/jira/browse/TIKA-4231
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.6.0, 2.9.1
>         Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>            Reporter: Aamir
>            Priority: Major
>         Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

Reply via email to