[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-04-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833807#comment-17833807
 ] 

Tilman Hausherr commented on TIKA-4231:
---

Yes it is text, but the PDF is using a feature that we don't support. Instead 
of having its own unicode for each glyph, it has the text extraction on a 
separate level.

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-04-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833745#comment-17833745
 ] 

Tim Allison commented on TIKA-4231:
---

On some PDFs, there can be problems with Unicode mappings and other glyph 
issues. For some of these files, they render well but the underlying electronic 
text is junk. In those cases, OCR is the best option.

I haven’t looked at this pdf and don’t know if the above is the case for this 
one.

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-04-03 Thread Aamir (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833740#comment-17833740
 ] 

Aamir commented on TIKA-4231:
-

Why use OCR? This is text, not images. 

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-04-02 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833385#comment-17833385
 ] 

Tilman Hausherr commented on TIKA-4231:
---

No this is not being worked on. You'll have to use OCR.

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-04-02 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833344#comment-17833344
 ] 

Tim Allison commented on TIKA-4231:
---

If you run Poppler's pdftotext against the file or copy and paste out of Adobe 
Reader into a text file, do you get higher quality text?

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-04-02 Thread Aamir (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833329#comment-17833329
 ] 

Aamir commented on TIKA-4231:
-

Is this issue being worked on? Any updates please?

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832293#comment-17832293
 ] 

Aamir commented on TIKA-4231:
-

No, this doesn't look better. Actually, I would say that it looks worse than 
before.

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832291#comment-17832291
 ] 

Tilman Hausherr commented on TIKA-4231:
---

I have attached an extraction with pdfbox 2.0.31:  [^arabic-pdfbox.txt] 
is this better, or not? I've added a BOM and removed the 00 bytes. In the tika 
extraction there are many "ef bf bd" bytes instead which is the utf8 
replacement character �.

A possible explanation why Adobe Reader works better is that this file uses the 
"ActualText"-feature which PDFBox doesn't support (PDFBOX-3248).

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832289#comment-17832289
 ] 

Aamir commented on TIKA-4231:
-

The problem persists with 2.9.1
I am updating the versions in this ticket as well so that it is clear that the 
latest version has the issue as well.

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0, it produces gibberish characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832284#comment-17832284
 ] 

Tilman Hausherr commented on TIKA-4231:
---

This doesn't change my argument. The latest version is 2.9.1, please try with 
that one.

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0, it produces gibberish characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832260#comment-17832260
 ] 

Aamir commented on TIKA-4231:
-

Sorry, I meant tika-parsers-standard-package 2.6.0

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using PDFBox version 2.6.0, it produces gibberish characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832258#comment-17832258
 ] 

Tilman Hausherr commented on TIKA-4231:
---

The current tika version is 2.9.1, soon to be 2.9.2. There is no "PDFBox 
version 2.6.0".

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using PDFBox version 2.6.0, it produces gibberish characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)