[jira] [Comment Edited] (PDFBOX-4531) Extraction of Arabic PDF has incorrect ordering of normalized ligatures

Tilman Hausherr (Jira) Sat, 25 Feb 2023 02:05:55 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691713#comment-17691713
 ]


Tilman Hausherr edited comment on PDFBOX-4531 at 2/25/23 10:04 AM:
-------------------------------------------------------------------

I don't know if I kept any hebrew test documents. I only remember the arab 
documents because of their name for two and because the short number for the 
third one.

I changed the patch so that BOTH constants are checked and repeated the test to 
see what happens, i.e. if more documents are differently extracted, but no.

If any of you has a hebrew document with ligatures (preferably short) I can add 
it to my test corpus.

To [~komedani] - should any of your work be added to his work? Or should I 
commit his work first?


was (Author: tilman):
I don't know if I kept any hebrew test documents. I only remember the arab 
documents because of their name for two and because the short number for the 
third one.

I changed the patch so that BOTH constants are checked and repeated the test to 
see what happens, i.e. if more documents are differently extracted, but no.

If any of you has a hebrew document with ligatures (preferably short) I can add 
it to my test corpus.

To [~komedani] ni - should any of your work be added to his work? Or should I 
commit his work first?

> Extraction of Arabic PDF has incorrect ordering of normalized ligatures
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-4531
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4531
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.15
>            Reporter: Tilman Hausherr
>            Priority: Major
>              Labels: Arabic, regression
>         Attachments: FES-GGArabisch-p112.pdf, PDFBOX-4531-reduced.pdf, 
> PDFBOX-679-toobig.pdf, RAND_PE122z1.arabic.pdf, artikel1_20_arab.pdf, 
> bidi-ligature-1.pdf, bidi-ligature-2.pdf, bidi-ligature.patch, diff-output.zip
>
>
> As reported by Elias Peterson in the mailing list:
> {quote}
> I think I'm seeing some issues concerning the handling of the Arabic 
> lam-with-alef ligature.  I'm attempting to process the PDF here:
> https://www.rand.org/content/dam/rand/pubs/perspectives/PE100/PE122/RAND_PE122z1.arabic.pdf
> When I run the ExtractText command with 2.0.15 I get the following:
> $ java -jar pdfbox-app-2.0.15.jar ExtractText -encoding UTF-8 
> RAND_PE122z1.arabic.pdf output.txt
> $ head output.txt
> C O R P O R A T I O N
> منظور تحليلي
> رؤى خبير بشأن قضايا السياسات اآلنية
> االتفاق مع إيران
> األيام التي تلي
> ...
> The issue being with the last two lines in the above snippet where my 
> understanding is that the ligature لا  was normalized but that the two 
> letters that compose it are in the wrong order.  I was thinking that 
> PDFBOX-684 sounded similar, and running the same PDF through 1.8.16 I see the 
> ligature is normalized in the way I think is expected (although the 
> interspersed English-language words are backwards here).
> $ java -jar pdfbox-app-1.8.16.jar ExtractText -encoding UTF-8 
> RAND_PE122z1.arabic.pdf output.txt
> ...
> $ head output.txt
> N O I T A R O P R O C
> منظور تحليلي
> رؤى خبير بشأن قضايا السياسات الآنية
> الاتفاق مع إيران
> الأيام التي تلي
> ...
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-4531) Extraction of Arabic PDF has incorrect ordering of normalized ligatures

Reply via email to