[
https://issues.apache.org/jira/browse/PDFBOX-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690866#comment-17690866
]
Mohamed M NourElDin commented on PDFBOX-5487:
---------------------------------------------
Hi [~tilman] , I have just created a new pull request for this one too.
[PR#155 PDFBOX-5487: Remove all space characters if contained within the
adjacent letters|https://github.com/apache/pdfbox/pull/155]
I still want to run this fix on the test set that you shared in PDFBOX-4531 but
meanwhile here is an explanation for the current issue in the attached PDF:
* There is a space character at the left of the last word in the first line
(last word because Arabic is written from right to left).
* This space actually overlaps with the adjacent Arabic letter 'ة'.
* When sorting is enabled, this space gets shifted into the middle of the word
between the last and before-last letters (i.e. 'ية' becomes 'ي ة')
* The same issue exists again in the first word on the 9{^}th{^} line from the
bottom of the first page ( 'فضلا' becomes 'ف ضلا')
I have attached here the extracted text before and after the fix as well as
some screenshots drawn by {{DrawPrintTextLocations}} utility to illustrate the
problem. Also, I can share with you a python script that can draw in the PDF
file directly for debugging.
Pre-chage: [^Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR
(withoutFixes).txt]
Post-change: [^Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR.txt]
Regarding *meld[123].png* images, issues highlighted with
* *{color:#00875a}green{color}* should be fixed by *PR#155*
* {color:#de350b}*red* {color:#172b4d}and{color} *{color:#0747a6}blue{color}*
{color:#172b4d}should be fixed by{color} *{color:#172b4d}PR#154{color}*{color}
{color:#de350b}{color:#172b4d}Thanks{color}{color}
> extra whitespaces when extracting Arabic text
> ---------------------------------------------
>
> Key: PDFBOX-5487
> URL: https://issues.apache.org/jira/browse/PDFBOX-5487
> Project: PDFBox
> Issue Type: Bug
> Reporter: Fatemeh Elyasi
> Priority: Major
> Labels: Arabic
> Attachments: Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR
> (withoutFixes).txt, Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR.pdf,
> Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR.txt, PDFBOX-5487_
> اعلامية.png, PDFBOX-5487_ وفضلا.png, arabtest.pdf, meld1.png, meld2.png,
> meld3.png, screenshot-1.png
>
>
> trying to extract text from an arabic PDF. You may notice that some of
> whitespaces are extracted in wrong place.
> Example:
> Original word: العالمية
> Extracted word: العالمي ة
>
> Pdf is attached, the example word is on the first line.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]