[jira] [Commented] (PDFBOX-3096) Lack of Bidi (Arabic / Hebrew) test reordering in text extracted with PDFbox

TOMER MAHLIN (JIRA) Tue, 10 Nov 2015 04:24:53 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998478#comment-14998478
 ]


TOMER MAHLIN commented on PDFBOX-3096:
--------------------------------------

Thanks for the pointer. Indeed conceptually the issue discussed in PDFBOX-2252 
looks the same. I will have to better test PDFbox and also look at the code. 
Upfront the concerns are:

1. Even though there is a "standard" for Adobe binary format, in practice there 
are so many tools which can be used for generation / authoring PDF content. 
There are very significant differences between those tools in Bidi context 
which can affect data presentation (and thus extraction). 

2. Bidi engine compliant with UBA (http://unicode.org/reports/tr9/) should be 
used for resolution of the issue. Java (Oracle JDK) Bidi engine (even if it is 
currently used for resolution of problem) is very rudimentary and has limited 
support.

Running several tests on different PDF files will certainly make things 
clearer. 

> Lack of Bidi (Arabic / Hebrew) test reordering in text extracted with PDFbox
> ----------------------------------------------------------------------------
>
>                 Key: PDFBOX-3096
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3096
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: TOMER MAHLIN
>         Attachments: PDFBox_HebrewExtractedText.PNG
>
>
> Rendering rules for Bidi (Arabic / Hebrew) text in regular Windows / Android 
> / iOS environment and Adobe environment are different. Adobe expect text to 
> appear in visual bidi layout while modern system are working with logical 
> bidi layout. 
> When text is extracted from PDF file it should be converted / normalized to 
> logical bidi layout. 
> Example:
> Assuming capital letters stand for Bidi letters.
> 1. In Adobe document you see: CBA
> 2. When you extract the content and display it in Notepad (or web browser or 
> any similar tool) you will see ABC while the expectation is to see CBA. 
> Assuming you have a real text with both Hebrew and English (or Arabic and 
> English) characters the result display is completely ruined after text 
> extraction. Moreover, even if we ignore the display and focus on text 
> manipulation (search, comparison, concatenation etc.), it will fail if the 
> same text authored in Notepad and extracted from PDF file are compared. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3096) Lack of Bidi (Arabic / Hebrew) test reordering in text extracted with PDFbox

Reply via email to