[jira] [Commented] (PDFBOX-2409) got the wrong result from Arabic text extraction

John Hewson (JIRA) Sat, 25 Oct 2014 15:19:28 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184299#comment-14184299
 ]


John Hewson commented on PDFBOX-2409:
-------------------------------------

I found the problem, it was in TextPosition#insertDiacritic. The original 
author made the mistake of trying to insert combining diacritics before their 
base characters, because they assumed that was how presentation order works. 
Needless to say, it is not, and combining characters always follow their base 
character, irrespective of whether the string is RTL or in presentation or 
logical order.

If you try out the latest version from trunk, my last commit should fix the 
problem.

> got the wrong result from Arabic text extraction
> ------------------------------------------------
>
>                 Key: PDFBOX-2409
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2409
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7, 2.0.0
>         Environment: Ubuntu 14.04 64bit
> java version "1.8.0_20"
>            Reporter: EugenePig
>             Fix For: 2.0.0
>
>         Attachments: THESSALONIANS.line - golden.txt, THESSALONIANS.pdf, 
> THESSALONIANS.txt, THESSALONIANS_win7_firefox.jpg, TextEdit-Arial.png, 
> adobe-utf8.txt, jahewson.mac.png
>
>
> java -jar pdfbox-app-1.8.7.jar ExtractText -sort -encoding UTF-8 
> THESSALONIANS.pdf
> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -sort -encoding UTF-8 
> THESSALONIANS.pdf
> Please compare THESSALONIANS.txt.jpg with THESSALONIANS.pdf. There are a lot 
> of differences. I just marked a few differences with red circles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PDFBOX-2409) got the wrong result from Arabic text extraction

Reply via email to