[
https://issues.apache.org/jira/browse/PDFBOX-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184299#comment-14184299
]
John Hewson commented on PDFBOX-2409:
-------------------------------------
I found the problem, it was in TextPosition#insertDiacritic. The original
author made the mistake of trying to insert combining diacritics before their
base characters, because they assumed that was how presentation order works.
Needless to say, it is not, and combining characters always follow their base
character, irrespective of whether the string is RTL or in presentation or
logical order.
If you try out the latest version from trunk, my last commit should fix the
problem.
> got the wrong result from Arabic text extraction
> ------------------------------------------------
>
> Key: PDFBOX-2409
> URL: https://issues.apache.org/jira/browse/PDFBOX-2409
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.7, 2.0.0
> Environment: Ubuntu 14.04 64bit
> java version "1.8.0_20"
> Reporter: EugenePig
> Fix For: 2.0.0
>
> Attachments: THESSALONIANS.line - golden.txt, THESSALONIANS.pdf,
> THESSALONIANS.txt, THESSALONIANS_win7_firefox.jpg, TextEdit-Arial.png,
> adobe-utf8.txt, jahewson.mac.png
>
>
> java -jar pdfbox-app-1.8.7.jar ExtractText -sort -encoding UTF-8
> THESSALONIANS.pdf
> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -sort -encoding UTF-8
> THESSALONIANS.pdf
> Please compare THESSALONIANS.txt.jpg with THESSALONIANS.pdf. There are a lot
> of differences. I just marked a few differences with red circles.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)