[
https://issues.apache.org/jira/browse/PDFBOX-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186684#comment-14186684
]
EugenePig commented on PDFBOX-2409:
-----------------------------------
It’s my fault. I uploaded the wrong golden sample. The wrong golden sample
didn’t be normalize first. The correct sample is THESSALONIANS.Sample.txt what
I just uploaded. There are two differences between THESSALONIANS.Sample.jpg and
THESSALONIANS-comment-14184298.jpg. However they have the same root cause.
I pasted parts of content in the section 8.6 of the book “Unicode Explained”.
===========================================================================================
8.6.6. Spacing Diacritic Marks
When a combining diacritic mark is applied to a space character, we get the
diacritic itself as a visible character. Alternatively, we might use a
character that itself represents a spacing diacritic mark, often called
"spacing clones" of diacritic marks. Such characters appear, for historical
reasons, in different blocks, such as Latin-1 Supplement and Spacing Modifier
Letters.
Starting from of Unicode 4.1, the recommendation is to apply a combining
diacritic mark to a no-break space U+00A0 rather than space U+0020. The reason
is "potential conflicts with the handling of sequences of U+0020 space
characters in contexts like XML." However, the formal definitions still to
define decompositions using the space. For example, the acute accent ´ (U+00B4)
is by definition compatibility equivalent to a two-character sequence
consisting of a space U+0020 and a combining acute accent U+0301.
Spacing diacritic marks do not have much use. Sometimes we might wish to
mention a diacritic in text, such as "the acute ´ has varying shapes." More
often, the spacing diacritic marks are used mistakenly (or questionably) as
replacements for more appropriate characters (e.g., the acute as an apostrophe).
Some Basic Latin (ASCII) characters are historically derived from diacritic
marks but are now treated as characters on their own. For example, the tilde ~
(U+007E) is not treated as a spacing clone of the combining tilde U+0303'that
would in fact be odd, since the tilde has a rather different appearance.
Instead, there is a separate character, small tilde (U+02DC), which is by
definition compatibility equivalent to U+0020 U+0303.
==========================================================================================
In our sample, U+FC62 and U+FEAE have the same position. They are overlapped.
Therefore I know U+FC62 is a spacing diacritic mark. If it is a PDF file, a
render can deal with it well. But I have to remove the space in a plain text
file. So if you compare THESSALONIANS.Sample.txt and
THESSALONIANS-UTF16-comment-14184298.txt in binary mode, you will find the
difference.
> got the wrong result from Arabic text extraction
> ------------------------------------------------
>
> Key: PDFBOX-2409
> URL: https://issues.apache.org/jira/browse/PDFBOX-2409
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.7, 2.0.0
> Environment: Ubuntu 14.04 64bit
> java version "1.8.0_20"
> Reporter: EugenePig
> Fix For: 2.0.0
>
> Attachments: THESSALONIANS-UTF16-comment-14184298.txt,
> THESSALONIANS-comment-14184298.jpg, THESSALONIANS-comment-14184298.txt,
> THESSALONIANS.Sample.jpg, THESSALONIANS.Sample.txt, THESSALONIANS.line -
> golden.txt, THESSALONIANS.pdf, THESSALONIANS.txt,
> THESSALONIANS_win7_firefox.jpg, TextEdit-Arial.png, adobe-utf8.txt,
> jahewson.mac.png
>
>
> java -jar pdfbox-app-1.8.7.jar ExtractText -sort -encoding UTF-8
> THESSALONIANS.pdf
> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -sort -encoding UTF-8
> THESSALONIANS.pdf
> Please compare THESSALONIANS.txt.jpg with THESSALONIANS.pdf. There are a lot
> of differences. I just marked a few differences with red circles.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)