[jira] [Commented] (PDFBOX-2409) got the wrong result from Arabic text extraction

EugenePig (JIRA) Tue, 28 Oct 2014 03:48:03 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186684#comment-14186684
 ]


EugenePig commented on PDFBOX-2409:
-----------------------------------

It’s my fault. I uploaded the wrong golden sample. The wrong golden sample 
didn’t be normalize first. The correct sample is THESSALONIANS.Sample.txt what 
I just uploaded. There are two differences between THESSALONIANS.Sample.jpg and 
THESSALONIANS-comment-14184298.jpg. However they have the same root cause. 
I pasted parts of content in the section 8.6 of the book “Unicode Explained”.

===========================================================================================
8.6.6. Spacing Diacritic Marks
When a combining diacritic mark is applied to a space character, we get the 
diacritic itself as a visible character. Alternatively, we might use a 
character that itself represents a spacing diacritic mark, often called 
"spacing clones" of diacritic marks. Such characters appear, for historical 
reasons, in different blocks, such as Latin-1 Supplement and Spacing Modifier 
Letters.
Starting from of Unicode 4.1, the recommendation is to apply a combining 
diacritic mark to a no-break space U+00A0 rather than space U+0020. The reason 
is "potential conflicts with the handling of sequences of U+0020 space 
characters in contexts like XML." However, the formal definitions still to 
define decompositions using the space. For example, the acute accent ´ (U+00B4) 
is by definition compatibility equivalent to a two-character sequence 
consisting of a space U+0020 and a combining acute accent U+0301.
Spacing diacritic marks do not have much use. Sometimes we might wish to 
mention a diacritic in text, such as "the acute ´ has varying shapes." More 
often, the spacing diacritic marks are used mistakenly (or questionably) as 
replacements for more appropriate characters (e.g., the acute as an apostrophe).
Some Basic Latin (ASCII) characters are historically derived from diacritic 
marks but are now treated as characters on their own. For example, the tilde ~ 
(U+007E) is not treated as a spacing clone of the combining tilde U+0303'that 
would in fact be odd, since the tilde has a rather different appearance. 
Instead, there is a separate character, small tilde  (U+02DC), which is by 
definition compatibility equivalent to U+0020 U+0303.
==========================================================================================

In our sample, U+FC62 and U+FEAE have the same position. They are overlapped. 
Therefore I know U+FC62 is a spacing diacritic mark. If it is a PDF file, a 
render can deal with it well. But I have to remove the space in a plain text 
file. So if you compare THESSALONIANS.Sample.txt and 
THESSALONIANS-UTF16-comment-14184298.txt in binary mode, you will find the 
difference.


> got the wrong result from Arabic text extraction
> ------------------------------------------------
>
>                 Key: PDFBOX-2409
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2409
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7, 2.0.0
>         Environment: Ubuntu 14.04 64bit
> java version "1.8.0_20"
>            Reporter: EugenePig
>             Fix For: 2.0.0
>
>         Attachments: THESSALONIANS-UTF16-comment-14184298.txt, 
> THESSALONIANS-comment-14184298.jpg, THESSALONIANS-comment-14184298.txt, 
> THESSALONIANS.Sample.jpg, THESSALONIANS.Sample.txt, THESSALONIANS.line - 
> golden.txt, THESSALONIANS.pdf, THESSALONIANS.txt, 
> THESSALONIANS_win7_firefox.jpg, TextEdit-Arial.png, adobe-utf8.txt, 
> jahewson.mac.png
>
>
> java -jar pdfbox-app-1.8.7.jar ExtractText -sort -encoding UTF-8 
> THESSALONIANS.pdf
> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -sort -encoding UTF-8 
> THESSALONIANS.pdf
> Please compare THESSALONIANS.txt.jpg with THESSALONIANS.pdf. There are a lot 
> of differences. I just marked a few differences with red circles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PDFBOX-2409) got the wrong result from Arabic text extraction

Reply via email to