[jira] [Comment Edited] (PDFBOX-3975) ExtractText converts some diacritics to combining forms that don't get combined

Matthew Self (JIRA) Sun, 22 Oct 2017 18:32:13 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214506#comment-16214506
 ]


Matthew Self edited comment on PDFBOX-3975 at 10/23/17 1:31 AM:
----------------------------------------------------------------

After reading more about combining marks 
(http://unicode.org/faq/char_combmark.html) I see that my suggestion is based 
on an incorrect assumption.  There are many valid accented characters that 
don't have a combined form and can only be represented by the base character 
plus a combining diacritic.  So, the fact that Normalizer.normalize() doesn't 
convert a pair into a single character does not mean that it is not a valid 
combination.

So, back to the original issue, it seems that the correct solution to prevent 
U+005E (CIRCUMFLEX ACCENT) being turned into U+0302 (COMBINING CIRCUMFLEX 
ACCENT) in this particular PDF file is to tighten up the overlap detection so 
that combineDiacritic() is not called in this case at all.  It seems that there 
is no reliable way to reject the potential combination of a base character and 
diacritic mark based only on the characters.  A COMBINING CIRCUMFLEX ACCENT 
could in theory be applied to any base character.


was (Author: mself):
After reading more about combining marks 
(http://unicode.org/faq/char_combmark.html) I see that my suggestion is based 
on an incorrect assumption.  There are many valid accented characters that 
don't have a combined form and can only be represented by the base character 
plus a combining diacritic.  So, the fact that Normalizer.Form.NFC() doesn't 
convert a pair into a single character does not mean that it is not a valid 
combination.

So, back to the original issue, it seems that the correct solution to prevent 
U+005E (CIRCUMFLEX ACCENT) being turned into U+0302 (COMBINING CIRCUMFLEX 
ACCENT) in this particular PDF file is to tighten up the overlap detection so 
that combineDiacritic() is not called in this case at all.  It seems that there 
is no reliable way to reject the potential combination of a base character and 
diacritic mark based only on the characters.  A COMBINING CIRCUMFLEX ACCENT 
could in theory be applied to any base character.

> ExtractText converts some diacritics to combining forms that don't get 
> combined
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-3975
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3975
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.7
>            Reporter: Matthew Self
>
> When I use ExtractText on the file 
> http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf,
>  there is an issue with the "^" character on page 15.
> The extracted text is "special characters ( * ! & } ̂  % and so on ) . )".
> Note that the extracted "^" character is U+0302 (COMBINING CIRCUMFLEX ACCENT) 
> when it ought to be plain old U+005E (CIRCUMFLEX ACCENT).
> I believe that what is happening is the original U+005E character is being 
> converted to U+0302 by the DIACRITICS map in TextPosition.java:
>         map.put(0x005e, "\u0302");
> This is probably because the character slightly overlaps the preceding space 
> character.  But then this combining diacritic can't be combined with space 
> character, so the extracted text contains the combining character instead of 
> the original.
> One solution would be to tighten up the detection of overlaps so that 
> combineDiacritic() is not called in this instance.
> Another (perhaps more robust) solution would be to verify in 
> combineDiacritic() that the call to Normalizer.normalize() actually does 
> combine the combining form of the diacritic with the previous character.  If 
> the result of calling Normalizer.normalize() has more than one character in 
> it, then the diacritic must not have been combined with the previous 
> character.  In that case, the diacritic should not be merged.
> The goal would be for the extracted text to never contain combining 
> characters that failed to combine.
> P.S.  Thank you for the great library of PDFBox!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-3975) ExtractText converts some diacritics to combining forms that don't get combined

Reply via email to