[
https://issues.apache.org/jira/browse/PDFBOX-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-3975:
------------------------------------
Priority: Minor (was: Major)
> ExtractText converts some diacritics to combining forms that don't get
> combined
> -------------------------------------------------------------------------------
>
> Key: PDFBOX-3975
> URL: https://issues.apache.org/jira/browse/PDFBOX-3975
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.7
> Reporter: Matthew Self
> Priority: Minor
> Labels: diacritics
>
> When I use ExtractText on the file
> http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf,
> there is an issue with the "^" character on page 15.
> The extracted text is "special characters ( * ! & } ̂ % and so on ) . )".
> Note that the extracted "^" character is U+0302 (COMBINING CIRCUMFLEX ACCENT)
> when it ought to be plain old U+005E (CIRCUMFLEX ACCENT).
> I believe that what is happening is the original U+005E character is being
> converted to U+0302 by the DIACRITICS map in TextPosition.java:
> map.put(0x005e, "\u0302");
> This is probably because the character slightly overlaps the preceding space
> character. But then this combining diacritic can't be combined with space
> character, so the extracted text contains the combining character instead of
> the original.
> One solution would be to tighten up the detection of overlaps so that
> combineDiacritic() is not called in this instance.
> Another (perhaps more robust) solution would be to verify in
> combineDiacritic() that the call to Normalizer.normalize() actually does
> combine the combining form of the diacritic with the previous character. If
> the result of calling Normalizer.normalize() has more than one character in
> it, then the diacritic must not have been combined with the previous
> character. In that case, the diacritic should not be merged.
> The goal would be for the extracted text to never contain combining
> characters that failed to combine.
> P.S. Thank you for the great library of PDFBox!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]