[
https://issues.apache.org/jira/browse/PDFBOX-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-3975:
------------------------------------
Attachment: PDF_32000_2008-p23-reduced2.pdf
PDF_32000_2008-p23-reduced1.pdf
Attached two reduced files, one with text like the original and the other with
more space. So you're right, it is the overlap detection. However I won't work
on this issue because it's really about the heuristics (which are never really
perfect) and the problem doesn't really have an impact IMHO.
If you would like to make a change I can test it my additional test files
(copyrighted files are not in the repository). If not, I'll just close this. I
set a label so we can fine is easier if the topic comes up again.
> ExtractText converts some diacritics to combining forms that don't get
> combined
> -------------------------------------------------------------------------------
>
> Key: PDFBOX-3975
> URL: https://issues.apache.org/jira/browse/PDFBOX-3975
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.7
> Reporter: Matthew Self
> Priority: Minor
> Labels: diacritics
> Attachments: PDF_32000_2008-p23-reduced1.pdf,
> PDF_32000_2008-p23-reduced2.pdf
>
>
> When I use ExtractText on the file
> http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf,
> there is an issue with the "^" character on page 15.
> The extracted text is "special characters ( * ! & } ̂ % and so on ) . )".
> Note that the extracted "^" character is U+0302 (COMBINING CIRCUMFLEX ACCENT)
> when it ought to be plain old U+005E (CIRCUMFLEX ACCENT).
> I believe that what is happening is the original U+005E character is being
> converted to U+0302 by the DIACRITICS map in TextPosition.java:
> map.put(0x005e, "\u0302");
> This is probably because the character slightly overlaps the preceding space
> character. But then this combining diacritic can't be combined with space
> character, so the extracted text contains the combining character instead of
> the original.
> One solution would be to tighten up the detection of overlaps so that
> combineDiacritic() is not called in this instance.
> Another (perhaps more robust) solution would be to verify in
> combineDiacritic() that the call to Normalizer.normalize() actually does
> combine the combining form of the diacritic with the previous character. If
> the result of calling Normalizer.normalize() has more than one character in
> it, then the diacritic must not have been combined with the previous
> character. In that case, the diacritic should not be merged.
> The goal would be for the extracted text to never contain combining
> characters that failed to combine.
> P.S. Thank you for the great library of PDFBox!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]