[jira] [Updated] (PDFBOX-3975) ExtractText converts some diacritics to combining forms that don't get combined

Tilman Hausherr (JIRA) Mon, 23 Oct 2017 10:18:09 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tilman Hausherr updated PDFBOX-3975:
------------------------------------
    Attachment: PDF_32000_2008-p23-reduced2.pdf
                PDF_32000_2008-p23-reduced1.pdf

Attached two reduced files, one with text like the original and the other with 
more space. So you're right, it is the overlap detection. However I won't work 
on this issue because it's really about the heuristics (which are never really 
perfect) and the problem doesn't really have an impact IMHO.

If you would like to make a change I can test it my additional test files 
(copyrighted files are not in the repository). If not, I'll just close this. I 
set a label so we can fine is easier if the topic comes up again.

> ExtractText converts some diacritics to combining forms that don't get 
> combined
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-3975
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3975
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.7
>            Reporter: Matthew Self
>            Priority: Minor
>              Labels: diacritics
>         Attachments: PDF_32000_2008-p23-reduced1.pdf, 
> PDF_32000_2008-p23-reduced2.pdf
>
>
> When I use ExtractText on the file 
> http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf,
>  there is an issue with the "^" character on page 15.
> The extracted text is "special characters ( * ! & } ̂  % and so on ) . )".
> Note that the extracted "^" character is U+0302 (COMBINING CIRCUMFLEX ACCENT) 
> when it ought to be plain old U+005E (CIRCUMFLEX ACCENT).
> I believe that what is happening is the original U+005E character is being 
> converted to U+0302 by the DIACRITICS map in TextPosition.java:
>         map.put(0x005e, "\u0302");
> This is probably because the character slightly overlaps the preceding space 
> character.  But then this combining diacritic can't be combined with space 
> character, so the extracted text contains the combining character instead of 
> the original.
> One solution would be to tighten up the detection of overlaps so that 
> combineDiacritic() is not called in this instance.
> Another (perhaps more robust) solution would be to verify in 
> combineDiacritic() that the call to Normalizer.normalize() actually does 
> combine the combining form of the diacritic with the previous character.  If 
> the result of calling Normalizer.normalize() has more than one character in 
> it, then the diacritic must not have been combined with the previous 
> character.  In that case, the diacritic should not be merged.
> The goal would be for the extracted text to never contain combining 
> characters that failed to combine.
> P.S.  Thank you for the great library of PDFBox!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-3975) ExtractText converts some diacritics to combining forms that don't get combined

Reply via email to