[ 
https://issues.apache.org/jira/browse/PDFBOX-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin LeFebvre updated PDFBOX-444:
-----------------------------------

    Component/s: Text extraction

> Incorrect Diacritic Merging/Placement
> -------------------------------------
>
>                 Key: PDFBOX-444
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-444
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Justin LeFebvre
>         Attachments: 03_2_SSL-sorted.txt, 03_2_SSL-unsorted.txt, 03_2_SSL.pdf
>
>
> When looking at the spacing issue found in PDFBOX-77, I found a separate 
> issue with the placement of the diacritic characters in the file 03_2_SSL.pdf 
> which I have attached here. 
> The issue is that there are separate TextPositions used to render the 
> character itself and its diacritic. For example, the word 
> And¨ erung,  should have its diacritic over the A character and not after the 
> d. This sort of issue occurs when the -sort option is enabled. Otherwise the 
> produced word looks like this,
> ¨Anderung. This is still not correct in that the A and the diacritic should 
> be merged to take up one character's width of space. This occurs throughout 
> the document. 
> Currently, PDFBOX does handle merging of diacritic characters but it assumes 
> that the TextPosition for the diacritic occurs after the TextPosition it is 
> supposed to be merged with, when in this file
> the diacritic TextPosition comes beforehand. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to