[ https://issues.apache.org/jira/browse/PDFBOX-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Justin LeFebvre updated PDFBOX-444: ----------------------------------- Component/s: Text extraction > Incorrect Diacritic Merging/Placement > ------------------------------------- > > Key: PDFBOX-444 > URL: https://issues.apache.org/jira/browse/PDFBOX-444 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Reporter: Justin LeFebvre > Attachments: 03_2_SSL-sorted.txt, 03_2_SSL-unsorted.txt, 03_2_SSL.pdf > > > When looking at the spacing issue found in PDFBOX-77, I found a separate > issue with the placement of the diacritic characters in the file 03_2_SSL.pdf > which I have attached here. > The issue is that there are separate TextPositions used to render the > character itself and its diacritic. For example, the word > And¨ erung, should have its diacritic over the A character and not after the > d. This sort of issue occurs when the -sort option is enabled. Otherwise the > produced word looks like this, > ¨Anderung. This is still not correct in that the A and the diacritic should > be merged to take up one character's width of space. This occurs throughout > the document. > Currently, PDFBOX does handle merging of diacritic characters but it assumes > that the TextPosition for the diacritic occurs after the TextPosition it is > supposed to be merged with, when in this file > the diacritic TextPosition comes beforehand. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.