Re: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction

Jeremias Maerki Sat, 28 Feb 2009 09:26:25 -0800

Brian,

you state here that you've applied a patch by one Ken Glidden. I cannot
find any post or submission from a person with that name on the PDFBox
mailing lists. So I'm concerned about the legal trail here. Can you
explain that, please? Thank you.


On 18.02.2009 22:36:01 Brian Carrier (JIRA) wrote:
> 
>      [ 
> https://issues.apache.org/jira/browse/PDFBOX-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
> 
> Brian Carrier resolved PDFBOX-430.
> ----------------------------------
> 
>     Resolution: Fixed
> 
> Fixed with patch by Ken Glidden that merges a single diacritic text chunk 
> into the previous text chunk if they overlap.  Note that this will not solve 
> problems where the diacritic comes much after the text chunk it overlays, but 
> we have not observed PDF files like that.
> 
> Sending        trunk/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java
> Sending        trunk/src/main/java/org/apache/pdfbox/util/TextPosition.java
> Sending        trunk/test/input/Acrobat9.pdf-sorted.txt
> Sending        trunk/test/input/Acrobat9.pdf.txt
> Transmitting file data ....Committed revision 745665.
> 
> 
> 
> > Incorrect diacritic placement in text extraction
> > ------------------------------------------------
> >
> >                 Key: PDFBOX-430
> >                 URL: https://issues.apache.org/jira/browse/PDFBOX-430
> >             Project: PDFBox
> >          Issue Type: Bug
> >            Reporter: Brian Carrier
> >
> > Some PDF files store diacritics (accents over characters) as separate text 
> > elements. The PDF files essentially have a chunk of text and then backup 
> > and place the diacritic over one of the characters in the chunk of text. 
> > With text extraction, the current design does not allow the diacritic to be 
> > placed over a character in the chunk and instead it is placed after the 
> > chunk. 
> > The debug-diac2.pdf file in PDFBOX-429 shows this problem. 
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.




Jeremias Maerki

Re: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction

Reply via email to