[
https://issues.apache.org/jira/browse/PDFBOX-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler updated PDFBOX-662:
--------------------------------------
Fix Version/s: 1.4.0
> PDFTextStripper character suppression
> -------------------------------------
>
> Key: PDFBOX-662
> URL: https://issues.apache.org/jira/browse/PDFBOX-662
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.0.0
> Environment: any
> Reporter: Mel Martinez
> Fix For: 1.4.0
>
>
> When parsing the file posted as an example for PDFBox-659, I noticed that
> numerous characters were missing from the extracted text.
> They are getting 'suppressed' in the
> PDFTextStripper.processTextPosition(TextPosition) method in a section that is
> meant to try to filter duplicate chars found in some MS Word - generated
> documents.
> The problem is that the filter is over-zealous (in the case of this document)
> and matches real characters against other real characters in the text.
> Example
> This is some text that has the letter 'e' in it multiple times.
> The filter might match one of the later 'e's to an earlier 'e' incorrectly
> (for example, the one at the end of 'some'), resulting in the extracted text:
> This is some text that has the letter 'e' in it multiple tims.
> .
> From what I can tell this is because it is using the raw, padded coordinates
> rather than resolved coordinates.
> The example PDF document (see PDFBOX-659) has pages that use both positive
> and negative raw coordinates that upon my cursory inspection don't always
> resolve on the same offset point.
> The suppression test logic compares textposition elements that seem to have
> different offsets, possibly due to different amounts of padding. Thus the
> 'overlap' that it detects is wrong. Its not comparing apples to apples.
> The document renders perfectly in Acrobat, so I believe we are not handling
> the coordinates correctly.
> A workaround is possible through suppressing the filtering by setting the
> PDFTextStripper.setSuppressDuplicateOverlappingText(boolean)
> attribute to false. But that is just hiding the fact that the logic is wrong.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)