[ https://issues.apache.org/jira/browse/PDFBOX-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15026520#comment-15026520 ]
Tilman Hausherr commented on PDFBOX-3130: ----------------------------------------- The box is the blue rectangle. The red rectangle is some articificial guide used for text extraction, to decide whats's in the same line. It is usually about the height (from the baseline) of a non capital glyph (e.g. "a"). The descent is the "Descent" element. > Recent regression in PDFTextStripper, text getting garbled > ---------------------------------------------------------- > > Key: PDFBOX-3130 > URL: https://issues.apache.org/jira/browse/PDFBOX-3130 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.0 > Reporter: Fred Andrews > Assignee: Tilman Hausherr > Fix For: 2.0.0 > > Attachments: garbled text 2-marked-1-capheight.png, garbled text > 2-marked-1.png, garbled text 2.pdf, garbled text.pdf > > > Text extraction using PrintTextLocations is getting garbled characters in the > attached snippet. > For this file it is getting one string of "2O(Er4env vqeheurosriAurseirueeass > ss/Ct:7:rh adaliaargynse csr eadc+cit6e l1ipc te+2en 6d9c1)9e 91 2933" > This test case is about as small as I could make it and still show the > problem; when I reduced the file to just one line of text, then the text came > though correctly. > This problem shows up in RC2 and the latest development build. I believe it > was OK in the development build from Nov 4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org