[
https://issues.apache.org/jira/browse/PDFBOX-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604558#comment-14604558
]
John Hewson edited comment on PDFBOX-2843 at 6/28/15 6:49 AM:
--------------------------------------------------------------
This fixes the issue with PrintTextLocations not working. Note that overriding
processTextPosition was never correct and 1.8 shouldn't have been doing it this
way. PrintTextLocations is really misusing these APIs.
If you're after individual characters in 2.0 then subclass PDFStreamEngine
instead and and override the showGlyph method. This will get you much better
results. Alternatively, if you want PDFBox to have analysed the text into lines
first, then subclass PDFTextStriper, override writePage() and call
getCharactersByArticle(), then iterate over those.
If you're just after a string which contains the page's text, then use
ExtractText instead.
was (Author: jahewson):
This fixes the issue with PrintTextLocations not working. Note that overriding
processTextPosition was never correct and 1.8 shouldn't have been doing it this
way. PrintTextLocations is really misusing these APIs.
If you're after individual characters in 2.0 then subclass PDFStreamEngine
instead and and override the showGlyph method. This will get you much better
results. Alternatively, if you want PDFBox to have analysed the text into lines
first, then subclass PDFTextStriper, override writePage() and call
getCharactersByArticle(), then iterate over those.
> widthOfSpace() appears wrong in TextPosition
> --------------------------------------------
>
> Key: PDFBOX-2843
> URL: https://issues.apache.org/jira/browse/PDFBOX-2843
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Environment: JDK 8 on Windows 7
> Reporter: Richard Wolfgram
> Attachments: Hello World.pdf, StripperTest18.java, StripperTest20.java
>
>
> When using the following override method of PDFTextStripper I am getting a
> large difference in values for TextPosition.getWidthOfSpace() between version
> 1.8.6 and pdfbox-2.0.0-20150611.100833-1423
> {code}
> @Override
> protected void processTextPosition(TextPosition textPos)
> {
> float spaceWidth = textPos.getWidthOfSpace();
> float width = textPos.getWidth();
> System.out.println(textPos.getCharacter() + " - Width of Space=" +
> spaceWidth + " - width=" + width);
> builder.append(textPos.getCharacter());
> }
> {code}
> In 1.8.6 average character width is around 5 and space width is around 2.5
> In 2.0 average character width is around 5 and space width is around 27
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]