[ 
https://issues.apache.org/jira/browse/PDFBOX-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604558#comment-14604558
 ] 

John Hewson edited comment on PDFBOX-2843 at 6/28/15 6:48 AM:
--------------------------------------------------------------

This fixes the issue with PrintTextLocations not working. Note that overriding 
processTextPosition was never correct and 1.8 shouldn't have been doing it this 
way. PrintTextLocations is really misusing these APIs.

If you're after individual characters in 2.0 then subclass PDFStreamEngine 
instead and and override the showGlyph method. This will get you much better 
results. Alternatively, if you want PDFBox to have analysed the text into lines 
first, then subclass PDFTextStriper, override writePage() and call 
getCharactersByArticle(), then iterate over those.


was (Author: jahewson):
This fixes the issue with PrintTextLocations not working. Note that overriding 
processTextPosition was never correct and 1.8 shouldn't have been doing it this 
way. PrintTextLocations is really misusing these APIs.

If you're after individual characters in 2.0 then subclass PDFStreamEngine 
instead and and override the showGlyph method. This will get you much better 
results. Alternatively, if you want PDFBox to have analysed the text first, 
then subclass PDFTextStriper, override writePage() and call 
getCharactersByArticle(), then iterate over those.

> widthOfSpace() appears wrong in TextPosition
> --------------------------------------------
>
>                 Key: PDFBOX-2843
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2843
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>         Environment: JDK 8 on Windows 7
>            Reporter: Richard Wolfgram
>         Attachments: Hello World.pdf, StripperTest18.java, StripperTest20.java
>
>
> When using the following override method of PDFTextStripper I am getting a 
> large difference in values for TextPosition.getWidthOfSpace() between version 
> 1.8.6 and pdfbox-2.0.0-20150611.100833-1423
> {code}
> @Override
>  protected void processTextPosition(TextPosition textPos)
>  {
>     float spaceWidth = textPos.getWidthOfSpace();
>     float width = textPos.getWidth();
>     System.out.println(textPos.getCharacter() + " - Width of Space=" + 
> spaceWidth + " - width=" + width);
>     builder.append(textPos.getCharacter());
>  }
> {code}
> In 1.8.6 average character width is around 5 and space width is around 2.5
> In 2.0 average character width is around 5 and space width is around 27



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to