[ 
https://issues.apache.org/jira/browse/PDFBOX-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605081#comment-14605081
 ] 

Richard Wolfgram commented on PDFBOX-2843:
------------------------------------------

Nevermind - I figured it out - I just took a look at PDFTextStreamEngine and 
added the following operators in the constructor:

        addOperator(new BeginText());
        addOperator(new Concatenate());
        addOperator(new DrawObject()); // special text version
        addOperator(new EndText());
        addOperator(new SetGraphicsStateParameters());
        addOperator(new Save());
        addOperator(new Restore());
        addOperator(new NextLine());
        addOperator(new SetCharSpacing());
        addOperator(new MoveText());
        addOperator(new MoveTextSetLeading());
        addOperator(new SetFontAndSize());
        addOperator(new ShowText());
        addOperator(new ShowTextAdjusted());
        addOperator(new SetTextLeading());
        addOperator(new SetMatrix());
        addOperator(new SetTextRenderingMode());
        addOperator(new SetTextRise());
        addOperator(new SetWordSpacing());
        addOperator(new SetTextHorizontalScaling());
        addOperator(new ShowTextLine());
        addOperator(new ShowTextLineAndSpace());

I think I will be able to figure out exactly what I need to include to do what 
I want to do.  I am processing the text of each page, converting rotated text 
and playing with space widths then comparing results against multiple language 
dictionaries to improve text stripping.

> widthOfSpace() appears wrong in TextPosition
> --------------------------------------------
>
>                 Key: PDFBOX-2843
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2843
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>         Environment: JDK 8 on Windows 7
>            Reporter: Richard Wolfgram
>         Attachments: Hello World.pdf, StripperTest18.java, StripperTest20.java
>
>
> When using the following override method of PDFTextStripper I am getting a 
> large difference in values for TextPosition.getWidthOfSpace() between version 
> 1.8.6 and pdfbox-2.0.0-20150611.100833-1423
> {code}
> @Override
>  protected void processTextPosition(TextPosition textPos)
>  {
>     float spaceWidth = textPos.getWidthOfSpace();
>     float width = textPos.getWidth();
>     System.out.println(textPos.getCharacter() + " - Width of Space=" + 
> spaceWidth + " - width=" + width);
>     builder.append(textPos.getCharacter());
>  }
> {code}
> In 1.8.6 average character width is around 5 and space width is around 2.5
> In 2.0 average character width is around 5 and space width is around 27



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to