Tilman Hausherr created PDFBOX-3042:
---------------------------------------

             Summary: Bad space calculation in text extraction
                 Key: PDFBOX-3042
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3042
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.0
            Reporter: Tilman Hausherr
            Assignee: Tilman Hausherr
             Fix For: 2.0.0


Some debug output from attached reduced file:

2.0:
{code}

spaceWidthText: 0.25
fontSizeText: 12.0
horizontalScalingText: 1.0
textRenderingMatrix.getScalingFactorX(): 12.0, textRenderingMatrix: 
[12.0,0.0,0.0,12.0,100.0,700.0]
ctm.getScalingFactorX(): 1.0
spaceWidthDisplay: 36.0

String[100.0,91.0 fs=12.0 xscale=12.0 height=7.8808603 space=36.0 
width=8.003998]B
{code}


1.8:
{code}
spaceWidthText: 0.25
fontSizeText: 12.0
horizontalScalingText: 1.0
textMatrix.getXScale(): 1.0, textMatrix: 
[[1.0,0.0,0.0][0.0,1.0,0.0][100.0,700.0,1.0]]
ctm.getXScale(): 1.0
spaceWidthDisp: 3.0

String[100.0,91.0 fs=12.0 xscale=12.0 height=7.884 space=3.0 width=8.003998]B
{code}

stream content is
{code}
1 0 0 1 0 0 cm
n
BT
/F12 12 Tf
1 0 0 1 100 700 Tm
(B) Tj
ET
{code}

The cause is somewhat similar to PDFBOX-3019, a factor is used twice. In 2.0, 
the fontSize is already calculated into the "parameters" Matrix object, which 
is used to calculate "textRenderingMatrix". In 1.8, textStateParameters is set 
similarly, but not used in the calculation of spaceWidthDisp.

The problem was discovered because of different text extractions. 

The problem did not appear in PDFBOX-3019 because fontSizeText was 1. 

The fix also solves the problem I mentioned at the end of PDFBOX-3038.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to