Tilman Hausherr created PDFBOX-3042:
---------------------------------------
Summary: Bad space calculation in text extraction
Key: PDFBOX-3042
URL: https://issues.apache.org/jira/browse/PDFBOX-3042
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
Fix For: 2.0.0
Some debug output from attached reduced file:
2.0:
{code}
spaceWidthText: 0.25
fontSizeText: 12.0
horizontalScalingText: 1.0
textRenderingMatrix.getScalingFactorX(): 12.0, textRenderingMatrix:
[12.0,0.0,0.0,12.0,100.0,700.0]
ctm.getScalingFactorX(): 1.0
spaceWidthDisplay: 36.0
String[100.0,91.0 fs=12.0 xscale=12.0 height=7.8808603 space=36.0
width=8.003998]B
{code}
1.8:
{code}
spaceWidthText: 0.25
fontSizeText: 12.0
horizontalScalingText: 1.0
textMatrix.getXScale(): 1.0, textMatrix:
[[1.0,0.0,0.0][0.0,1.0,0.0][100.0,700.0,1.0]]
ctm.getXScale(): 1.0
spaceWidthDisp: 3.0
String[100.0,91.0 fs=12.0 xscale=12.0 height=7.884 space=3.0 width=8.003998]B
{code}
stream content is
{code}
1 0 0 1 0 0 cm
n
BT
/F12 12 Tf
1 0 0 1 100 700 Tm
(B) Tj
ET
{code}
The cause is somewhat similar to PDFBOX-3019, a factor is used twice. In 2.0,
the fontSize is already calculated into the "parameters" Matrix object, which
is used to calculate "textRenderingMatrix". In 1.8, textStateParameters is set
similarly, but not used in the calculation of spaceWidthDisp.
The problem was discovered because of different text extractions.
The problem did not appear in PDFBOX-3019 because fontSizeText was 1.
The fix also solves the problem I mentioned at the end of PDFBOX-3038.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]