[
https://issues.apache.org/jira/browse/PDFBOX-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978790#comment-14978790
]
ASF subversion and git services commented on PDFBOX-3042:
---------------------------------------------------------
Commit 1711070 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1711070 ]
PDFBOX-3042: don't multiply with horizontalScalingText, as this has already
been done before
> Bad space calculation in text extraction
> ----------------------------------------
>
> Key: PDFBOX-3042
> URL: https://issues.apache.org/jira/browse/PDFBOX-3042
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Tilman Hausherr
> Assignee: Tilman Hausherr
> Labels: regression
> Fix For: 2.0.0
>
> Attachments: PDFBOX-3042-003177-p2-reduced.pdf,
> PDFBOX-3042-003177-p2.pdf
>
>
> Some debug output from attached reduced file:
> 2.0:
> {code}
> spaceWidthText: 0.25
> fontSizeText: 12.0
> horizontalScalingText: 1.0
> textRenderingMatrix.getScalingFactorX(): 12.0, textRenderingMatrix:
> [12.0,0.0,0.0,12.0,100.0,700.0]
> ctm.getScalingFactorX(): 1.0
> spaceWidthDisplay: 36.0
> String[100.0,91.0 fs=12.0 xscale=12.0 height=7.8808603 space=36.0
> width=8.003998]B
> {code}
> 1.8:
> {code}
> spaceWidthText: 0.25
> fontSizeText: 12.0
> horizontalScalingText: 1.0
> textMatrix.getXScale(): 1.0, textMatrix:
> [[1.0,0.0,0.0][0.0,1.0,0.0][100.0,700.0,1.0]]
> ctm.getXScale(): 1.0
> spaceWidthDisp: 3.0
> String[100.0,91.0 fs=12.0 xscale=12.0 height=7.884 space=3.0 width=8.003998]B
> {code}
> stream content is
> {code}
> 1 0 0 1 0 0 cm
> n
> BT
> /F12 12 Tf
> 1 0 0 1 100 700 Tm
> (B) Tj
> ET
> {code}
> The cause is somewhat similar to PDFBOX-3019, a factor is used twice. In 2.0,
> the fontSize is already calculated into the "parameters" Matrix object, which
> is used to calculate "textRenderingMatrix". In 1.8, textStateParameters is
> set similarly, but not used in the calculation of spaceWidthDisp.
> The problem was discovered because of different text extractions.
> The problem did not appear in PDFBOX-3019 because fontSizeText was 1.
> The fix also solves the problem I mentioned at the end of PDFBOX-3038.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]