[ 
https://issues.apache.org/jira/browse/PDFBOX-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franken updated PDFBOX-5420:
----------------------------
    Description: 
*Given*

Given is a PDF where the cm operator is used to scale the transformation matrix 
by a factor of 0.02834933. The font size is then set to 282 using the Tf 
operator. 

!image-2022-04-23-14-46-34-929.png|width=389,height=84!

 

*Error Description*

When the PdfTextStripper is used to fetch the text from that pdf, the internal 
representation of the Textpositions contains the wrong font size of 282pt. The 
correct font size would be 10pt. The reason for this miscalculation is the 
fact, that the PdfTextStripper does not scale the text size based on the 
current transformation matrix. 

 

 *Proposed fix*

In the file LegacyPDFStreamEngine.java that bug can be fixed in the showGlyph 
function. There the fontSizeInPt must be calculated using the following code:
{code:java}
processTextPosition(new TextPosition(pageRotation, pageSize.getWidth(),
        pageSize.getHeight(), translatedTextRenderingMatrix, nextX, nextY,
        Math.abs(dyDisplay), dxDisplay,
        Math.abs(spaceWidthDisplay), unicodeMapping, new int[] { code }, font,
        fontSize,
        (int)(fontSize * textMatrix.getScalingFactorX() * 
graphicsState.currentTransformationMatrix.scalingFactorX)));{code}
*Further remarks*

To easily triage the error, i attached a unit test and a sample file. The 
sample was manually edited to remove all unnecessary data and fixed with qpdf. 
However, i redacted only the content stream, other objects in the pdf are still 
present, thus the pdf is pretty large. As i'm mainly programming kotlin, i 
attached the original version of the test i used to debug that issue. There is 
also a java version attached. 

  was:
*Given*

Given is a PDF where the cm operator is used to scale the transformation matrix 
by a factor of 0.03. The font size is then set to 282 using the Tf operator. 

!image-2022-04-23-14-46-34-929.png|width=389,height=84!

 

*Error Description*

When the PdfTextStripper is used to fetch the text from that pdf, the internal 
representation of the Textpositions contains the wrong font size of 282pt. The 
correct font size would be 10pt. The reason for this miscalculation is the 
fact, that the PdfTextStripper does not scale the text size based on the 
current transformation matrix. 

 

 *Proposed fix*

In the file LegacyPDFStreamEngine.java that bug can be fixed in the showGlyph 
function. There the fontSizeInPt must be calculated using the following code:
{code:java}
processTextPosition(new TextPosition(pageRotation, pageSize.getWidth(),
        pageSize.getHeight(), translatedTextRenderingMatrix, nextX, nextY,
        Math.abs(dyDisplay), dxDisplay,
        Math.abs(spaceWidthDisplay), unicodeMapping, new int[] { code }, font,
        fontSize,
        (int)(fontSize * textMatrix.getScalingFactorX() * 
graphicsState.currentTransformationMatrix.scalingFactorX)));{code}
*Further remarks*

To easily triage the error, i attached a unit test and a sample file. The 
sample was manually edited to remove all unnecessary data and fixed with qpdf. 
However, i redacted only the content stream, other objects in the pdf are still 
present, thus the pdf is pretty large. As i'm mainly programming kotlin, i 
attached the original version of the test i used to debug that issue. There is 
also a java version attached. 


> PDFTextStripper does not use cm to infer correct font size
> ----------------------------------------------------------
>
>                 Key: PDFBOX-5420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5420
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Franken
>            Priority: Minor
>         Attachments: TextStripperTest.kt, 
> TextStripperUsesTransformationMatrix.java, ec_2.fixed.pdf, 
> image-2022-04-23-14-46-34-929.png
>
>
> *Given*
> Given is a PDF where the cm operator is used to scale the transformation 
> matrix by a factor of 0.02834933. The font size is then set to 282 using the 
> Tf operator. 
> !image-2022-04-23-14-46-34-929.png|width=389,height=84!
>  
> *Error Description*
> When the PdfTextStripper is used to fetch the text from that pdf, the 
> internal representation of the Textpositions contains the wrong font size of 
> 282pt. The correct font size would be 10pt. The reason for this 
> miscalculation is the fact, that the PdfTextStripper does not scale the text 
> size based on the current transformation matrix. 
>  
>  *Proposed fix*
> In the file LegacyPDFStreamEngine.java that bug can be fixed in the showGlyph 
> function. There the fontSizeInPt must be calculated using the following code:
> {code:java}
> processTextPosition(new TextPosition(pageRotation, pageSize.getWidth(),
>         pageSize.getHeight(), translatedTextRenderingMatrix, nextX, nextY,
>         Math.abs(dyDisplay), dxDisplay,
>         Math.abs(spaceWidthDisplay), unicodeMapping, new int[] { code }, font,
>         fontSize,
>         (int)(fontSize * textMatrix.getScalingFactorX() * 
> graphicsState.currentTransformationMatrix.scalingFactorX)));{code}
> *Further remarks*
> To easily triage the error, i attached a unit test and a sample file. The 
> sample was manually edited to remove all unnecessary data and fixed with 
> qpdf. However, i redacted only the content stream, other objects in the pdf 
> are still present, thus the pdf is pretty large. As i'm mainly programming 
> kotlin, i attached the original version of the test i used to debug that 
> issue. There is also a java version attached. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to