[jira] [Commented] (PDFBOX-3464) character height 3 times higher than expected

Tilman Hausherr (JIRA) Wed, 17 Aug 2016 10:17:00 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15424939#comment-15424939
 ]


Tilman Hausherr commented on PDFBOX-3464:
-----------------------------------------

In LegacyPDFStreamEngine.java (formerly PDFTextStreamEngine.java), find this 
code part:
{code}
        // sometimes the bbox has very high values, but CapHeight is OK
        PDFontDescriptor fontDescriptor = font.getFontDescriptor();
        if (fontDescriptor != null)
        {
            float capHeight = fontDescriptor.getCapHeight();
            if (capHeight != 0 && (capHeight < glyphHeight || glyphHeight == 0))
            {
                glyphHeight = capHeight;
            }
        }

{code}
Change it to this:
{code}
        // sometimes the bbox has very high values, but CapHeight is OK
        PDFontDescriptor fontDescriptor = font.getFontDescriptor();
        if (fontDescriptor != null)
        {
            float capHeight = fontDescriptor.getCapHeight();
            if (capHeight != 0 && (capHeight < glyphHeight || glyphHeight == 0))
            {
                glyphHeight = capHeight;
            }
            // PDFBOX-3464: Sometimes even CapHeight has very high value, but 
Ascent and Descent are ok
            float ascent = fontDescriptor.getAscent();
            float descent = fontDescriptor.getDescent();
            if (ascent > 0 && descent < 0 && ((ascent - descent) / 2 < 
glyphHeight || glyphHeight == 0))
            {
                glyphHeight = (ascent - descent) / 2;
            }
        }
{code}
So what the new code is doing is that if ascent and descent are not 0, and if 
either the height is 0 or the height is larger than (ascent - descent) / 2, 
then it is used. You seem to be doing your own calculations, so you should use 
a similar logic.

However one build test (TestTextStripper) might fail (I can't tell for sure, 
because I use more test files than in the repository). If it happens, you need 
to copy the changed .txt result files (not the diff files, these shows what the 
differences between expected / received are) from PDFBox 
reactor\pdfbox\target\test-output to PDFBox 
reactor\pdfbox\src\test\resources\input .

> character height 3 times higher than expected
> ---------------------------------------------
>
>                 Key: PDFBOX-3464
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3464
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Roman
>            Priority: Minor
>         Attachments: screenshot-1.png, screenshot.png, subnode.docx.pdf
>
>
> The issue basically same as PDFBOX-2749, but wrong sample was attached to it 
> by mistake. Correct PDF is attached here.
> The core of the problem is that font height for this specific font is 
> determined incorrectly, please see code with comments below.
> The issue was reproduced on Pdfbox 1.8.4, but as we tested before, same 
> result we get on 1.8.9 and 2.0 versions.
> {code}
> public class Extractor extends PDFTextStripper {
> //<...CUT...>
>       protected void writePage() throws IOException {
>               for (List<TextPosition> textList : charactersByArticle) { 
> //charactersByArticle was inherited from base class
>                       Iterator textIter = textList.iterator();
> //<...CUT...>
>                       while (textIter.hasNext()) {
>                               TextPosition position = (TextPosition) 
> textIter.next();
> //<...CUT...>
>               PDFontDescriptor fontDescriptor = 
> position.getFont().getFontDescriptor();
> //<...CUT...>
>               float yscale = position.getTextPos().getYScale();
>               float asc = Math.abs(fontDescriptor.getAscent() / 1000 * 
> yscale);
>               float rh = 
> Math.abs(fontDescriptor.getFontBoundingBox().getUpperRightY() / 1000 * 
> yscale);
>               float desc = Math.abs(fontDescriptor.getDescent() / 1000 * 
> yscale);
>               float capHeight = Math.abs(fontDescriptor.getCapHeight() / 1000 
> * yscale);
>               if (capHeight == 0)
>                       capHeight = position.getHeight();
>               float h = (rh + Math.max(Math.max(capHeight, 
> position.getHeight()), asc)) / 2;
> //"h" evaluates to 37.39 (should be between 11 and 12)
> //"desc" evaluates to 2.664
> //"capHeight" evaluates to 37.39
> //"position.getHeight()" evaluates to 33.48
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3464) character height 3 times higher than expected

Reply via email to