Tilman Hausherr created PDFBOX-3038:
---------------------------------------

             Summary: Text extraction shows glyphs with zero height
                 Key: PDFBOX-3038
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3038
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.0
            Reporter: Tilman Hausherr
             Fix For: 2.0.0


This happens with file 001033.pdf:
2.0:
{code}
String[108.0,663.6 fs=6.96 xscale=6.96 height=0.0 space=12.1104 
width=3.4800034]1
String[144.0,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=2.996994]I
String[147.417,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=4.5]n
String[152.337,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=2.25] 
String[154.88701,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=2.501999]t
String[157.809,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=4.5]h
String[162.729,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=3.9960022]e
String[167.145,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=2.25] 
{code}


1.8:
{code}
String[108.0,663.6 fs=6.96 xscale=6.96 height=4.57272 space=1.74 
width=3.4800034]1
String[144.0,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=2.996994]I
String[147.417,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=4.5]n
String[152.337,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=2.25] 
String[154.88701,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 
width=2.501999]t
String[157.809,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=4.5]h
String[162.729,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=3.9960022]e
String[167.145,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=2.25] 
{code}


The font has an empty bbox:
{code}
def
/FontBBox {0 0 0 0}
{code}

1.8 had this code to get the height (in PDSimpleFont):
{code}
                PDRectangle fontBBox = desc.getFontBoundingBox();
                if (fontBBox != null)
                {
                    retval = fontBBox.getHeight() / 2;
                }
                if( retval == 0 )
                {
                    retval = desc.getCapHeight();
                }
                if( retval == 0 )
                {
                    retval = desc.getAscent();
                }
                if( retval == 0 )
                {
                    retval = desc.getXHeight();
                    if (retval > 0)
                    {
                        retval -= desc.getDescent();
                    }
                }
{code}

2.0 has only this:
{code}
float glyphHeight = font.getBoundingBox().getHeight() / 2;
{code}
So 2.0 takes the height from the font itself, and has no Plan B.

Getting the BBox from the font descriptor brings correct heights. (And a better 
text extraction)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to