[ 
https://issues.apache.org/jira/browse/PDFBOX-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-3038.
-------------------------------------
    Resolution: Fixed
      Assignee: Tilman Hausherr

Setting to resolved;
- Text extraction for this file is now identical between 1.8 and 2.0
- Space width output is not identical. I'm gonna think about this and open 
another issue if useful
- Same problem may exist for other font types, but I prefer not changing 
anything before I have a test file.

> Text extraction shows glyphs with zero height
> ---------------------------------------------
>
>                 Key: PDFBOX-3038
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3038
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>              Labels: regression
>             Fix For: 2.0.0
>
>         Attachments: PDFBOX-3038-001033-p2.pdf
>
>
> This happens with file 001033.pdf:
> 2.0:
> {code}
> String[108.0,663.6 fs=6.96 xscale=6.96 height=0.0 space=12.1104 
> width=3.4800034]1
> String[144.0,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=2.996994]I
> String[147.417,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=4.5]n
> String[152.337,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=2.25] 
> String[154.88701,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 
> width=2.501999]t
> String[157.809,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=4.5]h
> String[162.729,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 
> width=3.9960022]e
> String[167.145,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=2.25] 
> {code}
> 1.8:
> {code}
> String[108.0,663.6 fs=6.96 xscale=6.96 height=4.57272 space=1.74 
> width=3.4800034]1
> String[144.0,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=2.996994]I
> String[147.417,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=4.5]n
> String[152.337,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=2.25] 
> String[154.88701,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 
> width=2.501999]t
> String[157.809,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=4.5]h
> String[162.729,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 
> width=3.9960022]e
> String[167.145,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=2.25] 
> {code}
> The font has an empty bbox:
> {code}
> def
> /FontBBox {0 0 0 0}
> {code}
> 1.8 had this code to get the height (in PDSimpleFont):
> {code}
>                 PDRectangle fontBBox = desc.getFontBoundingBox();
>                 if (fontBBox != null)
>                 {
>                     retval = fontBBox.getHeight() / 2;
>                 }
>                 if( retval == 0 )
>                 {
>                     retval = desc.getCapHeight();
>                 }
>                 if( retval == 0 )
>                 {
>                     retval = desc.getAscent();
>                 }
>                 if( retval == 0 )
>                 {
>                     retval = desc.getXHeight();
>                     if (retval > 0)
>                     {
>                         retval -= desc.getDescent();
>                     }
>                 }
> {code}
> 2.0 has only this:
> {code}
> float glyphHeight = font.getBoundingBox().getHeight() / 2;
> {code}
> So 2.0 takes the height from the font itself, and has no Plan B.
> Getting the BBox from the font descriptor brings correct heights. (And a 
> better text extraction)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to