[ 
https://issues.apache.org/jira/browse/PDFBOX-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983241#comment-14983241
 ] 

John Hewson commented on PDFBOX-3075:
-------------------------------------

{quote}
And as you say there you need to rethink the font height so it works with 
PDFTextStripper. My changes made it though the test cases so I think the 
stripper can't be that dependent on the actual text height. It uses the fonts 
boundingbox height not the font.getHeight(int code) that gives you a specific 
glyph height.
{quote}

Indeed, PDFont.getHeight() is never called by PDFTextStripper. The height is 
calculated in PDFTextStreamEngine as a function of the font's bbox - again, 
incorrectly.

{quote}
Futher more all the font types doesn't have glyphs defined. Could be wrong 
behavior but in those cases you could only approximate the height. My patch 
gave me a unified font height in the 1000 em system so I could make accurate 
calculations on the position and height of glyphs.
{quote}

Logical height is the same for all glyphs in a font and doesn't require any 
glyph metrics to calculate. The font size is available in the PDF graphics 
state.

It's important to separate the two kinds of glyph dimensions, logical and 
visual. getWidth() returns the logical width (i.e. the advance width) of a 
glyph. Correspondingly, in PDFTextStripper we want to work with the logical 
height of glyph too. This is the same for all glyphs in a given font and should 
be equal to the font size in the graphics state multiplied by the (TM, I think) 
and CTM. It might be necessary to take into account the font's matrix too. 
Anyway, all the information is available in PDFBox.

{quote}
I've been running a many tests on these functions but I would like to 
contribute back because the help I've gotten from PDFBOX is great. When it 
comes to the width advance it's pretty accurate as long as I make small changes 
when we have vertical texts and texts that writes from right to left. But we've 
solved those too.
{quote}

We have accurate calculations for those metrics too, they're just not being 
used in PDFTextStripper.

{code}
Should all font's have glyphs?
{code}

Yes, but that might stop such things from occurring. As far as logical height 
is concerned, it doesn't matter.

{code}
So what do you recommend that I do going forth. I would like to build my 
solution on PDFBOX and I have time alotted by my company to contribute code 
back to PDFBOX when our work requires changes in the PDFBOX engine.
{code}

That's great, we've had some people interested in working on this recently. 
Check out my reply to PDFBOX-3056, where I explain what the problem is and 
PDFBOX-3062 where I add some more details. Note that the latter issue is still 
open.

> Changed to the getHeight function for fonts so it will return a more accurate 
> height
> ------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-3075
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3075
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Daniel Persson
>            Priority: Minor
>              Labels: github-import
>             Fix For: 2.0.0
>
>         Attachments: get_height.patch
>
>
> The getHeight in the fonts gave back approximated heights and in some cases 
> only height the first time the function was called. Tried to clean up the 
> functions and return a more accurate height for each glyph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to