[
https://issues.apache.org/jira/browse/PDFBOX-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983241#comment-14983241
]
John Hewson commented on PDFBOX-3075:
-------------------------------------
{quote}
And as you say there you need to rethink the font height so it works with
PDFTextStripper. My changes made it though the test cases so I think the
stripper can't be that dependent on the actual text height. It uses the fonts
boundingbox height not the font.getHeight(int code) that gives you a specific
glyph height.
{quote}
Indeed, PDFont.getHeight() is never called by PDFTextStripper. The height is
calculated in PDFTextStreamEngine as a function of the font's bbox - again,
incorrectly.
{quote}
Futher more all the font types doesn't have glyphs defined. Could be wrong
behavior but in those cases you could only approximate the height. My patch
gave me a unified font height in the 1000 em system so I could make accurate
calculations on the position and height of glyphs.
{quote}
Logical height is the same for all glyphs in a font and doesn't require any
glyph metrics to calculate. The font size is available in the PDF graphics
state.
It's important to separate the two kinds of glyph dimensions, logical and
visual. getWidth() returns the logical width (i.e. the advance width) of a
glyph. Correspondingly, in PDFTextStripper we want to work with the logical
height of glyph too. This is the same for all glyphs in a given font and should
be equal to the font size in the graphics state multiplied by the (TM, I think)
and CTM. It might be necessary to take into account the font's matrix too.
Anyway, all the information is available in PDFBox.
{quote}
I've been running a many tests on these functions but I would like to
contribute back because the help I've gotten from PDFBOX is great. When it
comes to the width advance it's pretty accurate as long as I make small changes
when we have vertical texts and texts that writes from right to left. But we've
solved those too.
{quote}
We have accurate calculations for those metrics too, they're just not being
used in PDFTextStripper.
{code}
Should all font's have glyphs?
{code}
Yes, but that might stop such things from occurring. As far as logical height
is concerned, it doesn't matter.
{code}
So what do you recommend that I do going forth. I would like to build my
solution on PDFBOX and I have time alotted by my company to contribute code
back to PDFBOX when our work requires changes in the PDFBOX engine.
{code}
That's great, we've had some people interested in working on this recently.
Check out my reply to PDFBOX-3056, where I explain what the problem is and
PDFBOX-3062 where I add some more details. Note that the latter issue is still
open.
> Changed to the getHeight function for fonts so it will return a more accurate
> height
> ------------------------------------------------------------------------------------
>
> Key: PDFBOX-3075
> URL: https://issues.apache.org/jira/browse/PDFBOX-3075
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Daniel Persson
> Priority: Minor
> Labels: github-import
> Fix For: 2.0.0
>
> Attachments: get_height.patch
>
>
> The getHeight in the fonts gave back approximated heights and in some cases
> only height the first time the function was called. Tried to clean up the
> functions and return a more accurate height for each glyph.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]