[
https://issues.apache.org/jira/browse/PDFBOX-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118963#comment-13118963
]
Arun K. M commented on PDFBOX-577:
----------------------------------
I would be very grateful of you could post your code for drawing the bounding
boxes if possible (I assume you use the pdf to image routines) and also to
share another other hints or directions you may have. I have been thinking of
intercepting the graphics context and recording all the "pixels" actually drawn
on the canvas for each character and then computing the bounding box - a bit
extreme I know, but I am also wondering if that information is available as you
have started down the path. I am interested in figuring out superscripts and
subscripts and using the Unicode codes to provide those in pdf to text for
better extracts. Any hints, guidance, pointers or experimental code is
welcome. I am new to PDFBOX and learning it at this time and I think that
highly accurate bounding box computations would be well worthwhile. Thanks!
> TextPosition should expose its bounding box
> -------------------------------------------
>
> Key: PDFBOX-577
> URL: https://issues.apache.org/jira/browse/PDFBOX-577
> Project: PDFBox
> Issue Type: Improvement
> Reporter: Villu Ruusmann
> Attachments:
> 0001-PDFont.java-Add-methods-to-retreive-the-Ascent-and-D.patch,
> AFM-getHeight.png, AFM-getUpperRightY.png
>
>
> It does not seem to be possible to calculate the bounding box of a
> TextPosition.
> IIUC, TextPosition#getY is the baseline of the text and
> TextPosition#getHeight is the absolute height of the text. When I subtract
> the latter from the former I get a top line, but this is only correct if the
> text does not contain descender characters.
> Below is a screenshot (AFM-getHeight.png) which shows the bounding boxes of
> TextPositions calculated as {#getX(), #getY() - #getHeight, #getWidth,
> #getHeight} painted in random colors. For example, the bounding boxes of
> parentheses are severely misplaced, which makes the line-by-line text
> extraction impossible.
> Right now I've solved the problem by tweaking AFM FontMetrics code so that it
> returns BoundingBox#getUpperRightY instead of BoundingBox#getHeight when
> queried via PDSimpleFont#getFontHeight(byte[], int, int). Another screenshot
> (AFM-getUpperRightY.png) shows how this restores the previously broken text
> extraction ability.
> It seems like a good idea to rework TextPosition so that it would be aware of
> its bounding box:
> *) Replace methods PDSimpleFont#getFontWidth(byte[], int, int) and
> PDSimpleFont#getFontHeight(byte[], int, int) with a single method
> PDSimpleFont#getFontBoundingBox(byte[], int, int)
> *) Replace the constructor TextPosition(Matrix, Matrix) with
> TextPosition(Matrix, BoundingBox)
> *) Add new methods TextPosition#getBoundingBox,
> TextPosition#getBoundingBoxDir. This shouldn't affect existing application
> clients, because TextPosition#getY and TextPosition#getHeight remain in place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira