[ 
https://issues.apache.org/jira/browse/PDFBOX-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434771#comment-15434771
 ] 

Daniel Persson commented on PDFBOX-3464:
----------------------------------------

Just a thought.

Could it be because of the UPM square?

"With the knowledge that your font is using a 1000, 1024, or 2048 UPM, you need 
to set up the drawing of your glyphs to ensure that all aspects of your 
typeface fit adequately into that UPM square."

All values in your scaling is done with a UPM square of 1000 but this font 
might be using the 2048 square instead?

> character height 3 times higher than expected
> ---------------------------------------------
>
>                 Key: PDFBOX-3464
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3464
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Roman
>            Priority: Minor
>         Attachments: notHelped.png, nowItsHelped.png, screenshot-1.png, 
> screenshot.png, subnode.docx.pdf
>
>
> The issue basically same as PDFBOX-2749, but wrong sample was attached to it 
> by mistake. Correct PDF is attached here.
> The core of the problem is that font height for this specific font is 
> determined incorrectly, please see code with comments below.
> The issue was reproduced on Pdfbox 1.8.4, but as we tested before, same 
> result we get on 1.8.9 and 2.0 versions.
> {code}
> public class Extractor extends PDFTextStripper {
> //<...CUT...>
>       protected void writePage() throws IOException {
>               for (List<TextPosition> textList : charactersByArticle) { 
> //charactersByArticle was inherited from base class
>                       Iterator textIter = textList.iterator();
> //<...CUT...>
>                       while (textIter.hasNext()) {
>                               TextPosition position = (TextPosition) 
> textIter.next();
> //<...CUT...>
>               PDFontDescriptor fontDescriptor = 
> position.getFont().getFontDescriptor();
> //<...CUT...>
>               float yscale = position.getTextPos().getYScale();
>               float asc = Math.abs(fontDescriptor.getAscent() / 1000 * 
> yscale);
>               float rh = 
> Math.abs(fontDescriptor.getFontBoundingBox().getUpperRightY() / 1000 * 
> yscale);
>               float desc = Math.abs(fontDescriptor.getDescent() / 1000 * 
> yscale);
>               float capHeight = Math.abs(fontDescriptor.getCapHeight() / 1000 
> * yscale);
>               if (capHeight == 0)
>                       capHeight = position.getHeight();
>               float h = (rh + Math.max(Math.max(capHeight, 
> position.getHeight()), asc)) / 2;
> //"h" evaluates to 37.39 (should be between 11 and 12)
> //"desc" evaluates to 2.664
> //"capHeight" evaluates to 37.39
> //"position.getHeight()" evaluates to 33.48
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to