[ https://issues.apache.org/jira/browse/PDFBOX-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435380#comment-15435380 ]
Daniel Persson commented on PDFBOX-3464: ---------------------------------------- I also took a look into the supplied PDF and our tool using PDFBox will extract the correct height after normalizing the fonts. Both fonts have a EM square of 2048. > character height 3 times higher than expected > --------------------------------------------- > > Key: PDFBOX-3464 > URL: https://issues.apache.org/jira/browse/PDFBOX-3464 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Reporter: Roman > Priority: Minor > Attachments: notHelped.png, nowItsHelped.png, screenshot-1.png, > screenshot.png, subnode.docx.pdf > > > The issue basically same as PDFBOX-2749, but wrong sample was attached to it > by mistake. Correct PDF is attached here. > The core of the problem is that font height for this specific font is > determined incorrectly, please see code with comments below. > The issue was reproduced on Pdfbox 1.8.4, but as we tested before, same > result we get on 1.8.9 and 2.0 versions. > {code} > public class Extractor extends PDFTextStripper { > //<...CUT...> > protected void writePage() throws IOException { > for (List<TextPosition> textList : charactersByArticle) { > //charactersByArticle was inherited from base class > Iterator textIter = textList.iterator(); > //<...CUT...> > while (textIter.hasNext()) { > TextPosition position = (TextPosition) > textIter.next(); > //<...CUT...> > PDFontDescriptor fontDescriptor = > position.getFont().getFontDescriptor(); > //<...CUT...> > float yscale = position.getTextPos().getYScale(); > float asc = Math.abs(fontDescriptor.getAscent() / 1000 * > yscale); > float rh = > Math.abs(fontDescriptor.getFontBoundingBox().getUpperRightY() / 1000 * > yscale); > float desc = Math.abs(fontDescriptor.getDescent() / 1000 * > yscale); > float capHeight = Math.abs(fontDescriptor.getCapHeight() / 1000 > * yscale); > if (capHeight == 0) > capHeight = position.getHeight(); > float h = (rh + Math.max(Math.max(capHeight, > position.getHeight()), asc)) / 2; > //"h" evaluates to 37.39 (should be between 11 and 12) > //"desc" evaluates to 2.664 > //"capHeight" evaluates to 37.39 > //"position.getHeight()" evaluates to 33.48 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org