Pavel Misurkin created PDFBOX-2584: -------------------------------------- Summary: Text extraction reports zero character widths Key: PDFBOX-2584 URL: https://issues.apache.org/jira/browse/PDFBOX-2584 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.8, 2.0.0 Reporter: Pavel Misurkin
We are using PDFBox API to get position of characters within a document Have found a problem with one document:: text extraction properly extracting text but set all character's width to zero Code is pretty simple {code} File input = new File("stip_2c.pdf"); document = PDDocument.load(input); PDFTextStripper extractor = new PDFTextStripper(); Writer output = new StringWriter(); extractor.writeText(document, output); {code} We are examining then value of Extractor.charactersByArticle member for characters widths - Have found the issue in 1.8.4 all chars widths were == zero - in version 1.8.8 all chars widths were == zero except whitespaces. See new validation added in 1.8.8 File pdfbox-1.8.8-src\pdfbox\src\main\java\org\apache\pdfbox\util\PDFStreamEngine.java line 369 {code} if (spaceWidthText == 0) { spaceWidthText = 1.0f; // if could not find font, use a generic value } {code} - in version 2.0.0 problem still exists -- This message was sent by Atlassian JIRA (v6.3.4#6332)