Hi this issue is solved in the current trunk, see [1] for further details.
BR Andreas Lehmkühler [1] https://issues.apache.org/jira/browse/PDFBOX-1481 Am 09.05.2012 20:15, schrieb 叶严杰:
..url for the pdf file: http://www.aclweb.org/anthology-new/P/P02/P02-1046.pdf On Thu, May 10, 2012 at 1:26 AM, 叶严杰 <[email protected]> wrote:I tried to get text from a pdf with pdfbox by striper.getText. (see code attached below) the pdf is attached as file. And bug info attached below. anyway to solve this bug? regrads *Code* public void read() { PDDocument document = null; FileInputStream is = null; try { is = new FileInputStream(file); PDFParser parser = new PDFParser(is); parser.parse(); document = parser.getPDDocument(); PDFTextStripper stripper = new PDFTextStripper(); content = stripper.getText(document); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } finally { if (is != null) { try { is.close(); } catch (IOException e) { e.printStackTrace(); } } if (document != null) { try { document.close(); } catch (IOException e) { e.printStackTrace(); } } } } *Bug Info* Exception in thread "main" java.lang.NumberFormatException: For input string: "dup" at java.lang.NumberFormatException.forInputString(Unknown Source) at java.lang.Integer.parseInt(Unknown Source) at java.lang.Integer.parseInt(Unknown Source) at org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344) at org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280) at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181) at org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83) at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:242) at get.read(get.java:33) at get.main(get.java:60)

