Hi

this issue is solved in the current trunk, see [1] for further details.

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-1481

Am 09.05.2012 20:15, schrieb 叶严杰:
..url for the pdf file:
http://www.aclweb.org/anthology-new/P/P02/P02-1046.pdf

On Thu, May 10, 2012 at 1:26 AM, 叶严杰 <[email protected]> wrote:

I tried to get text from a pdf with pdfbox by striper.getText. (see code
attached below)
the pdf is attached as file. And bug info attached below.
anyway to solve this bug?

regrads

*Code*
     public void read()
     {
         PDDocument document = null;
         FileInputStream is = null;
         try {
             is = new FileInputStream(file);
             PDFParser parser = new PDFParser(is);
             parser.parse();
             document = parser.getPDDocument();
             PDFTextStripper stripper = new PDFTextStripper();
             content = stripper.getText(document);
         } catch (FileNotFoundException e) {
             e.printStackTrace();
         } catch (IOException e) {
             e.printStackTrace();
         } finally {
             if (is != null) {
                 try {
                     is.close();
                 } catch (IOException e) {
                     e.printStackTrace();
                 }
             }
             if (document != null) {
                 try {
                     document.close();
                 } catch (IOException e) {
                     e.printStackTrace();
                 }
             }
         }
     }

*Bug Info*
Exception in thread "main" java.lang.NumberFormatException: For input
string: "dup"
     at java.lang.NumberFormatException.forInputString(Unknown Source)
     at java.lang.Integer.parseInt(Unknown Source)
     at java.lang.Integer.parseInt(Unknown Source)
     at
org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344)
     at
org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280)
     at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181)
     at
org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83)
     at
org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152)
     at
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108)
     at
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
     at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
     at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
     at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
     at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
     at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
     at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
     at
org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:242)
     at get.read(get.java:33)
     at get.main(get.java:60)



Reply via email to