[ https://issues.apache.org/jira/browse/PDFBOX-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224478#comment-17224478 ]
Yauheni Salopiy commented on PDFBOX-3451: ----------------------------------------- Hi [~tilman], [~msahyoun], Thank You! Best Regards, Yauheni Salopiy > IOException at org.apache.pdfbox.pdfparser.BaseParser.readLong > -------------------------------------------------------------- > > Key: PDFBOX-3451 > URL: https://issues.apache.org/jira/browse/PDFBOX-3451 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.1, 2.0.2 > Reporter: Yauheni Salopiy > Priority: Major > Labels: WK > Attachments: PDFBOX-3451_LOG.txt, att3x1l.pdf, att3x1l.txt > > > Apache Tika 1.14-SNAPSHOT (PDF Box 2.0.2) throws following exception on text > extraction from valid PDF document: > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from > org.apache.tika.parser.pdf.PDFParser@5b529706 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > com.wolterskluwer.atlas.transformer.processFileResources.DocumentsTextExtractor.extractText(DocumentsTextExtractor.java:44) > at > com.wolterskluwer.atlas.transformer.processFileResources.DocumentsTextExtractor.main(DocumentsTextExtractor.java:134) > Caused by: java.io.IOException: Error: Expected a long type at offset > 9003008, instead got '???3??~?????~???' > at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1350) > at > org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1278) > at > org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:739) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:721) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:652) > at > org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:612) > at > org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:215) > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:249) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:840) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:780) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:130) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 6 more > Caused by: java.lang.NumberFormatException: For input string: > "???3??~?????~???" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Long.parseLong(Long.java:589) > at java.lang.Long.parseLong(Long.java:631) > at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1345) > ... 17 more > Please, find failing document and log with StackTrace in attachments. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org