[ https://issues.apache.org/jira/browse/PDFBOX-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15405584#comment-15405584 ]
Yauheni Salopiy commented on PDFBOX-3452: ----------------------------------------- Hi [~tilman], Thank You for the investigation. Is it possible to make PDF Box more forgiving to such cases? I'm asking because I can open this PDF document with Acrobat Reader DC though I can confirm that other PDF Readers I tried wasn't able to open it. Thank You in advance! Best Regards, Yauheni Salopiy > IOException at org.apache.pdfbox.pdfparser.BaseParser.readStringNumber > ---------------------------------------------------------------------- > > Key: PDFBOX-3452 > URL: https://issues.apache.org/jira/browse/PDFBOX-3452 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.2 > Reporter: Yauheni Salopiy > Labels: WK > Attachments: 95s-0316-rpt0242-21-appendix-16-f-vol177.pdf, > PDFBOX-3452_LOG.txt > > > Apache Tika 1.14-SNAPSHOT (PDF Box 2.0.2) throws following exception on text > extraction from valid PDF document: > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from > org.apache.tika.parser.pdf.PDFParser@6c25e6c4 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > com.wolterskluwer.atlas.transformer.processFileResources.DocumentsTextExtractor.extractText(DocumentsTextExtractor.java:44) > at > com.wolterskluwer.atlas.transformer.processFileResources.DocumentsTextExtractor.main(DocumentsTextExtractor.java:134) > Caused by: java.io.IOException: Number '???·???????Wk®)i?v' is getting too > long, stop reading at offset 266260 > at > org.apache.pdfbox.pdfparser.BaseParser.readStringNumber(BaseParser.java:1379) > at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1341) > at > org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1278) > at > org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:739) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:721) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:652) > at > org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:612) > at > org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:215) > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:249) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:840) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:780) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:130) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 6 more > Please, find failing document and log with StackTrace in attachments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org