[ https://issues.apache.org/jira/browse/TIKA-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070055#comment-16070055 ]
Nick Burch commented on TIKA-2407: ---------------------------------- You'd be best off reporting this to the Apache PDFBox project, which is the library that Tika uses to process PDF files. That's the right place to get this fixed, or a more appropriate error thrown. You can report it as the PDFBOX project here, see https://issues.apache.org/jira/projects/PDFBOX > Tika crashed while parsing corrupt PDF > -------------------------------------- > > Key: TIKA-2407 > URL: https://issues.apache.org/jira/browse/TIKA-2407 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.15 > Reporter: Jorge Spinsanti > Attachments: IOException.pdf > > > Tika throws an exception when try to parse a corrupt PDF file to extract text > content (see attached file): > {code} > Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal > IOException from org.apache.tika.parser.pdf.PDFParser@d71dc5e > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 16 more > Caused by: java.io.IOException: Error reading stream, expected='endstream' > actual='' at offset 116070 > at > org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1013) > at > org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:781) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:742) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:673) > at > org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:633) > at > org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:241) > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1132) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1066) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:141) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 23 more > {code} > Can you thrown a specific exception to allow better error handling? Something > like BadInputException or WrongFileException? -- This message was sent by Atlassian JIRA (v6.4.14#64029)