[jira] [Commented] (TIKA-617) Series of exceptions from PDFBox
[ https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040819#comment-14040819 ] Tyler Palsulich commented on TIKA-617: -- Hi, Are you still having this issue? Do you have the/a PDF which caused this exception? Thanks! Tyler Series of exceptions from PDFBox Key: TIKA-617 URL: https://issues.apache.org/jira/browse/TIKA-617 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Erik Hetzner Hi, I am getting the following exception from PDFBox. Thank you! (If I should file these upstream at PDFBox first, please let me know.) {noformat} $ java -jar tika-app-1.0-SNAPSHOT.jar http://www.arb.ca.gov/research/apr/past/01-340.pdf /dev/null ERROR - Stop reading corrupt stream INFO - unsupported/disabled operation: f24.481 INFO - unsupported/disabled operation: ree)n. WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91) INFO - unsupported/disabled operation: i- INFO - unsupported/disabled operation: R4% INFO - unsupported/disabled operation: ) INFO - unsupported/disabled operation: Re.8 INFO - unsupported/disabled operation: e. INFO - unsupported/disabled operation: FE)- WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91) INFO - unsupported/disabled operation: R3% INFO - unsupported/disabled operation: T Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5809fdee at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at
[jira] [Commented] (TIKA-617) Series of exceptions from PDFBox
[ https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040968#comment-14040968 ] Erik Hetzner commented on TIKA-617: --- The URL containing the PDF is listed in the above comment. Trying it with 1.5 gives different errors and generates an incomplete XML file: {noformat} java -jar tika-app-1.5.jar http://www.arb.ca.gov/research/apr/past/01-340.pdf /dev/null ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException Exception in thread main org.apache.tika.exception.TikaException: Unable to extract PDF content at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:122) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:142) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:418) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:112) Caused by: java.io.IOException at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:138) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:336) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:248) at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:183) at org.apache.pdfbox.pdfparser.PDFStreamParser.init(PDFStreamParser.java:107) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:106) ... 7 more Caused by: java.util.zip.DataFormatException: invalid distance too far back at java.util.zip.Inflater.inflateBytes(Native Method) at java.util.zip.Inflater.inflate(Inflater.java:259) at java.util.zip.Inflater.inflate(Inflater.java:280) at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169) at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98) ... 18 more {noformat} Series of exceptions from PDFBox Key: TIKA-617 URL: https://issues.apache.org/jira/browse/TIKA-617 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Erik Hetzner Hi, I am getting the following exception from PDFBox. Thank you! (If I should file these upstream at PDFBox first, please let me know.) {noformat} $ java -jar tika-app-1.0-SNAPSHOT.jar http://www.arb.ca.gov/research/apr/past/01-340.pdf /dev/null ERROR - Stop reading corrupt stream INFO - unsupported/disabled operation: f24.481 INFO - unsupported/disabled operation: ree)n. WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at
[jira] [Commented] (TIKA-617) Series of exceptions from PDFBox
[ https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041025#comment-14041025 ] Tim Allison commented on TIKA-617: -- Confirmed still a problem with both classic (sequential) and newer NonSequentialParser in Tika trunk with PDFBox 1.8.6. Please open an issue in PDFBox if you haven't done so already. Thank you! Found same issue here (although Adobe couldn't read this one either without serious problems): http://digitalcorpora.org/corp/nps/files/govdocs1/898/898385.pdf Series of exceptions from PDFBox Key: TIKA-617 URL: https://issues.apache.org/jira/browse/TIKA-617 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Erik Hetzner Hi, I am getting the following exception from PDFBox. Thank you! (If I should file these upstream at PDFBox first, please let me know.) {noformat} $ java -jar tika-app-1.0-SNAPSHOT.jar http://www.arb.ca.gov/research/apr/past/01-340.pdf /dev/null ERROR - Stop reading corrupt stream INFO - unsupported/disabled operation: f24.481 INFO - unsupported/disabled operation: ree)n. WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91) INFO - unsupported/disabled operation: i- INFO - unsupported/disabled operation: R4% INFO - unsupported/disabled operation: ) INFO - unsupported/disabled operation: Re.8 INFO - unsupported/disabled operation: e. INFO - unsupported/disabled operation: FE)- WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91) INFO - unsupported/disabled operation: R3% INFO - unsupported/disabled operation: T Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5809fdee at