[ 
https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041025#comment-14041025
 ] 

Tim Allison commented on TIKA-617:
----------------------------------

Confirmed still a problem with both classic (sequential) and newer 
NonSequentialParser in Tika trunk with PDFBox 1.8.6.  Please open an issue in 
PDFBox if you haven't done so already.  Thank you!

Found same issue here (although Adobe couldn't read this one either without 
serious problems):
http://digitalcorpora.org/corp/nps/files/govdocs1/898/898385.pdf

> Series of exceptions from PDFBox
> --------------------------------
>
>                 Key: TIKA-617
>                 URL: https://issues.apache.org/jira/browse/TIKA-617
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.10
>            Reporter: Erik Hetzner
>
> Hi,
> I am getting the following exception from PDFBox. Thank you!
> (If I should file these upstream at PDFBox first, please let me know.)
> {noformat}
> $ java -jar tika-app-1.0-SNAPSHOT.jar 
> http://www.arb.ca.gov/research/apr/past/01-340.pdf > /dev/null
> ERROR - Stop reading corrupt stream
> INFO - unsupported/disabled operation: f24.481
> INFO - unsupported/disabled operation: ree)n.
> WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot 
> be cast to org.apache.pdfbox.cos.COSArray
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
> to org.apache.pdfbox.cos.COSArray
>       at 
> org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>       at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
>       at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
>       at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
> INFO - unsupported/disabled operation: i-
> INFO - unsupported/disabled operation: R4%
> INFO - unsupported/disabled operation: )
> INFO - unsupported/disabled operation: Re.8
> INFO - unsupported/disabled operation: e.
> INFO - unsupported/disabled operation: FE)-
> WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot 
> be cast to org.apache.pdfbox.cos.COSArray
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
> to org.apache.pdfbox.cos.COSArray
>       at 
> org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>       at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
>       at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
>       at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
> INFO - unsupported/disabled operation: R3%
> INFO - unsupported/disabled operation: T
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5809fdee
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>       at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
>       at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
>       at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
> Caused by: java.lang.RuntimeException: java.io.IOException: Error: Expected 
> operator 'ID' actual='I8'
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:178)
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>       ... 5 more
> Caused by: java.io.IOException: Error: Expected operator 'ID' actual='I8'
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:382)
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46)
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:175)
>       ... 15 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to