[jira] [Commented] (TIKA-617) Series of exceptions from PDFBox

2014-06-23 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040819#comment-14040819
 ] 

Tyler Palsulich commented on TIKA-617:
--

Hi, 

Are you still having this issue? Do you have the/a PDF which caused this 
exception? Thanks!

Tyler

 Series of exceptions from PDFBox
 

 Key: TIKA-617
 URL: https://issues.apache.org/jira/browse/TIKA-617
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Erik Hetzner

 Hi,
 I am getting the following exception from PDFBox. Thank you!
 (If I should file these upstream at PDFBox first, please let me know.)
 {noformat}
 $ java -jar tika-app-1.0-SNAPSHOT.jar 
 http://www.arb.ca.gov/research/apr/past/01-340.pdf  /dev/null
 ERROR - Stop reading corrupt stream
 INFO - unsupported/disabled operation: f24.481
 INFO - unsupported/disabled operation: ree)n.
 WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot 
 be cast to org.apache.pdfbox.cos.COSArray
 java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
 to org.apache.pdfbox.cos.COSArray
   at 
 org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
 INFO - unsupported/disabled operation: i-
 INFO - unsupported/disabled operation: R4%
 INFO - unsupported/disabled operation: )
 INFO - unsupported/disabled operation: Re.8
 INFO - unsupported/disabled operation: e.
 INFO - unsupported/disabled operation: FE)-
 WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot 
 be cast to org.apache.pdfbox.cos.COSArray
 java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
 to org.apache.pdfbox.cos.COSArray
   at 
 org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
 INFO - unsupported/disabled operation: R3%
 INFO - unsupported/disabled operation: T
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5809fdee
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at 

[jira] [Commented] (TIKA-617) Series of exceptions from PDFBox

2014-06-23 Thread Erik Hetzner (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040968#comment-14040968
 ] 

Erik Hetzner commented on TIKA-617:
---

The URL containing the PDF is listed in the above comment. Trying it with 1.5 
gives different errors and generates an incomplete XML file:

{noformat}
java -jar tika-app-1.5.jar http://www.arb.ca.gov/research/apr/past/01-340.pdf  
 /dev/null
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
Exception in thread main org.apache.tika.exception.TikaException: Unable to 
extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:122)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:142)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:418)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:112)
Caused by: java.io.IOException
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:138)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:336)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:248)
at 
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:183)
at 
org.apache.pdfbox.pdfparser.PDFStreamParser.init(PDFStreamParser.java:107)
at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)
at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)
at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:106)
... 7 more
Caused by: java.util.zip.DataFormatException: invalid distance too far back
at java.util.zip.Inflater.inflateBytes(Native Method)
at java.util.zip.Inflater.inflate(Inflater.java:259)
at java.util.zip.Inflater.inflate(Inflater.java:280)
at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169)
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98)
... 18 more
{noformat}

 Series of exceptions from PDFBox
 

 Key: TIKA-617
 URL: https://issues.apache.org/jira/browse/TIKA-617
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Erik Hetzner

 Hi,
 I am getting the following exception from PDFBox. Thank you!
 (If I should file these upstream at PDFBox first, please let me know.)
 {noformat}
 $ java -jar tika-app-1.0-SNAPSHOT.jar 
 http://www.arb.ca.gov/research/apr/past/01-340.pdf  /dev/null
 ERROR - Stop reading corrupt stream
 INFO - unsupported/disabled operation: f24.481
 INFO - unsupported/disabled operation: ree)n.
 WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot 
 be cast to org.apache.pdfbox.cos.COSArray
 java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
 to org.apache.pdfbox.cos.COSArray
   at 
 org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
   at 
 

[jira] [Commented] (TIKA-617) Series of exceptions from PDFBox

2014-06-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041025#comment-14041025
 ] 

Tim Allison commented on TIKA-617:
--

Confirmed still a problem with both classic (sequential) and newer 
NonSequentialParser in Tika trunk with PDFBox 1.8.6.  Please open an issue in 
PDFBox if you haven't done so already.  Thank you!

Found same issue here (although Adobe couldn't read this one either without 
serious problems):
http://digitalcorpora.org/corp/nps/files/govdocs1/898/898385.pdf

 Series of exceptions from PDFBox
 

 Key: TIKA-617
 URL: https://issues.apache.org/jira/browse/TIKA-617
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Erik Hetzner

 Hi,
 I am getting the following exception from PDFBox. Thank you!
 (If I should file these upstream at PDFBox first, please let me know.)
 {noformat}
 $ java -jar tika-app-1.0-SNAPSHOT.jar 
 http://www.arb.ca.gov/research/apr/past/01-340.pdf  /dev/null
 ERROR - Stop reading corrupt stream
 INFO - unsupported/disabled operation: f24.481
 INFO - unsupported/disabled operation: ree)n.
 WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot 
 be cast to org.apache.pdfbox.cos.COSArray
 java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
 to org.apache.pdfbox.cos.COSArray
   at 
 org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
 INFO - unsupported/disabled operation: i-
 INFO - unsupported/disabled operation: R4%
 INFO - unsupported/disabled operation: )
 INFO - unsupported/disabled operation: Re.8
 INFO - unsupported/disabled operation: e.
 INFO - unsupported/disabled operation: FE)-
 WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot 
 be cast to org.apache.pdfbox.cos.COSArray
 java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
 to org.apache.pdfbox.cos.COSArray
   at 
 org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
 INFO - unsupported/disabled operation: R3%
 INFO - unsupported/disabled operation: T
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5809fdee
   at