[ 
https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636523#comment-15636523
 ] 

Tim Allison commented on TIKA-2153:
-----------------------------------

My initial diagnosis was wrong.  This is not a pre-parse stream exception.  If 
I had read the stacktrace more carefully...doh, and sorry.  


The issue here is that we're catching only TikaExceptions in the 
ParsingEmbeddedDocumentExtractor.  IOExceptions of embedded documents are 
causing the overall parse to fail.

This file is handled "correctly" by the RecursiveParserWrapper.  The stacktrace 
for the offending embedded file is stored in the appropriate Metadata object 
(offset 123), and the overall parse succeeds.

Some options to handle this:
1) add a catch for IOException in ParsingEmbeddedDocumentExtractor
2) wrap the IOExceptions thrown by MimeTypes.detect() into a TikaException
3) other options?

> TaggedIOException on a valid Powerpoint file
> --------------------------------------------
>
>                 Key: TIKA-2153
>                 URL: https://issues.apache.org/jira/browse/TIKA-2153
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.13
>         Environment: Windows 7 x64, JVM 1.8.0_101
>            Reporter: Seva Alekseyev
>         Attachments: tika_2153_unzipping.png
>
>
> On the following Powerpoint file, which opens fine with Powerpoint:
> https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx
> the Tika parses throws the following error:
> org.apache.tika.io.TaggedIOException: invalid stored block lengths
>       at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>       at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
>       at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
>       at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
>       at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>       at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>       at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>       at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
>       at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
>       at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
>       at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>       at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>       at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>       at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>       at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>       at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>       at java.io.FilterInputStream.read(FilterInputStream.java:107)
>       at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
>       ... 13 more
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>       at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>       at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
>       at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>       at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>       at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>       ... 19 more
> Could be similar to #2130.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to