[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-11-04 Thread Sharath Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15635502#comment-15635502
 ] 

Sharath Kumar commented on TIKA-2146:
-

What would be action plan for this. is this gonna be supported in Tika or not

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
> Attachments: Test bug.doc, This is password protected.doc
>
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-11-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636015#comment-15636015
 ] 

Tim Allison commented on TIKA-2146:
---

bq. Sadly that means, unless someone volunteers to add the support to POI, that 
haven't the password won't actually help...

I think Nick summed it up.  If you can open an issue on POI and find someone to 
add the support, then it will happen.  I regret that we can't fix this at the 
Tika level, and I'd look into at the POI level, but this is an area that is 
beyond my comfort zone.

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
> Attachments: Test bug.doc, This is password protected.doc
>
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apac

[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-11-04 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636102#comment-15636102
 ] 

Nick Burch commented on TIKA-2146:
--

My guess is it's about 2-3 weeks of work at the POI level to add support for 
this. Unless you've got a handy intern or some budget, it looks unlikely it'll 
be fixed soon...

However, it's probably only 2-3 hours of work reading through the published 
.DOC file format specs from Microsoft to find out how encrypted word documents 
are marked as such in the file. You probably want 
https://msdn.microsoft.com/en-us/library/office/gg615596(v=office.14).aspx then 
https://msdn.microsoft.com/en-us/library/office/cc313153(v=office.12).aspx . 
Once someone has found that out, it's only a few minutes work to add the check 
and throw a more helpful exception

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
> Attachments: Test bug.doc, This is password protected.doc
>
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfB

[jira] [Created] (TIKA-2159) Handle pre-parse embedded object exceptions uniformly and more robustly

2016-11-04 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2159:
-

 Summary: Handle pre-parse embedded object exceptions uniformly and 
more robustly
 Key: TIKA-2159
 URL: https://issues.apache.org/jira/browse/TIKA-2159
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Tim Allison
Priority: Minor


When an embedded document is parsed and causes an exception, we're currently 
catching that and swallowing it in ParsingEmbeddedDocumentExtractor (the 
default) or reporting it in the RecursiveParserWrapper by storing the 
stacktrace in the Metadata of the embedded document.

However, if there's an exception during detection on the embedded stream or on 
getting the stream _before_ the stream hits the parser, we aren't handling that 
uniformly or robustly across parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2153) TaggedIOException on a valid Powerpoint file

2016-11-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636308#comment-15636308
 ] 

Tim Allison commented on TIKA-2153:
---

This is one instance of a larger problem that we should try to handle more 
uniformly and robustly.

> TaggedIOException on a valid Powerpoint file
> 
>
> Key: TIKA-2153
> URL: https://issues.apache.org/jira/browse/TIKA-2153
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: tika_2153_unzipping.png
>
>
> On the following Powerpoint file, which opens fine with Powerpoint:
> https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx
> the Tika parses throws the following error:
> org.apache.tika.io.TaggedIOException: invalid stored block lengths
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
>   at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
>   at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
>   ... 13 more
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   ... 19 more
> Could be similar to #2130.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2160) POIXMLException from NullPointerException on a valid Word file

2016-11-04 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2160:


 Summary: POIXMLException from NullPointerException on a valid Word 
file
 Key: TIKA-2160
 URL: https://issues.apache.org/jira/browse/TIKA-2160
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the attached word file, which opens fine with Word (albeit with no text), 
the Tika parser throws the following error:

org.apache.poi.POIXMLException: java.lang.NullPointerException
at 
org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:130)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:208)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
at 
org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
at 
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: java.lang.NullPointerException
at 
org.apache.poi.xwpf.usermodel.AbstractXWPFSDT.(AbstractXWPFSDT.java:37)
at org.apache.poi.xwpf.usermodel.XWPFSDT.(XWPFSDT.java:38)
at 
org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:124)
... 9 more




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: tika-python

2016-11-04 Thread Mattmann, Chris A (3010)
Thanks Jorg appreciate it.
I’ll check out:

https://github.com/TalmarGrosskotz/teacher-shelf.git

And get back to you.

++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-502
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 

On 11/4/16, 3:52 AM, "Jörg Bilert"  wrote:

Wow,

thank you for your quick answer. I think I might have given you a wrong 
impression. I think your code would work perfectly if I knew how to use 
it properly- :)

Nevertheless I added a link to my github repository to give you a short 
impression of the project I am planning and the first steps in Python3 
and tika I have taken so far.

I would be glad for any help you could give me on how to use the 
different parsers (or the parser for different filetypes).

Thank you in advance,

Jörg


Am 04.11.2016 um 04:34 schrieb Mattmann, Chris A (3010):
> Dear Jorg,
>
> Thank you much for sending this. I have been meaning to reply to your 
prior
> emails on the same subject. Yes it will work for other file types. Can 
you give
> me an example file and upload it in a Github issue of a file it’s not 
working for?
> I can take a look.
>
> Cheers,
> Chris
>
>
> ++
> Chris Mattmann, Ph.D.
> Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, Open Source Projects Formulation and Development Office (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 180-503E, Mailstop: 180-502
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>   
>
> On 11/3/16, 5:15 PM, "Jörg Bilert"  wrote:
>
>  Hello Mr Mattman,
>  
>  I have just been looking into your pythong wrapper for tika and I 
like
>  it a lot.
>  But there is one thing i just don't see. According to the Apache Tika
>  website Tika supports a lot of file formats (even audio and video). 
Buti
>  don't know how to parse them in python. ODT and PDF work fine like in
>  the samplecode on your github page.
>  
>  Could you give me a clue where to start to handle other file-types?
>  
>  Yours, Jörg Bilert
>  
>





[jira] [Updated] (TIKA-2160) POIXMLException from NullPointerException on a valid Word file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2160:
-
Attachment: test_16022016081053.docx

> POIXMLException from NullPointerException on a valid Word file
> --
>
> Key: TIKA-2160
> URL: https://issues.apache.org/jira/browse/TIKA-2160
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: test_16022016081053.docx
>
>
> On the attached word file, which opens fine with Word (albeit with no text), 
> the Tika parser throws the following error:
> org.apache.poi.POIXMLException: java.lang.NullPointerException
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:130)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:208)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.poi.xwpf.usermodel.AbstractXWPFSDT.(AbstractXWPFSDT.java:37)
>   at org.apache.poi.xwpf.usermodel.XWPFSDT.(XWPFSDT.java:38)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:124)
>   ... 9 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2153) TaggedIOException on a valid Powerpoint file

2016-11-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636523#comment-15636523
 ] 

Tim Allison commented on TIKA-2153:
---

My initial diagnosis was wrong.  This is not a pre-parse stream exception.  If 
I had read the stacktrace more carefully...doh, and sorry.  


The issue here is that we're catching only TikaExceptions in the 
ParsingEmbeddedDocumentExtractor.  IOExceptions of embedded documents are 
causing the overall parse to fail.

This file is handled "correctly" by the RecursiveParserWrapper.  The stacktrace 
for the offending embedded file is stored in the appropriate Metadata object 
(offset 123), and the overall parse succeeds.

Some options to handle this:
1) add a catch for IOException in ParsingEmbeddedDocumentExtractor
2) wrap the IOExceptions thrown by MimeTypes.detect() into a TikaException
3) other options?

> TaggedIOException on a valid Powerpoint file
> 
>
> Key: TIKA-2153
> URL: https://issues.apache.org/jira/browse/TIKA-2153
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: tika_2153_unzipping.png
>
>
> On the following Powerpoint file, which opens fine with Powerpoint:
> https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx
> the Tika parses throws the following error:
> org.apache.tika.io.TaggedIOException: invalid stored block lengths
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
>   at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
>   at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
>   ... 13 more
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   ... 19 more
> Could be similar to #2130.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2160) POIXMLException from NullPointerException on a valid Word file

2016-11-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636607#comment-15636607
 ] 

Tim Allison commented on TIKA-2160:
---

Opened, fixed and resolved: https://bz.apache.org/bugzilla/show_bug.cgi?id=60341

> POIXMLException from NullPointerException on a valid Word file
> --
>
> Key: TIKA-2160
> URL: https://issues.apache.org/jira/browse/TIKA-2160
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: test_16022016081053.docx
>
>
> On the attached word file, which opens fine with Word (albeit with no text), 
> the Tika parser throws the following error:
> org.apache.poi.POIXMLException: java.lang.NullPointerException
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:130)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:208)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.poi.xwpf.usermodel.AbstractXWPFSDT.(AbstractXWPFSDT.java:37)
>   at org.apache.poi.xwpf.usermodel.XWPFSDT.(XWPFSDT.java:38)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:124)
>   ... 9 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2158) NullPointerException on a valid Word file

2016-11-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636669#comment-15636669
 ] 

Tim Allison commented on TIKA-2158:
---

Opened, fixed, resolved: https://bz.apache.org/bugzilla/show_bug.cgi?id=60342

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2158
> URL: https://issues.apache.org/jira/browse/TIKA-2158
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: RTOP_Template01112015063856.docx
>
>
> On the attached Word file, which opens fine with Word, the Tika parser throws 
> the following error:
> java.lang.NullPointerException
>   at 
> org.apache.poi.xwpf.usermodel.XWPFSDTContentCell.(XWPFSDTContentCell.java:49)
>   at org.apache.poi.xwpf.usermodel.XWPFSDTCell.(XWPFSDTCell.java:35)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFTableRow.getTableICells(XWPFTableRow.java:147)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:359)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:111)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2157) HSLFException on a valid Powerpoint file

2016-11-04 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2157.
---
Resolution: Fixed

> HSLFException on a valid Powerpoint file
> 
>
> Key: TIKA-2157
> URL: https://issues.apache.org/jira/browse/TIKA-2157
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: CRADA 2-09 K Subbarao.ppt
>
>
> On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
> parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> incorrect data check
>   at org.apache.poi.hslf.blip.PICT.getData(PICT.java:120)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: incorrect data check
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.PICT.read(PICT.java:133)
>   at org.apache.poi.hslf.blip.PICT.getData(PICT.java:116)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2161) TaggedIOException from EOFException on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2161:


 Summary: TaggedIOException from EOFException on a valid Powerpoint 
file
 Key: TIKA-2161
 URL: https://issues.apache.org/jira/browse/TIKA-2161
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
parser throws the following error:

org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at java.nio.file.Files.copy(Files.java:2908)
at java.nio.file.Files.copy(Files.java:3027)
at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140)
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:138)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at 
org.apache.poi.util.BoundedInputStream.read(BoundedInputStream.java:121)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
... 22 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2161) TaggedIOException from EOFException on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2161:
-
Attachment: Erik-LymeChipBranchSeminar.ppt

> TaggedIOException from EOFException on a valid Powerpoint file
> --
>
> Key: TIKA-2161
> URL: https://issues.apache.org/jira/browse/TIKA-2161
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Erik-LymeChipBranchSeminar.ppt
>
>
> On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
> parser throws the following error:
> org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at java.nio.file.Files.copy(Files.java:2908)
>   at java.nio.file.Files.copy(Files.java:3027)
>   at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587)
>   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615)
>   at 
> org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377)
>   at 
> org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:138)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
>   at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
>   at 
> org.apache.poi.util.BoundedInputStream.read(BoundedInputStream.java:121)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   ... 22 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2155) IndexOutOfBoundsException on a valid Excel file

2016-11-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636855#comment-15636855
 ] 

Tim Allison commented on TIKA-2155:
---

Opened: https://bz.apache.org/bugzilla/show_bug.cgi?id=60343

> IndexOutOfBoundsException on a valid Excel file
> ---
>
> Key: TIKA-2155
> URL: https://issues.apache.org/jira/browse/TIKA-2155
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Copy of [corrupted Unicode text].xlsx
>
>
> On the attached Excel file, which opens fine with Excel, the Tika parser 
> throws the following error:
> java.lang.IndexOutOfBoundsException: Index: 65535, Size: 251
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> org.apache.poi.xssf.model.StylesTable.getStyleAt(StylesTable.java:421)
>   at 
> org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.startElement(XSSFSheetXMLHandler.java:281)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.startElement(XSSFExcelExtractorDecorator.java:345)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:509)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractXMLDocumentParser.emptyElement(AbstractXMLDocumentParser.java:182)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:356)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2786)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:195)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2162) "Unknown compression method" on a Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2162:
-
Attachment: DECAY.ppt

> "Unknown compression method" on a Powerpoint file
> -
>
> Key: TIKA-2162
> URL: https://issues.apache.org/jira/browse/TIKA-2162
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: DECAY.ppt
>
>
> On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
> parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> unknown compression method
>   at org.apache.poi.hslf.blip.EMF.getData(EMF.java:91)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: unknown compression method
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.EMF.getData(EMF.java:85)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2162) "Unknown compression method" on a Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2162:


 Summary: "Unknown compression method" on a Powerpoint file
 Key: TIKA-2162
 URL: https://issues.apache.org/jira/browse/TIKA-2162
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: DECAY.ppt

On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
unknown compression method
at org.apache.poi.hslf.blip.EMF.getData(EMF.java:91)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: unknown compression method
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.EMF.getData(EMF.java:85)
... 6 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2163) POIXMLException from ClassCastException on a valid Word template

2016-11-04 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2163:


 Summary: POIXMLException from ClassCastException on a valid Word 
template
 Key: TIKA-2163
 URL: https://issues.apache.org/jira/browse/TIKA-2163
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: ChronologicalResume.dotx

On the attached Word template, which opens fine with Word, the Tika parser 
throws the following error:

org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException
at 
org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:65)
at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:601)
at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:613)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:156)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
at 
org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
at 
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
org.apache.poi.xwpf.usermodel.XWPFFactory.createDocumentPart(XWPFFactory.java:57)
at 
org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:60)
... 10 more
Caused by: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart 
cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument
at 
org.apache.poi.xwpf.usermodel.XWPFHeaderFooter.(XWPFHeaderFooter.java:74)
at org.apache.poi.xwpf.usermodel.XWPFHeader.(XWPFHeader.java:54)
... 16 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2163) POIXMLException from ClassCastException on a valid Word template

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2163:
-
Attachment: ChronologicalResume.dotx

> POIXMLException from ClassCastException on a valid Word template
> 
>
> Key: TIKA-2163
> URL: https://issues.apache.org/jira/browse/TIKA-2163
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: ChronologicalResume.dotx
>
>
> On the attached Word template, which opens fine with Word, the Tika parser 
> throws the following error:
> org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException
>   at 
> org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:65)
>   at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:601)
>   at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:613)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:156)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFactory.createDocumentPart(XWPFFactory.java:57)
>   at 
> org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:60)
>   ... 10 more
> Caused by: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart 
> cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument
>   at 
> org.apache.poi.xwpf.usermodel.XWPFHeaderFooter.(XWPFHeaderFooter.java:74)
>   at org.apache.poi.xwpf.usermodel.XWPFHeader.(XWPFHeader.java:54)
>   ... 16 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2157) HSLFException on a valid Powerpoint file

2016-11-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636973#comment-15636973
 ] 

Hudson commented on TIKA-2157:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1132 (See 
[https://builds.apache.org/job/Tika-trunk/1132/])
TIKA-2157 -- handle zip exception in embedded stream (tallison: rev 
75fa1386b95ccf1bc7fb9a9f60811636baace05e)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java


> HSLFException on a valid Powerpoint file
> 
>
> Key: TIKA-2157
> URL: https://issues.apache.org/jira/browse/TIKA-2157
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: CRADA 2-09 K Subbarao.ppt
>
>
> On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
> parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> incorrect data check
>   at org.apache.poi.hslf.blip.PICT.getData(PICT.java:120)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: incorrect data check
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.PICT.read(PICT.java:133)
>   at org.apache.poi.hslf.blip.PICT.getData(PICT.java:116)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2163) POIXMLException from ClassCastException on a valid Word template

2016-11-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636985#comment-15636985
 ] 

Tim Allison commented on TIKA-2163:
---

We need to figure out how to process glossaryDocument relationships 
correctly...  

Thank you for opening this issue.

> POIXMLException from ClassCastException on a valid Word template
> 
>
> Key: TIKA-2163
> URL: https://issues.apache.org/jira/browse/TIKA-2163
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: ChronologicalResume.dotx
>
>
> On the attached Word template, which opens fine with Word, the Tika parser 
> throws the following error:
> org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException
>   at 
> org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:65)
>   at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:601)
>   at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:613)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:156)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFactory.createDocumentPart(XWPFFactory.java:57)
>   at 
> org.apache.poi.POIXMLFactory.createDocumentPart(POIXMLFactory.java:60)
>   ... 10 more
> Caused by: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart 
> cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument
>   at 
> org.apache.poi.xwpf.usermodel.XWPFHeaderFooter.(XWPFHeaderFooter.java:74)
>   at org.apache.poi.xwpf.usermodel.XWPFHeader.(XWPFHeader.java:54)
>   ... 16 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2157) HSLFException on a valid Powerpoint file

2016-11-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637017#comment-15637017
 ] 

Hudson commented on TIKA-2157:
--

SUCCESS: Integrated in Jenkins build tika-2.x #170 (See 
[https://builds.apache.org/job/tika-2.x/170/])
TIKA-2157 - handle zip exception in embedded file (tallison: rev 
2d5189186668166cdc7109e9096f9c6f4dcc5e6b)
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java


> HSLFException on a valid Powerpoint file
> 
>
> Key: TIKA-2157
> URL: https://issues.apache.org/jira/browse/TIKA-2157
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: CRADA 2-09 K Subbarao.ppt
>
>
> On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
> parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> incorrect data check
>   at org.apache.poi.hslf.blip.PICT.getData(PICT.java:120)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: incorrect data check
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.PICT.read(PICT.java:133)
>   at org.apache.poi.hslf.blip.PICT.getData(PICT.java:116)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2164:


 Summary: HSLFException from ZipException "invalid stored block 
lengths" on a valid Powerpoint file
 Key: TIKA-2164
 URL: https://issues.apache.org/jira/browse/TIKA-2164
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Description: 
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

  was:
On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more


> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Attachment: Research Forum 2013.3.ppt

> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Research Forum 2013.3.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Description: 
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the other file emits a similar error "invalid block type".

  was:
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more


> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Research Forum 2013.3.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the other file emits a similar error "invalid block type".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Description: 
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached file emits a similar error "invalid block type".

  was:
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the other file emits a similar error "invalid block type".


> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Research Forum 2013.3.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached file emits a similar error "invalid block type".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Attachment: Jankovic final Retreat 2002.PPT

> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jankovic final Retreat 2002.PPT, Research Forum 
> 2013.3.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached file emits a similar error "invalid block type".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Description: 
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached "Research forum" file emits a similar error "invalid block 
type".
EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid 
literal/length code" error.

  was:
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached file emits a similar error "invalid block type".


> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jankovic final Retreat 2002.PPT, Research Forum 
> 2013.3.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached "Research forum" file emits a similar error "invalid block 
> type".
> EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar 
> "invalid literal/length code" error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Attachment: paperfigures.ppt

> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jankovic final Retreat 2002.PPT, Research Forum 
> 2013.3.ppt, paperfigures.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached "Research forum" file emits a similar error "invalid block 
> type".
> EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar 
> "invalid literal/length code" error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Description: 
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached "Research forum" file emits a similar error "invalid block 
type".
EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid 
literal/length code" error.
EDIT3: the attached "paperfigures" file emits "invalid distance too far back". 
Something is wrong with ZIP in Powerpoints.

  was:
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached "Research forum" file emits a similar error "invalid block 
type".
EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid 
literal/length code" error.


> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jankovic final Retreat 2002.PPT, Research Forum 
> 2013.3.ppt, paperfigures.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached "Research forum" file emits a similar error "invalid block 
> type".
> EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar 
> "invalid literal/length code" error.
> EDIT3: the attached "paperfigures" file emits "invalid distance too far 
> back". Something is wrong with ZIP in Powerpoints.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Description: 
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached "Research forum" file emits a similar error "invalid block 
type".
EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid 
literal/length code" error.
EDIT3: the attached "paperfigures" file emits "invalid distance too far back". 
Something is wrong with ZIP in Powerpoints.
EDIT4: in "Lab meeting", it's "Unexpected end of ZLIB input stream"

  was:
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached "Research forum" file emits a similar error "invalid block 
type".
EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid 
literal/length code" error.
EDIT3: the attached "paperfigures" file emits "invalid distance too far back". 
Something is wrong with ZIP in Powerpoints.


> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jankovic final Retreat 2002.PPT, Lab Meeting.ppt, 
> Research Forum 2013.3.ppt, paperfigures.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached "Research forum" file emits a similar error "invalid block 
> type".
> EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar 
> "invalid literal/length code" error.
> EDIT3: the attached "paperfigures" file emits "invalid distance too far 
> back". Something is wrong with ZIP in Powerpoints.
> EDIT4: in "Lab meeting", it's "Unexpected end o

[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Attachment: Lab Meeting.ppt

> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jankovic final Retreat 2002.PPT, Lab Meeting.ppt, 
> Research Forum 2013.3.ppt, paperfigures.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached "Research forum" file emits a similar error "invalid block 
> type".
> EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar 
> "invalid literal/length code" error.
> EDIT3: the attached "paperfigures" file emits "invalid distance too far 
> back". Something is wrong with ZIP in Powerpoints.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Attachment: suba.ppt

> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jankovic final Retreat 2002.PPT, Lab Meeting.ppt, 
> Research Forum 2013.3.ppt, paperfigures.ppt, suba.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached "Research forum" file emits a similar error "invalid block 
> type".
> EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar 
> "invalid literal/length code" error.
> EDIT3: the attached "paperfigures" file emits "invalid distance too far 
> back". Something is wrong with ZIP in Powerpoints.
> EDIT4: in "Lab meeting", it's "Unexpected end of ZLIB input stream"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2164) HSLFException from ZipException "invalid stored block lengths" on a valid Powerpoint file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2164:
-
Description: 
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached "Research forum" file emits a similar error "invalid block 
type".
EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid 
literal/length code" error.
EDIT3: the attached "paperfigures" file emits "invalid distance too far back". 
Something is wrong with ZIP in Powerpoints.
EDIT4: in "Lab meeting", it's "Unexpected end of ZLIB input stream"
"suba" exhibits a similar error, "invalid distance too far back" but in a 
different exception.

  was:
On the following Powerpoint file:

https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt

which opens fine with Powerpoint, the Tika parser throws the following error:

org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
invalid stored block lengths
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
... 6 more

EDIT: the attached "Research forum" file emits a similar error "invalid block 
type".
EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar "invalid 
literal/length code" error.
EDIT3: the attached "paperfigures" file emits "invalid distance too far back". 
Something is wrong with ZIP in Powerpoints.
EDIT4: in "Lab meeting", it's "Unexpected end of ZLIB input stream"


> HSLFException from ZipException "invalid stored block lengths" on a valid 
> Powerpoint file
> -
>
> Key: TIKA-2164
> URL: https://issues.apache.org/jira/browse/TIKA-2164
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jankovic final Retreat 2002.PPT, Lab Meeting.ppt, 
> Research Forum 2013.3.ppt, paperfigures.ppt, suba.ppt
>
>
> On the following Powerpoint file:
> https://dl.dropboxusercontent.com/u/92341073/TCM%202012_DR_5.ppt
> which opens fine with Powerpoint, the Tika parser throws the following error:
> org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
> invalid stored block lengths
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:64)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:324)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:58)
>   ... 6 more
> EDIT: the attached "Research forum" file emits a similar error "invalid block 
> type".
> EDIT2: the attached "Jankovic final Retreat 2002" file emits a similar 
> "invalid literal/length code" error.
> E

[jira] [Created] (TIKA-2165) NegativeArraySizeException on a valid Word file

2016-11-04 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2165:


 Summary: NegativeArraySizeException on a valid Word file
 Key: TIKA-2165
 URL: https://issues.apache.org/jira/browse/TIKA-2165
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the attached file, which opens with Word, the Tika parser throws an error:

java.lang.NegativeArraySizeException
at org.apache.poi.hwpf.model.Ffn.(Ffn.java:79)
at org.apache.poi.hwpf.model.FontTable.(FontTable.java:66)
at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:344)
at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2165) NegativeArraySizeException on a valid Word file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2165:
-
Attachment: file108.doc

> NegativeArraySizeException on a valid Word file
> ---
>
> Key: TIKA-2165
> URL: https://issues.apache.org/jira/browse/TIKA-2165
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: file108.doc
>
>
> On the attached file, which opens with Word, the Tika parser throws an error:
> java.lang.NegativeArraySizeException
>   at org.apache.poi.hwpf.model.Ffn.(Ffn.java:79)
>   at org.apache.poi.hwpf.model.FontTable.(FontTable.java:66)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:344)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2166) TaggedIOException from a ZipException on a valid Word file

2016-11-04 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2166:


 Summary: TaggedIOException from a ZipException on a valid Word file
 Key: TIKA-2166
 URL: https://issues.apache.org/jira/browse/TIKA-2166
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the attached file, which opens with Word, Tika throws:

org.apache.tika.io.TaggedIOException: invalid block type
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63)
at org.gagravarr.tika.OggDetector.detect(OggDetector.java:68)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
at gov.nih.niaid.fscanner.Extract.ExtractContents(Extract.java:63)
at gov.nih.niaid.temp.Main.main(Main.java:68)
Caused by: org.apache.tika.io.TaggedIOException: invalid block type
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59)
... 12 more
Caused by: java.util.zip.ZipException: invalid block type
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at 
org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
... 16 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2166) TaggedIOException from a ZipException on a valid Word file

2016-11-04 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2166:
-
Attachment: AMSMIC briefing doc.docx

> TaggedIOException from a ZipException on a valid Word file
> --
>
> Key: TIKA-2166
> URL: https://issues.apache.org/jira/browse/TIKA-2166
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: AMSMIC briefing doc.docx
>
>
> On the attached file, which opens with Word, Tika throws:
> org.apache.tika.io.TaggedIOException: invalid block type
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63)
>   at org.gagravarr.tika.OggDetector.detect(OggDetector.java:68)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>   at gov.nih.niaid.fscanner.Extract.ExtractContents(Extract.java:63)
>   at gov.nih.niaid.temp.Main.main(Main.java:68)
> Caused by: org.apache.tika.io.TaggedIOException: invalid block type
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59)
>   ... 12 more
> Caused by: java.util.zip.ZipException: invalid block type
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   ... 16 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)