Marc Prud'hommeaux created TIKA-2564: ----------------------------------------
Summary: Tika client cannot extract files from embedded archive formats Key: TIKA-2564 URL: https://issues.apache.org/jira/browse/TIKA-2564 Project: Tika Issue Type: Bug Environment: Mac OS 10.13.3 (17D47) 17:42 ext$ java -version java version "9.0.1" Java(TM) SE Runtime Environment (build 9.0.1+11) Java HotSpot(TM) 64-Bit Server VM (build 9.0.1+11, mixed mode) 17:42 ext$ uname -a Darwin bix.local 17.4.0 Darwin Kernel Version 17.4.0: Sun Dec 17 09:19:54 PST 2017; root:xnu-4570.41.2~1/RELEASE_X86_64 x86_64 Reporter: Marc Prud'hommeaux This may be related to TIKA-2395. When trying to extract the files from tika/tika-parsers/src/test/resources/test-documents/test-documents.tgz % coursier launch org.apache.tika:tika-app:1.17 --main org.apache.tika.cli.TikaCLI -- --extract test-documents.tgz I see the exception: Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.CompressorParser@62628e78 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:205) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:486) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:564) at coursier.cli.qR.a(Unknown Source) at coursier.cli.qQ.j(Unknown Source) at coursier.cli.qW.a(Unknown Source) at d.h.a.c(Unknown Source) at b.b.c_(Unknown Source) at d.b.d.E.g(Unknown Source) at d.b.e.aW.g(Unknown Source) at d.b.f.b.aa.a(Unknown Source) at coursier.cli.qQ.b(Unknown Source) at coursier.cli.Q.b(Unknown Source) at b.J.c_(Unknown Source) at d.F.h(Unknown Source) at b.F.a(Unknown Source) at coursier.cli.Coursier.main(Unknown Source) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:564) at coursier.Bootstrap.main(Bootstrap.java:428) Caused by: java.io.IOException: mark/reset not supported at java.base/java.io.InputStream.reset(InputStream.java:474) at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:444) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84) at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:1045) at org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:222) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 28 more However, I can browse the document fine using: % coursier launch org.apache.tika:tika-app:1.17 --main org.apache.tika.cli.TikaCLI -- test-documents.tgz This issue affects: test-documents.rar, test-documents.tar.Z, test-documents.tbz2, and test-documents.tgz But it does not affect test-documents.7z, test-documents.cab, test-documents.ddf, test-documents.dmg, test-documents.tar, or test-documents.zip This makes me suspect that it has something to do with extracting files from packages that are embedded in other archive parsers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)