[ https://issues.apache.org/jira/browse/TIKA-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Burchard updated TIKA-3261: -------------------------------- Description: I've tried to parse a file using both 1.20 and 1.24.1. The file appears valid when I view it in my text editor and seems to simply be a tab-delimited table with a mix of Hebrew and Latin characters. In 1.20 I see an exception thrown, and in 1.24.1 I get JSON metadata back with no content. My command line: {{curl -X PUT --upload-file /tmp/choke.txt [http://localhost:9998/rmeta/text]}} 1.24.1 Result: {{[\\{"Content-Type":"application/octet-stream","X-Parsed-By":"org.apache.tika.parser.EmptyParser","X-TIKA:embedded_depth":"0","X-TIKA:parse_time_millis":"10"}]}} 1.20 Result: {{INFO Starting Apache Tika 1.20 server}} {{INFO Setting the server's publish address to be [http://localhost:9998/]}} {{INFO Logging initialized @1704ms to org.eclipse.jetty.util.log.Slf4jLog}} {{INFO jetty-9.4.z-SNAPSHOT; built: 2018-08-30T13:59:14.071Z; git: 27208684755d94a92186989f695db2d7b21ebc51; jvm 8.0.6.10 - pwa6480sr6fp10-20200408_01(SR6 FP10)}} {{INFO Started ServerConnector@7b09f799{HTTP/1.1,[http/1.1]} {localhost:9998} }} {{INFO Started @2085ms}} {{WARN Empty contextPath}} {{INFO Started o.e.j.s.h.ContextHandler@-405fdc63{/,null,AVAILABLE}}} {{INFO Started Apache Tika server at [http://localhost:9998/]}} {{INFO rmeta/text (autodetecting type)}} {{WARN rmeta/text: Text extraction failed (null)}} {{org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.server.resource.TikaResource$1@74f007b}} \{{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)}} \{{ at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)}} \{{ at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:224)}} \{{ at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:401)}} \{{ at org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:144)}} \{{ at org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:121)}} \{{ at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}} \{{ at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:90)}} \{{ at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)}} \{{ at java.lang.reflect.Method.invoke(Method.java:508)}} \{{ at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)}} \{{ at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)}} \{{ at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)}} \{{ at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)}} \{{ at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)}} \{{ at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)}} \{{ at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)}} \{{ at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)}} \{{ at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)}} \{{ at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)}} \{{ at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)}} \{{ at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)}} \{{ at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)}} \{{ at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1340)}} \{{ at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)}} \{{ at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1242)}} \{{ at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)}} \{{ at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)}} \{{ at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)}} \{{ at org.eclipse.jetty.server.Server.handle(Server.java:503)}} \{{ at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:364)}} \{{ at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)}} \{{ at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)}} \{{ at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)}} \{{ at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)}} \{{ at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765)}} \{{ at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683)}} \{{ at java.lang.Thread.run(Thread.java:820)}} {{Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported Media Type}} \{{ at org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.java:127)}} \{{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)}} \{{ ... 37 more}} was: I've tried to parse a file using both 1.20 and 1.24.1. The file appears valid when I view it in my text editor and seems to simply be a tab-delimited table with a mix of Hebrew and Latin characters. In 1.20 I see an exception thrown, and in 1.24.1 I get JSON metadata back with no content. My command line: {{curl -X PUT --upload-file /tmp/choke.txt http://localhost:9998/rmeta/text}} 1.24.1 Result: {{[\{"Content-Type":"application/octet-stream","X-Parsed-By":"org.apache.tika.parser.EmptyParser","X-TIKA:embedded_depth":"0","X-TIKA:parse_time_millis":"10"}]}} 1.20 Result: {{INFO Starting Apache Tika 1.20 server}} {{INFO Setting the server's publish address to be http://localhost:9998/}} {{INFO Logging initialized @1704ms to org.eclipse.jetty.util.log.Slf4jLog}} {{INFO jetty-9.4.z-SNAPSHOT; built: 2018-08-30T13:59:14.071Z; git: 27208684755d94a92186989f695db2d7b21ebc51; jvm 8.0.6.10 - pwa6480sr6fp10-20200408_01(SR6 FP10)}} {{INFO Started ServerConnector@7b09f799\{HTTP/1.1,[http/1.1]}{localhost:9998}}} {{INFO Started @2085ms}} {{WARN Empty contextPath}} {{INFO Started o.e.j.s.h.ContextHandler@-405fdc63\{/,null,AVAILABLE}}} {{INFO Started Apache Tika server at http://localhost:9998/}} {{INFO rmeta/text (autodetecting type)}} {{WARN rmeta/text: Text extraction failed (null)}} {{org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.server.resource.TikaResource$1@74f007b}} {{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)}} {{ at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)}} {{ at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:224)}} {{ at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:401)}} {{ at org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:144)}} {{ at org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:121)}} {{ at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}} {{ at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:90)}} {{ at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)}} {{ at java.lang.reflect.Method.invoke(Method.java:508)}} {{ at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)}} {{ at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)}} {{ at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)}} {{ at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)}} {{ at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)}} {{ at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)}} {{ at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)}} {{ at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)}} {{ at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)}} {{ at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)}} {{ at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)}} {{ at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)}} {{ at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)}} {{ at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1340)}} {{ at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)}} {{ at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1242)}} {{ at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)}} {{ at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)}} {{ at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)}} {{ at org.eclipse.jetty.server.Server.handle(Server.java:503)}} {{ at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:364)}} {{ at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)}} {{ at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)}} {{ at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)}} {{ at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)}} {{ at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765)}} {{ at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683)}} {{ at java.lang.Thread.run(Thread.java:820)}} {{Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported Media Type}} {{ at org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.java:127)}} {{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)}} {{ ... 37 more}} > Text file is parsed by "EmptyParser" but the file does contain what looks > like valid text > ----------------------------------------------------------------------------------------- > > Key: TIKA-3261 > URL: https://issues.apache.org/jira/browse/TIKA-3261 > Project: Tika > Issue Type: Bug > Components: detector > Affects Versions: 1.20, 1.24.1 > Environment: Tika is running on Windows 10 for my test machine, and > Windows 2016 for the production machine. Reproducible on both. The Linux > command line I used is just SLES on WSL, so it has no bearing here. > > (having a problem attaching the file, Jira is giving me a 'missing token' > error so I'll try again after creation of the Jira issue) > Reporter: Josh Burchard > Priority: Major > Attachments: choke.zip > > > I've tried to parse a file using both 1.20 and 1.24.1. The file appears > valid when I view it in my text editor and seems to simply be a tab-delimited > table with a mix of Hebrew and Latin characters. In 1.20 I see an exception > thrown, and in 1.24.1 I get JSON metadata back with no content. > My command line: > {{curl -X PUT --upload-file /tmp/choke.txt > [http://localhost:9998/rmeta/text]}} > 1.24.1 Result: > {{[\\{"Content-Type":"application/octet-stream","X-Parsed-By":"org.apache.tika.parser.EmptyParser","X-TIKA:embedded_depth":"0","X-TIKA:parse_time_millis":"10"}]}} > > 1.20 Result: > {{INFO Starting Apache Tika 1.20 server}} > {{INFO Setting the server's publish address to be [http://localhost:9998/]}} > {{INFO Logging initialized @1704ms to org.eclipse.jetty.util.log.Slf4jLog}} > {{INFO jetty-9.4.z-SNAPSHOT; built: 2018-08-30T13:59:14.071Z; git: > 27208684755d94a92186989f695db2d7b21ebc51; jvm 8.0.6.10 - > pwa6480sr6fp10-20200408_01(SR6 FP10)}} > {{INFO Started ServerConnector@7b09f799{HTTP/1.1,[http/1.1]} > {localhost:9998} > }} > {{INFO Started @2085ms}} > {{WARN Empty contextPath}} > {{INFO Started o.e.j.s.h.ContextHandler@-405fdc63{/,null,AVAILABLE}}} > {{INFO Started Apache Tika server at [http://localhost:9998/]}} > {{INFO rmeta/text (autodetecting type)}} > {{WARN rmeta/text: Text extraction failed (null)}} > {{org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.server.resource.TikaResource$1@74f007b}} > \{{ at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)}} > \{{ at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)}} > \{{ at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:224)}} > \{{ at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:401)}} > \{{ at > org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:144)}} > \{{ at > org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:121)}} > \{{ at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}} > \{{ at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:90)}} > \{{ at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)}} > \{{ at java.lang.reflect.Method.invoke(Method.java:508)}} > \{{ at > org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)}} > \{{ at > org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)}} > \{{ at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)}} > \{{ at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)}} > \{{ at > org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)}} > \{{ at > org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)}} > \{{ at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)}} > \{{ at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)}} > \{{ at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)}} > \{{ at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)}} > \{{ at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)}} > \{{ at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)}} > \{{ at > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)}} > \{{ at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1340)}} > \{{ at > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)}} > \{{ at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1242)}} > \{{ at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)}} > \{{ at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)}} > \{{ at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)}} > \{{ at org.eclipse.jetty.server.Server.handle(Server.java:503)}} > \{{ at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:364)}} > \{{ at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)}} > \{{ at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)}} > \{{ at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)}} > \{{ at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)}} > \{{ at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765)}} > \{{ at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683)}} > \{{ at java.lang.Thread.run(Thread.java:820)}} > {{Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported Media > Type}} > \{{ at > org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.java:127)}} > \{{ at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)}} > \{{ ... 37 more}} > -- This message was sent by Atlassian Jira (v8.3.4#803005)