[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15126429#comment-15126429 ]
Tim Allison commented on TIKA-1845: ----------------------------------- Looks like there is no trouble with the tika-app with straight extraction or with recursive json -J. If you use the -z option, you'll see the attachment is a wmf file, looks like a blank, black banner. There may be items in the metadata, though, that are sensitive (name of original author and organization). As I look more closely at the stacktrace, the detector is returning "null" in the AutoDetectParser with tika-server, but not with tika-app. I can trigger this stacktrace in our unit tests. Yay! So, I think TIKA-1010 made this problem visible, but this looks like something is going wrong with the detector in tika-server's AutoDetectParser's detector. > Unable to extract content from certain RTFs using tika-server versions since > 1.5 > --------------------------------------------------------------------------------- > > Key: TIKA-1845 > URL: https://issues.apache.org/jira/browse/TIKA-1845 > Project: Tika > Issue Type: Bug > Components: server > Affects Versions: 1.6, 1.9, 1.11 > Environment: Windows > Reporter: Ian Williams > > I have some patient letters that are RTF documents. When I extract the text > from these documents using tika-server-1.5.jar, it works fine. > However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and > 1.11), it fails with the stack trace and error shown below. > I can provide a sample RTF that is failing. > I wondered whether the error might be related to the following change that > was introduced in 1.6?: > * Made RTFParser's list handling slightly more robust against corrupt > list metadata (TIKA-1305) > It's possible that there is some issue with the RTF documents, but they are > real patient letters and they open in Microsoft Word without any problems. > Many thanks > Ian > Steps to reproduce issue > ==================== > 1. HTTP PUT to Tika server using curl: > C:\Downloads\Apache Tika>curl -X PUT --data-binary > @test-anonymised-letter.rtf http://localhost:9998/tika --header > "Content-Type: application/rtf" --header "Accept: text/plain" > --> this works fine when running tika-server-1.5.jar, but fails with > tika-server-1.6.jar > 2. Screen capture from the server: > INFO: Starting Apache Tika 1.9 server > Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination > INFO: Setting the server's publish address to be http://localhost:9998/ > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: jetty-8.y.z-SNAPSHOT > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: Started SelectChannelConnector@localhost:9998 > Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main > INFO: Started > Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika (application/rtf) > Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika: Text extraction failed > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.rtf.RTFParser@32a6dc > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) > at > org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) > at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) > at > org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > at org.eclipse.jetty.server.Server.handle(Server.java:370) > at > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) > at > org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) > at > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:651) > at > org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) > at > org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Unknown Source) > Caused by: java.lang.NullPointerException > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113) > at > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) > at > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:103) > at > org.apache.tika.parser.rtf.RTFEmbObjHandler.extractObj(RTFEmbObjHandler.java:230) > at > org.apache.tika.parser.rtf.RTFEmbObjHandler.handleCompletedObject(RTFEmbObjHandler.java:198) > at > org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1357) > at > org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:456) > at > org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:439) > at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:86) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > ... 34 more > Feb 01, 2016 2:26:25 PM org.apache.cxf.jaxrs.utils.JAXRSUtils > logMessageHandlerProblem > SEVERE: Problem with writing the data, class > org.apache.tika.server.resource.TikaResource$4, ContentType: text/plain -- This message was sent by Atlassian JIRA (v6.3.4#6332)