TNEF parsing unstable --------------------- Key: TIKA-835 URL: https://issues.apache.org/jira/browse/TIKA-835 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0 Environment: CentOS 4.x/5.x/6.x Java 6 Reporter: Rob Tulloh
We are seeing problems in Solr with tika throwing exceptions. Sometimes we see OOM like this: {noformat} SEVERE: java.lang.OutOfMemoryError: Java heap space at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50) at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63) at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:195) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) {noformat} Other times, we see errors like this one: {noformat} Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun at org.apache.poi.util.LittleEndian.readUShort(LittleEndian.java:302) at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:53) at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63) at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 26 more {noformat} I am able to reproduce these failures with tika-app-1.0.jar. I am not able to share the content as the content is proprietary in nature. The OOM error is particularly problematic as it crashes Solr and causes our document indexing pipeline to get congested while it waits for Solr to restart. Please see also Solr ticket https://issues.apache.org/jira/browse/SOLR-2990 as this contains the original posting of the problem and some details of our environment where the tests are being performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira