TNEF parsing unstable
---------------------

                 Key: TIKA-835
                 URL: https://issues.apache.org/jira/browse/TIKA-835
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.0
         Environment: CentOS 4.x/5.x/6.x 
Java 6
            Reporter: Rob Tulloh


We are seeing problems in Solr with tika throwing exceptions. Sometimes we see 
OOM like this:

{noformat}
SEVERE: java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50)
        at 
org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
        at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
        at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
        at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:195)
        at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478)
        at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
        at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
        at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
        at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
        at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
        at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
{noformat}

Other times, we see errors like this one:

{noformat}
Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer 
underrun
        at org.apache.poi.util.LittleEndian.readUShort(LittleEndian.java:302)
        at 
org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:53)
        at 
org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
        at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
        at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        ... 26 more
{noformat}

I am able to reproduce these failures with tika-app-1.0.jar. I am not able to 
share the content as the content is proprietary in nature. The OOM error is 
particularly problematic as it crashes Solr and causes our document indexing 
pipeline to get congested while it waits for Solr to restart. Please see also 
Solr ticket https://issues.apache.org/jira/browse/SOLR-2990 as this contains 
the original posting of the problem and some details of our environment where 
the tests are being performed.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to