[ 
https://issues.apache.org/jira/browse/TIKA-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177545#comment-13177545
 ] 

Nick Burch commented on TIKA-835:
---------------------------------

Without a file, it's going to be very hard for us to identify what's wrong. (It 
could well be an issue where we're mis-reading the previous attribute, then 
we're finding junk where the next one should be)

Alas the TNEF format doesn't have nearly as much public documentation as much 
of the other Microsoft formats, so reverse engineering is often needed (which 
needs sample files to work against)

Finally, this is a POI bug, so we should take the discussions on how you can 
identify the problem parts of your file there
                
> TNEF parsing unstable
> ---------------------
>
>                 Key: TIKA-835
>                 URL: https://issues.apache.org/jira/browse/TIKA-835
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: CentOS 4.x/5.x/6.x 
> Java 6
>            Reporter: Rob Tulloh
>
> We are seeing problems in Solr with tika throwing exceptions. Sometimes we 
> see OOM like this:
> {noformat}
> SEVERE: java.lang.OutOfMemoryError: Java heap space
>         at 
> org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50)
>         at 
> org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
>         at 
> org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>         at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:195)
>         at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
>         at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>         at 
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
>         at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>         at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>         at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>         at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>         at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>         at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>         at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>         at 
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>         at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> {noformat}
> Other times, we see errors like this one:
> {noformat}
> Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer 
> underrun
>         at org.apache.poi.util.LittleEndian.readUShort(LittleEndian.java:302)
>         at 
> org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:53)
>         at 
> org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>         at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
>         at 
> org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         ... 26 more
> {noformat}
> I am able to reproduce these failures with tika-app-1.0.jar. I am not able to 
> share the content as the content is proprietary in nature. The OOM error is 
> particularly problematic as it crashes Solr and causes our document indexing 
> pipeline to get congested while it waits for Solr to restart. Please see also 
> Solr ticket https://issues.apache.org/jira/browse/SOLR-2990 as this contains 
> the original posting of the problem and some details of our environment where 
> the tests are being performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to