
Have you asked on Tika's ML?
You may also want to watch https://issues.apache.org/jira/browse/SOLR-2901

Performance Monitoring SaaS for Solr - 

> From: Wayne W <waynemailingli...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Saturday, January 14, 2012 2:53 AM
> Subject: Solr - Tika(?) memory leak
> we're using Solr running on tomcat with 1GB in production, and of late
> we've been having a huge number of OutOfMemory issues. It seems from
> what I can tell this is coming from the tika extraction of the
> content. I've processed the java dump file using a memory analyzer and
> its pretty clean at least the class involved. It seems like a leak to
> me, as we don't parse any files larger than 20M, and these objects are
> taking up ~700M
> I've attached 2 screen shots from the tool (not sure if you receive
> attachments).
> But to summarize (class, number of objects, Used heap size, Retained Heap 
> Size):
> org.apache.xmlbeans.impl.store.Xob$ElementXObj             838,993
>          80,533,728       604,606,040
> org.apache.poi.openxml4j.opc.ZipPackage                          2
>                    112                  87,009,848
> char[]
>               587                    32,216,960       38,216,950
> We're really desperate to find a solution to this - any ideas or help
> is greatly appreciated.
> Wayne

