Hi, I'm trying to index a txt-File (~150MB) using Solr Cell/Tika. The curl command aborts due to a java.lang.OutOfMemoryError. ***************************************************************** java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.<init>(String.java:215) at java.lang.StringBuilder.toString(StringBuilder.java:430) at org.apache.solr.handler.extraction.SolrContentHandler.newDocument(Sol rContentHandler.java:124) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(Ext ractingDocumentLoader.java:119) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(Ex tractingDocumentLoader.java:125) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr actingDocumentLoader.java:195) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co ntentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl erBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle Request(RequestHandlers.java:237) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter .java:337) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte r.java:240) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl icationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF ilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV alve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextV alve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j ava:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j ava:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal ve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav a:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java :852) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce ss(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:48 9) at java.lang.Thread.run(Thread.java:619) ) that prevented it from fulfilling this request.</u></p><HR size="1" noshade="n oshade"><h3>Apache Tomcat/6.0.26</h3></body></html> *****************************************************************
AFAIK Tika keeps the whole file in RAM and posts it as one single string to Solr. I'm using JVM-args: Xmx1024M and solr default config with ***************************************************************** <mainIndex> <!-- options specific to the main on-disk lucene index --> <useCompoundFile>false</useCompoundFile> <ramBufferSizeMB>32</ramBufferSizeMB> <mergeFactor>10</mergeFactor> ... </mainIndex> <requestDispatcher handleSelect="true" > <!--Make sure your system has some authentication before enabling remote streaming! --> <requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048000" /> ... ***************************************************************** Is there a chance to force Solr/Tika to flush the memory during indexing a file? Increasing RAM in dependence on the size of the largest file to index seems not very nice. Did I miss some configuration option or do I have to modify Java code? I just found http://osdir.com/ml/tika-dev.lucene.apache.org/2009-02/msg00020.html and I'm wondering if there is a solution yet. Carina