Hi Eric, Shawn, Thank you for your reply.
Luckily just on the second time itself my 13GB SOLR XML (more than a million docs) went in fine into SOLR without any problem and I uploaded another 2 more sets of 1.2million+ docs fine without any hassle. I will try for lesser sized more xmls next time as well as the auto commit suggestion. Best Rgds, Mark. On Thu, Apr 1, 2010 at 6:18 PM, Shawn Smith <sh...@thena.net> wrote: > The error might be that your http client doesn't handle really large > files (32-bit overflow in the Content-Length header?) or something in > your network is killing your long-lived socket? Solr can definitely > accept a 13GB xml document. > > I've uploaded large files into Solr successfully, including recently a > 12GB XML input file with ~4 million documents. My Solr instance had > 2GB of memory and it took about 2 hours. Solr streamed the XML in > nicely. I had to jump through a couple of hoops, but in my case it > was easier than writing a tool to split up my 12GB XML file... > > 1. I tried to use curl to do the upload, but it didn't handle files > that large. For my quick and dirty testing, netcat (nc) did the > trick--it doesn't buffer the file in memory and it doesn't overflow > the Content-Length header. Plus I could pipe the data through pv to > get a progress bar and estimated time of completion. Not recommended > for production! > > FILE=documents.xml > SIZE=$(stat --format %s $FILE) > (echo "POST /solr/update HTTP/1.1 > Host: localhost:8983 > Content-Type: text/xml > Content-Length: $SIZE > " ; cat $FILE ) | pv -s $SIZE | nc localhost 8983 > > 2. Indexing seemed to use less memory if I configured Solr to auto > commit periodically in solrconfig.xml. This is what I used: > > <updateHandler class="solr.DirectUpdateHandler2"> > <autoCommit> > <maxDocs>25000</maxDocs> <!-- maximum uncommited docs > before autocommit triggered --> > <maxTime>300000</maxTime> <!-- 5 minutes, maximum time (in > MS) after adding a doc before an autocommit is triggered --> > </autoCommit> > </updateHandler> > > Shawn > > On Thu, Apr 1, 2010 at 10:10 AM, Erick Erickson <erickerick...@gmail.com> > wrote: > > Don't do that. For many reasons <G>. By trying to batch so many docs > > together, you're just *asking* for trouble. Quite apart from whether > it'll > > work once, having *any* HTTP-based protocol work reliably with 13G is > > fragile... > > > > For instance, I don't want to have my know whether the XML parsing in > > SOLR parses the entire document into memory before processing or > > not. But I sure don't want my application to change behavior if SOLR > > changes it's mind and wants to process the other way. My perfectly > > working application (assuming an event-driven parser) could > > suddenly start requiring over 13G of memory... Oh my aching head! > > > > Your specific error might even be dependent upon GCing, which will > > cause it to break differently, sometimes, maybe...... > > > > So do break things up and transmit multiple documents. It'll save you > > a world of hurt. > > > > HTH > > Erick > > > > On Thu, Apr 1, 2010 at 4:34 AM, Mark Fletcher > > <mark.fletcher2...@gmail.com>wrote: > > > >> Hi, > >> > >> For the first time I tried uploading a huge input SOLR xml having about > 1.2 > >> million *docs* (13GB in size). After some time I get the following > >> exception:- > >> > >> <u>The server encountered an internal error ([was class > >> java.net.SocketTimeoutException] Read timed out > >> java.lang.RuntimeException: [was class java.net.SocketTimeoutException] > >> Read > >> timed out > >> at > >> > >> > com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) > >> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) > >> at > >> > >> > com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) > >> at > com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) > >> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:279) > >> at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:138) > >> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) > >> at > >> > >> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) > >> at > >> > >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) > >> at > >> > >> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > >> at > >> > >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) > >> at > >> > >> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > >> at > >> > >> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > >> at > >> > >> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) > >> at > >> > >> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) > >> at > >> > >> > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) > >> at > >> > >> > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > >> at > >> > >> > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > >> at > >> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) > >> at > >> > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845) > >> at > >> > >> > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) > >> at > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) > >> at java.lang.Thread.run(Thread.java:619) > >> Caused by: java.net.SocketTimeoutException: Read timed out > >> ... > >> > >> Was the file I tried to upload too big and should I try reducing its > >> size..? > >> > >> Thanks and Rgds, > >> Mark. > >> > > >