Hi Eric, Shawn,

Thank you for your reply.

Luckily just on the second time itself my 13GB SOLR XML (more than a million
docs) went in fine into SOLR without any problem and I uploaded another 2
more sets of 1.2million+ docs fine without any hassle.

I will try for lesser sized more xmls next time as well as the auto commit
suggestion.

Best Rgds,
Mark.

On Thu, Apr 1, 2010 at 6:18 PM, Shawn Smith <sh...@thena.net> wrote:

> The error might be that your http client doesn't handle really large
> files (32-bit overflow in the Content-Length header?) or something in
> your network is killing your long-lived socket?  Solr can definitely
> accept a 13GB xml document.
>
> I've uploaded large files into Solr successfully, including recently a
> 12GB XML input file with ~4 million documents.  My Solr instance had
> 2GB of memory and it took about 2 hours.  Solr streamed the XML in
> nicely.  I had to jump through a couple of hoops, but in my case it
> was easier than writing a tool to split up my 12GB XML file...
>
> 1. I tried to use curl to do the upload, but it didn't handle files
> that large.  For my quick and dirty testing, netcat (nc) did the
> trick--it doesn't buffer the file in memory and it doesn't overflow
> the Content-Length header.  Plus I could pipe the data through pv to
> get a progress bar and estimated time of completion.  Not recommended
> for production!
>
>  FILE=documents.xml
>  SIZE=$(stat --format %s $FILE)
>  (echo "POST /solr/update HTTP/1.1
>  Host: localhost:8983
>  Content-Type: text/xml
>  Content-Length: $SIZE
>  " ; cat $FILE ) | pv -s $SIZE | nc localhost 8983
>
> 2. Indexing seemed to use less memory if I configured Solr to auto
> commit periodically in solrconfig.xml.  This is what I used:
>
>    <updateHandler class="solr.DirectUpdateHandler2">
>        <autoCommit>
>            <maxDocs>25000</maxDocs> <!-- maximum uncommited docs
> before autocommit triggered -->
>            <maxTime>300000</maxTime> <!-- 5 minutes, maximum time (in
> MS) after adding a doc before an autocommit is triggered -->
>        </autoCommit>
>    </updateHandler>
>
> Shawn
>
> On Thu, Apr 1, 2010 at 10:10 AM, Erick Erickson <erickerick...@gmail.com>
> wrote:
> > Don't do that. For many reasons <G>. By trying to batch so many docs
> > together, you're just *asking* for trouble. Quite apart from whether
> it'll
> > work once, having *any* HTTP-based protocol work reliably with 13G is
> > fragile...
> >
> > For instance, I don't want to have my know whether the XML parsing in
> > SOLR parses the entire document into memory before processing or
> > not. But I sure don't want my application to change behavior if SOLR
> > changes it's mind and wants to process the other way. My perfectly
> > working application (assuming an event-driven parser) could
> > suddenly start requiring over 13G of memory... Oh my aching head!
> >
> > Your specific error might even be dependent upon GCing, which will
> > cause it to break differently, sometimes, maybe......
> >
> > So do break things up and transmit multiple documents. It'll save you
> > a world of hurt.
> >
> > HTH
> > Erick
> >
> > On Thu, Apr 1, 2010 at 4:34 AM, Mark Fletcher
> > <mark.fletcher2...@gmail.com>wrote:
> >
> >> Hi,
> >>
> >> For the first time I tried uploading a huge input SOLR xml having about
> 1.2
> >> million *docs* (13GB in size). After some time I get the following
> >> exception:-
> >>
> >>  <u>The server encountered an internal error ([was class
> >> java.net.SocketTimeoutException] Read timed out
> >> java.lang.RuntimeException: [was class java.net.SocketTimeoutException]
> >> Read
> >> timed out
> >>  at
> >>
> >>
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> >>  at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> >>  at
> >>
> >>
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> >>  at
> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> >>  at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:279)
> >>  at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:138)
> >>  at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
> >>  at
> >>
> >>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
> >>  at
> >>
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> >>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> >>  at
> >>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> >>  at
> >>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> >>  at
> >>
> >>
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> >>  at
> >>
> >>
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> >>  at
> >>
> >>
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> >>  at
> >>
> >>
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> >>  at
> >>
> >>
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
> >>  at
> >>
> >>
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> >>  at
> >>
> >>
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> >>  at
> >>
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
> >>  at
> >>
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845)
> >>  at
> >>
> >>
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
> >>  at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
> >>  at java.lang.Thread.run(Thread.java:619)
> >> Caused by: java.net.SocketTimeoutException: Read timed out
> >> ...
> >>
> >> Was the file I tried to upload too big and should I try reducing its
> >> size..?
> >>
> >> Thanks and Rgds,
> >> Mark.
> >>
> >
>

Reply via email to