[ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660496#action_12660496 ]
Ryan McKinley commented on SOLR-906: ------------------------------------ | One problem with the current implementation is that it writes everything to a local buffer and then uploads the whole content in one go. So essentially we are wasting time till your 40K docs are written into this huge XML. Another issue is that this XML has to fit in memory. We need to fix the comonsHttpSolrServer first. It must stream the docs . Really?! Are you saying that the [RequestEntity.html#getContentLength()|http://hc.apache.org/httpclient-3.x/apidocs/org/apache/commons/httpclient/methods/RequestEntity.html#getContentLength()] does not behave as advertised? This implementation returns -1 for the content length, and that tells the connection use chunk encoding to transmit the request entity. Where do you get the 40K number? Is it from the log? If so, that is the expected behavior -- the server continually processes documents until it reaches the end of the stream. That may be 1 document that may be 1M docs... If you are filling up a Collection<SolrInputDocument> with 40K docs, then sending it of course it is going to hold on to 40K docs at once. | We can enhance the SolrServer API by adding a method SolrServer#add(Iterator<SolrInputDocs> docs) . So CommonsHttpSolrServer can start writing the documents as and when you are producing your documents . We also have the advantage of not storing the huge list of docs in memory. I'm not following... with the StreamingHttpSolrServer, you can send documents one at a time and each documents starts sending as soon as it can. There is a BlockingQueue<UpdateRequest> that holds all UpdateRequests that come through the 'request' method. BlockingQueue's only hold a fixed number of items and will block before adding something beyond the limit. | Another enhancement is using a different format (SOLR-865). It uses javabin format and it can be extremely fast compared to XML and the payload can be reduced substantially. That is a different issue altogether. That relates to having something different running on the server. Once that is in, then this should be able to leverage that as well... > Buffered / Streaming SolrServer implementaion > --------------------------------------------- > > Key: SOLR-906 > URL: https://issues.apache.org/jira/browse/SOLR-906 > Project: Solr > Issue Type: New Feature > Components: clients - java > Reporter: Ryan McKinley > Assignee: Shalin Shekhar Mangar > Fix For: 1.4 > > Attachments: SOLR-906-StreamingHttpSolrServer.patch, > SOLR-906-StreamingHttpSolrServer.patch, > SOLR-906-StreamingHttpSolrServer.patch, > SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java > > > While indexing lots of documents, the CommonsHttpSolrServer add( > SolrInputDocument ) is less then optimal. This makes a new request for each > document. > With a "StreamingHttpSolrServer", documents are buffered and then written to > a single open Http connection. > For related discussion see: > http://www.nabble.com/solr-performance-tt9055437.html#a20833680 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.