[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Ryan McKinley (JIRA) Sun, 04 Jan 2009 10:00:12 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660607#action_12660607
 ]


Ryan McKinley commented on SOLR-906:
------------------------------------

Are you looking at the patch or just brainstorming how this could be 
implemented?

{panel}
I am referring to the client code .The method in UpdateRequest

public Collection<ContentStream> getContentStreams() throws IOException {
    return ClientUtils.toContentStreams( getXML(), ClientUtils.TEXT_XML );
}

This means that the getXML() method actually constructs a huge String which is 
the entire xml. It is not very good if we are writing out very large no:of docs
{panel}

This is not how the patch works...  for starters, it never calls 
getContentStreams() for UpdateRequest.  It opens a single connection and 
continually dumps the xml for each request.  Rather then call getXML() the 
patch adds a function writeXml( Writer ) that writes directly to the open 
buffer.

{panel}
I am suggesting that ComonsHttpSolrServer has scope for improvement. Instead of 
building that String in memory we can just start streaming it to the server. So 
the OutputStream can be passed on to UpdateRequest so that it can write the xml 
right into the stream. So there is streaming effectively on both ends
{panel}
The ComonsHttpSolrServer is fine, but you are right that each UpdateRequest 
*may* want to write the content directly to the open stream.  The ContentStream 
interface gives us all that control.  One thing to note is that if you do not 
specify the length, the HttpCommons server will use chunked encoding.

But I think adding the StreammingUpdateSolrServer resolves that for everyone.  
Uses have either option.

{panel}
 One drawback of a StreamingHttpSolrServer is that it ends up opening multiple 
connections for uploading the documents
{panel}
Nonsense -- that is exactly what this avoids.  It opens a single connection and 
writes everything to it.  You can configure how many threads you want emptying 
the queue; each one will open a connection.

{panel}
Another enhancement . We can add one (or more ) extra thread in the server to 
do the call updaterequestprocessor.processAdd() . 
{panel}
That opens a whole can of worms...  perhaps better discussed on java-dev.  For 
now I think sticking to the 1 thread/prequest is a good model.  If you want 
multiple threads running on the server use multiple connections (it is even an 
argument in the StreammingHttpSolrServer)

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, 
> SOLR-906-StreamingHttpSolrServer.patch, 
> SOLR-906-StreamingHttpSolrServer.patch, 
> SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( 
> SolrInputDocument ) is less then optimal.  This makes a new request for each 
> document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to 
> a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Reply via email to