[jira] [Comment Edited] (SOLR-7535) Add UpdateStream to Streaming API and Streaming Expression

Joel Bernstein (JIRA) Tue, 29 Dec 2015 05:01:25 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073867#comment-15073867
 ]


Joel Bernstein edited comment on SOLR-7535 at 12/29/15 12:59 PM:
-----------------------------------------------------------------

[~gerlowskija], the patch looks good. 

Three comments


1) I'd like to limit the changes in the patch to the UpdateStream if possible. 
It looks like the UpdateStream is extending CloudSolrStream which pushed some 
changes into CloudSolrStream. Let's have the UpdateStream extend TupleStream 
for now. In another ticket we can look at moving some shared methods to the 
TupleStream class to eliminate code duplication.

2) Let's remove the commit following the EOF tuple. The UpdateStream is likely 
to be run in parallel which means dozens of workers will be committing at the 
same time. We can add a CommitStream which would not be run in paralllel that 
will commit after a number updates or after it sees the EOF tuple.

We'll implement the CommitStream in a different ticket. For now we can rely on 
autoCommits to commit and explicitly commit in the test cases.

The pseudo code below shows a CommitStream wrapping an UpdateStream which is 
wrapped by a ParallelStream.
{code}
commit(collection1, 
             parallel(
                           update(collection1, search(collection2...))
              ), 
              100000))
{code}


3) We'll want to implement batching. So we'll need to add a batch size 
parameter to the UpdateStream. Then we'll send the updates in a batch to the 
CloudSolrClient. After each batch the read() method should return a Tuple with 
the number of documents indexed in the batch. This Tuple can be used by the 
CommitStream to commit every X records and can be returned to the client which 
will ensure that we don't get client timeouts do to inactivity.

So each call to the UpdateStream.read() will read a batch of docs from the 
sourceStream, index the batch and return a Tuple with the count.

 


was (Author: joel.bernstein):
[~gerlowskija], the patch looks good. 

Three comments


1) I'd like to limit the changes in the patch to the UpdateStream if possible. 
It looks like the UpdateStream is extending CloudSolrStream which pushed some 
changes into CloudSolrStream. Let's have the UpdateStream extend TupleStream 
for now. In another ticket we can look at moving some shared methods to the 
TupleStream class to eliminate code duplication.

2) Let's remove the commit following the EOF tuple. The stream is likely to be 
run in parallel which means dozens of workers will be committing at the same 
time. We can add a CommitStream which would not be run in paralllel that will 
commit after a number updates or after it sees the EOF tuple.

We'll implement the CommitStream in a different ticket. For now we can rely on 
autoCommits to commit and explicitly commit in the test cases.

The pseudo code below shows a CommitStream wrapping an UpdateStream which is 
wrapped by a ParallelStream.
{code}
commit(collection1, 
             parallel(
                           update(collection1, search(collection2...))
              ), 
              100000))
{code}


3) We'll want to implement batching. So we'll need to add a batch size 
parameter to the UpdateStream. Then we'll send the updates in a batch to the 
CloudSolrClient. After each batch the read() method should return a Tuple with 
the number of documents indexed in the batch. This Tuple can be used by the 
CommitStream to commit every X records and can be returned to the client which 
will ensure that we don't get client timeouts do to inactivity.

So each call to the UpdateStream.read() will read a batch of docs from the 
sourceStream, index the batch and return a Tuple with the count.

 

> Add UpdateStream to Streaming API and Streaming Expression
> ----------------------------------------------------------
>
>                 Key: SOLR-7535
>                 URL: https://issues.apache.org/jira/browse/SOLR-7535
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java, SolrJ
>            Reporter: Joel Bernstein
>            Priority: Minor
>         Attachments: SOLR-7535.patch
>
>
> The ticket adds an UpdateStream implementation to the Streaming API and 
> streaming expressions. The UpdateStream will wrap a TupleStream and send the 
> Tuples it reads to a SolrCloud collection to be indexed.
> This will allow users to pull data from different Solr Cloud collections, 
> merge and transform the streams and send the transformed data to another Solr 
> Cloud collection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-7535) Add UpdateStream to Streaming API and Streaming Expression

Reply via email to