[jira] [Commented] (SOLR-7535) Add UpdateStream to Streaming API and Streaming Expression

Joel Bernstein (JIRA) Sun, 03 Jan 2016 13:50:57 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080537#comment-15080537
 ]


Joel Bernstein commented on SOLR-7535:
--------------------------------------

After some more thought, I'm thinking of adding a buffer="true/false" parameter 
to the UpdateStream. If buffer="true" then the UpdateStream will first write 
each batch to local disk. During the buffering phase each tuple with return the 
"buffered" count. When all the records have been buffered, each call to read() 
will index one batch from disk and return the "indexed" count.

I believe we're going to need this buffering approach when indexing large 
amounts of data from a large number of shards. For example with 10 workers and 
20 shards with 3 replicas we could expect well over 10 million records per 
second being exported from the shards. Indexing will be much, much slower so 
the exporting shards will be blocked for minutes at time causing timeouts. 
Buffering to local disk should be able to keep up, even with compression. 

If buffer="false" then the UpdateStream will directly update the way that it 
does now.  This will work fine for smaller loads.

> Add UpdateStream to Streaming API and Streaming Expression
> ----------------------------------------------------------
>
>                 Key: SOLR-7535
>                 URL: https://issues.apache.org/jira/browse/SOLR-7535
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java, SolrJ
>            Reporter: Joel Bernstein
>            Assignee: Joel Bernstein
>            Priority: Minor
>         Attachments: SOLR-7535.patch, SOLR-7535.patch, SOLR-7535.patch, 
> SOLR-7535.patch, SOLR-7535.patch, SOLR-7535.patch
>
>
> The ticket adds an UpdateStream implementation to the Streaming API and 
> streaming expressions. The UpdateStream will wrap a TupleStream and send the 
> Tuples it reads to a SolrCloud collection to be indexed.
> This will allow users to pull data from different Solr Cloud collections, 
> merge and transform the streams and send the transformed data to another Solr 
> Cloud collection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-7535) Add UpdateStream to Streaming API and Streaming Expression

Reply via email to