[ 
https://issues.apache.org/jira/browse/SOLR-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076403#comment-15076403
 ] 

Dennis Gove commented on SOLR-7535:
-----------------------------------

+1 on fault tolerance as well.

1) I think the expected behavior of all streams is that the EOF tuple could 
contain extra metadata about the stream that is only known at the end. This 
allows an clients (or other streams) to know that this metadata didn't come 
from a real document but is just EOF metadata. If there are streams which don't 
handle a non-empty EOF tuple I think those streams should be corrected. 

2) I think you're correct about the ParallelStream and how it operates. I don't 
see a way for the ParallelStream, as currently implemented, to interact with 
the raw tuples coming out from a call to another streams read() method. Ie, it 
does depend on doing the partitioning at the source and cannot do it in the 
middle of a data pipeline. It'd be a nice feature to be able to take a single 
stream of data and split it out onto N streams across N workers.

Here's an example of a pipeline I'd like to be able to create with a 
ParallelStream but currently cannot seem to. Essentially, do something with the 
data then split it off to workers to to perform the expensive operations and 
then bring them back together (I hope the ascii art shows properly). 

{code}
                                  / --- worker1 --- rollup --- sort ---\
sourceA ---\                     /----- worker2 --- rollup --- sort ----\  
            ----------- join ---<------ worker3 --- rollup --- sort -----> --- 
mergesort ---\
sourceB ---/                     \----- worker4 --- rollup --- sort ----/       
             >--- join ---- output
                                  \ --- worker5 --- rollup --- sort ---/        
 sourceC ---/
{code}

My understanding is that the parallelization must be done at the start of the 
pipeline and cannot be done in the middle of the pipeline.

Maybe a new stream is required that can split streams off to workers.

> Add UpdateStream to Streaming API and Streaming Expression
> ----------------------------------------------------------
>
>                 Key: SOLR-7535
>                 URL: https://issues.apache.org/jira/browse/SOLR-7535
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java, SolrJ
>            Reporter: Joel Bernstein
>            Priority: Minor
>         Attachments: SOLR-7535.patch, SOLR-7535.patch
>
>
> The ticket adds an UpdateStream implementation to the Streaming API and 
> streaming expressions. The UpdateStream will wrap a TupleStream and send the 
> Tuples it reads to a SolrCloud collection to be indexed.
> This will allow users to pull data from different Solr Cloud collections, 
> merge and transform the streams and send the transformed data to another Solr 
> Cloud collection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to