[ 
https://issues.apache.org/jira/browse/SOLR-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076381#comment-15076381
 ] 

Jason Gerlowski commented on SOLR-7535:
---------------------------------------

+1 for adding fault tolerance, but for doing so under a separate JIRA ticket.  
This is something that probably needs thought about across the board.

Additionally, just wanted to put some updates on my progress here.  I sat down 
this morning to work on tests for UpdateStream.  The simple cases all seem to 
work fine.

I _did_ run into two issues though when I tried to write a test that combined 
ParallelStream and UpdateStream (i.e. parallel(a, update(b... ):

1.) Currently, if UpdateStream reaches EOF mid-batch, it sends out an EOF tuple 
that also contains a "docsUploaded" field.  But ParallelStream currently 
swallows this tuple and spits out a completely clean EOF tuple.  (It seems a 
few Stream types expect that EOF tuples don't have any substantive fields).

This shouldn't be hard to fix.  I can just change UpdateStream to emit EOF 
after emitting a tuple with the partial batch.  i.e. instead of {{{EOF:true 
docsUploaded:3}}}}, just return {{{docsUploaded:3}}} followed-by {{{EOF:true}}}

2.) ParallelStream works by providing {{partitionKeys}} to the underlying 
searches.  However, this doesn't work with UpdateStream, which goes to the 
/update handler, not the /search handler.  Since there's no partitioning, the 
same update gets run twice, putting two copies of the docs in the collection 
used by update().



I didn't really anticipate running into any major problems in using 
ParallelStream with UpdateStream, but it looks to me like ParallelStream is 
only really appropriate for wrapping searches, not updates.  (This reminds me a 
bit of Dennis' comments above about ReadStreams and WriteStreams).  Am I 
interpreting this incorrectly?

Running out of the house now, but I'll be back shortly to look at this again.  
Sorry if my notes above are a bit rough.  I'm jotting them down half so I 
remember where I was, and I haven't really thought through things as well as I 
would've liked yet.

> Add UpdateStream to Streaming API and Streaming Expression
> ----------------------------------------------------------
>
>                 Key: SOLR-7535
>                 URL: https://issues.apache.org/jira/browse/SOLR-7535
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java, SolrJ
>            Reporter: Joel Bernstein
>            Priority: Minor
>         Attachments: SOLR-7535.patch, SOLR-7535.patch
>
>
> The ticket adds an UpdateStream implementation to the Streaming API and 
> streaming expressions. The UpdateStream will wrap a TupleStream and send the 
> Tuples it reads to a SolrCloud collection to be indexed.
> This will allow users to pull data from different Solr Cloud collections, 
> merge and transform the streams and send the transformed data to another Solr 
> Cloud collection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to