[ https://issues.apache.org/jira/browse/SOLR-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076381#comment-15076381 ]
Jason Gerlowski commented on SOLR-7535: --------------------------------------- +1 for adding fault tolerance, but for doing so under a separate JIRA ticket. This is something that probably needs thought about across the board. Additionally, just wanted to put some updates on my progress here. I sat down this morning to work on tests for UpdateStream. The simple cases all seem to work fine. I _did_ run into two issues though when I tried to write a test that combined ParallelStream and UpdateStream (i.e. parallel(a, update(b... ): 1.) Currently, if UpdateStream reaches EOF mid-batch, it sends out an EOF tuple that also contains a "docsUploaded" field. But ParallelStream currently swallows this tuple and spits out a completely clean EOF tuple. (It seems a few Stream types expect that EOF tuples don't have any substantive fields). This shouldn't be hard to fix. I can just change UpdateStream to emit EOF after emitting a tuple with the partial batch. i.e. instead of {{{EOF:true docsUploaded:3}}}}, just return {{{docsUploaded:3}}} followed-by {{{EOF:true}}} 2.) ParallelStream works by providing {{partitionKeys}} to the underlying searches. However, this doesn't work with UpdateStream, which goes to the /update handler, not the /search handler. Since there's no partitioning, the same update gets run twice, putting two copies of the docs in the collection used by update(). I didn't really anticipate running into any major problems in using ParallelStream with UpdateStream, but it looks to me like ParallelStream is only really appropriate for wrapping searches, not updates. (This reminds me a bit of Dennis' comments above about ReadStreams and WriteStreams). Am I interpreting this incorrectly? Running out of the house now, but I'll be back shortly to look at this again. Sorry if my notes above are a bit rough. I'm jotting them down half so I remember where I was, and I haven't really thought through things as well as I would've liked yet. > Add UpdateStream to Streaming API and Streaming Expression > ---------------------------------------------------------- > > Key: SOLR-7535 > URL: https://issues.apache.org/jira/browse/SOLR-7535 > Project: Solr > Issue Type: New Feature > Components: clients - java, SolrJ > Reporter: Joel Bernstein > Priority: Minor > Attachments: SOLR-7535.patch, SOLR-7535.patch > > > The ticket adds an UpdateStream implementation to the Streaming API and > streaming expressions. The UpdateStream will wrap a TupleStream and send the > Tuples it reads to a SolrCloud collection to be indexed. > This will allow users to pull data from different Solr Cloud collections, > merge and transform the streams and send the transformed data to another Solr > Cloud collection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org