Re: Review Request 23358: SAMZA-225

Martin Kleppmann Mon, 21 Jul 2014 14:28:02 -0700


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 36
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line36>
> >
> >     This paragraph seems contradictory -- does Spark guarantee ordering or 
> > not? And what do you mean with "is not emphasized in the document"?
> >     
> >     My understanding is that Spark's transformation operators must be 
> > side-effect-free, so the order in which batches are processed is 
> > irrelevant. When one batch depends on the output of a previous batch (e.g. 
> > a window-based operation), Spark Streaming guarantees that the correct 
> > previous batch is used as input to the subsequent batch (which is 
> > effectively ordering, even if some of the execution may actually happen in 
> > parallel).
> >     
> >     I'm not sure about ordering of output operations (which may have 
> > side-effects).
> >     
> >     Another thing -- I believe Spark Streaming requires transformation 
> > operators to be deterministic. Is that true? If so, it would be worth 
> > mentioning, because 
> >     that may make it unsuitable for nondeterministic processing, e.g. a 
> > randomized machine learning algorithm. Samza has no such requirement.
> >     
> >     "Spark Streaming supports at-least once messaging semantics": you say 
> > below that Spark Streaming may lose messages if the receiver task fails. If 
> > this is the case, the guarantee is neither at-least-once nor at-most-once, 
> > but more like zero-or-more-times.
> 
> Yan Fang wrote:
>     When I say "is not emphasized in the document", mean that I could not 
> find relevant documents. From my test, the messages order in one DStream 
> seems guaranteed. But if you combine some DStreams in the process, no order 
> is guaranteed. 
>     
>     you are right, transformation operations are side-effect-free and output 
> operations (should) have the side-effects. And all transformation operations 
> only happen after output operations are called (because of lazy 
> implementation). 
>     
>     So I am a little conservative about the order of messages in Spark 
> Streaming in case I write something wrong.
>     
>     yes for "transformation operators to be deterministic". Because you only 
> apply the operations in a deterministic stream. Will mention that.
>     
>     they do lose data and work on that. 
> https://issues.apache.org/jira/browse/SPARK-1730, 
> https://issues.apache.org/jira/browse/SPARK-1647. It's a little weird in 
> Kafka situation. Because of the consumer offset, it does not lose data but 
> processes too many messages at the first, say , 2s, when you bring up the 
> receiver after the failure. Maybe I should mention it as well?


"not emphasized": maybe say that in Spark, since messages are processed in 
batches by side-effect-free operators, the exact ordering of messages is not 
important in Spark.

Good find on the data loss issues, I'd suggest linking to SPARK-1647. I don't 
understand the issue with Kafka. When it comes back after a failure, does it 
start consuming from the latest offset, or some older offset?


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 42
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line42>
> >
> >     Does this state DStream provide any key-value access or other query 
> > model? If it's just a stream of records, that would imply that every time a 
> > batch of input records is processed, the stream processor also needs to 
> > consume the entire state DStream. That's fine if the state is small, but 
> > with a large amount of state (multiple GB), it would probably get very 
> > inefficient. If this is true, it would further support our "Samza is good 
> > if you have lots of state" story.
> >     
> >     Also: you don't mention anything about stream joins in this comparison. 
> > I see Spark has a join operator -- do you know what it does? Does it just 
> > take one batch from each input stream and join within those batches? Or can 
> > you do joins across a longer window, or against a table?
> >     
> >     Since joins typically involve large amounts of state, they are worth 
> > highlighting as an area where Samza may be stronger.
> >     
> >     "Everytime updateStateByKey is applied, you will get a new state 
> > DStream": presumably you get a new DStream once per batch, not for every 
> > single message within a batch?
> 
> Yan Fang wrote:
>     AFAIK, no other methods. will update when I know. It's a little 
> interesting in Spark Streaming. Seems it only updates the state of the keys 
> when the keys appear in this time interval. (because updateStateByKey only is 
> called every time interval). So maybe there is not concern of "consume the 
> entire state DStream", instead, the concern is "how can I change the previous 
> state and other key's state". asking this in the community now.
>     
>     "join" is a little tricky. You can join two DStreams in the same time 
> interval, meaning that, you can join two batches received from the same time 
> interval but can not join two DStreams that have different time intervals, 
> such as a realtime batch and a window batch.
>     
>     Yes, once per batch, not for single message. will emphasize this.
> 
> Yan Fang wrote:
>     Update:
>     
>     The following statement is wrong: "Seems it only updates the state of the 
> keys when the keys appear in this time interval. (because updateStateByKey 
> only is called every time interval).".
>     
>     You were right. Every time, the stream processor needs to consume the 
> entire state DStream. Spark Streaming has the inefficiency when the state is 
> very big. And they are working on this: 
> https://issues.apache.org/jira/browse/SPARK-2365  . Will mention this in the 
> updated version of doc.

Ok, good. I'd suggest explicitly mentioning the join limitation, as well.


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 50
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line50>
> >
> >     "send them to executors" -> "sending them to executors"
> >     
> >     "the parallelism is simply accomplished by normal RDD operations, such 
> > as map, reduceByKey, reduceByWindow" -- this is unclear. How are these 
> > operations parallelized? I assume each RDD batch is processed sequentially 
> > by a single processing task, but different batches can be processed on 
> > different machines. Is that right?
> >     
> >     If your input stream is partitioned (i.e. there are multiple 
> > receivers), does each batch include only messages from a single partition, 
> > or are all the partitions combined into a single batch? Can you repartition 
> > a stream, e.g. when grouping on a field, to ensure that all messages with 
> > the same value in that field get grouped together? (akin to the shuffle 
> > phase in MapReduce)
> >     
> >     How does partitioning play into the ordering guarantees? Even if each 
> > partition is ordered, there typically isn't a deterministic ordering of 
> > messages across partitions; how does this interact with Spark's determinism 
> > requirement for operators?
> 
> Yan Fang wrote:
>     "How are these operations parallelized? I assume each RDD batch is 
> processed sequentially by a single processing task, but different batches can 
> be processed on different machines. Is that right?"
>       Not quite sure. Will update when I know.
>     
>     "If your input stream is partitioned (i.e. there are multiple receivers), 
> does each batch include only messages from a single partition, or are all the 
> partitions combined into a single batch? Can you repartition a stream, e.g. 
> when grouping on a field, to ensure that all messages with the same value in 
> that field get grouped together? (akin to the shuffle phase in MapReduce)"
>     
>     Actually when you partition the stream, every partition is a DStream. 
> Then whatever you want to do is based on the DStream. So in the situation 
> where you want to have a "whole" stream containing all the partitions, you 
> will use the "union" to put them all together as a one DStream. Then "group 
> by" is somehow provided by "ReduceByKey". In general Spark (other projects), 
> they have groupByKey, but not supported in Spark Streaming.
>     
>     Yes, you are right. No deterministic ordering of messages across 
> partitions. Not quite sure about " Spark's determinism requirement for 
> operators". What does this mean?

"Spark's determinism requirement for operators": I mean that a Spark operator 
is required to always produce the same output when given the same input. 
Question is: if an operator is given the same set of input messages in a 
different order, is it still required to produce the same output? If yes, and 
if the order of input messages to an operator is not guaranteed, then that puts 
a limitation on what kinds of operator you can implement.

What I'm getting at here: if you want to write a custom Spark operator, it 
seems that you have to obey a lot of constraints in order to satisfy the 
framework's assumptions. IMHO Samza gives you a lot more freedom to implement 
the logic that you want.


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 66
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line66>
> >
> >     "reprocess the same data from since data was processed" -- sentence 
> > seems to have too many words in it?
> 
> Yan Fang wrote:
>     I think you know what I mean here...Just not sure how to phrase it less 
> verbosely...*_*

I know what you mean, it's just copyediting to make it easy for readers to 
understand. Here's a suggestion:

"When a Samza job recovers from a failure, it's possible that it will process 
some data more than once. This happens because the job restarts at the last 
checkpoint, and any messages that had been processed between that checkpoint 
and the failure are processed again. The amount of reprocessed data can be 
minimized by setting a small checkpoint interval."


- Martin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/#review47895
-----------------------------------------------------------


On July 15, 2014, 6:15 p.m., Yan Fang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23358/
> -----------------------------------------------------------
> 
> (Updated July 15, 2014, 6:15 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Comparison of Spark Streaming and Samza
> 
> 
> Diffs
> -----
> 
>   docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
>   docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
>   docs/learn/documentation/0.7.0/index.html 149ff2b 
> 
> Diff: https://reviews.apache.org/r/23358/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Yan Fang
> 
>

Re: Review Request 23358: SAMZA-225

Reply via email to