Re: Review Request 23358: SAMZA-225

Yan Fang Mon, 21 Jul 2014 17:51:20 -0700


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 36
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line36>
> >
> >     This paragraph seems contradictory -- does Spark guarantee ordering or 
> > not? And what do you mean with "is not emphasized in the document"?
> >     
> >     My understanding is that Spark's transformation operators must be 
> > side-effect-free, so the order in which batches are processed is 
> > irrelevant. When one batch depends on the output of a previous batch (e.g. 
> > a window-based operation), Spark Streaming guarantees that the correct 
> > previous batch is used as input to the subsequent batch (which is 
> > effectively ordering, even if some of the execution may actually happen in 
> > parallel).
> >     
> >     I'm not sure about ordering of output operations (which may have 
> > side-effects).
> >     
> >     Another thing -- I believe Spark Streaming requires transformation 
> > operators to be deterministic. Is that true? If so, it would be worth 
> > mentioning, because 
> >     that may make it unsuitable for nondeterministic processing, e.g. a 
> > randomized machine learning algorithm. Samza has no such requirement.
> >     
> >     "Spark Streaming supports at-least once messaging semantics": you say 
> > below that Spark Streaming may lose messages if the receiver task fails. If 
> > this is the case, the guarantee is neither at-least-once nor at-most-once, 
> > but more like zero-or-more-times.
> 
> Yan Fang wrote:
>     When I say "is not emphasized in the document", mean that I could not 
> find relevant documents. From my test, the messages order in one DStream 
> seems guaranteed. But if you combine some DStreams in the process, no order 
> is guaranteed. 
>     
>     you are right, transformation operations are side-effect-free and output 
> operations (should) have the side-effects. And all transformation operations 
> only happen after output operations are called (because of lazy 
> implementation). 
>     
>     So I am a little conservative about the order of messages in Spark 
> Streaming in case I write something wrong.
>     
>     yes for "transformation operators to be deterministic". Because you only 
> apply the operations in a deterministic stream. Will mention that.
>     
>     they do lose data and work on that. 
> https://issues.apache.org/jira/browse/SPARK-1730, 
> https://issues.apache.org/jira/browse/SPARK-1647. It's a little weird in 
> Kafka situation. Because of the consumer offset, it does not lose data but 
> processes too many messages at the first, say , 2s, when you bring up the 
> receiver after the failure. Maybe I should mention it as well?
> 
> Martin Kleppmann wrote:
>     "not emphasized": maybe say that in Spark, since messages are processed 
> in batches by side-effect-free operators, the exact ordering of messages is 
> not important in Spark.
>     
>     Good find on the data loss issues, I'd suggest linking to SPARK-1647. I 
> don't understand the issue with Kafka. When it comes back after a failure, 
> does it start consuming from the latest offset, or some older offset?


In terms of Kafka, when Spark Streaming restarts, it starts from the older 
offset where it fails. That means, if Spark Streaming is using Kafka as the 
input stream, it will not lose data in a receiver/driver failure scenario. 
However, since there are many unprocessed messages in the Kafka ( because it 
does not consume any data during the failure time), it will consume all the 
unprocessed messages at the first interval. After that, it goes to normal 
situation where it consumes as the same rate as the data is coming. Now they 
have a patch https://issues.apache.org/jira/browse/SPARK-1341 to control the 
rate.

But for sure it loses data if it's using Flume/Twitter data as the input stream.


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 50
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line50>
> >
> >     "send them to executors" -> "sending them to executors"
> >     
> >     "the parallelism is simply accomplished by normal RDD operations, such 
> > as map, reduceByKey, reduceByWindow" -- this is unclear. How are these 
> > operations parallelized? I assume each RDD batch is processed sequentially 
> > by a single processing task, but different batches can be processed on 
> > different machines. Is that right?
> >     
> >     If your input stream is partitioned (i.e. there are multiple 
> > receivers), does each batch include only messages from a single partition, 
> > or are all the partitions combined into a single batch? Can you repartition 
> > a stream, e.g. when grouping on a field, to ensure that all messages with 
> > the same value in that field get grouped together? (akin to the shuffle 
> > phase in MapReduce)
> >     
> >     How does partitioning play into the ordering guarantees? Even if each 
> > partition is ordered, there typically isn't a deterministic ordering of 
> > messages across partitions; how does this interact with Spark's determinism 
> > requirement for operators?
> 
> Yan Fang wrote:
>     "How are these operations parallelized? I assume each RDD batch is 
> processed sequentially by a single processing task, but different batches can 
> be processed on different machines. Is that right?"
>       Not quite sure. Will update when I know.
>     
>     "If your input stream is partitioned (i.e. there are multiple receivers), 
> does each batch include only messages from a single partition, or are all the 
> partitions combined into a single batch? Can you repartition a stream, e.g. 
> when grouping on a field, to ensure that all messages with the same value in 
> that field get grouped together? (akin to the shuffle phase in MapReduce)"
>     
>     Actually when you partition the stream, every partition is a DStream. 
> Then whatever you want to do is based on the DStream. So in the situation 
> where you want to have a "whole" stream containing all the partitions, you 
> will use the "union" to put them all together as a one DStream. Then "group 
> by" is somehow provided by "ReduceByKey". In general Spark (other projects), 
> they have groupByKey, but not supported in Spark Streaming.
>     
>     Yes, you are right. No deterministic ordering of messages across 
> partitions. Not quite sure about " Spark's determinism requirement for 
> operators". What does this mean?
> 
> Martin Kleppmann wrote:
>     "Spark's determinism requirement for operators": I mean that a Spark 
> operator is required to always produce the same output when given the same 
> input. Question is: if an operator is given the same set of input messages in 
> a different order, is it still required to produce the same output? If yes, 
> and if the order of input messages to an operator is not guaranteed, then 
> that puts a limitation on what kinds of operator you can implement.
>     
>     What I'm getting at here: if you want to write a custom Spark operator, 
> it seems that you have to obey a lot of constraints in order to satisfy the 
> framework's assumptions. IMHO Samza gives you a lot more freedom to implement 
> the logic that you want.

Not sure if I answer your question. In Spark Streaming, when the same set of 
input messages have different orders, they are thought as different data input. 
So the output will not be the same.

I agree with you on that Spark has a lot of constraints and Samza is more 
flexible. But Spark Streaming gives users the option to reuse their batch 
processing code. 


- Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/#review47895
-----------------------------------------------------------


On July 15, 2014, 6:15 p.m., Yan Fang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23358/
> -----------------------------------------------------------
> 
> (Updated July 15, 2014, 6:15 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Comparison of Spark Streaming and Samza
> 
> 
> Diffs
> -----
> 
>   docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
>   docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
>   docs/learn/documentation/0.7.0/index.html 149ff2b 
> 
> Diff: https://reviews.apache.org/r/23358/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Yan Fang
> 
>

Re: Review Request 23358: SAMZA-225

Reply via email to