Re: Review Request 23358: SAMZA-225

Yan Fang Thu, 17 Jul 2014 10:01:46 -0700


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 36
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line36>
> >
> >     This paragraph seems contradictory -- does Spark guarantee ordering or 
> > not? And what do you mean with "is not emphasized in the document"?
> >     
> >     My understanding is that Spark's transformation operators must be 
> > side-effect-free, so the order in which batches are processed is 
> > irrelevant. When one batch depends on the output of a previous batch (e.g. 
> > a window-based operation), Spark Streaming guarantees that the correct 
> > previous batch is used as input to the subsequent batch (which is 
> > effectively ordering, even if some of the execution may actually happen in 
> > parallel).
> >     
> >     I'm not sure about ordering of output operations (which may have 
> > side-effects).
> >     
> >     Another thing -- I believe Spark Streaming requires transformation 
> > operators to be deterministic. Is that true? If so, it would be worth 
> > mentioning, because 
> >     that may make it unsuitable for nondeterministic processing, e.g. a 
> > randomized machine learning algorithm. Samza has no such requirement.
> >     
> >     "Spark Streaming supports at-least once messaging semantics": you say 
> > below that Spark Streaming may lose messages if the receiver task fails. If 
> > this is the case, the guarantee is neither at-least-once nor at-most-once, 
> > but more like zero-or-more-times.

When I say "is not emphasized in the document", mean that I could not find 
relevant documents. From my test, the messages order in one DStream seems 
guaranteed. But if you combine some DStreams in the process, no order is 
guaranteed. 

you are right, transformation operations are side-effect-free and output 
operations (should) have the side-effects. And all transformation operations 
only happen after output operations are called (because of lazy 
implementation). 

So I am a little conservative about the order of messages in Spark Streaming in 
case I write something wrong.

yes for "transformation operators to be deterministic". Because you only apply 
the operations in a deterministic stream. Will mention that.

they do lose data and work on that. 
https://issues.apache.org/jira/browse/SPARK-1730, 
https://issues.apache.org/jira/browse/SPARK-1647. It's a little weird in Kafka 
situation. Because of the consumer offset, it does not lose data but processes 
too many messages at the first, say , 2s, when you bring up the receiver after 
the failure. Maybe I should mention it as well?

> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 44
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line44>
> >
> >     Typo: "intermedia"
> >     
> >     Is this periodic writing of state to HDFS basically a form of 
> > checkpointing? If so, it might be worth calling it such.
> >     
> >     The state management page of the docs also calls out that checkpointing 
> > the entire task state is inefficient if the state is large. Could do a 
> > cross-reference.

yes. quote from mailing list "After every checkpointing interval, the latest 
state RDD is stored to HDFS in its entirety. Along with that, the series of 
DStream transformations that was setup with the streaming context is also 
stored into HDFS (the whole DAG of DStream objects is serialized and saved)."

> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 42
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line42>
> >
> >     Does this state DStream provide any key-value access or other query 
> > model? If it's just a stream of records, that would imply that every time a 
> > batch of input records is processed, the stream processor also needs to 
> > consume the entire state DStream. That's fine if the state is small, but 
> > with a large amount of state (multiple GB), it would probably get very 
> > inefficient. If this is true, it would further support our "Samza is good 
> > if you have lots of state" story.
> >     
> >     Also: you don't mention anything about stream joins in this comparison. 
> > I see Spark has a join operator -- do you know what it does? Does it just 
> > take one batch from each input stream and join within those batches? Or can 
> > you do joins across a longer window, or against a table?
> >     
> >     Since joins typically involve large amounts of state, they are worth 
> > highlighting as an area where Samza may be stronger.
> >     
> >     "Everytime updateStateByKey is applied, you will get a new state 
> > DStream": presumably you get a new DStream once per batch, not for every 
> > single message within a batch?

AFAIK, no other methods. will update when I know. It's a little interesting in 
Spark Streaming. Seems it only updates the state of the keys when the keys 
appear in this time interval. (because updateStateByKey only is called every 
time interval). So maybe there is not concern of "consume the entire state 
DStream", instead, the concern is "how can I change the previous state and 
other key's state". asking this in the community now.

"join" is a little tricky. You can join two DStreams in the same time interval, 
meaning that, you can join two batches received from the same time interval but 
can not join two DStreams that have different time intervals, such as a 
realtime batch and a window batch.

Yes, once per batch, not for single message. will emphasize this.

> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 50
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line50>
> >
> >     "send them to executors" -> "sending them to executors"
> >     
> >     "the parallelism is simply accomplished by normal RDD operations, such 
> > as map, reduceByKey, reduceByWindow" -- this is unclear. How are these 
> > operations parallelized? I assume each RDD batch is processed sequentially 
> > by a single processing task, but different batches can be processed on 
> > different machines. Is that right?
> >     
> >     If your input stream is partitioned (i.e. there are multiple 
> > receivers), does each batch include only messages from a single partition, 
> > or are all the partitions combined into a single batch? Can you repartition 
> > a stream, e.g. when grouping on a field, to ensure that all messages with 
> > the same value in that field get grouped together? (akin to the shuffle 
> > phase in MapReduce)
> >     
> >     How does partitioning play into the ordering guarantees? Even if each 
> > partition is ordered, there typically isn't a deterministic ordering of 
> > messages across partitions; how does this interact with Spark's determinism 
> > requirement for operators?

"How are these operations parallelized? I assume each RDD batch is processed 
sequentially by a single processing task, but different batches can be 
processed on different machines. Is that right?"
  Not quite sure. Will update when I know.

"If your input stream is partitioned (i.e. there are multiple receivers), does 
each batch include only messages from a single partition, or are all the 
partitions combined into a single batch? Can you repartition a stream, e.g. 
when grouping on a field, to ensure that all messages with the same value in 
that field get grouped together? (akin to the shuffle phase in MapReduce)"

Actually when you partition the stream, every partition is a DStream. Then 
whatever you want to do is based on the DStream. So in the situation where you 
want to have a "whole" stream containing all the partitions, you will use the 
"union" to put them all together as a one DStream. Then "group by" is somehow 
provided by "ReduceByKey". In general Spark (other projects), they have 
groupByKey, but not supported in Spark Streaming.

Yes, you are right. No deterministic ordering of messages across partitions. 
Not quite sure about " Spark's determinism requirement for operators". What 
does this mean?

> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 86
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line86>
> >
> >     Is this really true? I get the impression that with Spark Streaming you 
> > build an entire processing graph with a DSL API, and deploy that entire 
> > graph as one unit. The communication between the nodes in that graph (in 
> > the form of DStreams) is provided by the framework. In that way it looks 
> > similar to Storm.
> >     
> >     Samza is totally different -- each job is just a message-at-a-time 
> > processor, and there is no framework support for topologies. Output of a 
> > processing task always needs to go back to a message broker (e.g. Kafka).
> >     
> >     A positive consequence of Samza's design is that a job's output can be 
> > consumed by multiple unrelated jobs, potentially run by different teams, 
> > and those jobs are isolated from each other through Kafka's buffering. That 
> > is not the case with Storm's (and Spark Streaming's?) framework-internal 
> > streams.
> >     
> >     Although a Storm/Spark job could in principle write its output to a 
> > message broker, the framework doesn't really make this easy. It seems that 
> > Storm/Spark aren't intended to used in a way where one topology's output is 
> > another topology's input. By contrast, in Samza, that mode of usage is 
> > standard.

yes, I should update this part to emphasize the difference.

- Yan

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/#review47895
-----------------------------------------------------------

On July 15, 2014, 6:15 p.m., Yan Fang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23358/
> -----------------------------------------------------------
> 
> (Updated July 15, 2014, 6:15 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Comparison of Spark Streaming and Samza
> 
> 
> Diffs
> -----
> 
>   docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
>   docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
>   docs/learn/documentation/0.7.0/index.html 149ff2b 
> 
> Diff: https://reviews.apache.org/r/23358/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Yan Fang
> 
>

Re: Review Request 23358: SAMZA-225

Reply via email to