Re: Review Request 23358: SAMZA-225

Yan Fang Thu, 17 Jul 2014 21:38:25 -0700


> On July 16, 2014, 8:36 p.m., Martin Kleppmann wrote:
> > docs/learn/documentation/0.7.0/comparisons/spark-streaming.md, line 42
> > <https://reviews.apache.org/r/23358/diff/3/?file=632630#file632630line42>
> >
> >     Does this state DStream provide any key-value access or other query 
> > model? If it's just a stream of records, that would imply that every time a 
> > batch of input records is processed, the stream processor also needs to 
> > consume the entire state DStream. That's fine if the state is small, but 
> > with a large amount of state (multiple GB), it would probably get very 
> > inefficient. If this is true, it would further support our "Samza is good 
> > if you have lots of state" story.
> >     
> >     Also: you don't mention anything about stream joins in this comparison. 
> > I see Spark has a join operator -- do you know what it does? Does it just 
> > take one batch from each input stream and join within those batches? Or can 
> > you do joins across a longer window, or against a table?
> >     
> >     Since joins typically involve large amounts of state, they are worth 
> > highlighting as an area where Samza may be stronger.
> >     
> >     "Everytime updateStateByKey is applied, you will get a new state 
> > DStream": presumably you get a new DStream once per batch, not for every 
> > single message within a batch?
> 
> Yan Fang wrote:
>     AFAIK, no other methods. will update when I know. It's a little 
> interesting in Spark Streaming. Seems it only updates the state of the keys 
> when the keys appear in this time interval. (because updateStateByKey only is 
> called every time interval). So maybe there is not concern of "consume the 
> entire state DStream", instead, the concern is "how can I change the previous 
> state and other key's state". asking this in the community now.
>     
>     "join" is a little tricky. You can join two DStreams in the same time 
> interval, meaning that, you can join two batches received from the same time 
> interval but can not join two DStreams that have different time intervals, 
> such as a realtime batch and a window batch.
>     
>     Yes, once per batch, not for single message. will emphasize this.

Update:

The following statement is wrong: "Seems it only updates the state of the keys 
when the keys appear in this time interval. (because updateStateByKey only is 
called every time interval).".

You were right. Every time, the stream processor needs to consume the entire 
state DStream. Spark Streaming has the inefficiency when the state is very big. 
And they are working on this: https://issues.apache.org/jira/browse/SPARK-2365  
. Will mention this in the updated version of doc.

- Yan

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23358/#review47895
-----------------------------------------------------------

On July 15, 2014, 6:15 p.m., Yan Fang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23358/
> -----------------------------------------------------------
> 
> (Updated July 15, 2014, 6:15 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Comparison of Spark Streaming and Samza
> 
> 
> Diffs
> -----
> 
>   docs/learn/documentation/0.7.0/comparisons/spark-streaming.md PRE-CREATION 
>   docs/learn/documentation/0.7.0/comparisons/storm.md 4a21094 
>   docs/learn/documentation/0.7.0/index.html 149ff2b 
> 
> Diff: https://reviews.apache.org/r/23358/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Yan Fang
> 
>

Re: Review Request 23358: SAMZA-225

Reply via email to