Re: Idempotent count

2015-03-18 Thread Arush Kharbanda
Hi Binh,

It stores the state as well the unprocessed data.  It is a subset of the
records that you aggregated so far.

This provides a good reference for checkpointing.

http://spark.apache.org/docs/1.2.1/streaming-programming-guide.html#checkpointing


On Wed, Mar 18, 2015 at 12:52 PM, Binh Nguyen Van binhn...@gmail.com
wrote:

 Hi Arush,

 Thank you for answering!
 When you say checkpoints hold metadata and Data, what is the Data? Is it
 the Data that is pulled from input source or is it the state?
 If it is state then is it the same number of records that I aggregated
 since beginning or only a subset of it? How can I limit the size of
 state that is kept in checkpoint?

 Thank you
 -Binh

 On Tue, Mar 17, 2015 at 11:47 PM, Arush Kharbanda 
 ar...@sigmoidanalytics.com wrote:

 Hi

 Yes spark streaming is capable of stateful stream processing. With or
 without state is a way of classifying state.
 Checkpoints hold metadata and Data.

 Thanks


 On Wed, Mar 18, 2015 at 4:00 AM, Binh Nguyen Van binhn...@gmail.com
 wrote:

 Hi all,

 I am new to Spark so please forgive me if my questions is stupid.
 I am trying to use Spark-Streaming in an application that read data
 from a queue (Kafka) and do some aggregation (sum, count..) and
 then persist result to an external storage system (MySQL, VoltDB...)

 From my understanding of Spark-Streaming, I can have two ways
 of doing aggregation:

- Stateless: I don't have to keep state and just apply new delta
values to the external system. From my understanding, doing in this way I
may end up with over counting when there is failure and replay.
- Statefull: Use checkpoint to keep state and blindly save new state
to external system. Doing in this way I have correct aggregation result 
 but
I have to keep data in two places (state and external system)

 My questions are:

- Is my understanding of Stateless and Statefull aggregation
correct? If not please correct me!
- For the Statefull aggregation, What does Spark-Streaming keep when
it saves checkpoint?

 Please kindly help!

 Thanks
 -Binh




 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com





-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Idempotent count

2015-03-18 Thread Binh Nguyen Van
Hi Arush,

Thank you for answering!
When you say checkpoints hold metadata and Data, what is the Data? Is it
the Data that is pulled from input source or is it the state?
If it is state then is it the same number of records that I aggregated
since beginning or only a subset of it? How can I limit the size of
state that is kept in checkpoint?

Thank you
-Binh

On Tue, Mar 17, 2015 at 11:47 PM, Arush Kharbanda 
ar...@sigmoidanalytics.com wrote:

 Hi

 Yes spark streaming is capable of stateful stream processing. With or
 without state is a way of classifying state.
 Checkpoints hold metadata and Data.

 Thanks


 On Wed, Mar 18, 2015 at 4:00 AM, Binh Nguyen Van binhn...@gmail.com
 wrote:

 Hi all,

 I am new to Spark so please forgive me if my questions is stupid.
 I am trying to use Spark-Streaming in an application that read data
 from a queue (Kafka) and do some aggregation (sum, count..) and
 then persist result to an external storage system (MySQL, VoltDB...)

 From my understanding of Spark-Streaming, I can have two ways
 of doing aggregation:

- Stateless: I don't have to keep state and just apply new delta
values to the external system. From my understanding, doing in this way I
may end up with over counting when there is failure and replay.
- Statefull: Use checkpoint to keep state and blindly save new state
to external system. Doing in this way I have correct aggregation result 
 but
I have to keep data in two places (state and external system)

 My questions are:

- Is my understanding of Stateless and Statefull aggregation correct?
If not please correct me!
- For the Statefull aggregation, What does Spark-Streaming keep when
it saves checkpoint?

 Please kindly help!

 Thanks
 -Binh




 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com



Re: Idempotent count

2015-03-18 Thread Arush Kharbanda
Hi

Yes spark streaming is capable of stateful stream processing. With or
without state is a way of classifying state.
Checkpoints hold metadata and Data.

Thanks


On Wed, Mar 18, 2015 at 4:00 AM, Binh Nguyen Van binhn...@gmail.com wrote:

 Hi all,

 I am new to Spark so please forgive me if my questions is stupid.
 I am trying to use Spark-Streaming in an application that read data
 from a queue (Kafka) and do some aggregation (sum, count..) and
 then persist result to an external storage system (MySQL, VoltDB...)

 From my understanding of Spark-Streaming, I can have two ways
 of doing aggregation:

- Stateless: I don't have to keep state and just apply new delta
values to the external system. From my understanding, doing in this way I
may end up with over counting when there is failure and replay.
- Statefull: Use checkpoint to keep state and blindly save new state
to external system. Doing in this way I have correct aggregation result but
I have to keep data in two places (state and external system)

 My questions are:

- Is my understanding of Stateless and Statefull aggregation correct?
If not please correct me!
- For the Statefull aggregation, What does Spark-Streaming keep when
it saves checkpoint?

 Please kindly help!

 Thanks
 -Binh




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com