Hi Binh, It stores the state as well the unprocessed data. It is a subset of the records that you aggregated so far.
This provides a good reference for checkpointing. http://spark.apache.org/docs/1.2.1/streaming-programming-guide.html#checkpointing On Wed, Mar 18, 2015 at 12:52 PM, Binh Nguyen Van <binhn...@gmail.com> wrote: > Hi Arush, > > Thank you for answering! > When you say checkpoints hold metadata and Data, what is the Data? Is it > the Data that is pulled from input source or is it the state? > If it is state then is it the same number of records that I aggregated > since beginning or only a subset of it? How can I limit the size of > state that is kept in checkpoint? > > Thank you > -Binh > > On Tue, Mar 17, 2015 at 11:47 PM, Arush Kharbanda < > ar...@sigmoidanalytics.com> wrote: > >> Hi >> >> Yes spark streaming is capable of stateful stream processing. With or >> without state is a way of classifying state. >> Checkpoints hold metadata and Data. >> >> Thanks >> >> >> On Wed, Mar 18, 2015 at 4:00 AM, Binh Nguyen Van <binhn...@gmail.com> >> wrote: >> >>> Hi all, >>> >>> I am new to Spark so please forgive me if my questions is stupid. >>> I am trying to use Spark-Streaming in an application that read data >>> from a queue (Kafka) and do some aggregation (sum, count..) and >>> then persist result to an external storage system (MySQL, VoltDB...) >>> >>> From my understanding of Spark-Streaming, I can have two ways >>> of doing aggregation: >>> >>> - Stateless: I don't have to keep state and just apply new delta >>> values to the external system. From my understanding, doing in this way I >>> may end up with over counting when there is failure and replay. >>> - Statefull: Use checkpoint to keep state and blindly save new state >>> to external system. Doing in this way I have correct aggregation result >>> but >>> I have to keep data in two places (state and external system) >>> >>> My questions are: >>> >>> - Is my understanding of Stateless and Statefull aggregation >>> correct? If not please correct me! >>> - For the Statefull aggregation, What does Spark-Streaming keep when >>> it saves checkpoint? >>> >>> Please kindly help! >>> >>> Thanks >>> -Binh >>> >> >> >> >> -- >> >> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> >> >> *Arush Kharbanda* || Technical Teamlead >> >> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com >> > > -- [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com