Hi Yes spark streaming is capable of stateful stream processing. With or without state is a way of classifying state. Checkpoints hold metadata and Data.
Thanks On Wed, Mar 18, 2015 at 4:00 AM, Binh Nguyen Van <binhn...@gmail.com> wrote: > Hi all, > > I am new to Spark so please forgive me if my questions is stupid. > I am trying to use Spark-Streaming in an application that read data > from a queue (Kafka) and do some aggregation (sum, count..) and > then persist result to an external storage system (MySQL, VoltDB...) > > From my understanding of Spark-Streaming, I can have two ways > of doing aggregation: > > - Stateless: I don't have to keep state and just apply new delta > values to the external system. From my understanding, doing in this way I > may end up with over counting when there is failure and replay. > - Statefull: Use checkpoint to keep state and blindly save new state > to external system. Doing in this way I have correct aggregation result but > I have to keep data in two places (state and external system) > > My questions are: > > - Is my understanding of Stateless and Statefull aggregation correct? > If not please correct me! > - For the Statefull aggregation, What does Spark-Streaming keep when > it saves checkpoint? > > Please kindly help! > > Thanks > -Binh > -- [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com