Hi all, I am new to Spark so please forgive me if my questions is stupid. I am trying to use Spark-Streaming in an application that read data from a queue (Kafka) and do some aggregation (sum, count..) and then persist result to an external storage system (MySQL, VoltDB...)
>From my understanding of Spark-Streaming, I can have two ways of doing aggregation: - Stateless: I don't have to keep state and just apply new delta values to the external system. From my understanding, doing in this way I may end up with over counting when there is failure and replay. - Statefull: Use checkpoint to keep state and blindly save new state to external system. Doing in this way I have correct aggregation result but I have to keep data in two places (state and external system) My questions are: - Is my understanding of Stateless and Statefull aggregation correct? If not please correct me! - For the Statefull aggregation, What does Spark-Streaming keep when it saves checkpoint? Please kindly help! Thanks -Binh