Hi all,

I am new to Spark so please forgive me if my questions is stupid.
I am trying to use Spark-Streaming in an application that read data
from a queue (Kafka) and do some aggregation (sum, count..) and
then persist result to an external storage system (MySQL, VoltDB...)

>From my understanding of Spark-Streaming, I can have two ways
of doing aggregation:

   - Stateless: I don't have to keep state and just apply new delta values
   to the external system. From my understanding, doing in this way I may end
   up with over counting when there is failure and replay.
   - Statefull: Use checkpoint to keep state and blindly save new state to
   external system. Doing in this way I have correct aggregation result but I
   have to keep data in two places (state and external system)

My questions are:

   - Is my understanding of Stateless and Statefull aggregation correct? If
   not please correct me!
   - For the Statefull aggregation, What does Spark-Streaming keep when it
   saves checkpoint?

Please kindly help!

Thanks
-Binh

Reply via email to