Solution 2 sounds better to me. You aren't always going to have graceful shutdowns.
On Mon, Sep 14, 2015 at 1:49 PM, Ofir Kerker <ofir.ker...@gmail.com> wrote: > Hi, > My Spark Streaming application consumes messages (events) from Kafka every > 10 seconds using the direct stream approach and aggregates these messages > into hourly aggregations (to answer analytics questions like: "How many > users from Paris visited page X between 8PM to 9PM") and save the data to > Cassandra. > > I was wondering if there's a good practice for handling a code change in a > Spark Streaming applications that uses stateful transformations > (updateStateByKey for example) because the new application code will not be > able to use the data that was checkpointed by the former application. > I have thought of a few solutions for this issue and was hoping some of you > have some experience with such case and can suggest other solutions or > feedback my suggested solutions: > *Solution #1*: On a graceful shutdown, in addition to the current Kafka > offsets, persist the current aggregated data into Cassandra tables > (different than the regular aggregation tables) that would allow reading > them easily when the new application starts in order to build the initial > state. > *Solution #2*: When an hour is "complete" (i.e not expecting more events > with the timestamp of this hour), update somewhere persistent (DB / shared > file) the last-complete-hour. This will allow me, when the new application > starts, to read all the events from Kafka from the beginning of retention > period (last X hours) and ignore events from timestamp smaller or equal > than > the last-complete-hour. > > I'll be happy to get your feedback! > > Thanks, > Ofir > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-application-code-change-and-stateful-transformations-tp24692.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >