Hi, My Spark Streaming application consumes messages (events) from Kafka every 10 seconds using the direct stream approach and aggregates these messages into hourly aggregations (to answer analytics questions like: "How many users from Paris visited page X between 8PM to 9PM") and save the data to Cassandra.
I was wondering if there's a good practice for handling a code change in a Spark Streaming applications that uses stateful transformations (updateStateByKey for example) because the new application code will not be able to use the data that was checkpointed by the former application. I have thought of a few solutions for this issue and was hoping some of you have some experience with such case and can suggest other solutions or feedback my suggested solutions: *Solution #1*: On a graceful shutdown, in addition to the current Kafka offsets, persist the current aggregated data into Cassandra tables (different than the regular aggregation tables) that would allow reading them easily when the new application starts in order to build the initial state. *Solution #2*: When an hour is "complete" (i.e not expecting more events with the timestamp of this hour), update somewhere persistent (DB / shared file) the last-complete-hour. This will allow me, when the new application starts, to read all the events from Kafka from the beginning of retention period (last X hours) and ignore events from timestamp smaller or equal than the last-complete-hour. I'll be happy to get your feedback! Thanks, Ofir -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-application-code-change-and-stateful-transformations-tp24692.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org