Thanks Cody! The 2nd solution is safer but seems wasteful :/ I'll try to optimize it by keeping in addition to the 'last-complete-hour' the corresponding offsets that bound the incomplete data to try and fast-forward only the last couple of hours in the worst case.
On Mon, Sep 14, 2015 at 22:14 Cody Koeninger <c...@koeninger.org> wrote: > Solution 2 sounds better to me. You aren't always going to have graceful > shutdowns. > > On Mon, Sep 14, 2015 at 1:49 PM, Ofir Kerker <ofir.ker...@gmail.com> > wrote: > >> Hi, >> My Spark Streaming application consumes messages (events) from Kafka every >> 10 seconds using the direct stream approach and aggregates these messages >> into hourly aggregations (to answer analytics questions like: "How many >> users from Paris visited page X between 8PM to 9PM") and save the data to >> Cassandra. >> >> I was wondering if there's a good practice for handling a code change in a >> Spark Streaming applications that uses stateful transformations >> (updateStateByKey for example) because the new application code will not >> be >> able to use the data that was checkpointed by the former application. >> I have thought of a few solutions for this issue and was hoping some of >> you >> have some experience with such case and can suggest other solutions or >> feedback my suggested solutions: >> *Solution #1*: On a graceful shutdown, in addition to the current Kafka >> offsets, persist the current aggregated data into Cassandra tables >> (different than the regular aggregation tables) that would allow reading >> them easily when the new application starts in order to build the initial >> state. >> *Solution #2*: When an hour is "complete" (i.e not expecting more events >> with the timestamp of this hour), update somewhere persistent (DB / shared >> file) the last-complete-hour. This will allow me, when the new application >> starts, to read all the events from Kafka from the beginning of retention >> period (last X hours) and ignore events from timestamp smaller or equal >> than >> the last-complete-hour. >> >> I'll be happy to get your feedback! >> >> Thanks, >> Ofir >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-application-code-change-and-stateful-transformations-tp24692.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >