Thanks Cody!
The 2nd solution is safer but seems wasteful :/
I'll try to optimize it by keeping in addition to the 'last-complete-hour'
the corresponding offsets that bound the incomplete data to try and
fast-forward only the last couple of hours in the worst case.

On Mon, Sep 14, 2015 at 22:14 Cody Koeninger <c...@koeninger.org> wrote:

> Solution 2 sounds better to me.  You aren't always going to have graceful
> shutdowns.
>
> On Mon, Sep 14, 2015 at 1:49 PM, Ofir Kerker <ofir.ker...@gmail.com>
> wrote:
>
>> Hi,
>> My Spark Streaming application consumes messages (events) from Kafka every
>> 10 seconds using the direct stream approach and aggregates these messages
>> into hourly aggregations (to answer analytics questions like: "How many
>> users from Paris visited page X between 8PM to 9PM") and save the data to
>> Cassandra.
>>
>> I was wondering if there's a good practice for handling a code change in a
>> Spark Streaming applications that uses stateful transformations
>> (updateStateByKey for example) because the new application code will not
>> be
>> able to use the data that was checkpointed by the former application.
>> I have thought of a few solutions for this issue and was hoping some of
>> you
>> have some experience with such case and can suggest other solutions or
>> feedback my suggested solutions:
>> *Solution #1*: On a graceful shutdown, in addition to the current Kafka
>> offsets, persist the current aggregated data into Cassandra tables
>> (different than the regular aggregation tables) that would allow reading
>> them easily when the new application starts in order to build the initial
>> state.
>> *Solution #2*: When an hour is "complete" (i.e not expecting more events
>> with the timestamp of this hour), update somewhere persistent (DB / shared
>> file) the last-complete-hour. This will allow me, when the new application
>> starts, to read all the events from Kafka from the beginning of retention
>> period (last X hours) and ignore events from timestamp smaller or equal
>> than
>> the last-complete-hour.
>>
>> I'll be happy to get your feedback!
>>
>> Thanks,
>> Ofir
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-application-code-change-and-stateful-transformations-tp24692.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to