Re: flink snapshotting fault-tolerance

Abhishek R. Singh Thu, 19 May 2016 11:36:52 -0700

I was wondering how checkpoints can be async? Because your state is constantly 
mutating. You probably need versioned state, or immutable data structs?


-Abhishek-

> On May 19, 2016, at 11:14 AM, Paris Carbone <par...@kth.se> wrote:
> 
> Hi Stavros,
> 
> Currently, rollback failure recovery in Flink works in the pipeline level, 
> not in the task level (see Millwheel [1]). It further builds on repayable 
> stream logs (i.e. Kafka), thus, there is no need for 3pc or backup in the 
> pipeline sources. You can also check this presentation [2] which explains the 
> basic concepts more in detail I hope. Mind that many upcoming optimisation 
> opportunities are going to be addressed in the not so long-term Flink roadmap.
> 
> Paris
> 
> [1] 
> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41378.pdf
>  
> <http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41378.pdf>
> [2] 
> http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha
>  
> <http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha>
> 
>  
> <http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha>
> 
>  
> <http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha>
>> On 19 May 2016, at 19:43, Stavros Kontopoulos <st.kontopou...@gmail.com 
>> <mailto:st.kontopou...@gmail.com>> wrote:
>> 
>> Cool thnx. So if a checkpoint expires the pipeline will block or fail in 
>> total or only the specific task related to the operator (running along with 
>> the checkpoint task) or nothing happens?
>> 
>> On Tue, May 17, 2016 at 3:49 PM, Robert Metzger <rmetz...@apache.org 
>> <mailto:rmetz...@apache.org>> wrote:
>> Hi Stravos,
>> 
>> I haven't implemented our checkpointing mechanism and I didn't participate 
>> in the design decisions while implementing it, so I can not compare it in 
>> detail to other approaches.
>> 
>> From a "does it work perspective": Checkpoints are only confirmed if all 
>> parallel subtasks successfully created a valid snapshot of the state. So if 
>> there is a failure in the checkpointing mechanism, no valid checkpoint will 
>> be created. The system will recover from the last valid checkpoint.
>> There is a timeout for checkpoints. So if a barrier doesn't pass through the 
>> system for a certain period of time, the checkpoint is cancelled. The 
>> default timeout is 10 minutes.
>> 
>> Regards,
>> Robert
>> 
>> 
>> On Mon, May 16, 2016 at 1:22 PM, Stavros Kontopoulos 
>> <st.kontopou...@gmail.com <mailto:st.kontopou...@gmail.com>> wrote:
>> Hi,
>> 
>> I was looking into the flink snapshotting algorithm details also mentioned 
>> here:
>> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
>>  
>> <http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/>
>> https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/
>>  
>> <https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/>
>> http://mail-archives.apache.org/mod_mbox/flink-user/201601.mbox/%3CCANC1h_s6MCWSuDf2zSnEeD66LszDoLx0jt64++0kBOKTjkAv7w%40mail.gmail.com%3E
>>  
>> <http://mail-archives.apache.org/mod_mbox/flink-user/201601.mbox/%3CCANC1h_s6MCWSuDf2zSnEeD66LszDoLx0jt64++0kBOKTjkAv7w%40mail.gmail.com%3E>
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/About-exactly-once-question-td2545.html
>>  
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/About-exactly-once-question-td2545.html>
>> 
>> From other sources i understand that it assumes no failures to work for 
>> message delivery or for example a process hanging for ever:
>> https://en.wikipedia.org/wiki/Snapshot_algorithm 
>> <https://en.wikipedia.org/wiki/Snapshot_algorithm>
>> https://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-global-states-of-distributed-systems/
>>  
>> <https://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-global-states-of-distributed-systems/>
>> 
>> So my understanding (maybe wrong) is that this is a solution which seems not 
>> to address the fault tolerance issue in a strong manner like for example if 
>> it was to use a 3pc protocol for local state propagation and global 
>> agreement. I know the latter is not efficient just mentioning it for 
>> comparison. 
>> 
>> How the algorithm behaves in practical terms under the presence of its own 
>> failures (this is a background process collecting partial states)? Are there 
>> timeouts for reaching a barrier?
>> 
>> PS. have not looked deep into the code details yet, planning to.
>> 
>> Best,
>> Stavros
>> 
>> 
>> 
>

Re: flink snapshotting fault-tolerance

Reply via email to