Re: Question: Determining Total Recovery Time

2020-02-26 Thread Zhijiang
. 26 (Wed.) 22:29 To:Morgan Geldenhuys Cc:Timo Walther ; user Subject:Re: Question: Determining Total Recovery Time Hi Morgan, doing it in a very general way sure is challenging. I'd assume that your idea of using the buffer usage has some shortcomings (which I don't know), but I also

Re: Question: Determining Total Recovery Time

2020-02-26 Thread Arvid Heise
Hi Morgan, doing it in a very general way sure is challenging. I'd assume that your idea of using the buffer usage has some shortcomings (which I don't know), but I also think it's a good starting point. Have you checked the PoolUsage metrics? [1] You could use them to detect the bottleneck and

Re: Question: Determining Total Recovery Time

2020-02-21 Thread Arvid Heise
Hi Morgan, sorry for the late reply. In general, that should work. You need to ensure that the same task is processing the same record though. Local copy needs to be state or else the last message would be lost upon restart. Performance will take a hit but if that is significant depends on the re

Re: Question: Determining Total Recovery Time

2020-02-11 Thread Morgan Geldenhuys
Thanks for the advice, i will look into it. Had a quick think about another simple solution but we would need a hook into the checkpoint process from the task/operator perspective, which I haven't looked into yet. It would work like this: - The sink operators (?) would keep a local copy of th

Re: Question: Determining Total Recovery Time

2020-02-10 Thread Arvid Heise
Hi Morgan, as Timo pointed out, there is no general solution, but in your setting, you could look at the consumer lag of the input topic after a crash. Lag would spike until all tasks restarted and reprocessing begins. Offsets are only committed on checkpoints though by default. Best, Arvid On

Re: Question: Determining Total Recovery Time

2020-02-04 Thread Timo Walther
Hi Morgan, as far as I know this is not possible mostly because measuring "till the point when the system catches up to the last message" is very pipeline/connector dependent. Some sources might need to read from the very beginning, some just continue from the latest checkpointed offset. Mea

Question: Determining Total Recovery Time

2020-02-03 Thread Morgan Geldenhuys
Community, I am interested in determining the total time to recover for a Flink application after experiencing a partial failure. Let's assume a pipeline consisting of Kafka -> Flink -> Kafka with Exactly-Once guarantees enabled. Taking a look at the documentation (https://ci.apache.org/pro