Re: Question: Determining Total Recovery Time

Morgan Geldenhuys Tue, 11 Feb 2020 02:25:13 -0800

Thanks for the advice, i will look into it.

Had a quick think about another simple solution but we would need a hookinto the checkpoint process from the task/operator perspective, which Ihaven't looked into yet. It would work like this:

- The sink operators (?) would keep a local copy of the last messageprocessed (or digest?), the current timestamp, and a boolean valueindicating whether or not the system is in recovery or not.- While not in recovery, update the local copy and timestamp with eachnew event processed.- When a failure is detected and the taskmanagers are notified torollback, we use the hook into this process to switch the boolean valueto true.- While true, it compares each new message with the last one processedbefore the recovery process was initiated.- When a match is found, the difference between the previous and currenttimestamp is calculated and outputted as a custom metric and the booleanis reset to false.

From here, the mean total recovery time could be calculated across theoperators. Not sure how it would impact on performance, but i doubt itwould be significant. We would need to ensure exactly once so that themessage would be guaranteed to be seen again. thoughts?


On 11.02.20 08:57, Arvid Heise wrote:

Hi Morgan,

as Timo pointed out, there is no general solution, but in yoursetting, you could look at the consumer lag of the input topic after acrash. Lag would spike until all tasks restarted and reprocessingbegins. Offsets are only committed on checkpoints though by default.


Best,

Arvid

On Tue, Feb 4, 2020 at 12:32 PM Timo Walther <twal...@apache.org<mailto:twal...@apache.org>> wrote:


    Hi Morgan,

    as far as I know this is not possible mostly because measuring
    "till the
    point when the system catches up to the last message" is very
    pipeline/connector dependent. Some sources might need to read from
    the
    very beginning, some just continue from the latest checkpointed
    offset.

    Measure things like that (e.g. for experiments) might require
    collecting
    own metrics as part of your pipeline definition.

    Regards,
    Timo


    On 03.02.20 12:20, Morgan Geldenhuys wrote:
    > Community,
    >
    > I am interested in determining the total time to recover for a
    Flink
    > application after experiencing a partial failure. Let's assume a
    > pipeline consisting of Kafka -> Flink -> Kafka with Exactly-Once
    > guarantees enabled.
    >
    > Taking a look at the documentation
    >
    
(https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/metrics.html),

    > one of the metrics which can be gathered is /recoveryTime/.
    However, as
    > far as I can tell this is only the time taken for the system to
    go from
    > an inconsistent state back into a consistent state, i.e.
    restarting the
    > job. Is there any way of measuring the amount of time taken from
    the
    > point when the failure occurred till the point when the system
    catches
    > up to the last message that was processed before the outage?
    >
    > Thank you very much in advance!
    >
    > Regards,
    > Morgan.

Re: Question: Determining Total Recovery Time

Reply via email to