Re: Decompose failure recovery time
Hi Piotr, Thanks a lot. I will try your suggestion to see what happen. Regards, Zhinan On Fri, 21 Aug 2020 at 00:40, Piotr Nowojski wrote: > > Hi Zhinan, > > It's hard to say, but my guess it takes that long for the tasks to respond to > cancellation which consists of a couple of steps. If a task is currently busy > processing something, it has to respond to interruption > (`java.lang.Thread#interrupt`). If it takes 30 seconds for a task to react to > the interruption and clean up it's resources, that can cause problems and > there is very little that Flink can do. > > If you want to debug it further, I would suggest collecting stack traces > during cancellation (or even better: profile the code during cancellation). > This would help you answer the question, what are the task threads busy with. > > Probably not a solution, but I'm mentioning it just in case, you can shorten > the `task.cancellation.timeout` period. By default it's 180s. After that, > whole TaskManager will be killed. If you have spare TaskManagers or you can > restart them very quickly, lowering this timeout might help to some extent > (in an exchange for dirty shutdown, without cleaning up the resources). > > Piotrek > > czw., 20 sie 2020 o 18:00 Zhinan Cheng napisał(a): >> >> Hi Piotr, >> >> Thanks a lot for your help. >> Yes, I finally realize that I can only approximate the time for [1] >> and [3] and measure [2] by monitoring the uptime and downtime metric >> provided by Flink. >> >> And now my problem is that I found the time in [2] can be up to 40s, I >> wonder why it takes so long to restart the job. >> The log actually shows that the time to switch all operator instances >> from CANCELING to CANCELED is around 30s, do you have any ideas about >> this? >> >> Many thanks. >> >> Regards, >> Zhinan >> >> On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski wrote: >> > >> > Hi, >> > >> > > I want to decompose the recovery time into different parts, say >> > > (1) the time to detect the failure, >> > > (2) the time to restart the job, >> > > (3) and the time to restore the checkpointing. >> > >> > 1. Maybe I'm missing something, but as far as I can tell, Flink can not >> > help you with that. Time to detect the failure, would be a time between the >> > failure occurred, and the time when JobManager realises about this failure. >> > If we could reliably measure/check when the first one happened, then we >> > could immediately trigger failover. You are interested in this exactly >> > because there is no reliable way to detect the failure immediately. You >> > could approximate this via analysing the logs. >> > >> > 2. Maybe there are some metrics that you could use, if not you check use >> > the REST API [1] to monitor for the job status. Again you could also do it >> > via analysing the logs. >> > >> > 3. In the future this might be measurable using the REST API (similar as >> > the point 2.), but currently there is no way to do it that way. There is a >> > ticket for that [2]. I think currently the only way is to do it is via >> > analysing the logs. >> > >> > If you just need to do this once, I would analyse the logs manually. If you >> > want to do it many times or monitor this continuously, I would write some >> > simple script (python?) to mix checking REST API calls for 2. with logs >> > analysing. >> > >> > Piotrek >> > >> > >> > [1] >> > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs >> > [2] https://issues.apache.org/jira/browse/FLINK-17012 >> > wt., 18 sie 2020 o 04:07 Zhinan Cheng napisał(a): >> > >> > > Hi all, >> > > >> > > I am working on measuring the failure recovery time of Flink and I >> > > want to decompose the recovery time into different parts, say the time >> > > to detect the failure, the time to restart the job, and the time to >> > > restore the checkpointing. >> > > >> > > Unfortunately, I cannot find any information in Flink doc to solve >> > > this, Is there any way that Flink has provided for this, otherwise, >> > > how can I solve this? >> > > >> > > Thanks a lot for your help. >> > > >> > > Regards, >> > > Juno >> > > >> >> On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski wrote: >> > >> > Hi, >> > >> > > I want to decompose the recovery time into different parts, say >> > > (1) the time to detect the failure, >> > > (2) the time to restart the job, >> > > (3) and the time to restore the checkpointing. >> > >> > 1. Maybe I'm missing something, but as far as I can tell, Flink can not >> > help you with that. Time to detect the failure, would be a time between the >> > failure occurred, and the time when JobManager realises about this failure. >> > If we could reliably measure/check when the first one happened, then we >> > could immediately trigger failover. You are interested in this exactly >> > because there is no reliable way to detect the failure immediately. You >> > could approximate this via analysing the logs. >> > >> > 2. Maybe there are some metrics
Re: Decompose failure recovery time
Hi Zhinan, It's hard to say, but my guess it takes that long for the tasks to respond to cancellation which consists of a couple of steps. If a task is currently busy processing something, it has to respond to interruption (`java.lang.Thread#interrupt`). If it takes 30 seconds for a task to react to the interruption and clean up it's resources, that can cause problems and there is very little that Flink can do. If you want to debug it further, I would suggest collecting stack traces during cancellation (or even better: profile the code during cancellation). This would help you answer the question, what are the task threads busy with. Probably not a solution, but I'm mentioning it just in case, you can shorten the `task.cancellation.timeout` period. By default it's 180s. After that, whole TaskManager will be killed. If you have spare TaskManagers or you can restart them very quickly, lowering this timeout might help to some extent (in an exchange for dirty shutdown, without cleaning up the resources). Piotrek czw., 20 sie 2020 o 18:00 Zhinan Cheng napisał(a): > Hi Piotr, > > Thanks a lot for your help. > Yes, I finally realize that I can only approximate the time for [1] > and [3] and measure [2] by monitoring the uptime and downtime metric > provided by Flink. > > And now my problem is that I found the time in [2] can be up to 40s, I > wonder why it takes so long to restart the job. > The log actually shows that the time to switch all operator instances > from CANCELING to CANCELED is around 30s, do you have any ideas about > this? > > Many thanks. > > Regards, > Zhinan > > On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski wrote: > > > > Hi, > > > > > I want to decompose the recovery time into different parts, say > > > (1) the time to detect the failure, > > > (2) the time to restart the job, > > > (3) and the time to restore the checkpointing. > > > > 1. Maybe I'm missing something, but as far as I can tell, Flink can not > > help you with that. Time to detect the failure, would be a time between > the > > failure occurred, and the time when JobManager realises about this > failure. > > If we could reliably measure/check when the first one happened, then we > > could immediately trigger failover. You are interested in this exactly > > because there is no reliable way to detect the failure immediately. You > > could approximate this via analysing the logs. > > > > 2. Maybe there are some metrics that you could use, if not you check use > > the REST API [1] to monitor for the job status. Again you could also do > it > > via analysing the logs. > > > > 3. In the future this might be measurable using the REST API (similar as > > the point 2.), but currently there is no way to do it that way. There is > a > > ticket for that [2]. I think currently the only way is to do it is via > > analysing the logs. > > > > If you just need to do this once, I would analyse the logs manually. If > you > > want to do it many times or monitor this continuously, I would write some > > simple script (python?) to mix checking REST API calls for 2. with logs > > analysing. > > > > Piotrek > > > > > > [1] > > > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs > > [2] https://issues.apache.org/jira/browse/FLINK-17012 > > wt., 18 sie 2020 o 04:07 Zhinan Cheng > napisał(a): > > > > > Hi all, > > > > > > I am working on measuring the failure recovery time of Flink and I > > > want to decompose the recovery time into different parts, say the time > > > to detect the failure, the time to restart the job, and the time to > > > restore the checkpointing. > > > > > > Unfortunately, I cannot find any information in Flink doc to solve > > > this, Is there any way that Flink has provided for this, otherwise, > > > how can I solve this? > > > > > > Thanks a lot for your help. > > > > > > Regards, > > > Juno > > > > > On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski wrote: > > > > Hi, > > > > > I want to decompose the recovery time into different parts, say > > > (1) the time to detect the failure, > > > (2) the time to restart the job, > > > (3) and the time to restore the checkpointing. > > > > 1. Maybe I'm missing something, but as far as I can tell, Flink can not > > help you with that. Time to detect the failure, would be a time between > the > > failure occurred, and the time when JobManager realises about this > failure. > > If we could reliably measure/check when the first one happened, then we > > could immediately trigger failover. You are interested in this exactly > > because there is no reliable way to detect the failure immediately. You > > could approximate this via analysing the logs. > > > > 2. Maybe there are some metrics that you could use, if not you check use > > the REST API [1] to monitor for the job status. Again you could also do > it > > via analysing the logs. > > > > 3. In the future this might be measurable using the REST API (similar as > > the point 2.), but currently there i
Re: Decompose failure recovery time
Hi Piotr, Thanks a lot for your help. Yes, I finally realize that I can only approximate the time for [1] and [3] and measure [2] by monitoring the uptime and downtime metric provided by Flink. And now my problem is that I found the time in [2] can be up to 40s, I wonder why it takes so long to restart the job. The log actually shows that the time to switch all operator instances from CANCELING to CANCELED is around 30s, do you have any ideas about this? Many thanks. Regards, Zhinan On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski wrote: > > Hi, > > > I want to decompose the recovery time into different parts, say > > (1) the time to detect the failure, > > (2) the time to restart the job, > > (3) and the time to restore the checkpointing. > > 1. Maybe I'm missing something, but as far as I can tell, Flink can not > help you with that. Time to detect the failure, would be a time between the > failure occurred, and the time when JobManager realises about this failure. > If we could reliably measure/check when the first one happened, then we > could immediately trigger failover. You are interested in this exactly > because there is no reliable way to detect the failure immediately. You > could approximate this via analysing the logs. > > 2. Maybe there are some metrics that you could use, if not you check use > the REST API [1] to monitor for the job status. Again you could also do it > via analysing the logs. > > 3. In the future this might be measurable using the REST API (similar as > the point 2.), but currently there is no way to do it that way. There is a > ticket for that [2]. I think currently the only way is to do it is via > analysing the logs. > > If you just need to do this once, I would analyse the logs manually. If you > want to do it many times or monitor this continuously, I would write some > simple script (python?) to mix checking REST API calls for 2. with logs > analysing. > > Piotrek > > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs > [2] https://issues.apache.org/jira/browse/FLINK-17012 > wt., 18 sie 2020 o 04:07 Zhinan Cheng napisał(a): > > > Hi all, > > > > I am working on measuring the failure recovery time of Flink and I > > want to decompose the recovery time into different parts, say the time > > to detect the failure, the time to restart the job, and the time to > > restore the checkpointing. > > > > Unfortunately, I cannot find any information in Flink doc to solve > > this, Is there any way that Flink has provided for this, otherwise, > > how can I solve this? > > > > Thanks a lot for your help. > > > > Regards, > > Juno > > On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski wrote: > > Hi, > > > I want to decompose the recovery time into different parts, say > > (1) the time to detect the failure, > > (2) the time to restart the job, > > (3) and the time to restore the checkpointing. > > 1. Maybe I'm missing something, but as far as I can tell, Flink can not > help you with that. Time to detect the failure, would be a time between the > failure occurred, and the time when JobManager realises about this failure. > If we could reliably measure/check when the first one happened, then we > could immediately trigger failover. You are interested in this exactly > because there is no reliable way to detect the failure immediately. You > could approximate this via analysing the logs. > > 2. Maybe there are some metrics that you could use, if not you check use > the REST API [1] to monitor for the job status. Again you could also do it > via analysing the logs. > > 3. In the future this might be measurable using the REST API (similar as > the point 2.), but currently there is no way to do it that way. There is a > ticket for that [2]. I think currently the only way is to do it is via > analysing the logs. > > If you just need to do this once, I would analyse the logs manually. If you > want to do it many times or monitor this continuously, I would write some > simple script (python?) to mix checking REST API calls for 2. with logs > analysing. > > Piotrek > > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs > [2] https://issues.apache.org/jira/browse/FLINK-17012 > wt., 18 sie 2020 o 04:07 Zhinan Cheng napisał(a): > > > Hi all, > > > > I am working on measuring the failure recovery time of Flink and I > > want to decompose the recovery time into different parts, say the time > > to detect the failure, the time to restart the job, and the time to > > restore the checkpointing. > > > > Unfortunately, I cannot find any information in Flink doc to solve > > this, Is there any way that Flink has provided for this, otherwise, > > how can I solve this? > > > > Thanks a lot for your help. > > > > Regards, > > Juno > >
Re: Decompose failure recovery time
Hi, > I want to decompose the recovery time into different parts, say > (1) the time to detect the failure, > (2) the time to restart the job, > (3) and the time to restore the checkpointing. 1. Maybe I'm missing something, but as far as I can tell, Flink can not help you with that. Time to detect the failure, would be a time between the failure occurred, and the time when JobManager realises about this failure. If we could reliably measure/check when the first one happened, then we could immediately trigger failover. You are interested in this exactly because there is no reliable way to detect the failure immediately. You could approximate this via analysing the logs. 2. Maybe there are some metrics that you could use, if not you check use the REST API [1] to monitor for the job status. Again you could also do it via analysing the logs. 3. In the future this might be measurable using the REST API (similar as the point 2.), but currently there is no way to do it that way. There is a ticket for that [2]. I think currently the only way is to do it is via analysing the logs. If you just need to do this once, I would analyse the logs manually. If you want to do it many times or monitor this continuously, I would write some simple script (python?) to mix checking REST API calls for 2. with logs analysing. Piotrek [1] https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs [2] https://issues.apache.org/jira/browse/FLINK-17012 wt., 18 sie 2020 o 04:07 Zhinan Cheng napisał(a): > Hi all, > > I am working on measuring the failure recovery time of Flink and I > want to decompose the recovery time into different parts, say the time > to detect the failure, the time to restart the job, and the time to > restore the checkpointing. > > Unfortunately, I cannot find any information in Flink doc to solve > this, Is there any way that Flink has provided for this, otherwise, > how can I solve this? > > Thanks a lot for your help. > > Regards, > Juno >