Re: Checkpoint error - "The job has failed"
Oh interesting. Yea, could be. We'll soon update to v1.12. Thanks Robert and Yun! On Wed, Apr 28, 2021 at 1:30 AM Yun Tang wrote: > Hi Dan, > > You could refer to the "Fix Versions" in FLINK-16753 [1] and know that > this bug is resolved after 1.11.3 not 1.11.1. > > [1] https://issues.apache.org/jira/browse/FLINK-16753 > > Best > Yun Tang > -- > *From:* Dan Hill > *Sent:* Tuesday, April 27, 2021 7:50 > *To:* Yun Tang > *Cc:* Robert Metzger ; user > *Subject:* Re: Checkpoint error - "The job has failed" > > Hey Yun and Robert, > > I'm using Flink v1.11.1. > > Robert, I'll send you a separate email with the logs. > > On Mon, Apr 26, 2021 at 12:46 AM Yun Tang wrote: > > Hi Dan, > > I think you might use older version of Flink and this problem has been > resolved by FLINK-16753 [1] after Flink-1.10.3. > > > [1] https://issues.apache.org/jira/browse/FLINK-16753 > > Best > Yun Tang > -- > *From:* Robert Metzger > *Sent:* Monday, April 26, 2021 14:46 > *To:* Dan Hill > *Cc:* user > *Subject:* Re: Checkpoint error - "The job has failed" > > Hi Dan, > > can you provide me with the JobManager logs to take a look as well? (This > will also tell me which Flink version you are using) > > > > On Mon, Apr 26, 2021 at 7:20 AM Dan Hill wrote: > > My Flink job failed to checkpoint with a "The job has failed" error. The > logs contained no other recent errors. I keep hitting the error even if I > cancel the jobs and restart them. When I restarted my jobmanager and > taskmanager, the error went away. > > What error am I hitting? It looks like there is bad state that lives > outside the scope of a job. > > How often do people restart their jobmanagers and taskmanager to deal with > errors like this? > >
Re: Checkpoint error - "The job has failed"
Hi Dan, You could refer to the "Fix Versions" in FLINK-16753 [1] and know that this bug is resolved after 1.11.3 not 1.11.1. [1] https://issues.apache.org/jira/browse/FLINK-16753 Best Yun Tang From: Dan Hill Sent: Tuesday, April 27, 2021 7:50 To: Yun Tang Cc: Robert Metzger ; user Subject: Re: Checkpoint error - "The job has failed" Hey Yun and Robert, I'm using Flink v1.11.1. Robert, I'll send you a separate email with the logs. On Mon, Apr 26, 2021 at 12:46 AM Yun Tang mailto:myas...@live.com>> wrote: Hi Dan, I think you might use older version of Flink and this problem has been resolved by FLINK-16753 [1] after Flink-1.10.3. [1] https://issues.apache.org/jira/browse/FLINK-16753 Best Yun Tang From: Robert Metzger mailto:rmetz...@apache.org>> Sent: Monday, April 26, 2021 14:46 To: Dan Hill mailto:quietgol...@gmail.com>> Cc: user mailto:user@flink.apache.org>> Subject: Re: Checkpoint error - "The job has failed" Hi Dan, can you provide me with the JobManager logs to take a look as well? (This will also tell me which Flink version you are using) On Mon, Apr 26, 2021 at 7:20 AM Dan Hill mailto:quietgol...@gmail.com>> wrote: My Flink job failed to checkpoint with a "The job has failed" error. The logs contained no other recent errors. I keep hitting the error even if I cancel the jobs and restart them. When I restarted my jobmanager and taskmanager, the error went away. What error am I hitting? It looks like there is bad state that lives outside the scope of a job. How often do people restart their jobmanagers and taskmanager to deal with errors like this?
Re: Checkpoint error - "The job has failed"
Hey Yun and Robert, I'm using Flink v1.11.1. Robert, I'll send you a separate email with the logs. On Mon, Apr 26, 2021 at 12:46 AM Yun Tang wrote: > Hi Dan, > > I think you might use older version of Flink and this problem has been > resolved by FLINK-16753 [1] after Flink-1.10.3. > > > [1] https://issues.apache.org/jira/browse/FLINK-16753 > > Best > Yun Tang > -- > *From:* Robert Metzger > *Sent:* Monday, April 26, 2021 14:46 > *To:* Dan Hill > *Cc:* user > *Subject:* Re: Checkpoint error - "The job has failed" > > Hi Dan, > > can you provide me with the JobManager logs to take a look as well? (This > will also tell me which Flink version you are using) > > > > On Mon, Apr 26, 2021 at 7:20 AM Dan Hill wrote: > > My Flink job failed to checkpoint with a "The job has failed" error. The > logs contained no other recent errors. I keep hitting the error even if I > cancel the jobs and restart them. When I restarted my jobmanager and > taskmanager, the error went away. > > What error am I hitting? It looks like there is bad state that lives > outside the scope of a job. > > How often do people restart their jobmanagers and taskmanager to deal with > errors like this? > >
Re: Checkpoint error - "The job has failed"
Hi Dan, I think you might use older version of Flink and this problem has been resolved by FLINK-16753 [1] after Flink-1.10.3. [1] https://issues.apache.org/jira/browse/FLINK-16753 Best Yun Tang From: Robert Metzger Sent: Monday, April 26, 2021 14:46 To: Dan Hill Cc: user Subject: Re: Checkpoint error - "The job has failed" Hi Dan, can you provide me with the JobManager logs to take a look as well? (This will also tell me which Flink version you are using) On Mon, Apr 26, 2021 at 7:20 AM Dan Hill mailto:quietgol...@gmail.com>> wrote: My Flink job failed to checkpoint with a "The job has failed" error. The logs contained no other recent errors. I keep hitting the error even if I cancel the jobs and restart them. When I restarted my jobmanager and taskmanager, the error went away. What error am I hitting? It looks like there is bad state that lives outside the scope of a job. How often do people restart their jobmanagers and taskmanager to deal with errors like this?
Re: Checkpoint error - "The job has failed"
Hi Dan, can you provide me with the JobManager logs to take a look as well? (This will also tell me which Flink version you are using) On Mon, Apr 26, 2021 at 7:20 AM Dan Hill wrote: > My Flink job failed to checkpoint with a "The job has failed" error. The > logs contained no other recent errors. I keep hitting the error even if I > cancel the jobs and restart them. When I restarted my jobmanager and > taskmanager, the error went away. > > What error am I hitting? It looks like there is bad state that lives > outside the scope of a job. > > How often do people restart their jobmanagers and taskmanager to deal with > errors like this? >
Checkpoint error - "The job has failed"
My Flink job failed to checkpoint with a "The job has failed" error. The logs contained no other recent errors. I keep hitting the error even if I cancel the jobs and restart them. When I restarted my jobmanager and taskmanager, the error went away. What error am I hitting? It looks like there is bad state that lives outside the scope of a job. How often do people restart their jobmanagers and taskmanager to deal with errors like this?