[jira] [Updated] (FLINK-11665) Flink fails to remove JobGraph from ZK even though it reports it did

Bashar Abdul Jawad (JIRA) Tue, 19 Feb 2019 15:48:53 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-11665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bashar Abdul Jawad updated FLINK-11665:
---------------------------------------
    Description: 
We recently have seen the following issue with Flink 1.5.5:

Given Flink Job ID 1d24cad26843dcebdfca236d5e3ad82a: 

1- A job is activated successfully and the job graph added to ZK:
{code:java}
Added SubmittedJobGraph(1d24cad26843dcebdfca236d5e3ad82a, null) to ZooKeeper.
{code}
2- Job is deactivated, Flink reports that the job graph has been successfully 
removed from ZK and the blob is deleted from the blob server (in this case S3):
{code:java}
Removed job graph 1d24cad26843dcebdfca236d5e3ad82a from ZooKeeper.
{code}
3- JM is later restarted, Flink for some reason attempts to recover the job 
that it reported earlier it has removed from ZK but since the blob has already 
been deleted the JM goes into a crash loop. The only way to recover it manually 
is to remove the job graph entry from ZK:
{code:java}
Recovered SubmittedJobGraph(1d24cad26843dcebdfca236d5e3ad82a, null).    
{code}
and
{code:java}
org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
 The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error 
Code: NoSuchKey; Request ID: 1BCDFD83FC4546A2), S3 Extended Request ID: 
OzZtMbihzCm1LKy99s2+rgUMxyll/xYmL6ouMvU2eo30wuDbUmj/DAWoTCs9pNNCLft0FWqbhTo= 
(Path: 
s3://blam-state-staging/flink/default/blob/job_1d24cad26843dcebdfca236d5e3ad82a/blob_p-c51b25cc0b20351f6e32a628bb6e674ee48a273e-ccfa96b0fd795502897c73714185dde3)
{code}

My question is under what circumstances would this happen? this seems to happen 
very infrequently but since the consequence is severe (JM crash loop) we'd like 
to understand how it would happen.

This  all seems a little similar to 
https://issues.apache.org/jira/browse/FLINK-9575 but that issue is reported 
fixed in Flink 1.5.2 and we are already on Flink 1.5.5

  was:
We recently have seen the following issue with Flink 1.5.5:

Given Flink Job ID 1d24cad26843dcebdfca236d5e3ad82a: 

1- A job is activated successfully and the job graph added to ZK:
{code:java}
Added SubmittedJobGraph(1d24cad26843dcebdfca236d5e3ad82a, null) to ZooKeeper.
{code}
2- Job is deactivated, Flink reports that the job graph has been successfully 
removed from ZK and the blob is deleted from the blob server (in this case S3):
{code:java}
Removed job graph 1d24cad26843dcebdfca236d5e3ad82a from ZooKeeper.
{code}
3- JM is later restarted, Flink for some reason attempts to recover the job 
that it reported earlier it has removed from ZK but since the blob has already 
been deleted the JM goes into a crash loop. The only way to recover it to 
manually remove the job graph entry from ZK:
{code:java}
Recovered SubmittedJobGraph(1d24cad26843dcebdfca236d5e3ad82a, null).    
{code}
and
{code:java}
org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
 The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error 
Code: NoSuchKey; Request ID: 1BCDFD83FC4546A2), S3 Extended Request ID: 
OzZtMbihzCm1LKy99s2+rgUMxyll/xYmL6ouMvU2eo30wuDbUmj/DAWoTCs9pNNCLft0FWqbhTo= 
(Path: 
s3://blam-state-staging/flink/default/blob/job_1d24cad26843dcebdfca236d5e3ad82a/blob_p-c51b25cc0b20351f6e32a628bb6e674ee48a273e-ccfa96b0fd795502897c73714185dde3)
{code}

My question is under what circumstances would this happen? this seems to happen 
very infrequently but since the consequence is severe (JM crash loop) we'd like 
to understand how it would happen.

This  all seems a little similar to 
https://issues.apache.org/jira/browse/FLINK-9575 but that issue is reported 
fixed in Flink 1.5.2 and we are already on Flink 1.5.5


> Flink fails to remove JobGraph from ZK even though it reports it did
> --------------------------------------------------------------------
>
>                 Key: FLINK-11665
>                 URL: https://issues.apache.org/jira/browse/FLINK-11665
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager
>    Affects Versions: 1.5.5
>            Reporter: Bashar Abdul Jawad
>            Priority: Major
>
> We recently have seen the following issue with Flink 1.5.5:
> Given Flink Job ID 1d24cad26843dcebdfca236d5e3ad82a: 
> 1- A job is activated successfully and the job graph added to ZK:
> {code:java}
> Added SubmittedJobGraph(1d24cad26843dcebdfca236d5e3ad82a, null) to ZooKeeper.
> {code}
> 2- Job is deactivated, Flink reports that the job graph has been successfully 
> removed from ZK and the blob is deleted from the blob server (in this case 
> S3):
> {code:java}
> Removed job graph 1d24cad26843dcebdfca236d5e3ad82a from ZooKeeper.
> {code}
> 3- JM is later restarted, Flink for some reason attempts to recover the job 
> that it reported earlier it has removed from ZK but since the blob has 
> already been deleted the JM goes into a crash loop. The only way to recover 
> it manually is to remove the job graph entry from ZK:
> {code:java}
> Recovered SubmittedJobGraph(1d24cad26843dcebdfca236d5e3ad82a, null).  
> {code}
> and
> {code:java}
> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The specified key does not exist. (Service: Amazon S3; Status Code: 404; 
> Error Code: NoSuchKey; Request ID: 1BCDFD83FC4546A2), S3 Extended Request ID: 
> OzZtMbihzCm1LKy99s2+rgUMxyll/xYmL6ouMvU2eo30wuDbUmj/DAWoTCs9pNNCLft0FWqbhTo= 
> (Path: 
> s3://blam-state-staging/flink/default/blob/job_1d24cad26843dcebdfca236d5e3ad82a/blob_p-c51b25cc0b20351f6e32a628bb6e674ee48a273e-ccfa96b0fd795502897c73714185dde3)
> {code}
> My question is under what circumstances would this happen? this seems to 
> happen very infrequently but since the consequence is severe (JM crash loop) 
> we'd like to understand how it would happen.
> This  all seems a little similar to 
> https://issues.apache.org/jira/browse/FLINK-9575 but that issue is reported 
> fixed in Flink 1.5.2 and we are already on Flink 1.5.5



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11665) Flink fails to remove JobGraph from ZK even though it reports it did

Reply via email to