Re: Duplicate copies of job in Flink UI/API

2021-09-09 Thread Chesnay Schepler

I'm afraid there's no real workaround.

If the information for completed jobs isn't important to you then 
setting jobstore.expiration-time to a low value can reduce the impact, 
or setting jobstore.max-capacity to 0 would prevent any completed job 
from being displayed.


Beyond that I can't think of anything.

We will track the issue in this ticket: 
https://issues.apache.org/jira/browse/FLINK-20195


I will also file a related ticket that the job archiving fails because, 
again, we archive multiple jobs with the same job ID.



On 09/09/2021 15:15, Peter Westermann wrote:


Thanks Chesnay. You are understanding this correctly. Your explanation 
makes sense to me.


Is there anything we can do to prevent this? At least for us, most 
times a leader election happens, the leader doesn’t actually change 
because the jobmanager is still healthy.


Thanks,

Peter


*From: *Chesnay Schepler 
*Date: *Thursday, September 9, 2021 at 9:11 AM
*To: *Peter Westermann , Piotr Nowojski 
, user@flink.apache.org 

*Subject: *Re: Duplicate copies of job in Flink UI/API

Just to double-check that I'm understanding things correctly:

You have a job with HA, then Zookeeper breaks down, the job gets 
suspended, ZK comes back online, and the _same_ JobManager becomes the 
leader?


If so, then I can explain why this happens and hopefully reproduce it.

In short, when a job is suspended (or terminates in any other way) 
then information about the job is stored in a data-structure.


This is used by the REST API (and thus, UI) to query completed jobs.

For _running_ jobs we query the JobMaster (a component within the 
JobManager responsible for that job) instead.


When listing all jobs we query all jobs from the data-structure for 
finished jobs, _and_ all JobMasters for running jobs. The core 
assumption here is that for a given ID only one of these can return 
something.


So what ends up happening is that when the job is suspended it is 
written to the data-structure, and then another JobMaster is started 
for the same job, and when listing all jobs we can now end up asking 
for the same job from multiple sources.


This is a somewhat unusual scenario because usually when a job is 
suspended another JobManager becomes the leader (where this wouldn't 
occur because the data-structure isn't shared across JobManagers).


On 09/09/2021 13:37, Peter Westermann wrote:

Hi Piotr,

Jobmanager logs are attached to this email. The only thing that
jumps out to me is this:

09/08/2021 09:02:26.240 -0400 ERROR
org.apache.flink.runtime.history.FsJobArchivist Failed to archive job.

  java.io.IOException: File already
exists:s3p://flink-s3-bucket/history/2db4ee6397151a1109d1ca05188a4cbb

This happened days after the Flink update – and not just once.
Across all our Flink clusters I’ve seen this 3 times. The cause
for the jobmanager leadership loss in this case was a deployment
of our zookeeper cluster that lead to a brief connection loss. The
new leader election is expected.

Thanks,

Peter

*From: *Piotr Nowojski 
<mailto:pnowoj...@apache.org>
*Date: *Thursday, September 9, 2021 at 12:39 AM
*To: *Peter Westermann 
<mailto:no.westerm...@genesys.com>
*Cc: *user@flink.apache.org <mailto:user@flink.apache.org>
     <mailto:user@flink.apache.org>
*Subject: *Re: Duplicate copies of job in Flink UI/API

Hi Peter,

Can you provide relevant JobManager logs? And can you write down
what steps have you taken before the failure happened? Did this
failure occur during upgrading Flink, or after the upgrade etc.

Best,

Piotrek

śr., 8 wrz 2021 o 16:11 Peter Westermann
mailto:no.westerm...@genesys.com>>
napisał(a):

We recently upgraded from Flink 1.12.4 to 1.12.5 and are
seeing some weird behavior after a change in jobmanager
leadership: We’re seeing two copies of the same job, one of
those is in SUSPENDED state and has a start time of zero.
Here’s the output from the /jobs/overview endpoint:

{

  "jobs": [{

    "jid": "2db4ee6397151a1109d1ca05188a4cbb",

    "name": "analytics-flink-v1",

    "state": "RUNNING",

    "start-time": 1631106146284,

    "end-time": -1,

    "duration": 2954642,

    "last-modification": 1631106152322,

    "tasks": {

  "total": 112,

  "created": 0,

  "scheduled": 0,

  "deploying": 0,

  "running": 112,

  "finished": 0,

  "canceling": 0,

  "canceled": 0,

  "failed": 0,

  &quo

Re: Duplicate copies of job in Flink UI/API

2021-09-09 Thread Peter Westermann
Thanks Chesnay. You are understanding this correctly. Your explanation makes 
sense to me.
Is there anything we can do to prevent this? At least for us, most times a 
leader election happens, the leader doesn’t actually change because the 
jobmanager is still healthy.

Thanks,
Peter




From: Chesnay Schepler 
Date: Thursday, September 9, 2021 at 9:11 AM
To: Peter Westermann , Piotr Nowojski 
, user@flink.apache.org 
Subject: Re: Duplicate copies of job in Flink UI/API
Just to double-check that I'm understanding things correctly:

You have a job with HA, then Zookeeper breaks down, the job gets suspended, ZK 
comes back online, and the _same_ JobManager becomes the leader?

If so, then I can explain why this happens and hopefully reproduce it.

In short, when a job is suspended (or terminates in any other way) then 
information about the job is stored in a data-structure.
This is used by the REST API (and thus, UI) to query completed jobs.
For _running_ jobs we query the JobMaster (a component within the JobManager 
responsible for that job) instead.

When listing all jobs we query all jobs from the data-structure for finished 
jobs, _and_ all JobMasters for running jobs. The core assumption here is that 
for a given ID only one of these can return something.

So what ends up happening is that when the job is suspended it is written to 
the data-structure, and then another JobMaster is started for the same job, and 
when listing all jobs we can now end up asking for the same job from multiple 
sources.

This is a somewhat unusual scenario because usually when a job is suspended 
another JobManager becomes the leader (where this wouldn't occur because the 
data-structure isn't shared across JobManagers).



On 09/09/2021 13:37, Peter Westermann wrote:
Hi Piotr,

Jobmanager logs are attached to this email. The only thing that jumps out to me 
is this:

09/08/2021 09:02:26.240 -0400 ERROR 
org.apache.flink.runtime.history.FsJobArchivist Failed to archive job.
  java.io.IOException: File already 
exists:s3p://flink-s3-bucket/history/2db4ee6397151a1109d1ca05188a4cbb

This happened days after the Flink update – and not just once. Across all our 
Flink clusters I’ve seen this 3 times. The cause for the jobmanager leadership 
loss in this case was a deployment of our zookeeper cluster that lead to a 
brief connection loss. The new leader election is expected.

Thanks,
Peter


From: Piotr Nowojski <mailto:pnowoj...@apache.org>
Date: Thursday, September 9, 2021 at 12:39 AM
To: Peter Westermann 
<mailto:no.westerm...@genesys.com>
Cc: user@flink.apache.org<mailto:user@flink.apache.org> 
<mailto:user@flink.apache.org>
Subject: Re: Duplicate copies of job in Flink UI/API
Hi Peter,

Can you provide relevant JobManager logs? And can you write down what steps 
have you taken before the failure happened? Did this failure occur during 
upgrading Flink, or after the upgrade etc.

Best,
Piotrek

śr., 8 wrz 2021 o 16:11 Peter Westermann 
mailto:no.westerm...@genesys.com>> napisał(a):
We recently upgraded from Flink 1.12.4 to 1.12.5 and are seeing some weird 
behavior after a change in jobmanager leadership: We’re seeing two copies of 
the same job, one of those is in SUSPENDED state and has a start time of zero. 
Here’s the output from the /jobs/overview endpoint:
{
  "jobs": [{
"jid": "2db4ee6397151a1109d1ca05188a4cbb",
"name": "analytics-flink-v1",
"state": "RUNNING",
"start-time": 1631106146284,
"end-time": -1,
"duration": 2954642,
"last-modification": 1631106152322,
"tasks": {
  "total": 112,
  "created": 0,
  "scheduled": 0,
  "deploying": 0,
  "running": 112,
  "finished": 0,
  "canceling": 0,
  "canceled": 0,
  "failed": 0,
  "reconciling": 0
}
  }, {
"jid": "2db4ee6397151a1109d1ca05188a4cbb",
"name": "analytics-flink-v1",
"state": "SUSPENDED",
"start-time": 0,
"end-time": -1,
"duration": 1631105900760,
"last-modification": 0,
"tasks": {
  "total": 0,
  "created": 0,
  "scheduled": 0,
  "deploying": 0,
  "running": 0,
  "finished": 0,
  "canceling": 0,
  "canceled": 0,
  "failed": 0,
  "reconciling": 0
}
  }]
}

Has anyone seen this behavior before?

Thanks,
Peter




Re: Duplicate copies of job in Flink UI/API

2021-09-09 Thread Chesnay Schepler

Just to double-check that I'm understanding things correctly:

You have a job with HA, then Zookeeper breaks down, the job gets 
suspended, ZK comes back online, and the _same_ JobManager becomes the 
leader?


If so, then I can explain why this happens and hopefully reproduce it.

In short, when a job is suspended (or terminates in any other way) then 
information about the job is stored in a data-structure.

This is used by the REST API (and thus, UI) to query completed jobs.
For _running_ jobs we query the JobMaster (a component within the 
JobManager responsible for that job) instead.


When listing all jobs we query all jobs from the data-structure for 
finished jobs, _and_ all JobMasters for running jobs. The core 
assumption here is that for a given ID only one of these can return 
something.


So what ends up happening is that when the job is suspended it is 
written to the data-structure, and then another JobMaster is started for 
the same job, and when listing all jobs we can now end up asking for the 
same job from multiple sources.


This is a somewhat unusual scenario because usually when a job is 
suspended another JobManager becomes the leader (where this wouldn't 
occur because the data-structure isn't shared across JobManagers).




On 09/09/2021 13:37, Peter Westermann wrote:


Hi Piotr,

Jobmanager logs are attached to this email. The only thing that jumps 
out to me is this:


09/08/2021 09:02:26.240 -0400 ERROR 
org.apache.flink.runtime.history.FsJobArchivist Failed to archive job.


  java.io.IOException: File already 
exists:s3p://flink-s3-bucket/history/2db4ee6397151a1109d1ca05188a4cbb


This happened days after the Flink update – and not just once. Across 
all our Flink clusters I’ve seen this 3 times. The cause for the 
jobmanager leadership loss in this case was a deployment of our 
zookeeper cluster that lead to a brief connection loss. The new leader 
election is expected.


Thanks,

Peter

*From: *Piotr Nowojski 
*Date: *Thursday, September 9, 2021 at 12:39 AM
*To: *Peter Westermann 
*Cc: *user@flink.apache.org 
*Subject: *Re: Duplicate copies of job in Flink UI/API

Hi Peter,

Can you provide relevant JobManager logs? And can you write down what 
steps have you taken before the failure happened? Did this 
failure occur during upgrading Flink, or after the upgrade etc.


Best,

Piotrek

śr., 8 wrz 2021 o 16:11 Peter Westermann <mailto:no.westerm...@genesys.com>> napisał(a):


We recently upgraded from Flink 1.12.4 to 1.12.5 and are seeing
some weird behavior after a change in jobmanager leadership: We’re
seeing two copies of the same job, one of those is in SUSPENDED
state and has a start time of zero. Here’s the output from the
/jobs/overview endpoint:

{

  "jobs": [{

    "jid": "2db4ee6397151a1109d1ca05188a4cbb",

    "name": "analytics-flink-v1",

    "state": "RUNNING",

    "start-time": 1631106146284,

    "end-time": -1,

    "duration": 2954642,

    "last-modification": 1631106152322,

    "tasks": {

  "total": 112,

  "created": 0,

  "scheduled": 0,

  "deploying": 0,

  "running": 112,

  "finished": 0,

  "canceling": 0,

  "canceled": 0,

  "failed": 0,

  "reconciling": 0

    }

  }, {

    "jid": "2db4ee6397151a1109d1ca05188a4cbb",

    "name": "analytics-flink-v1",

    "state": "SUSPENDED",

    "start-time": 0,

    "end-time": -1,

    "duration": 1631105900760,

    "last-modification": 0,

    "tasks": {

  "total": 0,

  "created": 0,

  "scheduled": 0,

  "deploying": 0,

  "running": 0,

  "finished": 0,

  "canceling": 0,

  "canceled": 0,

  "failed": 0,

  "reconciling": 0

    }

  }]

}

Has anyone seen this behavior before?

Thanks,

Peter





Re: Duplicate copies of job in Flink UI/API

2021-09-09 Thread Peter Westermann
Hi Piotr,

Jobmanager logs are attached to this email. The only thing that jumps out to me 
is this:

09/08/2021 09:02:26.240 -0400 ERROR 
org.apache.flink.runtime.history.FsJobArchivist Failed to archive job.
  java.io.IOException: File already 
exists:s3p://flink-s3-bucket/history/2db4ee6397151a1109d1ca05188a4cbb

This happened days after the Flink update – and not just once. Across all our 
Flink clusters I’ve seen this 3 times. The cause for the jobmanager leadership 
loss in this case was a deployment of our zookeeper cluster that lead to a 
brief connection loss. The new leader election is expected.

Thanks,
Peter


From: Piotr Nowojski 
Date: Thursday, September 9, 2021 at 12:39 AM
To: Peter Westermann 
Cc: user@flink.apache.org 
Subject: Re: Duplicate copies of job in Flink UI/API
Hi Peter,

Can you provide relevant JobManager logs? And can you write down what steps 
have you taken before the failure happened? Did this failure occur during 
upgrading Flink, or after the upgrade etc.

Best,
Piotrek

śr., 8 wrz 2021 o 16:11 Peter Westermann 
mailto:no.westerm...@genesys.com>> napisał(a):
We recently upgraded from Flink 1.12.4 to 1.12.5 and are seeing some weird 
behavior after a change in jobmanager leadership: We’re seeing two copies of 
the same job, one of those is in SUSPENDED state and has a start time of zero. 
Here’s the output from the /jobs/overview endpoint:
{
  "jobs": [{
"jid": "2db4ee6397151a1109d1ca05188a4cbb",
"name": "analytics-flink-v1",
"state": "RUNNING",
"start-time": 1631106146284,
"end-time": -1,
"duration": 2954642,
"last-modification": 1631106152322,
"tasks": {
  "total": 112,
  "created": 0,
  "scheduled": 0,
  "deploying": 0,
  "running": 112,
  "finished": 0,
  "canceling": 0,
  "canceled": 0,
  "failed": 0,
  "reconciling": 0
}
  }, {
"jid": "2db4ee6397151a1109d1ca05188a4cbb",
"name": "analytics-flink-v1",
"state": "SUSPENDED",
"start-time": 0,
"end-time": -1,
"duration": 1631105900760,
"last-modification": 0,
"tasks": {
  "total": 0,
  "created": 0,
  "scheduled": 0,
  "deploying": 0,
  "running": 0,
  "finished": 0,
  "canceling": 0,
  "canceled": 0,
  "failed": 0,
  "reconciling": 0
}
  }]
}

Has anyone seen this behavior before?

Thanks,
Peter
09/08/2021 09:02:31.015 -0400 INFO 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot 
with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb 
with allocation id fbddb90b669081bd9907c835f1906a79.
09/08/2021 09:02:31.015 -0400 INFO 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot 
with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb 
with allocation id ee62d44923180b0ac66e10ed170f0af3.
09/08/2021 09:02:31.015 -0400 INFO 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot 
with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb 
with allocation id 13d0e72b41883dfb84e866645b07dc92.
09/08/2021 09:02:31.015 -0400 INFO 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot 
with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb 
with allocation id ea4064056de27327a2037f4d71aa9e5c.
09/08/2021 09:02:31.015 -0400 INFO 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot 
with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb 
with allocation id cfcc6014e93f09884ad5f61e4a108e8d.
09/08/2021 09:02:31.014 -0400 INFO 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot 
with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb 
with allocation id 3dd4d17bf50232c95f178d6a235f2dc1.
09/08/2021 09:02:31.014 -0400 INFO 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot 
with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb 
with allocation id 13a7257a9aedd2ce2ca5267f93586763.
09/08/2021 09:02:31.014 -0400 INFO 
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot 
[SlotRequestId{f9b28f7479a60f286846fd9d5f1f4e8e}] and profile 
ResourceProfile{UNKNOWN} with allocation id fbddb90b669081bd9907c835f1906a79 
from resource manager.
09/08/2021 09:02:31.014 -0400 INFO 
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot 
[SlotRequestId{1df0f0709f439fc647c995a36

Re: Duplicate copies of job in Flink UI/API

2021-09-08 Thread Piotr Nowojski
Hi Peter,

Can you provide relevant JobManager logs? And can you write down what steps
have you taken before the failure happened? Did this failure occur during
upgrading Flink, or after the upgrade etc.

Best,
Piotrek

śr., 8 wrz 2021 o 16:11 Peter Westermann 
napisał(a):

> We recently upgraded from Flink 1.12.4 to 1.12.5 and are seeing some weird
> behavior after a change in jobmanager leadership: We’re seeing two copies
> of the same job, one of those is in SUSPENDED state and has a start time of
> zero. Here’s the output from the /jobs/overview endpoint:
>
> {
>
>   "jobs": [{
>
> "jid": "2db4ee6397151a1109d1ca05188a4cbb",
>
> "name": "analytics-flink-v1",
>
> "state": "RUNNING",
>
> "start-time": 1631106146284,
>
> "end-time": -1,
>
> "duration": 2954642,
>
> "last-modification": 1631106152322,
>
> "tasks": {
>
>   "total": 112,
>
>   "created": 0,
>
>   "scheduled": 0,
>
>   "deploying": 0,
>
>   "running": 112,
>
>   "finished": 0,
>
>   "canceling": 0,
>
>   "canceled": 0,
>
>   "failed": 0,
>
>   "reconciling": 0
>
> }
>
>   }, {
>
> "jid": "2db4ee6397151a1109d1ca05188a4cbb",
>
> "name": "analytics-flink-v1",
>
> "state": "SUSPENDED",
>
> "start-time": 0,
>
> "end-time": -1,
>
> "duration": 1631105900760,
>
> "last-modification": 0,
>
> "tasks": {
>
>   "total": 0,
>
>   "created": 0,
>
>   "scheduled": 0,
>
>   "deploying": 0,
>
>   "running": 0,
>
>   "finished": 0,
>
>   "canceling": 0,
>
>   "canceled": 0,
>
>   "failed": 0,
>
>   "reconciling": 0
>
> }
>
>   }]
>
> }
>
>
>
> Has anyone seen this behavior before?
>
>
>
> Thanks,
>
> Peter
>


Duplicate copies of job in Flink UI/API

2021-09-08 Thread Peter Westermann
We recently upgraded from Flink 1.12.4 to 1.12.5 and are seeing some weird 
behavior after a change in jobmanager leadership: We’re seeing two copies of 
the same job, one of those is in SUSPENDED state and has a start time of zero. 
Here’s the output from the /jobs/overview endpoint:
{
  "jobs": [{
"jid": "2db4ee6397151a1109d1ca05188a4cbb",
"name": "analytics-flink-v1",
"state": "RUNNING",
"start-time": 1631106146284,
"end-time": -1,
"duration": 2954642,
"last-modification": 1631106152322,
"tasks": {
  "total": 112,
  "created": 0,
  "scheduled": 0,
  "deploying": 0,
  "running": 112,
  "finished": 0,
  "canceling": 0,
  "canceled": 0,
  "failed": 0,
  "reconciling": 0
}
  }, {
"jid": "2db4ee6397151a1109d1ca05188a4cbb",
"name": "analytics-flink-v1",
"state": "SUSPENDED",
"start-time": 0,
"end-time": -1,
"duration": 1631105900760,
"last-modification": 0,
"tasks": {
  "total": 0,
  "created": 0,
  "scheduled": 0,
  "deploying": 0,
  "running": 0,
  "finished": 0,
  "canceling": 0,
  "canceled": 0,
  "failed": 0,
  "reconciling": 0
}
  }]
}

Has anyone seen this behavior before?

Thanks,
Peter