Fabio Wanner created FLINK-32552:
------------------------------------
Summary: Mixed up Flink session job deployments
Key: FLINK-32552
URL: https://issues.apache.org/jira/browse/FLINK-32552
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Reporter: Fabio Wanner
*Context*
In the scope of end-to-end tests we deploy all the Flink session jobs we have
regularly in a staging environment. Some of the jobs are bundled together in
one helm chart and therefore deployed at the same time. There are around 40
individual Flink jobs (running on the same Flink session cluster). The session
cluster is individual for each e2e test run. The problems described below
happen scarcely (1 in ~ 50 run maybe).
*Problem*
Rarely the operator seems to "mix up" the deployments. This can be seen in the
Flink cluster logs as multiple {{Received JobGraph submission '<JOB NAME>'
(<JOB_ID>)}} logs are created from jobs with the same job_id. This results in
errors such as:
{{DuplicateJobSubmissionException}} or {{ClassNotFoundException.}}
It' also visible in the FlinkSessionJob resource: status.jobStatus.jobName does
not match the expected job name of the job being deployed (The job name is
passed to the application via argument).
So far we were unable to reliably reproduce the error.
*Details*
The following lines show the status of 3 jobs form the view point of the Flink
cluster dashboard, and the FlinkSessionJob ressource:
*aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615*
Apache Flink Dashboard:
* State: Restarting
* ID: a7d36f3881f943a00000000000000002
* Exceptions: Cannot load user class: aelps.pipelines.aletsch.smc.SMCUrlMapper
FlinkSessionJob Ressource:
* State: RUNNING
* jobId: a1221c743367497b0000000000000002
* uid: a1221c74-3367-497b-ad2f-8793ab23919d
*aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615*
Apache Flink Dashboard:
* State: -
* ID: -
FlinkSessionJob Ressource:
* State: UPGRADING
* jobId: -
* uid: a7d36f38-81f9-43a0-898f-19b950430e9d
Flink K8s Operator:
* Exceptions: DuplicateJobSubmissionException: Job has already been submitted.
*aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615*
Apache Flink Dashboard:
* State: Running
* ID: e692c2dfaa18441c0000000000000002
* Exceptions: -
FlinkSessionJob Ressource:
* State: RUNNING
* jobId: e692c2dfaa18441c0000000000000002
* uid: e692c2df-aa18-441c-a352-88aefa9a3017
As we can see the *aletsch_smc* job is presumably running according to the
FlinkSessionJob resource, but crash-looping in the cluster and it has the jobID
matching the uid of the resource of {*}aletsch_mat{*}. While *aletsch_mat* is
not even running. The following logs also show some suspicious entries: There
are several {{Received JobGraph submission}} from different jobs with the same
jobID.
*Logs*
The logs are filtered by the ** 3 jobIds from above.
JobID: a7d36f3881f943a00000000000000002
{code:bash}
Flink Cluster
...
023-07-06 10:23:50,552 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
(a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING.
2023-07-06 10:23:50 file:
'/tmp/tm_10.0.11.159:6122-e9fadc/blobStorage/job_a7d36f3881f943a00000000000000002/blob_p-40c7a30adef8868254191d2cf2dbc4cb7ab46f0d-8a02a0583d91c5e8e6c94f378aa444c2'
(valid JAR)
2023-07-06 10:23:50,522 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job a7d36f3881f943a00000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=4}]
2023-07-06 10:23:50,522 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job a7d36f3881f943a00000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=3}]
2023-07-06 10:23:50,522 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job a7d36f3881f943a00000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=2}]
2023-07-06 10:23:50,522 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job a7d36f3881f943a00000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=1}]
2023-07-06 10:23:50,512 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
(a7d36f3881f943a00000000000000002) switched from state RESTARTING to RUNNING.
2023-07-06 10:23:48,979 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Clearing resource requirements of job a7d36f3881f943a00000000000000002
2023-07-06 10:23:48,853 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job a7d36f3881f943a00000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=1}]
2023-07-06 10:23:48,853 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job a7d36f3881f943a00000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=2}]
2023-07-06 10:23:48,853 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job a7d36f3881f943a00000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=3}]
2023-07-06 10:23:48 file:
'/tmp/tm_10.0.11.159:6122-e9fadc/blobStorage/job_a7d36f3881f943a00000000000000002/blob_p-40c7a30adef8868254191d2cf2dbc4cb7ab46f0d-8a02a0583d91c5e8e6c94f378aa444c2'
(valid JAR)
2023-07-06 10:23:48,661 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
(a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING.
2023-07-06 10:23:48,583 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job a7d36f3881f943a00000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=4}]
2023-07-06 10:23:48,583 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job a7d36f3881f943a00000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=3}]
2023-07-06 10:23:48,583 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job a7d36f3881f943a00000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=2}]
2023-07-06 10:23:48,582 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job a7d36f3881f943a00000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=1}]
2023-07-06 10:23:48,573 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
(a7d36f3881f943a00000000000000002) switched from state RESTARTING to RUNNING.
2023-07-06 10:23:47,562 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received
JobGraph submission 'aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615'
(a7d36f3881f943a00000000000000002).
2023-07-06 10:23:47,518 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Clearing resource requirements of job a7d36f3881f943a00000000000000002
2023-07-06 10:23:47,517 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job a7d36f3881f943a00000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=1}]
2023-07-06 10:23:47,517 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job a7d36f3881f943a00000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=2}]
2023-07-06 10:23:47,516 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job a7d36f3881f943a00000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=3}]
2023-07-06 10:23:47,463 INFO
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] -
Submitting Job with JobId=a7d36f3881f943a00000000000000002.
2023-07-06 10:23:47,463 INFO
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] -
Job a7d36f3881f943a00000000000000002 is submitted.
2023-07-06 10:23:47,104 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
(a7d36f3881f943a00000000000000002) switched from state RUNNING to RESTARTING.
2023-07-06 10:23:46,804 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Offer
reserved slots to the leader of job a7d36f3881f943a00000000000000002.
2023-07-06 10:23:46,804 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Establish
JobManager connection for job a7d36f3881f943a00000000000000002.
2023-07-06 10:23:46,799 INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful
registration at job manager
akka.tcp://[email protected]:6123/user/rpc/jobmanager_2 for job
a7d36f3881f943a00000000000000002.
2023-07-06 10:23:46,577 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive slot
request 221b24b50413805c9e35d7620b8a00b8 for job
a7d36f3881f943a00000000000000002 from resource manager with leader id
aaa9331f70b07a195b5f09d57d1b40c5.
2023-07-06 10:23:46,577 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive slot
request 49d3c8cd1080bd38c0144c3d3cc597cd for job
a7d36f3881f943a00000000000000002 from resource manager with leader id
aaa9331f70b07a195b5f09d57d1b40c5.
2023-07-06 10:23:46,577 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive slot
request 819f34cc8957066478fb4b3549367d24 for job
a7d36f3881f943a00000000000000002 from resource manager with leader id
aaa9331f70b07a195b5f09d57d1b40c5.
2023-07-06 10:23:46,574 INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job
a7d36f3881f943a00000000000000002 for job leader monitoring.
2023-07-06 10:23:46,570 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive slot
request 36802a7de1487f3fb1b6a3b509bd5e20 for job
a7d36f3881f943a00000000000000002 from resource manager with leader id
aaa9331f70b07a195b5f09d57d1b40c5.
2023-07-06 10:23:46,560 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job a7d36f3881f943a00000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=4}]
2023-07-06 10:23:46,556 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registered job manager
[email protected]://[email protected]:6123/user/rpc/jobmanager_2
for job a7d36f3881f943a00000000000000002.
2023-07-06 10:23:46,528 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering job manager
[email protected]://[email protected]:6123/user/rpc/jobmanager_2
for job a7d36f3881f943a00000000000000002.
2023-07-06 10:23:46,480 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
(a7d36f3881f943a00000000000000002) switched from state CREATED to RUNNING.
2023-07-06 10:23:46,476 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Starting execution of job
'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615'
(a7d36f3881f943a00000000000000002) under job master id
aaa9331f70b07a195b5f09d57d1b40c5.
2023-07-06 10:23:46,466 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Using failover strategy
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@62877000
for aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
(a7d36f3881f943a00000000000000002).
2023-07-06 10:23:46,079 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Running initialization on master for job
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
(a7d36f3881f943a00000000000000002).
2023-07-06 10:23:46,059 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] -
Found 0 checkpoints in
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a7d36f3881f943a00000000000000002-config-map'}.
2023-07-06 10:23:46,051 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] -
Recovering checkpoints from
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a7d36f3881f943a00000000000000002-config-map'}.
2023-07-06 10:23:46,006 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Using restart back off time strategy
ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000,
maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000,
jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
(a7d36f3881f943a00000000000000002).
2023-07-06 10:23:45,987 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Initializing job
'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615'
(a7d36f3881f943a00000000000000002).
2023-07-06 10:23:45,966 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received
JobGraph submission 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615'
(a7d36f3881f943a00000000000000002).
2023-07-06 10:23:45,965 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received
JobGraph submission 'aletsch_mat_e5730831db8092adb12f5189c4c895ef3a268615'
(a7d36f3881f943a00000000000000002).
2023-07-06 10:23:45,915 INFO
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Added
JobGraph(jobId: a7d36f3881f943a00000000000000002) to
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}.
2023-07-06 10:23:45,859 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Submitting
job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615'
(a7d36f3881f943a00000000000000002).
2023-07-06 10:23:45,857 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received
JobGraph submission 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615'
(a7d36f3881f943a00000000000000002).
2023-07-06 10:23:45,705 INFO
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] -
Submitting Job with JobId=a7d36f3881f943a00000000000000002.
2023-07-06 10:23:45,705 INFO
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] -
Job a7d36f3881f943a00000000000000002 is submitted.
2023-07-06 10:23:45,705 INFO
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] -
Submitting Job with JobId=a7d36f3881f943a00000000000000002.
2023-07-06 10:23:45,705 INFO
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] -
Job a7d36f3881f943a00000000000000002 is submitted.
2023-07-06 10:23:45,705 INFO
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] -
Submitting Job with JobId=a7d36f3881f943a00000000000000002.
2023-07-06 10:23:45,705 INFO
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] -
Job a7d36f3881f943a00000000000000002 is submitted.
Flink Operator
2023-07-06 10:26:25 2023-07-06 08:26:25,792
o.a.f.k.o.s.AbstractFlinkService [INFO
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job:
a7d36f3881f943a00000000000000002 to session cluster.
2023-07-06 10:25:05,163 o.a.f.k.o.s.AbstractFlinkService [INFO
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job:
a7d36f3881f943a00000000000000002 to session cluster.
2023-07-06 10:24:24,553 o.a.f.k.o.s.AbstractFlinkService [INFO
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job:
a7d36f3881f943a00000000000000002 to session cluster.
2023-07-06 10:24:03,850 o.a.f.k.o.s.AbstractFlinkService [INFO
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job:
a7d36f3881f943a00000000000000002 to session cluster.
2023-07-06 10:23:53,094 o.a.f.k.o.s.AbstractFlinkService [INFO
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job:
a7d36f3881f943a00000000000000002 to session cluster.
2023-07-06 10:23:47,346 o.a.f.k.o.s.AbstractFlinkService [INFO
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job:
a7d36f3881f943a00000000000000002 to session cluster.
2023-07-06 10:23:45,372 o.a.f.k.o.s.AbstractFlinkService [INFO
][aelps-staging/aletsch-mat-staging-e5730831] Submitting job:
a7d36f3881f943a00000000000000002 to session cluster.
{code}
JobID: a1221c743367497b0000000000000002
{code:bash}
Flink Cluster
2023-07-06 11:23:48,062 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 1 for job a1221c743367497b0000000000000002 (48548 bytes,
checkpointDuration=107 ms, finalizationTime=33 ms).
2023-07-06 11:23:47,937 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 1 (type=CheckpointType{name='Checkpoint',
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688635427922 for job
a1221c743367497b0000000000000002.
2023-07-06 10:23:48,567 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Offer
reserved slots to the leader of job a1221c743367497b0000000000000002.
2023-07-06 10:23:48,567 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Establish
JobManager connection for job a1221c743367497b0000000000000002.
2023-07-06 10:23:48,567 INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful
registration at job manager
akka.tcp://[email protected]:6123/user/rpc/jobmanager_7 for job
a1221c743367497b0000000000000002.
2023-07-06 10:23:48,009 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive slot
request cae6932e2409d5fece3f6b4636e3c71a for job
a1221c743367497b0000000000000002 from resource manager with leader id
aaa9331f70b07a195b5f09d57d1b40c5.
2023-07-06 10:23:48,003 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive slot
request 8a57f3ecff07d300aebb33f6b3545aed for job
a1221c743367497b0000000000000002 from resource manager with leader id
aaa9331f70b07a195b5f09d57d1b40c5.
2023-07-06 10:23:48,003 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive slot
request 7a4a0cfd16eec4a1cb043cce5f989db0 for job
a1221c743367497b0000000000000002 from resource manager with leader id
aaa9331f70b07a195b5f09d57d1b40c5.
2023-07-06 10:23:48,002 INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job
a1221c743367497b0000000000000002 for job leader monitoring.
2023-07-06 10:23:48,002 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive slot
request 92cbc64513fa703e4acf28bbb3088a58 for job
a1221c743367497b0000000000000002 from resource manager with leader id
aaa9331f70b07a195b5f09d57d1b40c5.
2023-07-06 10:23:48,999 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job a1221c743367497b0000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=4}]
2023-07-06 10:23:47,998 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registered job manager
[email protected]://[email protected]:6123/user/rpc/jobmanager_7
for job a1221c743367497b0000000000000002.
2023-07-06 10:23:47,953 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering job manager
[email protected]://[email protected]:6123/user/rpc/jobmanager_7
for job a1221c743367497b0000000000000002.
2023-07-06 10:23:47,922 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
(a1221c743367497b0000000000000002) switched from state CREATED to RUNNING.
2023-07-06 10:23:47,887 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Starting execution of job
'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615'
(a1221c743367497b0000000000000002) under job master id
aaa9331f70b07a195b5f09d57d1b40c5.
2023-07-06 10:23:47,887 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Using failover strategy
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@2222ba4d
for aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
(a1221c743367497b0000000000000002).
2023-07-06 10:23:47,880 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Running initialization on master for job
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
(a1221c743367497b0000000000000002).
2023-07-06 10:23:47,872 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] -
Found 0 checkpoints in
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a1221c743367497b0000000000000002-config-map'}.
2023-07-06 10:23:47,867 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] -
Recovering checkpoints from
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-a1221c743367497b0000000000000002-config-map'}.
2023-07-06 10:23:47,832 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Using restart back off time strategy
ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000,
maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000,
jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for
aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615
(a1221c743367497b0000000000000002).
2023-07-06 10:23:47,832 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Initializing job
'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615'
(a1221c743367497b0000000000000002).
2023-07-06 10:23:47,820 INFO
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Added
JobGraph(jobId: a1221c743367497b0000000000000002) to
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}.
2023-07-06 10:23:47,780 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Submitting
job 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615'
(a1221c743367497b0000000000000002).
2023-07-06 10:23:47,776 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received
JobGraph submission 'aletsch_smc_e5730831db8092adb12f5189c4c895ef3a268615'
(a1221c743367497b0000000000000002).
2023-07-06 10:23:47,668 INFO
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] -
Submitting Job with JobId=a1221c743367497b0000000000000002.
2023-07-06 10:23:47,668 INFO
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] -
Job a1221c743367497b0000000000000002 is submitted.
Flink Operator
2023-07-06 10:23:48,007 o.a.f.k.o.s.AbstractFlinkService [INFO
][aelps-staging/aletsch-smc-staging-e5730831] Submitted job:
a1221c743367497b0000000000000002 to session cluster.
2023-07-06 10:23:47,505 o.a.f.k.o.s.AbstractFlinkService [INFO
][aelps-staging/aletsch-smc-staging-e5730831] Submitting job:
a1221c743367497b0000000000000002 to session cluster.
2023-07-06 10:23:45,416 o.a.f.k.o.s.AbstractFlinkService [INFO
][aelps-staging/aletsch-smc-staging-e5730831] Submitting job:
a1221c743367497b0000000000000002 to session cluster.
{code}
JobID: e692c2dfaa18441c0000000000000002
{code:bash}
Flink Cluster
2023-07-06 11:23:48,004 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 1 for job e692c2dfaa18441c0000000000000002 (8194 bytes,
checkpointDuration=125 ms, finalizationTime=28 ms).
2023-07-06 11:23:47,867 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 1 (type=CheckpointType{name='Checkpoint',
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1688635427851 for job
e692c2dfaa18441c0000000000000002.
2023-07-06 10:23:48,568 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Offer
reserved slots to the leader of job e692c2dfaa18441c0000000000000002.
2023-07-06 10:23:48,568 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Establish
JobManager connection for job e692c2dfaa18441c0000000000000002.
2023-07-06 10:23:48,568 INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Successful
registration at job manager
akka.tcp://[email protected]:6123/user/rpc/jobmanager_6 for job
e692c2dfaa18441c0000000000000002.
2023-07-06 10:23:48,002 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive slot
request 5e5a0e55fac280bf31abf29a20bce684 for job
e692c2dfaa18441c0000000000000002 from resource manager with leader id
aaa9331f70b07a195b5f09d57d1b40c5.
2023-07-06 10:23:48,002 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive slot
request 1cdbce54f4376a1df86430f97dab6858 for job
e692c2dfaa18441c0000000000000002 from resource manager with leader id
aaa9331f70b07a195b5f09d57d1b40c5.
2023-07-06 10:23:48,002 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive slot
request 352db7288d0e4d1775d5f52dd14c769d for job
e692c2dfaa18441c0000000000000002 from resource manager with leader id
aaa9331f70b07a195b5f09d57d1b40c5.
2023-07-06 10:23:48,001 INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job
e692c2dfaa18441c0000000000000002 for job leader monitoring.
2023-07-06 10:23:48,000 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive slot
request bffed3e4a4c8573049a4119bd7e15f19 for job
e692c2dfaa18441c0000000000000002 from resource manager with leader id
aaa9331f70b07a195b5f09d57d1b40c5.
2023-07-06 10:23:48,998 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job e692c2dfaa18441c0000000000000002:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=4}]
2023-07-06 10:23:47,998 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registered job manager
[email protected]://[email protected]:6123/user/rpc/jobmanager_6
for job e692c2dfaa18441c0000000000000002.
2023-07-06 10:23:47,953 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering job manager
[email protected]://[email protected]:6123/user/rpc/jobmanager_6
for job e692c2dfaa18441c0000000000000002.
2023-07-06 10:23:47,851 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615
(e692c2dfaa18441c0000000000000002) switched from state CREATED to RUNNING.
2023-07-06 10:23:47,845 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Starting execution of job
'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615'
(e692c2dfaa18441c0000000000000002) under job master id
aaa9331f70b07a195b5f09d57d1b40c5.
2023-07-06 10:23:47,844 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Using failover strategy
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@7eeab246
for aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615
(e692c2dfaa18441c0000000000000002).
2023-07-06 10:23:47,834 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Running initialization on master for job
aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615
(e692c2dfaa18441c0000000000000002).
2023-07-06 10:23:47,825 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] -
Found 0 checkpoints in
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-e692c2dfaa18441c0000000000000002-config-map'}.
2023-07-06 10:23:47,813 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] -
Recovering checkpoints from
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-e692c2dfaa18441c0000000000000002-config-map'}.
2023-07-06 10:23:47,782 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Using restart back off time strategy
ExponentialDelayRestartBackoffTimeStrategy(initialBackoffMS=1000,
maxBackoffMS=300000, backoffMultiplier=2.0, resetBackoffThresholdMS=3600000,
jitterFactor=0.5, currentBackoffMS=1000, lastFailureTimestamp=0) for
aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615
(e692c2dfaa18441c0000000000000002).
2023-07-06 10:23:47,781 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Initializing job
'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615'
(e692c2dfaa18441c0000000000000002).
2023-07-06 10:23:47,774 INFO
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Added
JobGraph(jobId: e692c2dfaa18441c0000000000000002) to
KubernetesStateHandleStore{configMapName='flink-cluster-aelps-staging-e5730831-cluster-config-map'}.
2023-07-06 10:23:47,703 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Submitting
job 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615'
(e692c2dfaa18441c0000000000000002).
2023-07-06 10:23:47,702 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received
JobGraph submission 'aletsch_wp_wafer_e5730831db8092adb12f5189c4c895ef3a268615'
(e692c2dfaa18441c0000000000000002).
2023-07-06 10:23:47,650 INFO
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] -
Submitting Job with JobId=e692c2dfaa18441c0000000000000002.
2023-07-06 10:23:47,650 INFO
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor [] -
Job e692c2dfaa18441c0000000000000002 is submitted.
Flink Operator
2023-07-06 10:23:47,973 o.a.f.k.o.s.AbstractFlinkService [INFO
][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitted job:
e692c2dfaa18441c0000000000000002 to session cluster.
2023-07-06 10:23:47,505 o.a.f.k.o.s.AbstractFlinkService [INFO
][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitting job:
e692c2dfaa18441c0000000000000002 to session cluster.
2023-07-06 10:23:45,374 o.a.f.k.o.s.AbstractFlinkService [INFO
][aelps-staging/aletsch-wp-wafer-staging-e5730831] Submitting job:
e692c2dfaa18441c0000000000000002 to session cluster.
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)