Hi Piotr,
Jobmanager logs are attached to this email. The only thing that jumps out to me
is this:
09/08/2021 09:02:26.240 -0400 ERROR
org.apache.flink.runtime.history.FsJobArchivist Failed to archive job.
java.io.IOException: File already
exists:s3p://flink-s3-bucket/history/2db4ee6397151a1109d1ca05188a4cbb
This happened days after the Flink update – and not just once. Across all our
Flink clusters I’ve seen this 3 times. The cause for the jobmanager leadership
loss in this case was a deployment of our zookeeper cluster that lead to a
brief connection loss. The new leader election is expected.
Thanks,
Peter
From: Piotr Nowojski <[email protected]>
Date: Thursday, September 9, 2021 at 12:39 AM
To: Peter Westermann <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: Duplicate copies of job in Flink UI/API
Hi Peter,
Can you provide relevant JobManager logs? And can you write down what steps
have you taken before the failure happened? Did this failure occur during
upgrading Flink, or after the upgrade etc.
Best,
Piotrek
śr., 8 wrz 2021 o 16:11 Peter Westermann
<[email protected]<mailto:[email protected]>> napisał(a):
We recently upgraded from Flink 1.12.4 to 1.12.5 and are seeing some weird
behavior after a change in jobmanager leadership: We’re seeing two copies of
the same job, one of those is in SUSPENDED state and has a start time of zero.
Here’s the output from the /jobs/overview endpoint:
{
"jobs": [{
"jid": "2db4ee6397151a1109d1ca05188a4cbb",
"name": "analytics-flink-v1",
"state": "RUNNING",
"start-time": 1631106146284,
"end-time": -1,
"duration": 2954642,
"last-modification": 1631106152322,
"tasks": {
"total": 112,
"created": 0,
"scheduled": 0,
"deploying": 0,
"running": 112,
"finished": 0,
"canceling": 0,
"canceled": 0,
"failed": 0,
"reconciling": 0
}
}, {
"jid": "2db4ee6397151a1109d1ca05188a4cbb",
"name": "analytics-flink-v1",
"state": "SUSPENDED",
"start-time": 0,
"end-time": -1,
"duration": 1631105900760,
"last-modification": 0,
"tasks": {
"total": 0,
"created": 0,
"scheduled": 0,
"deploying": 0,
"running": 0,
"finished": 0,
"canceling": 0,
"canceled": 0,
"failed": 0,
"reconciling": 0
}
}]
}
Has anyone seen this behavior before?
Thanks,
Peter
09/08/2021 09:02:31.015 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot
with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb
with allocation id fbddb90b669081bd9907c835f1906a79.
09/08/2021 09:02:31.015 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot
with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb
with allocation id ee62d44923180b0ac66e10ed170f0af3.
09/08/2021 09:02:31.015 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot
with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb
with allocation id 13d0e72b41883dfb84e866645b07dc92.
09/08/2021 09:02:31.015 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot
with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb
with allocation id ea4064056de27327a2037f4d71aa9e5c.
09/08/2021 09:02:31.015 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot
with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb
with allocation id cfcc6014e93f09884ad5f61e4a108e8d.
09/08/2021 09:02:31.014 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot
with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb
with allocation id 3dd4d17bf50232c95f178d6a235f2dc1.
09/08/2021 09:02:31.014 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot
with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb
with allocation id 13a7257a9aedd2ce2ca5267f93586763.
09/08/2021 09:02:31.014 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot
[SlotRequestId{f9b28f7479a60f286846fd9d5f1f4e8e}] and profile
ResourceProfile{UNKNOWN} with allocation id fbddb90b669081bd9907c835f1906a79
from resource manager.
09/08/2021 09:02:31.014 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot
[SlotRequestId{1df0f0709f439fc647c995a36d8c60a7}] and profile
ResourceProfile{UNKNOWN} with allocation id ee62d44923180b0ac66e10ed170f0af3
from resource manager.
09/08/2021 09:02:31.014 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot
[SlotRequestId{2a441ce02b709ffacdcc319d715239d8}] and profile
ResourceProfile{UNKNOWN} with allocation id 13d0e72b41883dfb84e866645b07dc92
from resource manager.
09/08/2021 09:02:31.014 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot
[SlotRequestId{9e3606665fc9b08e63a1cb399863534b}] and profile
ResourceProfile{UNKNOWN} with allocation id ea4064056de27327a2037f4d71aa9e5c
from resource manager.
09/08/2021 09:02:31.014 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot
[SlotRequestId{802408aafe4e015591902c107cecef02}] and profile
ResourceProfile{UNKNOWN} with allocation id cfcc6014e93f09884ad5f61e4a108e8d
from resource manager.
09/08/2021 09:02:31.014 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot
[SlotRequestId{e0a0f561e532d36e64feafee9eae8100}] and profile
ResourceProfile{UNKNOWN} with allocation id 3dd4d17bf50232c95f178d6a235f2dc1
from resource manager.
09/08/2021 09:02:31.013 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot
[SlotRequestId{4cfc987782b4aed47bb5d98394601db6}] and profile
ResourceProfile{UNKNOWN} with allocation id 13a7257a9aedd2ce2ca5267f93586763
from resource manager.
09/08/2021 09:02:31.013 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Request slot
with profile ResourceProfile{UNKNOWN} for job 2db4ee6397151a1109d1ca05188a4cbb
with allocation id 2e742af203d1be847d06c643f3984b54.
09/08/2021 09:02:31.013 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Requesting new slot
[SlotRequestId{0f25b099a7dd972a69c31532798e19ab}] and profile
ResourceProfile{UNKNOWN} with allocation id 2e742af203d1be847d06c643f3984b54
from resource manager.
09/08/2021 09:02:31.013 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster
JobManager successfully registered at ResourceManager, leader id:
b0f998b265508bb4cd715749318a4ce4.
09/08/2021 09:02:31.013 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Registered
job manager
038f7d4cea297118551aed586a338a49://0aaaf7660b7f4414af376af4f91f9500:50001/user/rpc/jobmanager_4
for job 2db4ee6397151a1109d1ca05188a4cbb.
09/08/2021 09:02:31.004 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Registering
job manager
038f7d4cea297118551aed586a338a49://0aaaf7660b7f4414af376af4f91f9500:50001/user/rpc/jobmanager_4
for job 2db4ee6397151a1109d1ca05188a4cbb.
09/08/2021 09:02:31.004 -0400 INFO
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService Starting
DefaultLeaderRetrievalService with
ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/2db4ee6397151a1109d1ca05188a4cbb/job_manager_lock'}.
09/08/2021 09:02:31.004 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster
Resolved ResourceManager address, beginning registration
09/08/2021 09:02:31.004 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster
Connecting to ResourceManager
akka.ssl.tcp://0aaaf7660b7f4414af376af4f91f9500:50001/user/rpc/resourcemanager_0(b0f998b265508bb4cd715749318a4ce4)
09/08/2021 09:02:31.002 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Cannot serve slot
request, no ResourceManager connected. Adding as pending request
[SlotRequestId{f9b28f7479a60f286846fd9d5f1f4e8e}]
09/08/2021 09:02:31.001 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Cannot serve slot
request, no ResourceManager connected. Adding as pending request
[SlotRequestId{1df0f0709f439fc647c995a36d8c60a7}]
09/08/2021 09:02:31.001 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Cannot serve slot
request, no ResourceManager connected. Adding as pending request
[SlotRequestId{2a441ce02b709ffacdcc319d715239d8}]
09/08/2021 09:02:31.001 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Cannot serve slot
request, no ResourceManager connected. Adding as pending request
[SlotRequestId{9e3606665fc9b08e63a1cb399863534b}]
09/08/2021 09:02:31.001 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Cannot serve slot
request, no ResourceManager connected. Adding as pending request
[SlotRequestId{802408aafe4e015591902c107cecef02}]
09/08/2021 09:02:31.000 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Cannot serve slot
request, no ResourceManager connected. Adding as pending request
[SlotRequestId{e0a0f561e532d36e64feafee9eae8100}]
09/08/2021 09:02:31.000 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Cannot serve slot
request, no ResourceManager connected. Adding as pending request
[SlotRequestId{4cfc987782b4aed47bb5d98394601db6}]
09/08/2021 09:02:31.000 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Cannot serve slot
request, no ResourceManager connected. Adding as pending request
[SlotRequestId{0f25b099a7dd972a69c31532798e19ab}]
09/08/2021 09:02:30.997 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster
Starting scheduling with scheduling strategy
[org.apache.flink.runtime.scheduler.strategy.PipelinedRegionSchedulingStrategy]
09/08/2021 09:02:30.996 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster
Starting execution of job analytics-flink-v1 (2db4ee6397151a1109d1ca05188a4cbb)
under job master id 9ed35c1fb07f037fed21d31d35cc4abf.
09/08/2021 09:02:30.996 -0400 INFO
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService Starting
DefaultLeaderRetrievalService with
ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/resource_manager_lock'}.
09/08/2021 09:02:30.992 -0400 INFO
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl JobManager runner for
job analytics-flink-v1 (2db4ee6397151a1109d1ca05188a4cbb) was granted
leadership with session id ed21d31d-35cc-4abf-9ed3-5c1fb07f037f at
akka.ssl.tcp://0aaaf7660b7f4414af376af4f91f9500:50001/user/rpc/jobmanager_4.
09/08/2021 09:02:30.990 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster
Using failover strategy
org.apache.flink.runtime.executiongraph.failover.flip1.RestartAllFailoverStrategy@7e7f0b53
for analytics-flink-v1 (2db4ee6397151a1109d1ca05188a4cbb).
09/08/2021 09:02:30.346 -0400 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore Trying to
retrieve checkpoint 318603.
09/08/2021 09:02:29.697 -0400 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore Trying to
retrieve checkpoint 318602.
09/08/2021 09:02:28.861 -0400 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore Trying to
retrieve checkpoint 318601.
09/08/2021 09:02:27.469 -0400 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore Trying to
retrieve checkpoint 318600.
09/08/2021 09:02:27.469 -0400 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore Trying to
fetch 4 checkpoints from storage.
09/08/2021 09:02:27.469 -0400 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore Found 4
checkpoints in
ZooKeeperStateHandleStore{namespace='analytics-flink/analytics-flink-v1/3069/checkpoints/2db4ee6397151a1109d1ca05188a4cbb'}.
09/08/2021 09:02:27.454 -0400 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore Recovering
checkpoints from
ZooKeeperStateHandleStore{namespace='analytics-flink/analytics-flink-v1/3069/checkpoints/2db4ee6397151a1109d1ca05188a4cbb'}.
09/08/2021 09:02:27.452 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster
Using application-defined state backend:
RocksDBStateBackend{checkpointStreamBackend=File State Backend (checkpoints:
's3p://inin-prod-aps1-analytics/analytics-flink/analytics-flink-v1/3069/checkpoints/HASH',
savepoints:
's3p://inin-prod-aps1-analytics/analytics-flink/savepoints/analytics-flink-v1',
asynchronous: TRUE, fileStateThreshold: 1048576), localRocksDbDirectories=null,
enableIncrementalCheckpointing=TRUE, numberOfTransferThreads=2,
writeBatchSize=2097152}
09/08/2021 09:02:27.451 -0400 INFO
org.apache.flink.contrib.streaming.state.RocksDBStateBackend Using
application-defined options factory: AnalyticsRocksOptionsFactory
[baseline=FLASH_SSD_OPTIMIZED, compressionType=ZSTD_COMPRESSION].
09/08/2021 09:02:27.451 -0400 INFO
org.apache.flink.contrib.streaming.state.RocksDBStateBackend Using predefined
options: FLASH_SSD_OPTIMIZED.
09/08/2021 09:02:27.451 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster
Using job/cluster config to configure application-defined state backend:
RocksDBStateBackend{checkpointStreamBackend=File State Backend (checkpoints:
's3p://inin-prod-aps1-analytics/analytics-flink/analytics-flink-v1/3069/checkpoints/HASH',
savepoints: 'null', asynchronous: UNDEFINED, fileStateThreshold: -1),
localRocksDbDirectories=null, enableIncrementalCheckpointing=UNDEFINED,
numberOfTransferThreads=-1, writeBatchSize=-1}
09/08/2021 09:02:27.449 -0400 INFO org.apache.flink.runtime.util.ZooKeeperUtils
Initialized DefaultCompletedCheckpointStore in
'ZooKeeperStateHandleStore{namespace='analytics-flink/analytics-flink-v1/3069/checkpoints/2db4ee6397151a1109d1ca05188a4cbb'}'
with /checkpoints/2db4ee6397151a1109d1ca05188a4cbb.
09/08/2021 09:02:27.446 -0400 INFO
org.apache.flink.runtime.scheduler.adapter.DefaultExecutionTopology Built 1
pipelined regions in 0 ms
09/08/2021 09:02:27.441 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster
Successfully ran initialization on master in 0 ms.
09/08/2021 09:02:27.441 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster
Running initialization on master for job analytics-flink-v1
(2db4ee6397151a1109d1ca05188a4cbb).
09/08/2021 09:02:27.440 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster
Using restart back off time strategy
FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=2147483647,
backoffTimeMS=1000) for analytics-flink-v1 (2db4ee6397151a1109d1ca05188a4cbb).
09/08/2021 09:02:27.430 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster
Initializing job analytics-flink-v1 (2db4ee6397151a1109d1ca05188a4cbb).
09/08/2021 09:02:27.430 -0400 INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcService Starting RPC endpoint for
org.apache.flink.runtime.jobmaster.JobMaster at
akka://flink/user/rpc/jobmanager_4 .
09/08/2021 09:02:27.429 -0400 INFO
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService Starting
DefaultLeaderElectionService with
ZooKeeperLeaderElectionDriver{leaderPath='/leader/2db4ee6397151a1109d1ca05188a4cbb/job_manager_lock'}.
09/08/2021 09:02:26.535 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Registering
TaskManager with ResourceID 10.105.236.109:50004-1e912c
(akka.ssl.tcp://f32954a85e3a78e12ad552dafb6935b7:50004/user/rpc/taskmanager_0)
at ResourceManager
09/08/2021 09:02:26.331 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Registering
TaskManager with ResourceID 10.105.244.116:50004-6095c5
(akka.ssl.tcp://40949cf374b33869a2d6b1e0fd532b7c:50004/user/rpc/taskmanager_0)
at ResourceManager
09/08/2021 09:02:26.283 -0400 INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcService Starting RPC endpoint for
org.apache.flink.runtime.dispatcher.StandaloneDispatcher at
akka://flink/user/rpc/dispatcher_3 .
09/08/2021 09:02:26.281 -0400 INFO
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess
Successfully recovered 1 persisted job graphs.
09/08/2021 09:02:26.281 -0400 INFO
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore Recovered
JobGraph(jobId: 2db4ee6397151a1109d1ca05188a4cbb).
09/08/2021 09:02:26.241 -0400 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher Could not archive
completed job analytics-flink-v1(2db4ee6397151a1109d1ca05188a4cbb) to the
history server.
java.io.IOException: File already
exists:s3p://flink-s3-bucket/history/2db4ee6397151a1109d1ca05188a4cbb
at c.f.p.h.s.PrestoS3FileSystem.create(PrestoS3FileSystem.java:357)
at o.a.h.fs.FileSystem.create(FileSystem.java:1169)
at o.a.h.fs.FileSystem.create(FileSystem.java:1149)
at o.a.h.fs.FileSystem.create(FileSystem.java:1038)
at o.a.f.f.s.c.HadoopFileSystem.create(HadoopFileSystem.java:154)
at o.a.f.f.s.c.HadoopFileSystem.create(HadoopFileSystem.java:37)
at
o.a.f.c.f.PluginFileSystemFactory$ClassLoaderFixingFileSystem.create(PluginFileSystemFactory.java:170)
at o.a.f.r.h.FsJobArchivist.archiveJob(FsJobArchivist.java:73)
at
o.a.f.r.d.JsonResponseHistoryServerArchivist.lambda$archiveExecutionGraph$0(JsonResponseHistoryServerArchivist.java:57)
at
o.a.f.u.f.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49)
... 4 common frames omitted
Wrapped by: j.l.RuntimeException: java.io.IOException: File already
exists:s3p://flink-s3-bucket/history/2db4ee6397151a1109d1ca05188a4cbb
at o.a.f.u.ExceptionUtils.rethrow(ExceptionUtils.java:316)
at
o.a.f.u.f.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:51)
at j.u.c.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
... 3 common frames omitted
Wrapped by: j.u.c.CompletionException: java.lang.RuntimeException:
java.io.IOException: File already
exists:s3p://flink-s3-bucket/history/2db4ee6397151a1109d1ca05188a4cbb
at j.u.c.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
at j.u.c.CompletableFuture.completeThrowable(CompletableFuture.java:280)
at j.u.c.CompletableFuture$AsyncRun.run(CompletableFuture.java:1643)
at j.u.c.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at j.u.c.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
09/08/2021 09:02:26.240 -0400 ERROR
org.apache.flink.runtime.history.FsJobArchivist Failed to archive job.
java.io.IOException: File already
exists:s3p://flink-s3-bucket/history/2db4ee6397151a1109d1ca05188a4cbb
at c.f.p.h.s.PrestoS3FileSystem.create(PrestoS3FileSystem.java:357)
at o.a.h.fs.FileSystem.create(FileSystem.java:1169)
at o.a.h.fs.FileSystem.create(FileSystem.java:1149)
at o.a.h.fs.FileSystem.create(FileSystem.java:1038)
at o.a.f.f.s.c.HadoopFileSystem.create(HadoopFileSystem.java:154)
at o.a.f.f.s.c.HadoopFileSystem.create(HadoopFileSystem.java:37)
at
o.a.f.c.f.PluginFileSystemFactory$ClassLoaderFixingFileSystem.create(PluginFileSystemFactory.java:170)
at o.a.f.r.h.FsJobArchivist.archiveJob(FsJobArchivist.java:73)
at
o.a.f.r.d.JsonResponseHistoryServerArchivist.lambda$archiveExecutionGraph$0(JsonResponseHistoryServerArchivist.java:57)
at o.a.f.u.f.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49)
at j.u.c.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
at j.u.c.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at j.u.c.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
09/08/2021 09:02:26.227 -0400 INFO
org.apache.flink.runtime.jobmanager.ZooKeeperJobGraphStoreWatcher Stopping
ZooKeeperJobGraphStoreWatcher
09/08/2021 09:02:26.212 -0400 INFO
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore Stopping
DefaultJobGraphStore.
09/08/2021 09:02:26.211 -0400 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher Stopped dispatcher
akka.ssl.tcp://1dd8e04affb77f1da7ab5f7c9202b570:50001/user/rpc/dispatcher_1.
09/08/2021 09:02:26.211 -0400 INFO
org.apache.flink.runtime.rest.handler.legacy.backpressure.BackPressureRequestCoordinator
Shutting down back pressure request coordinator.
09/08/2021 09:02:26.201 -0400 INFO
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore Released job graph
2db4ee6397151a1109d1ca05188a4cbb from
ZooKeeperStateHandleStore{namespace='analytics-flink/analytics-flink-v1/3069/jobgraphs'}.
09/08/2021 09:02:26.192 -0400 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher Job
2db4ee6397151a1109d1ca05188a4cbb reached terminal state SUSPENDED.
09/08/2021 09:02:26.191 -0400 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver Closing
ZooKeeperLeaderElectionDriver{leaderPath='/leader/2db4ee6397151a1109d1ca05188a4cbb/job_manager_lock'}
09/08/2021 09:02:26.191 -0400 INFO
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService Stopping
DefaultLeaderElectionService.
09/08/2021 09:02:26.190 -0400 INFO
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess
Trying to recover job with job id 2db4ee6397151a1109d1ca05188a4cbb.
09/08/2021 09:02:26.190 -0400 INFO
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore Retrieved job ids
[2db4ee6397151a1109d1ca05188a4cbb] from
ZooKeeperStateHandleStore{namespace='analytics-flink/analytics-flink-v1/3069/jobgraphs'}
09/08/2021 09:02:26.189 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Registering
TaskManager with ResourceID 10.105.217.197:50004-6176ad
(akka.ssl.tcp://0a11cfda1c06f5500298d05e94d88c13:50004/user/rpc/taskmanager_0)
at ResourceManager
09/08/2021 09:02:26.182 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Stopping SlotPool.
09/08/2021 09:02:26.182 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster
Close ResourceManager connection fb0bf924810087592ad931aecb0387b1: Stopping
JobMaster for job analytics-flink-v1(2db4ee6397151a1109d1ca05188a4cbb)..
09/08/2021 09:02:26.182 -0400 INFO
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess
Recover all persisted job graphs.
09/08/2021 09:02:26.182 -0400 INFO
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess Start
SessionDispatcherLeaderProcess.
09/08/2021 09:02:26.181 -0400 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl Suspending SlotPool.
09/08/2021 09:02:26.178 -0400 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter Shutting down.
09/08/2021 09:02:26.164 -0400 WARN org.apache.flink.metrics.MetricGroup Name
collision: Group already contains a Metric with the name 'taskSlotsTotal'.
Metric will not be reported.[jobmanager, 10.105.221.188]
09/08/2021 09:02:26.164 -0400 WARN org.apache.flink.metrics.MetricGroup Name
collision: Group already contains a Metric with the name 'taskSlotsAvailable'.
Metric will not be reported.[jobmanager, 10.105.221.188]
09/08/2021 09:02:26.163 -0400 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl Starting
the SlotManager.
09/08/2021 09:02:26.163 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
ResourceManager
akka.ssl.tcp://0aaaf7660b7f4414af376af4f91f9500:50001/user/rpc/resourcemanager_0
was granted leadership with fencing token b0f998b265508bb4cd715749318a4ce4
09/08/2021 09:02:26.161 -0400 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver
Connection to ZooKeeper was reconnected. Leader election can be restarted.
09/08/2021 09:02:26.161 -0400 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver
Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
09/08/2021 09:02:26.160 -0400 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver
Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
09/08/2021 09:02:26.160 -0400 INFO
org.apache.flink.runtime.jobmanager.ZooKeeperJobGraphStoreWatcher ZooKeeper
connection RECONNECTED. Changes to the submitted job graphs are monitored again.
09/08/2021 09:02:26.160 -0400 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver
Connection to ZooKeeper was reconnected. Leader election can be restarted.
09/08/2021 09:02:26.160 -0400 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver
Connection to ZooKeeper was reconnected. Leader election can be restarted.
09/08/2021 09:02:26.159 -0400 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver
Connection to ZooKeeper was reconnected. Leader election can be restarted.
09/08/2021 09:02:26.159 -0400 INFO
org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager
State change: RECONNECTED
09/08/2021 09:02:26.159 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Session
establishment complete on server zkeeper-2/10.105.219.52:2181, sessionid =
0x30000016c7d978a, negotiated timeout = 10000
09/08/2021 09:02:26.158 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Socket
connection established to zkeeper-2/10.105.219.52:2181, initiating session
09/08/2021 09:02:26.157 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Opening
socket connection to server zkeeper-2/10.105.219.52:2181
09/08/2021 09:02:25.274 -0400 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver
Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
09/08/2021 09:02:25.273 -0400 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver
Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
09/08/2021 09:02:25.273 -0400 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver
Connection to ZooKeeper was reconnected. Leader election can be restarted.
09/08/2021 09:02:25.273 -0400 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver
Connection to ZooKeeper was reconnected. Leader election can be restarted.
09/08/2021 09:02:25.273 -0400 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver
Connection to ZooKeeper was reconnected. Leader election can be restarted.
09/08/2021 09:02:25.273 -0400 INFO
org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager
State change: RECONNECTED
09/08/2021 09:02:25.273 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Session
establishment complete on server zkeeper-1/10.105.253.30:2181, sessionid =
0x30000016c7d9789, negotiated timeout = 10000
09/08/2021 09:02:25.271 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Socket
connection established to zkeeper-1/10.105.253.30:2181, initiating session
09/08/2021 09:02:25.271 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Opening
socket connection to server zkeeper-1/10.105.253.30:2181
09/08/2021 09:02:25.120 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Socket error
occurred: zkeeper-3/10.105.233.83:2181: Connection refused
09/08/2021 09:02:25.119 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Opening
socket connection to server zkeeper-3/10.105.233.83:2181
09/08/2021 09:02:25.027 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Socket error
occurred: zkeeper-3/10.105.233.83:2181: Connection refused
09/08/2021 09:02:25.026 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Opening
socket connection to server zkeeper-3/10.105.233.83:2181
09/08/2021 09:02:23.738 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Unable to
read additional data from server sessionid 0x30000016c7d978a, likely server has
closed socket, closing socket connection and attempting reconnect
09/08/2021 09:02:23.737 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Socket
connection established to zkeeper-1/10.105.253.30:2181, initiating session
09/08/2021 09:02:23.737 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Opening
socket connection to server zkeeper-1/10.105.253.30:2181
09/08/2021 09:02:23.646 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Unable to
read additional data from server sessionid 0x30000016c7d9789, likely server has
closed socket, closing socket connection and attempting reconnect
09/08/2021 09:02:23.645 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Socket
connection established to zkeeper-2/10.105.219.52:2181, initiating session
09/08/2021 09:02:23.644 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Opening
socket connection to server zkeeper-2/10.105.219.52:2181
09/08/2021 09:02:23.507 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Unable to
read additional data from server sessionid 0x30000016c7d9789, likely server has
closed socket, closing socket connection and attempting reconnect
09/08/2021 09:02:23.506 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Socket
connection established to zkeeper-1/10.105.253.30:2181, initiating session
09/08/2021 09:02:23.505 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Opening
socket connection to server zkeeper-1/10.105.253.30:2181
09/08/2021 09:02:22.879 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Unable to
read additional data from server sessionid 0x30000016c7d978a, likely server has
closed socket, closing socket connection and attempting reconnect
09/08/2021 09:02:22.878 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Socket
connection established to zkeeper-2/10.105.219.52:2181, initiating session
09/08/2021 09:02:22.877 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Opening
socket connection to server zkeeper-2/10.105.219.52:2181
09/08/2021 09:02:22.742 -0400 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore Suspending
09/08/2021 09:02:22.725 -0400 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver Closing
ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/resource_manager_lock'}.
09/08/2021 09:02:22.725 -0400 INFO
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService Stopping
DefaultLeaderRetrievalService.
09/08/2021 09:02:22.724 -0400 INFO org.apache.flink.runtime.jobmaster.JobMaster
Stopping the JobMaster for job
analytics-flink-v1(2db4ee6397151a1109d1ca05188a4cbb).
09/08/2021 09:02:22.679 -0400 WARN
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver
Connection to ZooKeeper suspended. The contender https://10.105.245.207:8081 no
longer participates in the leader election.
09/08/2021 09:02:22.679 -0400 WARN
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver
Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.
09/08/2021 09:02:22.679 -0400 WARN
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver
Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.
09/08/2021 09:02:22.679 -0400 WARN
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver
Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.
09/08/2021 09:02:22.679 -0400 WARN
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver
Connection to ZooKeeper suspended. The contender LeaderContender:
JobManagerRunnerImpl no longer participates in the leader election.
09/08/2021 09:02:22.679 -0400 WARN
org.apache.flink.runtime.jobmanager.ZooKeeperJobGraphStoreWatcher ZooKeeper
connection SUSPENDING. Changes to the submitted job graphs are not monitored
(temporarily).
09/08/2021 09:02:22.677 -0400 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher Stopping all currently
running jobs of dispatcher
akka.ssl.tcp://1dd8e04affb77f1da7ab5f7c9202b570:50001/user/rpc/dispatcher_1.
09/08/2021 09:02:22.677 -0400 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher Stopping dispatcher
akka.ssl.tcp://1dd8e04affb77f1da7ab5f7c9202b570:50001/user/rpc/dispatcher_1.
09/08/2021 09:02:22.677 -0400 INFO
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess
Stopping SessionDispatcherLeaderProcess.
09/08/2021 09:02:22.677 -0400 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl Suspending
the SlotManager.
09/08/2021 09:02:22.676 -0400 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver Closing
ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/2db4ee6397151a1109d1ca05188a4cbb/job_manager_lock'}.
09/08/2021 09:02:22.676 -0400 INFO
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService Stopping
DefaultLeaderRetrievalService.
09/08/2021 09:02:22.676 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
ResourceManager
akka.ssl.tcp://1dd8e04affb77f1da7ab5f7c9202b570:50001/user/rpc/resourcemanager_0
was revoked leadership. Clearing fencing token.
09/08/2021 09:02:22.676 -0400 WARN
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver
Connection to ZooKeeper suspended. The contender LeaderContender:
DefaultDispatcherRunner no longer participates in the leader election.
09/08/2021 09:02:22.676 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Closing
TaskExecutor connection 10.105.217.197:50004-6176ad because: ResourceManager
leader changed to new address null
09/08/2021 09:02:22.675 -0400 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager Closing
TaskExecutor connection 10.105.236.109:50004-1e912c because: ResourceManager
leader changed to new address null
09/08/2021 09:02:22.675 -0400 WARN
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver
Connection to ZooKeeper suspended. The contender LeaderContender:
StandaloneResourceManager no longer participates in the leader election.
09/08/2021 09:02:22.675 -0400 INFO
org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager
State change: SUSPENDED
09/08/2021 09:02:22.673 -0400 WARN
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver
Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.
09/08/2021 09:02:22.673 -0400 WARN
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver
Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.
09/08/2021 09:02:22.673 -0400 WARN
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver
Connection to ZooKeeper suspended. The contender LeaderContender:
StandaloneResourceManager no longer participates in the leader election.
09/08/2021 09:02:22.673 -0400 WARN
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver
Connection to ZooKeeper suspended. The contender https://10.105.221.188:8081 no
longer participates in the leader election.
09/08/2021 09:02:22.673 -0400 WARN
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver
Connection to ZooKeeper suspended. The contender LeaderContender:
DefaultDispatcherRunner no longer participates in the leader election.
09/08/2021 09:02:22.673 -0400 INFO
org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager
State change: SUSPENDED
09/08/2021 09:02:22.574 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Unable to
read additional data from server sessionid 0x30000016c7d978a, likely server has
closed socket, closing socket connection and attempting reconnect
09/08/2021 09:02:22.573 -0400 INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn Unable to
read additional data from server sessionid 0x30000016c7d9789, likely server has
closed socket, closing socket connection and attempting reconnect