@Till Rohrmann, Thanks for your clear explanation
------------------ ???????? ------------------
??????:
"Till Rohrmann"
<[email protected]>;
????????: 2021??8??24??(??????) ????8:51
??????: "yanjie"<[email protected]>;
????: "user"<[email protected]>;
????: Re: AdaptiveScheduler stopped without exception
Hi Yanjie,
The observed exception in the logs is just a side effect of the shut down
procedure. It is a bug that shutting down the Dispatcher will result in a fatal
exception coming from the ApplicationDispatcherBootstrap. I've created a
ticket in order to fix it [1].
The true reason for stopping the SessionDispatcherLeaderProcess is that
the DefaultDispatcherRunner lost its leadership. Unfortunately, we don't
log this event on info. If you enable debug log level then you should see
it. What happens when the Dispatcher loses leadership is that the Dispatcher
component will be stopped. I will improve the logging of the
DefaultDispatcherRunner to better state when it gains and loses leadership [2].
I hope this will make the logs easier to understand.
In the second job manager log, it is effectively the same. Just with the
difference that first the ResourceManager loses its leadership. It seems as if
the cause for the leadership loss could be that 172.18.0.1:443 is no longer
reachable (probably the K8s API server).
[1] https://issues.apache.org/jira/browse/FLINK-23946
[2] https://issues.apache.org/jira/browse/FLINK-23947
Cheers,
Till
On Tue, Aug 24, 2021 at 9:56 AM yanjie <[email protected]> wrote:
Hi all,
I run a Application Cluster on Azure K8s, the job works fine for a duration,
then jobmanager catches an exception:
org.apache.flink.util.FlinkException: AdaptiveScheduler is being stopped.
at
org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler.closeAsync(AdaptiveScheduler.java:415)
~[flink-dist_2.11-1.13.0.jar:1.13.0]
at
org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:962)
~[flink-dist_2.11-1.13.0.jar:1.13.0]
at
org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:926)
~[flink-dist_2.11-1.13.0.jar:1.13.0]
at
org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:398)
~[flink-dist_2.11-1.13.0.jar:1.13.0]
...... omit
without any other exception before. Then jobmanager executes stopping steps and
shutdown.
Because there's no other exception before, I don't know why 'AdaptiveScheduler
is being stopped'.
My question:
What causes this issue(flink-jobmanager-1593852-jgwjt.log)?
Is network issuse caused this exception?(as encountered in
flink-jobmanager-1593852-kr22z.log)?
Why first jobmanager(flink-jobmanager-1593852-jgwjt) doesn't throw any
exception before?
Logs:
Attached log files contain jobmanager&taskmanager's log. I configure k8s-HA
with jobmanager's parallelism=1 (Whether set jobmangert's parallelism=1 or 2,
both will recurrent)
flink-jobmanager-1593852-jgwjt.log:
works fine until '2021-08-23 05:08:25'
flink-jobmanager-1593852-kr22z.log:
start from '2021-08-23 05:08:35' and restore my job, works fine for a duration,
then at '2021-08-23 14:24:15'
, jobmanager looks like occur network issue (may be Azure k8s's network issue,
lead to flink cann't operate configmap, loose leader after k8s-ha lease
duration).
Until '2021-08-23 14:24:32', this jobmanager catch exception 'AdaptiveScheduler
is being stopped' again, and then shutdown.
flink--taskexecutor-0-flink-taskmanager-1593852-56dfcd95bc-hvnps.log:
Contains taskmanager's logs from beginning to '2021-08-23 09:15:24'. Covered
the first jobmanager's (flink-jobmanager-1593852-jgwjt) lifecircle.
Background:
Deployment&Configuration
I follow this page :
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/#deploy-application-cluster
deploy a Application Cluster to run my job. And add configurations for high
availability on Kubernetes and use reactive scheduler mode.
Attached yaml files contain 'flink-config' & 'flink-jobmanager' &
'flink-taskmanager' configurations.
Other experiences
In the previous test, when deploy my flink job on Azure K8s cluster, I
encounter 'network issue' once, this issue will lead to master jobmanager can't
renew configmap for a while,
and then the standby jobmanager will be elected as leader, then when previous
leader's network recovered, it knows it is not a leader any more, then
shutdown. Because k8s's default
configuration 'backoffLimit=6', my flink job will be removed finally.
I'm fixing this issue by increasing k8s ha's configurations, as this official
docment says:
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/#advanced-high-availability-kubernetes-options
My analyse:
Both jobmanager's log files contain same exception: 'AdaptiveScheduler is being
stopped'. First jobmanager doesn't print any exception before.
The second jobmanager's print network exception, this may infer that this is
caused by a network issue.
And I really encounter 'network issue' in the previous test and the fix job is
on going, May be this exception is also caused by 'network issue'.
The reason why I raised this is: the first jobmanager doesn't print any
information, I wonder why this happens.