[
https://issues.apache.org/jira/browse/YARN-11396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18032450#comment-18032450
]
zeekling commented on YARN-11396:
---------------------------------
In the kill AM scenario, the value of user.getTotalApplications() is not 0.
Therefore, this modification is incorrect or the problem is not located and
cleared. I prefer the JobManager bug.
> Used resource of user may be incorrect when flink's job manager retry
> ----------------------------------------------------------------------
>
> Key: YARN-11396
> URL: https://issues.apache.org/jira/browse/YARN-11396
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 3.1.3, 2.10.1, 3.2.4, 3.3.4
> Reporter: Li Kang
> Priority: Minor
> Attachments: YARN-11396.001.patch, image-2022-12-14-14-37-09-463.png
>
>
> Run flink job on YARN 2.10.1 using the capacity scheduler,used resource of
> user is incorrect when job manager failed and attempt.
> Reproduce this issue:
> 1. Create a capacity_test queue. The queue resource is following:
> {code:java}
> Queue State: RUNNING
> Used Capacity: <memory:4096, vCores:4> (84.7%)
> Configured Capacity: <memory:0, vCores:0>
> Configured Max Capacity: unlimited
> Effective Capacity: <memory:20479, vCores:4> (4.0%)
> Effective Max Capacity: <memory:512000, vCores:118> (100.0%)
> Absolute Used Capacity: 3.4%
> Absolute Configured Capacity: 4.0%
> Absolute Configured Max Capacity: 100.0%
> Used Resources: <memory:4096, vCores:4>
> {code}
> 2. Sumbit a flink job to yarn with parallelism is 10 and contaianer
> resource is 1c 1024m.
> {code:java}
> flink run -m yarn-cluster -yjm 1024 -ytm 1024 -parallelism 10 -yqu
> capacity_test /cloud/service/flink/examples/streaming/WindowJoin.ja {code}
> Becuase user's max resource of this queue is 4c, 10g, so this job only can
> runnning 5 containers, at this moment, used resource of this user is following
> ||User Name||Max Resource||Weight||Used Resource||Max AM Resource||Used AM
> Resource||Schedulable Apps||Non-Schedulable Apps||
> |hadoop|*<memory:20480, vCores:4>*|1.0|<memory:5120, vCores:5>|<memory:10240,
> vCores:2>|<memory:2048, vCores:2>|2|
> 3. kill -9 the process of job manager, so this application of attempt will
> be removed by yarn, and the user will be remove form UserManager as well.
> In method of LeafQueue#removeApplicationAttempt, when user's total
> applications is 0, the user will be remove from usersManager.
> {code:java}
> private void removeApplicationAttempt(
> FiCaSchedulerApp application, String userName) {
> try {
> writeLock.lock();
> //...
> user.finishApplication(wasActive);
> if (user.getTotalApplications() == 0) {
> usersManager.removeUser(application.getUser());
> }
> //...
> }{code}
> 4. A new job manager will be attempted , so the User object of hadoop will
> be recreate, and used resource of user is initialize to 0. As the same time,
> in flink job, the value of
> ApplicationSubmissionContextProto#keep_containers_across_application_attempts
> is true, old containers can still running, and this part of resources is not
> compute in recreated user. So used resource of user is incorrect and real
> used resource more than max resource,like this
> !image-2022-12-14-14-37-09-463.png|width=1192,height=532!
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]