[ 
https://issues.apache.org/jira/browse/YARN-11396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kang Lee updated YARN-11396:
----------------------------
    Description: 
Run flink job on YARN 2.10.1 using the capacity scheduler,used resource of user 
is incorrect when job manager failed and attempt. 
Reproduce this issue:
  1. Create a capacity_test queue. The queue resource is following:
{code:java}
Queue State:    RUNNING
Used Capacity:    <memory:4096, vCores:4> (84.7%)
Configured Capacity:    <memory:0, vCores:0>
Configured Max Capacity:    unlimited
Effective Capacity:    <memory:20479, vCores:4> (4.0%)
Effective Max Capacity:    <memory:512000, vCores:118> (100.0%)
Absolute Used Capacity:    3.4%
Absolute Configured Capacity:    4.0%
Absolute Configured Max Capacity:    100.0%
Used Resources:    <memory:4096, vCores:4>
{code}
  2. Sumbit a flink job to yarn with parallelism is 10 and contaianer resource 
is 1c 1024m.
{code:java}
flink run -m yarn-cluster -yjm 1024 -ytm 1024 -parallelism 10 -yqu 
capacity_test /cloud/service/flink/examples/streaming/WindowJoin.ja {code}
 Becuase user's max resource of this queue is 4c, 10g, so this job only can 
runnning 5 containers, at this moment, used resource of this user is following
||User Name||Max Resource||Weight||Used Resource||Max AM Resource||Used AM 
Resource||Schedulable Apps||Non-Schedulable Apps||
|hadoop|*<memory:20480, vCores:4>*|1.0|<memory:5120, vCores:5>|<memory:10240, 
vCores:2>|<memory:2048, vCores:2>|2|

  3. kill -9 the process of job manager, so this application of attempt will be 
removed by yarn, and the user will be remove form UserManager as well.

   In method of LeafQueue#removeApplicationAttempt, when user's total 
applications is 0, the user will be remove from usersManager.
{code:java}
private void removeApplicationAttempt(
    FiCaSchedulerApp application, String userName) {
  try {
    writeLock.lock();
    //...
    user.finishApplication(wasActive);
    if (user.getTotalApplications() == 0) {
      usersManager.removeUser(application.getUser());
    }
    //...
  }{code}
  4. A new job manager will be attempted , so the User object of hadoop will be 
recreate, and used resource of user is initialize to 0.  As the same time, in 
flink job,  the value of 
ApplicationSubmissionContextProto#keep_containers_across_application_attempts 
is true,  old containers can still running, and this part of resources is not 
compute in recreated user. So used resource of user is incorrect and real used 
resource more than max resource,like this

!image-2022-12-14-14-37-09-463.png|width=1192,height=532!

 

 

 

  was:
Run flink job on YARN 2.10.1 using the capacity scheduler,used resource of user 
is incorrect when job manager failed and attempt. 
Reproduce this issue:
  1. Create a capacity_test queue. The queue resource is following:
{code:java}
Queue State:    RUNNING
Used Capacity:    <memory:4096, vCores:4> (84.7%)
Configured Capacity:    <memory:0, vCores:0>
Configured Max Capacity:    unlimited
Effective Capacity:    <memory:20479, vCores:4> (4.0%)
Effective Max Capacity:    <memory:512000, vCores:118> (100.0%)
Absolute Used Capacity:    3.4%
Absolute Configured Capacity:    4.0%
Absolute Configured Max Capacity:    100.0%
Used Resources:    <memory:4096, vCores:4>
{code}
  2. Sumbit a flink job to yarn with parallelism is 10 and contaianer resource 
is 1c 1024m.
{code:java}
flink run -m yarn-cluster -yjm 1024 -ytm 1024 -parallelism 10 -yqu 
capacity_test /cloud/service/flink/examples/streaming/WindowJoin.ja {code}
 Becuase user's max resource of this queue is 4c, 10g, so this job only can 
runnning 5 containers, at this moment, used resource of this user is following
||User Name||Max Resource||Weight||Used Resource||Max AM Resource||Used AM 
Resource||Schedulable Apps||Non-Schedulable Apps||
|hadoop|*<memory:20480, vCores:4>*|1.0|<memory:5120, vCores:5>|<memory:10240, 
vCores:2>|<memory:2048, vCores:2>|2|

  3. kill -9 the process of job manager, so this application of attempt will be 
removed by yarn, and the user will be remove form UserManager as well.

   In method of LeafQueue#removeApplicationAttempt, when user's total 
applications is 0, the user will be remove from usersManager.
{code:java}
private void removeApplicationAttempt(
    FiCaSchedulerApp application, String userName) {
  try {
    writeLock.lock();
    //...
    user.finishApplication(wasActive);
    if (user.getTotalApplications() == 0) {
      usersManager.removeUser(application.getUser());
    }
    //...
  }{code}
  4. A new job manager will be attempted , so the User object of hadoop will be 
recreate, and used resource of user is initialize to 0.  As the same time, in 
flink job,  the value of 
ApplicationSubmissionContextProto#keep_containers_across_application_attempts 
is true,  old containers can still running, and this part of resources is not 
compute in recreated user. So used resource of user is incorrect and real used 
resource more than max resource can bu used,like this

!image-2022-12-14-14-37-09-463.png|width=1192,height=532!

 

 

 


> Used resource of user may be incorrect  when flink's job manager retry
> ----------------------------------------------------------------------
>
>                 Key: YARN-11396
>                 URL: https://issues.apache.org/jira/browse/YARN-11396
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.10.1
>            Reporter: Kang Lee
>            Priority: Minor
>         Attachments: image-2022-12-14-14-37-09-463.png
>
>
> Run flink job on YARN 2.10.1 using the capacity scheduler,used resource of 
> user is incorrect when job manager failed and attempt. 
> Reproduce this issue:
>   1. Create a capacity_test queue. The queue resource is following:
> {code:java}
> Queue State:    RUNNING
> Used Capacity:    <memory:4096, vCores:4> (84.7%)
> Configured Capacity:    <memory:0, vCores:0>
> Configured Max Capacity:    unlimited
> Effective Capacity:    <memory:20479, vCores:4> (4.0%)
> Effective Max Capacity:    <memory:512000, vCores:118> (100.0%)
> Absolute Used Capacity:    3.4%
> Absolute Configured Capacity:    4.0%
> Absolute Configured Max Capacity:    100.0%
> Used Resources:    <memory:4096, vCores:4>
> {code}
>   2. Sumbit a flink job to yarn with parallelism is 10 and contaianer 
> resource is 1c 1024m.
> {code:java}
> flink run -m yarn-cluster -yjm 1024 -ytm 1024 -parallelism 10 -yqu 
> capacity_test /cloud/service/flink/examples/streaming/WindowJoin.ja {code}
>  Becuase user's max resource of this queue is 4c, 10g, so this job only can 
> runnning 5 containers, at this moment, used resource of this user is following
> ||User Name||Max Resource||Weight||Used Resource||Max AM Resource||Used AM 
> Resource||Schedulable Apps||Non-Schedulable Apps||
> |hadoop|*<memory:20480, vCores:4>*|1.0|<memory:5120, vCores:5>|<memory:10240, 
> vCores:2>|<memory:2048, vCores:2>|2|
>   3. kill -9 the process of job manager, so this application of attempt will 
> be removed by yarn, and the user will be remove form UserManager as well.
>    In method of LeafQueue#removeApplicationAttempt, when user's total 
> applications is 0, the user will be remove from usersManager.
> {code:java}
> private void removeApplicationAttempt(
>     FiCaSchedulerApp application, String userName) {
>   try {
>     writeLock.lock();
>     //...
>     user.finishApplication(wasActive);
>     if (user.getTotalApplications() == 0) {
>       usersManager.removeUser(application.getUser());
>     }
>     //...
>   }{code}
>   4. A new job manager will be attempted , so the User object of hadoop will 
> be recreate, and used resource of user is initialize to 0.  As the same time, 
> in flink job,  the value of 
> ApplicationSubmissionContextProto#keep_containers_across_application_attempts 
> is true,  old containers can still running, and this part of resources is not 
> compute in recreated user. So used resource of user is incorrect and real 
> used resource more than max resource,like this
> !image-2022-12-14-14-37-09-463.png|width=1192,height=532!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to