[ 
https://issues.apache.org/jira/browse/YARN-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmytro Kabakchei updated YARN-4698:
-----------------------------------
    Description: 
We noticed that on our cluster there are negative values in RM UI counters:
- Containers Running: -19
- Memory Used: -38GB
- Vcores Used: -19

After we checked RM logs, we found, that the following events had happened:
- Assigned container: 67019 times
- Released container: 67019 times
- Invalid container released: 19 times

Some log records related can be found within "Example.log-cut" attachment.

After some investigation we made a conclusion that there is some kind of race 
condition for container that was scheduled for killing, but was completed 
successfully before kill.
Also, there is a patch that possibly mitigates effects of the issue, but 
doesn't solve original problem (see mitigating2.5.1diff).
Unfortunately, the cluster and all other logs are lost, because the report was 
made about a year ago, but wasn't submitted properly. Also, we don't know if 
the issue exist in other versions.

  was:
We noticed that on our cluster there are negative values in RM UI counters:
-Containers Running: -19
-Memory Used: -38GB
-Vcores Used: -19

After we checked RM logs, we found, that the following events had happened:
- Assigned container: 67019 times
- Released container: 67019 times
- Invalid container released: 19 times

Some log records related can be found within "Example.log-cut" attachment.

After some investigation we made a conclusion that there is some kind of race 
condition for container that was scheduled for killing, but was completed 
successfully before kill.
Also, there is a patch that possibly mitigates effects of the issue, but 
doesn't solve original problem (see mitigating2.5.1diff).
Unfortunately, the cluster and all other logs are lost, because the report was 
made about a year ago, but wasn't submitted properly. Also, we don't know if 
the issue exist in other versions.


> Negative value in RM UI counters due to double container release
> ----------------------------------------------------------------
>
>                 Key: YARN-4698
>                 URL: https://issues.apache.org/jira/browse/YARN-4698
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler, resourcemanager
>    Affects Versions: 2.5.1
>            Reporter: Dmytro Kabakchei
>            Priority: Minor
>         Attachments: Example.log-cut, mitigating2.5.1.diff
>
>
> We noticed that on our cluster there are negative values in RM UI counters:
> - Containers Running: -19
> - Memory Used: -38GB
> - Vcores Used: -19
> After we checked RM logs, we found, that the following events had happened:
> - Assigned container: 67019 times
> - Released container: 67019 times
> - Invalid container released: 19 times
> Some log records related can be found within "Example.log-cut" attachment.
> After some investigation we made a conclusion that there is some kind of race 
> condition for container that was scheduled for killing, but was completed 
> successfully before kill.
> Also, there is a patch that possibly mitigates effects of the issue, but 
> doesn't solve original problem (see mitigating2.5.1diff).
> Unfortunately, the cluster and all other logs are lost, because the report 
> was made about a year ago, but wasn't submitted properly. Also, we don't know 
> if the issue exist in other versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to