[jira] [Updated] (YARN-7131) FSDownload.unpack should read determine the type of resource by reading the header bytes

2017-08-30 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-7131:
-
Description: 
Currently, there are naive string checks to determine if a resource is of a 
particular type (jar, zip, tar.gz) 

There can be cases where this does not work - e.g., the user decides to split 
up a large zip resource as file1.zip.001, file1.zip.002.

Instead, FSDownload.unpack should read the file header bytes to determine the 
file type.

  was:
Currently, there are naive string checks to determine if a resource of a 
particular type (jar, zip, tar.gz) 

There can be cases where this does not work - e.g., the user decides to split 
up a large zip resource as file1.zip.001, file1.zip.002.

Instead, FSDownload.unpack should read the file header bytes to determine the 
file type.


> FSDownload.unpack should read determine the type of resource by reading the 
> header bytes
> 
>
> Key: YARN-7131
> URL: https://issues.apache.org/jira/browse/YARN-7131
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>
> Currently, there are naive string checks to determine if a resource is of a 
> particular type (jar, zip, tar.gz) 
> There can be cases where this does not work - e.g., the user decides to split 
> up a large zip resource as file1.zip.001, file1.zip.002.
> Instead, FSDownload.unpack should read the file header bytes to determine the 
> file type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7131) FSDownload.unpack should read determine the type of resource by reading the header bytes

2017-08-30 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-7131:
-
Description: 
Currently, there are naive string checks to determine if a resource of a 
particular type (jar, zip, tar.gz) 

There can be cases where this does not work - e.g., the user decides to split 
up a large zip resource as file1.zip.001, file1.zip.002.

Instead, FSDownload.unpack should read the file header bytes to determine the 
file type.

  was:
Currently, there are naive string checks to determine if a resource of a 
particular type (jar, zip, tar.gz) 

There can be cases where this does not work - e.g., the user decides to split 
up a large zip resource as {file1}.zip.001, {file1}.zip.002.

Instead, FSDownload.unpack should read the file header bytes to determine the 
file type.


> FSDownload.unpack should read determine the type of resource by reading the 
> header bytes
> 
>
> Key: YARN-7131
> URL: https://issues.apache.org/jira/browse/YARN-7131
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>
> Currently, there are naive string checks to determine if a resource of a 
> particular type (jar, zip, tar.gz) 
> There can be cases where this does not work - e.g., the user decides to split 
> up a large zip resource as file1.zip.001, file1.zip.002.
> Instead, FSDownload.unpack should read the file header bytes to determine the 
> file type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7131) FSDownload.unpack should read determine the type of resource by reading the header bytes

2017-08-30 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-7131:
-
Component/s: nodemanager

> FSDownload.unpack should read determine the type of resource by reading the 
> header bytes
> 
>
> Key: YARN-7131
> URL: https://issues.apache.org/jira/browse/YARN-7131
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>
> Currently, there are naive string checks to determine if a resource is of a 
> particular type (jar, zip, tar.gz) 
> There can be cases where this does not work - e.g., the user decides to split 
> up a large zip resource as file1.zip.001, file1.zip.002.
> Instead, FSDownload.unpack should read the file header bytes to determine the 
> file type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7131) FSDownload.unpack should read determine the type of resource by reading the header bytes

2017-08-30 Thread Brook Zhou (JIRA)
Brook Zhou created YARN-7131:


 Summary: FSDownload.unpack should read determine the type of 
resource by reading the header bytes
 Key: YARN-7131
 URL: https://issues.apache.org/jira/browse/YARN-7131
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Brook Zhou
Assignee: Brook Zhou


Currently, there are naive string checks to determine if a resource of a 
particular type (jar, zip, tar.gz) 

There can be cases where this does not work - e.g., the user decides to split 
up a large zip resource as {file1}.zip.001, {file1}.zip.002.

Instead, FSDownload.unpack should read the file header bytes to determine the 
file type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7098) LocalizerRunner should immediately send heartbeat response LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING

2017-08-28 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-7098:
-
Attachment: YARN-7098.patch

> LocalizerRunner should immediately send heartbeat response 
> LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING
> 
>
> Key: YARN-7098
> URL: https://issues.apache.org/jira/browse/YARN-7098
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>Priority: Minor
> Attachments: YARN-7098.patch
>
>
> Currently, the following can happen:
> 1. ContainerLocalizer heartbeats to ResourceLocalizationService.
> 2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner 
> for the localizerId (containerId). Goes into {code:java}return 
> localizer.processHeartbeat(status.getResources());{code}
> 3. Container receives kill event, goes from LOCALIZING -> KILLING. The 
> LocalizerRunner is removed from LocalizerTracker, since the privLocalizers 
> lock is now free.
> 4. Since check (2) passed, LocalizerRunner sends heartbeat response with 
> LocalizerStatus.LIVE and the next file to download.
> What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) 
> happened before the heartbeat response in (4). This saves the container from 
> potentially downloading an extra resource due to the one extra LIVE heartbeat 
> which will end up being deleted anyway.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7098) LocalizerRunner should immediately send heartbeat response LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING

2017-08-28 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-7098:
-
Description: 
Currently, the following can happen:

1. ContainerLocalizer heartbeats to ResourceLocalizationService.
2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner 
for the localizerId (containerId). Goes into {code:java}return 
localizer.processHeartbeat(status.getResources());{code}
3. Container receives kill event, goes from LOCALIZING -> KILLING. The 
LocalizerRunner is removed from LocalizerTracker, since the privLocalizers lock 
is now free.
4. Since check (2) passed, LocalizerRunner sends heartbeat response with 
LocalizerStatus.LIVE and the next file to download.

What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) 
happened before the heartbeat response in (4). This saves the container from 
potentially downloading an extra resource due to the one extra LIVE heartbeat 
which will end up being deleted anyway.

  was:
Currently, the following can happen:

1. ContainerLocalizer heartbeats to ResourceLocalizationService.
2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner 
for the localizerId (containerId). Starts executing 
3. Container receives kill event, goes from LOCALIZING -> KILLING. The 
LocalizerRunner is removed from LocalizerTracker.
4. Since check (2) passed, LocalizerRunner sends heartbeat response with 
LocalizerStatus.LIVE and the next file to download.

What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) 
happened before the heartbeat response in (4). This saves the container from 
potentially downloading an extra resource which will end up being deleted 
anyway.


> LocalizerRunner should immediately send heartbeat response 
> LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING
> 
>
> Key: YARN-7098
> URL: https://issues.apache.org/jira/browse/YARN-7098
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>Priority: Minor
>
> Currently, the following can happen:
> 1. ContainerLocalizer heartbeats to ResourceLocalizationService.
> 2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner 
> for the localizerId (containerId). Goes into {code:java}return 
> localizer.processHeartbeat(status.getResources());{code}
> 3. Container receives kill event, goes from LOCALIZING -> KILLING. The 
> LocalizerRunner is removed from LocalizerTracker, since the privLocalizers 
> lock is now free.
> 4. Since check (2) passed, LocalizerRunner sends heartbeat response with 
> LocalizerStatus.LIVE and the next file to download.
> What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) 
> happened before the heartbeat response in (4). This saves the container from 
> potentially downloading an extra resource due to the one extra LIVE heartbeat 
> which will end up being deleted anyway.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7098) LocalizerRunner should immediately send heartbeat response LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING

2017-08-28 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-7098:
-
Description: 
Currently, the following can happen:

1. ContainerLocalizer heartbeats to ResourceLocalizationService.
2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner 
for the localizerId (containerId). Starts executing 
3. Container receives kill event, goes from LOCALIZING -> KILLING. The 
LocalizerRunner is removed from LocalizerTracker.
4. Since check (2) passed, LocalizerRunner sends heartbeat response with 
LocalizerStatus.LIVE and the next file to download.

What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) 
happened before the heartbeat response in (4). This saves the container from 
potentially downloading an extra resource which will end up being deleted 
anyway.

  was:
Currently, the following can happen:

1. ContainerLocalizer heartbeats to ResourceLocalizationService.
2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner 
for the localizerId (containerId).
3. Container receives kill event, goes from LOCALIZING -> KILLING. The 
LocalizerRunner is not removed from LocalizerTracker due to locking.
4. Since check (2) passed, LocalizerRunner sends heartbeat response with 
LocalizerStatus.LIVE and the next file to download.

What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) 
happened before the heartbeat response in (4). This saves the container from 
potentially downloading an extra resource which will end up being deleted 
anyway.


> LocalizerRunner should immediately send heartbeat response 
> LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING
> 
>
> Key: YARN-7098
> URL: https://issues.apache.org/jira/browse/YARN-7098
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>Priority: Minor
>
> Currently, the following can happen:
> 1. ContainerLocalizer heartbeats to ResourceLocalizationService.
> 2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner 
> for the localizerId (containerId). Starts executing 
> 3. Container receives kill event, goes from LOCALIZING -> KILLING. The 
> LocalizerRunner is removed from LocalizerTracker.
> 4. Since check (2) passed, LocalizerRunner sends heartbeat response with 
> LocalizerStatus.LIVE and the next file to download.
> What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) 
> happened before the heartbeat response in (4). This saves the container from 
> potentially downloading an extra resource which will end up being deleted 
> anyway.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7098) LocalizerRunner should immediately send heartbeat response LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING

2017-08-28 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-7098:
-
Description: 
Currently, the following can happen:

1. ContainerLocalizer heartbeats to ResourceLocalizationService.
2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner 
for the localizerId (containerId).
3. Container receives kill event, goes from LOCALIZING -> KILLING. The 
LocalizerRunner is not removed from LocalizerTracker due to locking.
4. Since check (2) passed, LocalizerRunner sends heartbeat response with 
LocalizerStatus.LIVE and the next file to download.

What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) 
happened before the heartbeat response in (4). This saves the container from 
potentially downloading an extra resource which will end up being deleted 
anyway.

  was:
Currently, the following can happen:

1. ContainerLocalizer heartbeats to ResourceLocalizationService.
2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner 
for the localizerId (containerId).
3. Container receives kill event, goes from LOCALIZING -> KILLING. The 
LocalizerRunner for the localizerId is removed from LocalizerTracker.
4. Since check (2) passed, LocalizerRunner sends heartbeat response with 
LocalizerStatus.LIVE and the next file to download.

What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) 
happened before the heartbeat response in (4). This saves the container from 
potentially downloading an extra resource which will end up being deleted 
anyway.


> LocalizerRunner should immediately send heartbeat response 
> LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING
> 
>
> Key: YARN-7098
> URL: https://issues.apache.org/jira/browse/YARN-7098
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>Priority: Minor
>
> Currently, the following can happen:
> 1. ContainerLocalizer heartbeats to ResourceLocalizationService.
> 2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner 
> for the localizerId (containerId).
> 3. Container receives kill event, goes from LOCALIZING -> KILLING. The 
> LocalizerRunner is not removed from LocalizerTracker due to locking.
> 4. Since check (2) passed, LocalizerRunner sends heartbeat response with 
> LocalizerStatus.LIVE and the next file to download.
> What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) 
> happened before the heartbeat response in (4). This saves the container from 
> potentially downloading an extra resource which will end up being deleted 
> anyway.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7098) LocalizerRunner should immediately send heartbeat response LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING

2017-08-24 Thread Brook Zhou (JIRA)
Brook Zhou created YARN-7098:


 Summary: LocalizerRunner should immediately send heartbeat 
response LocalizerStatus.DIE when the Container transitions from LOCALIZING to 
KILLING
 Key: YARN-7098
 URL: https://issues.apache.org/jira/browse/YARN-7098
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Brook Zhou
Assignee: Brook Zhou
Priority: Minor


Currently, the following can happen:

1. ContainerLocalizer heartbeats to ResourceLocalizationService.
2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner 
for the localizerId (containerId).
3. Container receives kill event, goes from LOCALIZING -> KILLING. The 
LocalizerRunner for the localizerId is removed from LocalizerTracker.
4. Since check (2) passed, LocalizerRunner sends heartbeat response with 
LocalizerStatus.LIVE and the next file to download.

What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) 
happened before the heartbeat response in (4). This saves the container from 
potentially downloading an extra resource which will end up being deleted 
anyway.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise

2017-07-28 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105798#comment-16105798
 ] 

Brook Zhou commented on YARN-6870:
--

Findbugs warnings are from existing code not in this patch.

> ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a 
> float, which is imprecise
> ---
>
> Key: YARN-6870
> URL: https://issues.apache.org/jira/browse/YARN-6870
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api, nodemanager
>Reporter: Brook Zhou
>Assignee: Brook Zhou
> Attachments: YARN-6870-v0.patch, YARN-6870-v1.patch, 
> YARN-6870-v2.patch, YARN-6870-v3.patch
>
>
> We have seen issues on our clusters where the current way of computing CPU 
> usage is having float-arithmetic inaccuracies (the bug is still there in 
> trunk)
> Simple program to illustrate:
> {code:title=Bar.java|borderStyle=solid}
>   public static void main(String[] args) throws Exception {
> float result = 0.0f;
> for (int i = 0; i < 7; i++) {
>   if (i == 6) {
> result += (float) 4 / (float)18;
>   } else {
> result += (float) 2 / (float)18;
>   }
> }
> for (int i = 0; i < 7; i++) {
>   if (i == 6) {
> result -= (float) 4 / (float)18;
>   } else {
> result -= (float) 2 / (float)18;
>   } 
> }
> System.out.println(result);
>   }
> {code}
> // Printed
> 4.4703484E-8
> 2017-04-12 05:43:24,014 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current 
> CPU Allocation: [0.891], Requested CPU Allocation: [0.]
> There are a few places with this issue:
> 1. ResourceUtilization.java - set/getCPU both use float. When 
> ContainerScheduler calls 
> ContainersMonitor.increase/decreaseResourceUtilization, this may lead to 
> issues.
> 2. AllocationBasedResourceUtilizationTracker.java  - hasResourcesAvailable 
> uses float as well for CPU computation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise

2017-07-28 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-6870:
-
Attachment: YARN-6870-v3.patch

> ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a 
> float, which is imprecise
> ---
>
> Key: YARN-6870
> URL: https://issues.apache.org/jira/browse/YARN-6870
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api, nodemanager
>Reporter: Brook Zhou
>Assignee: Brook Zhou
> Attachments: YARN-6870-v0.patch, YARN-6870-v1.patch, 
> YARN-6870-v2.patch, YARN-6870-v3.patch
>
>
> We have seen issues on our clusters where the current way of computing CPU 
> usage is having float-arithmetic inaccuracies (the bug is still there in 
> trunk)
> Simple program to illustrate:
> {code:title=Bar.java|borderStyle=solid}
>   public static void main(String[] args) throws Exception {
> float result = 0.0f;
> for (int i = 0; i < 7; i++) {
>   if (i == 6) {
> result += (float) 4 / (float)18;
>   } else {
> result += (float) 2 / (float)18;
>   }
> }
> for (int i = 0; i < 7; i++) {
>   if (i == 6) {
> result -= (float) 4 / (float)18;
>   } else {
> result -= (float) 2 / (float)18;
>   } 
> }
> System.out.println(result);
>   }
> {code}
> // Printed
> 4.4703484E-8
> 2017-04-12 05:43:24,014 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current 
> CPU Allocation: [0.891], Requested CPU Allocation: [0.]
> There are a few places with this issue:
> 1. ResourceUtilization.java - set/getCPU both use float. When 
> ContainerScheduler calls 
> ContainersMonitor.increase/decreaseResourceUtilization, this may lead to 
> issues.
> 2. AllocationBasedResourceUtilizationTracker.java  - hasResourcesAvailable 
> uses float as well for CPU computation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise

2017-07-28 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-6870:
-
Attachment: YARN-6870-v2.patch

Thanks, made the change.

> ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a 
> float, which is imprecise
> ---
>
> Key: YARN-6870
> URL: https://issues.apache.org/jira/browse/YARN-6870
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api, nodemanager
>Reporter: Brook Zhou
>Assignee: Brook Zhou
> Attachments: YARN-6870-v0.patch, YARN-6870-v1.patch, 
> YARN-6870-v2.patch
>
>
> We have seen issues on our clusters where the current way of computing CPU 
> usage is having float-arithmetic inaccuracies (the bug is still there in 
> trunk)
> Simple program to illustrate:
> {code:title=Bar.java|borderStyle=solid}
>   public static void main(String[] args) throws Exception {
> float result = 0.0f;
> for (int i = 0; i < 7; i++) {
>   if (i == 6) {
> result += (float) 4 / (float)18;
>   } else {
> result += (float) 2 / (float)18;
>   }
> }
> for (int i = 0; i < 7; i++) {
>   if (i == 6) {
> result -= (float) 4 / (float)18;
>   } else {
> result -= (float) 2 / (float)18;
>   } 
> }
> System.out.println(result);
>   }
> {code}
> // Printed
> 4.4703484E-8
> 2017-04-12 05:43:24,014 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current 
> CPU Allocation: [0.891], Requested CPU Allocation: [0.]
> There are a few places with this issue:
> 1. ResourceUtilization.java - set/getCPU both use float. When 
> ContainerScheduler calls 
> ContainersMonitor.increase/decreaseResourceUtilization, this may lead to 
> issues.
> 2. AllocationBasedResourceUtilizationTracker.java  - hasResourcesAvailable 
> uses float as well for CPU computation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise

2017-07-28 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-6870:
-
Attachment: YARN-6870-v1.patch

> ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a 
> float, which is imprecise
> ---
>
> Key: YARN-6870
> URL: https://issues.apache.org/jira/browse/YARN-6870
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api, nodemanager
>Reporter: Brook Zhou
>Assignee: Brook Zhou
> Attachments: YARN-6870-v0.patch, YARN-6870-v1.patch
>
>
> We have seen issues on our clusters where the current way of computing CPU 
> usage is having float-arithmetic inaccuracies (the bug is still there in 
> trunk)
> Simple program to illustrate:
> {code:title=Bar.java|borderStyle=solid}
>   public static void main(String[] args) throws Exception {
> float result = 0.0f;
> for (int i = 0; i < 7; i++) {
>   if (i == 6) {
> result += (float) 4 / (float)18;
>   } else {
> result += (float) 2 / (float)18;
>   }
> }
> for (int i = 0; i < 7; i++) {
>   if (i == 6) {
> result -= (float) 4 / (float)18;
>   } else {
> result -= (float) 2 / (float)18;
>   } 
> }
> System.out.println(result);
>   }
> {code}
> // Printed
> 4.4703484E-8
> 2017-04-12 05:43:24,014 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current 
> CPU Allocation: [0.891], Requested CPU Allocation: [0.]
> There are a few places with this issue:
> 1. ResourceUtilization.java - set/getCPU both use float. When 
> ContainerScheduler calls 
> ContainersMonitor.increase/decreaseResourceUtilization, this may lead to 
> issues.
> 2. AllocationBasedResourceUtilizationTracker.java  - hasResourcesAvailable 
> uses float as well for CPU computation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise

2017-07-26 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-6870:
-
Attachment: YARN-6870-v0.patch

Attached patch against trunk.

> ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a 
> float, which is imprecise
> ---
>
> Key: YARN-6870
> URL: https://issues.apache.org/jira/browse/YARN-6870
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api, nodemanager
>Reporter: Brook Zhou
>Assignee: Brook Zhou
> Attachments: YARN-6870-v0.patch
>
>
> We have seen issues on our clusters where the current way of computing CPU 
> usage is having float-arithmetic inaccuracies (the bug is still there in 
> trunk)
> Simple program to illustrate:
> {code:title=Bar.java|borderStyle=solid}
>   public static void main(String[] args) throws Exception {
> float result = 0.0f;
> for (int i = 0; i < 7; i++) {
>   if (i == 6) {
> result += (float) 4 / (float)18;
>   } else {
> result += (float) 2 / (float)18;
>   }
> }
> for (int i = 0; i < 7; i++) {
>   if (i == 6) {
> result -= (float) 4 / (float)18;
>   } else {
> result -= (float) 2 / (float)18;
>   } 
> }
> System.out.println(result);
>   }
> {code}
> // Printed
> 4.4703484E-8
> 2017-04-12 05:43:24,014 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current 
> CPU Allocation: [0.891], Requested CPU Allocation: [0.]
> There are a few places with this issue:
> 1. ResourceUtilization.java - set/getCPU both use float. When 
> ContainerScheduler calls 
> ContainersMonitor.increase/decreaseResourceUtilization, this may lead to 
> issues.
> 2. AllocationBasedResourceUtilizationTracker.java  - hasResourcesAvailable 
> uses float as well for CPU computation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise

2017-07-26 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102135#comment-16102135
 ] 

Brook Zhou edited comment on YARN-6870 at 7/26/17 10:07 PM:


[~asuresh] Sure. For the scope of this patch, I want to check with you whether 
we need to change the ResourceUtilization proto and downstream objects to use 
integral values instead of [0,1.0], or whether we only need to fix 
AllocationBasedResourceUtilizationTracker.hasResourcesAvailable/ContainerScheduler.hasSufficientResources
 to convert from float to int for cpu calculations.

Based on what I see, only the latter is the actual bug. The former is nice to 
have, but currently doesn't impact this JIRA. Thoughts?


was (Author: brookz):
[~asuresh] Sure. For the scope of this patch, I want to check with you whether 
we need to change the ResourceUtilization proto and downstream objects to use 
integral values instead of [0,1.0], or whether we only need to fix 
AllocationBasedResourceUtilizationTracker.hasResourcesAvailable/ContainerScheduler.hasResourcesAvailable
 to convert from float to int for cpu calculations.

Based on what I see, only the latter is the actual bug. The former is nice to 
have, but currently doesn't impact this JIRA. Thoughts?

> ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a 
> float, which is imprecise
> ---
>
> Key: YARN-6870
> URL: https://issues.apache.org/jira/browse/YARN-6870
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api, nodemanager
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>
> We have seen issues on our clusters where the current way of computing CPU 
> usage is having float-arithmetic inaccuracies (the bug is still there in 
> trunk)
> Simple program to illustrate:
> {code:title=Bar.java|borderStyle=solid}
>   public static void main(String[] args) throws Exception {
> float result = 0.0f;
> for (int i = 0; i < 7; i++) {
>   if (i == 6) {
> result += (float) 4 / (float)18;
>   } else {
> result += (float) 2 / (float)18;
>   }
> }
> for (int i = 0; i < 7; i++) {
>   if (i == 6) {
> result -= (float) 4 / (float)18;
>   } else {
> result -= (float) 2 / (float)18;
>   } 
> }
> System.out.println(result);
>   }
> {code}
> // Printed
> 4.4703484E-8
> 2017-04-12 05:43:24,014 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current 
> CPU Allocation: [0.891], Requested CPU Allocation: [0.]
> There are a few places with this issue:
> 1. ResourceUtilization.java - set/getCPU both use float. When 
> ContainerScheduler calls 
> ContainersMonitor.increase/decreaseResourceUtilization, this may lead to 
> issues.
> 2. AllocationBasedResourceUtilizationTracker.java  - hasResourcesAvailable 
> uses float as well for CPU computation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise

2017-07-26 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102135#comment-16102135
 ] 

Brook Zhou commented on YARN-6870:
--

[~asuresh] Sure. For the scope of this patch, I want to check with you whether 
we need to change the ResourceUtilization proto and downstream objects to use 
integral values instead of [0,1.0], or whether we only need to fix 
AllocationBasedResourceUtilizationTracker.hasResourcesAvailable/ContainerScheduler.hasResourcesAvailable
 to convert from float to int for cpu calculations.

Based on what I see, only the latter is the actual bug. The former is nice to 
have, but currently doesn't impact this JIRA. Thoughts?

> ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a 
> float, which is imprecise
> ---
>
> Key: YARN-6870
> URL: https://issues.apache.org/jira/browse/YARN-6870
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api, nodemanager
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>
> We have seen issues on our clusters where the current way of computing CPU 
> usage is having float-arithmetic inaccuracies (the bug is still there in 
> trunk)
> Simple program to illustrate:
> {code:title=Bar.java|borderStyle=solid}
>   public static void main(String[] args) throws Exception {
> float result = 0.0f;
> for (int i = 0; i < 7; i++) {
>   if (i == 6) {
> result += (float) 4 / (float)18;
>   } else {
> result += (float) 2 / (float)18;
>   }
> }
> for (int i = 0; i < 7; i++) {
>   if (i == 6) {
> result -= (float) 4 / (float)18;
>   } else {
> result -= (float) 2 / (float)18;
>   } 
> }
> System.out.println(result);
>   }
> {code}
> // Printed
> 4.4703484E-8
> 2017-04-12 05:43:24,014 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current 
> CPU Allocation: [0.891], Requested CPU Allocation: [0.]
> There are a few places with this issue:
> 1. ResourceUtilization.java - set/getCPU both use float. When 
> ContainerScheduler calls 
> ContainersMonitor.increase/decreaseResourceUtilization, this may lead to 
> issues.
> 2. AllocationBasedResourceUtilizationTracker.java  - hasResourcesAvailable 
> uses float as well for CPU computation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise

2017-07-25 Thread Brook Zhou (JIRA)
Brook Zhou created YARN-6870:


 Summary: ResourceUtilization/ContainersMonitorImpl is calculating 
CPU utilization as a float, which is imprecise
 Key: YARN-6870
 URL: https://issues.apache.org/jira/browse/YARN-6870
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api, nodemanager
Reporter: Brook Zhou
Assignee: Brook Zhou


We have seen issues on our clusters where the current way of computing CPU 
usage is having float-arithmetic inaccuracies (the bug is still there in trunk)

Simple program to illustrate:
{code:title=Bar.java|borderStyle=solid}
  public static void main(String[] args) throws Exception {
float result = 0.0f;
for (int i = 0; i < 7; i++) {
  if (i == 6) {
result += (float) 4 / (float)18;
  } else {
result += (float) 2 / (float)18;
  }
}
for (int i = 0; i < 7; i++) {
  if (i == 6) {
result -= (float) 4 / (float)18;
  } else {
result -= (float) 2 / (float)18;
  } 
}
System.out.println(result);
  }
{code}
// Printed
4.4703484E-8


2017-04-12 05:43:24,014 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current CPU 
Allocation: [0.891], Requested CPU Allocation: [0.]

There are a few places with this issue:
1. ResourceUtilization.java - set/getCPU both use float. When 
ContainerScheduler calls 
ContainersMonitor.increase/decreaseResourceUtilization, this may lead to issues.

2. AllocationBasedResourceUtilizationTracker.java  - hasResourcesAvailable uses 
float as well for CPU computation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5472) WIN_MAX_PATH logic is off by one

2016-08-08 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412694#comment-15412694
 ] 

Brook Zhou commented on YARN-5472:
--

Yes, this is something I have verified on our NM's running on machine versions 
up to server 2012 r2 - the max path limitation without using long-path 
prefixing is 259 characters.

> WIN_MAX_PATH logic is off by one
> 
>
> Key: YARN-5472
> URL: https://issues.apache.org/jira/browse/YARN-5472
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
> Environment: Windows
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>Priority: Minor
> Attachments: YARN-5472-v0.patch
>
>
> The following check is incorrect in DefaultContainerExecutor:
> if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > 
> WIN_MAX_PATH)
> should be >=, as the max path is defined as "D:\some 256-character path 
> string" on Windows platforms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5472) WIN_MAX_PATH logic is off by one

2016-08-08 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-5472:
-
Attachment: YARN-5472-v0.patch

> WIN_MAX_PATH logic is off by one
> 
>
> Key: YARN-5472
> URL: https://issues.apache.org/jira/browse/YARN-5472
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
> Environment: Windows
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>Priority: Minor
> Attachments: YARN-5472-v0.patch
>
>
> The following check is incorrect in DefaultContainerExecutor:
> if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > 
> WIN_MAX_PATH)
> should be >=, as the max path is defined as "D:\some 256-character path 
> string" on Windows platforms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5472) WIN_MAX_PATH logic is off by one

2016-08-03 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-5472:
-
Description: 
The following check is incorrect in DefaultContainerExecutor:

if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > 
WIN_MAX_PATH)

should be >=, as the max path is defined as "D:\some 256-character path 
string" on Windows platforms.

  was:
The following check is incorrect in DefaultContainerExecutor:

if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > 
WIN_MAX_PATH)

should be >=, as the max path is defined as "D:\some 256-character path 
string" 


> WIN_MAX_PATH logic is off by one
> 
>
> Key: YARN-5472
> URL: https://issues.apache.org/jira/browse/YARN-5472
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
> Environment: Windows
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>Priority: Minor
>
> The following check is incorrect in DefaultContainerExecutor:
> if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > 
> WIN_MAX_PATH)
> should be >=, as the max path is defined as "D:\some 256-character path 
> string" on Windows platforms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5472) WIN_MAX_PATH logic is off by one

2016-08-03 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-5472:
-
Affects Version/s: 2.6.0

> WIN_MAX_PATH logic is off by one
> 
>
> Key: YARN-5472
> URL: https://issues.apache.org/jira/browse/YARN-5472
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
> Environment: Windows
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>Priority: Minor
>
> The following check is incorrect in DefaultContainerExecutor:
> if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > 
> WIN_MAX_PATH)
> should be >=, as the max path is defined as "D:\some 256-character path 
> string" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5472) WIN_MAX_PATH logic is off by one

2016-08-03 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-5472:
-
Description: 
The following check is incorrect in DefaultContainerExecutor:

if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > 
WIN_MAX_PATH)

should be >=, as the max path is defined as "D:\some 256-character path 
string" 

  was:
The following check is incorrect:

if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > 
WIN_MAX_PATH)

should be >=, as the max path is defined as "D:\some 256-character path 
string" 


> WIN_MAX_PATH logic is off by one
> 
>
> Key: YARN-5472
> URL: https://issues.apache.org/jira/browse/YARN-5472
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
> Environment: Windows
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>Priority: Minor
>
> The following check is incorrect in DefaultContainerExecutor:
> if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > 
> WIN_MAX_PATH)
> should be >=, as the max path is defined as "D:\some 256-character path 
> string" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-5472) WIN_MAX_PATH logic is off by one

2016-08-03 Thread Brook Zhou (JIRA)
Brook Zhou created YARN-5472:


 Summary: WIN_MAX_PATH logic is off by one
 Key: YARN-5472
 URL: https://issues.apache.org/jira/browse/YARN-5472
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: Windows
Reporter: Brook Zhou
Assignee: Brook Zhou
Priority: Minor


The following check is incorrect:

if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > 
WIN_MAX_PATH)

should be >=, as the max path is defined as "D:\some 256-character path 
string" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5451) Container localizers that hang are not cleaned up

2016-08-02 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405060#comment-15405060
 ] 

Brook Zhou commented on YARN-5451:
--

Is this because the ContainerLocalizer is launched in a separate process from 
LCE with a timeOutInterval of 0?

> Container localizers that hang are not cleaned up
> -
>
> Key: YARN-5451
> URL: https://issues.apache.org/jira/browse/YARN-5451
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>
> I ran across an old, rogue process on one of our nodes.  It apparently was a 
> container localizer that somehow entered an infinite loop during startup.  
> The NM never cleaned up this broken localizer, so it happily ran forever.  
> The NM needs to do a better job of tracking localizers, including killing 
> them if they appear to be hung/broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-4840) Add option to upload files recursively from container directory

2016-03-18 Thread Brook Zhou (JIRA)
Brook Zhou created YARN-4840:


 Summary: Add option to upload files recursively from container 
directory
 Key: YARN-4840
 URL: https://issues.apache.org/jira/browse/YARN-4840
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: log-aggregation
Affects Versions: 2.8.0
Reporter: Brook Zhou
Priority: Minor
 Fix For: 2.8.0


It may be useful to allow users to aggregate their logs recursively from 
container directories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4818) AggregatedLogFormat.LogValue.write() incorrectly truncates files

2016-03-15 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196359#comment-15196359
 ] 

Brook Zhou commented on YARN-4818:
--

You're right, it was some issue with our FileSystem implementation.

I will cancel this. Sorry about the confusion.

> AggregatedLogFormat.LogValue.write() incorrectly truncates files
> 
>
> Key: YARN-4818
> URL: https://issues.apache.org/jira/browse/YARN-4818
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>  Labels: log-aggregation
> Fix For: 2.8.0
>
> Attachments: YARN-4818-v0.patch
>
>
> AggregatedLogFormat.LogValue.write() currently has a bug where it only writes 
> in blocks of the buffer size (65535). This is because 
> FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length 
> bytes remaining. In cases where the file size is not an exact multiple of 
> 65535 bytes, the remaining bytes are truncated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4818) AggregatedLogFormat.LogValue.write() incorrectly truncates files

2016-03-15 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou resolved YARN-4818.
--
Resolution: Invalid

> AggregatedLogFormat.LogValue.write() incorrectly truncates files
> 
>
> Key: YARN-4818
> URL: https://issues.apache.org/jira/browse/YARN-4818
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>  Labels: log-aggregation
> Fix For: 2.8.0
>
> Attachments: YARN-4818-v0.patch
>
>
> AggregatedLogFormat.LogValue.write() currently has a bug where it only writes 
> in blocks of the buffer size (65535). This is because 
> FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length 
> bytes remaining. In cases where the file size is not an exact multiple of 
> 65535 bytes, the remaining bytes are truncated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4818) AggregatedLogFormat.LogValue.write() incorrectly truncates files

2016-03-15 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-4818:
-
Component/s: nodemanager

> AggregatedLogFormat.LogValue.write() incorrectly truncates files
> 
>
> Key: YARN-4818
> URL: https://issues.apache.org/jira/browse/YARN-4818
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>  Labels: log-aggregation
> Fix For: 2.8.0
>
> Attachments: YARN-4818-v0.patch
>
>
> AggregatedLogFormat.LogValue.write() currently has a bug where it only writes 
> in blocks of the buffer size (65535). This is because 
> FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length 
> bytes remaining. In cases where the file size is not an exact multiple of 
> 65535 bytes, the remaining bytes are truncated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4818) AggregatedLogFormat.LogValue.write() incorrectly truncates files

2016-03-15 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-4818:
-
Labels: log-aggregation  (was: )

> AggregatedLogFormat.LogValue.write() incorrectly truncates files
> 
>
> Key: YARN-4818
> URL: https://issues.apache.org/jira/browse/YARN-4818
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>  Labels: log-aggregation
> Fix For: 2.8.0
>
> Attachments: YARN-4818-v0.patch
>
>
> AggregatedLogFormat.LogValue.write() currently has a bug where it only writes 
> in blocks of the buffer size (65535). This is because 
> FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length 
> bytes remaining. In cases where the file size is not an exact multiple of 
> 65535 bytes, the remaining bytes are truncated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4818) AggregatedLogFormat.LogValue.write() incorrectly truncates files

2016-03-15 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-4818:
-
Attachment: YARN-4818-v0.patch

> AggregatedLogFormat.LogValue.write() incorrectly truncates files
> 
>
> Key: YARN-4818
> URL: https://issues.apache.org/jira/browse/YARN-4818
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Brook Zhou
>Assignee: Brook Zhou
> Fix For: 2.8.0
>
> Attachments: YARN-4818-v0.patch
>
>
> AggregatedLogFormat.LogValue.write() currently has a bug where it only writes 
> in blocks of the buffer size (65535). This is because 
> FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length 
> bytes remaining. In cases where the file size is not an exact multiple of 
> 65535 bytes, the remaining bytes are truncated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4818) AggregatedLogFormat.LogValue.write() incorrectly truncates files

2016-03-14 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-4818:
-
Summary: AggregatedLogFormat.LogValue.write() incorrectly truncates files  
(was: AggregatedLogFormat.LogValue writes only in blocks of buffer size)

> AggregatedLogFormat.LogValue.write() incorrectly truncates files
> 
>
> Key: YARN-4818
> URL: https://issues.apache.org/jira/browse/YARN-4818
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Brook Zhou
>Assignee: Brook Zhou
> Fix For: 2.8.0
>
>
> AggregatedLogFormat.LogValue.write() currently has a bug where it only writes 
> in blocks of the buffer size (65535). This is because 
> FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length 
> bytes remaining. In cases where the file size is not an exact multiple of 
> 65535 bytes, the remaining bytes are truncated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4818) AggregatedLogFormat.LogValue writes only in blocks of buffer size

2016-03-14 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-4818:
-
Description: AggregatedLogFormat.LogValue.write() currently has a bug where 
it only writes in blocks of the buffer size (65535). This is because 
FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length 
bytes remaining. In cases where the file size is not an exact multiple of 65535 
bytes, the remaining bytes are truncated.  (was: 
AggregatedLogFormat.LogValue.write() currently has a bug where it only writes 
in blocks of the buffer size (65535). This is because 
FileInputStream.read(byte[] buf) returns -1 if there are less than 65535 bytes 
remaining. In cases where the file is less than 65535 bytes, 0 bytes are 
written.)

> AggregatedLogFormat.LogValue writes only in blocks of buffer size
> -
>
> Key: YARN-4818
> URL: https://issues.apache.org/jira/browse/YARN-4818
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Brook Zhou
>Assignee: Brook Zhou
> Fix For: 2.8.0
>
>
> AggregatedLogFormat.LogValue.write() currently has a bug where it only writes 
> in blocks of the buffer size (65535). This is because 
> FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length 
> bytes remaining. In cases where the file size is not an exact multiple of 
> 65535 bytes, the remaining bytes are truncated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4818) AggregatedLogFormat.LogValue writes only in blocks of buffer size

2016-03-14 Thread Brook Zhou (JIRA)
Brook Zhou created YARN-4818:


 Summary: AggregatedLogFormat.LogValue writes only in blocks of 
buffer size
 Key: YARN-4818
 URL: https://issues.apache.org/jira/browse/YARN-4818
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.8.0
Reporter: Brook Zhou
Assignee: Brook Zhou
 Fix For: 2.8.0


AggregatedLogFormat.LogValue.write() currently has a bug where it only writes 
in blocks of the buffer size (65535). This is because 
FileInputStream.read(byte[] buf) returns -1 if there are less than 65535 bytes 
remaining. In cases where the file is less than 65535 bytes, 0 bytes are 
written.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4677) RMNodeResourceUpdateEvent update from scheduler can lead to race condition

2016-03-04 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-4677:
-
Issue Type: Sub-task  (was: Improvement)
Parent: YARN-914

> RMNodeResourceUpdateEvent update from scheduler can lead to race condition
> --
>
> Key: YARN-4677
> URL: https://issues.apache.org/jira/browse/YARN-4677
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, resourcemanager, scheduler
>Affects Versions: 2.7.1
>Reporter: Brook Zhou
>Assignee: Junping Du
>
> When a node is in decommissioning state, there is time window between 
> completedContainer() and RMNodeResourceUpdateEvent get handled in 
> scheduler.nodeUpdate (YARN-3223). 
> So if a scheduling effort happens within this window, the new container could 
> still get allocated on this node. Even worse case is if scheduling effort 
> happen after RMNodeResourceUpdateEvent sent out but before it is propagated 
> to SchedulerNode - then the total resource is lower than used resource and 
> available resource is a negative value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4677) RMNodeResourceUpdateEvent update from scheduler can lead to race condition

2016-02-22 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157240#comment-15157240
 ] 

Brook Zhou commented on YARN-4677:
--

Hi [~djp], I currently have no plan, would appreciate if you could work on it. 
Thanks.

> RMNodeResourceUpdateEvent update from scheduler can lead to race condition
> --
>
> Key: YARN-4677
> URL: https://issues.apache.org/jira/browse/YARN-4677
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: graceful, resourcemanager, scheduler
>Affects Versions: 2.7.1
>Reporter: Brook Zhou
>
> When a node is in decommissioning state, there is time window between 
> completedContainer() and RMNodeResourceUpdateEvent get handled in 
> scheduler.nodeUpdate (YARN-3223). 
> So if a scheduling effort happens within this window, the new container could 
> still get allocated on this node. Even worse case is if scheduling effort 
> happen after RMNodeResourceUpdateEvent sent out but before it is propagated 
> to SchedulerNode - then the total resource is lower than used resource and 
> available resource is a negative value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission

2016-02-22 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157237#comment-15157237
 ] 

Brook Zhou commented on YARN-3223:
--

Thanks [~djp] for the kind reviews and followups.

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch, YARN-3223-v5.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission

2016-02-10 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15142170#comment-15142170
 ] 

Brook Zhou commented on YARN-3223:
--

Hi [~djp], can you please review?

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch, YARN-3223-v5.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4677) RMNodeResourceUpdateEvent update from scheduler can lead to race condition

2016-02-06 Thread Brook Zhou (JIRA)
Brook Zhou created YARN-4677:


 Summary: RMNodeResourceUpdateEvent update from scheduler can lead 
to race condition
 Key: YARN-4677
 URL: https://issues.apache.org/jira/browse/YARN-4677
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: graceful, resourcemanager, scheduler
Affects Versions: 2.7.1
Reporter: Brook Zhou


When a node is in decommissioning state, there is time window between 
completedContainer() and RMNodeResourceUpdateEvent get handled in 
scheduler.nodeUpdate (YARN-3223). 

So if a scheduling effort happens within this window, the new container could 
still get allocated on this node. Even worse case is if scheduling effort 
happen after RMNodeResourceUpdateEvent sent out but before it is propagated to 
SchedulerNode - then the total resource is lower than used resource and 
available resource is a negative value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2016-02-05 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: YARN-3223-v4.patch

Latest patch with the same scheduler code applied to Fifo/Fair Schedulers and 
tests.

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2016-02-05 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: YARN-3223-v4.patch

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2016-02-05 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: YARN-3223-v4.patch

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2016-02-05 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: (was: YARN-3223-v4.patch)

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2016-02-05 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: (was: YARN-3223-v4.patch)

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2016-02-05 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: YARN-3223-v4.patch

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2016-02-05 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: (was: YARN-3223-v4.patch)

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2016-02-05 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: YARN-3223-v5.patch

Fixed the checkstyle errors.

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch, YARN-3223-v5.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission

2016-01-22 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113307#comment-15113307
 ] 

Brook Zhou commented on YARN-3223:
--

Spoke offline with Junping, we will move forward with async approach in 
general. I will move any remaining to-dos to a separate JIRA.

Going back to my previous point, 
[YARN-4344|https://issues.apache.org/jira/browse/YARN-4344] seems to have 
removed the dependency of using the RMNode.getTotalCapability() call inside the 
scheduler. Instead, the scheduler will directly use 
SchedulerNode.getTotalResource() for updating clusterResource on 
add/removeNode. In that case, we can simplify the scheduler's nodeUpdate change 
to simply

{code:title=CapacityScheduler.java|borderStyle=solid}
private synchronized void nodeUpdate(RMNode nm) {...
+if (nm.getState() == NodeState.DECOMMISSIONING) {
+  this.updateNodeAndQueueResource(nm, ResourceOption.newInstance(
+  getSchedulerNode(nm.getNodeID()).getUsedResource(), 0));  
+}
...
}
{code}

At this point RMNodeImpl already has saved the originalTotalCapability of the 
node. This will also immediately update the SchedulerNode resources which will 
make scheduling consistent. The costs of locking should be minimal since the 
function just performs a few updates. This should resolve the issues you have 
brought up. Do you agree? 

Otherwise, I will just keep what I have in v3.patch and upload another patch 
with the same nodeUpdate code for Fifo and Fair schedulers, then create another 
JIRA to track the possible scheduler inconsistencies.

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch, YARN-3223-v3.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission

2016-01-07 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088097#comment-15088097
 ] 

Brook Zhou commented on YARN-3223:
--

Thanks [~djp] for the feedback. 

Those scenarios mentioned are indeed problematic. I think the proposal would 
end up making some changes to SchedulerNode and add more complexity there. It 
could end up being too much overhead in terms of maintaining more variables, 
and will still not solve the issues entirely due to the system still being only 
eventually consistent. 

Since CapacityScheduler.nodeUpdate is already synchronized, if we eliminated 
using the asynchronous RMNodeResourceUpdateEvent and just directly modify the 
decommissioning SchedulerNode using updateNodeAndQueueResource, we guarantee 
SchedulerNode's consistency. 

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch, YARN-3223-v3.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission

2015-12-16 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061124#comment-15061124
 ] 

Brook Zhou commented on YARN-3223:
--

Test breaks unrelated.

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch, YARN-3223-v3.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4002) make ResourceTrackerService.nodeHeartbeat more concurrent

2015-12-03 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038392#comment-15038392
 ] 

Brook Zhou commented on YARN-4002:
--

If this is currently not being worked on, I will assign it to me.

> make ResourceTrackerService.nodeHeartbeat more concurrent
> -
>
> Key: YARN-4002
> URL: https://issues.apache.org/jira/browse/YARN-4002
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Hong Zhiguo
>Assignee: Hong Zhiguo
>Priority: Critical
> Attachments: YARN-4002-v0.patch
>
>
> We have multiple RPC threads to handle NodeHeartbeatRequest from NMs. By 
> design the method ResourceTrackerService.nodeHeartbeat should be concurrent 
> enough to scale for large clusters.
> But we have a "BIG" lock in NodesListManager.isValidNode which I think it's 
> unnecessary.
> First, the fields "includes" and "excludes" of HostsFileReader are only 
> updated on "refresh nodes".  All RPC threads handling node heartbeats are 
> only readers.  So RWLock could be used to  alow concurrent access by RPC 
> threads.
> Second, since he fields "includes" and "excludes" of HostsFileReader are 
> always updated by "reference assignment", which is atomic in Java, the reader 
> side lock could just be skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4002) make ResourceTrackerService.nodeHeartbeat more concurrent

2015-12-02 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-4002:
-
Attachment: YARN-4002-v0.patch

Added a patch for this.

> make ResourceTrackerService.nodeHeartbeat more concurrent
> -
>
> Key: YARN-4002
> URL: https://issues.apache.org/jira/browse/YARN-4002
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Hong Zhiguo
>Assignee: Hong Zhiguo
>Priority: Critical
> Attachments: YARN-4002-v0.patch
>
>
> We have multiple RPC threads to handle NodeHeartbeatRequest from NMs. By 
> design the method ResourceTrackerService.nodeHeartbeat should be concurrent 
> enough to scale for large clusters.
> But we have a "BIG" lock in NodesListManager.isValidNode which I think it's 
> unnecessary.
> First, the fields "includes" and "excludes" of HostsFileReader are only 
> updated on "refresh nodes".  All RPC threads handling node heartbeats are 
> only readers.  So RWLock could be used to  alow concurrent access by RPC 
> threads.
> Second, since he fields "includes" and "excludes" of HostsFileReader are 
> always updated by "reference assignment", which is atomic in Java, the reader 
> side lock could just be skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2015-11-30 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: YARN-3223-v3.patch

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch, YARN-3223-v3.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2015-11-30 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: 0001-YARN-3223-resource-update.patch

Since completedContainer is often called multiple times like from nodeUpdate(), 
I moved the trigger of RMNodeResourceUpdateEvent directly into nodeUpdate() 
when a node is decommissioning. If this is ok, I will add similar code to 
Fifo/Fair schedulers.

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2015-11-30 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: (was: 0001-YARN-3223-resource-update.patch)

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission

2015-11-20 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018487#comment-15018487
 ] 

Brook Zhou commented on YARN-3223:
--

Makes sense. 

One thing that I'm not sure about - RMNodeImpl does not know directly the 
amount of usedResource in order to trigger an RMNodeResourceUpdateEvent. I can 
use rmNode.context.getScheduler().(rmNode.getNodeID()).getUsedResource(), but 
I'm not sure if adding that dependency on scheduler is okay.

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission

2015-11-05 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992234#comment-14992234
 ] 

Brook Zhou commented on YARN-3223:
--

Unit tests that failed were not affected by patch. May be related to 
[YARN-2634|https://issues.apache.org/jira/browse/YARN-2634] 

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2015-11-04 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: YARN-3223-v2.patch

Updated patch based on feedback. Checkstyle errors about CapacityScheduler.java 
file length still there.

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission

2015-10-28 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978999#comment-14978999
 ] 

Brook Zhou commented on YARN-3223:
--

Thanks [~leftnoteasy],  [~djp] for review.

bq. Suggest to use CapacityScheduler#updateNodeAndQueueResource to update 
resources, we need to update queue's resource, cluster metrics as well.
That makes sense. I'm currently setting SchedulerNode's usedResource to equal 
to totalResource, and keeping totalResource the same. If we use that function, 
it means totalResource should be set equal to usedResource, and on recommission 
we should just revert back to the original totalResource? I like your way 
better.

bq. When async scheduling enabled, we need to make sure decommissioing node's 
total resource is updated so no new container will be allocated on these nodes.
Even if async scheduling is enabled, we will update the total resource on 
NODE_UPDATE event to equal to current usedResource, async scheduling thread 
will not allocate containers to the node.

bq.  RMNode itself (RMNode.getState()) is already include the necessary info, 
so the boolean parameter sounds like redundant
Agreed. I will let the scheduler decide the current state directly using that 
function.

bq.  I think we need separated test case to cover resource update during NM 
decommissioning 
Yes, that is definitely going to be added. I just wanted to see if my general 
ideas were okay with the community. Thanks!


> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2015-10-21 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: YARN-3223-v1.patch

Fixed the whitespace/checkstyle issues.

The only remaining checkstyle issue is 
"CapacityScheduler.java:1: File length is 2,009 lines"

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission

2015-10-08 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949223#comment-14949223
 ] 

Brook Zhou commented on YARN-3223:
--

Ah okay, sorry about that, will do. 

It seems to be passing test-patch on my local trunk repo, so I will update with 
submit patch.

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2015-10-01 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: YARN-3223-v0.patch

I changed the implementation to add a flag to NodeUpdateSchedulerEvent 
indicating isDecommissioning, which will update the SchedulerNode's 
usedResource to be equal to totalResource.

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2015-09-24 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: (was: YARN-3223-v1.patch)

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2015-09-11 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: (was: YARN-3223-v0.1.patch)

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2015-09-11 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: YARN-3223-v1.patch

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v1.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission

2015-09-11 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741678#comment-14741678
 ] 

Brook Zhou commented on YARN-3223:
--

Applied YARN-3212-v5.1.patch first.

With the YARN-3223-v1.patch changes, test-patch results passed.

| Vote |   Subsystem |  Runtime   | Comment

|   0  |  pre-patch  |  42m 50s   | Pre-patch trunk compilation is 
|  | || healthy.
|  +1  |@author  |  0m 0s | The patch does not contain any 
|  | || @author tags.
|  +1  | tests included  |  0m 0s | The patch appears to include 1 new 
|  | || or modified test files.
|  +1  |  javac  |  11m 12s   | There were no new javac warning 
|  | || messages.
|  +1  |javadoc  |  28m 15s   | There were no new javadoc warning 
|  | || messages.
|  +1  |  release audit  |  0m 59s| The applied patch does not increase 
|  | || the total number of release audit
|  | || warnings.
|  +1  | checkstyle  |  4m 35s| There were no new checkstyle 
|  | || issues.
|  +1  | whitespace  |  0m 0s | The patch has no lines that end in 
|  | || whitespace.
|  +1  |install  |  4m 9s | mvn install still works. 
|  +1  |eclipse:eclipse  |  1m 29s| The patch built with 
|  | || eclipse:eclipse.
|  +1  |   findbugs  |  7m 27s| The patch does not introduce any 
|  | || new Findbugs (version 3.0.0)
|  | || warnings.
|  | |  100m 58s  | 


> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v1.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-666) [Umbrella] Support rolling upgrades in YARN

2015-09-09 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou reassigned YARN-666:
---

Assignee: Brook Zhou

> [Umbrella] Support rolling upgrades in YARN
> ---
>
> Key: YARN-666
> URL: https://issues.apache.org/jira/browse/YARN-666
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Siddharth Seth
>Assignee: Brook Zhou
> Fix For: 2.6.0
>
> Attachments: YARN_Rolling_Upgrades.pdf, YARN_Rolling_Upgrades_v2.pdf
>
>
> Jira to track changes required in YARN to allow rolling upgrades, including 
> documentation and possible upgrade routes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2015-09-04 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: (was: YARN-3223-v0.patch)

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.1.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2015-09-01 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: (was: YARN-3223-v0.1.patch)

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2015-09-01 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: YARN-3223-v0.1.patch

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.1.patch, YARN-3223-v0.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2015-08-18 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: YARN-3223-v0.1.patch

Contains tests, formatting changes

 Resource update during NM graceful decommission
 ---

 Key: YARN-3223
 URL: https://issues.apache.org/jira/browse/YARN-3223
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Affects Versions: 2.7.1
Reporter: Junping Du
Assignee: Brook Zhou
 Attachments: YARN-3223-v0.1.patch, YARN-3223-v0.patch


 During NM graceful decommission, we should handle resource update properly, 
 include: make RMNode keep track of old resource for possible rollback, keep 
 available resource to 0 and used resource get updated when
 container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2015-08-12 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: YARN-3223-v0.patch

Just a patch for review.

 Resource update during NM graceful decommission
 ---

 Key: YARN-3223
 URL: https://issues.apache.org/jira/browse/YARN-3223
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Affects Versions: 2.7.1
Reporter: Junping Du
Assignee: Varun Saxena
 Attachments: YARN-3223-v0.patch


 During NM graceful decommission, we should handle resource update properly, 
 include: make RMNode keep track of old resource for possible rollback, keep 
 available resource to 0 and used resource get updated when
 container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission

2015-07-24 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14640845#comment-14640845
 ] 

Brook Zhou commented on YARN-3223:
--

Hi, I'm interested in whether this is still active. If not, I have an 
implementation of this that I would like to get reviewed. 

 Resource update during NM graceful decommission
 ---

 Key: YARN-3223
 URL: https://issues.apache.org/jira/browse/YARN-3223
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Junping Du
Assignee: Varun Saxena

 During NM graceful decommission, we should handle resource update properly, 
 include: make RMNode keep track of old resource for possible rollback, keep 
 available resource to 0 and used resource get updated when
 container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)