[jira] [Updated] (YARN-7131) FSDownload.unpack should read determine the type of resource by reading the header bytes
[ https://issues.apache.org/jira/browse/YARN-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-7131: - Description: Currently, there are naive string checks to determine if a resource is of a particular type (jar, zip, tar.gz) There can be cases where this does not work - e.g., the user decides to split up a large zip resource as file1.zip.001, file1.zip.002. Instead, FSDownload.unpack should read the file header bytes to determine the file type. was: Currently, there are naive string checks to determine if a resource of a particular type (jar, zip, tar.gz) There can be cases where this does not work - e.g., the user decides to split up a large zip resource as file1.zip.001, file1.zip.002. Instead, FSDownload.unpack should read the file header bytes to determine the file type. > FSDownload.unpack should read determine the type of resource by reading the > header bytes > > > Key: YARN-7131 > URL: https://issues.apache.org/jira/browse/YARN-7131 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Brook Zhou >Assignee: Brook Zhou > > Currently, there are naive string checks to determine if a resource is of a > particular type (jar, zip, tar.gz) > There can be cases where this does not work - e.g., the user decides to split > up a large zip resource as file1.zip.001, file1.zip.002. > Instead, FSDownload.unpack should read the file header bytes to determine the > file type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7131) FSDownload.unpack should read determine the type of resource by reading the header bytes
[ https://issues.apache.org/jira/browse/YARN-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-7131: - Description: Currently, there are naive string checks to determine if a resource of a particular type (jar, zip, tar.gz) There can be cases where this does not work - e.g., the user decides to split up a large zip resource as file1.zip.001, file1.zip.002. Instead, FSDownload.unpack should read the file header bytes to determine the file type. was: Currently, there are naive string checks to determine if a resource of a particular type (jar, zip, tar.gz) There can be cases where this does not work - e.g., the user decides to split up a large zip resource as {file1}.zip.001, {file1}.zip.002. Instead, FSDownload.unpack should read the file header bytes to determine the file type. > FSDownload.unpack should read determine the type of resource by reading the > header bytes > > > Key: YARN-7131 > URL: https://issues.apache.org/jira/browse/YARN-7131 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Brook Zhou >Assignee: Brook Zhou > > Currently, there are naive string checks to determine if a resource of a > particular type (jar, zip, tar.gz) > There can be cases where this does not work - e.g., the user decides to split > up a large zip resource as file1.zip.001, file1.zip.002. > Instead, FSDownload.unpack should read the file header bytes to determine the > file type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7131) FSDownload.unpack should read determine the type of resource by reading the header bytes
[ https://issues.apache.org/jira/browse/YARN-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-7131: - Component/s: nodemanager > FSDownload.unpack should read determine the type of resource by reading the > header bytes > > > Key: YARN-7131 > URL: https://issues.apache.org/jira/browse/YARN-7131 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Brook Zhou >Assignee: Brook Zhou > > Currently, there are naive string checks to determine if a resource is of a > particular type (jar, zip, tar.gz) > There can be cases where this does not work - e.g., the user decides to split > up a large zip resource as file1.zip.001, file1.zip.002. > Instead, FSDownload.unpack should read the file header bytes to determine the > file type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7131) FSDownload.unpack should read determine the type of resource by reading the header bytes
Brook Zhou created YARN-7131: Summary: FSDownload.unpack should read determine the type of resource by reading the header bytes Key: YARN-7131 URL: https://issues.apache.org/jira/browse/YARN-7131 Project: Hadoop YARN Issue Type: Improvement Reporter: Brook Zhou Assignee: Brook Zhou Currently, there are naive string checks to determine if a resource of a particular type (jar, zip, tar.gz) There can be cases where this does not work - e.g., the user decides to split up a large zip resource as {file1}.zip.001, {file1}.zip.002. Instead, FSDownload.unpack should read the file header bytes to determine the file type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7098) LocalizerRunner should immediately send heartbeat response LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING
[ https://issues.apache.org/jira/browse/YARN-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-7098: - Attachment: YARN-7098.patch > LocalizerRunner should immediately send heartbeat response > LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING > > > Key: YARN-7098 > URL: https://issues.apache.org/jira/browse/YARN-7098 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Brook Zhou >Assignee: Brook Zhou >Priority: Minor > Attachments: YARN-7098.patch > > > Currently, the following can happen: > 1. ContainerLocalizer heartbeats to ResourceLocalizationService. > 2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner > for the localizerId (containerId). Goes into {code:java}return > localizer.processHeartbeat(status.getResources());{code} > 3. Container receives kill event, goes from LOCALIZING -> KILLING. The > LocalizerRunner is removed from LocalizerTracker, since the privLocalizers > lock is now free. > 4. Since check (2) passed, LocalizerRunner sends heartbeat response with > LocalizerStatus.LIVE and the next file to download. > What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) > happened before the heartbeat response in (4). This saves the container from > potentially downloading an extra resource due to the one extra LIVE heartbeat > which will end up being deleted anyway. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7098) LocalizerRunner should immediately send heartbeat response LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING
[ https://issues.apache.org/jira/browse/YARN-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-7098: - Description: Currently, the following can happen: 1. ContainerLocalizer heartbeats to ResourceLocalizationService. 2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner for the localizerId (containerId). Goes into {code:java}return localizer.processHeartbeat(status.getResources());{code} 3. Container receives kill event, goes from LOCALIZING -> KILLING. The LocalizerRunner is removed from LocalizerTracker, since the privLocalizers lock is now free. 4. Since check (2) passed, LocalizerRunner sends heartbeat response with LocalizerStatus.LIVE and the next file to download. What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) happened before the heartbeat response in (4). This saves the container from potentially downloading an extra resource due to the one extra LIVE heartbeat which will end up being deleted anyway. was: Currently, the following can happen: 1. ContainerLocalizer heartbeats to ResourceLocalizationService. 2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner for the localizerId (containerId). Starts executing 3. Container receives kill event, goes from LOCALIZING -> KILLING. The LocalizerRunner is removed from LocalizerTracker. 4. Since check (2) passed, LocalizerRunner sends heartbeat response with LocalizerStatus.LIVE and the next file to download. What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) happened before the heartbeat response in (4). This saves the container from potentially downloading an extra resource which will end up being deleted anyway. > LocalizerRunner should immediately send heartbeat response > LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING > > > Key: YARN-7098 > URL: https://issues.apache.org/jira/browse/YARN-7098 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Brook Zhou >Assignee: Brook Zhou >Priority: Minor > > Currently, the following can happen: > 1. ContainerLocalizer heartbeats to ResourceLocalizationService. > 2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner > for the localizerId (containerId). Goes into {code:java}return > localizer.processHeartbeat(status.getResources());{code} > 3. Container receives kill event, goes from LOCALIZING -> KILLING. The > LocalizerRunner is removed from LocalizerTracker, since the privLocalizers > lock is now free. > 4. Since check (2) passed, LocalizerRunner sends heartbeat response with > LocalizerStatus.LIVE and the next file to download. > What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) > happened before the heartbeat response in (4). This saves the container from > potentially downloading an extra resource due to the one extra LIVE heartbeat > which will end up being deleted anyway. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7098) LocalizerRunner should immediately send heartbeat response LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING
[ https://issues.apache.org/jira/browse/YARN-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-7098: - Description: Currently, the following can happen: 1. ContainerLocalizer heartbeats to ResourceLocalizationService. 2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner for the localizerId (containerId). Starts executing 3. Container receives kill event, goes from LOCALIZING -> KILLING. The LocalizerRunner is removed from LocalizerTracker. 4. Since check (2) passed, LocalizerRunner sends heartbeat response with LocalizerStatus.LIVE and the next file to download. What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) happened before the heartbeat response in (4). This saves the container from potentially downloading an extra resource which will end up being deleted anyway. was: Currently, the following can happen: 1. ContainerLocalizer heartbeats to ResourceLocalizationService. 2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner for the localizerId (containerId). 3. Container receives kill event, goes from LOCALIZING -> KILLING. The LocalizerRunner is not removed from LocalizerTracker due to locking. 4. Since check (2) passed, LocalizerRunner sends heartbeat response with LocalizerStatus.LIVE and the next file to download. What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) happened before the heartbeat response in (4). This saves the container from potentially downloading an extra resource which will end up being deleted anyway. > LocalizerRunner should immediately send heartbeat response > LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING > > > Key: YARN-7098 > URL: https://issues.apache.org/jira/browse/YARN-7098 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Brook Zhou >Assignee: Brook Zhou >Priority: Minor > > Currently, the following can happen: > 1. ContainerLocalizer heartbeats to ResourceLocalizationService. > 2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner > for the localizerId (containerId). Starts executing > 3. Container receives kill event, goes from LOCALIZING -> KILLING. The > LocalizerRunner is removed from LocalizerTracker. > 4. Since check (2) passed, LocalizerRunner sends heartbeat response with > LocalizerStatus.LIVE and the next file to download. > What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) > happened before the heartbeat response in (4). This saves the container from > potentially downloading an extra resource which will end up being deleted > anyway. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7098) LocalizerRunner should immediately send heartbeat response LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING
[ https://issues.apache.org/jira/browse/YARN-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-7098: - Description: Currently, the following can happen: 1. ContainerLocalizer heartbeats to ResourceLocalizationService. 2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner for the localizerId (containerId). 3. Container receives kill event, goes from LOCALIZING -> KILLING. The LocalizerRunner is not removed from LocalizerTracker due to locking. 4. Since check (2) passed, LocalizerRunner sends heartbeat response with LocalizerStatus.LIVE and the next file to download. What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) happened before the heartbeat response in (4). This saves the container from potentially downloading an extra resource which will end up being deleted anyway. was: Currently, the following can happen: 1. ContainerLocalizer heartbeats to ResourceLocalizationService. 2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner for the localizerId (containerId). 3. Container receives kill event, goes from LOCALIZING -> KILLING. The LocalizerRunner for the localizerId is removed from LocalizerTracker. 4. Since check (2) passed, LocalizerRunner sends heartbeat response with LocalizerStatus.LIVE and the next file to download. What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) happened before the heartbeat response in (4). This saves the container from potentially downloading an extra resource which will end up being deleted anyway. > LocalizerRunner should immediately send heartbeat response > LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING > > > Key: YARN-7098 > URL: https://issues.apache.org/jira/browse/YARN-7098 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Brook Zhou >Assignee: Brook Zhou >Priority: Minor > > Currently, the following can happen: > 1. ContainerLocalizer heartbeats to ResourceLocalizationService. > 2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner > for the localizerId (containerId). > 3. Container receives kill event, goes from LOCALIZING -> KILLING. The > LocalizerRunner is not removed from LocalizerTracker due to locking. > 4. Since check (2) passed, LocalizerRunner sends heartbeat response with > LocalizerStatus.LIVE and the next file to download. > What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) > happened before the heartbeat response in (4). This saves the container from > potentially downloading an extra resource which will end up being deleted > anyway. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7098) LocalizerRunner should immediately send heartbeat response LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING
Brook Zhou created YARN-7098: Summary: LocalizerRunner should immediately send heartbeat response LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING Key: YARN-7098 URL: https://issues.apache.org/jira/browse/YARN-7098 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Brook Zhou Assignee: Brook Zhou Priority: Minor Currently, the following can happen: 1. ContainerLocalizer heartbeats to ResourceLocalizationService. 2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner for the localizerId (containerId). 3. Container receives kill event, goes from LOCALIZING -> KILLING. The LocalizerRunner for the localizerId is removed from LocalizerTracker. 4. Since check (2) passed, LocalizerRunner sends heartbeat response with LocalizerStatus.LIVE and the next file to download. What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) happened before the heartbeat response in (4). This saves the container from potentially downloading an extra resource which will end up being deleted anyway. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise
[ https://issues.apache.org/jira/browse/YARN-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105798#comment-16105798 ] Brook Zhou commented on YARN-6870: -- Findbugs warnings are from existing code not in this patch. > ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a > float, which is imprecise > --- > > Key: YARN-6870 > URL: https://issues.apache.org/jira/browse/YARN-6870 > Project: Hadoop YARN > Issue Type: Bug > Components: api, nodemanager >Reporter: Brook Zhou >Assignee: Brook Zhou > Attachments: YARN-6870-v0.patch, YARN-6870-v1.patch, > YARN-6870-v2.patch, YARN-6870-v3.patch > > > We have seen issues on our clusters where the current way of computing CPU > usage is having float-arithmetic inaccuracies (the bug is still there in > trunk) > Simple program to illustrate: > {code:title=Bar.java|borderStyle=solid} > public static void main(String[] args) throws Exception { > float result = 0.0f; > for (int i = 0; i < 7; i++) { > if (i == 6) { > result += (float) 4 / (float)18; > } else { > result += (float) 2 / (float)18; > } > } > for (int i = 0; i < 7; i++) { > if (i == 6) { > result -= (float) 4 / (float)18; > } else { > result -= (float) 2 / (float)18; > } > } > System.out.println(result); > } > {code} > // Printed > 4.4703484E-8 > 2017-04-12 05:43:24,014 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current > CPU Allocation: [0.891], Requested CPU Allocation: [0.] > There are a few places with this issue: > 1. ResourceUtilization.java - set/getCPU both use float. When > ContainerScheduler calls > ContainersMonitor.increase/decreaseResourceUtilization, this may lead to > issues. > 2. AllocationBasedResourceUtilizationTracker.java - hasResourcesAvailable > uses float as well for CPU computation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise
[ https://issues.apache.org/jira/browse/YARN-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-6870: - Attachment: YARN-6870-v3.patch > ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a > float, which is imprecise > --- > > Key: YARN-6870 > URL: https://issues.apache.org/jira/browse/YARN-6870 > Project: Hadoop YARN > Issue Type: Bug > Components: api, nodemanager >Reporter: Brook Zhou >Assignee: Brook Zhou > Attachments: YARN-6870-v0.patch, YARN-6870-v1.patch, > YARN-6870-v2.patch, YARN-6870-v3.patch > > > We have seen issues on our clusters where the current way of computing CPU > usage is having float-arithmetic inaccuracies (the bug is still there in > trunk) > Simple program to illustrate: > {code:title=Bar.java|borderStyle=solid} > public static void main(String[] args) throws Exception { > float result = 0.0f; > for (int i = 0; i < 7; i++) { > if (i == 6) { > result += (float) 4 / (float)18; > } else { > result += (float) 2 / (float)18; > } > } > for (int i = 0; i < 7; i++) { > if (i == 6) { > result -= (float) 4 / (float)18; > } else { > result -= (float) 2 / (float)18; > } > } > System.out.println(result); > } > {code} > // Printed > 4.4703484E-8 > 2017-04-12 05:43:24,014 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current > CPU Allocation: [0.891], Requested CPU Allocation: [0.] > There are a few places with this issue: > 1. ResourceUtilization.java - set/getCPU both use float. When > ContainerScheduler calls > ContainersMonitor.increase/decreaseResourceUtilization, this may lead to > issues. > 2. AllocationBasedResourceUtilizationTracker.java - hasResourcesAvailable > uses float as well for CPU computation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise
[ https://issues.apache.org/jira/browse/YARN-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-6870: - Attachment: YARN-6870-v2.patch Thanks, made the change. > ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a > float, which is imprecise > --- > > Key: YARN-6870 > URL: https://issues.apache.org/jira/browse/YARN-6870 > Project: Hadoop YARN > Issue Type: Bug > Components: api, nodemanager >Reporter: Brook Zhou >Assignee: Brook Zhou > Attachments: YARN-6870-v0.patch, YARN-6870-v1.patch, > YARN-6870-v2.patch > > > We have seen issues on our clusters where the current way of computing CPU > usage is having float-arithmetic inaccuracies (the bug is still there in > trunk) > Simple program to illustrate: > {code:title=Bar.java|borderStyle=solid} > public static void main(String[] args) throws Exception { > float result = 0.0f; > for (int i = 0; i < 7; i++) { > if (i == 6) { > result += (float) 4 / (float)18; > } else { > result += (float) 2 / (float)18; > } > } > for (int i = 0; i < 7; i++) { > if (i == 6) { > result -= (float) 4 / (float)18; > } else { > result -= (float) 2 / (float)18; > } > } > System.out.println(result); > } > {code} > // Printed > 4.4703484E-8 > 2017-04-12 05:43:24,014 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current > CPU Allocation: [0.891], Requested CPU Allocation: [0.] > There are a few places with this issue: > 1. ResourceUtilization.java - set/getCPU both use float. When > ContainerScheduler calls > ContainersMonitor.increase/decreaseResourceUtilization, this may lead to > issues. > 2. AllocationBasedResourceUtilizationTracker.java - hasResourcesAvailable > uses float as well for CPU computation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise
[ https://issues.apache.org/jira/browse/YARN-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-6870: - Attachment: YARN-6870-v1.patch > ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a > float, which is imprecise > --- > > Key: YARN-6870 > URL: https://issues.apache.org/jira/browse/YARN-6870 > Project: Hadoop YARN > Issue Type: Bug > Components: api, nodemanager >Reporter: Brook Zhou >Assignee: Brook Zhou > Attachments: YARN-6870-v0.patch, YARN-6870-v1.patch > > > We have seen issues on our clusters where the current way of computing CPU > usage is having float-arithmetic inaccuracies (the bug is still there in > trunk) > Simple program to illustrate: > {code:title=Bar.java|borderStyle=solid} > public static void main(String[] args) throws Exception { > float result = 0.0f; > for (int i = 0; i < 7; i++) { > if (i == 6) { > result += (float) 4 / (float)18; > } else { > result += (float) 2 / (float)18; > } > } > for (int i = 0; i < 7; i++) { > if (i == 6) { > result -= (float) 4 / (float)18; > } else { > result -= (float) 2 / (float)18; > } > } > System.out.println(result); > } > {code} > // Printed > 4.4703484E-8 > 2017-04-12 05:43:24,014 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current > CPU Allocation: [0.891], Requested CPU Allocation: [0.] > There are a few places with this issue: > 1. ResourceUtilization.java - set/getCPU both use float. When > ContainerScheduler calls > ContainersMonitor.increase/decreaseResourceUtilization, this may lead to > issues. > 2. AllocationBasedResourceUtilizationTracker.java - hasResourcesAvailable > uses float as well for CPU computation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise
[ https://issues.apache.org/jira/browse/YARN-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-6870: - Attachment: YARN-6870-v0.patch Attached patch against trunk. > ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a > float, which is imprecise > --- > > Key: YARN-6870 > URL: https://issues.apache.org/jira/browse/YARN-6870 > Project: Hadoop YARN > Issue Type: Bug > Components: api, nodemanager >Reporter: Brook Zhou >Assignee: Brook Zhou > Attachments: YARN-6870-v0.patch > > > We have seen issues on our clusters where the current way of computing CPU > usage is having float-arithmetic inaccuracies (the bug is still there in > trunk) > Simple program to illustrate: > {code:title=Bar.java|borderStyle=solid} > public static void main(String[] args) throws Exception { > float result = 0.0f; > for (int i = 0; i < 7; i++) { > if (i == 6) { > result += (float) 4 / (float)18; > } else { > result += (float) 2 / (float)18; > } > } > for (int i = 0; i < 7; i++) { > if (i == 6) { > result -= (float) 4 / (float)18; > } else { > result -= (float) 2 / (float)18; > } > } > System.out.println(result); > } > {code} > // Printed > 4.4703484E-8 > 2017-04-12 05:43:24,014 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current > CPU Allocation: [0.891], Requested CPU Allocation: [0.] > There are a few places with this issue: > 1. ResourceUtilization.java - set/getCPU both use float. When > ContainerScheduler calls > ContainersMonitor.increase/decreaseResourceUtilization, this may lead to > issues. > 2. AllocationBasedResourceUtilizationTracker.java - hasResourcesAvailable > uses float as well for CPU computation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise
[ https://issues.apache.org/jira/browse/YARN-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102135#comment-16102135 ] Brook Zhou edited comment on YARN-6870 at 7/26/17 10:07 PM: [~asuresh] Sure. For the scope of this patch, I want to check with you whether we need to change the ResourceUtilization proto and downstream objects to use integral values instead of [0,1.0], or whether we only need to fix AllocationBasedResourceUtilizationTracker.hasResourcesAvailable/ContainerScheduler.hasSufficientResources to convert from float to int for cpu calculations. Based on what I see, only the latter is the actual bug. The former is nice to have, but currently doesn't impact this JIRA. Thoughts? was (Author: brookz): [~asuresh] Sure. For the scope of this patch, I want to check with you whether we need to change the ResourceUtilization proto and downstream objects to use integral values instead of [0,1.0], or whether we only need to fix AllocationBasedResourceUtilizationTracker.hasResourcesAvailable/ContainerScheduler.hasResourcesAvailable to convert from float to int for cpu calculations. Based on what I see, only the latter is the actual bug. The former is nice to have, but currently doesn't impact this JIRA. Thoughts? > ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a > float, which is imprecise > --- > > Key: YARN-6870 > URL: https://issues.apache.org/jira/browse/YARN-6870 > Project: Hadoop YARN > Issue Type: Bug > Components: api, nodemanager >Reporter: Brook Zhou >Assignee: Brook Zhou > > We have seen issues on our clusters where the current way of computing CPU > usage is having float-arithmetic inaccuracies (the bug is still there in > trunk) > Simple program to illustrate: > {code:title=Bar.java|borderStyle=solid} > public static void main(String[] args) throws Exception { > float result = 0.0f; > for (int i = 0; i < 7; i++) { > if (i == 6) { > result += (float) 4 / (float)18; > } else { > result += (float) 2 / (float)18; > } > } > for (int i = 0; i < 7; i++) { > if (i == 6) { > result -= (float) 4 / (float)18; > } else { > result -= (float) 2 / (float)18; > } > } > System.out.println(result); > } > {code} > // Printed > 4.4703484E-8 > 2017-04-12 05:43:24,014 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current > CPU Allocation: [0.891], Requested CPU Allocation: [0.] > There are a few places with this issue: > 1. ResourceUtilization.java - set/getCPU both use float. When > ContainerScheduler calls > ContainersMonitor.increase/decreaseResourceUtilization, this may lead to > issues. > 2. AllocationBasedResourceUtilizationTracker.java - hasResourcesAvailable > uses float as well for CPU computation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise
[ https://issues.apache.org/jira/browse/YARN-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102135#comment-16102135 ] Brook Zhou commented on YARN-6870: -- [~asuresh] Sure. For the scope of this patch, I want to check with you whether we need to change the ResourceUtilization proto and downstream objects to use integral values instead of [0,1.0], or whether we only need to fix AllocationBasedResourceUtilizationTracker.hasResourcesAvailable/ContainerScheduler.hasResourcesAvailable to convert from float to int for cpu calculations. Based on what I see, only the latter is the actual bug. The former is nice to have, but currently doesn't impact this JIRA. Thoughts? > ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a > float, which is imprecise > --- > > Key: YARN-6870 > URL: https://issues.apache.org/jira/browse/YARN-6870 > Project: Hadoop YARN > Issue Type: Bug > Components: api, nodemanager >Reporter: Brook Zhou >Assignee: Brook Zhou > > We have seen issues on our clusters where the current way of computing CPU > usage is having float-arithmetic inaccuracies (the bug is still there in > trunk) > Simple program to illustrate: > {code:title=Bar.java|borderStyle=solid} > public static void main(String[] args) throws Exception { > float result = 0.0f; > for (int i = 0; i < 7; i++) { > if (i == 6) { > result += (float) 4 / (float)18; > } else { > result += (float) 2 / (float)18; > } > } > for (int i = 0; i < 7; i++) { > if (i == 6) { > result -= (float) 4 / (float)18; > } else { > result -= (float) 2 / (float)18; > } > } > System.out.println(result); > } > {code} > // Printed > 4.4703484E-8 > 2017-04-12 05:43:24,014 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current > CPU Allocation: [0.891], Requested CPU Allocation: [0.] > There are a few places with this issue: > 1. ResourceUtilization.java - set/getCPU both use float. When > ContainerScheduler calls > ContainersMonitor.increase/decreaseResourceUtilization, this may lead to > issues. > 2. AllocationBasedResourceUtilizationTracker.java - hasResourcesAvailable > uses float as well for CPU computation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise
Brook Zhou created YARN-6870: Summary: ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise Key: YARN-6870 URL: https://issues.apache.org/jira/browse/YARN-6870 Project: Hadoop YARN Issue Type: Bug Components: api, nodemanager Reporter: Brook Zhou Assignee: Brook Zhou We have seen issues on our clusters where the current way of computing CPU usage is having float-arithmetic inaccuracies (the bug is still there in trunk) Simple program to illustrate: {code:title=Bar.java|borderStyle=solid} public static void main(String[] args) throws Exception { float result = 0.0f; for (int i = 0; i < 7; i++) { if (i == 6) { result += (float) 4 / (float)18; } else { result += (float) 2 / (float)18; } } for (int i = 0; i < 7; i++) { if (i == 6) { result -= (float) 4 / (float)18; } else { result -= (float) 2 / (float)18; } } System.out.println(result); } {code} // Printed 4.4703484E-8 2017-04-12 05:43:24,014 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current CPU Allocation: [0.891], Requested CPU Allocation: [0.] There are a few places with this issue: 1. ResourceUtilization.java - set/getCPU both use float. When ContainerScheduler calls ContainersMonitor.increase/decreaseResourceUtilization, this may lead to issues. 2. AllocationBasedResourceUtilizationTracker.java - hasResourcesAvailable uses float as well for CPU computation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5472) WIN_MAX_PATH logic is off by one
[ https://issues.apache.org/jira/browse/YARN-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412694#comment-15412694 ] Brook Zhou commented on YARN-5472: -- Yes, this is something I have verified on our NM's running on machine versions up to server 2012 r2 - the max path limitation without using long-path prefixing is 259 characters. > WIN_MAX_PATH logic is off by one > > > Key: YARN-5472 > URL: https://issues.apache.org/jira/browse/YARN-5472 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 > Environment: Windows >Reporter: Brook Zhou >Assignee: Brook Zhou >Priority: Minor > Attachments: YARN-5472-v0.patch > > > The following check is incorrect in DefaultContainerExecutor: > if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > > WIN_MAX_PATH) > should be >=, as the max path is defined as "D:\some 256-character path > string" on Windows platforms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5472) WIN_MAX_PATH logic is off by one
[ https://issues.apache.org/jira/browse/YARN-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-5472: - Attachment: YARN-5472-v0.patch > WIN_MAX_PATH logic is off by one > > > Key: YARN-5472 > URL: https://issues.apache.org/jira/browse/YARN-5472 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 > Environment: Windows >Reporter: Brook Zhou >Assignee: Brook Zhou >Priority: Minor > Attachments: YARN-5472-v0.patch > > > The following check is incorrect in DefaultContainerExecutor: > if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > > WIN_MAX_PATH) > should be >=, as the max path is defined as "D:\some 256-character path > string" on Windows platforms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5472) WIN_MAX_PATH logic is off by one
[ https://issues.apache.org/jira/browse/YARN-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-5472: - Description: The following check is incorrect in DefaultContainerExecutor: if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > WIN_MAX_PATH) should be >=, as the max path is defined as "D:\some 256-character path string" on Windows platforms. was: The following check is incorrect in DefaultContainerExecutor: if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > WIN_MAX_PATH) should be >=, as the max path is defined as "D:\some 256-character path string" > WIN_MAX_PATH logic is off by one > > > Key: YARN-5472 > URL: https://issues.apache.org/jira/browse/YARN-5472 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 > Environment: Windows >Reporter: Brook Zhou >Assignee: Brook Zhou >Priority: Minor > > The following check is incorrect in DefaultContainerExecutor: > if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > > WIN_MAX_PATH) > should be >=, as the max path is defined as "D:\some 256-character path > string" on Windows platforms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5472) WIN_MAX_PATH logic is off by one
[ https://issues.apache.org/jira/browse/YARN-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-5472: - Affects Version/s: 2.6.0 > WIN_MAX_PATH logic is off by one > > > Key: YARN-5472 > URL: https://issues.apache.org/jira/browse/YARN-5472 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 > Environment: Windows >Reporter: Brook Zhou >Assignee: Brook Zhou >Priority: Minor > > The following check is incorrect in DefaultContainerExecutor: > if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > > WIN_MAX_PATH) > should be >=, as the max path is defined as "D:\some 256-character path > string" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5472) WIN_MAX_PATH logic is off by one
[ https://issues.apache.org/jira/browse/YARN-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-5472: - Description: The following check is incorrect in DefaultContainerExecutor: if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > WIN_MAX_PATH) should be >=, as the max path is defined as "D:\some 256-character path string" was: The following check is incorrect: if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > WIN_MAX_PATH) should be >=, as the max path is defined as "D:\some 256-character path string" > WIN_MAX_PATH logic is off by one > > > Key: YARN-5472 > URL: https://issues.apache.org/jira/browse/YARN-5472 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Environment: Windows >Reporter: Brook Zhou >Assignee: Brook Zhou >Priority: Minor > > The following check is incorrect in DefaultContainerExecutor: > if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > > WIN_MAX_PATH) > should be >=, as the max path is defined as "D:\some 256-character path > string" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5472) WIN_MAX_PATH logic is off by one
Brook Zhou created YARN-5472: Summary: WIN_MAX_PATH logic is off by one Key: YARN-5472 URL: https://issues.apache.org/jira/browse/YARN-5472 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Windows Reporter: Brook Zhou Assignee: Brook Zhou Priority: Minor The following check is incorrect: if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > WIN_MAX_PATH) should be >=, as the max path is defined as "D:\some 256-character path string" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5451) Container localizers that hang are not cleaned up
[ https://issues.apache.org/jira/browse/YARN-5451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405060#comment-15405060 ] Brook Zhou commented on YARN-5451: -- Is this because the ContainerLocalizer is launched in a separate process from LCE with a timeOutInterval of 0? > Container localizers that hang are not cleaned up > - > > Key: YARN-5451 > URL: https://issues.apache.org/jira/browse/YARN-5451 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Jason Lowe > > I ran across an old, rogue process on one of our nodes. It apparently was a > container localizer that somehow entered an infinite loop during startup. > The NM never cleaned up this broken localizer, so it happily ran forever. > The NM needs to do a better job of tracking localizers, including killing > them if they appear to be hung/broken. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-4840) Add option to upload files recursively from container directory
Brook Zhou created YARN-4840: Summary: Add option to upload files recursively from container directory Key: YARN-4840 URL: https://issues.apache.org/jira/browse/YARN-4840 Project: Hadoop YARN Issue Type: Improvement Components: log-aggregation Affects Versions: 2.8.0 Reporter: Brook Zhou Priority: Minor Fix For: 2.8.0 It may be useful to allow users to aggregate their logs recursively from container directories. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4818) AggregatedLogFormat.LogValue.write() incorrectly truncates files
[ https://issues.apache.org/jira/browse/YARN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196359#comment-15196359 ] Brook Zhou commented on YARN-4818: -- You're right, it was some issue with our FileSystem implementation. I will cancel this. Sorry about the confusion. > AggregatedLogFormat.LogValue.write() incorrectly truncates files > > > Key: YARN-4818 > URL: https://issues.apache.org/jira/browse/YARN-4818 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Brook Zhou >Assignee: Brook Zhou > Labels: log-aggregation > Fix For: 2.8.0 > > Attachments: YARN-4818-v0.patch > > > AggregatedLogFormat.LogValue.write() currently has a bug where it only writes > in blocks of the buffer size (65535). This is because > FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length > bytes remaining. In cases where the file size is not an exact multiple of > 65535 bytes, the remaining bytes are truncated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4818) AggregatedLogFormat.LogValue.write() incorrectly truncates files
[ https://issues.apache.org/jira/browse/YARN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou resolved YARN-4818. -- Resolution: Invalid > AggregatedLogFormat.LogValue.write() incorrectly truncates files > > > Key: YARN-4818 > URL: https://issues.apache.org/jira/browse/YARN-4818 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Brook Zhou >Assignee: Brook Zhou > Labels: log-aggregation > Fix For: 2.8.0 > > Attachments: YARN-4818-v0.patch > > > AggregatedLogFormat.LogValue.write() currently has a bug where it only writes > in blocks of the buffer size (65535). This is because > FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length > bytes remaining. In cases where the file size is not an exact multiple of > 65535 bytes, the remaining bytes are truncated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4818) AggregatedLogFormat.LogValue.write() incorrectly truncates files
[ https://issues.apache.org/jira/browse/YARN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-4818: - Component/s: nodemanager > AggregatedLogFormat.LogValue.write() incorrectly truncates files > > > Key: YARN-4818 > URL: https://issues.apache.org/jira/browse/YARN-4818 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Brook Zhou >Assignee: Brook Zhou > Labels: log-aggregation > Fix For: 2.8.0 > > Attachments: YARN-4818-v0.patch > > > AggregatedLogFormat.LogValue.write() currently has a bug where it only writes > in blocks of the buffer size (65535). This is because > FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length > bytes remaining. In cases where the file size is not an exact multiple of > 65535 bytes, the remaining bytes are truncated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4818) AggregatedLogFormat.LogValue.write() incorrectly truncates files
[ https://issues.apache.org/jira/browse/YARN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-4818: - Labels: log-aggregation (was: ) > AggregatedLogFormat.LogValue.write() incorrectly truncates files > > > Key: YARN-4818 > URL: https://issues.apache.org/jira/browse/YARN-4818 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Brook Zhou >Assignee: Brook Zhou > Labels: log-aggregation > Fix For: 2.8.0 > > Attachments: YARN-4818-v0.patch > > > AggregatedLogFormat.LogValue.write() currently has a bug where it only writes > in blocks of the buffer size (65535). This is because > FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length > bytes remaining. In cases where the file size is not an exact multiple of > 65535 bytes, the remaining bytes are truncated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4818) AggregatedLogFormat.LogValue.write() incorrectly truncates files
[ https://issues.apache.org/jira/browse/YARN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-4818: - Attachment: YARN-4818-v0.patch > AggregatedLogFormat.LogValue.write() incorrectly truncates files > > > Key: YARN-4818 > URL: https://issues.apache.org/jira/browse/YARN-4818 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Brook Zhou >Assignee: Brook Zhou > Fix For: 2.8.0 > > Attachments: YARN-4818-v0.patch > > > AggregatedLogFormat.LogValue.write() currently has a bug where it only writes > in blocks of the buffer size (65535). This is because > FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length > bytes remaining. In cases where the file size is not an exact multiple of > 65535 bytes, the remaining bytes are truncated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4818) AggregatedLogFormat.LogValue.write() incorrectly truncates files
[ https://issues.apache.org/jira/browse/YARN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-4818: - Summary: AggregatedLogFormat.LogValue.write() incorrectly truncates files (was: AggregatedLogFormat.LogValue writes only in blocks of buffer size) > AggregatedLogFormat.LogValue.write() incorrectly truncates files > > > Key: YARN-4818 > URL: https://issues.apache.org/jira/browse/YARN-4818 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Brook Zhou >Assignee: Brook Zhou > Fix For: 2.8.0 > > > AggregatedLogFormat.LogValue.write() currently has a bug where it only writes > in blocks of the buffer size (65535). This is because > FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length > bytes remaining. In cases where the file size is not an exact multiple of > 65535 bytes, the remaining bytes are truncated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4818) AggregatedLogFormat.LogValue writes only in blocks of buffer size
[ https://issues.apache.org/jira/browse/YARN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-4818: - Description: AggregatedLogFormat.LogValue.write() currently has a bug where it only writes in blocks of the buffer size (65535). This is because FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length bytes remaining. In cases where the file size is not an exact multiple of 65535 bytes, the remaining bytes are truncated. (was: AggregatedLogFormat.LogValue.write() currently has a bug where it only writes in blocks of the buffer size (65535). This is because FileInputStream.read(byte[] buf) returns -1 if there are less than 65535 bytes remaining. In cases where the file is less than 65535 bytes, 0 bytes are written.) > AggregatedLogFormat.LogValue writes only in blocks of buffer size > - > > Key: YARN-4818 > URL: https://issues.apache.org/jira/browse/YARN-4818 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Brook Zhou >Assignee: Brook Zhou > Fix For: 2.8.0 > > > AggregatedLogFormat.LogValue.write() currently has a bug where it only writes > in blocks of the buffer size (65535). This is because > FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length > bytes remaining. In cases where the file size is not an exact multiple of > 65535 bytes, the remaining bytes are truncated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4818) AggregatedLogFormat.LogValue writes only in blocks of buffer size
Brook Zhou created YARN-4818: Summary: AggregatedLogFormat.LogValue writes only in blocks of buffer size Key: YARN-4818 URL: https://issues.apache.org/jira/browse/YARN-4818 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.0 Reporter: Brook Zhou Assignee: Brook Zhou Fix For: 2.8.0 AggregatedLogFormat.LogValue.write() currently has a bug where it only writes in blocks of the buffer size (65535). This is because FileInputStream.read(byte[] buf) returns -1 if there are less than 65535 bytes remaining. In cases where the file is less than 65535 bytes, 0 bytes are written. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4677) RMNodeResourceUpdateEvent update from scheduler can lead to race condition
[ https://issues.apache.org/jira/browse/YARN-4677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-4677: - Issue Type: Sub-task (was: Improvement) Parent: YARN-914 > RMNodeResourceUpdateEvent update from scheduler can lead to race condition > -- > > Key: YARN-4677 > URL: https://issues.apache.org/jira/browse/YARN-4677 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, resourcemanager, scheduler >Affects Versions: 2.7.1 >Reporter: Brook Zhou >Assignee: Junping Du > > When a node is in decommissioning state, there is time window between > completedContainer() and RMNodeResourceUpdateEvent get handled in > scheduler.nodeUpdate (YARN-3223). > So if a scheduling effort happens within this window, the new container could > still get allocated on this node. Even worse case is if scheduling effort > happen after RMNodeResourceUpdateEvent sent out but before it is propagated > to SchedulerNode - then the total resource is lower than used resource and > available resource is a negative value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4677) RMNodeResourceUpdateEvent update from scheduler can lead to race condition
[ https://issues.apache.org/jira/browse/YARN-4677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157240#comment-15157240 ] Brook Zhou commented on YARN-4677: -- Hi [~djp], I currently have no plan, would appreciate if you could work on it. Thanks. > RMNodeResourceUpdateEvent update from scheduler can lead to race condition > -- > > Key: YARN-4677 > URL: https://issues.apache.org/jira/browse/YARN-4677 > Project: Hadoop YARN > Issue Type: Improvement > Components: graceful, resourcemanager, scheduler >Affects Versions: 2.7.1 >Reporter: Brook Zhou > > When a node is in decommissioning state, there is time window between > completedContainer() and RMNodeResourceUpdateEvent get handled in > scheduler.nodeUpdate (YARN-3223). > So if a scheduling effort happens within this window, the new container could > still get allocated on this node. Even worse case is if scheduling effort > happen after RMNodeResourceUpdateEvent sent out but before it is propagated > to SchedulerNode - then the total resource is lower than used resource and > available resource is a negative value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157237#comment-15157237 ] Brook Zhou commented on YARN-3223: -- Thanks [~djp] for the kind reviews and followups. > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch, YARN-3223-v5.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15142170#comment-15142170 ] Brook Zhou commented on YARN-3223: -- Hi [~djp], can you please review? > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch, YARN-3223-v5.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4677) RMNodeResourceUpdateEvent update from scheduler can lead to race condition
Brook Zhou created YARN-4677: Summary: RMNodeResourceUpdateEvent update from scheduler can lead to race condition Key: YARN-4677 URL: https://issues.apache.org/jira/browse/YARN-4677 Project: Hadoop YARN Issue Type: Improvement Components: graceful, resourcemanager, scheduler Affects Versions: 2.7.1 Reporter: Brook Zhou When a node is in decommissioning state, there is time window between completedContainer() and RMNodeResourceUpdateEvent get handled in scheduler.nodeUpdate (YARN-3223). So if a scheduling effort happens within this window, the new container could still get allocated on this node. Even worse case is if scheduling effort happen after RMNodeResourceUpdateEvent sent out but before it is propagated to SchedulerNode - then the total resource is lower than used resource and available resource is a negative value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: YARN-3223-v4.patch Latest patch with the same scheduler code applied to Fifo/Fair Schedulers and tests. > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: YARN-3223-v4.patch > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: YARN-3223-v4.patch > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: (was: YARN-3223-v4.patch) > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: (was: YARN-3223-v4.patch) > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: YARN-3223-v4.patch > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: (was: YARN-3223-v4.patch) > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: YARN-3223-v5.patch Fixed the checkstyle errors. > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch, YARN-3223-v3.patch, YARN-3223-v4.patch, YARN-3223-v5.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113307#comment-15113307 ] Brook Zhou commented on YARN-3223: -- Spoke offline with Junping, we will move forward with async approach in general. I will move any remaining to-dos to a separate JIRA. Going back to my previous point, [YARN-4344|https://issues.apache.org/jira/browse/YARN-4344] seems to have removed the dependency of using the RMNode.getTotalCapability() call inside the scheduler. Instead, the scheduler will directly use SchedulerNode.getTotalResource() for updating clusterResource on add/removeNode. In that case, we can simplify the scheduler's nodeUpdate change to simply {code:title=CapacityScheduler.java|borderStyle=solid} private synchronized void nodeUpdate(RMNode nm) {... +if (nm.getState() == NodeState.DECOMMISSIONING) { + this.updateNodeAndQueueResource(nm, ResourceOption.newInstance( + getSchedulerNode(nm.getNodeID()).getUsedResource(), 0)); +} ... } {code} At this point RMNodeImpl already has saved the originalTotalCapability of the node. This will also immediately update the SchedulerNode resources which will make scheduling consistent. The costs of locking should be minimal since the function just performs a few updates. This should resolve the issues you have brought up. Do you agree? Otherwise, I will just keep what I have in v3.patch and upload another patch with the same nodeUpdate code for Fifo and Fair schedulers, then create another JIRA to track the possible scheduler inconsistencies. > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch, YARN-3223-v3.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088097#comment-15088097 ] Brook Zhou commented on YARN-3223: -- Thanks [~djp] for the feedback. Those scenarios mentioned are indeed problematic. I think the proposal would end up making some changes to SchedulerNode and add more complexity there. It could end up being too much overhead in terms of maintaining more variables, and will still not solve the issues entirely due to the system still being only eventually consistent. Since CapacityScheduler.nodeUpdate is already synchronized, if we eliminated using the asynchronous RMNodeResourceUpdateEvent and just directly modify the decommissioning SchedulerNode using updateNodeAndQueueResource, we guarantee SchedulerNode's consistency. > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch, YARN-3223-v3.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061124#comment-15061124 ] Brook Zhou commented on YARN-3223: -- Test breaks unrelated. > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch, YARN-3223-v3.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4002) make ResourceTrackerService.nodeHeartbeat more concurrent
[ https://issues.apache.org/jira/browse/YARN-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038392#comment-15038392 ] Brook Zhou commented on YARN-4002: -- If this is currently not being worked on, I will assign it to me. > make ResourceTrackerService.nodeHeartbeat more concurrent > - > > Key: YARN-4002 > URL: https://issues.apache.org/jira/browse/YARN-4002 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Hong Zhiguo >Assignee: Hong Zhiguo >Priority: Critical > Attachments: YARN-4002-v0.patch > > > We have multiple RPC threads to handle NodeHeartbeatRequest from NMs. By > design the method ResourceTrackerService.nodeHeartbeat should be concurrent > enough to scale for large clusters. > But we have a "BIG" lock in NodesListManager.isValidNode which I think it's > unnecessary. > First, the fields "includes" and "excludes" of HostsFileReader are only > updated on "refresh nodes". All RPC threads handling node heartbeats are > only readers. So RWLock could be used to alow concurrent access by RPC > threads. > Second, since he fields "includes" and "excludes" of HostsFileReader are > always updated by "reference assignment", which is atomic in Java, the reader > side lock could just be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4002) make ResourceTrackerService.nodeHeartbeat more concurrent
[ https://issues.apache.org/jira/browse/YARN-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-4002: - Attachment: YARN-4002-v0.patch Added a patch for this. > make ResourceTrackerService.nodeHeartbeat more concurrent > - > > Key: YARN-4002 > URL: https://issues.apache.org/jira/browse/YARN-4002 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Hong Zhiguo >Assignee: Hong Zhiguo >Priority: Critical > Attachments: YARN-4002-v0.patch > > > We have multiple RPC threads to handle NodeHeartbeatRequest from NMs. By > design the method ResourceTrackerService.nodeHeartbeat should be concurrent > enough to scale for large clusters. > But we have a "BIG" lock in NodesListManager.isValidNode which I think it's > unnecessary. > First, the fields "includes" and "excludes" of HostsFileReader are only > updated on "refresh nodes". All RPC threads handling node heartbeats are > only readers. So RWLock could be used to alow concurrent access by RPC > threads. > Second, since he fields "includes" and "excludes" of HostsFileReader are > always updated by "reference assignment", which is atomic in Java, the reader > side lock could just be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: YARN-3223-v3.patch > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch, YARN-3223-v3.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: 0001-YARN-3223-resource-update.patch Since completedContainer is often called multiple times like from nodeUpdate(), I moved the trigger of RMNodeResourceUpdateEvent directly into nodeUpdate() when a node is decommissioning. If this is ok, I will add similar code to Fifo/Fair schedulers. > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: (was: 0001-YARN-3223-resource-update.patch) > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018487#comment-15018487 ] Brook Zhou commented on YARN-3223: -- Makes sense. One thing that I'm not sure about - RMNodeImpl does not know directly the amount of usedResource in order to trigger an RMNodeResourceUpdateEvent. I can use rmNode.context.getScheduler().(rmNode.getNodeID()).getUsedResource(), but I'm not sure if adding that dependency on scheduler is okay. > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992234#comment-14992234 ] Brook Zhou commented on YARN-3223: -- Unit tests that failed were not affected by patch. May be related to [YARN-2634|https://issues.apache.org/jira/browse/YARN-2634] > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: YARN-3223-v2.patch Updated patch based on feedback. Checkstyle errors about CapacityScheduler.java file length still there. > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, > YARN-3223-v2.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978999#comment-14978999 ] Brook Zhou commented on YARN-3223: -- Thanks [~leftnoteasy], [~djp] for review. bq. Suggest to use CapacityScheduler#updateNodeAndQueueResource to update resources, we need to update queue's resource, cluster metrics as well. That makes sense. I'm currently setting SchedulerNode's usedResource to equal to totalResource, and keeping totalResource the same. If we use that function, it means totalResource should be set equal to usedResource, and on recommission we should just revert back to the original totalResource? I like your way better. bq. When async scheduling enabled, we need to make sure decommissioing node's total resource is updated so no new container will be allocated on these nodes. Even if async scheduling is enabled, we will update the total resource on NODE_UPDATE event to equal to current usedResource, async scheduling thread will not allocate containers to the node. bq. RMNode itself (RMNode.getState()) is already include the necessary info, so the boolean parameter sounds like redundant Agreed. I will let the scheduler decide the current state directly using that function. bq. I think we need separated test case to cover resource update during NM decommissioning Yes, that is definitely going to be added. I just wanted to see if my general ideas were okay with the community. Thanks! > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: YARN-3223-v1.patch Fixed the whitespace/checkstyle issues. The only remaining checkstyle issue is "CapacityScheduler.java:1: File length is 2,009 lines" > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949223#comment-14949223 ] Brook Zhou commented on YARN-3223: -- Ah okay, sorry about that, will do. It seems to be passing test-patch on my local trunk repo, so I will update with submit patch. > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: YARN-3223-v0.patch I changed the implementation to add a flag to NodeUpdateSchedulerEvent indicating isDecommissioning, which will update the SchedulerNode's usedResource to be equal to totalResource. > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: (was: YARN-3223-v1.patch) > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: (was: YARN-3223-v0.1.patch) > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: YARN-3223-v1.patch > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v1.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741678#comment-14741678 ] Brook Zhou commented on YARN-3223: -- Applied YARN-3212-v5.1.patch first. With the YARN-3223-v1.patch changes, test-patch results passed. | Vote | Subsystem | Runtime | Comment | 0 | pre-patch | 42m 50s | Pre-patch trunk compilation is | | || healthy. | +1 |@author | 0m 0s | The patch does not contain any | | || @author tags. | +1 | tests included | 0m 0s | The patch appears to include 1 new | | || or modified test files. | +1 | javac | 11m 12s | There were no new javac warning | | || messages. | +1 |javadoc | 28m 15s | There were no new javadoc warning | | || messages. | +1 | release audit | 0m 59s| The applied patch does not increase | | || the total number of release audit | | || warnings. | +1 | checkstyle | 4m 35s| There were no new checkstyle | | || issues. | +1 | whitespace | 0m 0s | The patch has no lines that end in | | || whitespace. | +1 |install | 4m 9s | mvn install still works. | +1 |eclipse:eclipse | 1m 29s| The patch built with | | || eclipse:eclipse. | +1 | findbugs | 7m 27s| The patch does not introduce any | | || new Findbugs (version 3.0.0) | | || warnings. | | | 100m 58s | > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v1.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-666) [Umbrella] Support rolling upgrades in YARN
[ https://issues.apache.org/jira/browse/YARN-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou reassigned YARN-666: --- Assignee: Brook Zhou > [Umbrella] Support rolling upgrades in YARN > --- > > Key: YARN-666 > URL: https://issues.apache.org/jira/browse/YARN-666 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.0.4-alpha >Reporter: Siddharth Seth >Assignee: Brook Zhou > Fix For: 2.6.0 > > Attachments: YARN_Rolling_Upgrades.pdf, YARN_Rolling_Upgrades_v2.pdf > > > Jira to track changes required in YARN to allow rolling upgrades, including > documentation and possible upgrade routes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: (was: YARN-3223-v0.patch) > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.1.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: (was: YARN-3223-v0.1.patch) > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: YARN-3223-v0.1.patch > Resource update during NM graceful decommission > --- > > Key: YARN-3223 > URL: https://issues.apache.org/jira/browse/YARN-3223 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.1 >Reporter: Junping Du >Assignee: Brook Zhou > Attachments: YARN-3223-v0.1.patch, YARN-3223-v0.patch > > > During NM graceful decommission, we should handle resource update properly, > include: make RMNode keep track of old resource for possible rollback, keep > available resource to 0 and used resource get updated when > container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: YARN-3223-v0.1.patch Contains tests, formatting changes Resource update during NM graceful decommission --- Key: YARN-3223 URL: https://issues.apache.org/jira/browse/YARN-3223 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Affects Versions: 2.7.1 Reporter: Junping Du Assignee: Brook Zhou Attachments: YARN-3223-v0.1.patch, YARN-3223-v0.patch During NM graceful decommission, we should handle resource update properly, include: make RMNode keep track of old resource for possible rollback, keep available resource to 0 and used resource get updated when container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-3223: - Attachment: YARN-3223-v0.patch Just a patch for review. Resource update during NM graceful decommission --- Key: YARN-3223 URL: https://issues.apache.org/jira/browse/YARN-3223 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Affects Versions: 2.7.1 Reporter: Junping Du Assignee: Varun Saxena Attachments: YARN-3223-v0.patch During NM graceful decommission, we should handle resource update properly, include: make RMNode keep track of old resource for possible rollback, keep available resource to 0 and used resource get updated when container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission
[ https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14640845#comment-14640845 ] Brook Zhou commented on YARN-3223: -- Hi, I'm interested in whether this is still active. If not, I have an implementation of this that I would like to get reviewed. Resource update during NM graceful decommission --- Key: YARN-3223 URL: https://issues.apache.org/jira/browse/YARN-3223 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Junping Du Assignee: Varun Saxena During NM graceful decommission, we should handle resource update properly, include: make RMNode keep track of old resource for possible rollback, keep available resource to 0 and used resource get updated when container finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)