[jira] [Commented] (YARN-9606) Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient
[ https://issues.apache.org/jira/browse/YARN-9606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419159#comment-17419159 ] Peter Bacsko commented on YARN-9606: Thanks [~BilwaST] for the backport, committed to branch-3.3. > Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient > -- > > Key: YARN-9606 > URL: https://issues.apache.org/jira/browse/YARN-9606 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9606-001.patch, YARN-9606-002.patch, > YARN-9606-branch-3.3-v2.patch, YARN-9606-branch-3.3.v1.patch, > YARN-9606.003.patch, YARN-9606.004.patch, YARN-9606.005.patch, > YARN-9606.006.patch, YARN-9606.007.patch, YARN-9606.008.patch > > > Yarn logs fails for running containers > > > {quote} > > > > Unable to fetch log files list > Exception in thread "main" java.io.IOException: > com.sun.jersey.api.client.ClientHandlerException: > javax.net.ssl.SSLHandshakeException: Error while authenticating with > endpoint: > [https://vm2:65321/ws/v1/node/containers/container_e05_1559802125016_0001_01_08/logs] > at > org.apache.hadoop.yarn.client.cli.LogsCLI.getContainerLogFiles(LogsCLI.java:543) > at > org.apache.hadoop.yarn.client.cli.LogsCLI.getMatchedContainerLogFiles(LogsCLI.java:1338) > at > org.apache.hadoop.yarn.client.cli.LogsCLI.getMatchedOptionForRunningApp(LogsCLI.java:1514) > at > org.apache.hadoop.yarn.client.cli.LogsCLI.fetchContainerLogs(LogsCLI.java:1052) > at org.apache.hadoop.yarn.client.cli.LogsCLI.runCommand(LogsCLI.java:367) > at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:152) > at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:399) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9606) Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient
[ https://issues.apache.org/jira/browse/YARN-9606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418683#comment-17418683 ] Peter Bacsko commented on YARN-9606: [~BilwaST] yeah, sorry, I completely forgot about it. I'll commit it tomorrow. > Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient > -- > > Key: YARN-9606 > URL: https://issues.apache.org/jira/browse/YARN-9606 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9606-001.patch, YARN-9606-002.patch, > YARN-9606-branch-3.3-v2.patch, YARN-9606-branch-3.3.v1.patch, > YARN-9606.003.patch, YARN-9606.004.patch, YARN-9606.005.patch, > YARN-9606.006.patch, YARN-9606.007.patch, YARN-9606.008.patch > > > Yarn logs fails for running containers > > > {quote} > > > > Unable to fetch log files list > Exception in thread "main" java.io.IOException: > com.sun.jersey.api.client.ClientHandlerException: > javax.net.ssl.SSLHandshakeException: Error while authenticating with > endpoint: > [https://vm2:65321/ws/v1/node/containers/container_e05_1559802125016_0001_01_08/logs] > at > org.apache.hadoop.yarn.client.cli.LogsCLI.getContainerLogFiles(LogsCLI.java:543) > at > org.apache.hadoop.yarn.client.cli.LogsCLI.getMatchedContainerLogFiles(LogsCLI.java:1338) > at > org.apache.hadoop.yarn.client.cli.LogsCLI.getMatchedOptionForRunningApp(LogsCLI.java:1514) > at > org.apache.hadoop.yarn.client.cli.LogsCLI.fetchContainerLogs(LogsCLI.java:1052) > at org.apache.hadoop.yarn.client.cli.LogsCLI.runCommand(LogsCLI.java:367) > at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:152) > at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:399) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10958) Use correct configuration for Group service init in CSMappingPlacementRule
Peter Bacsko created YARN-10958: --- Summary: Use correct configuration for Group service init in CSMappingPlacementRule Key: YARN-10958 URL: https://issues.apache.org/jira/browse/YARN-10958 Project: Hadoop YARN Issue Type: Bug Reporter: Peter Bacsko There is a potential problem in {{CSMappingPlacementRule.java}}: {noformat} if (groups == null) { groups = Groups.getUserToGroupsMappingService(conf); } {noformat} The problem is, we're supposed to pass {{scheduler.getConf()}}. The "conf" object is the config for capacity scheduler, which does not include the property which selects the group service provider. Therefore, the current code just works by chance, because Group mapping service is already initialized at this point. See the original fix in YARN-10053. Also, need a unit test to verify it. Idea: # Create a Configuration object in which the property "hadoop.security.group.mapping" refers to an existing a test implementation. # Add a new method to {{Groups}} which nulls out the singleton instance, eg. {{Groups.reset()}}. # Create a mock CapacityScheduler where {{getConf()}} and {{getConfiguration()}} contain different settings for "hadoop.security.group.mapping". Since {{getConf()}} is the service config, this should return the config object created in step #1. # Create an instance of {{CSMappingPlacementRule}} with a single primary group rule. # Run the placement evaluation. # Expected: returned queue matches what is supposed to be coming from the test group mapping service ("testuser" --> "testqueue"). # Modify "hadoop.security.group.mapping" in the config object created in step #1. # Call {{Groups.refresh()}} which changes the group mapping ("testuser" --> "testqueue2"). This requires that the test group mapping service implement {{GroupMappingServiceProvider.cacheGroupsRefresh()}}. # Create a new instance of {{CSMappingPlacementRule}}. # Run the placement evaluation again # Expected: with the same user, the target queue has changed. This looks convoluted, but these steps make sure that: # {{CSMappingPlacementRule}} will force the initialization of groups. # We select the correct configuration for group service init. # We don't create a new {{Groups}} instance if the singleton is initialized, so we cover the original problem described in YARN-10597. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403793#comment-17403793 ] Peter Bacsko commented on YARN-10848: - Thanks for the comment [~prabhujoseph], so you're saying that this is by design? If this is intentional, then probably we should close this JIRA. But at first, this behavior was really weird to me. > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Minni Mittal >Priority: Major > Labels: pull-request-available > Attachments: TestTooManyContainers.java > > Time Spent: 20m > Remaining Estimate: 0h > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10576) Update Capacity Scheduler documentation about JSON-based placement mapping
[ https://issues.apache.org/jira/browse/YARN-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17401337#comment-17401337 ] Peter Bacsko commented on YARN-10576: - Thanks [~bteke], I'm not sure if I'll have time to work on this anymore, so you can take it over if you want. > Update Capacity Scheduler documentation about JSON-based placement mapping > -- > > Key: YARN-10576 > URL: https://issues.apache.org/jira/browse/YARN-10576 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10576-001.patch > > > The weight mode and AQC also affects how the new placement engine in CS works > and the documentation has to reflect that. > Certain statements in the documentation are no longer valid, for example: > * create flag: "Only applies to managed queue parents" - there is no > ManagedParentQueue in weight mode. > * "The nested rules primaryGroupUser and secondaryGroupUser expects the > parent queues to exist, ie. they cannot be created automatically". This only > applies to the legacy absolute/percentage mode. > Find all statements that mentions possible limitations and fix them if > necessary. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9907) Make YARN Service AM RPC port configurable
[ https://issues.apache.org/jira/browse/YARN-9907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390495#comment-17390495 ] Peter Bacsko commented on YARN-9907: [~tarunparimi] [~prabhujoseph] to me it looks the same as YARN-10439. Can this be closed as duplicate? > Make YARN Service AM RPC port configurable > -- > > Key: YARN-9907 > URL: https://issues.apache.org/jira/browse/YARN-9907 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-9907.001.patch > > > YARN Service AM uses a random ephemeral port for the ClientAMService RPC. In > environments where firewalls block unnecessary ports by default, it is useful > to have a configuration that specifies the port range. Similar to what we > have for MapReduce {{yarn.app.mapreduce.am.job.client.port-range}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1732#comment-1732 ] Peter Bacsko commented on YARN-10848: - [~minni31] the problem is that if you have a node with a lots of memory, CS keeps allocating containers even if there are no more vcores available. Imagine a 32 core server with 768GB of RAM. With a container size of 2G, this means that 384 containers can run in parallel, potentially overloading the node. This might be a slightly artifical scenario, but it can happen. IMO whether a container "fits in" or not should depend on both values. It's OK to use only one for fairness calculation, but as I pointed out above, Fair Scheduler does not allow such allocation if "Fair" policy is used in the queue. But if this was done intentionally, I'm wondering what's the thought process behind it. > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Minni Mittal >Priority: Major > Attachments: TestTooManyContainers.java > > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377446#comment-17377446 ] Peter Bacsko commented on YARN-10848: - [~minni31] sure, you can take it and I can review the patch if you upload one. > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Minni Mittal >Priority: Major > Attachments: TestTooManyContainers.java > > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko reassigned YARN-10848: --- Assignee: Minni Mittal > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Minni Mittal >Priority: Major > Attachments: TestTooManyContainers.java > > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10849) Clarify testcase documentation for TestServiceAM#testContainersReleasedWhenPreLaunchFails
[ https://issues.apache.org/jira/browse/YARN-10849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376630#comment-17376630 ] Peter Bacsko commented on YARN-10849: - [~snemeth] patch v2 seems to undo what v1 introduced. > Clarify testcase documentation for > TestServiceAM#testContainersReleasedWhenPreLaunchFails > - > > Key: YARN-10849 > URL: https://issues.apache.org/jira/browse/YARN-10849 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Attachments: YARN-10849.001.patch, YARN-10849.002.patch > > > There's a small comment added to testcase: > org.apache.hadoop.yarn.service.TestServiceAM#testContainersReleasedWhenPreLaunchFails: > {code} > // Test to verify that the containers are released and the > // component instance is added to the pending queue when building the launch > // context fails. > {code} > However, it was not clear for me why the "launch context" would fail. > While the test passes, it throws an Exception that tells the story. > {code} > 2021-07-06 18:31:04,438 ERROR [pool-275-thread-1] > containerlaunch.ContainerLaunchService (ContainerLaunchService.java:run(122)) > - [COMPINSTANCE compa-0 : container_1625589063422_0001_01_01]: Failed to > launch container. > java.lang.IllegalArgumentException: Can not create a Path from a null string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:164) > at org.apache.hadoop.fs.Path.(Path.java:180) > at > org.apache.hadoop.yarn.service.provider.tarball.TarballProviderService.processArtifact(TarballProviderService.java:39) > at > org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:144) > at > org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:107) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > This exception is thrown because the id of the Artifact object is unset > (null) and TarballProviderService.processArtifact verifies it and it does not > allow such artifacts. > The aim of this jira is to add a clarification comment or javadoc to this > method. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10848: Attachment: TestTooManyContainers.java > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Priority: Major > Attachments: TestTooManyContainers.java > > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10848: Attachment: (was: TestTooManyContainers.java) > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Priority: Major > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10848: Attachment: TestTooManyContainers.java > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Priority: Major > Attachments: TestTooManyContainers.java > > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10848: Summary: Vcore allocation problem with DefaultResourceCalculator (was: Vcore usage problem with Default/DominantResourceCalculator) > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Priority: Major > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10848) Vcore usage problem with Default/DominantResourceCalculator
Peter Bacsko created YARN-10848: --- Summary: Vcore usage problem with Default/DominantResourceCalculator Key: YARN-10848 URL: https://issues.apache.org/jira/browse/YARN-10848 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, capacityscheduler Reporter: Peter Bacsko If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating containers even if we run out of vcores. CS checks the the available resources at two places. The first check is {{CapacityScheduler.allocateContainerOnSingleNode()}}: {noformat} if (calculator.computeAvailableContainers(Resources .add(node.getUnallocatedResource(), node.getTotalKillableResources()), minimumAllocation) <= 0) { LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " + "available or preemptible resource for minimum allocation"); {noformat} The second, which is more important, is located in {{RegularContainerAllocator.assignContainer()}}: {noformat} if (!Resources.fitsIn(rc, capability, totalResource)) { LOG.warn("Node : " + node.getNodeID() + " does not have sufficient resource for ask : " + pendingAsk + " node total capability : " + node.getTotalResource()); // Skip this locality request ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( activitiesManager, node, application, schedulerKey, ActivityDiagnosticConstant. NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST + getResourceDiagnostics(capability, totalResource), ActivityLevel.NODE); return ContainerAllocation.LOCALITY_SKIPPED; } {noformat} Here, {{rc}} is the resource calculator instance, the other two values are: {noformat} Resource capability = pendingAsk.getPerAllocationResource(); Resource available = node.getUnallocatedResource(); {noformat} There is a repro unit test attatched to this case, which can demonstrate the problem. The root cause is that we pass the resource calculator to {{Resource.fitsIn()}}. Instead, we should use an overridden version, just like in {{FSAppAttempt.assignContainer()}}: {noformat} // Can we allocate a container on this node? if (Resources.fitsIn(capability, available)) { // Inform the application of the new container for this request RMContainer allocatedContainer = allocate(type, node, schedulerKey, pendingAsk, reservedContainer); {noformat} In CS, if we switch to DominantResourceCalculator OR use {{Resources.fitsIn()}} without the calculator in {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10843) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler - part II
[ https://issues.apache.org/jira/browse/YARN-10843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10843: Component/s: capacityscheduler capacity scheduler > [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler > - part II > -- > > Key: YARN-10843 > URL: https://issues.apache.org/jira/browse/YARN-10843 > Project: Hadoop YARN > Issue Type: Task > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Priority: Major > Labels: fs2cs > > Remaining tasks for fs2cs converter. > Phase I was completed under YARN-9698. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10843) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler - part II
[ https://issues.apache.org/jira/browse/YARN-10843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10843: Labels: fs2cs (was: ) > [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler > - part II > -- > > Key: YARN-10843 > URL: https://issues.apache.org/jira/browse/YARN-10843 > Project: Hadoop YARN > Issue Type: Task >Reporter: Peter Bacsko >Priority: Major > Labels: fs2cs > > Remaining tasks for fs2cs converter. > Phase I was completed under YARN-9698. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10843) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler - part II
[ https://issues.apache.org/jira/browse/YARN-10843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10843: Description: Remaining tasks for fs2cs converter. Phase I was completed under YARN-9698. > [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler > - part II > -- > > Key: YARN-10843 > URL: https://issues.apache.org/jira/browse/YARN-10843 > Project: Hadoop YARN > Issue Type: Task >Reporter: Peter Bacsko >Priority: Major > > Remaining tasks for fs2cs converter. > Phase I was completed under YARN-9698. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9698) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-9698. Fix Version/s: 3.4.0 Resolution: Fixed > [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler > > > Key: YARN-9698 > URL: https://issues.apache.org/jira/browse/YARN-9698 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Weiwei Yang >Priority: Major > Labels: fs2cs > Fix For: 3.4.0 > > Attachments: FS-CS Migration.pdf > > > We see some users want to migrate from Fair Scheduler to Capacity Scheduler, > this Jira is created as an umbrella to track all related efforts for the > migration, the scope contains > * Bug fixes > * Add missing features > * Migration tools that help to generate CS configs based on FS, validate > configs etc > * Documents > this is part of CS component, the purpose is to make the migration process > smooth. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9698) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17375649#comment-17375649 ] Peter Bacsko commented on YARN-9698: Remaining subtasks have been moved under YARN-10843. Closing this. Thanks for everyone's contribution. > [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler > > > Key: YARN-9698 > URL: https://issues.apache.org/jira/browse/YARN-9698 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Weiwei Yang >Priority: Major > Labels: fs2cs > Attachments: FS-CS Migration.pdf > > > We see some users want to migrate from Fair Scheduler to Capacity Scheduler, > this Jira is created as an umbrella to track all related efforts for the > migration, the scope contains > * Bug fixes > * Add missing features > * Migration tools that help to generate CS configs based on FS, validate > configs etc > * Documents > this is part of CS component, the purpose is to make the migration process > smooth. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10171) Add support for increment-allocation of custom resource types
[ https://issues.apache.org/jira/browse/YARN-10171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10171: Parent Issue: YARN-10843 (was: YARN-9698) > Add support for increment-allocation of custom resource types > - > > Key: YARN-10171 > URL: https://issues.apache.org/jira/browse/YARN-10171 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Qi Zhu >Priority: Minor > > The FairScheduler's {{yarn.resource-types.memory-mb.increment-allocation}} > and {{yarn.resource-types.vcores.increment-allocation}} configs are converted > to the {{yarn.scheduler.minimum-allocation-*}} configs, which is fine for the > vcores and memory. > In case of custom resource types like GPU if > {{yarn.resource-types.gpu.increment-allocation}} is set, then CS will not be > aware of that. We don't have a {{yarn.scheduler.minimum-allocation-gpu}} > setting for this purpose, but {{yarn.resource-types.gpu.min-allocation}} is > respected by the {{ResourceCalculator}} through the > {{ResourceUtils#getResourceInformationMapFromConfig}} which would provide us > with the same behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10758) Mixed mode: Allow relative and absolute mode in the same queue hierarchy
[ https://issues.apache.org/jira/browse/YARN-10758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10758: Parent Issue: YARN-10843 (was: YARN-9698) > Mixed mode: Allow relative and absolute mode in the same queue hierarchy > > > Key: YARN-10758 > URL: https://issues.apache.org/jira/browse/YARN-10758 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > > Fair Scheduler supports mixed mode for maximum capacity. An example scenario > of such configuration: > {noformat} > root.a.capacity [memory-mb=7268, vcores=8]{noformat} > {noformat} > root.a.a1.capacity 50{noformat} > {noformat} > root.a.a2.capacity 50{noformat} > Capacity Scheduler already permits using weight mode and relative/percentage > mode in the same hierarchy, however, the absolute mode and relative mode is > mutually exclusive. This improvement is a natural extension of CS to lift > this limitation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10693) Add document for YARN-10623 auto refresh queue conf in cs.
[ https://issues.apache.org/jira/browse/YARN-10693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10693: Parent Issue: YARN-10843 (was: YARN-9698) > Add document for YARN-10623 auto refresh queue conf in cs. > -- > > Key: YARN-10693 > URL: https://issues.apache.org/jira/browse/YARN-10693 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10693.001.patch, YARN-10693.002.patch, > YARN-10693.003.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9936) Support vector of capacity percentages in Capacity Scheduler configuration
[ https://issues.apache.org/jira/browse/YARN-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9936: --- Parent Issue: YARN-10843 (was: YARN-9698) > Support vector of capacity percentages in Capacity Scheduler configuration > -- > > Key: YARN-9936 > URL: https://issues.apache.org/jira/browse/YARN-9936 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Zoltan Siegl >Assignee: Andras Gyori >Priority: Major > Attachments: Capacity Scheduler support of “vector of resources > percentage”.pdf > > > Currently, the Capacity Scheduler queue configuration supports two ways to > set queue capacity. > * In percentage of all available resources as a float ( eg. 25.0 ) means 25% > of the resources of its parent queue for all resource types equally (eg. 25% > of all memory, 25% of all CPU cores, and 25% of all available GPU in the > cluster) The percentages of all queues has to add up to 100%. > * In an absolute amount of resources ( e.g. > memory=4GB,vcores=20,yarn.io/gpu=4 ). The amount of all resources in the > queues has to be less than or equal to all resources in the > cluster.{color:#de350b}Actually, the above is not supported, we only support > memory and vcores now in absolute mode, we should extend in {color}YARN-10503. > Apart from these two already existing ways, there is a demand to add capacity > percentage of each available resource type separately. (eg. > {{memory=20%,vcores=40%,yarn.io/gpu=100%}}). > At the same time, a similar concept should be included with queues > maximum-capacity as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10049) FIFOOrderingPolicy Improvements
[ https://issues.apache.org/jira/browse/YARN-10049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10049: Parent Issue: YARN-10843 (was: YARN-9698) > FIFOOrderingPolicy Improvements > --- > > Key: YARN-10049 > URL: https://issues.apache.org/jira/browse/YARN-10049 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Manikandan R >Assignee: Manikandan R >Priority: Major > Attachments: YARN-10049.001.patch, YARN-10049.002.patch, > YARN-10049.003.patch > > > FIFOPolicy of FS does the following comparisons in addition to app priority > comparison: > 1. Using Start time > 2. Using Name > Scope of this jira is to achieve the same comparisons in FIFOOrderingPolicy > of CS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9975) Support proxy acl user for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9975: --- Parent Issue: YARN-10843 (was: YARN-9698) > Support proxy acl user for CapacityScheduler > > > Key: YARN-9975 > URL: https://issues.apache.org/jira/browse/YARN-9975 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > As commented in https://issues.apache.org/jira/browse/YARN-9698. > I will open a new jira for the proxy user feature. > The background is that we have long running sql thriftserver for many users: > {quote}{{user->sql proxy-> sql thriftserver}}{quote} > But we do not have keytab for all users on 'sql proxy'. We just use a super > user like 'sql_prc' to submit the 'sql thriftserver' application. To support > this we should change the scheduler to support proxy user acl -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9904) Investigate how resource allocation configuration could be more consistent in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9904: --- Parent Issue: YARN-10843 (was: YARN-9698) > Investigate how resource allocation configuration could be more consistent in > CapacityScheduler > --- > > Key: YARN-9904 > URL: https://issues.apache.org/jira/browse/YARN-9904 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Gergely Pollák >Priority: Major > > It would be nice if everywhere where a capacity can be defined could be > defined the same way: > * With fixed amounts (eg 1GB memory, 8 vcores, 3 GPU) > * With percentages > ** Percentage of all resources (eg 10% of all memory, vcore, GPU) > ** Percentage per resource type (eg 10% memory, 25% vcore, 50% GPU) > We need to determine all configuration options where capacities can be > defined, and see if it is possible to extend the configuration, or if it > makes sense in that case. > The outcome is a proposal for all the configurations which could/should be > changed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9892) Capacity scheduler: support DRF ordering policy on queue level
[ https://issues.apache.org/jira/browse/YARN-9892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9892: --- Parent Issue: YARN-10843 (was: YARN-9698) > Capacity scheduler: support DRF ordering policy on queue level > -- > > Key: YARN-9892 > URL: https://issues.apache.org/jira/browse/YARN-9892 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Peter Bacsko >Assignee: Manikandan R >Priority: Major > Attachments: YARN-9892-003.patch, YARN-9892.001.patch, > YARN-9892.002.patch > > > Capacity scheduler does not support DRF (Dominant Resource Fairness) ordering > policy on queue level. Only "fifo" and "fair" are accepted for > {{yarn.scheduler.capacity..ordering-policy}}. > DRF can only be used globally if > {{yarn.scheduler.capacity.resource-calculator}} is set to > DominantResourceCalculator. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9748) Allow capacity-scheduler configuration on HDFS and support reload from HDFS
[ https://issues.apache.org/jira/browse/YARN-9748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9748: --- Parent Issue: YARN-10843 (was: YARN-9698) > Allow capacity-scheduler configuration on HDFS and support reload from HDFS > --- > > Key: YARN-9748 > URL: https://issues.apache.org/jira/browse/YARN-9748 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler, capacityscheduler >Affects Versions: 3.1.2 >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > Improvement: > Support auto reload from hdfs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9759) Document Queue and App Ordering Policy for CS
[ https://issues.apache.org/jira/browse/YARN-9759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9759: --- Parent Issue: YARN-10843 (was: YARN-9698) > Document Queue and App Ordering Policy for CS > - > > Key: YARN-9759 > URL: https://issues.apache.org/jira/browse/YARN-9759 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > Documentation of below properties for CapacityScheduler are missing > Ordering policy inside a parent queue to sort queues: > yarn.scheduler.capacity..ordering-policy = utilization, > priority-utilization > Ordering policy inside a leaf queue to sort apps: > yarn.scheduler.capacity..ordering-policy = fifo , fair -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7621) Support submitting apps with queue path for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-7621: --- Parent Issue: YARN-10843 (was: YARN-9698) > Support submitting apps with queue path for CapacityScheduler > - > > Key: YARN-7621 > URL: https://issues.apache.org/jira/browse/YARN-7621 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: zhoukang >Priority: Major > Labels: fs2cs > Attachments: YARN-7621.001.patch, YARN-7621.002.patch > > > Currently there is a difference of queue definition in > ApplicationSubmissionContext between CapacityScheduler and FairScheduler. > FairScheduler needs queue path but CapacityScheduler needs queue name. There > is no doubt of the correction of queue definition for CapacityScheduler > because it does not allow duplicate leaf queue names, but it's hard to switch > between FairScheduler and CapacityScheduler. I propose to support submitting > apps with queue path for CapacityScheduler to make the interface clearer and > scheduler switch smoothly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9709) When we expanding queue list the scheduler page will not show any applications
[ https://issues.apache.org/jira/browse/YARN-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9709: --- Parent Issue: YARN-10843 (was: YARN-9698) > When we expanding queue list the scheduler page will not show any applications > -- > > Key: YARN-9709 > URL: https://issues.apache.org/jira/browse/YARN-9709 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 3.1.2 >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-9709.001.patch, list1.png, list3.png > > > When we expanding queue list the scheduler page will not show any > applications.But it works well in FairScheduler. > !list1.png! > !list3.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9700) Docs about how to migrate from FS to CS config
[ https://issues.apache.org/jira/browse/YARN-9700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9700: --- Parent Issue: YARN-10843 (was: YARN-9698) > Docs about how to migrate from FS to CS config > -- > > Key: YARN-9700 > URL: https://issues.apache.org/jira/browse/YARN-9700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: docs >Reporter: Wanqiang Ji >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10843) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler - part II
Peter Bacsko created YARN-10843: --- Summary: [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler - part II Key: YARN-10843 URL: https://issues.apache.org/jira/browse/YARN-10843 Project: Hadoop YARN Issue Type: Task Reporter: Peter Bacsko -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10780) Optimise retrieval of configured node labels in CS queues
[ https://issues.apache.org/jira/browse/YARN-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369025#comment-17369025 ] Peter Bacsko commented on YARN-10780: - +1 Thanks [~gandras], latest patch LGTM. Committed to trunk. > Optimise retrieval of configured node labels in CS queues > - > > Key: YARN-10780 > URL: https://issues.apache.org/jira/browse/YARN-10780 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10780.001.patch, YARN-10780.002.patch, > YARN-10780.003.patch, YARN-10780.004.patch, YARN-10780.005.patch > > > CapacitySchedulerConfiguration#getConfiguredNodeLabels scales poorly with > respect to queue numbers (its O(n*m), where n is the number of queues and m > is the number of properties set by each queue). During CS reinit, the node > labels are often queried, however looking at the code: > {code:java} > for (Entry stringStringEntry : this) { > e = stringStringEntry; > String key = e.getKey(); > if (key.startsWith(getQueuePrefix(queuePath) + ACCESSIBLE_NODE_LABELS > + DOT)) { > // Find in > // .accessible-node-labels..property > int labelStartIdx = > key.indexOf(ACCESSIBLE_NODE_LABELS) > + ACCESSIBLE_NODE_LABELS.length() + 1; > int labelEndIndx = key.indexOf('.', labelStartIdx); > String labelName = key.substring(labelStartIdx, labelEndIndx); > configuredNodeLabels.add(labelName); > } > } > {code} > This method iterates through ALL properties set in the configuration. For > example in case of initialising 2500 queues, each having at least 2 > properties: > 2500 * 5000 ~= over 12 million iteration + additional properties > There are some ways to resolve this issue while keeping backward > compatibility: > # Create a property like the original accessible-node-labels, which contains > predefined labels. If it is set, then getConfiguredNodeLabels get the value > of this property, otherwise it falls back to the old logic. I think > accessible-node-labels are not used for this purpose (though I have a feeling > that it should have been). > # Collect node labels for all queues at the beginning of parseQueue and only > iterate through the properties once. This will increase the space complexity > in exchange of not requiring intervention from user's perspective. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10780) Optimise retrieval of configured node labels in CS queues
[ https://issues.apache.org/jira/browse/YARN-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367259#comment-17367259 ] Peter Bacsko commented on YARN-10780: - [~gandras] looks good, could you take care of the checkstyle problems? > Optimise retrieval of configured node labels in CS queues > - > > Key: YARN-10780 > URL: https://issues.apache.org/jira/browse/YARN-10780 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10780.001.patch, YARN-10780.002.patch, > YARN-10780.003.patch, YARN-10780.004.patch > > > CapacitySchedulerConfiguration#getConfiguredNodeLabels scales poorly with > respect to queue numbers (its O(n*m), where n is the number of queues and m > is the number of properties set by each queue). During CS reinit, the node > labels are often queried, however looking at the code: > {code:java} > for (Entry stringStringEntry : this) { > e = stringStringEntry; > String key = e.getKey(); > if (key.startsWith(getQueuePrefix(queuePath) + ACCESSIBLE_NODE_LABELS > + DOT)) { > // Find in > // .accessible-node-labels..property > int labelStartIdx = > key.indexOf(ACCESSIBLE_NODE_LABELS) > + ACCESSIBLE_NODE_LABELS.length() + 1; > int labelEndIndx = key.indexOf('.', labelStartIdx); > String labelName = key.substring(labelStartIdx, labelEndIndx); > configuredNodeLabels.add(labelName); > } > } > {code} > This method iterates through ALL properties set in the configuration. For > example in case of initialising 2500 queues, each having at least 2 > properties: > 2500 * 5000 ~= over 12 million iteration + additional properties > There are some ways to resolve this issue while keeping backward > compatibility: > # Create a property like the original accessible-node-labels, which contains > predefined labels. If it is set, then getConfiguredNodeLabels get the value > of this property, otherwise it falls back to the old logic. I think > accessible-node-labels are not used for this purpose (though I have a feeling > that it should have been). > # Collect node labels for all queues at the beginning of parseQueue and only > iterate through the properties once. This will increase the space complexity > in exchange of not requiring intervention from user's perspective. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%
[ https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10796: Attachment: YARN-10796-003.patch > Capacity Scheduler: dynamic queue cannot scale out properly if its capacity > is 0% > - > > Key: YARN-10796 > URL: https://issues.apache.org/jira/browse/YARN-10796 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10796-001.patch, YARN-10796-002.patch, > YARN-10796-003.patch > > > If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it > cannot properly scale even if it's max-capacity and the parent's max-capacity > would allow it. > Example: > {noformat} > Cluster Capacity: 16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu ) > Container allocation size: 1G / 1 vcore > root.dynamic > Effective Capacity: ( 50.0%) > Effective Max Capacity: (100.0%) > Template: > Capacity: 40% > Max Capacity: 100% > User Limit Factor: 4 > {noformat} > leaf-queue-template.capacity = 40% > leaf-queue-template.maximum-capacity = 100% > leaf-queue-template.maximum-am-resource-percent = 50% > leaf-queue-template.minimum-user-limit-percent =100% > leaf-queue-template.user-limit-factor = 4 > "root.dynamic" has a maximum capacity of 100% and a capacity of 50%. > Let's assume there are running containers in these dynamic queues (MR sleep > jobs): > root.dynamic.user1 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user2 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user3 = 1 AM + 15 container (capacity = 0%) > This scenario will result in an underutilized cluster. There will be approx > 18% unused capacity. On the other hand, it's still possible to submit a new > application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% > utilization is possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%
[ https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355804#comment-17355804 ] Peter Bacsko edited comment on YARN-10796 at 6/2/21, 3:32 PM: -- [~gandras] this is a valid concern. Question is, do we accept how it worked before and say "yeah, that's another way of working"? Are there clusters built on the fact that a 0% queue cannot scale out properly, despite the max-capacity setting? Honestly, I don't know. Maybe some people got used to the improper behavior and expect it to work that way, which does happen in real life. In my view, even a zero capacity queue should be able to occupy the cluster if nothing else is used, provided max-capacity is set appropriately. So I would not go for a new property. was (Author: pbacsko): [~gandras] this is a valid concern. Question is, do we accept how it worked before and say "yeah, that's another way of working"? Are there clusters built on the fact that a 0% queue cannot scale out properly, despite the max-capacity setting? Honestly, I don't know. Maybe some people got used to the improper behavior and expect it to work that way, which does happen in real life. TIn my view, even a zero capacity queue should be able to occupy the cluster if nothing else is used, provided max-capacity is set appropriately. So I would not go for a new property. > Capacity Scheduler: dynamic queue cannot scale out properly if its capacity > is 0% > - > > Key: YARN-10796 > URL: https://issues.apache.org/jira/browse/YARN-10796 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10796-001.patch, YARN-10796-002.patch > > > If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it > cannot properly scale even if it's max-capacity and the parent's max-capacity > would allow it. > Example: > {noformat} > Cluster Capacity: 16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu ) > Container allocation size: 1G / 1 vcore > root.dynamic > Effective Capacity: ( 50.0%) > Effective Max Capacity: (100.0%) > Template: > Capacity: 40% > Max Capacity: 100% > User Limit Factor: 4 > {noformat} > leaf-queue-template.capacity = 40% > leaf-queue-template.maximum-capacity = 100% > leaf-queue-template.maximum-am-resource-percent = 50% > leaf-queue-template.minimum-user-limit-percent =100% > leaf-queue-template.user-limit-factor = 4 > "root.dynamic" has a maximum capacity of 100% and a capacity of 50%. > Let's assume there are running containers in these dynamic queues (MR sleep > jobs): > root.dynamic.user1 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user2 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user3 = 1 AM + 15 container (capacity = 0%) > This scenario will result in an underutilized cluster. There will be approx > 18% unused capacity. On the other hand, it's still possible to submit a new > application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% > utilization is possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%
[ https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355804#comment-17355804 ] Peter Bacsko commented on YARN-10796: - [~gandras] this is a valid concern. Question is, do we accept how it worked before and say "yeah, that's another way of working"? Are there clusters built on the fact that a 0% queue cannot scale out properly, despite the max-capacity setting? Honestly, I don't know. Maybe some people got used to the improper behavior and expect it to work that way, which does happen in real life. That said, even a zero capacity queue should be able to occupy the cluster if nothing else is used, provided max-capacity is set appropriately. So I would not go for a new property. > Capacity Scheduler: dynamic queue cannot scale out properly if its capacity > is 0% > - > > Key: YARN-10796 > URL: https://issues.apache.org/jira/browse/YARN-10796 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10796-001.patch, YARN-10796-002.patch > > > If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it > cannot properly scale even if it's max-capacity and the parent's max-capacity > would allow it. > Example: > {noformat} > Cluster Capacity: 16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu ) > Container allocation size: 1G / 1 vcore > root.dynamic > Effective Capacity: ( 50.0%) > Effective Max Capacity: (100.0%) > Template: > Capacity: 40% > Max Capacity: 100% > User Limit Factor: 4 > {noformat} > leaf-queue-template.capacity = 40% > leaf-queue-template.maximum-capacity = 100% > leaf-queue-template.maximum-am-resource-percent = 50% > leaf-queue-template.minimum-user-limit-percent =100% > leaf-queue-template.user-limit-factor = 4 > "root.dynamic" has a maximum capacity of 100% and a capacity of 50%. > Let's assume there are running containers in these dynamic queues (MR sleep > jobs): > root.dynamic.user1 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user2 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user3 = 1 AM + 15 container (capacity = 0%) > This scenario will result in an underutilized cluster. There will be approx > 18% unused capacity. On the other hand, it's still possible to submit a new > application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% > utilization is possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%
[ https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355804#comment-17355804 ] Peter Bacsko edited comment on YARN-10796 at 6/2/21, 3:28 PM: -- [~gandras] this is a valid concern. Question is, do we accept how it worked before and say "yeah, that's another way of working"? Are there clusters built on the fact that a 0% queue cannot scale out properly, despite the max-capacity setting? Honestly, I don't know. Maybe some people got used to the improper behavior and expect it to work that way, which does happen in real life. TIn my view, even a zero capacity queue should be able to occupy the cluster if nothing else is used, provided max-capacity is set appropriately. So I would not go for a new property. was (Author: pbacsko): [~gandras] this is a valid concern. Question is, do we accept how it worked before and say "yeah, that's another way of working"? Are there clusters built on the fact that a 0% queue cannot scale out properly, despite the max-capacity setting? Honestly, I don't know. Maybe some people got used to the improper behavior and expect it to work that way, which does happen in real life. That said, even a zero capacity queue should be able to occupy the cluster if nothing else is used, provided max-capacity is set appropriately. So I would not go for a new property. > Capacity Scheduler: dynamic queue cannot scale out properly if its capacity > is 0% > - > > Key: YARN-10796 > URL: https://issues.apache.org/jira/browse/YARN-10796 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10796-001.patch, YARN-10796-002.patch > > > If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it > cannot properly scale even if it's max-capacity and the parent's max-capacity > would allow it. > Example: > {noformat} > Cluster Capacity: 16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu ) > Container allocation size: 1G / 1 vcore > root.dynamic > Effective Capacity: ( 50.0%) > Effective Max Capacity: (100.0%) > Template: > Capacity: 40% > Max Capacity: 100% > User Limit Factor: 4 > {noformat} > leaf-queue-template.capacity = 40% > leaf-queue-template.maximum-capacity = 100% > leaf-queue-template.maximum-am-resource-percent = 50% > leaf-queue-template.minimum-user-limit-percent =100% > leaf-queue-template.user-limit-factor = 4 > "root.dynamic" has a maximum capacity of 100% and a capacity of 50%. > Let's assume there are running containers in these dynamic queues (MR sleep > jobs): > root.dynamic.user1 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user2 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user3 = 1 AM + 15 container (capacity = 0%) > This scenario will result in an underutilized cluster. There will be approx > 18% unused capacity. On the other hand, it's still possible to submit a new > application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% > utilization is possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10780) Optimise retrieval of configured node labels in CS queues
[ https://issues.apache.org/jira/browse/YARN-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355796#comment-17355796 ] Peter Bacsko commented on YARN-10780: - Ok, I went through the patch. I'm not saying that I have 100% full understanding, but for the most part, I get it. Some comments/questions: 1. {{ConfiguredNodeLabels}}: it has a no-arg constructor, which is used once. But as I can see, that doesn't do anything and the real thing occurs when it's called with a Configuration object (init / reinit). Also, in the constructor CapacitySchedulerQueueManager, we have a conf object, can't we just pass that? 2. In {{AbstractCSQueue}}, we always call {{getConfiguredLabels()}}, I think it's simpler to directly reference the labels object. I can also see that descendant classes reference it. In order to be consistent, you might consider making it protected or package private, every other variable seems to follow this convention. 3. If I understand correctly, {{CapacitySchedulerConfiguration.getConfiguredNodeLabelsByQueue()}} runs once and only once, right? I mean once per init/reinit. 4. Nit: variable "stringStringEntry", can we have a better name for this? Like "configEntry". 5. I'd be a bit more aggressive with immutable Sets. {{getLabelsByQueue()}} should return {{ImmutableSet.of(labels)}}. {{CapacitySchedulerConfiguration.getConfiguredNodeLabels(String)}} always constructs a new set, so that's OK. > Optimise retrieval of configured node labels in CS queues > - > > Key: YARN-10780 > URL: https://issues.apache.org/jira/browse/YARN-10780 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10780.001.patch, YARN-10780.002.patch > > > CapacitySchedulerConfiguration#getConfiguredNodeLabels scales poorly with > respect to queue numbers (its O(n*m), where n is the number of queues and m > is the number of properties set by each queue). During CS reinit, the node > labels are often queried, however looking at the code: > {code:java} > for (Entry stringStringEntry : this) { > e = stringStringEntry; > String key = e.getKey(); > if (key.startsWith(getQueuePrefix(queuePath) + ACCESSIBLE_NODE_LABELS > + DOT)) { > // Find in > // .accessible-node-labels..property > int labelStartIdx = > key.indexOf(ACCESSIBLE_NODE_LABELS) > + ACCESSIBLE_NODE_LABELS.length() + 1; > int labelEndIndx = key.indexOf('.', labelStartIdx); > String labelName = key.substring(labelStartIdx, labelEndIndx); > configuredNodeLabels.add(labelName); > } > } > {code} > This method iterates through ALL properties set in the configuration. For > example in case of initialising 2500 queues, each having at least 2 > properties: > 2500 * 5000 ~= over 12 million iteration + additional properties > There are some ways to resolve this issue while keeping backward > compatibility: > # Create a property like the original accessible-node-labels, which contains > predefined labels. If it is set, then getConfiguredNodeLabels get the value > of this property, otherwise it falls back to the old logic. I think > accessible-node-labels are not used for this purpose (though I have a feeling > that it should have been). > # Collect node labels for all queues at the beginning of parseQueue and only > iterate through the properties once. This will increase the space complexity > in exchange of not requiring intervention from user's perspective. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10780) Optimise retrieval of configured node labels in CS queues
[ https://issues.apache.org/jira/browse/YARN-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355775#comment-17355775 ] Peter Bacsko commented on YARN-10780: - There are a lot of NPE at {{serviceStop()}}, [~gandras] could you check those? In the meantime, I'll review the changes. > Optimise retrieval of configured node labels in CS queues > - > > Key: YARN-10780 > URL: https://issues.apache.org/jira/browse/YARN-10780 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10780.001.patch, YARN-10780.002.patch > > > CapacitySchedulerConfiguration#getConfiguredNodeLabels scales poorly with > respect to queue numbers (its O(n*m), where n is the number of queues and m > is the number of properties set by each queue). During CS reinit, the node > labels are often queried, however looking at the code: > {code:java} > for (Entry stringStringEntry : this) { > e = stringStringEntry; > String key = e.getKey(); > if (key.startsWith(getQueuePrefix(queuePath) + ACCESSIBLE_NODE_LABELS > + DOT)) { > // Find in > // .accessible-node-labels..property > int labelStartIdx = > key.indexOf(ACCESSIBLE_NODE_LABELS) > + ACCESSIBLE_NODE_LABELS.length() + 1; > int labelEndIndx = key.indexOf('.', labelStartIdx); > String labelName = key.substring(labelStartIdx, labelEndIndx); > configuredNodeLabels.add(labelName); > } > } > {code} > This method iterates through ALL properties set in the configuration. For > example in case of initialising 2500 queues, each having at least 2 > properties: > 2500 * 5000 ~= over 12 million iteration + additional properties > There are some ways to resolve this issue while keeping backward > compatibility: > # Create a property like the original accessible-node-labels, which contains > predefined labels. If it is set, then getConfiguredNodeLabels get the value > of this property, otherwise it falls back to the old logic. I think > accessible-node-labels are not used for this purpose (though I have a feeling > that it should have been). > # Collect node labels for all queues at the beginning of parseQueue and only > iterate through the properties once. This will increase the space complexity > in exchange of not requiring intervention from user's perspective. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%
[ https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355771#comment-17355771 ] Peter Bacsko commented on YARN-10796: - Thanks [~bteke], this makes sense. > Capacity Scheduler: dynamic queue cannot scale out properly if its capacity > is 0% > - > > Key: YARN-10796 > URL: https://issues.apache.org/jira/browse/YARN-10796 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10796-001.patch, YARN-10796-002.patch > > > If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it > cannot properly scale even if it's max-capacity and the parent's max-capacity > would allow it. > Example: > {noformat} > Cluster Capacity: 16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu ) > Container allocation size: 1G / 1 vcore > root.dynamic > Effective Capacity: ( 50.0%) > Effective Max Capacity: (100.0%) > Template: > Capacity: 40% > Max Capacity: 100% > User Limit Factor: 4 > {noformat} > leaf-queue-template.capacity = 40% > leaf-queue-template.maximum-capacity = 100% > leaf-queue-template.maximum-am-resource-percent = 50% > leaf-queue-template.minimum-user-limit-percent =100% > leaf-queue-template.user-limit-factor = 4 > "root.dynamic" has a maximum capacity of 100% and a capacity of 50%. > Let's assume there are running containers in these dynamic queues (MR sleep > jobs): > root.dynamic.user1 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user2 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user3 = 1 AM + 15 container (capacity = 0%) > This scenario will result in an underutilized cluster. There will be approx > 18% unused capacity. On the other hand, it's still possible to submit a new > application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% > utilization is possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%
[ https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355720#comment-17355720 ] Peter Bacsko commented on YARN-10796: - [~bteke], [~gandras], [~snemeth] could you review this please? > Capacity Scheduler: dynamic queue cannot scale out properly if its capacity > is 0% > - > > Key: YARN-10796 > URL: https://issues.apache.org/jira/browse/YARN-10796 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10796-001.patch, YARN-10796-002.patch > > > If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it > cannot properly scale even if it's max-capacity and the parent's max-capacity > would allow it. > Example: > {noformat} > Cluster Capacity: 16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu ) > Container allocation size: 1G / 1 vcore > root.dynamic > Effective Capacity: ( 50.0%) > Effective Max Capacity: (100.0%) > Template: > Capacity: 40% > Max Capacity: 100% > User Limit Factor: 4 > {noformat} > leaf-queue-template.capacity = 40% > leaf-queue-template.maximum-capacity = 100% > leaf-queue-template.maximum-am-resource-percent = 50% > leaf-queue-template.minimum-user-limit-percent =100% > leaf-queue-template.user-limit-factor = 4 > "root.dynamic" has a maximum capacity of 100% and a capacity of 50%. > Let's assume there are running containers in these dynamic queues (MR sleep > jobs): > root.dynamic.user1 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user2 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user3 = 1 AM + 15 container (capacity = 0%) > This scenario will result in an underutilized cluster. There will be approx > 18% unused capacity. On the other hand, it's still possible to submit a new > application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% > utilization is possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%
[ https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10796: Attachment: YARN-10796-002.patch > Capacity Scheduler: dynamic queue cannot scale out properly if its capacity > is 0% > - > > Key: YARN-10796 > URL: https://issues.apache.org/jira/browse/YARN-10796 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10796-001.patch, YARN-10796-002.patch > > > If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it > cannot properly scale even if it's max-capacity and the parent's max-capacity > would allow it. > Example: > {noformat} > Cluster Capacity: 16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu ) > Container allocation size: 1G / 1 vcore > root.dynamic > Effective Capacity: ( 50.0%) > Effective Max Capacity: (100.0%) > Template: > Capacity: 40% > Max Capacity: 100% > User Limit Factor: 4 > {noformat} > leaf-queue-template.capacity = 40% > leaf-queue-template.maximum-capacity = 100% > leaf-queue-template.maximum-am-resource-percent = 50% > leaf-queue-template.minimum-user-limit-percent =100% > leaf-queue-template.user-limit-factor = 4 > "root.dynamic" has a maximum capacity of 100% and a capacity of 50%. > Let's assume there are running containers in these dynamic queues (MR sleep > jobs): > root.dynamic.user1 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user2 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user3 = 1 AM + 15 container (capacity = 0%) > This scenario will result in an underutilized cluster. There will be approx > 18% unused capacity. On the other hand, it's still possible to submit a new > application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% > utilization is possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%
[ https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10796: Issue Type: Bug (was: Task) > Capacity Scheduler: dynamic queue cannot scale out properly if its capacity > is 0% > - > > Key: YARN-10796 > URL: https://issues.apache.org/jira/browse/YARN-10796 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10796-001.patch > > > If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it > cannot properly scale even if it's max-capacity and the parent's max-capacity > would allow it. > Example: > {noformat} > Cluster Capacity: 16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu ) > Container allocation size: 1G / 1 vcore > root.dynamic > Effective Capacity: ( 50.0%) > Effective Max Capacity: (100.0%) > Template: > Capacity: 40% > Max Capacity: 100% > User Limit Factor: 4 > {noformat} > leaf-queue-template.capacity = 40% > leaf-queue-template.maximum-capacity = 100% > leaf-queue-template.maximum-am-resource-percent = 50% > leaf-queue-template.minimum-user-limit-percent =100% > leaf-queue-template.user-limit-factor = 4 > "root.dynamic" has a maximum capacity of 100% and a capacity of 50%. > Let's assume there are running containers in these dynamic queues (MR sleep > jobs): > root.dynamic.user1 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user2 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user3 = 1 AM + 15 container (capacity = 0%) > This scenario will result in an underutilized cluster. There will be approx > 18% unused capacity. On the other hand, it's still possible to submit a new > application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% > utilization is possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%
[ https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10796: Attachment: YARN-10796-001.patch > Capacity Scheduler: dynamic queue cannot scale out properly if its capacity > is 0% > - > > Key: YARN-10796 > URL: https://issues.apache.org/jira/browse/YARN-10796 > Project: Hadoop YARN > Issue Type: Task > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10796-001.patch > > > If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it > cannot properly scale even if it's max-capacity and the parent's max-capacity > would allow it. > Example: > {noformat} > Cluster Capacity: 16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu ) > Container allocation size: 1G / 1 vcore > root.dynamic > Effective Capacity: ( 50.0%) > Effective Max Capacity: (100.0%) > Template: > Capacity: 40% > Max Capacity: 100% > User Limit Factor: 4 > {noformat} > leaf-queue-template.capacity = 40% > leaf-queue-template.maximum-capacity = 100% > leaf-queue-template.maximum-am-resource-percent = 50% > leaf-queue-template.minimum-user-limit-percent =100% > leaf-queue-template.user-limit-factor = 4 > "root.dynamic" has a maximum capacity of 100% and a capacity of 50%. > Let's assume there are running containers in these dynamic queues (MR sleep > jobs): > root.dynamic.user1 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user2 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user3 = 1 AM + 15 container (capacity = 0%) > This scenario will result in an underutilized cluster. There will be approx > 18% unused capacity. On the other hand, it's still possible to submit a new > application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% > utilization is possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%
[ https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10796: Summary: Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0% (was: Capacity Scheduler: dynamic queue cannot scale out properly if it's capacity is 0%) > Capacity Scheduler: dynamic queue cannot scale out properly if its capacity > is 0% > - > > Key: YARN-10796 > URL: https://issues.apache.org/jira/browse/YARN-10796 > Project: Hadoop YARN > Issue Type: Task > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it > cannot properly scale even if it's max-capacity and the parent's max-capacity > would allow it. > Example: > {noformat} > Cluster Capacity: 16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu ) > Container allocation size: 1G / 1 vcore > root.dynamic > Effective Capacity: ( 50.0%) > Effective Max Capacity: (100.0%) > Template: > Capacity: 40% > Max Capacity: 100% > User Limit Factor: 4 > {noformat} > leaf-queue-template.capacity = 40% > leaf-queue-template.maximum-capacity = 100% > leaf-queue-template.maximum-am-resource-percent = 50% > leaf-queue-template.minimum-user-limit-percent =100% > leaf-queue-template.user-limit-factor = 4 > "root.dynamic" has a maximum capacity of 100% and a capacity of 50%. > Let's assume there are running containers in these dynamic queues (MR sleep > jobs): > root.dynamic.user1 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user2 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user3 = 1 AM + 15 container (capacity = 0%) > This scenario will result in an underutilized cluster. There will be approx > 18% unused capacity. On the other hand, it's still possible to submit a new > application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% > utilization is possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if it's capacity is 0%
[ https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10796: Description: If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it cannot properly scale even if it's max-capacity and the parent's max-capacity would allow it. Example: {noformat} Cluster Capacity: 16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu ) Container allocation size: 1G / 1 vcore root.dynamic Effective Capacity: ( 50.0%) Effective Max Capacity: (100.0%) Template: Capacity: 40% Max Capacity: 100% User Limit Factor: 4 {noformat} leaf-queue-template.capacity = 40% leaf-queue-template.maximum-capacity = 100% leaf-queue-template.maximum-am-resource-percent = 50% leaf-queue-template.minimum-user-limit-percent =100% leaf-queue-template.user-limit-factor = 4 "root.dynamic" has a maximum capacity of 100% and a capacity of 50%. Let's assume there are running containers in these dynamic queues (MR sleep jobs): root.dynamic.user1 = 1 AM + 3 container (capacity = 40%) root.dynamic.user2 = 1 AM + 3 container (capacity = 40%) root.dynamic.user3 = 1 AM + 15 container (capacity = 0%) This scenario will result in an underutilized cluster. There will be approx 18% unused capacity. On the other hand, it's still possible to submit a new application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% utilization is possible. was: If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it cannot properly scale even if it's max-capacity and the parent's max-capacity would allow it. Example: {noformat} Cluster Capacity: 16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu ) Container allocation size: 1G / 1 vcore Root.dynamic Effective Capacity: ( 50.0%) Effective Max Capacity: (100.0%) Template: Capacity: 40% Max Capacity: 100% User Limit Factor: 4 {noformat} leaf-queue-template.capacity = 40% leaf-queue-template.maximum-capacity = 100% leaf-queue-template.maximum-am-resource-percent = 50% leaf-queue-template.minimum-user-limit-percent =100% leaf-queue-template.user-limit-factor = 4 "root.dynamic" has a maximum capacity of 100% and a capacity of 50%. Let's assume there are running containers in these dynamic queues (MR sleep jobs): root.dynamic.user1 = 1 AM + 3 container (capacity = 40%) root.dynamic.user2 = 1 AM + 3 container (capacity = 40%) root.dynamic.user3 = 1 AM + 15 container (capacity = 0%) This scenario will result in an underutilized cluster. There will be approx 18% unused capacity. On the other hand, it's still possible to submit a new application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% utilization is possible. > Capacity Scheduler: dynamic queue cannot scale out properly if it's capacity > is 0% > -- > > Key: YARN-10796 > URL: https://issues.apache.org/jira/browse/YARN-10796 > Project: Hadoop YARN > Issue Type: Task > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it > cannot properly scale even if it's max-capacity and the parent's max-capacity > would allow it. > Example: > {noformat} > Cluster Capacity: 16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu ) > Container allocation size: 1G / 1 vcore > root.dynamic > Effective Capacity: ( 50.0%) > Effective Max Capacity: (100.0%) > Template: > Capacity: 40% > Max Capacity: 100% > User Limit Factor: 4 > {noformat} > leaf-queue-template.capacity = 40% > leaf-queue-template.maximum-capacity = 100% > leaf-queue-template.maximum-am-resource-percent = 50% > leaf-queue-template.minimum-user-limit-percent =100% > leaf-queue-template.user-limit-factor = 4 > "root.dynamic" has a maximum capacity of 100% and a capacity of 50%. > Let's assume there are running containers in these dynamic queues (MR sleep > jobs): > root.dynamic.user1 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user2 = 1 AM + 3 container (capacity = 40%) > root.dynamic.user3 = 1 AM + 15 container (capacity = 0%) > This scenario will result in an underutilized cluster. There will be approx > 18% unused capacity. On the other hand, it's still possible to submit a new > application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% > utilization is possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issue
[jira] [Created] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if it's capacity is 0%
Peter Bacsko created YARN-10796: --- Summary: Capacity Scheduler: dynamic queue cannot scale out properly if it's capacity is 0% Key: YARN-10796 URL: https://issues.apache.org/jira/browse/YARN-10796 Project: Hadoop YARN Issue Type: Task Components: capacity scheduler, capacityscheduler Reporter: Peter Bacsko Assignee: Peter Bacsko If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it cannot properly scale even if it's max-capacity and the parent's max-capacity would allow it. Example: {noformat} Cluster Capacity: 16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu ) Container allocation size: 1G / 1 vcore Root.dynamic Effective Capacity: ( 50.0%) Effective Max Capacity: (100.0%) Template: Capacity: 40% Max Capacity: 100% User Limit Factor: 4 {noformat} leaf-queue-template.capacity = 40% leaf-queue-template.maximum-capacity = 100% leaf-queue-template.maximum-am-resource-percent = 50% leaf-queue-template.minimum-user-limit-percent =100% leaf-queue-template.user-limit-factor = 4 "root.dynamic" has a maximum capacity of 100% and a capacity of 50%. Let's assume there are running containers in these dynamic queues (MR sleep jobs): root.dynamic.user1 = 1 AM + 3 container (capacity = 40%) root.dynamic.user2 = 1 AM + 3 container (capacity = 40%) root.dynamic.user3 = 1 AM + 15 container (capacity = 0%) This scenario will result in an underutilized cluster. There will be approx 18% unused capacity. On the other hand, it's still possible to submit a new application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% utilization is possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10771) Add cluster metric for size of SchedulerEventQueue and RMEventQueue
[ https://issues.apache.org/jira/browse/YARN-10771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349248#comment-17349248 ] Peter Bacsko commented on YARN-10771: - Just one thing: {{import com.google.common.util.concurrent.ThreadFactoryBuilder;}} Use the shaded import starting with "org.apache.thirdparty". > Add cluster metric for size of SchedulerEventQueue and RMEventQueue > --- > > Key: YARN-10771 > URL: https://issues.apache.org/jira/browse/YARN-10771 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: chaosju >Assignee: chaosju >Priority: Major > Attachments: YARN-10763.001.patch, YARN-10771.002.patch, > YARN-10771.003.patch, YARN-10771.004.patch > > > Add cluster metric for size of Scheduler event queue and RM event queue, This > lets us know the load of the RM and convenient monitoring the metrics. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9606) Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient
[ https://issues.apache.org/jira/browse/YARN-9606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349177#comment-17349177 ] Peter Bacsko commented on YARN-9606: [~BilwaST] I can't apply this patch to branch-3.3, even using "git apply \-3" fails. Could you upload a branch-3.3 version? > Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient > -- > > Key: YARN-9606 > URL: https://issues.apache.org/jira/browse/YARN-9606 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9606-001.patch, YARN-9606-002.patch, > YARN-9606.003.patch, YARN-9606.004.patch, YARN-9606.005.patch, > YARN-9606.006.patch, YARN-9606.007.patch, YARN-9606.008.patch > > > Yarn logs fails for running containers > > > {quote} > > > > Unable to fetch log files list > Exception in thread "main" java.io.IOException: > com.sun.jersey.api.client.ClientHandlerException: > javax.net.ssl.SSLHandshakeException: Error while authenticating with > endpoint: > [https://vm2:65321/ws/v1/node/containers/container_e05_1559802125016_0001_01_08/logs] > at > org.apache.hadoop.yarn.client.cli.LogsCLI.getContainerLogFiles(LogsCLI.java:543) > at > org.apache.hadoop.yarn.client.cli.LogsCLI.getMatchedContainerLogFiles(LogsCLI.java:1338) > at > org.apache.hadoop.yarn.client.cli.LogsCLI.getMatchedOptionForRunningApp(LogsCLI.java:1514) > at > org.apache.hadoop.yarn.client.cli.LogsCLI.fetchContainerLogs(LogsCLI.java:1052) > at org.apache.hadoop.yarn.client.cli.LogsCLI.runCommand(LogsCLI.java:367) > at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:152) > at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:399) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
[ https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349104#comment-17349104 ] Peter Bacsko commented on YARN-10779: - [~snemeth] please review this. > Add option to disable lowercase conversion in GetApplicationsRequestPBImpl > and ApplicationSubmissionContextPBImpl > - > > Key: YARN-10779 > URL: https://issues.apache.org/jira/browse/YARN-10779 > Project: Hadoop YARN > Issue Type: Task > Components: resourcemanager >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10779-001.patch, YARN-10779-002.patch, > YARN-10779-003.patch, YARN-10779-POC.patch > > > In both {{GetApplicationsRequestPBImpl}} and > {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase > conversion: > {noformat} > checkTags(tags); > // Convert applicationTags to lower case and add > this.applicationTags = new TreeSet<>(); > for (String tag : tags) { > this.applicationTags.add(StringUtils.toLowerCase(tag)); > } > } > {noformat} > However, we encountered some cases where this is not desirable for "userid" > tags. > Proposed solution: since both classes are pretty low-level and can be often > instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should > be cached inside them. A new property should be created which tells whether > lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-9606) Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient
[ https://issues.apache.org/jira/browse/YARN-9606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko reopened YARN-9606: > Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient > -- > > Key: YARN-9606 > URL: https://issues.apache.org/jira/browse/YARN-9606 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9606-001.patch, YARN-9606-002.patch, > YARN-9606.003.patch, YARN-9606.004.patch, YARN-9606.005.patch, > YARN-9606.006.patch, YARN-9606.007.patch, YARN-9606.008.patch > > > Yarn logs fails for running containers > > > {quote} > > > > Unable to fetch log files list > Exception in thread "main" java.io.IOException: > com.sun.jersey.api.client.ClientHandlerException: > javax.net.ssl.SSLHandshakeException: Error while authenticating with > endpoint: > [https://vm2:65321/ws/v1/node/containers/container_e05_1559802125016_0001_01_08/logs] > at > org.apache.hadoop.yarn.client.cli.LogsCLI.getContainerLogFiles(LogsCLI.java:543) > at > org.apache.hadoop.yarn.client.cli.LogsCLI.getMatchedContainerLogFiles(LogsCLI.java:1338) > at > org.apache.hadoop.yarn.client.cli.LogsCLI.getMatchedOptionForRunningApp(LogsCLI.java:1514) > at > org.apache.hadoop.yarn.client.cli.LogsCLI.fetchContainerLogs(LogsCLI.java:1052) > at org.apache.hadoop.yarn.client.cli.LogsCLI.runCommand(LogsCLI.java:367) > at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:152) > at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:399) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
[ https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349064#comment-17349064 ] Peter Bacsko commented on YARN-10779: - [~gandras] this is a site property, not a CS property. CS re-initialization does not affect YARN as a whole, only CS. It requires to restart Resource Manager. ZK-based activation/deactivation also only refreshes scheduler-based settings, queues, ACLs, supergroup settings, etc. See {{org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive()}}. The site configuration is loaded once in {{ResourceManager.main()}}. > Add option to disable lowercase conversion in GetApplicationsRequestPBImpl > and ApplicationSubmissionContextPBImpl > - > > Key: YARN-10779 > URL: https://issues.apache.org/jira/browse/YARN-10779 > Project: Hadoop YARN > Issue Type: Task > Components: resourcemanager >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10779-001.patch, YARN-10779-002.patch, > YARN-10779-003.patch, YARN-10779-POC.patch > > > In both {{GetApplicationsRequestPBImpl}} and > {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase > conversion: > {noformat} > checkTags(tags); > // Convert applicationTags to lower case and add > this.applicationTags = new TreeSet<>(); > for (String tag : tags) { > this.applicationTags.add(StringUtils.toLowerCase(tag)); > } > } > {noformat} > However, we encountered some cases where this is not desirable for "userid" > tags. > Proposed solution: since both classes are pretty low-level and can be often > instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should > be cached inside them. A new property should be created which tells whether > lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
[ https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10779: Attachment: YARN-10779-003.patch > Add option to disable lowercase conversion in GetApplicationsRequestPBImpl > and ApplicationSubmissionContextPBImpl > - > > Key: YARN-10779 > URL: https://issues.apache.org/jira/browse/YARN-10779 > Project: Hadoop YARN > Issue Type: Task > Components: resourcemanager >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10779-001.patch, YARN-10779-002.patch, > YARN-10779-003.patch, YARN-10779-POC.patch > > > In both {{GetApplicationsRequestPBImpl}} and > {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase > conversion: > {noformat} > checkTags(tags); > // Convert applicationTags to lower case and add > this.applicationTags = new TreeSet<>(); > for (String tag : tags) { > this.applicationTags.add(StringUtils.toLowerCase(tag)); > } > } > {noformat} > However, we encountered some cases where this is not desirable for "userid" > tags. > Proposed solution: since both classes are pretty low-level and can be often > instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should > be cached inside them. A new property should be created which tells whether > lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
[ https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10779: Attachment: YARN-10779-002.patch > Add option to disable lowercase conversion in GetApplicationsRequestPBImpl > and ApplicationSubmissionContextPBImpl > - > > Key: YARN-10779 > URL: https://issues.apache.org/jira/browse/YARN-10779 > Project: Hadoop YARN > Issue Type: Task > Components: resourcemanager >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10779-001.patch, YARN-10779-002.patch, > YARN-10779-POC.patch > > > In both {{GetApplicationsRequestPBImpl}} and > {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase > conversion: > {noformat} > checkTags(tags); > // Convert applicationTags to lower case and add > this.applicationTags = new TreeSet<>(); > for (String tag : tags) { > this.applicationTags.add(StringUtils.toLowerCase(tag)); > } > } > {noformat} > However, we encountered some cases where this is not desirable for "userid" > tags. > Proposed solution: since both classes are pretty low-level and can be often > instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should > be cached inside them. A new property should be created which tells whether > lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10505) Extend the maximum-capacity property to support Fair Scheduler migration
[ https://issues.apache.org/jira/browse/YARN-10505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10505: Description: The property root.users.maximum-capacity could mean the following things: * Parent Percentage: maximum capacity relative to its parent. If it’s set to 50, then it means that the capacity is capped with respect to the parent. * Cluster Percentage: maximum capacity expressed as a percentage of the overall cluster capacity. Note that Fair Scheduler supports the following settings: * Single percentage (cluster) * Two percentages (cluster) * Absolute resources It is recommended that all three formats are supported for maximum-capacity after introducing weight mode. was: The property root.users.maximum-capacity could mean the following things: * Parent Percentage: maximum capacity relative to its parent. If it’s set to 50, then it means that the capacity is capped with respect to the parent. * Cluster Percentage: maximum capacity expressed as a percentage of the overall cluster capacity. Note that Fair Scheduler supports the following settings: * Single percentage (absolute) * Two percentages (absolute) * Absolute resources It is recommended that all three formats are supported for maximum-capacity after introducing weight mode. > Extend the maximum-capacity property to support Fair Scheduler migration > > > Key: YARN-10505 > URL: https://issues.apache.org/jira/browse/YARN-10505 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Priority: Major > > The property root.users.maximum-capacity could mean the following things: > * Parent Percentage: maximum capacity relative to its parent. If it’s set to > 50, then it means that the capacity is capped with respect to the parent. > * Cluster Percentage: maximum capacity expressed as a percentage of the > overall cluster capacity. > > Note that Fair Scheduler supports the following settings: > * Single percentage (cluster) > * Two percentages (cluster) > * Absolute resources > > It is recommended that all three formats are supported for maximum-capacity > after introducing weight mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
[ https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348213#comment-17348213 ] Peter Bacsko commented on YARN-10779: - TODO: new property needs a "." character after the prefix variable. > Add option to disable lowercase conversion in GetApplicationsRequestPBImpl > and ApplicationSubmissionContextPBImpl > - > > Key: YARN-10779 > URL: https://issues.apache.org/jira/browse/YARN-10779 > Project: Hadoop YARN > Issue Type: Task > Components: resourcemanager >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10779-001.patch, YARN-10779-POC.patch > > > In both {{GetApplicationsRequestPBImpl}} and > {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase > conversion: > {noformat} > checkTags(tags); > // Convert applicationTags to lower case and add > this.applicationTags = new TreeSet<>(); > for (String tag : tags) { > this.applicationTags.add(StringUtils.toLowerCase(tag)); > } > } > {noformat} > However, we encountered some cases where this is not desirable for "userid" > tags. > Proposed solution: since both classes are pretty low-level and can be often > instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should > be cached inside them. A new property should be created which tells whether > lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
[ https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348202#comment-17348202 ] Peter Bacsko commented on YARN-10779: - Ok, uploaded patch v1, I forgot to update {{yarn-default.xml}}, that will happen in the next patch. [~gandras] [~snemeth] care to review? > Add option to disable lowercase conversion in GetApplicationsRequestPBImpl > and ApplicationSubmissionContextPBImpl > - > > Key: YARN-10779 > URL: https://issues.apache.org/jira/browse/YARN-10779 > Project: Hadoop YARN > Issue Type: Task > Components: resourcemanager >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10779-001.patch, YARN-10779-POC.patch > > > In both {{GetApplicationsRequestPBImpl}} and > {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase > conversion: > {noformat} > checkTags(tags); > // Convert applicationTags to lower case and add > this.applicationTags = new TreeSet<>(); > for (String tag : tags) { > this.applicationTags.add(StringUtils.toLowerCase(tag)); > } > } > {noformat} > However, we encountered some cases where this is not desirable for "userid" > tags. > Proposed solution: since both classes are pretty low-level and can be often > instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should > be cached inside them. A new property should be created which tells whether > lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
[ https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10779: Attachment: YARN-10779-001.patch > Add option to disable lowercase conversion in GetApplicationsRequestPBImpl > and ApplicationSubmissionContextPBImpl > - > > Key: YARN-10779 > URL: https://issues.apache.org/jira/browse/YARN-10779 > Project: Hadoop YARN > Issue Type: Task > Components: resourcemanager >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10779-001.patch, YARN-10779-POC.patch > > > In both {{GetApplicationsRequestPBImpl}} and > {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase > conversion: > {noformat} > checkTags(tags); > // Convert applicationTags to lower case and add > this.applicationTags = new TreeSet<>(); > for (String tag : tags) { > this.applicationTags.add(StringUtils.toLowerCase(tag)); > } > } > {noformat} > However, we encountered some cases where this is not desirable for "userid" > tags. > Proposed solution: since both classes are pretty low-level and can be often > instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should > be cached inside them. A new property should be created which tells whether > lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
[ https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348147#comment-17348147 ] Peter Bacsko edited comment on YARN-10779 at 5/20/21, 8:34 AM: --- "It is a possibility that Node A sends this PB message to Node B, which means that this configuration must be set on Node B in order to take effect" What you're saying is true, but this applies to the whole Hadoop ecosystem, not just to this particular setting. Imagine having different container executors on different nodes, or security enabled/disabled, that would wreak havoc. Proper configuration management is mandatory and expected. was (Author: pbacsko): "It is a possibility that Node A sends this PB message to Node B, which means that this configuration must be set on Node B in order to take effect" What you're saying is true, but this applies to the whole Hadoop ecosystem, not just to this particular setting. Imagine having different container executors on different nodes, or security enabled/disabled, that would wreak havoc. Proper configuration management is mandatory and assumed. > Add option to disable lowercase conversion in GetApplicationsRequestPBImpl > and ApplicationSubmissionContextPBImpl > - > > Key: YARN-10779 > URL: https://issues.apache.org/jira/browse/YARN-10779 > Project: Hadoop YARN > Issue Type: Task > Components: resourcemanager >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10779-POC.patch > > > In both {{GetApplicationsRequestPBImpl}} and > {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase > conversion: > {noformat} > checkTags(tags); > // Convert applicationTags to lower case and add > this.applicationTags = new TreeSet<>(); > for (String tag : tags) { > this.applicationTags.add(StringUtils.toLowerCase(tag)); > } > } > {noformat} > However, we encountered some cases where this is not desirable for "userid" > tags. > Proposed solution: since both classes are pretty low-level and can be often > instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should > be cached inside them. A new property should be created which tells whether > lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
[ https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348147#comment-17348147 ] Peter Bacsko edited comment on YARN-10779 at 5/20/21, 8:33 AM: --- "It is a possibility that Node A sends this PB message to Node B, which means that this configuration must be set on Node B in order to take effect" What you're saying is true, but this applies to the whole Hadoop ecosystem, not just to this particular setting. Imagine having different container executors on different nodes, or security enabled/disabled, that would wreak havoc. Proper configuration management is mandatory and assumed. was (Author: pbacsko): "It is a possibility that Node A sends this PB message to Node B, which means that this configuration must be set on Node B in order to take effect" What you're saying is true, but this applies to the whole Hadoop ecosystem, not just to this particular setting. Imagine having different container executors on different nodes, or security enabled/disabled, that would wreak havoc. Proper configuration management is a mandatory and assumed. > Add option to disable lowercase conversion in GetApplicationsRequestPBImpl > and ApplicationSubmissionContextPBImpl > - > > Key: YARN-10779 > URL: https://issues.apache.org/jira/browse/YARN-10779 > Project: Hadoop YARN > Issue Type: Task > Components: resourcemanager >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10779-POC.patch > > > In both {{GetApplicationsRequestPBImpl}} and > {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase > conversion: > {noformat} > checkTags(tags); > // Convert applicationTags to lower case and add > this.applicationTags = new TreeSet<>(); > for (String tag : tags) { > this.applicationTags.add(StringUtils.toLowerCase(tag)); > } > } > {noformat} > However, we encountered some cases where this is not desirable for "userid" > tags. > Proposed solution: since both classes are pretty low-level and can be often > instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should > be cached inside them. A new property should be created which tells whether > lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
[ https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348147#comment-17348147 ] Peter Bacsko commented on YARN-10779: - "It is a possibility that Node A sends this PB message to Node B, which means that this configuration must be set on Node B in order to take effect" What you're saying is true, but this applies to the whole Hadoop ecosystem, not just to this particular setting. Imagine having different container executors on different nodes, or security enabled/disabled, that would wreak havoc. Proper configuration management is a mandatory and assumed. > Add option to disable lowercase conversion in GetApplicationsRequestPBImpl > and ApplicationSubmissionContextPBImpl > - > > Key: YARN-10779 > URL: https://issues.apache.org/jira/browse/YARN-10779 > Project: Hadoop YARN > Issue Type: Task > Components: resourcemanager >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10779-POC.patch > > > In both {{GetApplicationsRequestPBImpl}} and > {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase > conversion: > {noformat} > checkTags(tags); > // Convert applicationTags to lower case and add > this.applicationTags = new TreeSet<>(); > for (String tag : tags) { > this.applicationTags.add(StringUtils.toLowerCase(tag)); > } > } > {noformat} > However, we encountered some cases where this is not desirable for "userid" > tags. > Proposed solution: since both classes are pretty low-level and can be often > instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should > be cached inside them. A new property should be created which tells whether > lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
[ https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348090#comment-17348090 ] Peter Bacsko commented on YARN-10779: - [~gandras] these classes are locally instantiated adapter classes that mostly use generated ProtoBuf classes as an input. They are not sent over the wire. We don't directly work on ProtoBuf classes, but on these PBImpl stuff that are the implementations of the non-PBImpl classes like {{ApplicationSubmissionContext}}. > Add option to disable lowercase conversion in GetApplicationsRequestPBImpl > and ApplicationSubmissionContextPBImpl > - > > Key: YARN-10779 > URL: https://issues.apache.org/jira/browse/YARN-10779 > Project: Hadoop YARN > Issue Type: Task > Components: resourcemanager >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10779-POC.patch > > > In both {{GetApplicationsRequestPBImpl}} and > {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase > conversion: > {noformat} > checkTags(tags); > // Convert applicationTags to lower case and add > this.applicationTags = new TreeSet<>(); > for (String tag : tags) { > this.applicationTags.add(StringUtils.toLowerCase(tag)); > } > } > {noformat} > However, we encountered some cases where this is not desirable for "userid" > tags. > Proposed solution: since both classes are pretty low-level and can be often > instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should > be cached inside them. A new property should be created which tells whether > lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10725) Backport YARN-10120 to branch-3.3
[ https://issues.apache.org/jira/browse/YARN-10725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347659#comment-17347659 ] Peter Bacsko commented on YARN-10725: - Ok, +1 committed to branch-3.3. > Backport YARN-10120 to branch-3.3 > - > > Key: YARN-10725 > URL: https://issues.apache.org/jira/browse/YARN-10725 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10120-branch-3.3.patch, > YARN-10725-branch-3.3.patch, YARN-10725-branch-3.3.v2.patch, > YARN-10725-branch-3.3.v3.patch, YARN-10725-branch-3.3.v4.patch, > YARN-10725-branch-3.3.v5.patch, image-2021-04-05-16-48-57-034.png, > image-2021-04-05-16-50-55-238.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10725) Backport YARN-10120 to branch-3.3
[ https://issues.apache.org/jira/browse/YARN-10725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10725: Fix Version/s: 3.3.1 > Backport YARN-10120 to branch-3.3 > - > > Key: YARN-10725 > URL: https://issues.apache.org/jira/browse/YARN-10725 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Fix For: 3.3.1 > > Attachments: YARN-10120-branch-3.3.patch, > YARN-10725-branch-3.3.patch, YARN-10725-branch-3.3.v2.patch, > YARN-10725-branch-3.3.v3.patch, YARN-10725-branch-3.3.v4.patch, > YARN-10725-branch-3.3.v5.patch, image-2021-04-05-16-48-57-034.png, > image-2021-04-05-16-50-55-238.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10725) Backport YARN-10120 to branch-3.3
[ https://issues.apache.org/jira/browse/YARN-10725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347632#comment-17347632 ] Peter Bacsko commented on YARN-10725: - No, I'll fix during commit. Just one more thing: this is a backport, but what should be the commit message? The original? Or "YARN-10725. Backport YARN-10120 to branch-3.3". I think usually we keep the original commit message unless some heavy changes were necessary during the backport (too many conflicts, whatever). > Backport YARN-10120 to branch-3.3 > - > > Key: YARN-10725 > URL: https://issues.apache.org/jira/browse/YARN-10725 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10120-branch-3.3.patch, > YARN-10725-branch-3.3.patch, YARN-10725-branch-3.3.v2.patch, > YARN-10725-branch-3.3.v3.patch, YARN-10725-branch-3.3.v4.patch, > YARN-10725-branch-3.3.v5.patch, image-2021-04-05-16-48-57-034.png, > image-2021-04-05-16-50-55-238.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10725) Backport YARN-10120 to branch-3.3
[ https://issues.apache.org/jira/browse/YARN-10725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347625#comment-17347625 ] Peter Bacsko commented on YARN-10725: - [~BilwaST] so you're saying that v5 is good to go? I don't have too much context, if everything is fine, I can commit it. > Backport YARN-10120 to branch-3.3 > - > > Key: YARN-10725 > URL: https://issues.apache.org/jira/browse/YARN-10725 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10120-branch-3.3.patch, > YARN-10725-branch-3.3.patch, YARN-10725-branch-3.3.v2.patch, > YARN-10725-branch-3.3.v3.patch, YARN-10725-branch-3.3.v4.patch, > YARN-10725-branch-3.3.v5.patch, image-2021-04-05-16-48-57-034.png, > image-2021-04-05-16-50-55-238.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
[ https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347617#comment-17347617 ] Peter Bacsko commented on YARN-10779: - [~BilwaST] [~jhung] you recently made modifications around here, do you think it's a viable approach? cc [~sunilg] [~snemeth]. > Add option to disable lowercase conversion in GetApplicationsRequestPBImpl > and ApplicationSubmissionContextPBImpl > - > > Key: YARN-10779 > URL: https://issues.apache.org/jira/browse/YARN-10779 > Project: Hadoop YARN > Issue Type: Task > Components: resourcemanager >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10779-POC.patch > > > In both {{GetApplicationsRequestPBImpl}} and > {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase > conversion: > {noformat} > checkTags(tags); > // Convert applicationTags to lower case and add > this.applicationTags = new TreeSet<>(); > for (String tag : tags) { > this.applicationTags.add(StringUtils.toLowerCase(tag)); > } > } > {noformat} > However, we encountered some cases where this is not desirable for "userid" > tags. > Proposed solution: since both classes are pretty low-level and can be often > instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should > be cached inside them. A new property should be created which tells whether > lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
[ https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347481#comment-17347481 ] Peter Bacsko commented on YARN-10779: - Uploaded a POC without tests. > Add option to disable lowercase conversion in GetApplicationsRequestPBImpl > and ApplicationSubmissionContextPBImpl > - > > Key: YARN-10779 > URL: https://issues.apache.org/jira/browse/YARN-10779 > Project: Hadoop YARN > Issue Type: Task > Components: resourcemanager >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10779-POC.patch > > > In both {{GetApplicationsRequestPBImpl}} and > {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase > conversion: > {noformat} > checkTags(tags); > // Convert applicationTags to lower case and add > this.applicationTags = new TreeSet<>(); > for (String tag : tags) { > this.applicationTags.add(StringUtils.toLowerCase(tag)); > } > } > {noformat} > However, we encountered some cases where this is not desirable for "userid" > tags. > Proposed solution: since both classes are pretty low-level and can be often > instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should > be cached inside them. A new property should be created which tells whether > lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
[ https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10779: Summary: Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl (was: Add option to disable lowecase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl) > Add option to disable lowercase conversion in GetApplicationsRequestPBImpl > and ApplicationSubmissionContextPBImpl > - > > Key: YARN-10779 > URL: https://issues.apache.org/jira/browse/YARN-10779 > Project: Hadoop YARN > Issue Type: Task > Components: resourcemanager >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10779-POC.patch > > > In both {{GetApplicationsRequestPBImpl}} and > {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase > conversion: > {noformat} > checkTags(tags); > // Convert applicationTags to lower case and add > this.applicationTags = new TreeSet<>(); > for (String tag : tags) { > this.applicationTags.add(StringUtils.toLowerCase(tag)); > } > } > {noformat} > However, we encountered some cases where this is not desirable for "userid" > tags. > Proposed solution: since both classes are pretty low-level and can be often > instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should > be cached inside them. A new property should be created which tells whether > lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10779) Add option to disable lowecase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
[ https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10779: Attachment: YARN-10779-POC.patch > Add option to disable lowecase conversion in GetApplicationsRequestPBImpl and > ApplicationSubmissionContextPBImpl > > > Key: YARN-10779 > URL: https://issues.apache.org/jira/browse/YARN-10779 > Project: Hadoop YARN > Issue Type: Task > Components: resourcemanager >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10779-POC.patch > > > In both {{GetApplicationsRequestPBImpl}} and > {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase > conversion: > {noformat} > checkTags(tags); > // Convert applicationTags to lower case and add > this.applicationTags = new TreeSet<>(); > for (String tag : tags) { > this.applicationTags.add(StringUtils.toLowerCase(tag)); > } > } > {noformat} > However, we encountered some cases where this is not desirable for "userid" > tags. > Proposed solution: since both classes are pretty low-level and can be often > instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should > be cached inside them. A new property should be created which tells whether > lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10779) Add option to disable lowecase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
Peter Bacsko created YARN-10779: --- Summary: Add option to disable lowecase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl Key: YARN-10779 URL: https://issues.apache.org/jira/browse/YARN-10779 Project: Hadoop YARN Issue Type: Task Components: resourcemanager Reporter: Peter Bacsko Assignee: Peter Bacsko In both {{GetApplicationsRequestPBImpl}} and {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase conversion: {noformat} checkTags(tags); // Convert applicationTags to lower case and add this.applicationTags = new TreeSet<>(); for (String tag : tags) { this.applicationTags.add(StringUtils.toLowerCase(tag)); } } {noformat} However, we encountered some cases where this is not desirable for "userid" tags. Proposed solution: since both classes are pretty low-level and can be often instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should be cached inside them. A new property should be created which tells whether lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10258) Add metrics for 'ApplicationsRunning' in NodeManager
[ https://issues.apache.org/jira/browse/YARN-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347419#comment-17347419 ] Peter Bacsko commented on YARN-10258: - Backported to branch-3.3. > Add metrics for 'ApplicationsRunning' in NodeManager > > > Key: YARN-10258 > URL: https://issues.apache.org/jira/browse/YARN-10258 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 3.1.3 >Reporter: ANANDA G B >Assignee: ANANDA G B >Priority: Minor > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10258-001.patch, YARN-10258-002.patch, > YARN-10258-003.patch, YARN-10258-005.patch, YARN-10258-006.patch, > YARN-10258-007.patch, YARN-10258-008.patch, YARN-10258-009.patch, > YARN-10258-010.patch, YARN-10258-branch-3.3.001.patch, YARN-10258_004.patch > > > Add metrics for 'ApplicationsRunning' in NodeManagers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10258) Add metrics for 'ApplicationsRunning' in NodeManager
[ https://issues.apache.org/jira/browse/YARN-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10258: Fix Version/s: 3.3.1 > Add metrics for 'ApplicationsRunning' in NodeManager > > > Key: YARN-10258 > URL: https://issues.apache.org/jira/browse/YARN-10258 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 3.1.3 >Reporter: ANANDA G B >Assignee: ANANDA G B >Priority: Minor > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10258-001.patch, YARN-10258-002.patch, > YARN-10258-003.patch, YARN-10258-005.patch, YARN-10258-006.patch, > YARN-10258-007.patch, YARN-10258-008.patch, YARN-10258-009.patch, > YARN-10258-010.patch, YARN-10258-branch-3.3.001.patch, YARN-10258_004.patch > > > Add metrics for 'ApplicationsRunning' in NodeManagers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10258) Add metrics for 'ApplicationsRunning' in NodeManager
[ https://issues.apache.org/jira/browse/YARN-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10258: Attachment: YARN-10258-branch-3.3.001.patch > Add metrics for 'ApplicationsRunning' in NodeManager > > > Key: YARN-10258 > URL: https://issues.apache.org/jira/browse/YARN-10258 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 3.1.3 >Reporter: ANANDA G B >Assignee: ANANDA G B >Priority: Minor > Fix For: 3.4.0 > > Attachments: YARN-10258-001.patch, YARN-10258-002.patch, > YARN-10258-003.patch, YARN-10258-005.patch, YARN-10258-006.patch, > YARN-10258-007.patch, YARN-10258-008.patch, YARN-10258-009.patch, > YARN-10258-010.patch, YARN-10258-branch-3.3.001.patch, YARN-10258_004.patch > > > Add metrics for 'ApplicationsRunning' in NodeManagers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-10258) Add metrics for 'ApplicationsRunning' in NodeManager
[ https://issues.apache.org/jira/browse/YARN-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko reopened YARN-10258: - > Add metrics for 'ApplicationsRunning' in NodeManager > > > Key: YARN-10258 > URL: https://issues.apache.org/jira/browse/YARN-10258 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 3.1.3 >Reporter: ANANDA G B >Assignee: ANANDA G B >Priority: Minor > Fix For: 3.4.0 > > Attachments: YARN-10258-001.patch, YARN-10258-002.patch, > YARN-10258-003.patch, YARN-10258-005.patch, YARN-10258-006.patch, > YARN-10258-007.patch, YARN-10258-008.patch, YARN-10258-009.patch, > YARN-10258-010.patch, YARN-10258_004.patch > > > Add metrics for 'ApplicationsRunning' in NodeManagers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10258) Add metrics for 'ApplicationsRunning' in NodeManager
[ https://issues.apache.org/jira/browse/YARN-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346396#comment-17346396 ] Peter Bacsko commented on YARN-10258: - +1 Thanks for the patch [~gb.ana...@gmail.com] and [~BilwaST] / [~zhuqi] for the review, committed to trunk. > Add metrics for 'ApplicationsRunning' in NodeManager > > > Key: YARN-10258 > URL: https://issues.apache.org/jira/browse/YARN-10258 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 3.1.3 >Reporter: ANANDA G B >Assignee: ANANDA G B >Priority: Minor > Attachments: YARN-10258-001.patch, YARN-10258-002.patch, > YARN-10258-003.patch, YARN-10258-005.patch, YARN-10258-006.patch, > YARN-10258-007.patch, YARN-10258-008.patch, YARN-10258-009.patch, > YARN-10258-010.patch, YARN-10258_004.patch > > > Add metrics for 'ApplicationsRunning' in NodeManagers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10258) Add metrics for 'ApplicationsRunning' in NodeManager
[ https://issues.apache.org/jira/browse/YARN-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346154#comment-17346154 ] Peter Bacsko commented on YARN-10258: - [~BilwaST] yes, I'll check this out. > Add metrics for 'ApplicationsRunning' in NodeManager > > > Key: YARN-10258 > URL: https://issues.apache.org/jira/browse/YARN-10258 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 3.1.3 >Reporter: ANANDA G B >Assignee: ANANDA G B >Priority: Minor > Attachments: YARN-10258-001.patch, YARN-10258-002.patch, > YARN-10258-003.patch, YARN-10258-005.patch, YARN-10258-006.patch, > YARN-10258-007.patch, YARN-10258-008.patch, YARN-10258-009.patch, > YARN-10258-010.patch, YARN-10258_004.patch > > > Add metrics for 'ApplicationsRunning' in NodeManagers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10763) Add the number of containers assigned per second metrics to ClusterMetrics
[ https://issues.apache.org/jira/browse/YARN-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346098#comment-17346098 ] Peter Bacsko commented on YARN-10763: - +1 Thanks [~chaosju] for the patch and [~zhuqi] for the review. Committed to trunk. > Add the number of containers assigned per second metrics to ClusterMetrics > --- > > Key: YARN-10763 > URL: https://issues.apache.org/jira/browse/YARN-10763 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: chaosju >Assignee: chaosju >Priority: Minor > Attachments: YARN-10763.001.patch, YARN-10763.002.patch, > YARN-10763.003.patch, YARN-10763.004.patch, YARN-10763.005.patch, > YARN-10763.006.patch, YARN-10763.007.patch, YARN-10763.008.patch, > screenshot-1.png > > > It'd be good to have ContainerAssignedNum/Second in ClusterMetrics for > measuring cluster throughput. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10763) Add the number of containers assigned per second metrics to ClusterMetrics
[ https://issues.apache.org/jira/browse/YARN-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10763: Summary: Add the number of containers assigned per second metrics to ClusterMetrics (was: add the speed of containers assigned metrics to ClusterMetrics) > Add the number of containers assigned per second metrics to ClusterMetrics > --- > > Key: YARN-10763 > URL: https://issues.apache.org/jira/browse/YARN-10763 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: chaosju >Assignee: chaosju >Priority: Minor > Attachments: YARN-10763.001.patch, YARN-10763.002.patch, > YARN-10763.003.patch, YARN-10763.004.patch, YARN-10763.005.patch, > YARN-10763.006.patch, YARN-10763.007.patch, YARN-10763.008.patch, > screenshot-1.png > > > It'd be good to have ContainerAssignedNum/Second in ClusterMetrics for > measuring cluster throughput. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9698) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344626#comment-17344626 ] Peter Bacsko commented on YARN-9698: The number of subtasks under this JIRA just keeps growing. The converter called fs2cs is mostly complete. It's not perfect, but it's working. Although new additions are constantly coming, I don't see the point of keeping this particular ticket open, otherwise it will never be closed. I suggest creating a new , "Phase II" JIRA and move the current subtasks under that. Then we can mark this as Fix Version = 3.4.0. [~gandras], [~snemeth], [~zhuqi] opinions? > [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler > > > Key: YARN-9698 > URL: https://issues.apache.org/jira/browse/YARN-9698 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Weiwei Yang >Priority: Major > Labels: fs2cs > Attachments: FS-CS Migration.pdf > > > We see some users want to migrate from Fair Scheduler to Capacity Scheduler, > this Jira is created as an umbrella to track all related efforts for the > migration, the scope contains > * Bug fixes > * Add missing features > * Migration tools that help to generate CS configs based on FS, validate > configs etc > * Documents > this is part of CS component, the purpose is to make the migration process > smooth. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10759) Encapsulate queue config modes
[ https://issues.apache.org/jira/browse/YARN-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344573#comment-17344573 ] Peter Bacsko commented on YARN-10759: - Thanks [~gandras] for the patch. I just have one to note. I can see that {{allowZeroCapacitySum}} has been moved to {{AbstractCSQueue}}, although it's really something which is meant for {{ParentQueue}}. I assume this is because the new code is easier to read and no type checks and casts are necessary. Is that correct? I'm wondering if this can cause problems. Because right now, this logic only runs inside {{ParentQueue}}: {noformat} // We also allow children's percent sum = 0 under the following // conditions // - Parent uses weight mode // - Parent uses percent mode, and parent has // (capacity=0 OR allowZero) if (parentCapacityType == QueueCapacityType.PERCENT) { if ((Math.abs(queueCapacities.getCapacity(nodeLabel)) > PRECISION) && (!allowZeroCapacitySum)) { throw new IOException( "Illegal" + " capacity sum of " + childrenPctSum + " for children of queue " + queueName + " for label=" + nodeLabel + ". It is set to 0, but parent percent != 0, and " + "doesn't allow children capacity to set to 0"); } } } {noformat} But after this refactor, leaf queues will have this property too with it being set to "false". Although there are no unit test failures, we need to double check if this extra boolean flag on leafs can have any impact on the existing code. > Encapsulate queue config modes > -- > > Key: YARN-10759 > URL: https://issues.apache.org/jira/browse/YARN-10759 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10759.001.patch, YARN-10759.002.patch, > YARN-10759.003.patch, YARN-10759.004.patch > > > Capacity Scheduler queues have three modes: > * relative/percentage > * weight > * absolute > Most of them have their own: > * validation logic > * config setting logic > * effective capacity calculation logic > These logics can be easily extracted and encapsulated in separate config mode > classes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10759) Encapsulate queue config modes
[ https://issues.apache.org/jira/browse/YARN-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344573#comment-17344573 ] Peter Bacsko edited comment on YARN-10759 at 5/14/21, 12:54 PM: Thanks [~gandras] for the patch. I just have one thing to note. I can see that {{allowZeroCapacitySum}} has been moved to {{AbstractCSQueue}}, although it's really something which is meant for {{ParentQueue}}. I assume this is because the new code is easier to read and no type checks and casts are necessary. Is that correct? I'm wondering if this can cause problems. Because right now, this logic only runs inside {{ParentQueue}}: {noformat} // We also allow children's percent sum = 0 under the following // conditions // - Parent uses weight mode // - Parent uses percent mode, and parent has // (capacity=0 OR allowZero) if (parentCapacityType == QueueCapacityType.PERCENT) { if ((Math.abs(queueCapacities.getCapacity(nodeLabel)) > PRECISION) && (!allowZeroCapacitySum)) { throw new IOException( "Illegal" + " capacity sum of " + childrenPctSum + " for children of queue " + queueName + " for label=" + nodeLabel + ". It is set to 0, but parent percent != 0, and " + "doesn't allow children capacity to set to 0"); } } } {noformat} But after this refactor, leaf queues will have this property too with it being set to "false". Although there are no unit test failures, we need to double check if this extra boolean flag on leafs can have any impact on the existing code. was (Author: pbacsko): Thanks [~gandras] for the patch. I just have one to note. I can see that {{allowZeroCapacitySum}} has been moved to {{AbstractCSQueue}}, although it's really something which is meant for {{ParentQueue}}. I assume this is because the new code is easier to read and no type checks and casts are necessary. Is that correct? I'm wondering if this can cause problems. Because right now, this logic only runs inside {{ParentQueue}}: {noformat} // We also allow children's percent sum = 0 under the following // conditions // - Parent uses weight mode // - Parent uses percent mode, and parent has // (capacity=0 OR allowZero) if (parentCapacityType == QueueCapacityType.PERCENT) { if ((Math.abs(queueCapacities.getCapacity(nodeLabel)) > PRECISION) && (!allowZeroCapacitySum)) { throw new IOException( "Illegal" + " capacity sum of " + childrenPctSum + " for children of queue " + queueName + " for label=" + nodeLabel + ". It is set to 0, but parent percent != 0, and " + "doesn't allow children capacity to set to 0"); } } } {noformat} But after this refactor, leaf queues will have this property too with it being set to "false". Although there are no unit test failures, we need to double check if this extra boolean flag on leafs can have any impact on the existing code. > Encapsulate queue config modes > -- > > Key: YARN-10759 > URL: https://issues.apache.org/jira/browse/YARN-10759 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10759.001.patch, YARN-10759.002.patch, > YARN-10759.003.patch, YARN-10759.004.patch > > > Capacity Scheduler queues have three modes: > * relative/percentage > * weight > * absolute > Most of them have their own: > * validation logic > * config setting logic > * effective capacity calculation logic > These logics can be easily extracted and encapsulated in separate config mode > clas
[jira] [Commented] (YARN-10763) add the speed of containers assigned metrics to ClusterMetrics
[ https://issues.apache.org/jira/browse/YARN-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344518#comment-17344518 ] Peter Bacsko commented on YARN-10763: - Thanks [~chaosju] just a final update, very minor things: 1. "Containers assigned in last second" --> missing "the": "Containers assigned in *the* last second" 2. Comment is not necessary, purpose of the executor is trivial: {noformat} /** * The executor service that count containers assigned in last second. * */ {noformat} 3. Nit: space after if {noformat} if(INSTANCE != null && INSTANCE.getAssignCounterExecutor() != null) { INSTANCE.getAssignCounterExecutor().shutdownNow(); } {noformat} I have no further comments. > add the speed of containers assigned metrics to ClusterMetrics > --- > > Key: YARN-10763 > URL: https://issues.apache.org/jira/browse/YARN-10763 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: chaosju >Assignee: chaosju >Priority: Minor > Attachments: YARN-10763.001.patch, YARN-10763.002.patch, > YARN-10763.003.patch, YARN-10763.004.patch, YARN-10763.005.patch, > YARN-10763.006.patch, YARN-10763.007.patch, screenshot-1.png > > > It'd be good to have ContainerAssignedNum/Second in ClusterMetrics for > measuring cluster throughput. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10571) Refactor dynamic queue handling logic
[ https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343250#comment-17343250 ] Peter Bacsko edited comment on YARN-10571 at 5/12/21, 1:36 PM: --- Thanks [~gandras], committed to trunk. Also thanks [~shuzirra], [~zhuqi] for the review. was (Author: pbacsko): Thanks [~gandras], committed to trunk. > Refactor dynamic queue handling logic > - > > Key: YARN-10571 > URL: https://issues.apache.org/jira/browse/YARN-10571 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Minor > Attachments: YARN-10571.001.patch, YARN-10571.002.patch, > YARN-10571.003.patch, YARN-10571.004.patch > > > As per YARN-10506 we have introduced an other mode for auto queue creation > and a new class, which handles it. We should move the old, managed queue > related logic to CSAutoQueueHandler as well, and do additional cleanup > regarding queue management. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10571) Refactor dynamic queue handling logic
[ https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343250#comment-17343250 ] Peter Bacsko commented on YARN-10571: - Thanks [~gandras], committed to trunk. > Refactor dynamic queue handling logic > - > > Key: YARN-10571 > URL: https://issues.apache.org/jira/browse/YARN-10571 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Minor > Attachments: YARN-10571.001.patch, YARN-10571.002.patch, > YARN-10571.003.patch, YARN-10571.004.patch > > > As per YARN-10506 we have introduced an other mode for auto queue creation > and a new class, which handles it. We should move the old, managed queue > related logic to CSAutoQueueHandler as well, and do additional cleanup > regarding queue management. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10571) Refactor dynamic queue handling logic
[ https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343227#comment-17343227 ] Peter Bacsko commented on YARN-10571: - +1 LGTM > Refactor dynamic queue handling logic > - > > Key: YARN-10571 > URL: https://issues.apache.org/jira/browse/YARN-10571 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Minor > Attachments: YARN-10571.001.patch, YARN-10571.002.patch, > YARN-10571.003.patch, YARN-10571.004.patch > > > As per YARN-10506 we have introduced an other mode for auto queue creation > and a new class, which handles it. We should move the old, managed queue > related logic to CSAutoQueueHandler as well, and do additional cleanup > regarding queue management. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10763) add the speed of containers assigned metrics to ClusterMetrics
[ https://issues.apache.org/jira/browse/YARN-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343152#comment-17343152 ] Peter Bacsko commented on YARN-10763: - Some further comments: 1. {{private volatile AtomicLong numContainersAssigned = new AtomicLong(0);}} - this doesn't have to be "volatile". 2. "containersAssignedCounter" should be renamed to something else. If this is an executor, then use the name "assignCounterExecutor" or something like that. 3. Also, "containersAssignedCounter" should be not be a static field. It's part of an object which just happens to be a singleton. 4. Initialize "containersAssignedCounter" in the constructor. 5. Use something else for the thread name instead of "ContainersAssigned_Counter", for example, "ContainerAssignmentCounterThread". 6. "numContainersAssignedLast" should be int and not long. Therefore, you don't need to the type cast at {{numContainerAssignedPerSecond.set((int)n);}}. 7. Also, it would be best to reset the counter of "numContainersAssigned" to 0. Now it's AtomicLong and I think it's unlikely to overflow, but if we're just counting in an interval, it makes sense to reset it. Also, it does not have to be long, but just int. So this is preferred: {{numContainersAssignedLast = numContainersAssigned.getAndSet(0)}}. 8. {{containersAssignedCounter.shutdown();}} --> use {{shutdownNow()}} to immediately stop the executor. 9. The test case: {noformat} @Test public void testClusterMetrics() throws Exception { assert(metrics != null); Assert.assertTrue(!metrics.numContainerAssignedPerSecond.changed()); metrics.incrNumContainerAssigned(); Thread.sleep(2000); Assert.assertEquals(metrics.getnumContainerAssignedPerSecond(),1L); } {noformat} The first is the default Java "assert", which only takes effect with the "-ea" cmd line switch. Just remove it. Also, instead of a fixed {{Thread.sleep()}}, use: {noformat} GenericTestUtils.waitFor(new Supplier() { @Override public Boolean get() { return metrics.getnumContainerAssignedPerSecond() == 1; } }, 500, 5000); {noformat} > add the speed of containers assigned metrics to ClusterMetrics > --- > > Key: YARN-10763 > URL: https://issues.apache.org/jira/browse/YARN-10763 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: chaosju >Assignee: chaosju >Priority: Minor > Attachments: YARN-10763.001.patch, YARN-10763.002.patch, > YARN-10763.003.patch, YARN-10763.004.patch, screenshot-1.png > > > It'd be good to have ContainerAssignedNum/Second in ClusterMetrics for > measuring cluster throughput. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10763) add the speed of containers assigned metrics to ClusterMetrics
[ https://issues.apache.org/jira/browse/YARN-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343139#comment-17343139 ] Peter Bacsko commented on YARN-10763: - [~chaosju] I added you to the list of contributors so you can assign the JIRA to yourself. Are the unit test failures related? The number of failed tests looks huge. > add the speed of containers assigned metrics to ClusterMetrics > --- > > Key: YARN-10763 > URL: https://issues.apache.org/jira/browse/YARN-10763 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: chaosju >Assignee: chaosju >Priority: Minor > Attachments: YARN-10763.001.patch, YARN-10763.002.patch, > YARN-10763.003.patch, YARN-10763.004.patch, screenshot-1.png > > > It'd be good to have ContainerAssignedNum/Second in ClusterMetrics for > measuring cluster throughput. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10763) add the speed of containers assigned metrics to ClusterMetrics
[ https://issues.apache.org/jira/browse/YARN-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko reassigned YARN-10763: --- Assignee: chaosju > add the speed of containers assigned metrics to ClusterMetrics > --- > > Key: YARN-10763 > URL: https://issues.apache.org/jira/browse/YARN-10763 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: chaosju >Assignee: chaosju >Priority: Minor > Attachments: YARN-10763.001.patch, YARN-10763.002.patch, > YARN-10763.003.patch, YARN-10763.004.patch, screenshot-1.png > > > It'd be good to have ContainerAssignedNum/Second in ClusterMetrics for > measuring cluster throughput. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9615) Add dispatcher metrics to RM
[ https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9615: --- Fix Version/s: 3.3.1 > Add dispatcher metrics to RM > > > Key: YARN-9615 > URL: https://issues.apache.org/jira/browse/YARN-9615 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-9615-branch-3.3-001.patch, YARN-9615.001.patch, > YARN-9615.002.patch, YARN-9615.003.patch, YARN-9615.004.patch, > YARN-9615.005.patch, YARN-9615.006.patch, YARN-9615.007.patch, > YARN-9615.008.patch, YARN-9615.009.patch, YARN-9615.010.patch, > YARN-9615.011.patch, YARN-9615.011.patch, YARN-9615.poc.patch, > image-2021-03-04-10-35-10-626.png, image-2021-03-04-10-36-12-441.png, > screenshot-1.png > > > It'd be good to have counts/processing times for each event type in RM async > dispatcher and scheduler async dispatcher. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM
[ https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342768#comment-17342768 ] Peter Bacsko commented on YARN-9615: Committed to branch-3.3. > Add dispatcher metrics to RM > > > Key: YARN-9615 > URL: https://issues.apache.org/jira/browse/YARN-9615 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-9615-branch-3.3-001.patch, YARN-9615.001.patch, > YARN-9615.002.patch, YARN-9615.003.patch, YARN-9615.004.patch, > YARN-9615.005.patch, YARN-9615.006.patch, YARN-9615.007.patch, > YARN-9615.008.patch, YARN-9615.009.patch, YARN-9615.010.patch, > YARN-9615.011.patch, YARN-9615.011.patch, YARN-9615.poc.patch, > image-2021-03-04-10-35-10-626.png, image-2021-03-04-10-36-12-441.png, > screenshot-1.png > > > It'd be good to have counts/processing times for each event type in RM async > dispatcher and scheduler async dispatcher. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM
[ https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342764#comment-17342764 ] Peter Bacsko commented on YARN-9615: OK, the raw type stuff can be ignored, this is the same on trunk, {{EventTypeMetrics}} is not parameterized. > Add dispatcher metrics to RM > > > Key: YARN-9615 > URL: https://issues.apache.org/jira/browse/YARN-9615 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9615-branch-3.3-001.patch, YARN-9615.001.patch, > YARN-9615.002.patch, YARN-9615.003.patch, YARN-9615.004.patch, > YARN-9615.005.patch, YARN-9615.006.patch, YARN-9615.007.patch, > YARN-9615.008.patch, YARN-9615.009.patch, YARN-9615.010.patch, > YARN-9615.011.patch, YARN-9615.011.patch, YARN-9615.poc.patch, > image-2021-03-04-10-35-10-626.png, image-2021-03-04-10-36-12-441.png, > screenshot-1.png > > > It'd be good to have counts/processing times for each event type in RM async > dispatcher and scheduler async dispatcher. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10505) Extend the maximum-capacity property to support Fair Scheduler migration
[ https://issues.apache.org/jira/browse/YARN-10505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342641#comment-17342641 ] Peter Bacsko commented on YARN-10505: - I would change the naming a bit. "Relative Percentage" -> "Parent Percentage" "Absolute Percentage" -> "Cluster Percentage" I think it's much clearer. Absolute/relative in this context it's hard to grasp or even sounds like an oxymoron. > Extend the maximum-capacity property to support Fair Scheduler migration > > > Key: YARN-10505 > URL: https://issues.apache.org/jira/browse/YARN-10505 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Priority: Major > > The property root.users.maximum-capacity could mean the following things: > * Relative Percentage: maximum capacity relative to its parent. If it’s set > to 50, then it means that the capacity is capped with respect to the parent. > * Absolute Percentage: maximum capacity expressed as a percentage of the > overall cluster capacity. > > Note that Fair Scheduler supports the following settings: > * Single percentage (absolute) > * Two percentages (absolute) > * Absolute resources > > It is recommended that all three formats are supported for maximum-capacity > after introducing weight mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10505) Extend the maximum-capacity property to support Fair Scheduler migration
[ https://issues.apache.org/jira/browse/YARN-10505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342641#comment-17342641 ] Peter Bacsko edited comment on YARN-10505 at 5/11/21, 3:14 PM: --- I would change the naming a bit. "Relative Percentage" -> "Parent Percentage" "Absolute Percentage" -> "Cluster Percentage" I think it's much clearer. Absolute/relative in this context is hard to grasp or even sounds like an oxymoron. was (Author: pbacsko): I would change the naming a bit. "Relative Percentage" -> "Parent Percentage" "Absolute Percentage" -> "Cluster Percentage" I think it's much clearer. Absolute/relative in this context it's hard to grasp or even sounds like an oxymoron. > Extend the maximum-capacity property to support Fair Scheduler migration > > > Key: YARN-10505 > URL: https://issues.apache.org/jira/browse/YARN-10505 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Priority: Major > > The property root.users.maximum-capacity could mean the following things: > * Relative Percentage: maximum capacity relative to its parent. If it’s set > to 50, then it means that the capacity is capped with respect to the parent. > * Absolute Percentage: maximum capacity expressed as a percentage of the > overall cluster capacity. > > Note that Fair Scheduler supports the following settings: > * Single percentage (absolute) > * Two percentages (absolute) > * Absolute resources > > It is recommended that all three formats are supported for maximum-capacity > after introducing weight mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995
[ https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342555#comment-17342555 ] Peter Bacsko commented on YARN-10642: - Committed to branch-3.1 too. Closing. > Race condition: AsyncDispatcher can get stuck by the changes introduced in > YARN-8995 > > > Key: YARN-10642 > URL: https://issues.apache.org/jira/browse/YARN-10642 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: zhengchenyu >Assignee: zhengchenyu >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.2.3 > > Attachments: MockForDeadLoop.java, YARN-10642-branch-3.1.001.patch, > YARN-10642-branch-3.2.001.patch, YARN-10642-branch-3.2.002.patch, > YARN-10642-branch-3.3.001.patch, YARN-10642.001.patch, YARN-10642.002.patch, > YARN-10642.003.patch, YARN-10642.004.patch, YARN-10642.005.patch, > deadloop.png, debugfornode.png, put.png, take.png > > > In our cluster, ResouceManager stuck twice within twenty days. Yarn client > can't submit application. I got jstack info at second time, then found the > reason. > I analyze all the jstack, I found many thread stuck because can't get > LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the > analytical process) > The reason is that one thread hold the putLock all the time, > printEventQueueDetails will called forEachRemaining, then hold putLock and > readLock. The AsyncDispatcher will stuck. > {code} > Thread 6526 (IPC Server handler 454 on default port 8030): > State: RUNNABLE > Blocked count: 29988 > Waited count: 2035029 > Stack: > > java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926) > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270) > > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295) > > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408) > > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215) > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) > > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432) > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958) > java.security.AccessController.doPrivileged(Native Method) > {code} > I analyze LinkedBlockingQueue's source code. I found forEachRemaining in > LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take > are called in different thread. > YARN-8995 introduce printEventQueueDetails method, > "eventQueue.stream().collect" will called forEachRemaining method. > Let's see why? "put.png" shows that how to put("a"), "take.png" shows that > how to take()。Specical Node: The removed Node will point itself for help gc!!! > The key point code is in forEachRemaining, we see LBQSpliterator use > forEachRemaining to visit all Node. But when got item value from Node, will > release the lock. If at this time, take() will be called. > The variable 'p' in forEachRemaining may point a Node which point itself, > then forEachRemaining will be in dead loop. You can see it in "deadloop.png" > Let's see a simple uni-test, Let's forEachRemaining called more slow than > take, the problem will reproduction。uni-test is MockForDeadLoop.java. > I debug MockForDeadLoop.java, and see a Node point itself. You can see pic > "debugfo
[jira] [Updated] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995
[ https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10642: Attachment: YARN-10642-branch-3.1.001.patch > Race condition: AsyncDispatcher can get stuck by the changes introduced in > YARN-8995 > > > Key: YARN-10642 > URL: https://issues.apache.org/jira/browse/YARN-10642 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: zhengchenyu >Assignee: zhengchenyu >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.2.3 > > Attachments: MockForDeadLoop.java, YARN-10642-branch-3.1.001.patch, > YARN-10642-branch-3.2.001.patch, YARN-10642-branch-3.2.002.patch, > YARN-10642-branch-3.3.001.patch, YARN-10642.001.patch, YARN-10642.002.patch, > YARN-10642.003.patch, YARN-10642.004.patch, YARN-10642.005.patch, > deadloop.png, debugfornode.png, put.png, take.png > > > In our cluster, ResouceManager stuck twice within twenty days. Yarn client > can't submit application. I got jstack info at second time, then found the > reason. > I analyze all the jstack, I found many thread stuck because can't get > LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the > analytical process) > The reason is that one thread hold the putLock all the time, > printEventQueueDetails will called forEachRemaining, then hold putLock and > readLock. The AsyncDispatcher will stuck. > {code} > Thread 6526 (IPC Server handler 454 on default port 8030): > State: RUNNABLE > Blocked count: 29988 > Waited count: 2035029 > Stack: > > java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926) > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270) > > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295) > > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408) > > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215) > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) > > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432) > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958) > java.security.AccessController.doPrivileged(Native Method) > {code} > I analyze LinkedBlockingQueue's source code. I found forEachRemaining in > LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take > are called in different thread. > YARN-8995 introduce printEventQueueDetails method, > "eventQueue.stream().collect" will called forEachRemaining method. > Let's see why? "put.png" shows that how to put("a"), "take.png" shows that > how to take()。Specical Node: The removed Node will point itself for help gc!!! > The key point code is in forEachRemaining, we see LBQSpliterator use > forEachRemaining to visit all Node. But when got item value from Node, will > release the lock. If at this time, take() will be called. > The variable 'p' in forEachRemaining may point a Node which point itself, > then forEachRemaining will be in dead loop. You can see it in "deadloop.png" > Let's see a simple uni-test, Let's forEachRemaining called more slow than > take, the problem will reproduction。uni-test is MockForDeadLoop.java. > I debug MockForDeadLoop.java, and see a Node point itself. You can see pic > "debugfornode.png" > Environment: > OS: CentOS Linux r
[jira] [Reopened] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995
[ https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko reopened YARN-10642: - > Race condition: AsyncDispatcher can get stuck by the changes introduced in > YARN-8995 > > > Key: YARN-10642 > URL: https://issues.apache.org/jira/browse/YARN-10642 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: zhengchenyu >Assignee: zhengchenyu >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.2.3 > > Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, > YARN-10642-branch-3.2.002.patch, YARN-10642-branch-3.3.001.patch, > YARN-10642.001.patch, YARN-10642.002.patch, YARN-10642.003.patch, > YARN-10642.004.patch, YARN-10642.005.patch, deadloop.png, debugfornode.png, > put.png, take.png > > > In our cluster, ResouceManager stuck twice within twenty days. Yarn client > can't submit application. I got jstack info at second time, then found the > reason. > I analyze all the jstack, I found many thread stuck because can't get > LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the > analytical process) > The reason is that one thread hold the putLock all the time, > printEventQueueDetails will called forEachRemaining, then hold putLock and > readLock. The AsyncDispatcher will stuck. > {code} > Thread 6526 (IPC Server handler 454 on default port 8030): > State: RUNNABLE > Blocked count: 29988 > Waited count: 2035029 > Stack: > > java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926) > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270) > > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295) > > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408) > > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215) > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) > > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432) > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958) > java.security.AccessController.doPrivileged(Native Method) > {code} > I analyze LinkedBlockingQueue's source code. I found forEachRemaining in > LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take > are called in different thread. > YARN-8995 introduce printEventQueueDetails method, > "eventQueue.stream().collect" will called forEachRemaining method. > Let's see why? "put.png" shows that how to put("a"), "take.png" shows that > how to take()。Specical Node: The removed Node will point itself for help gc!!! > The key point code is in forEachRemaining, we see LBQSpliterator use > forEachRemaining to visit all Node. But when got item value from Node, will > release the lock. If at this time, take() will be called. > The variable 'p' in forEachRemaining may point a Node which point itself, > then forEachRemaining will be in dead loop. You can see it in "deadloop.png" > Let's see a simple uni-test, Let's forEachRemaining called more slow than > take, the problem will reproduction。uni-test is MockForDeadLoop.java. > I debug MockForDeadLoop.java, and see a Node point itself. You can see pic > "debugfornode.png" > Environment: > OS: CentOS Linux release 7.5.1804 (Core) > JDK: jdk1.8.0_281 -- This message was sent by At