from:"Peter Bacsko \(Jira\)"

[jira] [Commented] (YARN-9606) Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient

2021-09-23 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419159#comment-17419159
 ] 

Peter Bacsko commented on YARN-9606:


Thanks [~BilwaST] for the backport, committed to branch-3.3.

> Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient 
> --
>
> Key: YARN-9606
> URL: https://issues.apache.org/jira/browse/YARN-9606
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9606-001.patch, YARN-9606-002.patch, 
> YARN-9606-branch-3.3-v2.patch, YARN-9606-branch-3.3.v1.patch, 
> YARN-9606.003.patch, YARN-9606.004.patch, YARN-9606.005.patch, 
> YARN-9606.006.patch, YARN-9606.007.patch, YARN-9606.008.patch
>
>
> Yarn logs fails for running containers    
>   
> 
>   {quote}                                                                     
>                           
>   
>
>  Unable to fetch log files list
>  Exception in thread "main" java.io.IOException: 
> com.sun.jersey.api.client.ClientHandlerException: 
> javax.net.ssl.SSLHandshakeException: Error while authenticating with 
> endpoint: 
> [https://vm2:65321/ws/v1/node/containers/container_e05_1559802125016_0001_01_08/logs]
>  at 
> org.apache.hadoop.yarn.client.cli.LogsCLI.getContainerLogFiles(LogsCLI.java:543)
>  at 
> org.apache.hadoop.yarn.client.cli.LogsCLI.getMatchedContainerLogFiles(LogsCLI.java:1338)
>  at 
> org.apache.hadoop.yarn.client.cli.LogsCLI.getMatchedOptionForRunningApp(LogsCLI.java:1514)
>  at 
> org.apache.hadoop.yarn.client.cli.LogsCLI.fetchContainerLogs(LogsCLI.java:1052)
>  at org.apache.hadoop.yarn.client.cli.LogsCLI.runCommand(LogsCLI.java:367)
>  at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:152)
>  at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:399)
>  {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9606) Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient

2021-09-22 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418683#comment-17418683
 ] 

Peter Bacsko commented on YARN-9606:


[~BilwaST] yeah, sorry, I completely forgot about it. I'll commit it tomorrow.

> Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient 
> --
>
> Key: YARN-9606
> URL: https://issues.apache.org/jira/browse/YARN-9606
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9606-001.patch, YARN-9606-002.patch, 
> YARN-9606-branch-3.3-v2.patch, YARN-9606-branch-3.3.v1.patch, 
> YARN-9606.003.patch, YARN-9606.004.patch, YARN-9606.005.patch, 
> YARN-9606.006.patch, YARN-9606.007.patch, YARN-9606.008.patch
>
>
> Yarn logs fails for running containers    
>   
> 
>   {quote}                                                                     
>                           
>   
>
>  Unable to fetch log files list
>  Exception in thread "main" java.io.IOException: 
> com.sun.jersey.api.client.ClientHandlerException: 
> javax.net.ssl.SSLHandshakeException: Error while authenticating with 
> endpoint: 
> [https://vm2:65321/ws/v1/node/containers/container_e05_1559802125016_0001_01_08/logs]
>  at 
> org.apache.hadoop.yarn.client.cli.LogsCLI.getContainerLogFiles(LogsCLI.java:543)
>  at 
> org.apache.hadoop.yarn.client.cli.LogsCLI.getMatchedContainerLogFiles(LogsCLI.java:1338)
>  at 
> org.apache.hadoop.yarn.client.cli.LogsCLI.getMatchedOptionForRunningApp(LogsCLI.java:1514)
>  at 
> org.apache.hadoop.yarn.client.cli.LogsCLI.fetchContainerLogs(LogsCLI.java:1052)
>  at org.apache.hadoop.yarn.client.cli.LogsCLI.runCommand(LogsCLI.java:367)
>  at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:152)
>  at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:399)
>  {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10958) Use correct configuration for Group service init in CSMappingPlacementRule

2021-09-16 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10958:
---

 Summary: Use correct configuration for Group service init in 
CSMappingPlacementRule
 Key: YARN-10958
 URL: https://issues.apache.org/jira/browse/YARN-10958
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Peter Bacsko


There is a potential problem in {{CSMappingPlacementRule.java}}:
{noformat}
if (groups == null) {
  groups = Groups.getUserToGroupsMappingService(conf);
}
{noformat}
The problem is, we're supposed to pass {{scheduler.getConf()}}. The "conf" 
object is the config for capacity scheduler, which does not include the 
property which selects the group service provider. Therefore, the current code 
just works by chance, because Group mapping service is already initialized at 
this point. See the original fix in YARN-10053.

Also, need a unit test to verify it.

Idea:
 # Create a Configuration object in which the property 
"hadoop.security.group.mapping" refers to an existing a test implementation.
 # Add a new method to {{Groups}} which nulls out the singleton instance, eg. 
{{Groups.reset()}}.
 # Create a mock CapacityScheduler where {{getConf()}} and 
{{getConfiguration()}} contain different settings for 
"hadoop.security.group.mapping". Since {{getConf()}} is the service config, 
this should return the config object created in step #1.
 # Create an instance of {{CSMappingPlacementRule}} with a single primary group 
rule.
 # Run the placement evaluation.
 # Expected: returned queue matches what is supposed to be coming from the test 
group mapping service ("testuser" --> "testqueue").
 # Modify "hadoop.security.group.mapping" in the config object created in step 
#1.
 # Call {{Groups.refresh()}} which changes the group mapping ("testuser" --> 
"testqueue2"). This requires that the test group mapping service implement 
{{GroupMappingServiceProvider.cacheGroupsRefresh()}}.
 # Create a new instance of {{CSMappingPlacementRule}}.
 # Run the placement evaluation again
 # Expected: with the same user, the target queue has changed.

This looks convoluted, but these steps make sure that:
 # {{CSMappingPlacementRule}} will force the initialization of groups.
 # We select the correct configuration for group service init.
 # We don't create a new {{Groups}} instance if the singleton is initialized, 
so we cover the original problem described in YARN-10597.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator

2021-08-24 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403793#comment-17403793
 ] 

Peter Bacsko commented on YARN-10848:
-

Thanks for the comment [~prabhujoseph], so you're saying that this is by 
design? If this is intentional, then probably we should close this JIRA. But at 
first, this behavior was really weird to me.

> Vcore allocation problem with DefaultResourceCalculator
> ---
>
> Key: YARN-10848
> URL: https://issues.apache.org/jira/browse/YARN-10848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Minni Mittal
>Priority: Major
>  Labels: pull-request-available
> Attachments: TestTooManyContainers.java
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating 
> containers even if we run out of vcores.
> CS checks the the available resources at two places. The first check is 
> {{CapacityScheduler.allocateContainerOnSingleNode()}}:
> {noformat}
> if (calculator.computeAvailableContainers(Resources
> .add(node.getUnallocatedResource(), 
> node.getTotalKillableResources()),
> minimumAllocation) <= 0) {
>   LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient "
>   + "available or preemptible resource for minimum allocation");
> {noformat}
> The second, which is more important, is located in 
> {{RegularContainerAllocator.assignContainer()}}:
> {noformat}
> if (!Resources.fitsIn(rc, capability, totalResource)) {
>   LOG.warn("Node : " + node.getNodeID()
>   + " does not have sufficient resource for ask : " + pendingAsk
>   + " node total capability : " + node.getTotalResource());
>   // Skip this locality request
>   ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
>   activitiesManager, node, application, schedulerKey,
>   ActivityDiagnosticConstant.
>   NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST
>   + getResourceDiagnostics(capability, totalResource),
>   ActivityLevel.NODE);
>   return ContainerAllocation.LOCALITY_SKIPPED;
> }
> {noformat}
> Here, {{rc}} is the resource calculator instance, the other two values are:
> {noformat}
> Resource capability = pendingAsk.getPerAllocationResource();
> Resource available = node.getUnallocatedResource();
> {noformat}
> There is a repro unit test attatched to this case, which can demonstrate the 
> problem. The root cause is that we pass the resource calculator to 
> {{Resource.fitsIn()}}. Instead, we should use an overridden version, just 
> like in {{FSAppAttempt.assignContainer()}}:
> {noformat}
>// Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
>   // Inform the application of the new container for this request
>   RMContainer allocatedContainer =
>   allocate(type, node, schedulerKey, pendingAsk,
>   reservedContainer);
> {noformat}
> In CS, if we switch to DominantResourceCalculator OR use 
> {{Resources.fitsIn()}} without the calculator in 
> {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit 
> test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10576) Update Capacity Scheduler documentation about JSON-based placement mapping

2021-08-18 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17401337#comment-17401337
 ] 

Peter Bacsko commented on YARN-10576:
-

Thanks [~bteke], I'm not sure if I'll have time to work on this anymore, so you 
can take it over if you want.

> Update Capacity Scheduler documentation about JSON-based placement mapping
> --
>
> Key: YARN-10576
> URL: https://issues.apache.org/jira/browse/YARN-10576
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10576-001.patch
>
>
> The weight mode and AQC also affects how the new placement engine in CS works 
> and the documentation has to reflect that.
> Certain statements in the documentation are no longer valid, for example:
> * create flag: "Only applies to managed queue parents" - there is no 
> ManagedParentQueue in weight mode.
> * "The nested rules primaryGroupUser and secondaryGroupUser expects the 
> parent queues to exist, ie. they cannot be created automatically". This only 
> applies to the legacy absolute/percentage mode.
> Find all statements that mentions possible limitations and fix them if 
> necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9907) Make YARN Service AM RPC port configurable

2021-07-30 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390495#comment-17390495
 ] 

Peter Bacsko commented on YARN-9907:


[~tarunparimi] [~prabhujoseph] to me it looks the same as YARN-10439. Can this 
be closed as duplicate?

> Make YARN Service AM RPC port configurable
> --
>
> Key: YARN-9907
> URL: https://issues.apache.org/jira/browse/YARN-9907
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9907.001.patch
>
>
> YARN Service AM uses a random ephemeral port for the ClientAMService RPC. In 
> environments where firewalls block unnecessary ports by default, it is useful 
> to have a configuration that specifies the port range. Similar to what we 
> have for MapReduce {{yarn.app.mapreduce.am.job.client.port-range}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator

2021-07-28 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1732#comment-1732
 ] 

Peter Bacsko commented on YARN-10848:
-

[~minni31] the problem is that if you have a node with a lots of memory, CS 
keeps allocating containers even if there are no more vcores available. Imagine 
a 32 core server with 768GB of RAM. With a container size of 2G, this means 
that 384 containers can run in parallel, potentially overloading the node. This 
might be a slightly artifical scenario, but it can happen. 

IMO whether a container "fits in" or not should depend on both values. It's OK 
to use only one for fairness calculation, but as I pointed out above, Fair 
Scheduler does not allow such allocation if "Fair" policy is used in the queue. 

But if this was done intentionally, I'm wondering what's the thought process 
behind it.

> Vcore allocation problem with DefaultResourceCalculator
> ---
>
> Key: YARN-10848
> URL: https://issues.apache.org/jira/browse/YARN-10848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Minni Mittal
>Priority: Major
> Attachments: TestTooManyContainers.java
>
>
> If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating 
> containers even if we run out of vcores.
> CS checks the the available resources at two places. The first check is 
> {{CapacityScheduler.allocateContainerOnSingleNode()}}:
> {noformat}
> if (calculator.computeAvailableContainers(Resources
> .add(node.getUnallocatedResource(), 
> node.getTotalKillableResources()),
> minimumAllocation) <= 0) {
>   LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient "
>   + "available or preemptible resource for minimum allocation");
> {noformat}
> The second, which is more important, is located in 
> {{RegularContainerAllocator.assignContainer()}}:
> {noformat}
> if (!Resources.fitsIn(rc, capability, totalResource)) {
>   LOG.warn("Node : " + node.getNodeID()
>   + " does not have sufficient resource for ask : " + pendingAsk
>   + " node total capability : " + node.getTotalResource());
>   // Skip this locality request
>   ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
>   activitiesManager, node, application, schedulerKey,
>   ActivityDiagnosticConstant.
>   NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST
>   + getResourceDiagnostics(capability, totalResource),
>   ActivityLevel.NODE);
>   return ContainerAllocation.LOCALITY_SKIPPED;
> }
> {noformat}
> Here, {{rc}} is the resource calculator instance, the other two values are:
> {noformat}
> Resource capability = pendingAsk.getPerAllocationResource();
> Resource available = node.getUnallocatedResource();
> {noformat}
> There is a repro unit test attatched to this case, which can demonstrate the 
> problem. The root cause is that we pass the resource calculator to 
> {{Resource.fitsIn()}}. Instead, we should use an overridden version, just 
> like in {{FSAppAttempt.assignContainer()}}:
> {noformat}
>// Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
>   // Inform the application of the new container for this request
>   RMContainer allocatedContainer =
>   allocate(type, node, schedulerKey, pendingAsk,
>   reservedContainer);
> {noformat}
> In CS, if we switch to DominantResourceCalculator OR use 
> {{Resources.fitsIn()}} without the calculator in 
> {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit 
> test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator

2021-07-08 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377446#comment-17377446
 ] 

Peter Bacsko commented on YARN-10848:
-

[~minni31] sure, you can take it and I can review the patch if you upload one.

> Vcore allocation problem with DefaultResourceCalculator
> ---
>
> Key: YARN-10848
> URL: https://issues.apache.org/jira/browse/YARN-10848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Minni Mittal
>Priority: Major
> Attachments: TestTooManyContainers.java
>
>
> If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating 
> containers even if we run out of vcores.
> CS checks the the available resources at two places. The first check is 
> {{CapacityScheduler.allocateContainerOnSingleNode()}}:
> {noformat}
> if (calculator.computeAvailableContainers(Resources
> .add(node.getUnallocatedResource(), 
> node.getTotalKillableResources()),
> minimumAllocation) <= 0) {
>   LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient "
>   + "available or preemptible resource for minimum allocation");
> {noformat}
> The second, which is more important, is located in 
> {{RegularContainerAllocator.assignContainer()}}:
> {noformat}
> if (!Resources.fitsIn(rc, capability, totalResource)) {
>   LOG.warn("Node : " + node.getNodeID()
>   + " does not have sufficient resource for ask : " + pendingAsk
>   + " node total capability : " + node.getTotalResource());
>   // Skip this locality request
>   ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
>   activitiesManager, node, application, schedulerKey,
>   ActivityDiagnosticConstant.
>   NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST
>   + getResourceDiagnostics(capability, totalResource),
>   ActivityLevel.NODE);
>   return ContainerAllocation.LOCALITY_SKIPPED;
> }
> {noformat}
> Here, {{rc}} is the resource calculator instance, the other two values are:
> {noformat}
> Resource capability = pendingAsk.getPerAllocationResource();
> Resource available = node.getUnallocatedResource();
> {noformat}
> There is a repro unit test attatched to this case, which can demonstrate the 
> problem. The root cause is that we pass the resource calculator to 
> {{Resource.fitsIn()}}. Instead, we should use an overridden version, just 
> like in {{FSAppAttempt.assignContainer()}}:
> {noformat}
>// Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
>   // Inform the application of the new container for this request
>   RMContainer allocatedContainer =
>   allocate(type, node, schedulerKey, pendingAsk,
>   reservedContainer);
> {noformat}
> In CS, if we switch to DominantResourceCalculator OR use 
> {{Resources.fitsIn()}} without the calculator in 
> {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit 
> test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator

2021-07-08 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reassigned YARN-10848:
---

Assignee: Minni Mittal

> Vcore allocation problem with DefaultResourceCalculator
> ---
>
> Key: YARN-10848
> URL: https://issues.apache.org/jira/browse/YARN-10848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Minni Mittal
>Priority: Major
> Attachments: TestTooManyContainers.java
>
>
> If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating 
> containers even if we run out of vcores.
> CS checks the the available resources at two places. The first check is 
> {{CapacityScheduler.allocateContainerOnSingleNode()}}:
> {noformat}
> if (calculator.computeAvailableContainers(Resources
> .add(node.getUnallocatedResource(), 
> node.getTotalKillableResources()),
> minimumAllocation) <= 0) {
>   LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient "
>   + "available or preemptible resource for minimum allocation");
> {noformat}
> The second, which is more important, is located in 
> {{RegularContainerAllocator.assignContainer()}}:
> {noformat}
> if (!Resources.fitsIn(rc, capability, totalResource)) {
>   LOG.warn("Node : " + node.getNodeID()
>   + " does not have sufficient resource for ask : " + pendingAsk
>   + " node total capability : " + node.getTotalResource());
>   // Skip this locality request
>   ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
>   activitiesManager, node, application, schedulerKey,
>   ActivityDiagnosticConstant.
>   NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST
>   + getResourceDiagnostics(capability, totalResource),
>   ActivityLevel.NODE);
>   return ContainerAllocation.LOCALITY_SKIPPED;
> }
> {noformat}
> Here, {{rc}} is the resource calculator instance, the other two values are:
> {noformat}
> Resource capability = pendingAsk.getPerAllocationResource();
> Resource available = node.getUnallocatedResource();
> {noformat}
> There is a repro unit test attatched to this case, which can demonstrate the 
> problem. The root cause is that we pass the resource calculator to 
> {{Resource.fitsIn()}}. Instead, we should use an overridden version, just 
> like in {{FSAppAttempt.assignContainer()}}:
> {noformat}
>// Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
>   // Inform the application of the new container for this request
>   RMContainer allocatedContainer =
>   allocate(type, node, schedulerKey, pendingAsk,
>   reservedContainer);
> {noformat}
> In CS, if we switch to DominantResourceCalculator OR use 
> {{Resources.fitsIn()}} without the calculator in 
> {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit 
> test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10849) Clarify testcase documentation for TestServiceAM#testContainersReleasedWhenPreLaunchFails

2021-07-07 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376630#comment-17376630
 ] 

Peter Bacsko commented on YARN-10849:
-

[~snemeth] patch v2 seems to undo what v1 introduced. 

> Clarify testcase documentation for 
> TestServiceAM#testContainersReleasedWhenPreLaunchFails
> -
>
> Key: YARN-10849
> URL: https://issues.apache.org/jira/browse/YARN-10849
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-10849.001.patch, YARN-10849.002.patch
>
>
> There's a small comment added to testcase: 
> org.apache.hadoop.yarn.service.TestServiceAM#testContainersReleasedWhenPreLaunchFails:
> {code}
>   // Test to verify that the containers are released and the
>   // component instance is added to the pending queue when building the launch
>   // context fails.
> {code}
> However, it was not clear for me why the "launch context" would fail.
> While the test passes, it throws an Exception that tells the story. 
> {code}
> 2021-07-06 18:31:04,438 ERROR [pool-275-thread-1] 
> containerlaunch.ContainerLaunchService (ContainerLaunchService.java:run(122)) 
> - [COMPINSTANCE compa-0 : container_1625589063422_0001_01_01]: Failed to 
> launch container.
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:164)
>   at org.apache.hadoop.fs.Path.(Path.java:180)
>   at 
> org.apache.hadoop.yarn.service.provider.tarball.TarballProviderService.processArtifact(TarballProviderService.java:39)
>   at 
> org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:144)
>   at 
> org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:107)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> This exception is thrown because the id of the Artifact object is unset 
> (null) and TarballProviderService.processArtifact verifies it and it does not 
> allow such artifacts.
> The aim of this jira is to add a clarification comment or javadoc to this 
> method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10848:

Attachment: TestTooManyContainers.java

> Vcore allocation problem with DefaultResourceCalculator
> ---
>
> Key: YARN-10848
> URL: https://issues.apache.org/jira/browse/YARN-10848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Priority: Major
> Attachments: TestTooManyContainers.java
>
>
> If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating 
> containers even if we run out of vcores.
> CS checks the the available resources at two places. The first check is 
> {{CapacityScheduler.allocateContainerOnSingleNode()}}:
> {noformat}
> if (calculator.computeAvailableContainers(Resources
> .add(node.getUnallocatedResource(), 
> node.getTotalKillableResources()),
> minimumAllocation) <= 0) {
>   LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient "
>   + "available or preemptible resource for minimum allocation");
> {noformat}
> The second, which is more important, is located in 
> {{RegularContainerAllocator.assignContainer()}}:
> {noformat}
> if (!Resources.fitsIn(rc, capability, totalResource)) {
>   LOG.warn("Node : " + node.getNodeID()
>   + " does not have sufficient resource for ask : " + pendingAsk
>   + " node total capability : " + node.getTotalResource());
>   // Skip this locality request
>   ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
>   activitiesManager, node, application, schedulerKey,
>   ActivityDiagnosticConstant.
>   NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST
>   + getResourceDiagnostics(capability, totalResource),
>   ActivityLevel.NODE);
>   return ContainerAllocation.LOCALITY_SKIPPED;
> }
> {noformat}
> Here, {{rc}} is the resource calculator instance, the other two values are:
> {noformat}
> Resource capability = pendingAsk.getPerAllocationResource();
> Resource available = node.getUnallocatedResource();
> {noformat}
> There is a repro unit test attatched to this case, which can demonstrate the 
> problem. The root cause is that we pass the resource calculator to 
> {{Resource.fitsIn()}}. Instead, we should use an overridden version, just 
> like in {{FSAppAttempt.assignContainer()}}:
> {noformat}
>// Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
>   // Inform the application of the new container for this request
>   RMContainer allocatedContainer =
>   allocate(type, node, schedulerKey, pendingAsk,
>   reservedContainer);
> {noformat}
> In CS, if we switch to DominantResourceCalculator OR use 
> {{Resources.fitsIn()}} without the calculator in 
> {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit 
> test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10848:

Attachment: (was: TestTooManyContainers.java)

> Vcore allocation problem with DefaultResourceCalculator
> ---
>
> Key: YARN-10848
> URL: https://issues.apache.org/jira/browse/YARN-10848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Priority: Major
>
> If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating 
> containers even if we run out of vcores.
> CS checks the the available resources at two places. The first check is 
> {{CapacityScheduler.allocateContainerOnSingleNode()}}:
> {noformat}
> if (calculator.computeAvailableContainers(Resources
> .add(node.getUnallocatedResource(), 
> node.getTotalKillableResources()),
> minimumAllocation) <= 0) {
>   LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient "
>   + "available or preemptible resource for minimum allocation");
> {noformat}
> The second, which is more important, is located in 
> {{RegularContainerAllocator.assignContainer()}}:
> {noformat}
> if (!Resources.fitsIn(rc, capability, totalResource)) {
>   LOG.warn("Node : " + node.getNodeID()
>   + " does not have sufficient resource for ask : " + pendingAsk
>   + " node total capability : " + node.getTotalResource());
>   // Skip this locality request
>   ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
>   activitiesManager, node, application, schedulerKey,
>   ActivityDiagnosticConstant.
>   NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST
>   + getResourceDiagnostics(capability, totalResource),
>   ActivityLevel.NODE);
>   return ContainerAllocation.LOCALITY_SKIPPED;
> }
> {noformat}
> Here, {{rc}} is the resource calculator instance, the other two values are:
> {noformat}
> Resource capability = pendingAsk.getPerAllocationResource();
> Resource available = node.getUnallocatedResource();
> {noformat}
> There is a repro unit test attatched to this case, which can demonstrate the 
> problem. The root cause is that we pass the resource calculator to 
> {{Resource.fitsIn()}}. Instead, we should use an overridden version, just 
> like in {{FSAppAttempt.assignContainer()}}:
> {noformat}
>// Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
>   // Inform the application of the new container for this request
>   RMContainer allocatedContainer =
>   allocate(type, node, schedulerKey, pendingAsk,
>   reservedContainer);
> {noformat}
> In CS, if we switch to DominantResourceCalculator OR use 
> {{Resources.fitsIn()}} without the calculator in 
> {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit 
> test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10848:

Attachment: TestTooManyContainers.java

> Vcore allocation problem with DefaultResourceCalculator
> ---
>
> Key: YARN-10848
> URL: https://issues.apache.org/jira/browse/YARN-10848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Priority: Major
> Attachments: TestTooManyContainers.java
>
>
> If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating 
> containers even if we run out of vcores.
> CS checks the the available resources at two places. The first check is 
> {{CapacityScheduler.allocateContainerOnSingleNode()}}:
> {noformat}
> if (calculator.computeAvailableContainers(Resources
> .add(node.getUnallocatedResource(), 
> node.getTotalKillableResources()),
> minimumAllocation) <= 0) {
>   LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient "
>   + "available or preemptible resource for minimum allocation");
> {noformat}
> The second, which is more important, is located in 
> {{RegularContainerAllocator.assignContainer()}}:
> {noformat}
> if (!Resources.fitsIn(rc, capability, totalResource)) {
>   LOG.warn("Node : " + node.getNodeID()
>   + " does not have sufficient resource for ask : " + pendingAsk
>   + " node total capability : " + node.getTotalResource());
>   // Skip this locality request
>   ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
>   activitiesManager, node, application, schedulerKey,
>   ActivityDiagnosticConstant.
>   NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST
>   + getResourceDiagnostics(capability, totalResource),
>   ActivityLevel.NODE);
>   return ContainerAllocation.LOCALITY_SKIPPED;
> }
> {noformat}
> Here, {{rc}} is the resource calculator instance, the other two values are:
> {noformat}
> Resource capability = pendingAsk.getPerAllocationResource();
> Resource available = node.getUnallocatedResource();
> {noformat}
> There is a repro unit test attatched to this case, which can demonstrate the 
> problem. The root cause is that we pass the resource calculator to 
> {{Resource.fitsIn()}}. Instead, we should use an overridden version, just 
> like in {{FSAppAttempt.assignContainer()}}:
> {noformat}
>// Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
>   // Inform the application of the new container for this request
>   RMContainer allocatedContainer =
>   allocate(type, node, schedulerKey, pendingAsk,
>   reservedContainer);
> {noformat}
> In CS, if we switch to DominantResourceCalculator OR use 
> {{Resources.fitsIn()}} without the calculator in 
> {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit 
> test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10848:

Summary: Vcore allocation problem with DefaultResourceCalculator  (was: 
Vcore usage problem with Default/DominantResourceCalculator)

> Vcore allocation problem with DefaultResourceCalculator
> ---
>
> Key: YARN-10848
> URL: https://issues.apache.org/jira/browse/YARN-10848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Priority: Major
>
> If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating 
> containers even if we run out of vcores.
> CS checks the the available resources at two places. The first check is 
> {{CapacityScheduler.allocateContainerOnSingleNode()}}:
> {noformat}
> if (calculator.computeAvailableContainers(Resources
> .add(node.getUnallocatedResource(), 
> node.getTotalKillableResources()),
> minimumAllocation) <= 0) {
>   LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient "
>   + "available or preemptible resource for minimum allocation");
> {noformat}
> The second, which is more important, is located in 
> {{RegularContainerAllocator.assignContainer()}}:
> {noformat}
> if (!Resources.fitsIn(rc, capability, totalResource)) {
>   LOG.warn("Node : " + node.getNodeID()
>   + " does not have sufficient resource for ask : " + pendingAsk
>   + " node total capability : " + node.getTotalResource());
>   // Skip this locality request
>   ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
>   activitiesManager, node, application, schedulerKey,
>   ActivityDiagnosticConstant.
>   NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST
>   + getResourceDiagnostics(capability, totalResource),
>   ActivityLevel.NODE);
>   return ContainerAllocation.LOCALITY_SKIPPED;
> }
> {noformat}
> Here, {{rc}} is the resource calculator instance, the other two values are:
> {noformat}
> Resource capability = pendingAsk.getPerAllocationResource();
> Resource available = node.getUnallocatedResource();
> {noformat}
> There is a repro unit test attatched to this case, which can demonstrate the 
> problem. The root cause is that we pass the resource calculator to 
> {{Resource.fitsIn()}}. Instead, we should use an overridden version, just 
> like in {{FSAppAttempt.assignContainer()}}:
> {noformat}
>// Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
>   // Inform the application of the new container for this request
>   RMContainer allocatedContainer =
>   allocate(type, node, schedulerKey, pendingAsk,
>   reservedContainer);
> {noformat}
> In CS, if we switch to DominantResourceCalculator OR use 
> {{Resources.fitsIn()}} without the calculator in 
> {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit 
> test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10848) Vcore usage problem with Default/DominantResourceCalculator

2021-07-06 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10848:
---

 Summary: Vcore usage problem with 
Default/DominantResourceCalculator
 Key: YARN-10848
 URL: https://issues.apache.org/jira/browse/YARN-10848
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, capacityscheduler
Reporter: Peter Bacsko


If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating 
containers even if we run out of vcores.

CS checks the the available resources at two places. The first check is 
{{CapacityScheduler.allocateContainerOnSingleNode()}}:
{noformat}
if (calculator.computeAvailableContainers(Resources
.add(node.getUnallocatedResource(), 
node.getTotalKillableResources()),
minimumAllocation) <= 0) {
  LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient "
  + "available or preemptible resource for minimum allocation");
{noformat}

The second, which is more important, is located in 
{{RegularContainerAllocator.assignContainer()}}:
{noformat}
if (!Resources.fitsIn(rc, capability, totalResource)) {
  LOG.warn("Node : " + node.getNodeID()
  + " does not have sufficient resource for ask : " + pendingAsk
  + " node total capability : " + node.getTotalResource());
  // Skip this locality request
  ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
  activitiesManager, node, application, schedulerKey,
  ActivityDiagnosticConstant.
  NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST
  + getResourceDiagnostics(capability, totalResource),
  ActivityLevel.NODE);
  return ContainerAllocation.LOCALITY_SKIPPED;
}
{noformat}

Here, {{rc}} is the resource calculator instance, the other two values are:
{noformat}
Resource capability = pendingAsk.getPerAllocationResource();
Resource available = node.getUnallocatedResource();
{noformat}

There is a repro unit test attatched to this case, which can demonstrate the 
problem. The root cause is that we pass the resource calculator to 
{{Resource.fitsIn()}}. Instead, we should use an overridden version, just like 
in {{FSAppAttempt.assignContainer()}}:
{noformat}
   // Can we allocate a container on this node?
if (Resources.fitsIn(capability, available)) {
  // Inform the application of the new container for this request
  RMContainer allocatedContainer =
  allocate(type, node, schedulerKey, pendingAsk,
  reservedContainer);
{noformat}

In CS, if we switch to DominantResourceCalculator OR use {{Resources.fitsIn()}} 
without the calculator in {{RegularContainerAllocator.assignContainer()}}, that 
fixes the failing unit test (see {{testTooManyContainers()}} in 
{{TestTooManyContainers.java}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10843) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler - part II

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10843:

Component/s: capacityscheduler
 capacity scheduler

> [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler 
> - part II
> --
>
> Key: YARN-10843
> URL: https://issues.apache.org/jira/browse/YARN-10843
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Priority: Major
>  Labels: fs2cs
>
> Remaining tasks for fs2cs converter.
> Phase I was completed under YARN-9698.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10843) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler - part II

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10843:

Labels: fs2cs  (was: )

> [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler 
> - part II
> --
>
> Key: YARN-10843
> URL: https://issues.apache.org/jira/browse/YARN-10843
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Peter Bacsko
>Priority: Major
>  Labels: fs2cs
>
> Remaining tasks for fs2cs converter.
> Phase I was completed under YARN-9698.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10843) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler - part II

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10843:

Description: 
Remaining tasks for fs2cs converter.

Phase I was completed under YARN-9698.

> [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler 
> - part II
> --
>
> Key: YARN-10843
> URL: https://issues.apache.org/jira/browse/YARN-10843
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Peter Bacsko
>Priority: Major
>
> Remaining tasks for fs2cs converter.
> Phase I was completed under YARN-9698.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-9698) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-9698.

Fix Version/s: 3.4.0
   Resolution: Fixed

> [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler
> 
>
> Key: YARN-9698
> URL: https://issues.apache.org/jira/browse/YARN-9698
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Weiwei Yang
>Priority: Major
>  Labels: fs2cs
> Fix For: 3.4.0
>
> Attachments: FS-CS Migration.pdf
>
>
> We see some users want to migrate from Fair Scheduler to Capacity Scheduler, 
> this Jira is created as an umbrella to track all related efforts for the 
> migration, the scope contains
>  * Bug fixes
>  * Add missing features
>  * Migration tools that help to generate CS configs based on FS, validate 
> configs etc
>  * Documents
> this is part of CS component, the purpose is to make the migration process 
> smooth.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9698) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler

2021-07-06 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17375649#comment-17375649
 ] 

Peter Bacsko commented on YARN-9698:


Remaining subtasks have been moved under YARN-10843. Closing this. Thanks for 
everyone's contribution.

> [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler
> 
>
> Key: YARN-9698
> URL: https://issues.apache.org/jira/browse/YARN-9698
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Weiwei Yang
>Priority: Major
>  Labels: fs2cs
> Attachments: FS-CS Migration.pdf
>
>
> We see some users want to migrate from Fair Scheduler to Capacity Scheduler, 
> this Jira is created as an umbrella to track all related efforts for the 
> migration, the scope contains
>  * Bug fixes
>  * Add missing features
>  * Migration tools that help to generate CS configs based on FS, validate 
> configs etc
>  * Documents
> this is part of CS component, the purpose is to make the migration process 
> smooth.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10171) Add support for increment-allocation of custom resource types

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10171:

Parent Issue: YARN-10843  (was: YARN-9698)

> Add support for increment-allocation of custom resource types
> -
>
> Key: YARN-10171
> URL: https://issues.apache.org/jira/browse/YARN-10171
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Qi Zhu
>Priority: Minor
>
> The FairScheduler's {{yarn.resource-types.memory-mb.increment-allocation}} 
> and {{yarn.resource-types.vcores.increment-allocation}} configs are converted 
> to the {{yarn.scheduler.minimum-allocation-*}} configs, which is fine for the 
> vcores and memory.
> In case of custom resource types like GPU if 
> {{yarn.resource-types.gpu.increment-allocation}} is set, then CS will not be 
> aware of that. We don't have a {{yarn.scheduler.minimum-allocation-gpu}} 
> setting for this purpose, but {{yarn.resource-types.gpu.min-allocation}} is 
> respected by the {{ResourceCalculator}} through the 
> {{ResourceUtils#getResourceInformationMapFromConfig}} which would provide us 
> with the same behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10758) Mixed mode: Allow relative and absolute mode in the same queue hierarchy

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10758:

Parent Issue: YARN-10843  (was: YARN-9698)

> Mixed mode: Allow relative and absolute mode in the same queue hierarchy
> 
>
> Key: YARN-10758
> URL: https://issues.apache.org/jira/browse/YARN-10758
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> Fair Scheduler supports mixed mode for maximum capacity. An example scenario 
> of such configuration:
> {noformat}
> root.a.capacity [memory-mb=7268, vcores=8]{noformat}
> {noformat}
> root.a.a1.capacity 50{noformat}
> {noformat}
> root.a.a2.capacity 50{noformat}
> Capacity Scheduler already permits using weight mode and relative/percentage 
> mode in the same hierarchy, however, the absolute mode and relative mode is 
> mutually exclusive. This improvement is a natural extension of CS to lift 
> this limitation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10693) Add document for YARN-10623 auto refresh queue conf in cs.

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10693:

Parent Issue: YARN-10843  (was: YARN-9698)

> Add document for YARN-10623 auto refresh queue conf in cs.
> --
>
> Key: YARN-10693
> URL: https://issues.apache.org/jira/browse/YARN-10693
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10693.001.patch, YARN-10693.002.patch, 
> YARN-10693.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9936) Support vector of capacity percentages in Capacity Scheduler configuration

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9936:
---
Parent Issue: YARN-10843  (was: YARN-9698)

> Support vector of capacity percentages in Capacity Scheduler configuration
> --
>
> Key: YARN-9936
> URL: https://issues.apache.org/jira/browse/YARN-9936
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Zoltan Siegl
>Assignee: Andras Gyori
>Priority: Major
> Attachments: Capacity Scheduler support of “vector of resources 
> percentage”.pdf
>
>
> Currently, the Capacity Scheduler queue configuration supports two ways to 
> set queue capacity.
>  * In percentage of all available resources as a float ( eg. 25.0 ) means 25% 
> of the resources of its parent queue for all resource types equally (eg. 25% 
> of all memory, 25% of all CPU cores, and 25% of all available GPU in the 
> cluster) The percentages of all queues has to add up to 100%.
>  * In an absolute amount of resources ( e.g. 
> memory=4GB,vcores=20,yarn.io/gpu=4 ). The amount of all resources in the 
> queues has to be less than or equal to all resources in the 
> cluster.{color:#de350b}Actually, the above is not supported, we only support 
> memory and vcores now in absolute mode, we should extend in {color}YARN-10503.
> Apart from these two already existing ways, there is a demand to add capacity 
> percentage of each available resource type separately. (eg. 
> {{memory=20%,vcores=40%,yarn.io/gpu=100%}}).
>  At the same time, a similar concept should be included with queues 
> maximum-capacity as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10049) FIFOOrderingPolicy Improvements

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10049:

Parent Issue: YARN-10843  (was: YARN-9698)

> FIFOOrderingPolicy Improvements
> ---
>
> Key: YARN-10049
> URL: https://issues.apache.org/jira/browse/YARN-10049
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-10049.001.patch, YARN-10049.002.patch, 
> YARN-10049.003.patch
>
>
> FIFOPolicy of FS does the following comparisons in addition to app priority 
> comparison:
> 1. Using Start time
> 2. Using Name
> Scope of this jira is to achieve the same comparisons in FIFOOrderingPolicy 
> of CS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9975) Support proxy acl user for CapacityScheduler

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9975:
---
Parent Issue: YARN-10843  (was: YARN-9698)

> Support proxy acl user for CapacityScheduler
> 
>
> Key: YARN-9975
> URL: https://issues.apache.org/jira/browse/YARN-9975
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> As commented in https://issues.apache.org/jira/browse/YARN-9698.
> I will open a new jira for the proxy user feature. 
> The background is that we have long running  sql thriftserver for many users:
> {quote}{{user->sql proxy-> sql thriftserver}}{quote}
> But we do not have keytab for all users on 'sql proxy'. We just use a super 
> user like 'sql_prc' to submit the 'sql thriftserver' application. To support 
> this we should change the scheduler to support proxy user acl



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9904) Investigate how resource allocation configuration could be more consistent in CapacityScheduler

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9904:
---
Parent Issue: YARN-10843  (was: YARN-9698)

> Investigate how resource allocation configuration could be more consistent in 
> CapacityScheduler
> ---
>
> Key: YARN-9904
> URL: https://issues.apache.org/jira/browse/YARN-9904
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Gergely Pollák
>Priority: Major
>
> It would be nice if everywhere where a capacity can be defined could be 
> defined the same way:
>  * With fixed amounts (eg 1GB memory, 8 vcores, 3 GPU)
>  * With percentages
>  ** Percentage of all resources (eg 10% of all memory, vcore, GPU)
>  ** Percentage per resource type (eg 10% memory, 25% vcore, 50% GPU)
> We need to determine all configuration options where capacities can be 
> defined, and see if it is possible to extend the configuration, or if it 
> makes sense in that case.
> The outcome is a proposal for all the configurations which could/should be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9892) Capacity scheduler: support DRF ordering policy on queue level

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9892:
---
Parent Issue: YARN-10843  (was: YARN-9698)

> Capacity scheduler: support DRF ordering policy on queue level
> --
>
> Key: YARN-9892
> URL: https://issues.apache.org/jira/browse/YARN-9892
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Peter Bacsko
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-9892-003.patch, YARN-9892.001.patch, 
> YARN-9892.002.patch
>
>
> Capacity scheduler does not support DRF (Dominant Resource Fairness) ordering 
> policy on queue level. Only "fifo" and "fair" are accepted for 
> {{yarn.scheduler.capacity..ordering-policy}}.
> DRF can only be used globally if 
> {{yarn.scheduler.capacity.resource-calculator}} is set to 
> DominantResourceCalculator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9748) Allow capacity-scheduler configuration on HDFS and support reload from HDFS

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9748:
---
Parent Issue: YARN-10843  (was: YARN-9698)

> Allow capacity-scheduler configuration on HDFS and support reload from HDFS
> ---
>
> Key: YARN-9748
> URL: https://issues.apache.org/jira/browse/YARN-9748
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, capacityscheduler
>Affects Versions: 3.1.2
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> Improvement:
> Support auto reload from hdfs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9759) Document Queue and App Ordering Policy for CS

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9759:
---
Parent Issue: YARN-10843  (was: YARN-9698)

> Document Queue and App Ordering Policy for CS
> -
>
> Key: YARN-9759
> URL: https://issues.apache.org/jira/browse/YARN-9759
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> Documentation of below properties for CapacityScheduler are missing
> Ordering policy inside a parent queue to sort queues: 
>   yarn.scheduler.capacity..ordering-policy  = utilization, 
> priority-utilization
> Ordering policy inside a leaf queue to sort apps:
>   yarn.scheduler.capacity..ordering-policy = fifo , fair



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-7621:
---
Parent Issue: YARN-10843  (was: YARN-9698)

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: zhoukang
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9709) When we expanding queue list the scheduler page will not show any applications

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9709:
---
Parent Issue: YARN-10843  (was: YARN-9698)

> When we expanding queue list the scheduler page will not show any applications
> --
>
> Key: YARN-9709
> URL: https://issues.apache.org/jira/browse/YARN-9709
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Affects Versions: 3.1.2
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-9709.001.patch, list1.png, list3.png
>
>
> When we expanding queue list the scheduler page will not show any 
> applications.But it works well in FairScheduler.
>  !list1.png! 
>  !list3.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9700) Docs about how to migrate from FS to CS config

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9700:
---
Parent Issue: YARN-10843  (was: YARN-9698)

> Docs about how to migrate from FS to CS config
> --
>
> Key: YARN-9700
> URL: https://issues.apache.org/jira/browse/YARN-9700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: docs
>Reporter: Wanqiang Ji
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10843) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler - part II

2021-07-06 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10843:
---

 Summary: [Umbrella] Tools to help migration from Fair Scheduler to 
Capacity Scheduler - part II
 Key: YARN-10843
 URL: https://issues.apache.org/jira/browse/YARN-10843
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10780) Optimise retrieval of configured node labels in CS queues

2021-06-24 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369025#comment-17369025
 ] 

Peter Bacsko commented on YARN-10780:
-

+1

Thanks [~gandras], latest patch LGTM.

Committed to trunk.

> Optimise retrieval of configured node labels in CS queues
> -
>
> Key: YARN-10780
> URL: https://issues.apache.org/jira/browse/YARN-10780
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10780.001.patch, YARN-10780.002.patch, 
> YARN-10780.003.patch, YARN-10780.004.patch, YARN-10780.005.patch
>
>
> CapacitySchedulerConfiguration#getConfiguredNodeLabels scales poorly with 
> respect to queue numbers (its O(n*m), where n is the number of queues and m 
> is the number of properties set by each queue). During CS reinit, the node 
> labels are often queried, however looking at the code:
> {code:java}
> for (Entry stringStringEntry : this) {
>   e = stringStringEntry;
>   String key = e.getKey();
>   if (key.startsWith(getQueuePrefix(queuePath) + ACCESSIBLE_NODE_LABELS
>   + DOT)) {
> // Find  in
> // .accessible-node-labels..property
> int labelStartIdx =
> key.indexOf(ACCESSIBLE_NODE_LABELS)
> + ACCESSIBLE_NODE_LABELS.length() + 1;
> int labelEndIndx = key.indexOf('.', labelStartIdx);
> String labelName = key.substring(labelStartIdx, labelEndIndx);
> configuredNodeLabels.add(labelName);
>   }
> }
> {code}
>  This method iterates through ALL properties set in the configuration. For 
> example in case of initialising 2500 queues, each having at least 2 
> properties:
> 2500 * 5000 ~= over 12 million iteration + additional properties
> There are some ways to resolve this issue while keeping backward 
> compatibility:
>  # Create a property like the original accessible-node-labels, which contains 
> predefined labels. If it is set, then getConfiguredNodeLabels get the value 
> of this property, otherwise it falls back to the old logic. I think 
> accessible-node-labels are not used for this purpose (though I have a feeling 
> that it should have been).
>  # Collect node labels for all queues at the beginning of parseQueue and only 
> iterate through the properties once. This will increase the space complexity 
> in exchange of not requiring intervention from user's perspective. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10780) Optimise retrieval of configured node labels in CS queues

2021-06-22 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367259#comment-17367259
 ] 

Peter Bacsko commented on YARN-10780:
-

[~gandras] looks good, could you take care of the checkstyle problems?

> Optimise retrieval of configured node labels in CS queues
> -
>
> Key: YARN-10780
> URL: https://issues.apache.org/jira/browse/YARN-10780
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10780.001.patch, YARN-10780.002.patch, 
> YARN-10780.003.patch, YARN-10780.004.patch
>
>
> CapacitySchedulerConfiguration#getConfiguredNodeLabels scales poorly with 
> respect to queue numbers (its O(n*m), where n is the number of queues and m 
> is the number of properties set by each queue). During CS reinit, the node 
> labels are often queried, however looking at the code:
> {code:java}
> for (Entry stringStringEntry : this) {
>   e = stringStringEntry;
>   String key = e.getKey();
>   if (key.startsWith(getQueuePrefix(queuePath) + ACCESSIBLE_NODE_LABELS
>   + DOT)) {
> // Find  in
> // .accessible-node-labels..property
> int labelStartIdx =
> key.indexOf(ACCESSIBLE_NODE_LABELS)
> + ACCESSIBLE_NODE_LABELS.length() + 1;
> int labelEndIndx = key.indexOf('.', labelStartIdx);
> String labelName = key.substring(labelStartIdx, labelEndIndx);
> configuredNodeLabels.add(labelName);
>   }
> }
> {code}
>  This method iterates through ALL properties set in the configuration. For 
> example in case of initialising 2500 queues, each having at least 2 
> properties:
> 2500 * 5000 ~= over 12 million iteration + additional properties
> There are some ways to resolve this issue while keeping backward 
> compatibility:
>  # Create a property like the original accessible-node-labels, which contains 
> predefined labels. If it is set, then getConfiguredNodeLabels get the value 
> of this property, otherwise it falls back to the old logic. I think 
> accessible-node-labels are not used for this purpose (though I have a feeling 
> that it should have been).
>  # Collect node labels for all queues at the beginning of parseQueue and only 
> iterate through the properties once. This will increase the space complexity 
> in exchange of not requiring intervention from user's perspective. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%

2021-06-02 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10796:

Attachment: YARN-10796-003.patch

> Capacity Scheduler: dynamic queue cannot scale out properly if its capacity 
> is 0%
> -
>
> Key: YARN-10796
> URL: https://issues.apache.org/jira/browse/YARN-10796
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10796-001.patch, YARN-10796-002.patch, 
> YARN-10796-003.patch
>
>
> If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it 
> cannot properly scale even if it's max-capacity and the parent's max-capacity 
> would allow it.
> Example:
> {noformat}
> Cluster Capacity:  16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu )
> Container allocation size: 1G / 1 vcore
> root.dynamic 
> Effective Capacity:   ( 50.0%)
> Effective Max Capacity:   (100.0%) 
> Template:
> Capacity:   40%
> Max Capacity:   100%
> User Limit Factor:  4
>  {noformat}
> leaf-queue-template.capacity = 40%
>  leaf-queue-template.maximum-capacity = 100%
>  leaf-queue-template.maximum-am-resource-percent = 50%
>  leaf-queue-template.minimum-user-limit-percent =100%
>  leaf-queue-template.user-limit-factor = 4
> "root.dynamic" has a maximum capacity of 100% and a capacity of 50%.
> Let's assume there are running containers in these dynamic queues (MR sleep 
> jobs):
>  root.dynamic.user1 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user2 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user3 = 1 AM + 15 container (capacity = 0%)
> This scenario will result in an underutilized cluster. There will be approx 
> 18% unused capacity. On the other hand, it's still possible to submit a new 
> application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% 
> utilization is possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%

2021-06-02 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355804#comment-17355804
 ] 

Peter Bacsko edited comment on YARN-10796 at 6/2/21, 3:32 PM:
--

[~gandras] this is a valid concern. Question is, do we accept how it worked 
before and say "yeah, that's another way of working"? Are there clusters built 
on the fact that a 0% queue cannot scale out properly, despite the max-capacity 
setting? Honestly, I don't know. Maybe some people got used to the improper 
behavior and expect it to work that way, which does happen in real life.

In my view, even a zero capacity queue should be able to occupy the cluster if 
nothing else is used, provided max-capacity is set appropriately. So I would 
not go for a new property.


was (Author: pbacsko):
[~gandras] this is a valid concern. Question is, do we accept how it worked 
before and say "yeah, that's another way of working"? Are there clusters built 
on the fact that a 0% queue cannot scale out properly, despite the max-capacity 
setting? Honestly, I don't know. Maybe some people got used to the improper 
behavior and expect it to work that way, which does happen in real life.

TIn my view, even a zero capacity queue should be able to occupy the cluster if 
nothing else is used, provided max-capacity is set appropriately. So I would 
not go for a new property.

> Capacity Scheduler: dynamic queue cannot scale out properly if its capacity 
> is 0%
> -
>
> Key: YARN-10796
> URL: https://issues.apache.org/jira/browse/YARN-10796
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10796-001.patch, YARN-10796-002.patch
>
>
> If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it 
> cannot properly scale even if it's max-capacity and the parent's max-capacity 
> would allow it.
> Example:
> {noformat}
> Cluster Capacity:  16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu )
> Container allocation size: 1G / 1 vcore
> root.dynamic 
> Effective Capacity:   ( 50.0%)
> Effective Max Capacity:   (100.0%) 
> Template:
> Capacity:   40%
> Max Capacity:   100%
> User Limit Factor:  4
>  {noformat}
> leaf-queue-template.capacity = 40%
>  leaf-queue-template.maximum-capacity = 100%
>  leaf-queue-template.maximum-am-resource-percent = 50%
>  leaf-queue-template.minimum-user-limit-percent =100%
>  leaf-queue-template.user-limit-factor = 4
> "root.dynamic" has a maximum capacity of 100% and a capacity of 50%.
> Let's assume there are running containers in these dynamic queues (MR sleep 
> jobs):
>  root.dynamic.user1 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user2 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user3 = 1 AM + 15 container (capacity = 0%)
> This scenario will result in an underutilized cluster. There will be approx 
> 18% unused capacity. On the other hand, it's still possible to submit a new 
> application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% 
> utilization is possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%

2021-06-02 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355804#comment-17355804
 ] 

Peter Bacsko commented on YARN-10796:
-

[~gandras] this is a valid concern. Question is, do we accept how it worked 
before and say "yeah, that's another way of working"? Are there clusters built 
on the fact that a 0% queue cannot scale out properly, despite the max-capacity 
setting? Honestly, I don't know. Maybe some people got used to the improper 
behavior and expect it to work that way, which does happen in real life.

That said, even a zero capacity queue should be able to occupy the cluster if 
nothing else is used, provided max-capacity is set appropriately. So I would 
not go for a new property.

> Capacity Scheduler: dynamic queue cannot scale out properly if its capacity 
> is 0%
> -
>
> Key: YARN-10796
> URL: https://issues.apache.org/jira/browse/YARN-10796
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10796-001.patch, YARN-10796-002.patch
>
>
> If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it 
> cannot properly scale even if it's max-capacity and the parent's max-capacity 
> would allow it.
> Example:
> {noformat}
> Cluster Capacity:  16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu )
> Container allocation size: 1G / 1 vcore
> root.dynamic 
> Effective Capacity:   ( 50.0%)
> Effective Max Capacity:   (100.0%) 
> Template:
> Capacity:   40%
> Max Capacity:   100%
> User Limit Factor:  4
>  {noformat}
> leaf-queue-template.capacity = 40%
>  leaf-queue-template.maximum-capacity = 100%
>  leaf-queue-template.maximum-am-resource-percent = 50%
>  leaf-queue-template.minimum-user-limit-percent =100%
>  leaf-queue-template.user-limit-factor = 4
> "root.dynamic" has a maximum capacity of 100% and a capacity of 50%.
> Let's assume there are running containers in these dynamic queues (MR sleep 
> jobs):
>  root.dynamic.user1 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user2 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user3 = 1 AM + 15 container (capacity = 0%)
> This scenario will result in an underutilized cluster. There will be approx 
> 18% unused capacity. On the other hand, it's still possible to submit a new 
> application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% 
> utilization is possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%

2021-06-02 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355804#comment-17355804
 ] 

Peter Bacsko edited comment on YARN-10796 at 6/2/21, 3:28 PM:
--

[~gandras] this is a valid concern. Question is, do we accept how it worked 
before and say "yeah, that's another way of working"? Are there clusters built 
on the fact that a 0% queue cannot scale out properly, despite the max-capacity 
setting? Honestly, I don't know. Maybe some people got used to the improper 
behavior and expect it to work that way, which does happen in real life.

TIn my view, even a zero capacity queue should be able to occupy the cluster if 
nothing else is used, provided max-capacity is set appropriately. So I would 
not go for a new property.


was (Author: pbacsko):
[~gandras] this is a valid concern. Question is, do we accept how it worked 
before and say "yeah, that's another way of working"? Are there clusters built 
on the fact that a 0% queue cannot scale out properly, despite the max-capacity 
setting? Honestly, I don't know. Maybe some people got used to the improper 
behavior and expect it to work that way, which does happen in real life.

That said, even a zero capacity queue should be able to occupy the cluster if 
nothing else is used, provided max-capacity is set appropriately. So I would 
not go for a new property.

> Capacity Scheduler: dynamic queue cannot scale out properly if its capacity 
> is 0%
> -
>
> Key: YARN-10796
> URL: https://issues.apache.org/jira/browse/YARN-10796
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10796-001.patch, YARN-10796-002.patch
>
>
> If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it 
> cannot properly scale even if it's max-capacity and the parent's max-capacity 
> would allow it.
> Example:
> {noformat}
> Cluster Capacity:  16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu )
> Container allocation size: 1G / 1 vcore
> root.dynamic 
> Effective Capacity:   ( 50.0%)
> Effective Max Capacity:   (100.0%) 
> Template:
> Capacity:   40%
> Max Capacity:   100%
> User Limit Factor:  4
>  {noformat}
> leaf-queue-template.capacity = 40%
>  leaf-queue-template.maximum-capacity = 100%
>  leaf-queue-template.maximum-am-resource-percent = 50%
>  leaf-queue-template.minimum-user-limit-percent =100%
>  leaf-queue-template.user-limit-factor = 4
> "root.dynamic" has a maximum capacity of 100% and a capacity of 50%.
> Let's assume there are running containers in these dynamic queues (MR sleep 
> jobs):
>  root.dynamic.user1 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user2 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user3 = 1 AM + 15 container (capacity = 0%)
> This scenario will result in an underutilized cluster. There will be approx 
> 18% unused capacity. On the other hand, it's still possible to submit a new 
> application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% 
> utilization is possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10780) Optimise retrieval of configured node labels in CS queues

2021-06-02 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355796#comment-17355796
 ] 

Peter Bacsko commented on YARN-10780:
-

Ok, I went through the patch. I'm not saying that I have 100% full 
understanding, but for the most part, I get it.

Some comments/questions:

1. {{ConfiguredNodeLabels}}: it has a no-arg constructor, which is used once. 
But as I can see, that doesn't do anything and the real thing occurs when it's 
called with a Configuration object (init / reinit). Also, in the constructor 
CapacitySchedulerQueueManager, we have a conf object, can't we just pass that?

2. In {{AbstractCSQueue}}, we always call {{getConfiguredLabels()}}, I think 
it's simpler to directly reference the labels object. I can also see that 
descendant classes reference it. In order to be consistent, you might consider 
making it protected or package private, every other variable seems to follow 
this convention.

3. If I understand correctly, 
{{CapacitySchedulerConfiguration.getConfiguredNodeLabelsByQueue()}} runs once 
and only once, right? I mean once per init/reinit.

4. Nit: variable "stringStringEntry", can we have a better name for this? Like 
"configEntry".

5. I'd be a bit more aggressive with immutable Sets. {{getLabelsByQueue()}} 
should return {{ImmutableSet.of(labels)}}. 
{{CapacitySchedulerConfiguration.getConfiguredNodeLabels(String)}} always 
constructs a new set, so that's OK.

> Optimise retrieval of configured node labels in CS queues
> -
>
> Key: YARN-10780
> URL: https://issues.apache.org/jira/browse/YARN-10780
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10780.001.patch, YARN-10780.002.patch
>
>
> CapacitySchedulerConfiguration#getConfiguredNodeLabels scales poorly with 
> respect to queue numbers (its O(n*m), where n is the number of queues and m 
> is the number of properties set by each queue). During CS reinit, the node 
> labels are often queried, however looking at the code:
> {code:java}
> for (Entry stringStringEntry : this) {
>   e = stringStringEntry;
>   String key = e.getKey();
>   if (key.startsWith(getQueuePrefix(queuePath) + ACCESSIBLE_NODE_LABELS
>   + DOT)) {
> // Find  in
> // .accessible-node-labels..property
> int labelStartIdx =
> key.indexOf(ACCESSIBLE_NODE_LABELS)
> + ACCESSIBLE_NODE_LABELS.length() + 1;
> int labelEndIndx = key.indexOf('.', labelStartIdx);
> String labelName = key.substring(labelStartIdx, labelEndIndx);
> configuredNodeLabels.add(labelName);
>   }
> }
> {code}
>  This method iterates through ALL properties set in the configuration. For 
> example in case of initialising 2500 queues, each having at least 2 
> properties:
> 2500 * 5000 ~= over 12 million iteration + additional properties
> There are some ways to resolve this issue while keeping backward 
> compatibility:
>  # Create a property like the original accessible-node-labels, which contains 
> predefined labels. If it is set, then getConfiguredNodeLabels get the value 
> of this property, otherwise it falls back to the old logic. I think 
> accessible-node-labels are not used for this purpose (though I have a feeling 
> that it should have been).
>  # Collect node labels for all queues at the beginning of parseQueue and only 
> iterate through the properties once. This will increase the space complexity 
> in exchange of not requiring intervention from user's perspective. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10780) Optimise retrieval of configured node labels in CS queues

2021-06-02 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355775#comment-17355775
 ] 

Peter Bacsko commented on YARN-10780:
-

There are a lot of NPE at {{serviceStop()}}, [~gandras] could you check those? 
In the meantime, I'll review the changes.

> Optimise retrieval of configured node labels in CS queues
> -
>
> Key: YARN-10780
> URL: https://issues.apache.org/jira/browse/YARN-10780
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10780.001.patch, YARN-10780.002.patch
>
>
> CapacitySchedulerConfiguration#getConfiguredNodeLabels scales poorly with 
> respect to queue numbers (its O(n*m), where n is the number of queues and m 
> is the number of properties set by each queue). During CS reinit, the node 
> labels are often queried, however looking at the code:
> {code:java}
> for (Entry stringStringEntry : this) {
>   e = stringStringEntry;
>   String key = e.getKey();
>   if (key.startsWith(getQueuePrefix(queuePath) + ACCESSIBLE_NODE_LABELS
>   + DOT)) {
> // Find  in
> // .accessible-node-labels..property
> int labelStartIdx =
> key.indexOf(ACCESSIBLE_NODE_LABELS)
> + ACCESSIBLE_NODE_LABELS.length() + 1;
> int labelEndIndx = key.indexOf('.', labelStartIdx);
> String labelName = key.substring(labelStartIdx, labelEndIndx);
> configuredNodeLabels.add(labelName);
>   }
> }
> {code}
>  This method iterates through ALL properties set in the configuration. For 
> example in case of initialising 2500 queues, each having at least 2 
> properties:
> 2500 * 5000 ~= over 12 million iteration + additional properties
> There are some ways to resolve this issue while keeping backward 
> compatibility:
>  # Create a property like the original accessible-node-labels, which contains 
> predefined labels. If it is set, then getConfiguredNodeLabels get the value 
> of this property, otherwise it falls back to the old logic. I think 
> accessible-node-labels are not used for this purpose (though I have a feeling 
> that it should have been).
>  # Collect node labels for all queues at the beginning of parseQueue and only 
> iterate through the properties once. This will increase the space complexity 
> in exchange of not requiring intervention from user's perspective. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%

2021-06-02 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355771#comment-17355771
 ] 

Peter Bacsko commented on YARN-10796:
-

Thanks [~bteke], this makes sense.

> Capacity Scheduler: dynamic queue cannot scale out properly if its capacity 
> is 0%
> -
>
> Key: YARN-10796
> URL: https://issues.apache.org/jira/browse/YARN-10796
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10796-001.patch, YARN-10796-002.patch
>
>
> If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it 
> cannot properly scale even if it's max-capacity and the parent's max-capacity 
> would allow it.
> Example:
> {noformat}
> Cluster Capacity:  16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu )
> Container allocation size: 1G / 1 vcore
> root.dynamic 
> Effective Capacity:   ( 50.0%)
> Effective Max Capacity:   (100.0%) 
> Template:
> Capacity:   40%
> Max Capacity:   100%
> User Limit Factor:  4
>  {noformat}
> leaf-queue-template.capacity = 40%
>  leaf-queue-template.maximum-capacity = 100%
>  leaf-queue-template.maximum-am-resource-percent = 50%
>  leaf-queue-template.minimum-user-limit-percent =100%
>  leaf-queue-template.user-limit-factor = 4
> "root.dynamic" has a maximum capacity of 100% and a capacity of 50%.
> Let's assume there are running containers in these dynamic queues (MR sleep 
> jobs):
>  root.dynamic.user1 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user2 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user3 = 1 AM + 15 container (capacity = 0%)
> This scenario will result in an underutilized cluster. There will be approx 
> 18% unused capacity. On the other hand, it's still possible to submit a new 
> application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% 
> utilization is possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%

2021-06-02 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355720#comment-17355720
 ] 

Peter Bacsko commented on YARN-10796:
-

[~bteke], [~gandras], [~snemeth] could you review this please?

> Capacity Scheduler: dynamic queue cannot scale out properly if its capacity 
> is 0%
> -
>
> Key: YARN-10796
> URL: https://issues.apache.org/jira/browse/YARN-10796
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10796-001.patch, YARN-10796-002.patch
>
>
> If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it 
> cannot properly scale even if it's max-capacity and the parent's max-capacity 
> would allow it.
> Example:
> {noformat}
> Cluster Capacity:  16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu )
> Container allocation size: 1G / 1 vcore
> root.dynamic 
> Effective Capacity:   ( 50.0%)
> Effective Max Capacity:   (100.0%) 
> Template:
> Capacity:   40%
> Max Capacity:   100%
> User Limit Factor:  4
>  {noformat}
> leaf-queue-template.capacity = 40%
>  leaf-queue-template.maximum-capacity = 100%
>  leaf-queue-template.maximum-am-resource-percent = 50%
>  leaf-queue-template.minimum-user-limit-percent =100%
>  leaf-queue-template.user-limit-factor = 4
> "root.dynamic" has a maximum capacity of 100% and a capacity of 50%.
> Let's assume there are running containers in these dynamic queues (MR sleep 
> jobs):
>  root.dynamic.user1 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user2 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user3 = 1 AM + 15 container (capacity = 0%)
> This scenario will result in an underutilized cluster. There will be approx 
> 18% unused capacity. On the other hand, it's still possible to submit a new 
> application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% 
> utilization is possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%

2021-06-01 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10796:

Attachment: YARN-10796-002.patch

> Capacity Scheduler: dynamic queue cannot scale out properly if its capacity 
> is 0%
> -
>
> Key: YARN-10796
> URL: https://issues.apache.org/jira/browse/YARN-10796
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10796-001.patch, YARN-10796-002.patch
>
>
> If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it 
> cannot properly scale even if it's max-capacity and the parent's max-capacity 
> would allow it.
> Example:
> {noformat}
> Cluster Capacity:  16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu )
> Container allocation size: 1G / 1 vcore
> root.dynamic 
> Effective Capacity:   ( 50.0%)
> Effective Max Capacity:   (100.0%) 
> Template:
> Capacity:   40%
> Max Capacity:   100%
> User Limit Factor:  4
>  {noformat}
> leaf-queue-template.capacity = 40%
>  leaf-queue-template.maximum-capacity = 100%
>  leaf-queue-template.maximum-am-resource-percent = 50%
>  leaf-queue-template.minimum-user-limit-percent =100%
>  leaf-queue-template.user-limit-factor = 4
> "root.dynamic" has a maximum capacity of 100% and a capacity of 50%.
> Let's assume there are running containers in these dynamic queues (MR sleep 
> jobs):
>  root.dynamic.user1 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user2 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user3 = 1 AM + 15 container (capacity = 0%)
> This scenario will result in an underutilized cluster. There will be approx 
> 18% unused capacity. On the other hand, it's still possible to submit a new 
> application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% 
> utilization is possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%

2021-06-01 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10796:

Issue Type: Bug  (was: Task)

> Capacity Scheduler: dynamic queue cannot scale out properly if its capacity 
> is 0%
> -
>
> Key: YARN-10796
> URL: https://issues.apache.org/jira/browse/YARN-10796
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10796-001.patch
>
>
> If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it 
> cannot properly scale even if it's max-capacity and the parent's max-capacity 
> would allow it.
> Example:
> {noformat}
> Cluster Capacity:  16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu )
> Container allocation size: 1G / 1 vcore
> root.dynamic 
> Effective Capacity:   ( 50.0%)
> Effective Max Capacity:   (100.0%) 
> Template:
> Capacity:   40%
> Max Capacity:   100%
> User Limit Factor:  4
>  {noformat}
> leaf-queue-template.capacity = 40%
>  leaf-queue-template.maximum-capacity = 100%
>  leaf-queue-template.maximum-am-resource-percent = 50%
>  leaf-queue-template.minimum-user-limit-percent =100%
>  leaf-queue-template.user-limit-factor = 4
> "root.dynamic" has a maximum capacity of 100% and a capacity of 50%.
> Let's assume there are running containers in these dynamic queues (MR sleep 
> jobs):
>  root.dynamic.user1 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user2 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user3 = 1 AM + 15 container (capacity = 0%)
> This scenario will result in an underutilized cluster. There will be approx 
> 18% unused capacity. On the other hand, it's still possible to submit a new 
> application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% 
> utilization is possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%

2021-06-01 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10796:

Attachment: YARN-10796-001.patch

> Capacity Scheduler: dynamic queue cannot scale out properly if its capacity 
> is 0%
> -
>
> Key: YARN-10796
> URL: https://issues.apache.org/jira/browse/YARN-10796
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10796-001.patch
>
>
> If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it 
> cannot properly scale even if it's max-capacity and the parent's max-capacity 
> would allow it.
> Example:
> {noformat}
> Cluster Capacity:  16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu )
> Container allocation size: 1G / 1 vcore
> root.dynamic 
> Effective Capacity:   ( 50.0%)
> Effective Max Capacity:   (100.0%) 
> Template:
> Capacity:   40%
> Max Capacity:   100%
> User Limit Factor:  4
>  {noformat}
> leaf-queue-template.capacity = 40%
>  leaf-queue-template.maximum-capacity = 100%
>  leaf-queue-template.maximum-am-resource-percent = 50%
>  leaf-queue-template.minimum-user-limit-percent =100%
>  leaf-queue-template.user-limit-factor = 4
> "root.dynamic" has a maximum capacity of 100% and a capacity of 50%.
> Let's assume there are running containers in these dynamic queues (MR sleep 
> jobs):
>  root.dynamic.user1 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user2 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user3 = 1 AM + 15 container (capacity = 0%)
> This scenario will result in an underutilized cluster. There will be approx 
> 18% unused capacity. On the other hand, it's still possible to submit a new 
> application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% 
> utilization is possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if its capacity is 0%

2021-05-31 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10796:

Summary: Capacity Scheduler: dynamic queue cannot scale out properly if its 
capacity is 0%  (was: Capacity Scheduler: dynamic queue cannot scale out 
properly if it's capacity is 0%)

> Capacity Scheduler: dynamic queue cannot scale out properly if its capacity 
> is 0%
> -
>
> Key: YARN-10796
> URL: https://issues.apache.org/jira/browse/YARN-10796
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it 
> cannot properly scale even if it's max-capacity and the parent's max-capacity 
> would allow it.
> Example:
> {noformat}
> Cluster Capacity:  16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu )
> Container allocation size: 1G / 1 vcore
> root.dynamic 
> Effective Capacity:   ( 50.0%)
> Effective Max Capacity:   (100.0%) 
> Template:
> Capacity:   40%
> Max Capacity:   100%
> User Limit Factor:  4
>  {noformat}
> leaf-queue-template.capacity = 40%
>  leaf-queue-template.maximum-capacity = 100%
>  leaf-queue-template.maximum-am-resource-percent = 50%
>  leaf-queue-template.minimum-user-limit-percent =100%
>  leaf-queue-template.user-limit-factor = 4
> "root.dynamic" has a maximum capacity of 100% and a capacity of 50%.
> Let's assume there are running containers in these dynamic queues (MR sleep 
> jobs):
>  root.dynamic.user1 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user2 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user3 = 1 AM + 15 container (capacity = 0%)
> This scenario will result in an underutilized cluster. There will be approx 
> 18% unused capacity. On the other hand, it's still possible to submit a new 
> application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% 
> utilization is possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if it's capacity is 0%

2021-05-31 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10796:

Description: 
If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it 
cannot properly scale even if it's max-capacity and the parent's max-capacity 
would allow it.

Example:
{noformat}
Cluster Capacity:  16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu )
Container allocation size: 1G / 1 vcore

root.dynamic 
Effective Capacity:   ( 50.0%)
Effective Max Capacity:   (100.0%) 

Template:
Capacity:   40%
Max Capacity:   100%
User Limit Factor:  4
 {noformat}
leaf-queue-template.capacity = 40%
 leaf-queue-template.maximum-capacity = 100%
 leaf-queue-template.maximum-am-resource-percent = 50%
 leaf-queue-template.minimum-user-limit-percent =100%
 leaf-queue-template.user-limit-factor = 4

"root.dynamic" has a maximum capacity of 100% and a capacity of 50%.

Let's assume there are running containers in these dynamic queues (MR sleep 
jobs):
 root.dynamic.user1 = 1 AM + 3 container (capacity = 40%)
 root.dynamic.user2 = 1 AM + 3 container (capacity = 40%)
 root.dynamic.user3 = 1 AM + 15 container (capacity = 0%)

This scenario will result in an underutilized cluster. There will be approx 18% 
unused capacity. On the other hand, it's still possible to submit a new 
application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% 
utilization is possible.

  was:
If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it 
cannot properly scale even if it's max-capacity and the parent's max-capacity 
would allow it.

Example:
{noformat}
Cluster Capacity:  16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu )
Container allocation size: 1G / 1 vcore

Root.dynamic 
Effective Capacity:   ( 50.0%)
Effective Max Capacity:   (100.0%) 

Template:
Capacity:   40%
Max Capacity:   100%
User Limit Factor:  4
 {noformat}
leaf-queue-template.capacity = 40%
 leaf-queue-template.maximum-capacity = 100%
 leaf-queue-template.maximum-am-resource-percent = 50%
 leaf-queue-template.minimum-user-limit-percent =100%
 leaf-queue-template.user-limit-factor = 4

"root.dynamic" has a maximum capacity of 100% and a capacity of 50%.

Let's assume there are running containers in these dynamic queues (MR sleep 
jobs):
 root.dynamic.user1 = 1 AM + 3 container (capacity = 40%)
 root.dynamic.user2 = 1 AM + 3 container (capacity = 40%)
 root.dynamic.user3 = 1 AM + 15 container (capacity = 0%)

This scenario will result in an underutilized cluster. There will be approx 18% 
unused capacity. On the other hand, it's still possible to submit a new 
application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% 
utilization is possible.


> Capacity Scheduler: dynamic queue cannot scale out properly if it's capacity 
> is 0%
> --
>
> Key: YARN-10796
> URL: https://issues.apache.org/jira/browse/YARN-10796
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it 
> cannot properly scale even if it's max-capacity and the parent's max-capacity 
> would allow it.
> Example:
> {noformat}
> Cluster Capacity:  16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu )
> Container allocation size: 1G / 1 vcore
> root.dynamic 
> Effective Capacity:   ( 50.0%)
> Effective Max Capacity:   (100.0%) 
> Template:
> Capacity:   40%
> Max Capacity:   100%
> User Limit Factor:  4
>  {noformat}
> leaf-queue-template.capacity = 40%
>  leaf-queue-template.maximum-capacity = 100%
>  leaf-queue-template.maximum-am-resource-percent = 50%
>  leaf-queue-template.minimum-user-limit-percent =100%
>  leaf-queue-template.user-limit-factor = 4
> "root.dynamic" has a maximum capacity of 100% and a capacity of 50%.
> Let's assume there are running containers in these dynamic queues (MR sleep 
> jobs):
>  root.dynamic.user1 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user2 = 1 AM + 3 container (capacity = 40%)
>  root.dynamic.user3 = 1 AM + 15 container (capacity = 0%)
> This scenario will result in an underutilized cluster. There will be approx 
> 18% unused capacity. On the other hand, it's still possible to submit a new 
> application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% 
> utilization is possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issue

[jira] [Created] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if it's capacity is 0%

2021-05-31 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10796:
---

 Summary: Capacity Scheduler: dynamic queue cannot scale out 
properly if it's capacity is 0%
 Key: YARN-10796
 URL: https://issues.apache.org/jira/browse/YARN-10796
 Project: Hadoop YARN
  Issue Type: Task
  Components: capacity scheduler, capacityscheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko


If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it 
cannot properly scale even if it's max-capacity and the parent's max-capacity 
would allow it.

Example:
{noformat}
Cluster Capacity:  16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu )
Container allocation size: 1G / 1 vcore

Root.dynamic 
Effective Capacity:   ( 50.0%)
Effective Max Capacity:   (100.0%) 

Template:
Capacity:   40%
Max Capacity:   100%
User Limit Factor:  4
 {noformat}
leaf-queue-template.capacity = 40%
 leaf-queue-template.maximum-capacity = 100%
 leaf-queue-template.maximum-am-resource-percent = 50%
 leaf-queue-template.minimum-user-limit-percent =100%
 leaf-queue-template.user-limit-factor = 4

"root.dynamic" has a maximum capacity of 100% and a capacity of 50%.

Let's assume there are running containers in these dynamic queues (MR sleep 
jobs):
 root.dynamic.user1 = 1 AM + 3 container (capacity = 40%)
 root.dynamic.user2 = 1 AM + 3 container (capacity = 40%)
 root.dynamic.user3 = 1 AM + 15 container (capacity = 0%)

This scenario will result in an underutilized cluster. There will be approx 18% 
unused capacity. On the other hand, it's still possible to submit a new 
application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% 
utilization is possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10771) Add cluster metric for size of SchedulerEventQueue and RMEventQueue

2021-05-21 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349248#comment-17349248
 ] 

Peter Bacsko commented on YARN-10771:
-

Just one thing:

{{import com.google.common.util.concurrent.ThreadFactoryBuilder;}}

Use the shaded import starting with "org.apache.thirdparty".

> Add cluster metric for size of SchedulerEventQueue and RMEventQueue
> ---
>
> Key: YARN-10771
> URL: https://issues.apache.org/jira/browse/YARN-10771
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: chaosju
>Assignee: chaosju
>Priority: Major
> Attachments: YARN-10763.001.patch, YARN-10771.002.patch, 
> YARN-10771.003.patch, YARN-10771.004.patch
>
>
> Add cluster metric for size of Scheduler event queue and RM event queue, This 
> lets us know the load of the RM and convenient monitoring the metrics.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9606) Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient

2021-05-21 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349177#comment-17349177
 ] 

Peter Bacsko commented on YARN-9606:


[~BilwaST] I can't apply this patch to branch-3.3, even using "git apply \-3" 
fails. Could you upload a branch-3.3 version?

> Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient 
> --
>
> Key: YARN-9606
> URL: https://issues.apache.org/jira/browse/YARN-9606
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9606-001.patch, YARN-9606-002.patch, 
> YARN-9606.003.patch, YARN-9606.004.patch, YARN-9606.005.patch, 
> YARN-9606.006.patch, YARN-9606.007.patch, YARN-9606.008.patch
>
>
> Yarn logs fails for running containers    
>   
> 
>   {quote}                                                                     
>                           
>   
>
>  Unable to fetch log files list
>  Exception in thread "main" java.io.IOException: 
> com.sun.jersey.api.client.ClientHandlerException: 
> javax.net.ssl.SSLHandshakeException: Error while authenticating with 
> endpoint: 
> [https://vm2:65321/ws/v1/node/containers/container_e05_1559802125016_0001_01_08/logs]
>  at 
> org.apache.hadoop.yarn.client.cli.LogsCLI.getContainerLogFiles(LogsCLI.java:543)
>  at 
> org.apache.hadoop.yarn.client.cli.LogsCLI.getMatchedContainerLogFiles(LogsCLI.java:1338)
>  at 
> org.apache.hadoop.yarn.client.cli.LogsCLI.getMatchedOptionForRunningApp(LogsCLI.java:1514)
>  at 
> org.apache.hadoop.yarn.client.cli.LogsCLI.fetchContainerLogs(LogsCLI.java:1052)
>  at org.apache.hadoop.yarn.client.cli.LogsCLI.runCommand(LogsCLI.java:367)
>  at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:152)
>  at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:399)
>  {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-21 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349104#comment-17349104
 ] 

Peter Bacsko commented on YARN-10779:
-

[~snemeth] please review this.

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-001.patch, YARN-10779-002.patch, 
> YARN-10779-003.patch, YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Reopened] (YARN-9606) Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient

2021-05-21 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reopened YARN-9606:


> Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient 
> --
>
> Key: YARN-9606
> URL: https://issues.apache.org/jira/browse/YARN-9606
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9606-001.patch, YARN-9606-002.patch, 
> YARN-9606.003.patch, YARN-9606.004.patch, YARN-9606.005.patch, 
> YARN-9606.006.patch, YARN-9606.007.patch, YARN-9606.008.patch
>
>
> Yarn logs fails for running containers    
>   
> 
>   {quote}                                                                     
>                           
>   
>
>  Unable to fetch log files list
>  Exception in thread "main" java.io.IOException: 
> com.sun.jersey.api.client.ClientHandlerException: 
> javax.net.ssl.SSLHandshakeException: Error while authenticating with 
> endpoint: 
> [https://vm2:65321/ws/v1/node/containers/container_e05_1559802125016_0001_01_08/logs]
>  at 
> org.apache.hadoop.yarn.client.cli.LogsCLI.getContainerLogFiles(LogsCLI.java:543)
>  at 
> org.apache.hadoop.yarn.client.cli.LogsCLI.getMatchedContainerLogFiles(LogsCLI.java:1338)
>  at 
> org.apache.hadoop.yarn.client.cli.LogsCLI.getMatchedOptionForRunningApp(LogsCLI.java:1514)
>  at 
> org.apache.hadoop.yarn.client.cli.LogsCLI.fetchContainerLogs(LogsCLI.java:1052)
>  at org.apache.hadoop.yarn.client.cli.LogsCLI.runCommand(LogsCLI.java:367)
>  at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:152)
>  at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:399)
>  {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-21 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349064#comment-17349064
 ] 

Peter Bacsko commented on YARN-10779:
-

[~gandras] this is a site property, not a CS property. CS re-initialization 
does not affect YARN as a whole, only CS. It requires to restart Resource 
Manager.

ZK-based activation/deactivation also only refreshes scheduler-based settings, 
queues, ACLs, supergroup settings, etc. See 
{{org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive()}}.
 The site configuration is loaded once in {{ResourceManager.main()}}.

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-001.patch, YARN-10779-002.patch, 
> YARN-10779-003.patch, YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-20 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10779:

Attachment: YARN-10779-003.patch

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-001.patch, YARN-10779-002.patch, 
> YARN-10779-003.patch, YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-20 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10779:

Attachment: YARN-10779-002.patch

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-001.patch, YARN-10779-002.patch, 
> YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10505) Extend the maximum-capacity property to support Fair Scheduler migration

2021-05-20 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10505:

Description: 
The property root.users.maximum-capacity could mean the following things:
 * Parent Percentage: maximum capacity relative to its parent. If it’s set to 
50, then it means that the capacity is capped with respect to the parent.
 * Cluster Percentage: maximum capacity expressed as a percentage of the 
overall cluster capacity.
 
Note that Fair Scheduler supports the following settings:
 * Single percentage (cluster)
 * Two percentages (cluster)
 * Absolute resources

 
It is recommended that all three formats are supported for maximum-capacity 
after introducing weight mode. 

  was:
The property root.users.maximum-capacity could mean the following things:
 * Parent Percentage: maximum capacity relative to its parent. If it’s set to 
50, then it means that the capacity is capped with respect to the parent.
 * Cluster Percentage: maximum capacity expressed as a percentage of the 
overall cluster capacity.
 
Note that Fair Scheduler supports the following settings:
 * Single percentage (absolute)
 * Two percentages (absolute)
 * Absolute resources

 
It is recommended that all three formats are supported for maximum-capacity 
after introducing weight mode. 


> Extend the maximum-capacity property to support Fair Scheduler migration
> 
>
> Key: YARN-10505
> URL: https://issues.apache.org/jira/browse/YARN-10505
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Priority: Major
>
> The property root.users.maximum-capacity could mean the following things:
>  * Parent Percentage: maximum capacity relative to its parent. If it’s set to 
> 50, then it means that the capacity is capped with respect to the parent.
>  * Cluster Percentage: maximum capacity expressed as a percentage of the 
> overall cluster capacity.
>  
> Note that Fair Scheduler supports the following settings:
>  * Single percentage (cluster)
>  * Two percentages (cluster)
>  * Absolute resources
>  
> It is recommended that all three formats are supported for maximum-capacity 
> after introducing weight mode. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-20 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348213#comment-17348213
 ] 

Peter Bacsko commented on YARN-10779:
-

TODO: new property needs a "." character after the prefix variable.

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-001.patch, YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-20 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348202#comment-17348202
 ] 

Peter Bacsko commented on YARN-10779:
-

Ok, uploaded patch v1, I forgot to update {{yarn-default.xml}}, that will 
happen in the next patch.

[~gandras] [~snemeth] care to review?

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-001.patch, YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-20 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10779:

Attachment: YARN-10779-001.patch

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-001.patch, YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-20 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348147#comment-17348147
 ] 

Peter Bacsko edited comment on YARN-10779 at 5/20/21, 8:34 AM:
---

"It is a possibility that Node A sends this PB message to Node B, which means 
that this configuration must be set on Node B in order to take effect"

What you're saying is true, but this applies to the whole Hadoop ecosystem, not 
just to this particular setting. Imagine having different container executors 
on different nodes, or security enabled/disabled, that would wreak havoc. 
Proper configuration management is mandatory and expected.


was (Author: pbacsko):
"It is a possibility that Node A sends this PB message to Node B, which means 
that this configuration must be set on Node B in order to take effect"

What you're saying is true, but this applies to the whole Hadoop ecosystem, not 
just to this particular setting. Imagine having different container executors 
on different nodes, or security enabled/disabled, that would wreak havoc. 
Proper configuration management is mandatory and assumed.

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-20 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348147#comment-17348147
 ] 

Peter Bacsko edited comment on YARN-10779 at 5/20/21, 8:33 AM:
---

"It is a possibility that Node A sends this PB message to Node B, which means 
that this configuration must be set on Node B in order to take effect"

What you're saying is true, but this applies to the whole Hadoop ecosystem, not 
just to this particular setting. Imagine having different container executors 
on different nodes, or security enabled/disabled, that would wreak havoc. 
Proper configuration management is mandatory and assumed.


was (Author: pbacsko):
"It is a possibility that Node A sends this PB message to Node B, which means 
that this configuration must be set on Node B in order to take effect"

What you're saying is true, but this applies to the whole Hadoop ecosystem, not 
just to this particular setting. Imagine having different container executors 
on different nodes, or security enabled/disabled, that would wreak havoc. 
Proper configuration management is a mandatory and assumed.

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-20 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348147#comment-17348147
 ] 

Peter Bacsko commented on YARN-10779:
-

"It is a possibility that Node A sends this PB message to Node B, which means 
that this configuration must be set on Node B in order to take effect"

What you're saying is true, but this applies to the whole Hadoop ecosystem, not 
just to this particular setting. Imagine having different container executors 
on different nodes, or security enabled/disabled, that would wreak havoc. 
Proper configuration management is a mandatory and assumed.

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-19 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348090#comment-17348090
 ] 

Peter Bacsko commented on YARN-10779:
-

[~gandras] these classes are locally instantiated adapter classes that mostly 
use generated ProtoBuf classes as an input. They are not sent over the wire. We 
don't directly work on ProtoBuf classes, but on these PBImpl stuff that are the 
implementations of the non-PBImpl classes like {{ApplicationSubmissionContext}}.

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10725) Backport YARN-10120 to branch-3.3

2021-05-19 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347659#comment-17347659
 ] 

Peter Bacsko commented on YARN-10725:
-

Ok, +1 committed to branch-3.3.

> Backport YARN-10120 to branch-3.3
> -
>
> Key: YARN-10725
> URL: https://issues.apache.org/jira/browse/YARN-10725
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10120-branch-3.3.patch, 
> YARN-10725-branch-3.3.patch, YARN-10725-branch-3.3.v2.patch, 
> YARN-10725-branch-3.3.v3.patch, YARN-10725-branch-3.3.v4.patch, 
> YARN-10725-branch-3.3.v5.patch, image-2021-04-05-16-48-57-034.png, 
> image-2021-04-05-16-50-55-238.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10725) Backport YARN-10120 to branch-3.3

2021-05-19 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10725:

Fix Version/s: 3.3.1

> Backport YARN-10120 to branch-3.3
> -
>
> Key: YARN-10725
> URL: https://issues.apache.org/jira/browse/YARN-10725
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Fix For: 3.3.1
>
> Attachments: YARN-10120-branch-3.3.patch, 
> YARN-10725-branch-3.3.patch, YARN-10725-branch-3.3.v2.patch, 
> YARN-10725-branch-3.3.v3.patch, YARN-10725-branch-3.3.v4.patch, 
> YARN-10725-branch-3.3.v5.patch, image-2021-04-05-16-48-57-034.png, 
> image-2021-04-05-16-50-55-238.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10725) Backport YARN-10120 to branch-3.3

2021-05-19 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347632#comment-17347632
 ] 

Peter Bacsko commented on YARN-10725:
-

No, I'll fix during commit. Just one more thing: this is a backport, but what 
should be the commit message? The original? Or "YARN-10725. Backport YARN-10120 
to branch-3.3". I think usually we keep the original commit message unless some 
heavy changes were necessary during the backport (too many conflicts, whatever).

> Backport YARN-10120 to branch-3.3
> -
>
> Key: YARN-10725
> URL: https://issues.apache.org/jira/browse/YARN-10725
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10120-branch-3.3.patch, 
> YARN-10725-branch-3.3.patch, YARN-10725-branch-3.3.v2.patch, 
> YARN-10725-branch-3.3.v3.patch, YARN-10725-branch-3.3.v4.patch, 
> YARN-10725-branch-3.3.v5.patch, image-2021-04-05-16-48-57-034.png, 
> image-2021-04-05-16-50-55-238.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10725) Backport YARN-10120 to branch-3.3

2021-05-19 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347625#comment-17347625
 ] 

Peter Bacsko commented on YARN-10725:
-

[~BilwaST] so you're saying that v5 is good to go? I don't have too much 
context, if everything is fine, I can commit it.

> Backport YARN-10120 to branch-3.3
> -
>
> Key: YARN-10725
> URL: https://issues.apache.org/jira/browse/YARN-10725
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10120-branch-3.3.patch, 
> YARN-10725-branch-3.3.patch, YARN-10725-branch-3.3.v2.patch, 
> YARN-10725-branch-3.3.v3.patch, YARN-10725-branch-3.3.v4.patch, 
> YARN-10725-branch-3.3.v5.patch, image-2021-04-05-16-48-57-034.png, 
> image-2021-04-05-16-50-55-238.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-19 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347617#comment-17347617
 ] 

Peter Bacsko commented on YARN-10779:
-

[~BilwaST] [~jhung] you recently made modifications around here, do you think 
it's a viable approach?

cc [~sunilg] [~snemeth].

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-19 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347481#comment-17347481
 ] 

Peter Bacsko commented on YARN-10779:
-

Uploaded a POC without tests.

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-19 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10779:

Summary: Add option to disable lowercase conversion in 
GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl  (was: Add 
option to disable lowecase conversion in GetApplicationsRequestPBImpl and 
ApplicationSubmissionContextPBImpl)

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10779) Add option to disable lowecase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-19 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10779:

Attachment: YARN-10779-POC.patch

> Add option to disable lowecase conversion in GetApplicationsRequestPBImpl and 
> ApplicationSubmissionContextPBImpl
> 
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10779) Add option to disable lowecase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-19 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10779:
---

 Summary: Add option to disable lowecase conversion in 
GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
 Key: YARN-10779
 URL: https://issues.apache.org/jira/browse/YARN-10779
 Project: Hadoop YARN
  Issue Type: Task
  Components: resourcemanager
Reporter: Peter Bacsko
Assignee: Peter Bacsko


In both {{GetApplicationsRequestPBImpl}} and 
{{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase conversion:

{noformat}
checkTags(tags);
// Convert applicationTags to lower case and add
this.applicationTags = new TreeSet<>();
for (String tag : tags) {
  this.applicationTags.add(StringUtils.toLowerCase(tag));
}
  }
{noformat}

However, we encountered some cases where this is not desirable for "userid" 
tags. 

Proposed solution: since both classes are pretty low-level and can be often 
instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
be cached inside them. A new property should be created which tells whether 
lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10258) Add metrics for 'ApplicationsRunning' in NodeManager

2021-05-19 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347419#comment-17347419
 ] 

Peter Bacsko commented on YARN-10258:
-

Backported to branch-3.3.

> Add metrics for 'ApplicationsRunning' in NodeManager
> 
>
> Key: YARN-10258
> URL: https://issues.apache.org/jira/browse/YARN-10258
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.1.3
>Reporter: ANANDA G B
>Assignee: ANANDA G B
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10258-001.patch, YARN-10258-002.patch, 
> YARN-10258-003.patch, YARN-10258-005.patch, YARN-10258-006.patch, 
> YARN-10258-007.patch, YARN-10258-008.patch, YARN-10258-009.patch, 
> YARN-10258-010.patch, YARN-10258-branch-3.3.001.patch, YARN-10258_004.patch
>
>
> Add metrics for 'ApplicationsRunning' in NodeManagers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10258) Add metrics for 'ApplicationsRunning' in NodeManager

2021-05-19 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10258:

Fix Version/s: 3.3.1

> Add metrics for 'ApplicationsRunning' in NodeManager
> 
>
> Key: YARN-10258
> URL: https://issues.apache.org/jira/browse/YARN-10258
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.1.3
>Reporter: ANANDA G B
>Assignee: ANANDA G B
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10258-001.patch, YARN-10258-002.patch, 
> YARN-10258-003.patch, YARN-10258-005.patch, YARN-10258-006.patch, 
> YARN-10258-007.patch, YARN-10258-008.patch, YARN-10258-009.patch, 
> YARN-10258-010.patch, YARN-10258-branch-3.3.001.patch, YARN-10258_004.patch
>
>
> Add metrics for 'ApplicationsRunning' in NodeManagers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10258) Add metrics for 'ApplicationsRunning' in NodeManager

2021-05-18 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10258:

Attachment: YARN-10258-branch-3.3.001.patch

> Add metrics for 'ApplicationsRunning' in NodeManager
> 
>
> Key: YARN-10258
> URL: https://issues.apache.org/jira/browse/YARN-10258
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.1.3
>Reporter: ANANDA G B
>Assignee: ANANDA G B
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: YARN-10258-001.patch, YARN-10258-002.patch, 
> YARN-10258-003.patch, YARN-10258-005.patch, YARN-10258-006.patch, 
> YARN-10258-007.patch, YARN-10258-008.patch, YARN-10258-009.patch, 
> YARN-10258-010.patch, YARN-10258-branch-3.3.001.patch, YARN-10258_004.patch
>
>
> Add metrics for 'ApplicationsRunning' in NodeManagers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Reopened] (YARN-10258) Add metrics for 'ApplicationsRunning' in NodeManager

2021-05-18 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reopened YARN-10258:
-

> Add metrics for 'ApplicationsRunning' in NodeManager
> 
>
> Key: YARN-10258
> URL: https://issues.apache.org/jira/browse/YARN-10258
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.1.3
>Reporter: ANANDA G B
>Assignee: ANANDA G B
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: YARN-10258-001.patch, YARN-10258-002.patch, 
> YARN-10258-003.patch, YARN-10258-005.patch, YARN-10258-006.patch, 
> YARN-10258-007.patch, YARN-10258-008.patch, YARN-10258-009.patch, 
> YARN-10258-010.patch, YARN-10258_004.patch
>
>
> Add metrics for 'ApplicationsRunning' in NodeManagers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10258) Add metrics for 'ApplicationsRunning' in NodeManager

2021-05-17 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346396#comment-17346396
 ] 

Peter Bacsko commented on YARN-10258:
-

+1 

Thanks for the patch [~gb.ana...@gmail.com] and [~BilwaST] / [~zhuqi] for the 
review, committed to trunk.

> Add metrics for 'ApplicationsRunning' in NodeManager
> 
>
> Key: YARN-10258
> URL: https://issues.apache.org/jira/browse/YARN-10258
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.1.3
>Reporter: ANANDA G B
>Assignee: ANANDA G B
>Priority: Minor
> Attachments: YARN-10258-001.patch, YARN-10258-002.patch, 
> YARN-10258-003.patch, YARN-10258-005.patch, YARN-10258-006.patch, 
> YARN-10258-007.patch, YARN-10258-008.patch, YARN-10258-009.patch, 
> YARN-10258-010.patch, YARN-10258_004.patch
>
>
> Add metrics for 'ApplicationsRunning' in NodeManagers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10258) Add metrics for 'ApplicationsRunning' in NodeManager

2021-05-17 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346154#comment-17346154
 ] 

Peter Bacsko commented on YARN-10258:
-

[~BilwaST] yes, I'll check this out.

> Add metrics for 'ApplicationsRunning' in NodeManager
> 
>
> Key: YARN-10258
> URL: https://issues.apache.org/jira/browse/YARN-10258
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.1.3
>Reporter: ANANDA G B
>Assignee: ANANDA G B
>Priority: Minor
> Attachments: YARN-10258-001.patch, YARN-10258-002.patch, 
> YARN-10258-003.patch, YARN-10258-005.patch, YARN-10258-006.patch, 
> YARN-10258-007.patch, YARN-10258-008.patch, YARN-10258-009.patch, 
> YARN-10258-010.patch, YARN-10258_004.patch
>
>
> Add metrics for 'ApplicationsRunning' in NodeManagers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10763) Add the number of containers assigned per second metrics to ClusterMetrics

2021-05-17 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346098#comment-17346098
 ] 

Peter Bacsko commented on YARN-10763:
-

+1

Thanks [~chaosju] for the patch and [~zhuqi] for the review. Committed to trunk.

> Add  the number of containers assigned per second metrics to ClusterMetrics
> ---
>
> Key: YARN-10763
> URL: https://issues.apache.org/jira/browse/YARN-10763
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: chaosju
>Assignee: chaosju
>Priority: Minor
> Attachments: YARN-10763.001.patch, YARN-10763.002.patch, 
> YARN-10763.003.patch, YARN-10763.004.patch, YARN-10763.005.patch, 
> YARN-10763.006.patch, YARN-10763.007.patch, YARN-10763.008.patch, 
> screenshot-1.png
>
>
> It'd be good to have ContainerAssignedNum/Second in ClusterMetrics for 
> measuring cluster throughput.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10763) Add the number of containers assigned per second metrics to ClusterMetrics

2021-05-17 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10763:

Summary: Add  the number of containers assigned per second metrics to 
ClusterMetrics  (was: add  the speed of containers assigned metrics to 
ClusterMetrics)

> Add  the number of containers assigned per second metrics to ClusterMetrics
> ---
>
> Key: YARN-10763
> URL: https://issues.apache.org/jira/browse/YARN-10763
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: chaosju
>Assignee: chaosju
>Priority: Minor
> Attachments: YARN-10763.001.patch, YARN-10763.002.patch, 
> YARN-10763.003.patch, YARN-10763.004.patch, YARN-10763.005.patch, 
> YARN-10763.006.patch, YARN-10763.007.patch, YARN-10763.008.patch, 
> screenshot-1.png
>
>
> It'd be good to have ContainerAssignedNum/Second in ClusterMetrics for 
> measuring cluster throughput.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9698) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler

2021-05-14 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344626#comment-17344626
 ] 

Peter Bacsko commented on YARN-9698:


The number of subtasks under this JIRA just keeps growing.

The converter called fs2cs is mostly complete. It's not perfect, but it's 
working. Although new additions are constantly coming, I don't see the point of 
keeping this particular ticket open, otherwise it will never be closed.

I suggest creating a new , "Phase II" JIRA and move the current subtasks under 
that.

Then we can mark this as Fix Version = 3.4.0.

[~gandras], [~snemeth], [~zhuqi] opinions?

> [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler
> 
>
> Key: YARN-9698
> URL: https://issues.apache.org/jira/browse/YARN-9698
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Weiwei Yang
>Priority: Major
>  Labels: fs2cs
> Attachments: FS-CS Migration.pdf
>
>
> We see some users want to migrate from Fair Scheduler to Capacity Scheduler, 
> this Jira is created as an umbrella to track all related efforts for the 
> migration, the scope contains
>  * Bug fixes
>  * Add missing features
>  * Migration tools that help to generate CS configs based on FS, validate 
> configs etc
>  * Documents
> this is part of CS component, the purpose is to make the migration process 
> smooth.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10759) Encapsulate queue config modes

2021-05-14 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344573#comment-17344573
 ] 

Peter Bacsko commented on YARN-10759:
-

Thanks [~gandras] for the patch. I just have one to note.

I can see that {{allowZeroCapacitySum}} has been moved to {{AbstractCSQueue}}, 
although it's really something which is meant for {{ParentQueue}}. I assume 
this is because the new code is easier to read and no type checks and casts are 
necessary. Is that correct?

I'm wondering if this can cause problems. Because right now, this logic only 
runs inside {{ParentQueue}}:
{noformat}
  // We also allow children's percent sum = 0 under the 
following   
  // conditions 
  // - Parent uses weight mode  
  // - Parent uses percent mode, and parent has 
  //   (capacity=0 OR allowZero)
  if (parentCapacityType == QueueCapacityType.PERCENT) {

if ((Math.abs(queueCapacities.getCapacity(nodeLabel))   

> PRECISION) && (!allowZeroCapacitySum)) {  
  throw new IOException(
  "Illegal" + " capacity sum of " + childrenPctSum  

  + " for children of queue " + queueName   

  + " for label=" + nodeLabel   
  + ". It is set to 0, but parent percent != 0, 
and "   
  + "doesn't allow children capacity to set to 
0"); 
}   
  } 
}
{noformat}
But after this refactor, leaf queues will have this property too with it being 
set to "false". Although there are no unit test failures, we need to double 
check if this extra boolean flag on leafs can have any impact on the existing 
code.

> Encapsulate queue config modes
> --
>
> Key: YARN-10759
> URL: https://issues.apache.org/jira/browse/YARN-10759
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10759.001.patch, YARN-10759.002.patch, 
> YARN-10759.003.patch, YARN-10759.004.patch
>
>
> Capacity Scheduler queues have three modes:
>  * relative/percentage
>  * weight
>  * absolute
> Most of them have their own:
>  * validation logic
>  * config setting logic
>  * effective capacity calculation logic
> These logics can be easily extracted and encapsulated in separate config mode 
> classes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10759) Encapsulate queue config modes

2021-05-14 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344573#comment-17344573
 ] 

Peter Bacsko edited comment on YARN-10759 at 5/14/21, 12:54 PM:


Thanks [~gandras] for the patch. I just have one thing to note.

I can see that {{allowZeroCapacitySum}} has been moved to {{AbstractCSQueue}}, 
although it's really something which is meant for {{ParentQueue}}. I assume 
this is because the new code is easier to read and no type checks and casts are 
necessary. Is that correct?

I'm wondering if this can cause problems. Because right now, this logic only 
runs inside {{ParentQueue}}:
{noformat}
  // We also allow children's percent sum = 0 under the 
following   
  // conditions 
  // - Parent uses weight mode  
  // - Parent uses percent mode, and parent has 
  //   (capacity=0 OR allowZero)
  if (parentCapacityType == QueueCapacityType.PERCENT) {

if ((Math.abs(queueCapacities.getCapacity(nodeLabel))   

> PRECISION) && (!allowZeroCapacitySum)) {  
  throw new IOException(
  "Illegal" + " capacity sum of " + childrenPctSum  

  + " for children of queue " + queueName   

  + " for label=" + nodeLabel   
  + ". It is set to 0, but parent percent != 0, 
and "   
  + "doesn't allow children capacity to set to 
0"); 
}   
  } 
}
{noformat}
But after this refactor, leaf queues will have this property too with it being 
set to "false". Although there are no unit test failures, we need to double 
check if this extra boolean flag on leafs can have any impact on the existing 
code.


was (Author: pbacsko):
Thanks [~gandras] for the patch. I just have one to note.

I can see that {{allowZeroCapacitySum}} has been moved to {{AbstractCSQueue}}, 
although it's really something which is meant for {{ParentQueue}}. I assume 
this is because the new code is easier to read and no type checks and casts are 
necessary. Is that correct?

I'm wondering if this can cause problems. Because right now, this logic only 
runs inside {{ParentQueue}}:
{noformat}
  // We also allow children's percent sum = 0 under the 
following   
  // conditions 
  // - Parent uses weight mode  
  // - Parent uses percent mode, and parent has 
  //   (capacity=0 OR allowZero)
  if (parentCapacityType == QueueCapacityType.PERCENT) {

if ((Math.abs(queueCapacities.getCapacity(nodeLabel))   

> PRECISION) && (!allowZeroCapacitySum)) {  
  throw new IOException(
  "Illegal" + " capacity sum of " + childrenPctSum  

  + " for children of queue " + queueName   

  + " for label=" + nodeLabel   
  + ". It is set to 0, but parent percent != 0, 
and "   
  + "doesn't allow children capacity to set to 
0"); 
}   
  } 
}
{noformat}
But after this refactor, leaf queues will have this property too with it being 
set to "false". Although there are no unit test failures, we need to double 
check if this extra boolean flag on leafs can have any impact on the existing 
code.

> Encapsulate queue config modes
> --
>
> Key: YARN-10759
> URL: https://issues.apache.org/jira/browse/YARN-10759
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10759.001.patch, YARN-10759.002.patch, 
> YARN-10759.003.patch, YARN-10759.004.patch
>
>
> Capacity Scheduler queues have three modes:
>  * relative/percentage
>  * weight
>  * absolute
> Most of them have their own:
>  * validation logic
>  * config setting logic
>  * effective capacity calculation logic
> These logics can be easily extracted and encapsulated in separate config mode 
> clas

[jira] [Commented] (YARN-10763) add the speed of containers assigned metrics to ClusterMetrics

2021-05-14 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344518#comment-17344518
 ] 

Peter Bacsko commented on YARN-10763:
-

Thanks [~chaosju] just a final update, very minor things:

1. "Containers assigned in last second" --> missing "the": "Containers assigned 
in *the* last second"

2. Comment is not necessary, purpose of the executor is trivial:
{noformat}
  /**
   * The executor service that count containers assigned in last second.
   *
   */
{noformat}

3. Nit: space after if
{noformat}
if(INSTANCE != null && INSTANCE.getAssignCounterExecutor() != null) 
{
  INSTANCE.getAssignCounterExecutor().shutdownNow();
}
{noformat}

I have no further comments.

> add  the speed of containers assigned metrics to ClusterMetrics
> ---
>
> Key: YARN-10763
> URL: https://issues.apache.org/jira/browse/YARN-10763
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: chaosju
>Assignee: chaosju
>Priority: Minor
> Attachments: YARN-10763.001.patch, YARN-10763.002.patch, 
> YARN-10763.003.patch, YARN-10763.004.patch, YARN-10763.005.patch, 
> YARN-10763.006.patch, YARN-10763.007.patch, screenshot-1.png
>
>
> It'd be good to have ContainerAssignedNum/Second in ClusterMetrics for 
> measuring cluster throughput.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10571) Refactor dynamic queue handling logic

2021-05-12 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343250#comment-17343250
 ] 

Peter Bacsko edited comment on YARN-10571 at 5/12/21, 1:36 PM:
---

Thanks [~gandras], committed to trunk. Also thanks [~shuzirra], [~zhuqi] for 
the review.


was (Author: pbacsko):
Thanks [~gandras], committed to trunk.

> Refactor dynamic queue handling logic
> -
>
> Key: YARN-10571
> URL: https://issues.apache.org/jira/browse/YARN-10571
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10571.001.patch, YARN-10571.002.patch, 
> YARN-10571.003.patch, YARN-10571.004.patch
>
>
> As per YARN-10506 we have introduced an other mode for auto queue creation 
> and a new class, which handles it. We should move the old, managed queue 
> related logic to CSAutoQueueHandler as well, and do additional cleanup 
> regarding queue management.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10571) Refactor dynamic queue handling logic

2021-05-12 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343250#comment-17343250
 ] 

Peter Bacsko commented on YARN-10571:
-

Thanks [~gandras], committed to trunk.

> Refactor dynamic queue handling logic
> -
>
> Key: YARN-10571
> URL: https://issues.apache.org/jira/browse/YARN-10571
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10571.001.patch, YARN-10571.002.patch, 
> YARN-10571.003.patch, YARN-10571.004.patch
>
>
> As per YARN-10506 we have introduced an other mode for auto queue creation 
> and a new class, which handles it. We should move the old, managed queue 
> related logic to CSAutoQueueHandler as well, and do additional cleanup 
> regarding queue management.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10571) Refactor dynamic queue handling logic

2021-05-12 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343227#comment-17343227
 ] 

Peter Bacsko commented on YARN-10571:
-

+1 LGTM

> Refactor dynamic queue handling logic
> -
>
> Key: YARN-10571
> URL: https://issues.apache.org/jira/browse/YARN-10571
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10571.001.patch, YARN-10571.002.patch, 
> YARN-10571.003.patch, YARN-10571.004.patch
>
>
> As per YARN-10506 we have introduced an other mode for auto queue creation 
> and a new class, which handles it. We should move the old, managed queue 
> related logic to CSAutoQueueHandler as well, and do additional cleanup 
> regarding queue management.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10763) add the speed of containers assigned metrics to ClusterMetrics

2021-05-12 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343152#comment-17343152
 ] 

Peter Bacsko commented on YARN-10763:
-

Some further comments:
1. {{private volatile AtomicLong numContainersAssigned = new AtomicLong(0);}} - 
this doesn't have to be "volatile". 

2. "containersAssignedCounter" should be renamed to something else. If this is 
an executor, then use the name "assignCounterExecutor" or something like that.

3. Also, "containersAssignedCounter" should be not be a static field. It's part 
of an object which just happens to be a singleton. 

4. Initialize  "containersAssignedCounter" in the constructor.

5. Use something else for the thread name instead of 
"ContainersAssigned_Counter", for example, "ContainerAssignmentCounterThread".

6. "numContainersAssignedLast" should be int and not long. Therefore, you don't 
need to the type cast at {{numContainerAssignedPerSecond.set((int)n);}}.

7. Also, it would be best to reset the counter of "numContainersAssigned" to 0. 
Now it's AtomicLong and I think it's unlikely to overflow, but if we're just 
counting in an interval, it makes sense to reset it. Also, it does not have to 
be long, but just int. So this is preferred: {{numContainersAssignedLast = 
numContainersAssigned.getAndSet(0)}}.

8. {{containersAssignedCounter.shutdown();}} --> use {{shutdownNow()}} to 
immediately stop the executor.

9. The test case:
{noformat}
@Test
public void testClusterMetrics() throws Exception {
  assert(metrics != null);
  Assert.assertTrue(!metrics.numContainerAssignedPerSecond.changed());
  metrics.incrNumContainerAssigned();
  Thread.sleep(2000);
  Assert.assertEquals(metrics.getnumContainerAssignedPerSecond(),1L);
}
{noformat}

The first is the default Java "assert", which only takes effect with the "-ea" 
cmd line switch. Just remove it.

Also, instead of a fixed {{Thread.sleep()}}, use:

{noformat}
GenericTestUtils.waitFor(new Supplier() {
  @Override
  public Boolean get() {
return metrics.getnumContainerAssignedPerSecond() == 1;
  }
}, 500, 5000);
{noformat}

> add  the speed of containers assigned metrics to ClusterMetrics
> ---
>
> Key: YARN-10763
> URL: https://issues.apache.org/jira/browse/YARN-10763
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: chaosju
>Assignee: chaosju
>Priority: Minor
> Attachments: YARN-10763.001.patch, YARN-10763.002.patch, 
> YARN-10763.003.patch, YARN-10763.004.patch, screenshot-1.png
>
>
> It'd be good to have ContainerAssignedNum/Second in ClusterMetrics for 
> measuring cluster throughput.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10763) add the speed of containers assigned metrics to ClusterMetrics

2021-05-12 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343139#comment-17343139
 ] 

Peter Bacsko commented on YARN-10763:
-

[~chaosju] I added you to the list of contributors so you can assign the JIRA 
to yourself.

Are the unit test failures related? The number of failed tests looks huge.

> add  the speed of containers assigned metrics to ClusterMetrics
> ---
>
> Key: YARN-10763
> URL: https://issues.apache.org/jira/browse/YARN-10763
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: chaosju
>Assignee: chaosju
>Priority: Minor
> Attachments: YARN-10763.001.patch, YARN-10763.002.patch, 
> YARN-10763.003.patch, YARN-10763.004.patch, screenshot-1.png
>
>
> It'd be good to have ContainerAssignedNum/Second in ClusterMetrics for 
> measuring cluster throughput.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10763) add the speed of containers assigned metrics to ClusterMetrics

2021-05-12 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reassigned YARN-10763:
---

Assignee: chaosju

> add  the speed of containers assigned metrics to ClusterMetrics
> ---
>
> Key: YARN-10763
> URL: https://issues.apache.org/jira/browse/YARN-10763
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: chaosju
>Assignee: chaosju
>Priority: Minor
> Attachments: YARN-10763.001.patch, YARN-10763.002.patch, 
> YARN-10763.003.patch, YARN-10763.004.patch, screenshot-1.png
>
>
> It'd be good to have ContainerAssignedNum/Second in ClusterMetrics for 
> measuring cluster throughput.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9615) Add dispatcher metrics to RM

2021-05-11 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9615:
---
Fix Version/s: 3.3.1

> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-9615-branch-3.3-001.patch, YARN-9615.001.patch, 
> YARN-9615.002.patch, YARN-9615.003.patch, YARN-9615.004.patch, 
> YARN-9615.005.patch, YARN-9615.006.patch, YARN-9615.007.patch, 
> YARN-9615.008.patch, YARN-9615.009.patch, YARN-9615.010.patch, 
> YARN-9615.011.patch, YARN-9615.011.patch, YARN-9615.poc.patch, 
> image-2021-03-04-10-35-10-626.png, image-2021-03-04-10-36-12-441.png, 
> screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM

2021-05-11 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342768#comment-17342768
 ] 

Peter Bacsko commented on YARN-9615:


Committed to branch-3.3.

> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-9615-branch-3.3-001.patch, YARN-9615.001.patch, 
> YARN-9615.002.patch, YARN-9615.003.patch, YARN-9615.004.patch, 
> YARN-9615.005.patch, YARN-9615.006.patch, YARN-9615.007.patch, 
> YARN-9615.008.patch, YARN-9615.009.patch, YARN-9615.010.patch, 
> YARN-9615.011.patch, YARN-9615.011.patch, YARN-9615.poc.patch, 
> image-2021-03-04-10-35-10-626.png, image-2021-03-04-10-36-12-441.png, 
> screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM

2021-05-11 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342764#comment-17342764
 ] 

Peter Bacsko commented on YARN-9615:


OK, the raw type stuff can be ignored, this is the same on trunk, 
{{EventTypeMetrics}} is not parameterized.

> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9615-branch-3.3-001.patch, YARN-9615.001.patch, 
> YARN-9615.002.patch, YARN-9615.003.patch, YARN-9615.004.patch, 
> YARN-9615.005.patch, YARN-9615.006.patch, YARN-9615.007.patch, 
> YARN-9615.008.patch, YARN-9615.009.patch, YARN-9615.010.patch, 
> YARN-9615.011.patch, YARN-9615.011.patch, YARN-9615.poc.patch, 
> image-2021-03-04-10-35-10-626.png, image-2021-03-04-10-36-12-441.png, 
> screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10505) Extend the maximum-capacity property to support Fair Scheduler migration

2021-05-11 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342641#comment-17342641
 ] 

Peter Bacsko commented on YARN-10505:
-

I would change the naming a bit.

"Relative Percentage" -> "Parent Percentage"
"Absolute Percentage" -> "Cluster Percentage"

I think it's much clearer. Absolute/relative in this context it's hard to grasp 
or even sounds like an oxymoron.

> Extend the maximum-capacity property to support Fair Scheduler migration
> 
>
> Key: YARN-10505
> URL: https://issues.apache.org/jira/browse/YARN-10505
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Priority: Major
>
> The property root.users.maximum-capacity could mean the following things:
>  * Relative Percentage: maximum capacity relative to its parent. If it’s set 
> to 50, then it means that the capacity is capped with respect to the parent.
>  * Absolute Percentage: maximum capacity expressed as a percentage of the 
> overall cluster capacity.
>  
> Note that Fair Scheduler supports the following settings:
>  * Single percentage (absolute)
>  * Two percentages (absolute)
>  * Absolute resources
>  
> It is recommended that all three formats are supported for maximum-capacity 
> after introducing weight mode. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10505) Extend the maximum-capacity property to support Fair Scheduler migration

2021-05-11 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342641#comment-17342641
 ] 

Peter Bacsko edited comment on YARN-10505 at 5/11/21, 3:14 PM:
---

I would change the naming a bit.

"Relative Percentage" -> "Parent Percentage"
"Absolute Percentage" -> "Cluster Percentage"

I think it's much clearer. Absolute/relative in this context is hard to grasp 
or even sounds like an oxymoron.


was (Author: pbacsko):
I would change the naming a bit.

"Relative Percentage" -> "Parent Percentage"
"Absolute Percentage" -> "Cluster Percentage"

I think it's much clearer. Absolute/relative in this context it's hard to grasp 
or even sounds like an oxymoron.

> Extend the maximum-capacity property to support Fair Scheduler migration
> 
>
> Key: YARN-10505
> URL: https://issues.apache.org/jira/browse/YARN-10505
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Priority: Major
>
> The property root.users.maximum-capacity could mean the following things:
>  * Relative Percentage: maximum capacity relative to its parent. If it’s set 
> to 50, then it means that the capacity is capped with respect to the parent.
>  * Absolute Percentage: maximum capacity expressed as a percentage of the 
> overall cluster capacity.
>  
> Note that Fair Scheduler supports the following settings:
>  * Single percentage (absolute)
>  * Two percentages (absolute)
>  * Absolute resources
>  
> It is recommended that all three formats are supported for maximum-capacity 
> after introducing weight mode. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-05-11 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342555#comment-17342555
 ] 

Peter Bacsko commented on YARN-10642:
-

Committed to branch-3.1 too. Closing.

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
> Attachments: MockForDeadLoop.java, YARN-10642-branch-3.1.001.patch, 
> YARN-10642-branch-3.2.001.patch, YARN-10642-branch-3.2.002.patch, 
> YARN-10642-branch-3.3.001.patch, YARN-10642.001.patch, YARN-10642.002.patch, 
> YARN-10642.003.patch, YARN-10642.004.patch, YARN-10642.005.patch, 
> deadloop.png, debugfornode.png, put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfo

[jira] [Updated] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-05-11 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10642:

Attachment: YARN-10642-branch-3.1.001.patch

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
> Attachments: MockForDeadLoop.java, YARN-10642-branch-3.1.001.patch, 
> YARN-10642-branch-3.2.001.patch, YARN-10642-branch-3.2.002.patch, 
> YARN-10642-branch-3.3.001.patch, YARN-10642.001.patch, YARN-10642.002.patch, 
> YARN-10642.003.patch, YARN-10642.004.patch, YARN-10642.005.patch, 
> deadloop.png, debugfornode.png, put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
> Environment:
>   OS: CentOS Linux r

[jira] [Reopened] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-05-11 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reopened YARN-10642:
-

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
> Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, 
> YARN-10642-branch-3.2.002.patch, YARN-10642-branch-3.3.001.patch, 
> YARN-10642.001.patch, YARN-10642.002.patch, YARN-10642.003.patch, 
> YARN-10642.004.patch, YARN-10642.005.patch, deadloop.png, debugfornode.png, 
> put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
> Environment:
>   OS: CentOS Linux release 7.5.1804 (Core) 
>   JDK: jdk1.8.0_281



--
This message was sent by At

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1682 matches

Mail list logo