from:"Peter Bacsko \(JIRA\)"

[jira] [Created] (YARN-10958) Use correct configuration for Group service init in CSMappingPlacementRule

2021-09-16 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10958:
---

 Summary: Use correct configuration for Group service init in 
CSMappingPlacementRule
 Key: YARN-10958
 URL: https://issues.apache.org/jira/browse/YARN-10958
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Peter Bacsko


There is a potential problem in {{CSMappingPlacementRule.java}}:
{noformat}
if (groups == null) {
  groups = Groups.getUserToGroupsMappingService(conf);
}
{noformat}
The problem is, we're supposed to pass {{scheduler.getConf()}}. The "conf" 
object is the config for capacity scheduler, which does not include the 
property which selects the group service provider. Therefore, the current code 
just works by chance, because Group mapping service is already initialized at 
this point. See the original fix in YARN-10053.

Also, need a unit test to verify it.

Idea:
 # Create a Configuration object in which the property 
"hadoop.security.group.mapping" refers to an existing a test implementation.
 # Add a new method to {{Groups}} which nulls out the singleton instance, eg. 
{{Groups.reset()}}.
 # Create a mock CapacityScheduler where {{getConf()}} and 
{{getConfiguration()}} contain different settings for 
"hadoop.security.group.mapping". Since {{getConf()}} is the service config, 
this should return the config object created in step #1.
 # Create an instance of {{CSMappingPlacementRule}} with a single primary group 
rule.
 # Run the placement evaluation.
 # Expected: returned queue matches what is supposed to be coming from the test 
group mapping service ("testuser" --> "testqueue").
 # Modify "hadoop.security.group.mapping" in the config object created in step 
#1.
 # Call {{Groups.refresh()}} which changes the group mapping ("testuser" --> 
"testqueue2"). This requires that the test group mapping service implement 
{{GroupMappingServiceProvider.cacheGroupsRefresh()}}.
 # Create a new instance of {{CSMappingPlacementRule}}.
 # Run the placement evaluation again
 # Expected: with the same user, the target queue has changed.

This looks convoluted, but these steps make sure that:
 # {{CSMappingPlacementRule}} will force the initialization of groups.
 # We select the correct configuration for group service init.
 # We don't create a new {{Groups}} instance if the singleton is initialized, 
so we cover the original problem described in YARN-10597.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10848) Vcore usage problem with Default/DominantResourceCalculator

2021-07-06 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10848:
---

 Summary: Vcore usage problem with 
Default/DominantResourceCalculator
 Key: YARN-10848
 URL: https://issues.apache.org/jira/browse/YARN-10848
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, capacityscheduler
Reporter: Peter Bacsko


If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating 
containers even if we run out of vcores.

CS checks the the available resources at two places. The first check is 
{{CapacityScheduler.allocateContainerOnSingleNode()}}:
{noformat}
if (calculator.computeAvailableContainers(Resources
.add(node.getUnallocatedResource(), 
node.getTotalKillableResources()),
minimumAllocation) <= 0) {
  LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient "
  + "available or preemptible resource for minimum allocation");
{noformat}

The second, which is more important, is located in 
{{RegularContainerAllocator.assignContainer()}}:
{noformat}
if (!Resources.fitsIn(rc, capability, totalResource)) {
  LOG.warn("Node : " + node.getNodeID()
  + " does not have sufficient resource for ask : " + pendingAsk
  + " node total capability : " + node.getTotalResource());
  // Skip this locality request
  ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
  activitiesManager, node, application, schedulerKey,
  ActivityDiagnosticConstant.
  NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST
  + getResourceDiagnostics(capability, totalResource),
  ActivityLevel.NODE);
  return ContainerAllocation.LOCALITY_SKIPPED;
}
{noformat}

Here, {{rc}} is the resource calculator instance, the other two values are:
{noformat}
Resource capability = pendingAsk.getPerAllocationResource();
Resource available = node.getUnallocatedResource();
{noformat}

There is a repro unit test attatched to this case, which can demonstrate the 
problem. The root cause is that we pass the resource calculator to 
{{Resource.fitsIn()}}. Instead, we should use an overridden version, just like 
in {{FSAppAttempt.assignContainer()}}:
{noformat}
   // Can we allocate a container on this node?
if (Resources.fitsIn(capability, available)) {
  // Inform the application of the new container for this request
  RMContainer allocatedContainer =
  allocate(type, node, schedulerKey, pendingAsk,
  reservedContainer);
{noformat}

In CS, if we switch to DominantResourceCalculator OR use {{Resources.fitsIn()}} 
without the calculator in {{RegularContainerAllocator.assignContainer()}}, that 
fixes the failing unit test (see {{testTooManyContainers()}} in 
{{TestTooManyContainers.java}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Resolved] (YARN-9698) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler

2021-07-06 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-9698.

Fix Version/s: 3.4.0
   Resolution: Fixed

> [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler
> 
>
> Key: YARN-9698
> URL: https://issues.apache.org/jira/browse/YARN-9698
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Weiwei Yang
>Priority: Major
>  Labels: fs2cs
> Fix For: 3.4.0
>
> Attachments: FS-CS Migration.pdf
>
>
> We see some users want to migrate from Fair Scheduler to Capacity Scheduler, 
> this Jira is created as an umbrella to track all related efforts for the 
> migration, the scope contains
>  * Bug fixes
>  * Add missing features
>  * Migration tools that help to generate CS configs based on FS, validate 
> configs etc
>  * Documents
> this is part of CS component, the purpose is to make the migration process 
> smooth.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10843) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler - part II

2021-07-06 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10843:
---

 Summary: [Umbrella] Tools to help migration from Fair Scheduler to 
Capacity Scheduler - part II
 Key: YARN-10843
 URL: https://issues.apache.org/jira/browse/YARN-10843
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if it's capacity is 0%

2021-05-31 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10796:
---

 Summary: Capacity Scheduler: dynamic queue cannot scale out 
properly if it's capacity is 0%
 Key: YARN-10796
 URL: https://issues.apache.org/jira/browse/YARN-10796
 Project: Hadoop YARN
  Issue Type: Task
  Components: capacity scheduler, capacityscheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko


If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it 
cannot properly scale even if it's max-capacity and the parent's max-capacity 
would allow it.

Example:
{noformat}
Cluster Capacity:  16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu )
Container allocation size: 1G / 1 vcore

Root.dynamic 
Effective Capacity:   ( 50.0%)
Effective Max Capacity:   (100.0%) 

Template:
Capacity:   40%
Max Capacity:   100%
User Limit Factor:  4
 {noformat}
leaf-queue-template.capacity = 40%
 leaf-queue-template.maximum-capacity = 100%
 leaf-queue-template.maximum-am-resource-percent = 50%
 leaf-queue-template.minimum-user-limit-percent =100%
 leaf-queue-template.user-limit-factor = 4

"root.dynamic" has a maximum capacity of 100% and a capacity of 50%.

Let's assume there are running containers in these dynamic queues (MR sleep 
jobs):
 root.dynamic.user1 = 1 AM + 3 container (capacity = 40%)
 root.dynamic.user2 = 1 AM + 3 container (capacity = 40%)
 root.dynamic.user3 = 1 AM + 15 container (capacity = 0%)

This scenario will result in an underutilized cluster. There will be approx 18% 
unused capacity. On the other hand, it's still possible to submit a new 
application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% 
utilization is possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10779) Add option to disable lowecase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-19 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10779:
---

 Summary: Add option to disable lowecase conversion in 
GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
 Key: YARN-10779
 URL: https://issues.apache.org/jira/browse/YARN-10779
 Project: Hadoop YARN
  Issue Type: Task
  Components: resourcemanager
Reporter: Peter Bacsko
Assignee: Peter Bacsko


In both {{GetApplicationsRequestPBImpl}} and 
{{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase conversion:

{noformat}
checkTags(tags);
// Convert applicationTags to lower case and add
this.applicationTags = new TreeSet<>();
for (String tag : tags) {
  this.applicationTags.add(StringUtils.toLowerCase(tag));
}
  }
{noformat}

However, we encountered some cases where this is not desirable for "userid" 
tags. 

Proposed solution: since both classes are pretty low-level and can be often 
instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
be cached inside them. A new property should be created which tells whether 
lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Resolved] (YARN-8786) LinuxContainerExecutor fails sporadically in create_local_dirs

2021-03-05 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-8786.

Resolution: Fixed

> LinuxContainerExecutor fails sporadically in create_local_dirs
> --
>
> Key: YARN-8786
> URL: https://issues.apache.org/jira/browse/YARN-8786
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Jon Bender
>Priority: Major
>
> We started using CGroups with LinuxContainerExecutor recently, running Apache 
> Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn 
> container will fail with a message like the following:
> {code:java}
> [2018-09-02 23:48:02.458691] 18/09/02 23:48:02 INFO container.ContainerImpl: 
> Container container_1530684675517_516620_01_020846 transitioned from 
> SCHEDULED to RUNNING
> [2018-09-02 23:48:02.458874] 18/09/02 23:48:02 INFO 
> monitor.ContainersMonitorImpl: Starting resource-monitoring for 
> container_1530684675517_516620_01_020846
> [2018-09-02 23:48:02.506114] 18/09/02 23:48:02 WARN 
> privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 
> 35. Privileged Execution Operation Stderr:
> [2018-09-02 23:48:02.506159] Could not create container dirsCould not create 
> local files and directories
> [2018-09-02 23:48:02.506220]
> [2018-09-02 23:48:02.506238] Stdout: main : command provided 1
> [2018-09-02 23:48:02.506258] main : run as user is nobody
> [2018-09-02 23:48:02.506282] main : requested yarn user is root
> [2018-09-02 23:48:02.506294] Getting exit code file...
> [2018-09-02 23:48:02.506307] Creating script paths...
> [2018-09-02 23:48:02.506330] Writing pid file...
> [2018-09-02 23:48:02.506366] Writing to tmp file 
> /path/to/hadoop/yarn/local/nmPrivate/application_1530684675517_516620/container_1530684675517_516620_01_020846/container_1530684675517_516620_01_020846.pid.tmp
> [2018-09-02 23:48:02.506389] Writing to cgroup task files...
> [2018-09-02 23:48:02.506402] Creating local dirs...
> [2018-09-02 23:48:02.506414] Getting exit code file...
> [2018-09-02 23:48:02.506435] Creating script paths...
> {code}
> Looking at the container executor source it's traceable to errors here: 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604]
>  And ultimately to 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672]
> The root failure seems to be in the underlying mkdir call, but that exit code 
> / errno is swallowed so we don't have more details. We tend to see this when 
> many containers start at the same time for the same application on a host, 
> and suspect it may be related to some race conditions around those shared 
> directories between containers for the same application.
> For example, this is a typical pattern in the audit logs:
> {code:java}
> [2018-09-07 17:16:38.447654] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> [2018-09-07 17:16:38.492298] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012870
> [2018-09-07 17:16:38.614044] 18/09/07 17:16:38 WARN 
> nodemanager.NMAuditLogger: USER=root  OPERATION=Container Finished - 
> Failed   TARGET=ContainerImplRESULT=FAILURE  DESCRIPTION=Container failed 
> with state: EXITED_WITH_FAILUREAPPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> {code}
> Two containers for the same application starting in quick succession followed 
> by the EXITED_WITH_FAILURE step (exit code 35).
> We plan to upgrade to 3.1.x soon but I don't expect this to be fixed by this, 
> the only major JIRAs that affected the executor since 3.0.0 seem unrelated 
> ([https://github.com/apache/hadoop/commit/bc285da107bb84a3c60c5224369d7398a41db2d8]
>  and 
> [https://github.com/apache/hadoop/commit/a82be7754d74f4d16b206427b91e700bb5f44d56])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Resolved] (YARN-10643) Fix the race condition introduced by YARN-8995.

2021-03-05 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-10643.
-
Resolution: Duplicate

> Fix the race condition introduced by YARN-8995.
> ---
>
> Key: YARN-10643
> URL: https://issues.apache.org/jira/browse/YARN-10643
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.2.1
>Reporter: Qi Zhu
>Assignee: zhengchenyu
>Priority: Critical
> Attachments: YARN-10643.001.patch
>
>
> The race condition introduced by -YARN-8995.-
> The problem has been raised in YARN-10221
> also in YARN-10642.
> I think we should fix it in a hurry.
> I will help fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10631) Document AM-preemption related changes (YARN-9537 and YARN-10625)

2021-02-17 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10631:
---

 Summary: Document AM-preemption related changes (YARN-9537 and 
YARN-10625)
 Key: YARN-10631
 URL: https://issues.apache.org/jira/browse/YARN-10631
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Preemption-related changes were introduced in YARN-9537 and YARN-10625.

These also introduce new properties which are not documented for Fair 
Scheduler. Extend the documentation with these enhancements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10625) FairScheduler: add global flag to disable AM-preemption

2021-02-12 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10625:
---

 Summary: FairScheduler: add global flag to disable AM-preemption
 Key: YARN-10625
 URL: https://issues.apache.org/jira/browse/YARN-10625
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 3.3.0
Reporter: Peter Bacsko
Assignee: Peter Bacsko


YARN-9537 added a feature to disable AM preemption on a per queue basis.

This is a nice enhancement, but it's very inconvenient if the cluster has a lot 
of queues or queues dynamically created/deleted regularly (static queue 
configuration changes).

It's a legitimate use-case to have AM preemption turned off completely. To make 
it easier, add property which acts as a global flag for this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10620) fs2cs: parentQueue for certain placement rules are not set during conversion

2021-02-09 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10620:
---

 Summary: fs2cs: parentQueue for certain placement rules are not 
set during conversion
 Key: YARN-10620
 URL: https://issues.apache.org/jira/browse/YARN-10620
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10599) fs2cs should generate new "auto-queue-creation-v2.enabled" properties for all parents

2021-01-26 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10599:
---

 Summary: fs2cs should generate new 
"auto-queue-creation-v2.enabled" properties for all parents
 Key: YARN-10599
 URL: https://issues.apache.org/jira/browse/YARN-10599
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10593) Fix incorrect string comparison in GpuDiscoverer

2021-01-25 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10593:
---

 Summary: Fix incorrect string comparison in GpuDiscoverer
 Key: YARN-10593
 URL: https://issues.apache.org/jira/browse/YARN-10593
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Peter Bacsko
Assignee: Peter Bacsko


The following comparison in {{GpuDiscoverer}} is invalid:


{noformat}
   binaryPath = configuredBinaryFile;
  // If path exists but file name is incorrect don't execute the file
  String fileName = binaryPath.getName();
  if (DEFAULT_BINARY_NAME.equals(fileName)) {  <--- inverse condition needed
String msg = String.format("Please check the configuration value of"
 +" %s. It should point to an %s binary.",
 YarnConfiguration.NM_GPU_PATH_TO_EXEC,
 DEFAULT_BINARY_NAME);
throwIfNecessary(new YarnException(msg), config);
LOG.warn(msg);
  }{noformat}
Obviously it should be other way around - we should log a warning or throw an 
exception if the file names *differ*, not when they're equal.

Consider adding a unit test for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10577) Automatically convert placement rules in fs2cs

2021-01-18 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10577:
---

 Summary: Automatically convert placement rules in fs2cs
 Key: YARN-10577
 URL: https://issues.apache.org/jira/browse/YARN-10577
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Current, users have to use "\-m" or "\-\-convert-placement-rules"  switch to 
convert the placement rules from FS.

Initially, we converted to the old mapping rule format, which has serious 
limitations, so we disabled the automatic conversion.

With the new JSON-based format and placement engine, this conversion should 
happen automatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10576) Update Capacity Scheduler about JSON-based placement mapping

2021-01-18 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10576:
---

 Summary: Update Capacity Scheduler about JSON-based placement 
mapping
 Key: YARN-10576
 URL: https://issues.apache.org/jira/browse/YARN-10576
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


The weight mode and AQC also affects how the new placement engine in CS works.

Certain statements in the documentation are no longer valid, for example:
* create flag: "Only applies to managed queue parents" - there is no 
ManagedParentQueue in weight mode.
* "The nested rules primaryGroupUser and secondaryGroupUser expects the parent 
queues to exist, ie. they cannot be created automatically". This only applies 
to the legacy absolute/percentage mode.

Find all statements that mentions possible limitations and fix them if 
necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10573) Enhance placement rule conversion in fs2cs in weight mode

2021-01-15 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10573:
---

 Summary: Enhance placement rule conversion in fs2cs in weight mode
 Key: YARN-10573
 URL: https://issues.apache.org/jira/browse/YARN-10573
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


If we're using weight mode, we have much more freedom when it comes to 
placement rules.

In YARN-10525, weight conversion is the default in {{fs2cs}}. This also means 
that we can support nested rules properly and also queues can be created under 
{{root}}. 

Therefore, a lot of warnings and validations inside {{QueuePlacementConverter}} 
are not necessary and only relevant if the user chose percentage-based 
conversion in the command line.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10570) Remove "experimental" warning message from fs2cs

2021-01-12 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10570:
---

 Summary: Remove "experimental" warning message from fs2cs
 Key: YARN-10570
 URL: https://issues.apache.org/jira/browse/YARN-10570
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Although {{fs2cs}} tool has been in constant development, it's been used and 
tested by a group of people, so let's remove the following message:

{{WARNING: This feature is experimental and not intended for production use!}}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10563) Fix dependency exclusion problem in poms

2021-01-07 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10563:
---

 Summary: Fix dependency exclusion problem in poms
 Key: YARN-10563
 URL: https://issues.apache.org/jira/browse/YARN-10563
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10515) Fix flaky test TestCapacitySchedulerAutoQueueCreation.testDynamicAutoQueueCreationWithTags

2020-12-03 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10515:
---

 Summary: Fix flaky test 
TestCapacitySchedulerAutoQueueCreation.testDynamicAutoQueueCreationWithTags
 Key: YARN-10515
 URL: https://issues.apache.org/jira/browse/YARN-10515
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Peter Bacsko
Assignee: Peter Bacsko


The testcase 
TestCapacitySchedulerAutoQueueCreation.testDynamicAutoQueueCreationWithTags 
sometimes fails with the following error:

{noformat}
org.apache.hadoop.service.ServiceStateException: 
org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize queues
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:174)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:110)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:884)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1296)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:339)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.serviceInit(MockRM.java:1018)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:158)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:134)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:130)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerAutoQueueCreation$5.(TestCapacitySchedulerAutoQueueCreation.java:873)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerAutoQueueCreation.testDynamicAutoQueueCreationWithTags(TestCapacitySchedulerAutoQueueCreation.java:873)
{noformat}

We have to reset queue metrics before running this test to make sure it passes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10507) Add the capability to fs2cs to write the converted placement rules inside capacity-scheduler.xml

2020-11-30 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10507:
---

 Summary: Add the capability to fs2cs to write the converted 
placement rules inside capacity-scheduler.xml
 Key: YARN-10507
 URL: https://issues.apache.org/jira/browse/YARN-10507
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Currently, fs2cs tool generates a separate {{mapping-rules.json}} file when it 
converts the placement rules.

However, we also support having the JSON inlined inside 
{{capacity-scheduler.xml}}.  Add a command line switch so that we can choose 
the desired output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Resolved] (YARN-10103) Capacity scheduler: add support for create=true/false per mapping rule

2020-11-10 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-10103.
-
Resolution: Won't Do

> Capacity scheduler: add support for create=true/false per mapping rule
> --
>
> Key: YARN-10103
> URL: https://issues.apache.org/jira/browse/YARN-10103
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Priority: Major
>  Labels: fs2cs
>
> You can't ask Capacity Scheduler for a mapping to create a queue if it 
> doesn't exist.
> For example, this mapping would use the first rule if the queue exist. If it 
> doesn't, then it proceeds to the next rule:
>  {{u:%user:%primary_group.%user:create=false;u:%user%:root.default}}
> Let's say user "alice" belongs to the "admins" group. It would first try to 
> map {{root.admins.alice}}. But, if the queue doesn't exist, then it places 
> the application into {{root.default}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10486) FS-CS converter: handle case when weight=0

2020-11-10 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10486:
---

 Summary: FS-CS converter: handle case when weight=0
 Key: YARN-10486
 URL: https://issues.apache.org/jira/browse/YARN-10486
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: yarn
Reporter: Peter Bacsko
Assignee: Peter Bacsko


We can encounter an ArithmeticException if there is a single or multiple queues 
under a parent with zero weight.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2020-10-14 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10460:
---

 Summary: Upgrading to JUnit 4.13 causes tests in 
TestNodeStatusUpdater to fail
 Key: YARN-10460
 URL: https://issues.apache.org/jira/browse/YARN-10460
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, test
Reporter: Peter Bacsko
Assignee: Peter Bacsko


In our downstream build environment, we're using JUnit 4.13. Recently, we 
discovered a truly weird test failure in TestNodeStatusUpdater.

The problem is that timeout handling has changed in Junit 4.13. See the 
difference between these two snippets:

4.12
{noformat}
@Override
public void evaluate() throws Throwable {
CallableStatement callable = new CallableStatement();
FutureTask task = new FutureTask(callable);
threadGroup = new ThreadGroup("FailOnTimeoutGroup");
Thread thread = new Thread(threadGroup, task, "Time-limited test");
thread.setDaemon(true);
thread.start();
callable.awaitStarted();
Throwable throwable = getResult(task, thread);
if (throwable != null) {
throw throwable;
}
}
{noformat}
 
 4.13
{noformat}
@Override
public void evaluate() throws Throwable {
CallableStatement callable = new CallableStatement();
FutureTask task = new FutureTask(callable);
ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
Thread thread = new Thread(threadGroup, task, "Time-limited test");
try {
thread.setDaemon(true);
thread.start();
callable.awaitStarted();
Throwable throwable = getResult(task, thread);
if (throwable != null) {
throw throwable;
}
} finally {
try {
thread.join(1);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
try {
threadGroup.destroy();  < This
} catch (IllegalThreadStateException e) {
// If a thread from the group is still alive, the ThreadGroup 
cannot be destroyed.
// Swallow the exception to keep the same behavior prior to 
this change.
}
}
}
{noformat}
The change comes from [https://github.com/junit-team/junit4/pull/1517].

Unfortunately, destroying the thread group causes an issue because there are 
all sorts of object caching in the IPC layer. The exception is:
{noformat}
java.lang.IllegalThreadStateException
at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
at java.lang.Thread.init(Thread.java:402)
at java.lang.Thread.init(Thread.java:349)
at java.lang.Thread.(Thread.java:675)
at 
java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
at 
com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
at 
org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
at org.apache.hadoop.ipc.Client.call(Client.java:1458)
at org.apache.hadoop.ipc.Client.call(Client.java:1405)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
at 
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
at 
org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
{noformat}
Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the client 
object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine2}} is stored as long as 
they're needed. But since the backing thread group is destroyed in the previous 
test, it's no longer possible to create new threads.

A quick workaround is to stop the clients and completely clear the 
{{ClientCache}} in {{ProtobufRpcEngine}} before each testcase. I tried this and 
it solves the problem but it feels hacky. Not sure if there is a better 
approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---

[jira] [Created] (YARN-10454) Add applicationName policy

2020-10-07 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10454:
---

 Summary: Add applicationName policy
 Key: YARN-10454
 URL: https://issues.apache.org/jira/browse/YARN-10454
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10447) TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing

2020-09-23 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10447:
---

 Summary: TestLeafQueue: ActivitiesManager thread might interfere 
with ongoing stubbing
 Key: YARN-10447
 URL: https://issues.apache.org/jira/browse/YARN-10447
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Peter Bacsko
Assignee: Peter Bacsko


YARN-9784 fixed some concurrency related issues in {{TestLeafQueue}}, but not 
all of them. Occasionally it's still possible to receive an exception from 
Mockito and the two following stack traces can be observed in the console:

{noformat}

org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
Integer cannot be returned by isMultiNodePlacementEnabled()
isMultiNodePlacementEnabled() should return boolean
***
If you're unsure why you're getting above error read on.
Due to the nature of the syntax above problem might occur because:
1. This exception *might* occur in wrongly written multi-threaded tests.
   Please refer to Mockito FAQ on limitations of concurrency testing.
2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
spies - 
   - with doReturn|Throw() family of methods. More in javadocs for 
Mockito.spy() method.
{noformat}

and

{noformat}
2020-09-22 14:44:52,584 INFO  [main] capacity.TestUtils 
(TestUtils.java:getMockNode(227)) - node = 127.0.0.3 avail=
2020-09-22 14:44:52,585 INFO  [main] capacity.TestUtils 
(TestUtils.java:getMockNode(227)) - node = 127.0.0.4 avail=
Exception in thread "ActivitiesManager thread." java.lang.ClassCastException: 
java.lang.Integer cannot be cast to java.lang.Boolean
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$$EnhancerByMockitoWithCGLIB$$272c72c5.isMultiNodePlacementEnabled()
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded(ActivitiesManager.java:266)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.access$500(ActivitiesManager.java:63)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:347)
at java.lang.Thread.run(Thread.java:748)
{noformat}

It's probably best to disable ActivitiesManager thread entirely in this test 
class, there is no need for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Resolved] (YARN-10424) Adapt existing AppName and UserGroupMapping unittests to ensure backwards compatibility

2020-09-18 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-10424.
-
Resolution: Fixed

> Adapt existing AppName and UserGroupMapping unittests to ensure backwards 
> compatibility
> ---
>
> Key: YARN-10424
> URL: https://issues.apache.org/jira/browse/YARN-10424
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10424.001.patch, YARN-10424.002.patch, 
> YARN-10424.003.patch
>
>
> The class {{UserGroupMappingPlacementRule}} and 
> {{AppNameMappingPlacementRule}} will disappear. In order to ensure backwards 
> compatibility when the configuration is defined in the legacy format, 
> {{TestAppNameMappingPlacementRule}} and {{TestUserGroupMappingPlacementRule}} 
> should be adapted to use the new evaluator logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10413) Change fs2cs to generate mapping rules in the new format

2020-08-29 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10413:
---

 Summary: Change fs2cs to generate mapping rules in the new format
 Key: YARN-10413
 URL: https://issues.apache.org/jira/browse/YARN-10413
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10387) Implement logic which returns MappingRule objects based on mapping rules

2020-08-05 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10387:
---

 Summary: Implement logic which returns MappingRule objects based 
on mapping rules
 Key: YARN-10387
 URL: https://issues.apache.org/jira/browse/YARN-10387
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10386) Create new JSON schema for Placement Rules

2020-08-05 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10386:
---

 Summary: Create new JSON schema for Placement Rules
 Key: YARN-10386
 URL: https://issues.apache.org/jira/browse/YARN-10386
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacity scheduler, capacityscheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10330) Add missing test scenarios to TestUserGroupMappingPlacementRule

2020-06-26 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10330:
---

 Summary: Add missing test scenarios to 
TestUserGroupMappingPlacementRule
 Key: YARN-10330
 URL: https://issues.apache.org/jira/browse/YARN-10330
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, capacityscheduler, test
Reporter: Peter Bacsko
Assignee: Peter Bacsko


After running {{TestUserGroupMappingPlacementRule}} with EclEmma, it turned out 
that there are at least 8-10 missing scenarios that are not covered. 

Since we're planning to enhance mapping rule logic with extra features, it is 
crucial to have good coverage so that we can verify backward compatibility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10325) Document max-parallel-apps for Capacity Scheduler

2020-06-22 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10325:
---

 Summary: Document max-parallel-apps for Capacity Scheduler
 Key: YARN-10325
 URL: https://issues.apache.org/jira/browse/YARN-10325
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacity scheduler, capacityscheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10316) FS-CS converter: convert userMaxApps, maxRunningApps settins

2020-06-15 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10316:
---

 Summary: FS-CS converter: convert userMaxApps, maxRunningApps 
settins
 Key: YARN-10316
 URL: https://issues.apache.org/jira/browse/YARN-10316
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


In YARN-9930, support for maximum running applications (called "max parallel 
apps") has been introduced.

The converter now can handle the following settings in {{fair-scheduler.xml}}:
 * {{ }} per user
 * {{}} per queue
 * {{}} 
 * {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Resolved] (YARN-9888) Capacity scheduler: add support for default maxRunningApps limit per user

2020-06-12 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-9888.

Resolution: Duplicate

This feature will be implemented in YARN-9930. Closing this as duplicate.

> Capacity scheduler: add support for default maxRunningApps limit per user
> -
>
> Key: YARN-9888
> URL: https://issues.apache.org/jira/browse/YARN-9888
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> Fair scheduler has the setting {{}} which limits how many 
> running applications each user can have. 
> Capacity scheduler lacks this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Resolved] (YARN-9887) Capacity scheduler: add support for limiting maxRunningApps per user

2020-06-05 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-9887.

Resolution: Duplicate

Closing this as duplicate. Implementation is tracked under YARN-9930.

> Capacity scheduler: add support for limiting maxRunningApps per user
> 
>
> Key: YARN-9887
> URL: https://issues.apache.org/jira/browse/YARN-9887
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> Fair Scheduler supports limiting the number of applications that a particular 
> user can submit:
> {noformat}
> 
>   10
> 
> {noformat}
> Capacity Scheduler does not have an exact equivalent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10283) Capacity Scheduler: starvation occurs if a higher priority queue is full a and node labels are used

2020-05-20 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10283:
---

 Summary: Capacity Scheduler: starvation occurs if a higher 
priority queue is full a and node labels are used
 Key: YARN-10283
 URL: https://issues.apache.org/jira/browse/YARN-10283
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Recently we've been investigating a scenario where applications submitted to a 
lower priority queue could not get scheduled because a higher priority queue in 
the same hierarchy could now satisfy the allocation request. Both queue 
belonged to the same partition.

If we disabled node labels, the problem disappeared.

The problem is that {{RegularContainerAllocator}} always allocated a container 
for the request, even if it should not have.

*Example:*
* Cluster total resources: 3 nodes, 15GB, 24 vcores
* Partition "shared" was created with 2 nodes
* "root.lowprio" (priority = 20) and "root.highprio" (priorty = 40) were added 
to the partition
* Both queues have a limit of 
* Using DominantResourceCalculator

Setup:
Submit distributed shell application to highprio with switches "-num_containers 
3 -container_vcores 4". The memory allocation is 512MB per container.

Chain of events:

1. Queue is filled with contaners until it reaches usage 
2. A node update event is pushed to CS from a node which is part of the 
partition
2. {{AbstractCSQueue.canAssignToQueue()}} returns true because it's smaller 
than the current limit resource 
3. Then {{LeafQueue.assignContainers()}} runs successfully and gets an 
allocated container for 
4. But we can't commit the resource request because we would have 9 vcores in 
total, violating the limit.

The problem is that we always try to assign container for the same application 
in each heartbeat from "highprio". Applications in "lowprio" cannot make 
progress.

*Problem:*
{{RegularContainerAllocator.assignContainer()}} does not handle this case well. 
We only reject allocation if this condition is satisfied:

{noformat}
 if (rmContainer == null && reservationsContinueLooking
  && node.getLabels().isEmpty()) {
{noformat}

But if we have node labels, we succeed with the allocation if there's room for 
a container.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Resolved] (YARN-10158) FS-CS converter: convert property yarn.scheduler.fair.update-interval-ms

2020-05-13 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-10158.
-
Resolution: Won't Do

> FS-CS converter: convert property yarn.scheduler.fair.update-interval-ms
> 
>
> Key: YARN-10158
> URL: https://issues.apache.org/jira/browse/YARN-10158
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10257) FS-CS converter: check deprecated increment properties for mem/vcores and fix DRF check

2020-05-03 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10257:
---

 Summary: FS-CS converter: check deprecated increment properties 
for mem/vcores and fix DRF check
 Key: YARN-10257
 URL: https://issues.apache.org/jira/browse/YARN-10257
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Two issues have been discovered during fs2cs testing:

1. The value of two properties are not checked:

{{yarn.scheduler.increment-allocation-mb}}
{{yarn.scheduler.increment-allocation-vcores}}

Although these two are marked as deprecated, they're still in use and must be 
handled.

2. The following piece of code is incorrect - the default scheduling policy can 
be different fromDRF, which is a problem is DRF is used everywhere:

{code}
  private boolean isDrfUsed(FairScheduler fs) {
FSQueue rootQueue = fs.getQueueManager().getRootQueue();
AllocationConfiguration allocConf = fs.getAllocationConfiguration();

String defaultPolicy = allocConf.getDefaultSchedulingPolicy().getName();

if (DominantResourceFairnessPolicy.NAME.equals(defaultPolicy)) {
  return true;
} else {
  return isDrfUsedOnQueueLevel(rootQueue);
}
  }
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10234) FS-CS converter: don't enale auto-create queue property for root

2020-04-14 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10234:
---

 Summary: FS-CS converter: don't enale auto-create queue property 
for root
 Key: YARN-10234
 URL: https://issues.apache.org/jira/browse/YARN-10234
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


The auto-create-child-queue property should not be enabled for root, otherwise 
it creates an exception inside capacity scheduler.

{noformat}
2020-04-14 09:48:54,117 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying 
to re-establish ZK session
2020-04-14 09:48:54,117 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
RMFatalEvent of type TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh 
configuration settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll 
operation failed
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:772)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:636)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
Caused by: java.io.IOException: Failed to re-init queues : null
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:489)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:430)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:761)
... 6 more
Caused by: java.lang.ClassCastException
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10226) NPE when using %primary_group queue mapping

2020-04-08 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10226:
---

 Summary: NPE when using %primary_group queue mapping
 Key: YARN-10226
 URL: https://issues.apache.org/jira/browse/YARN-10226
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko


If we use the following queue mapping:

{{u:%user:%primary_group}}

then we get a NPE inside ResourceManager:

{noformat}
2020-04-06 11:59:13,883 ERROR resourcemanager.ResourceManager 
(ResourceManager.java:serviceStart(881)) - Failed to load/recover state
java.lang.NullPointerException
at 
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.getQueue(CapacitySchedulerQueueManager.java:138)
at 
org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule.getContextForPrimaryGroup(UserGroupMappingPlacementRule.java:163)
at 
org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule.getPlacementForUser(UserGroupMappingPlacementRule.java:118)
at 
org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule.getPlacementForApp(UserGroupMappingPlacementRule.java:227)
at 
org.apache.hadoop.yarn.server.resourcemanager.placement.PlacementManager.placeApplication(PlacementManager.java:67)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.placeApplication(RMAppManager.java:827)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:378)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:367)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:594)
...
{noformat}

We to check if parent queue is null in 
{{UserGroupMappingPlacementRule.getContextForPrimaryGroup()}}.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10198) [managedParent].%primary_group placement doesn't work after YARN-9868

2020-03-13 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10198:
---

 Summary: [managedParent].%primary_group placement doesn't work 
after YARN-9868
 Key: YARN-10198
 URL: https://issues.apache.org/jira/browse/YARN-10198
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko


YARN-9868 introduced an unnecessary check if we have the following placement 
rule:

[managedParentQueue].%primary_group

Here, {{%primary_group}} is expected to be created if it doesn't exist. 
However, there is this validation code which is not necessary:

{noformat}
  } else if (mapping.getQueue().equals(PRIMARY_GROUP_MAPPING)) {
if (this.queueManager
.getQueue(groups.getGroups(user).get(0)) != null) {
  return getPlacementContext(mapping,
  groups.getGroups(user).get(0));
} else {
  return null;
}
{noformat}

We should revert this part to the original version:
{noformat}
  } else if (mapping.queue.equals(PRIMARY_GROUP_MAPPING)) {
return getPlacementContext(mapping, groups.getGroups(user).get(0));
}
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10197) FS-CS converter: fix emitted ordering policy string and max-am-resource percent value

2020-03-13 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10197:
---

 Summary: FS-CS converter: fix emitted ordering policy string and 
max-am-resource percent value
 Key: YARN-10197
 URL: https://issues.apache.org/jira/browse/YARN-10197
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10193) FS-CS converter: fix incorrect capacity conversion

2020-03-11 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10193:
---

 Summary: FS-CS converter: fix incorrect capacity conversion
 Key: YARN-10193
 URL: https://issues.apache.org/jira/browse/YARN-10193
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Conversion of capacities are incorrect if the total doesn't add up exactly to 
100.00%.

The loop invariant must be fixed:
{noformat}
 for (int i = 0; i < children.size() - 2; i++) {
{noformat}

The testcase needs to be fixed too:
{noformat}
assertEquals("root.default capacity", "33.333",
csConfig.get(PREFIX + "root.default.capacity"));
assertEquals("root.admins capacity", "33.333",
csConfig.get(PREFIX + "root.admins.capacity"));
assertEquals("root.users capacity", "66.667",
csConfig.get(PREFIX + "root.users.capacity"));
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10191) FS-CS converter: call System.exit() for every code path in main()

2020-03-11 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10191:
---

 Summary: FS-CS converter: call System.exit() for every code path 
in main()
 Key: YARN-10191
 URL: https://issues.apache.org/jira/browse/YARN-10191
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Note that we don't always call {{System.exit()}} on the happy path scenario in 
the converter:
{code:java}
  public static void main(String[] args) {
try {
  FSConfigToCSConfigArgumentHandler fsConfigConversionArgumentHandler =
  new FSConfigToCSConfigArgumentHandler();
  int exitCode =
  fsConfigConversionArgumentHandler.parseAndConvert(args);
  if (exitCode != 0) {
LOG.error(FATAL,
"Error while starting FS configuration conversion, " +
"see previous error messages for details!");
System.exit(exitCode);
  }
} catch (Throwable t) {
  LOG.error(FATAL,
  "Error while starting FS configuration conversion!", t);
  System.exit(-1);
}
  }
 {code}
This is a mistake. If there's any non-daemon thread hanging around which was 
started by either FS or CS, the tool will never terminate. We must call 
{{System.exit()}} in every occasion to make sure that it never blocks at the 
end.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10175) FS-CS converter: only convert placement rules if a cmd line switch is defined

2020-02-28 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10175:
---

 Summary: FS-CS converter: only convert placement rules if a cmd 
line switch is defined
 Key: YARN-10175
 URL: https://issues.apache.org/jira/browse/YARN-10175
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


In the current form, the conversion of FS placement rules to CS mapping rules 
has a lot of feature gaps and doesn't work properly.

The output is good as a starting point but sometimes it causes CS to throw an 
exception.

Until a proper resolution is implemented, it's better to disable this by 
default and introduce a command line switch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10158) FS-CS converter: convert property yarn.scheduler.fair.update-interval-ms

2020-02-24 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10158:
---

 Summary: FS-CS converter: convert property 
yarn.scheduler.fair.update-interval-ms
 Key: YARN-10158
 URL: https://issues.apache.org/jira/browse/YARN-10158
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10157) FS-CS converter: initPropertyActions() is not called without rules file

2020-02-24 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10157:
---

 Summary: FS-CS converter: initPropertyActions() is not called 
without rules file
 Key: YARN-10157
 URL: https://issues.apache.org/jira/browse/YARN-10157
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


The method {{FSConfigToCSConfigRuleHandler.initPropertyActions()}} should be 
invoked even if we don't use the rule file. Otherwise the rule handler will not 
initialize actions to WARNING.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Resolved] (YARN-10142) Distributed shell: add support for localization visibility

2020-02-18 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-10142.
-
Resolution: Duplicate

> Distributed shell: add support for localization visibility
> --
>
> Key: YARN-10142
> URL: https://issues.apache.org/jira/browse/YARN-10142
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: distributed-shell
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> The localization is now hard coded in DistributedShell:
> {noformat}
> FileStatus scFileStatus = fs.getFileStatus(dst);
> LocalResource scRsrc =
> LocalResource.newInstance(
> URL.fromURI(dst.toUri()),
> LocalResourceType.FILE, LocalResourceVisibility.APPLICATION,
> scFileStatus.getLen(), scFileStatus.getModificationTime());
> localResources.put(fileDstPath, scRsrc);
> {noformat}
> However, sometimes it's useful if you have the possibility to change this to 
> PRIVATE/PUBLIC for testing purposes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10147) FPGA plugin can't find the localized aocx file

2020-02-18 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10147:
---

 Summary: FPGA plugin can't find the localized aocx file
 Key: YARN-10147
 URL: https://issues.apache.org/jira/browse/YARN-10147
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Peter Bacsko
Assignee: Peter Bacsko


There's a bug in the FPGA plugin which is intended to find the localized "aocx" 
file:

{noformat}
...
if (localizedResources != null) {
  Optional aocxPath = localizedResources
  .keySet()
  .stream()
  .filter(path -> matchesIpid(path, id))
  .findFirst();

  if (aocxPath.isPresent()) {
ipFilePath = aocxPath.get().toUri().toString();
LOG.debug("Found: " + ipFilePath);
  }
} else {
  LOG.warn("Localized resource is null!");
}

return ipFilePath;
  }

  private boolean matchesIpid(Path p, String id) {
return p.getName().toLowerCase().equals(id.toLowerCase())
&& p.getName().endsWith(".aocx");
  }
{noformat}

The method {{matchesIpid()}} works incorrecty: the {{id}} argument is the 
expected filename, but without the extension. Therefore the {{equals()}} 
comparison will always be false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10142) Distributed shell: add support for localization visibility

2020-02-17 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10142:
---

 Summary: Distributed shell: add support for localization visibility
 Key: YARN-10142
 URL: https://issues.apache.org/jira/browse/YARN-10142
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Peter Bacsko
Assignee: Peter Bacsko


The localization is now hard coded in DistributedShell:

{noformat}
FileStatus scFileStatus = fs.getFileStatus(dst);
LocalResource scRsrc =
LocalResource.newInstance(
URL.fromURI(dst.toUri()),
LocalResourceType.FILE, LocalResourceVisibility.APPLICATION,
scFileStatus.getLen(), scFileStatus.getModificationTime());
localResources.put(fileDstPath, scRsrc);
{noformat}

However, sometimes it's useful if you have the possibility to change this to 
PRIVATE/PUBLIC for testing purposes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10135) FS-CS converter tool: issue warning on dynamic auto-create mapping rules

2020-02-13 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10135:
---

 Summary: FS-CS converter tool: issue warning on dynamic 
auto-create mapping rules
 Key: YARN-10135
 URL: https://issues.apache.org/jira/browse/YARN-10135
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


The converter tool should issue a warning whenever the conversion results in 
mapping rules similar to these:

{{u:%user:[managedParentQueueName].[queueName]}}

{{u:%user:[managedParentQueueName].%user}}

{{u:%user:[managedParentQueueName].%primary_group}}

{{u:%user:[managedParentQueueName].%secondary_group}}

{{u:%user:%primary_group.%user}}

{{u:%user:%secondary_group.%user}}

{{u:%user:[managedParentQueuePath].%user}}

 

The reason is that right now it's fully clear how we'll handle a case like 
"u:%user:%primary_group.%user", where "%primary_group.%user" might result in 
something like "users.john". 

In case of "u:%user:[managedParentQueuePath].%user" , the 
[managedParentQueuePath] is a result of a full path from Fair Scheduler. 
Therefore it's not going to be a leaf queue. 

The user might be required to do some fine tuning and adjust the property 
"auto-create-child-queues". We should display a warning about these additional 
steps.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10105) FS-CS converter: separator between mapping rules should be comma

2020-01-27 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10105:
---

 Summary: FS-CS converter: separator between mapping rules should 
be comma
 Key: YARN-10105
 URL: https://issues.apache.org/jira/browse/YARN-10105
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


A converted configuration throws this error:


{noformat}
2020-01-27 03:35:35,007 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to 
standby state
2020-01-27 03:35:35,008 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting 
ResourceManager
java.lang.IllegalArgumentException: Illegal queue mapping 
u:%user:%user;u:%user:root.users.%user;u:%user:root.default
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration.getQueueMappings(CapacitySchedulerConfiguration.java:1113)
at 
org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule.initialize(UserGroupMappingPlacementRule.java:244)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getUserGroupMappingPlacementRule(CapacityScheduler.java:671)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updatePlacementRules(CapacityScheduler.java:712)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:753)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:361)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:426)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
{noformat}

Mapping rules should be separated by a "," character, not by a semicolon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10104) FS-CS converter: dryRun requires either -p or -o

2020-01-24 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10104:
---

 Summary: FS-CS converter: dryRun requires either -p or -o
 Key: YARN-10104
 URL: https://issues.apache.org/jira/browse/YARN-10104
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


The "-d" / "--dry-run" switch doesn't work properly.

You still have to define either "-p" or "-o", which is not the way the tool is 
supposed to work (ie. it doesn't need to generate any output after the 
conversion).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10103) Capacity scheduler: add support for create=true/false per mapping rule

2020-01-23 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10103:
---

 Summary: Capacity scheduler: add support for create=true/false per 
mapping rule
 Key: YARN-10103
 URL: https://issues.apache.org/jira/browse/YARN-10103
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko


You can't ask Capacity Scheduler for a mapping to create a queue if it doesn't 
exist.

For example, this mapping would use the first rule if the queue exist. If it 
doesn't, then it proceeds to the next rule.

Example:
{{u:%user:%primary_group.%user:create=false;u:%user%:root.default}}

Let's say user "alice" belongs to the "admins" group. It would first try to map 
{{root.admins.alice}}. But, if the queue doesn't exist, then it places the 
application into {{root.default}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10102) Capacity scheduler: add support for combined %specified mapping

2020-01-23 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10102:
---

 Summary: Capacity scheduler: add support for combined %specified 
mapping
 Key: YARN-10102
 URL: https://issues.apache.org/jira/browse/YARN-10102
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko


The reduce the gap between Fair Scheduler and Capacity Scheduler, it's 
reasonable to have a {{%specified}} mapping. This would be equivalent to the 
{{}}  placement rule in FS, that is, use the queue that comes in 
with the application submission context.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10099) FS-CS converter: handle allow-undeclared-pools and user-as-default queue properly

2020-01-22 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10099:
---

 Summary: FS-CS converter: handle allow-undeclared-pools and 
user-as-default queue properly
 Key: YARN-10099
 URL: https://issues.apache.org/jira/browse/YARN-10099
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Based on the latest documentation, there are two important properties that are 
ignored if we have placement rules:

||Property||Explanation||
|yarn.scheduler.fair.allow-undeclared-pools|If this is true, new queues can be 
created at application submission time, whether because they are specified as 
the application’s queue by the submitter or because they are placed there by 
the user-as-default-queue property. If this is false, any time an app would be 
placed in a queue that is not specified in the allocations file, it is placed 
in the “default” queue instead. Defaults to true. *If a queue placement policy 
is given in the allocations file, this property is ignored.*|
|yarn.scheduler.fair.user-as-default-queue|Whether to use the username 
associated with the allocation as the default queue name, in the event that a 
queue name is not specified. If this is set to “false” or unset, all jobs have 
a shared default queue, named “default”. Defaults to true. *If a queue 
placement policy is given in the allocations file, this property is ignored.*|
| | |

Right now these settings affects the conversion regardless of the placement 
rules. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10085) FS-CS converter: remove mixed ordering policy check

2020-01-15 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10085:
---

 Summary: FS-CS converter: remove mixed ordering policy check
 Key: YARN-10085
 URL: https://issues.apache.org/jira/browse/YARN-10085
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


When YARN-9892 gets committed, this part will become unnecessary:

{noformat}
// Validate ordering policy
if (queueConverter.isDrfPolicyUsedOnQueueLevel()) {
  if (queueConverter.isFifoOrFairSharePolicyUsed()) {
throw new ConversionException(
"DRF ordering policy cannot be used together with fifo/fair");
  } else {
capacitySchedulerConfig.set(
CapacitySchedulerConfiguration.RESOURCE_CALCULATOR_CLASS,
DominantResourceCalculator.class.getCanonicalName());
  }
}
{noformat}

We will be able to freely mix fifo/fair/drf, so let's get rid of this strict 
check and also rewrite {{FSQueueConverter.emitOrderingPolicy()}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10082) FS-CS converter: disable terminal placement rule checking

2020-01-13 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10082:
---

 Summary: FS-CS converter: disable terminal placement rule checking
 Key: YARN-10082
 URL: https://issues.apache.org/jira/browse/YARN-10082
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Before YARN-8967, {{QueuePlacementRule}} class had a method called 
{{isTerminal()}}. However, sometimes this method was hard-coded to return 
false, accepting such configurations as:

{noformat}







{noformat}

It's because {{NestedUserQueue.isTerminal()}} always returns {{false}}.

This changed after YARN-8967, the behavior is different. Now, this 
configuration is not accepted because {{QueuePlacementPolicy.fromXml()}} 
calculates a list of terminal rules differently:

https://github.com/apache/hadoop/blob/5257f50abb71905ef3068fd45541d00ce9e8f355/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementPolicy.java#L176-L183

In order to migrate existing configuration that were created before YARN-8967, 
we need a new switch (at least in migration mode) in FS to turn off this 
validation, otherwise the tool will not be able to migrate these configs and 
the following exception will be thrown:

{noformat}
~$ ./yarn fs2cs -y /tmp/yarn-site.xml -f /tmp/fair-scheduler.xml -o /tmp
WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
20/01/13 05:48:20 INFO converter.FSConfigToCSConfigConverter: Output directory 
for yarn-site.xml and capacity-scheduler.xml is: /tmp
20/01/13 05:48:20 INFO converter.FSConfigToCSConfigConverter: Conversion rules 
file is not defined, using default conversion config!
20/01/13 05:48:21 INFO converter.FSConfigToCSConfigConverter: Using explicitly 
defined fair-scheduler.xml
WARNING: This feature is experimental and not intended for production use!
20/01/13 05:48:21 INFO conf.Configuration: resource-types.xml not found
20/01/13 05:48:21 INFO resource.ResourceUtils: Unable to find 
'resource-types.xml'.
20/01/13 05:48:21 INFO security.YarnAuthorizationProvider: 
org.apache.hadoop.yarn.security.ConfiguredYarnAuthorizer is instantiated.
20/01/13 05:48:21 INFO scheduler.AbstractYarnScheduler: Minimum allocation = 

20/01/13 05:48:21 INFO scheduler.AbstractYarnScheduler: Maximum allocation = 

20/01/13 05:48:21 INFO placement.PlacementFactory: Creating PlacementRule 
implementation: class 
org.apache.hadoop.yarn.server.resourcemanager.placement.SpecifiedPlacementRule
20/01/13 05:48:21 INFO placement.PlacementFactory: Creating PlacementRule 
implementation: class 
org.apache.hadoop.yarn.server.resourcemanager.placement.UserPlacementRule
20/01/13 05:48:21 INFO fair.AllocationFileLoaderService: Loading allocation 
file file:/tmp/fair-scheduler.xml
20/01/13 05:48:22 INFO placement.PlacementFactory: Creating PlacementRule 
implementation: class 
org.apache.hadoop.yarn.server.resourcemanager.placement.SpecifiedPlacementRule
20/01/13 05:48:22 INFO placement.PlacementFactory: Creating PlacementRule 
implementation: class 
org.apache.hadoop.yarn.server.resourcemanager.placement.UserPlacementRule
20/01/13 05:48:22 INFO placement.PlacementFactory: Creating PlacementRule 
implementation: class 
org.apache.hadoop.yarn.server.resourcemanager.placement.DefaultPlacementRule
20/01/13 05:48:22 INFO placement.PlacementFactory: Creating PlacementRule 
implementation: class 
org.apache.hadoop.yarn.server.resourcemanager.placement.DefaultPlacementRule
20/01/13 05:48:22 INFO service.AbstractService: Service 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
failed in state INITED
java.io.IOException: Failed to initialize FairScheduler
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1438)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1479)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.FSConfigToCSConfigConverter.convert(FSConfigToCSConfigConverter.java:206)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.FSConfigToCSConfigConverter.convert(FSConfigToCSConfigConverter.java:101)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.FSConfigToCSConfigArgumentHandler.parseAndConvert(FSConfigToCSConfigArgumentHandler.java:116)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.FSConfigToCSConfigConverterMain.main(FSConfigToCSConfigConverterMain.java:44)
Caused by: 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 Rules after rule 2 in queue placement policy can never

[jira] [Created] (YARN-10067) Add dry-run feature to FS-CS converter tool

2020-01-02 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10067:
---

 Summary: Add dry-run feature to FS-CS converter tool
 Key: YARN-10067
 URL: https://issues.apache.org/jira/browse/YARN-10067
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Add a "-d" / "--dry-run" switch to the tool. The purpose of this would be to 
inform the user whether a conversion is possible and if it is, are there any 
warnings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10019) container-executor: misc improvements in child process and regarding exec() calls

2019-12-07 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10019:
---

 Summary: container-executor: misc improvements in child process 
and regarding exec() calls
 Key: YARN-10019
 URL: https://issues.apache.org/jira/browse/YARN-10019
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Peter Bacsko
Assignee: Peter Bacsko


There are a couple of improvements that we can do in container-executor 
regarding how we exit from child processes and how we handle failed exec() 
calls:

1. If we're in the child code path and we detect an erroneous condition, the 
usual way is just simply call {{_exit()}}. Normal {{exit()}} occurs in the 
parent. Calling {{_exit()}}  prevents flushing stdio buffers twice and any 
cleanup logic registered with {{atexit()}} or {{on_exit()}} will run only once.

2. There's code like {{if (execlp(script_file_dest, script_file_dest, NULL) != 
0) ...}} which is not necessary. Exec functions are not supposed to return. If 
they do, it's definitely an error, so no need to check the return value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-10018) container-executor: possible -1 return value of fork() is not always checked

2019-12-07 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10018:
---

 Summary: container-executor: possible -1 return value of fork() is 
not always checked
 Key: YARN-10018
 URL: https://issues.apache.org/jira/browse/YARN-10018
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Peter Bacsko
Assignee: Peter Bacsko


There are some places in the container-executor native, where the {{fork()}} 
call is not handled properly. This operation can fail with -1, but sometimes 
the necessary if branch is missing to validate the success.

Also, at one location, the return value is defined as an {{int}}, not 
{{pid_t}}. It's better to handle this transparently and change it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Resolved] (YARN-9891) Capacity scheduler: enhance capacity / maximum-capacity setting

2019-10-28 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-9891.

Resolution: Duplicate

> Capacity scheduler: enhance capacity / maximum-capacity setting
> ---
>
> Key: YARN-9891
> URL: https://issues.apache.org/jira/browse/YARN-9891
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Peter Bacsko
>Priority: Major
>
> Capacity Scheduler does not support two percentage values for capacity and 
> maximum-capacity settings. So, you can't do something like this:
> {{yarn.scheduler.capacity.root.users.john.maximum-capacity=memory-mb=50.0%, 
> vcores=50.0%}}
> It's possible to use absolute resources, but not two separate percentages 
> (which expresses capacity as a percentage of the overall cluster resource). 
> Such a configuration is accepted in Fair Scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9922) Fix JavaDoc errors introduced by YARN-9699

2019-10-21 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-9922:
--

 Summary: Fix JavaDoc errors introduced by YARN-9699
 Key: YARN-9922
 URL: https://issues.apache.org/jira/browse/YARN-9922
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacity scheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9893) Capacity scheduler: enhance leaf-queue-template capacity / maximum-capacity setting

2019-10-11 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-9893:
--

 Summary: Capacity scheduler: enhance leaf-queue-template capacity 
/ maximum-capacity setting
 Key: YARN-9893
 URL: https://issues.apache.org/jira/browse/YARN-9893
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacity scheduler
Reporter: Peter Bacsko


Capacity Scheduler does not support two percentage values for leaf queue 
capacity and maximum-capacity settings. So, you can't do something like this:

{{yarn.scheduler.capacity.root.users.john.leaf-queue-template.capacity=memory-mb=50.0%,
 vcores=50.0%}}

On top of that, it's not even possible to define absolute resources:

{{yarn.scheduler.capacity.root.users.john.leaf-queue-template.capacity=memory-mb=16384,
 vcores=8}}

Only a single percentage value is accepted.

This makes it nearly impossible to properly convert a similar setting from Fair 
Scheduler, where such a configuration is valid and accepted 
({{}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9892) Capacity scheduler: support DRF ordering policy on queue level

2019-10-11 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-9892:
--

 Summary: Capacity scheduler: support DRF ordering policy on queue 
level
 Key: YARN-9892
 URL: https://issues.apache.org/jira/browse/YARN-9892
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacity scheduler
Reporter: Peter Bacsko


Capacity scheduler does not support DRF (Dominant Resource Fairness) ordering 
policy on queue level. Only "fifo" and "fair" are accepted for 
{{yarn.scheduler.capacity..ordering-policy}}.

DRF can only be used globally if 
{{yarn.scheduler.capacity.resource-calculator}} is set to 
DominantResourceCalculator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9891) Capacity scheduler: enhance capacity / maximum-capacity setting

2019-10-11 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-9891:
--

 Summary: Capacity scheduler: enhance capacity / maximum-capacity 
setting
 Key: YARN-9891
 URL: https://issues.apache.org/jira/browse/YARN-9891
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko


Capacity Scheduler does not support two percentage values for capacity and 
maximum-capacity settings. So, you can't do something like this:

{{yarn.scheduler.capacity.root.users.john.maximum-capacity=memory-mb=50.0%, 
vcores=50.0%}}

It's possible to use absolute resources, but not two separate percentages 
(which expresses capacity as a percentage of the overall cluster resource). 
Such a configuration is accepted in Fair Scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9888) Capacity scheduler: add support for default maxRunningApps limit per user

2019-10-11 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-9888:
--

 Summary: Capacity scheduler: add support for default 
maxRunningApps limit per user
 Key: YARN-9888
 URL: https://issues.apache.org/jira/browse/YARN-9888
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko


Fair scheduler has the setting {{}} which limits how many 
running applications each user can have. 

Capacity scheduler lacks this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9887) Capacity scheduler: add support for limiting maxRunningApps per user

2019-10-11 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-9887:
--

 Summary: Capacity scheduler: add support for limiting 
maxRunningApps per user
 Key: YARN-9887
 URL: https://issues.apache.org/jira/browse/YARN-9887
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko


Fair Scheduler supports limiting the number of applications that a particular 
user can submit:


{noformat}

  10

{noformat}

Capacity Scheduler does not have an exact equivalent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Resolved] (YARN-9717) Add more logging to container-executor about issues with directory creation or permissions

2019-09-23 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-9717.

Resolution: Won't Fix

> Add more logging to container-executor about issues with directory creation 
> or permissions
> --
>
> Key: YARN-9717
> URL: https://issues.apache.org/jira/browse/YARN-9717
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Peter Bacsko
>Priority: Major
>
> During some downstream testing we bumped into some problems with the 
> container executor where an extra logging would be quite helpful when local 
> files and directories could not be created (container-executor.c:1810).
> The most important log line could be the following:
> There's a function called create_container_directories in 
> container-executor.c.
> We should place a log line like this:
> Before we're calling:
> We have: 
> {code:java}
> if (mkdirs(container_dir, perms) == 0) {
>   result = 0;
> }
> {code}
> We could add an else statement and add the following log, if creating the 
> directory was not successful: 
> {code:java}
> fprintf(LOGFILE, "Failed to create directory: %s, user: %s", container_dir, 
> user);
> {code}
> This way, CE at least prints the directory itself if we have any permission 
> issue while trying to create a subdirectory or file under it.
> If we want to be very precise, some logging into the mkdirs function could 
> also be added as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9841) Capacity scheduler: add support for combined %user + %primary_group mapping

2019-09-19 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-9841:
--

 Summary: Capacity scheduler: add support for combined %user + 
%primary_group mapping
 Key: YARN-9841
 URL: https://issues.apache.org/jira/browse/YARN-9841
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Right now in CS, using {{%primary_group}} with a parent queue is only possible 
this way:

{{u:%user:parentqueue.%primary_group}}

Looking at 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/placement/UserGroupMappingPlacementRule.java,
 we cannot do something like:

{{u:%user:%primary_group.%user}}

Fair Scheduler supports a nested rule where such a placement/mapping rule is 
possible. This improvement would reduce this feature gap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9840) Capacity scheduler: add support for Secondary Group user mapping

2019-09-19 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-9840:
--

 Summary: Capacity scheduler: add support for Secondary Group user 
mapping
 Key: YARN-9840
 URL: https://issues.apache.org/jira/browse/YARN-9840
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Currently, Capacity Scheduler only supports primary group rule mapping like 
this:

{{u:%user:%primary_group}}

Fair scheduler already supports secondary group placement rule. Let's add this 
to CS to reduce the feature gap.

Class of interest: 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/placement/UserGroupMappingPlacementRule.java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2019-09-12 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-9833:
--

 Summary: Race condition when DirectoryCollection.checkDirs() runs 
during container launch
 Key: YARN-9833
 URL: https://issues.apache.org/jira/browse/YARN-9833
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.2.0
Reporter: Peter Bacsko
Assignee: Peter Bacsko


During endurance testing, we found a race condition that cause an empty 
{{localDirs}} being passed to container-executor.

The problem is that {{DirectoryCollection.checkDirs()}} clears three 
collections:
{code:java}
this.writeLock.lock();
try {
  localDirs.clear();
  errorDirs.clear();
  fullDirs.clear();
  ...
{code}
This happens in critical section guarded by a write lock. When we start a 
container, we retrieve the local dirs by calling 
{{dirsHandler.getLocalDirs();}} which in turn invokes 
{{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
{code:java}
List getGoodDirs() {
this.readLock.lock();
try {
  return Collections.unmodifiableList(localDirs);
} finally {
  this.readLock.unlock();
}
  }
{code}
So we're also in a critical section guarded by the lock. But 
{{Collections.unmodifiableList()}} only returns a _view_ of the collection, not 
a copy. After we get the view, {{MonitoringTimerTask.run()}} might be scheduled 
to run and immediately clears {{localDirs}}.
This caused a weird behaviour in container-executor, which exited with error 
code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).

Therefore we can't just return a view, we must return a copy with 
{{ImmutableList.copyOf()}}.

Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9749) TestAppLogAggregatorImpl#testDFSQuotaExceeded fails on trunk

2019-08-15 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9749:
--

 Summary: TestAppLogAggregatorImpl#testDFSQuotaExceeded fails on 
trunk
 Key: YARN-9749
 URL: https://issues.apache.org/jira/browse/YARN-9749
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Peter Bacsko
Assignee: Adam Antal


TestAppLogAggregatorImpl#testDFSQuotaExceeded currently fails on trunk. It was 
most likely introduced by YARN-9676 (resetting HEAD to the previous commit and 
then re-running the test passes).

{noformat}
[INFO] Running 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestAppLogAggregatorImpl
[ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.781 s 
<<< FAILURE! - in 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestAppLogAggregatorImpl
[ERROR] 
testDFSQuotaExceeded(org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestAppLogAggregatorImpl)
  Time elapsed: 2.361 s  <<< FAILURE!
java.lang.AssertionError: The set of paths for deletion are not the same as 
expected: actual size: 0 vs expected size: 1
at org.junit.Assert.fail(Assert.java:88)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestAppLogAggregatorImpl.verifyFilesToDelete(TestAppLogAggregatorImpl.java:344)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestAppLogAggregatorImpl.access$000(TestAppLogAggregatorImpl.java:82)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestAppLogAggregatorImpl$1.answer(TestAppLogAggregatorImpl.java:330)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestAppLogAggregatorImpl$1.answer(TestAppLogAggregatorImpl.java:319)
at 
org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:39)
at 
org.mockito.internal.handler.MockHandlerImpl.handle(MockHandlerImpl.java:96)
at 
org.mockito.internal.handler.NullResultGuardian.handle(NullResultGuardian.java:29)
at 
org.mockito.internal.handler.InvocationNotifierHandler.handle(InvocationNotifierHandler.java:35)
at 
org.mockito.internal.creation.bytebuddy.MockMethodInterceptor.doIntercept(MockMethodInterceptor.java:61)
at 
org.mockito.internal.creation.bytebuddy.MockMethodInterceptor.doIntercept(MockMethodInterceptor.java:49)
at 
org.mockito.internal.creation.bytebuddy.MockMethodInterceptor$DispatcherDefaultingToRealMethod.interceptSuperCallable(MockMethodInterceptor.java:108)
at 
org.apache.hadoop.yarn.server.nodemanager.DeletionService$MockitoMock$1879282050.delete(Unknown
 Source)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregationPostCleanUp(AppLogAggregatorImpl.java:556)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:476)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestAppLogAggregatorImpl.testDFSQuotaExceeded(TestAppLogAggregatorImpl.java:469)
...
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Resolved] (YARN-9473) [Umbrella] Support Vector Engine ( a new accelerator hardware) based on pluggable device framework

2019-07-01 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-9473.

   Resolution: Fixed
Fix Version/s: 3.3.0

> [Umbrella] Support Vector Engine ( a new accelerator hardware) based on 
> pluggable device framework
> --
>
> Key: YARN-9473
> URL: https://issues.apache.org/jira/browse/YARN-9473
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Zhankun Tang
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0
>
>
> As the heterogeneous computation trend rises, new acceleration hardware like 
> GPU, FPGA is used to satisfy various requirements.
> And a new hardware Vector Engine (VE) which released by NEC company is 
> another example. The VE is like GPU but has different characteristics. It's 
> suitable for machine learning and HPC due to better memory bandwidth and no 
> PCIe bottleneck.
> Please Check here for more VE details:
> [https://www.nextplatform.com/2017/11/22/deep-dive-necs-aurora-vector-engine/]
> [https://www.hotchips.org/hc30/2conf/2.14_NEC_vector_NEC_SXAurora_TSUBASA_HotChips30_finalb.pdf]
> As we know, YARN-8851 is a pluggable device framework which provides an easy 
> way to develop a plugin for such new accelerators. This JIRA proposes to 
> develop a new VE plugin based on that framework and be implemented as current 
> GPU's "NvidiaGPUPluginForRuntimeV2" plugin.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9660) Enhance documentation of Docker on YARN support

2019-07-01 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9660:
--

 Summary: Enhance documentation of Docker on YARN support
 Key: YARN-9660
 URL: https://issues.apache.org/jira/browse/YARN-9660
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation, nodemanager
Reporter: Peter Bacsko


Right now, using Docker on YARN has some hard requirements. If these 
requirements are not met, then launching the containers will fail and and error 
message will be printed. Depending on how familiar the user is with Docker, it 
might or might not be easy for them to understand what went wrong and how to 
fix the underlying problem.

It would be important to explicitly document these requirements along with the 
error messages.

#1: CGroups handler cannot be systemd

If docker deamon runs with systemd cgroups handler, we receive the following 
error upon launching a container:

{noformat}
Container id: container_1561638268473_0006_01_02
Exit code: 7
Exception message: Launch container failed
Shell error output: /usr/bin/docker-current: Error response from daemon: 
cgroup-parent for systemd cgroup should be a valid slice named as "xxx.slice".
See '/usr/bin/docker-current run --help'.
Shell output: main : command provided 4
main : run as user is johndoe
main : requested yarn user is johndoe
{noformat}

Solution: switch to cgroupfs. Doing so can be OS-specific, but we can document 
a {{systemcl}} example.


#2: {{/bin/bash}} must be present on the {{$PATH}} inside the container
Some smaller images like "busybox" or "alpine" does not have {{/bin/bash}}. 
It's because all commands under {{/bin}} are linked to {{/bin/busybox}} and 
there's only {{/bin/sh}}.

If we try to use these kind of images, we'll see the following error message:

{noformat}
Container id: container_1561638268473_0015_01_02
Exit code: 7
Exception message: Launch container failed
Shell error output: /usr/bin/docker-current: Error response from daemon: oci 
runtime error: container_linux.go:235: starting container process caused "exec: 
\"bash\": executable file not found in $PATH".
Shell output: main : command provided 4
main : run as user is johndoe
main : requested yarn user is johndoe
{noformat}

#3: {{find}} command must be available on the {{$PATH}}

It seems obvious that we have the {{find}} command, but even very popular 
images like {{fedora}} requires that we install it separately.

If we don't have {{find}} available, then {{launcher_container.sh}} fails with:

{noformat}
2019-07-01 03:51:25.053]Container exited with a non-zero exit code 127. Error 
file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/tmp/hadoop-systest/nm-local-dir/usercache/systest/appcache/application_1561638268473_0017/container_1561638268473_0017_01_02/launch_container.sh:
 line 44: find: command not found
Last 4096 bytes of stderr.txt :
[2019-07-01 03:51:25.053]Container exited with a non-zero exit code 127. Error 
file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/tmp/hadoop-systest/nm-local-dir/usercache/systest/appcache/application_1561638268473_0017/container_1561638268473_0017_01_02/launch_container.sh:
 line 44: find: command not found
Last 4096 bytes of stderr.txt :
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9622) All testcase fails in TestTimelineReaderWebServicesHBaseStorage

2019-06-13 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9622:
--

 Summary: All testcase fails in 
TestTimelineReaderWebServicesHBaseStorage
 Key: YARN-9622
 URL: https://issues.apache.org/jira/browse/YARN-9622
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver, timelineservice
Reporter: Peter Bacsko


When you try to run all tests from TestTimelineReaderWebServicesHBaseStorage, 
the result is the following:

{noformat}
[ERROR] Failures: 
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetAppNotPresent:->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
 Response from server should have been Not Found
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunNotPresent:2192->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
 Response from server should have been Not Found
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testUIDNotProperlyEscaped:905->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140
 Response from server should have been Bad Request
[ERROR] Errors: 
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testForFlowAppsPagination:2375->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testForFlowRunAppsPagination:2420->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testForFlowRunsPagination:2465->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGenericEntitiesForPagination:2272->verifyEntitiesForPagination:2288->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetApp:1024->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetAppWithoutFlowInfo:1064->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetAppsMetricsRange:2516->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesByUID:662->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesConfigFilters:1263->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesDataToRetrieve:1154->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesEventFilters:1640->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesInfoFilters:1380->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricFilters:1494->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricsTimeRange:1820->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesRelationFilters:1696->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesWithoutFlowInfo:1130->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntityDataToRetrieve:1905->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetEntityWithoutFlowInfo:1113->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowApps:2047->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsFilters:2153->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsNotPresent:2253->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowRun:443->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunApps:1984->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunAppsNotPresent:2235->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowRuns:488->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunsMetricsToRetrieve:616->AbstractTimelineReaderHBaseTestBase.getResponse:129
 » IO
[ERROR]   
TestTimelineReaderWebServicesHBaseStorage.testGetFlows:918->verifyFlowEntites:2349->AbstractTimelineReaderHBase

[jira] [Created] (YARN-9621) Test failure TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1

2019-06-13 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9621:
--

 Summary: Test failure 
TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on 
branch-3.1
 Key: YARN-9621
 URL: https://issues.apache.org/jira/browse/YARN-9621
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.2
Reporter: Peter Bacsko


Testcase 
{{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} 
seems to constantly fail on branch 3.1. I believe it was introduced by 
YARN-9253.

{noformat}
testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager)
  Time elapsed: 24.636 s  <<< FAILURE!
java.lang.AssertionError: expected:<1> but was:<2>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9595) FPGA plugin: NullPointerException in FpgaNodeResourceUpdateHandler.updateConfiguredResource()

2019-06-03 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9595:
--

 Summary: FPGA plugin: NullPointerException in 
FpgaNodeResourceUpdateHandler.updateConfiguredResource()
 Key: YARN-9595
 URL: https://issues.apache.org/jira/browse/YARN-9595
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Peter Bacsko
Assignee: Peter Bacsko


YARN-9264 accidentally introduced a bug in FpgaDiscoverer. Sometimes 
{{currentFpgaInfo}} is not set, resulting in an NPE being thrown:

{noformat}
2019-06-03 05:14:50,157 INFO org.apache.hadoop.service.AbstractService: Service 
NodeManager failed in state INITED; cause: java.lang.NullPointerException
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaNodeResourceUpdateHandler.updateConfiguredResource(FpgaNodeResourceUpdateHandler.java:54)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.updateConfiguredResourcesViaPlugins(NodeStatusUpdaterImpl.java:358)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceInit(NodeStatusUpdaterImpl.java:190)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:459)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:869)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:942)
{noformat}

The problem is that in {{FpgaDiscoverer}}, we don't set {{currentFpgaInfo}} if 
the following condition is true:

{noformat}
if (allowed == null || allowed.equalsIgnoreCase(
YarnConfiguration.AUTOMATICALLY_DISCOVER_GPU_DEVICES)) {
  return list;
} else if (allowed.matches("(\\d,)*\\d")){
...
{noformat}

Solution is simple, it should always be initialized, just like before.

Unit tests should be enhanced to verify that it's set properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9552) FairScheduler: NODE_UPDATE can cause a NoSuchElementException

2019-05-14 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9552:
--

 Summary: FairScheduler: NODE_UPDATE can cause a 
NoSuchElementException
 Key: YARN-9552
 URL: https://issues.apache.org/jira/browse/YARN-9552
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko


We observed a race condition inside YARN with the following stack trace:

{noformat}
18/11/07 06:45:09.559 SchedulerEventDispatcher:Event Processor ERROR 
EventDispatcher: Error in handling event type NODE_UPDATE to the Event 
Dispatcher
java.util.NoSuchElementException
at 
java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
at 
java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1373)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:353)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1094)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:961)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1183)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:132)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:748)
{noformat}

This is basically the same as the one described in YARN-7382, but the root 
cause is different.

When we create an application attempt, we create an {{FSAppAttempt}} object. 
This contains an {{AppSchedulingInfo}} which contains a set of 
{{SchedulerRequestKey}}. Initially, this set is empty and only initialized a 
bit later on a separate thread during a state transition:
{noformat}
2019-05-07 15:58:02,659 INFO  [RM StateStore dispatcher] recovery.RMStateStore 
(RMStateStore.java:transition(239)) - Storing info for app: 
application_1557237478804_0001
2019-05-07 15:58:02,684 INFO  [RM Event dispatcher] rmapp.RMAppImpl 
(RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change from 
NEW_SAVING to SUBMITTED on event = APP_NEW_SAVED
2019-05-07 15:58:02,690 INFO  [SchedulerEventDispatcher:Event Processor] 
fair.FairScheduler (FairScheduler.java:addApplication(490)) - Accepted 
application application_1557237478804_0001 from user: bacskop, in queue: 
root.bacskop, currently num of applications: 1
2019-05-07 15:58:02,698 INFO  [RM Event dispatcher] rmapp.RMAppImpl 
(RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change from 
SUBMITTED to ACCEPTED on event = APP_ACCEPTED
2019-05-07 15:58:02,731 INFO  [RM Event dispatcher] 
resourcemanager.ApplicationMasterService 
(ApplicationMasterService.java:registerAppAttempt(434)) - Registering app 
attempt : appattempt_1557237478804_0001_01
2019-05-07 15:58:02,732 INFO  [RM Event dispatcher] attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:handle(920)) - appattempt_1557237478804_0001_01 
State change from NEW to SUBMITTED on event = START
2019-05-07 15:58:02,746 INFO  [SchedulerEventDispatcher:Event Processor] 
scheduler.SchedulerApplicationAttempt 
(SchedulerApplicationAttempt.java:(207)) - *** In the constructor of 
SchedulerApplicationAttempt
2019-05-07 15:58:02,747 INFO  [SchedulerEventDispatcher:Event Processor] 
scheduler.SchedulerApplicationAttempt 
(SchedulerApplicationAttempt.java:(230)) - *** Contents of 
appSchedulingInfo: []
2019-05-07 15:58:02,752 INFO  [SchedulerEventDispatcher:Event Processor] 
fair.FairScheduler (FairScheduler.java:addApplicationAttempt(546)) - Added 
Application Attempt appattempt_1557237478804_0001_01 to scheduler from 
user: bacskop
2019-05-07 15:58:02,756 INFO  [RM Event dispatcher] scheduler.AppSchedulingInfo 
(AppSchedulingInfo.java:updatePendingResources(257)) - *** Adding scheduler 
key: SchedulerRequestKey{priority=0, allocationRequestId=-1, 
containerToUpdate=null}  for attempt: appattempt_1557237478804_0001_01
2019-05-07 15:58:02,759 INFO  [RM Event dispatcher] attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:handle(920)) - appattempt_1557237478804_0001_01 
State change from SUBMITTED to SCHEDULED on event = ATTEMPT_ADDED
2019-05-07 15:58:02,892

[jira] [Resolved] (YARN-9446) TestMiniMRClientCluster.testRestart is flaky

2019-05-10 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-9446.

Resolution: Won't Do

Closing this as Won't Do - related Hadoop JIRA (HADOOP-16238) should fix this 
problem.

> TestMiniMRClientCluster.testRestart is flaky
> 
>
> Key: YARN-9446
> URL: https://issues.apache.org/jira/browse/YARN-9446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Minor
>
> The testcase {{TestMiniMRClientCluster.testRestart}} sometimes fails with 
> this error:
> {noformat}
> 2019-04-04 11:21:31,896 INFO [main] service.AbstractService 
> (AbstractService.java:noteFailure(273)) - Service 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService failed in state 
> STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.net.BindException: Problem binding to [test-host:35491] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.net.BindException: Problem binding to [test-host:35491] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException
>  at 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:138)
>  at 
> org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
>  at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.startServer(AdminService.java:178)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceStart(AdminService.java:165)
>  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>  at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1244)
>  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>  at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:355)
>  at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.access$300(MiniYARNCluster.java:127)
>  at 
> org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:493)
>  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>  at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
>  at 
> org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:312)
>  at 
> org.apache.hadoop.mapreduce.v2.MiniMRYarnCluster.serviceStart(MiniMRYarnCluster.java:210)
>  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>  at 
> org.apache.hadoop.mapred.MiniMRYarnClusterAdapter.restart(MiniMRYarnClusterAdapter.java:73)
>  at 
> org.apache.hadoop.mapred.TestMiniMRClientCluster.testRestart(TestMiniMRClientCluster.java:114)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62){noformat}
> The solution is to set the socket option SO_REUSEADDR which is implemented in 
> HADOOP-16238.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9476) Create unit tests for VE plugin

2019-04-12 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9476:
--

 Summary: Create unit tests for VE plugin
 Key: YARN-9476
 URL: https://issues.apache.org/jira/browse/YARN-9476
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9477) Investigate device discovery mechanisms

2019-04-12 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9477:
--

 Summary: Investigate device discovery mechanisms
 Key: YARN-9477
 URL: https://issues.apache.org/jira/browse/YARN-9477
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9475) Add basic VE plugin

2019-04-12 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9475:
--

 Summary: Add basic VE plugin
 Key: YARN-9475
 URL: https://issues.apache.org/jira/browse/YARN-9475
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Peter Bacsko






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9461) TestRMWebServicesDelegationTokenAuthentication.testCancelledDelegationToken fails with HTTP 400

2019-04-08 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9461:
--

 Summary: 
TestRMWebServicesDelegationTokenAuthentication.testCancelledDelegationToken 
fails with HTTP 400
 Key: YARN-9461
 URL: https://issues.apache.org/jira/browse/YARN-9461
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, test
Reporter: Peter Bacsko
Assignee: Peter Bacsko


The test 
{{TestRMWebServicesDelegationTokenAuthentication.testCancelledDelegationToken}} 
sometimes fails with the following error:

{noformat}
java.io.IOException: Server returned HTTP response code: 400 for URL: 
http://localhost:8088/ws/v1/cluster/delegation-token
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication.cancelDelegationToken(TestRMWebServicesDelegationTokenAuthentication.java:462)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication.testCancelledDelegationToken(TestRMWebServicesDelegationTokenAuthentication.java:283)
{noformat}

The problem is that for whatever reason, Jetty seems to execute the token 
cancellation REST call twice. First we get HTTP 200 OK, but the second request 
fails with HTTP 400 Bad Request.

The {{MockRM}} instance is static. Something could be a problem in this class 
and it turned out that using separate {{MockRM}} instances solves the flakiness.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9446) TestMiniMRClientCluster.testRestart is flaky

2019-04-05 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9446:
--

 Summary: TestMiniMRClientCluster.testRestart is flaky
 Key: YARN-9446
 URL: https://issues.apache.org/jira/browse/YARN-9446
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Peter Bacsko
Assignee: Peter Bacsko


The testcase {{TestMiniMRClientCluster.testRestart}} sometimes fails with this 
error:

{noformat}
2019-04-04 11:21:31,896 INFO [main] service.AbstractService 
(AbstractService.java:noteFailure(273)) - Service 
org.apache.hadoop.yarn.server.resourcemanager.AdminService failed in state 
STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.net.BindException: Problem binding to [test-host:35491] 
java.net.BindException: Address already in use; For more details see: 
http://wiki.apache.org/hadoop/BindException
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: 
Problem binding to [test-host:35491] java.net.BindException: Address already in 
use; For more details see: http://wiki.apache.org/hadoop/BindException
 at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:138)
 at 
org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
 at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
 at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.startServer(AdminService.java:178)
 at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceStart(AdminService.java:165)
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
 at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1244)
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
 at 
org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:355)
 at 
org.apache.hadoop.yarn.server.MiniYARNCluster.access$300(MiniYARNCluster.java:127)
 at 
org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:493)
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
 at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
 at 
org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:312)
 at 
org.apache.hadoop.mapreduce.v2.MiniMRYarnCluster.serviceStart(MiniMRYarnCluster.java:210)
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
 at 
org.apache.hadoop.mapred.MiniMRYarnClusterAdapter.restart(MiniMRYarnClusterAdapter.java:73)
 at 
org.apache.hadoop.mapred.TestMiniMRClientCluster.testRestart(TestMiniMRClientCluster.java:114)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62){noformat}

The solution is to set the socket option SO_REUSEADDR which is implemented in 
HADOOP-16238.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Resolved] (YARN-9436) Flaky test testApplicationLifetimeMonitor

2019-04-03 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-9436.

Resolution: Duplicate

> Flaky test testApplicationLifetimeMonitor
> -
>
> Key: YARN-9436
> URL: https://issues.apache.org/jira/browse/YARN-9436
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> In our test environment, we occasionally encounter this failure:
> {noformat}
> 2019-04-03 12:49:32 [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
> 2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, 
> Time elapsed: 215.535 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
> 2019-04-03 12:53:08 [ERROR] 
> testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor)
>   Time elapsed: 34.244 s  <<< FAILURE!
> 2019-04-03 12:53:08 java.lang.AssertionError: Application killed before 
> lifetime value
> 2019-04-03 12:53:08   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218)
> 2019-04-03 12:53:08 
> {noformat}
> The root cause is the condition here:
> {noformat}
> Assert.assertTrue("Application killed before lifetime value",
> totalTimeRun > maxLifetime);
> {noformat}
> However, there are two problems with this condition:
>  1. Logically it's not correct. In fact, since the app should be killed after 
> 30 seconds, one would expect to see {{totalTimeRun = maxLifetime}}. Due to 
> some asynchronicity and rounding, most of the time {{totalTimeRun}} ends up 
> being 31.
> 2. Sometimes the application is killed fast enough and {{totalTimeRun}} is 
> 30, but this is correct, because in {{setUpCSQueue}} we set the queue 
> lifetime:
> {noformat}
> csConf.setMaximumLifetimePerQueue(
> CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime);
> csConf.setDefaultLifetimePerQueue(
> CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime);
> {noformat}
> A more proper condition is:
> {noformat}
> Assert.assertTrue("Application killed before lifetime value",
> totalTimeRun >= maxLifetime);
> {noformat}
> The assertion message in the next line is also misleading:
> {noformat}
> Assert.assertTrue(
> "Application killed before lifetime value " + totalTimeRun,
> totalTimeRun < maxLifetime + 10L);
> {noformat}
> If it false, it means that the application is killed _after_ 40 seconds, 
> which exceeds both the app's lifetime (40s) and that of the queue (30s).
> {noformat}
> Assert.assertTrue(
> "Application killed after queue/app lifetime value: " + 
> totalTimeRun,
> totalTimeRun < maxLifetime + 10L);
> {noformat}
> We can be even be stricter, since we expect a kill almost immediately after 
> 30 seconds:
> {noformat}
> Assert.assertTrue(
> "Application killed too late: " + totalTimeRun,
> totalTimeRun < maxLifetime + 2L);
> {noformat}
> where we allow a 2 second tolerance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9436) Flaky test testApplicationLifetimeMonitor

2019-04-03 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9436:
--

 Summary: Flaky test testApplicationLifetimeMonitor
 Key: YARN-9436
 URL: https://issues.apache.org/jira/browse/YARN-9436
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler, test
Reporter: Peter Bacsko
Assignee: Peter Bacsko


In our test environment, we occasionally encounter this failure:
{noformat}
2019-04-03 12:49:32 [INFO] Running 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, 
Time elapsed: 215.535 s <<< FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
2019-04-03 12:53:08 [ERROR] 
testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor)
  Time elapsed: 34.244 s  <<< FAILURE!
2019-04-03 12:53:08 java.lang.AssertionError: Application killed before 
lifetime value
2019-04-03 12:53:08 at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218)
2019-04-03 12:53:08 
{noformat}
The root cause is the condition here:
{noformat}
Assert.assertTrue("Application killed before lifetime value",
totalTimeRun > maxLifetime);
{noformat}
However, there are two problems with this condition:
 1. Logically it's not correct. In fact, since the app should be killed after 
30 seconds, one would expect to see {{totalTimeRun = maxLifetime}}. Due to some 
asynchronicity and rounding, most of the time {{totalTimeRun}} ends up being 31.

2. Sometimes the application is killed fast enough and {{totalTimeRun}} is 30, 
but this is correct, because in {{setUpCSQueue}} we set the queue lifetime:
{noformat}
csConf.setMaximumLifetimePerQueue(
CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime);
csConf.setDefaultLifetimePerQueue(
CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime);
{noformat}
A more proper condition is:
{noformat}
Assert.assertTrue("Application killed before lifetime value",
totalTimeRun >= maxLifetime);
{noformat}
The assertion message in the next line is also misleading:
{noformat}
Assert.assertTrue(
"Application killed before lifetime value " + totalTimeRun,
totalTimeRun < maxLifetime + 10L);
{noformat}
If it false, it means that the application is killed _after_ 40 seconds, which 
exceeds both the app's lifetime (40s) and that of the queue (30s).
{noformat}
Assert.assertTrue(
"Application killed after queue/app lifetime value: " + 
totalTimeRun,
totalTimeRun < maxLifetime + 10L);
{noformat}
We can be even be stricter, since we expect a kill almost immediately after 30 
seconds:
{noformat}
Assert.assertTrue(
"Application killed too late: " + totalTimeRun,
totalTimeRun < maxLifetime + 2L);
{noformat}
where we allow a 2 second tolerance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Resolved] (YARN-9264) [Umbrella] Follow-up on IntelOpenCL FPGA plugin

2019-03-30 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-9264.

   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.3.0

> [Umbrella] Follow-up on IntelOpenCL FPGA plugin
> ---
>
> Key: YARN-9264
> URL: https://issues.apache.org/jira/browse/YARN-9264
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0
>
>
> The Intel FPGA resource type support was released in Hadoop 3.1.0.
> Right now the plugin implementation has some deficiencies that need to be 
> fixed. This JIRA lists all problems that need to be resolved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9270) Minor cleanup in TestFpgaDiscoverer

2019-02-01 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9270:
--

 Summary: Minor cleanup in TestFpgaDiscoverer
 Key: YARN-9270
 URL: https://issues.apache.org/jira/browse/YARN-9270
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko


Let's do some cleanup in this class.

* {{testLinuxFpgaResourceDiscoverPluginConfig}} - this test should be split up 
to 5 different tests, because it tests 5 different scenarios.
* remove {{setNewEnvironmentHack()}} - too complicated. We can introduce a 
{{Function}} in the plugin class like {{Function envProvider = 
System::getenv()}} plus a setter method which allows the test to modify 
{{envProvider}}. Much simpler and straightfoward.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9269) Minor cleanup in FpgaResourceAllocator

2019-02-01 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9269:
--

 Summary: Minor cleanup in FpgaResourceAllocator
 Key: YARN-9269
 URL: https://issues.apache.org/jira/browse/YARN-9269
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko


Some stuff that we observed:

* {{addFpga()}} - we check for duplicate devices, but we don't print any 
error/warning if there's any.
* {{findMatchedFpga()}} should be called findMatchingFpga(). Also, is this 
method even needed? We already receive an {{FpgaDevice}} instance in 
{{updateFpga()}} which I believe is the same that we're looking up.
* variable {{IPIDpreference}} is confusing
* {{availableFpga}} / {{usedFpgaByRequestor}} are instances of 
{{LinkedHashMap}}. What's the rationale behind this? Doesn't a simple 
{{HashMap}} suffice?
* {{usedFpgaByRequestor}} should be renamed, naming is a bit unclear
* {{allowedFpgas}} should be an immutable list



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9268) Various fixes are needed in FpgaDevice

2019-02-01 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9268:
--

 Summary: Various fixes are needed in FpgaDevice
 Key: YARN-9268
 URL: https://issues.apache.org/jira/browse/YARN-9268
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko


Need to fix the following the class FpgaDevice:

* It implements Comparable, but not Comparable, so we have a raw 
type warning. It also returns 0 in every case. There is no natural ordering 
among FPGA devices, perhaps "acl0" comes before "acl1", but this seems too 
forced and unnecessary.We think this class should not implement Comparable at 
all, at least not like that.
* Stores unnecessary fields: devName, busNum, temperature, power usage. For 
one, these are never needed in the code. Secondly, temp and power usage changes 
constantly. It's pointless to store these in this POJO.
* serialVersionUID is 1L - let's generate a number for this
* Use int instead of Integer - don't allow nulls. If major/minor uniquely 
identifies the card, then let's demand them in the constructor and don't store 
Integers that can be null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9267) Various fixes are needed in FpgaResourceHandlerImpl

2019-02-01 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9267:
--

 Summary: Various fixes are needed in FpgaResourceHandlerImpl
 Key: YARN-9267
 URL: https://issues.apache.org/jira/browse/YARN-9267
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko


Fix some problems in FpgaResourceHandlerImpl:

* preStart() does not reconfigure card with the same IP - we see it as a 
problem. If you recompile the FPGA application, you must rename the aocx file 
because the card will not be reprogrammed. Suggestion: instead of storing 
Node<->IPID mapping, store Node<->IPID hash (like the SHA-256 of the localized 
file).
* Switch to slf4j from Apache Commons Logging
* Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9265) FPGA plugin fails to recognize Intel PAC card

2019-02-01 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9265:
--

 Summary: FPGA plugin fails to recognize Intel PAC card
 Key: YARN-9265
 URL: https://issues.apache.org/jira/browse/YARN-9265
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.1.0
Reporter: Peter Bacsko


The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).

There are two major issues.

Problem #1

The output of aocl diagnose:
{noformat}

Device Name:
acl0
 
Package Pat:
/home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
 
Vendor: Intel Corp
 
Physical Dev Name   StatusInformation
 
pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
  PCIe 08:00.0
  FPGA temperature = 79 degrees C.
 
DIAGNOSTIC_PASSED

 
Call "aocl diagnose " to run diagnose for specified devices
Call "aocl diagnose all" to run diagnose for all devices
{noformat}

This generates the following error message:
{noformat}
2019-01-25 06:46:02,834 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
 Using FPGA vendor plugin: 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
2019-01-25 06:46:02,943 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
 Trying to diagnose FPGA information ...
2019-01-25 06:46:03,085 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
 Using traffic control bandwidth handler
2019-01-25 06:46:03,108 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
 Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
2019-01-25 06:46:03,139 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
 FPGA Plugin bootstrap success.
2019-01-25 06:46:03,247 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
 Couldn't find (?i)bus:slot.func\s=\s.*, pattern
2019-01-25 06:46:03,248 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
 Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
2019-01-25 06:46:03,251 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
 Failed to get major-minor number from reading /dev/pac_a10_f30
2019-01-25 06:46:03,252 ERROR 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
bootstrap configured resource subsystems!
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
 No FPGA devices detected!
{noformat}

Problem #2

The plugin assume that the file name under {{/dev}} can be derived from the 
"Physical Dev Name". This is not the case. For example, it thinks that the 
device file is {{ /dev/pac_a10_f30}} which is not the case, the actual file 
is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9266) Various fixes are needed in IntelFpgaOpenclPlugin

2019-02-01 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9266:
--

 Summary: Various fixes are needed in IntelFpgaOpenclPlugin
 Key: YARN-9266
 URL: https://issues.apache.org/jira/browse/YARN-9266
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Peter Bacsko


Problems identified in this class:

* InnerShellExecutor ignores the timeout parameter
* configureIP() uses printStackTrace() instead of logging
* configureIP() does not log the output of aocl if the exit code != 0
* parseDiagnoseInfo() is too heavyweight -- it should be in its own class for 
better testability
* downloadIP() uses contains() for file name check -- this can really surprise 
users in some cases (eg. you want to use hello.aocx but hello2.aocx also 
matches)
* method name downloadIP() is misleading -- it actually tries to finds the 
file. Everything is downloaded (localized) at this point.
* @VisibleForTesting methods should be package private



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9264) [Umbrella] Follow-up on IntelOpenCL FPGA plugin

2019-02-01 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9264:
--

 Summary: [Umbrella] Follow-up on IntelOpenCL FPGA plugin
 Key: YARN-9264
 URL: https://issues.apache.org/jira/browse/YARN-9264
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.1.1
Reporter: Peter Bacsko


The Intel FPGA resource type support was released in Hadoop 3.1.0.

Right now the plugin implementation has some deficiencies that need to be 
fixed. This JIRA lists all problems that need to be resolved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9011) Race condition during decommissioning

2018-11-12 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9011:
--

 Summary: Race condition during decommissioning
 Key: YARN-9011
 URL: https://issues.apache.org/jira/browse/YARN-9011
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.1
Reporter: Peter Bacsko
Assignee: Antal Bálint Steinbach


During internal testing, we found a nasty race condition which occurs during 
decommissioning.

Node manager, incorrect behaviour:
{noformat}
2018-06-18 21:00:17,634 WARN 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting down.
2018-06-18 21:00:17,634 WARN 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
hostname:node-6.hostname.com
{noformat}

Node manager, expected behaviour:
{noformat}
2018-06-18 21:07:37,377 WARN 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting down.
2018-06-18 21:07:37,377 WARN 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
decommissioned
{noformat}

Note the two different messages from the RM ("Disallowed NodeManager" vs 
"DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
inconsistent state of nodes while they're being updated:

{noformat}
2018-06-18 21:00:17,575 INFO 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
 exclude:{node-6.hostname.com}
2018-06-18 21:00:17,575 INFO 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
decommission node node-6.hostname.com:8041 with state RUNNING
2018-06-18 21:00:17,575 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
node-6.hostname.com
2018-06-18 21:00:17,576 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
node-6.hostname.com:8041 in DECOMMISSIONING.
2018-06-18 21:00:17,575 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
RESULT=SUCCESS
2018-06-18 21:00:17,577 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
original total capability: 
2018-06-18 21:00:17,577 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
{noformat}

When the decommissioning succeeds, there is no output logged from 
{{ResourceTrackerService}}.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-9008) Extend YARN distributed shell with file localization feature

2018-11-10 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-9008:
--

 Summary: Extend YARN distributed shell with file localization 
feature
 Key: YARN-9008
 URL: https://issues.apache.org/jira/browse/YARN-9008
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.1.1, 2.9.1
Reporter: Peter Bacsko
Assignee: Peter Bacsko


YARN distributed shell is a very handy tool to test various features of YARN.

However, it lacks support for file localization - that is, you define files in 
the command like that you wish to be localized remotely. This can be extremely 
useful in certain scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-6715) NodeHealthScriptRunner does not handle non-zero exit codes properly

2017-06-16 Thread Peter Bacsko (JIRA)

Peter Bacsko created YARN-6715:
--

 Summary: NodeHealthScriptRunner does not handle non-zero exit 
codes properly
 Key: YARN-6715
 URL: https://issues.apache.org/jira/browse/YARN-6715
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Peter Bacsko


There is a bug in NodeHealthScriptRunner. The {{FAILED_WITH_EXIT_CODE}} case is 
incorrect:

{noformat}
void reportHealthStatus(HealthCheckerExitStatus status) {
  long now = System.currentTimeMillis();
  switch (status) {
  case SUCCESS:
setHealthStatus(true, "", now);
break;
  case TIMED_OUT:
setHealthStatus(false, NODE_HEALTH_SCRIPT_TIMED_OUT_MSG);
break;
  case FAILED_WITH_EXCEPTION:
setHealthStatus(false, exceptionStackTrace);
break;
  case FAILED_WITH_EXIT_CODE:
setHealthStatus(true, "", now);
break;
  case FAILED:
setHealthStatus(false, shexec.getOutput());
break;
  }
}
{noformat}

This case also lacks unit test coverage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

97 matches

Mail list logo