[jira] [Created] (YARN-10958) Use correct configuration for Group service init in CSMappingPlacementRule
Peter Bacsko created YARN-10958: --- Summary: Use correct configuration for Group service init in CSMappingPlacementRule Key: YARN-10958 URL: https://issues.apache.org/jira/browse/YARN-10958 Project: Hadoop YARN Issue Type: Bug Reporter: Peter Bacsko There is a potential problem in {{CSMappingPlacementRule.java}}: {noformat} if (groups == null) { groups = Groups.getUserToGroupsMappingService(conf); } {noformat} The problem is, we're supposed to pass {{scheduler.getConf()}}. The "conf" object is the config for capacity scheduler, which does not include the property which selects the group service provider. Therefore, the current code just works by chance, because Group mapping service is already initialized at this point. See the original fix in YARN-10053. Also, need a unit test to verify it. Idea: # Create a Configuration object in which the property "hadoop.security.group.mapping" refers to an existing a test implementation. # Add a new method to {{Groups}} which nulls out the singleton instance, eg. {{Groups.reset()}}. # Create a mock CapacityScheduler where {{getConf()}} and {{getConfiguration()}} contain different settings for "hadoop.security.group.mapping". Since {{getConf()}} is the service config, this should return the config object created in step #1. # Create an instance of {{CSMappingPlacementRule}} with a single primary group rule. # Run the placement evaluation. # Expected: returned queue matches what is supposed to be coming from the test group mapping service ("testuser" --> "testqueue"). # Modify "hadoop.security.group.mapping" in the config object created in step #1. # Call {{Groups.refresh()}} which changes the group mapping ("testuser" --> "testqueue2"). This requires that the test group mapping service implement {{GroupMappingServiceProvider.cacheGroupsRefresh()}}. # Create a new instance of {{CSMappingPlacementRule}}. # Run the placement evaluation again # Expected: with the same user, the target queue has changed. This looks convoluted, but these steps make sure that: # {{CSMappingPlacementRule}} will force the initialization of groups. # We select the correct configuration for group service init. # We don't create a new {{Groups}} instance if the singleton is initialized, so we cover the original problem described in YARN-10597. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10848) Vcore usage problem with Default/DominantResourceCalculator
Peter Bacsko created YARN-10848: --- Summary: Vcore usage problem with Default/DominantResourceCalculator Key: YARN-10848 URL: https://issues.apache.org/jira/browse/YARN-10848 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, capacityscheduler Reporter: Peter Bacsko If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating containers even if we run out of vcores. CS checks the the available resources at two places. The first check is {{CapacityScheduler.allocateContainerOnSingleNode()}}: {noformat} if (calculator.computeAvailableContainers(Resources .add(node.getUnallocatedResource(), node.getTotalKillableResources()), minimumAllocation) <= 0) { LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " + "available or preemptible resource for minimum allocation"); {noformat} The second, which is more important, is located in {{RegularContainerAllocator.assignContainer()}}: {noformat} if (!Resources.fitsIn(rc, capability, totalResource)) { LOG.warn("Node : " + node.getNodeID() + " does not have sufficient resource for ask : " + pendingAsk + " node total capability : " + node.getTotalResource()); // Skip this locality request ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( activitiesManager, node, application, schedulerKey, ActivityDiagnosticConstant. NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST + getResourceDiagnostics(capability, totalResource), ActivityLevel.NODE); return ContainerAllocation.LOCALITY_SKIPPED; } {noformat} Here, {{rc}} is the resource calculator instance, the other two values are: {noformat} Resource capability = pendingAsk.getPerAllocationResource(); Resource available = node.getUnallocatedResource(); {noformat} There is a repro unit test attatched to this case, which can demonstrate the problem. The root cause is that we pass the resource calculator to {{Resource.fitsIn()}}. Instead, we should use an overridden version, just like in {{FSAppAttempt.assignContainer()}}: {noformat} // Can we allocate a container on this node? if (Resources.fitsIn(capability, available)) { // Inform the application of the new container for this request RMContainer allocatedContainer = allocate(type, node, schedulerKey, pendingAsk, reservedContainer); {noformat} In CS, if we switch to DominantResourceCalculator OR use {{Resources.fitsIn()}} without the calculator in {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9698) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-9698. Fix Version/s: 3.4.0 Resolution: Fixed > [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler > > > Key: YARN-9698 > URL: https://issues.apache.org/jira/browse/YARN-9698 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Weiwei Yang >Priority: Major > Labels: fs2cs > Fix For: 3.4.0 > > Attachments: FS-CS Migration.pdf > > > We see some users want to migrate from Fair Scheduler to Capacity Scheduler, > this Jira is created as an umbrella to track all related efforts for the > migration, the scope contains > * Bug fixes > * Add missing features > * Migration tools that help to generate CS configs based on FS, validate > configs etc > * Documents > this is part of CS component, the purpose is to make the migration process > smooth. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10843) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler - part II
Peter Bacsko created YARN-10843: --- Summary: [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler - part II Key: YARN-10843 URL: https://issues.apache.org/jira/browse/YARN-10843 Project: Hadoop YARN Issue Type: Task Reporter: Peter Bacsko -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10796) Capacity Scheduler: dynamic queue cannot scale out properly if it's capacity is 0%
Peter Bacsko created YARN-10796: --- Summary: Capacity Scheduler: dynamic queue cannot scale out properly if it's capacity is 0% Key: YARN-10796 URL: https://issues.apache.org/jira/browse/YARN-10796 Project: Hadoop YARN Issue Type: Task Components: capacity scheduler, capacityscheduler Reporter: Peter Bacsko Assignee: Peter Bacsko If we have a dynamic queue (AutoCreatedLeafQueue) with capacity = 0%, then it cannot properly scale even if it's max-capacity and the parent's max-capacity would allow it. Example: {noformat} Cluster Capacity: 16 GB / 16cpu (2 nodes, each with 8 GB / 8 cpu ) Container allocation size: 1G / 1 vcore Root.dynamic Effective Capacity: ( 50.0%) Effective Max Capacity: (100.0%) Template: Capacity: 40% Max Capacity: 100% User Limit Factor: 4 {noformat} leaf-queue-template.capacity = 40% leaf-queue-template.maximum-capacity = 100% leaf-queue-template.maximum-am-resource-percent = 50% leaf-queue-template.minimum-user-limit-percent =100% leaf-queue-template.user-limit-factor = 4 "root.dynamic" has a maximum capacity of 100% and a capacity of 50%. Let's assume there are running containers in these dynamic queues (MR sleep jobs): root.dynamic.user1 = 1 AM + 3 container (capacity = 40%) root.dynamic.user2 = 1 AM + 3 container (capacity = 40%) root.dynamic.user3 = 1 AM + 15 container (capacity = 0%) This scenario will result in an underutilized cluster. There will be approx 18% unused capacity. On the other hand, it's still possible to submit a new application to root.dynamic.user1 or root.dynamic.user2 and reaching a 100% utilization is possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10779) Add option to disable lowecase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl
Peter Bacsko created YARN-10779: --- Summary: Add option to disable lowecase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl Key: YARN-10779 URL: https://issues.apache.org/jira/browse/YARN-10779 Project: Hadoop YARN Issue Type: Task Components: resourcemanager Reporter: Peter Bacsko Assignee: Peter Bacsko In both {{GetApplicationsRequestPBImpl}} and {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase conversion: {noformat} checkTags(tags); // Convert applicationTags to lower case and add this.applicationTags = new TreeSet<>(); for (String tag : tags) { this.applicationTags.add(StringUtils.toLowerCase(tag)); } } {noformat} However, we encountered some cases where this is not desirable for "userid" tags. Proposed solution: since both classes are pretty low-level and can be often instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should be cached inside them. A new property should be created which tells whether lowercase conversion should occur or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8786) LinuxContainerExecutor fails sporadically in create_local_dirs
[ https://issues.apache.org/jira/browse/YARN-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-8786. Resolution: Fixed > LinuxContainerExecutor fails sporadically in create_local_dirs > -- > > Key: YARN-8786 > URL: https://issues.apache.org/jira/browse/YARN-8786 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Jon Bender >Priority: Major > > We started using CGroups with LinuxContainerExecutor recently, running Apache > Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn > container will fail with a message like the following: > {code:java} > [2018-09-02 23:48:02.458691] 18/09/02 23:48:02 INFO container.ContainerImpl: > Container container_1530684675517_516620_01_020846 transitioned from > SCHEDULED to RUNNING > [2018-09-02 23:48:02.458874] 18/09/02 23:48:02 INFO > monitor.ContainersMonitorImpl: Starting resource-monitoring for > container_1530684675517_516620_01_020846 > [2018-09-02 23:48:02.506114] 18/09/02 23:48:02 WARN > privileged.PrivilegedOperationExecutor: Shell execution returned exit code: > 35. Privileged Execution Operation Stderr: > [2018-09-02 23:48:02.506159] Could not create container dirsCould not create > local files and directories > [2018-09-02 23:48:02.506220] > [2018-09-02 23:48:02.506238] Stdout: main : command provided 1 > [2018-09-02 23:48:02.506258] main : run as user is nobody > [2018-09-02 23:48:02.506282] main : requested yarn user is root > [2018-09-02 23:48:02.506294] Getting exit code file... > [2018-09-02 23:48:02.506307] Creating script paths... > [2018-09-02 23:48:02.506330] Writing pid file... > [2018-09-02 23:48:02.506366] Writing to tmp file > /path/to/hadoop/yarn/local/nmPrivate/application_1530684675517_516620/container_1530684675517_516620_01_020846/container_1530684675517_516620_01_020846.pid.tmp > [2018-09-02 23:48:02.506389] Writing to cgroup task files... > [2018-09-02 23:48:02.506402] Creating local dirs... > [2018-09-02 23:48:02.506414] Getting exit code file... > [2018-09-02 23:48:02.506435] Creating script paths... > {code} > Looking at the container executor source it's traceable to errors here: > [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604] > And ultimately to > [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672] > The root failure seems to be in the underlying mkdir call, but that exit code > / errno is swallowed so we don't have more details. We tend to see this when > many containers start at the same time for the same application on a host, > and suspect it may be related to some race conditions around those shared > directories between containers for the same application. > For example, this is a typical pattern in the audit logs: > {code:java} > [2018-09-07 17:16:38.447654] 18/09/07 17:16:38 INFO > nodemanager.NMAuditLogger: USER=root IP=<> Container Request > TARGET=ContainerManageImpl RESULT=SUCCESS > APPID=application_1530684675517_559126 > CONTAINERID=container_1530684675517_559126_01_012871 > [2018-09-07 17:16:38.492298] 18/09/07 17:16:38 INFO > nodemanager.NMAuditLogger: USER=root IP=<> Container Request > TARGET=ContainerManageImpl RESULT=SUCCESS > APPID=application_1530684675517_559126 > CONTAINERID=container_1530684675517_559126_01_012870 > [2018-09-07 17:16:38.614044] 18/09/07 17:16:38 WARN > nodemanager.NMAuditLogger: USER=root OPERATION=Container Finished - > Failed TARGET=ContainerImplRESULT=FAILURE DESCRIPTION=Container failed > with state: EXITED_WITH_FAILUREAPPID=application_1530684675517_559126 > CONTAINERID=container_1530684675517_559126_01_012871 > {code} > Two containers for the same application starting in quick succession followed > by the EXITED_WITH_FAILURE step (exit code 35). > We plan to upgrade to 3.1.x soon but I don't expect this to be fixed by this, > the only major JIRAs that affected the executor since 3.0.0 seem unrelated > ([https://github.com/apache/hadoop/commit/bc285da107bb84a3c60c5224369d7398a41db2d8] > and > [https://github.com/apache/hadoop/commit/a82be7754d74f4d16b206427b91e700bb5f44d56]) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10643) Fix the race condition introduced by YARN-8995.
[ https://issues.apache.org/jira/browse/YARN-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-10643. - Resolution: Duplicate > Fix the race condition introduced by YARN-8995. > --- > > Key: YARN-10643 > URL: https://issues.apache.org/jira/browse/YARN-10643 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.2.1 >Reporter: Qi Zhu >Assignee: zhengchenyu >Priority: Critical > Attachments: YARN-10643.001.patch > > > The race condition introduced by -YARN-8995.- > The problem has been raised in YARN-10221 > also in YARN-10642. > I think we should fix it in a hurry. > I will help fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10631) Document AM-preemption related changes (YARN-9537 and YARN-10625)
Peter Bacsko created YARN-10631: --- Summary: Document AM-preemption related changes (YARN-9537 and YARN-10625) Key: YARN-10631 URL: https://issues.apache.org/jira/browse/YARN-10631 Project: Hadoop YARN Issue Type: Task Reporter: Peter Bacsko Assignee: Peter Bacsko Preemption-related changes were introduced in YARN-9537 and YARN-10625. These also introduce new properties which are not documented for Fair Scheduler. Extend the documentation with these enhancements. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10625) FairScheduler: add global flag to disable AM-preemption
Peter Bacsko created YARN-10625: --- Summary: FairScheduler: add global flag to disable AM-preemption Key: YARN-10625 URL: https://issues.apache.org/jira/browse/YARN-10625 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 3.3.0 Reporter: Peter Bacsko Assignee: Peter Bacsko YARN-9537 added a feature to disable AM preemption on a per queue basis. This is a nice enhancement, but it's very inconvenient if the cluster has a lot of queues or queues dynamically created/deleted regularly (static queue configuration changes). It's a legitimate use-case to have AM preemption turned off completely. To make it easier, add property which acts as a global flag for this feature. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10620) fs2cs: parentQueue for certain placement rules are not set during conversion
Peter Bacsko created YARN-10620: --- Summary: fs2cs: parentQueue for certain placement rules are not set during conversion Key: YARN-10620 URL: https://issues.apache.org/jira/browse/YARN-10620 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10599) fs2cs should generate new "auto-queue-creation-v2.enabled" properties for all parents
Peter Bacsko created YARN-10599: --- Summary: fs2cs should generate new "auto-queue-creation-v2.enabled" properties for all parents Key: YARN-10599 URL: https://issues.apache.org/jira/browse/YARN-10599 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10593) Fix incorrect string comparison in GpuDiscoverer
Peter Bacsko created YARN-10593: --- Summary: Fix incorrect string comparison in GpuDiscoverer Key: YARN-10593 URL: https://issues.apache.org/jira/browse/YARN-10593 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Peter Bacsko Assignee: Peter Bacsko The following comparison in {{GpuDiscoverer}} is invalid: {noformat} binaryPath = configuredBinaryFile; // If path exists but file name is incorrect don't execute the file String fileName = binaryPath.getName(); if (DEFAULT_BINARY_NAME.equals(fileName)) { <--- inverse condition needed String msg = String.format("Please check the configuration value of" +" %s. It should point to an %s binary.", YarnConfiguration.NM_GPU_PATH_TO_EXEC, DEFAULT_BINARY_NAME); throwIfNecessary(new YarnException(msg), config); LOG.warn(msg); }{noformat} Obviously it should be other way around - we should log a warning or throw an exception if the file names *differ*, not when they're equal. Consider adding a unit test for this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10577) Automatically convert placement rules in fs2cs
Peter Bacsko created YARN-10577: --- Summary: Automatically convert placement rules in fs2cs Key: YARN-10577 URL: https://issues.apache.org/jira/browse/YARN-10577 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko Current, users have to use "\-m" or "\-\-convert-placement-rules" switch to convert the placement rules from FS. Initially, we converted to the old mapping rule format, which has serious limitations, so we disabled the automatic conversion. With the new JSON-based format and placement engine, this conversion should happen automatically. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10576) Update Capacity Scheduler about JSON-based placement mapping
Peter Bacsko created YARN-10576: --- Summary: Update Capacity Scheduler about JSON-based placement mapping Key: YARN-10576 URL: https://issues.apache.org/jira/browse/YARN-10576 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko The weight mode and AQC also affects how the new placement engine in CS works. Certain statements in the documentation are no longer valid, for example: * create flag: "Only applies to managed queue parents" - there is no ManagedParentQueue in weight mode. * "The nested rules primaryGroupUser and secondaryGroupUser expects the parent queues to exist, ie. they cannot be created automatically". This only applies to the legacy absolute/percentage mode. Find all statements that mentions possible limitations and fix them if necessary. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10573) Enhance placement rule conversion in fs2cs in weight mode
Peter Bacsko created YARN-10573: --- Summary: Enhance placement rule conversion in fs2cs in weight mode Key: YARN-10573 URL: https://issues.apache.org/jira/browse/YARN-10573 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko If we're using weight mode, we have much more freedom when it comes to placement rules. In YARN-10525, weight conversion is the default in {{fs2cs}}. This also means that we can support nested rules properly and also queues can be created under {{root}}. Therefore, a lot of warnings and validations inside {{QueuePlacementConverter}} are not necessary and only relevant if the user chose percentage-based conversion in the command line. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10570) Remove "experimental" warning message from fs2cs
Peter Bacsko created YARN-10570: --- Summary: Remove "experimental" warning message from fs2cs Key: YARN-10570 URL: https://issues.apache.org/jira/browse/YARN-10570 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko Although {{fs2cs}} tool has been in constant development, it's been used and tested by a group of people, so let's remove the following message: {{WARNING: This feature is experimental and not intended for production use!}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10563) Fix dependency exclusion problem in poms
Peter Bacsko created YARN-10563: --- Summary: Fix dependency exclusion problem in poms Key: YARN-10563 URL: https://issues.apache.org/jira/browse/YARN-10563 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10515) Fix flaky test TestCapacitySchedulerAutoQueueCreation.testDynamicAutoQueueCreationWithTags
Peter Bacsko created YARN-10515: --- Summary: Fix flaky test TestCapacitySchedulerAutoQueueCreation.testDynamicAutoQueueCreationWithTags Key: YARN-10515 URL: https://issues.apache.org/jira/browse/YARN-10515 Project: Hadoop YARN Issue Type: Bug Reporter: Peter Bacsko Assignee: Peter Bacsko The testcase TestCapacitySchedulerAutoQueueCreation.testDynamicAutoQueueCreationWithTags sometimes fails with the following error: {noformat} org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize queues at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:174) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:110) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:884) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1296) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:339) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.serviceInit(MockRM.java:1018) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:158) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:134) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:130) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerAutoQueueCreation$5.(TestCapacitySchedulerAutoQueueCreation.java:873) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerAutoQueueCreation.testDynamicAutoQueueCreationWithTags(TestCapacitySchedulerAutoQueueCreation.java:873) {noformat} We have to reset queue metrics before running this test to make sure it passes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10507) Add the capability to fs2cs to write the converted placement rules inside capacity-scheduler.xml
Peter Bacsko created YARN-10507: --- Summary: Add the capability to fs2cs to write the converted placement rules inside capacity-scheduler.xml Key: YARN-10507 URL: https://issues.apache.org/jira/browse/YARN-10507 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko Currently, fs2cs tool generates a separate {{mapping-rules.json}} file when it converts the placement rules. However, we also support having the JSON inlined inside {{capacity-scheduler.xml}}. Add a command line switch so that we can choose the desired output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10103) Capacity scheduler: add support for create=true/false per mapping rule
[ https://issues.apache.org/jira/browse/YARN-10103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-10103. - Resolution: Won't Do > Capacity scheduler: add support for create=true/false per mapping rule > -- > > Key: YARN-10103 > URL: https://issues.apache.org/jira/browse/YARN-10103 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Priority: Major > Labels: fs2cs > > You can't ask Capacity Scheduler for a mapping to create a queue if it > doesn't exist. > For example, this mapping would use the first rule if the queue exist. If it > doesn't, then it proceeds to the next rule: > {{u:%user:%primary_group.%user:create=false;u:%user%:root.default}} > Let's say user "alice" belongs to the "admins" group. It would first try to > map {{root.admins.alice}}. But, if the queue doesn't exist, then it places > the application into {{root.default}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10486) FS-CS converter: handle case when weight=0
Peter Bacsko created YARN-10486: --- Summary: FS-CS converter: handle case when weight=0 Key: YARN-10486 URL: https://issues.apache.org/jira/browse/YARN-10486 Project: Hadoop YARN Issue Type: Sub-task Components: yarn Reporter: Peter Bacsko Assignee: Peter Bacsko We can encounter an ArithmeticException if there is a single or multiple queues under a parent with zero weight. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
Peter Bacsko created YARN-10460: --- Summary: Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail Key: YARN-10460 URL: https://issues.apache.org/jira/browse/YARN-10460 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, test Reporter: Peter Bacsko Assignee: Peter Bacsko In our downstream build environment, we're using JUnit 4.13. Recently, we discovered a truly weird test failure in TestNodeStatusUpdater. The problem is that timeout handling has changed in Junit 4.13. See the difference between these two snippets: 4.12 {noformat} @Override public void evaluate() throws Throwable { CallableStatement callable = new CallableStatement(); FutureTask task = new FutureTask(callable); threadGroup = new ThreadGroup("FailOnTimeoutGroup"); Thread thread = new Thread(threadGroup, task, "Time-limited test"); thread.setDaemon(true); thread.start(); callable.awaitStarted(); Throwable throwable = getResult(task, thread); if (throwable != null) { throw throwable; } } {noformat} 4.13 {noformat} @Override public void evaluate() throws Throwable { CallableStatement callable = new CallableStatement(); FutureTask task = new FutureTask(callable); ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); Thread thread = new Thread(threadGroup, task, "Time-limited test"); try { thread.setDaemon(true); thread.start(); callable.awaitStarted(); Throwable throwable = getResult(task, thread); if (throwable != null) { throw throwable; } } finally { try { thread.join(1); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } try { threadGroup.destroy(); < This } catch (IllegalThreadStateException e) { // If a thread from the group is still alive, the ThreadGroup cannot be destroyed. // Swallow the exception to keep the same behavior prior to this change. } } } {noformat} The change comes from [https://github.com/junit-team/junit4/pull/1517]. Unfortunately, destroying the thread group causes an issue because there are all sorts of object caching in the IPC layer. The exception is: {noformat} java.lang.IllegalThreadStateException at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) at java.lang.Thread.init(Thread.java:402) at java.lang.Thread.init(Thread.java:349) at java.lang.Thread.(Thread.java:675) at java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) at com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) at java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) at org.apache.hadoop.ipc.Client.call(Client.java:1458) at org.apache.hadoop.ipc.Client.call(Client.java:1405) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy81.startContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576) {noformat} Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the client object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine2}} is stored as long as they're needed. But since the backing thread group is destroyed in the previous test, it's no longer possible to create new threads. A quick workaround is to stop the clients and completely clear the {{ClientCache}} in {{ProtobufRpcEngine}} before each testcase. I tried this and it solves the problem but it feels hacky. Not sure if there is a better approach. -- This message was sent by Atlassian Jira (v8.3.4#803005) ---
[jira] [Created] (YARN-10454) Add applicationName policy
Peter Bacsko created YARN-10454: --- Summary: Add applicationName policy Key: YARN-10454 URL: https://issues.apache.org/jira/browse/YARN-10454 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10447) TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing
Peter Bacsko created YARN-10447: --- Summary: TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing Key: YARN-10447 URL: https://issues.apache.org/jira/browse/YARN-10447 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: Peter Bacsko Assignee: Peter Bacsko YARN-9784 fixed some concurrency related issues in {{TestLeafQueue}}, but not all of them. Occasionally it's still possible to receive an exception from Mockito and the two following stack traces can be observed in the console: {noformat} org.mockito.exceptions.misusing.WrongTypeOfReturnValue: Integer cannot be returned by isMultiNodePlacementEnabled() isMultiNodePlacementEnabled() should return boolean *** If you're unsure why you're getting above error read on. Due to the nature of the syntax above problem might occur because: 1. This exception *might* occur in wrongly written multi-threaded tests. Please refer to Mockito FAQ on limitations of concurrency testing. 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method. {noformat} and {noformat} 2020-09-22 14:44:52,584 INFO [main] capacity.TestUtils (TestUtils.java:getMockNode(227)) - node = 127.0.0.3 avail= 2020-09-22 14:44:52,585 INFO [main] capacity.TestUtils (TestUtils.java:getMockNode(227)) - node = 127.0.0.4 avail= Exception in thread "ActivitiesManager thread." java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Boolean at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$$EnhancerByMockitoWithCGLIB$$272c72c5.isMultiNodePlacementEnabled() at org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded(ActivitiesManager.java:266) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.access$500(ActivitiesManager.java:63) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:347) at java.lang.Thread.run(Thread.java:748) {noformat} It's probably best to disable ActivitiesManager thread entirely in this test class, there is no need for it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10424) Adapt existing AppName and UserGroupMapping unittests to ensure backwards compatibility
[ https://issues.apache.org/jira/browse/YARN-10424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-10424. - Resolution: Fixed > Adapt existing AppName and UserGroupMapping unittests to ensure backwards > compatibility > --- > > Key: YARN-10424 > URL: https://issues.apache.org/jira/browse/YARN-10424 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10424.001.patch, YARN-10424.002.patch, > YARN-10424.003.patch > > > The class {{UserGroupMappingPlacementRule}} and > {{AppNameMappingPlacementRule}} will disappear. In order to ensure backwards > compatibility when the configuration is defined in the legacy format, > {{TestAppNameMappingPlacementRule}} and {{TestUserGroupMappingPlacementRule}} > should be adapted to use the new evaluator logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10413) Change fs2cs to generate mapping rules in the new format
Peter Bacsko created YARN-10413: --- Summary: Change fs2cs to generate mapping rules in the new format Key: YARN-10413 URL: https://issues.apache.org/jira/browse/YARN-10413 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10387) Implement logic which returns MappingRule objects based on mapping rules
Peter Bacsko created YARN-10387: --- Summary: Implement logic which returns MappingRule objects based on mapping rules Key: YARN-10387 URL: https://issues.apache.org/jira/browse/YARN-10387 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10386) Create new JSON schema for Placement Rules
Peter Bacsko created YARN-10386: --- Summary: Create new JSON schema for Placement Rules Key: YARN-10386 URL: https://issues.apache.org/jira/browse/YARN-10386 Project: Hadoop YARN Issue Type: Sub-task Components: capacity scheduler, capacityscheduler Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10330) Add missing test scenarios to TestUserGroupMappingPlacementRule
Peter Bacsko created YARN-10330: --- Summary: Add missing test scenarios to TestUserGroupMappingPlacementRule Key: YARN-10330 URL: https://issues.apache.org/jira/browse/YARN-10330 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, capacityscheduler, test Reporter: Peter Bacsko Assignee: Peter Bacsko After running {{TestUserGroupMappingPlacementRule}} with EclEmma, it turned out that there are at least 8-10 missing scenarios that are not covered. Since we're planning to enhance mapping rule logic with extra features, it is crucial to have good coverage so that we can verify backward compatibility. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10325) Document max-parallel-apps for Capacity Scheduler
Peter Bacsko created YARN-10325: --- Summary: Document max-parallel-apps for Capacity Scheduler Key: YARN-10325 URL: https://issues.apache.org/jira/browse/YARN-10325 Project: Hadoop YARN Issue Type: Sub-task Components: capacity scheduler, capacityscheduler Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10316) FS-CS converter: convert userMaxApps, maxRunningApps settins
Peter Bacsko created YARN-10316: --- Summary: FS-CS converter: convert userMaxApps, maxRunningApps settins Key: YARN-10316 URL: https://issues.apache.org/jira/browse/YARN-10316 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko In YARN-9930, support for maximum running applications (called "max parallel apps") has been introduced. The converter now can handle the following settings in {{fair-scheduler.xml}}: * {{ }} per user * {{}} per queue * {{}} * {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9888) Capacity scheduler: add support for default maxRunningApps limit per user
[ https://issues.apache.org/jira/browse/YARN-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-9888. Resolution: Duplicate This feature will be implemented in YARN-9930. Closing this as duplicate. > Capacity scheduler: add support for default maxRunningApps limit per user > - > > Key: YARN-9888 > URL: https://issues.apache.org/jira/browse/YARN-9888 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > Fair scheduler has the setting {{}} which limits how many > running applications each user can have. > Capacity scheduler lacks this feature. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9887) Capacity scheduler: add support for limiting maxRunningApps per user
[ https://issues.apache.org/jira/browse/YARN-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-9887. Resolution: Duplicate Closing this as duplicate. Implementation is tracked under YARN-9930. > Capacity scheduler: add support for limiting maxRunningApps per user > > > Key: YARN-9887 > URL: https://issues.apache.org/jira/browse/YARN-9887 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > Fair Scheduler supports limiting the number of applications that a particular > user can submit: > {noformat} > > 10 > > {noformat} > Capacity Scheduler does not have an exact equivalent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10283) Capacity Scheduler: starvation occurs if a higher priority queue is full a and node labels are used
Peter Bacsko created YARN-10283: --- Summary: Capacity Scheduler: starvation occurs if a higher priority queue is full a and node labels are used Key: YARN-10283 URL: https://issues.apache.org/jira/browse/YARN-10283 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko Recently we've been investigating a scenario where applications submitted to a lower priority queue could not get scheduled because a higher priority queue in the same hierarchy could now satisfy the allocation request. Both queue belonged to the same partition. If we disabled node labels, the problem disappeared. The problem is that {{RegularContainerAllocator}} always allocated a container for the request, even if it should not have. *Example:* * Cluster total resources: 3 nodes, 15GB, 24 vcores * Partition "shared" was created with 2 nodes * "root.lowprio" (priority = 20) and "root.highprio" (priorty = 40) were added to the partition * Both queues have a limit of * Using DominantResourceCalculator Setup: Submit distributed shell application to highprio with switches "-num_containers 3 -container_vcores 4". The memory allocation is 512MB per container. Chain of events: 1. Queue is filled with contaners until it reaches usage 2. A node update event is pushed to CS from a node which is part of the partition 2. {{AbstractCSQueue.canAssignToQueue()}} returns true because it's smaller than the current limit resource 3. Then {{LeafQueue.assignContainers()}} runs successfully and gets an allocated container for 4. But we can't commit the resource request because we would have 9 vcores in total, violating the limit. The problem is that we always try to assign container for the same application in each heartbeat from "highprio". Applications in "lowprio" cannot make progress. *Problem:* {{RegularContainerAllocator.assignContainer()}} does not handle this case well. We only reject allocation if this condition is satisfied: {noformat} if (rmContainer == null && reservationsContinueLooking && node.getLabels().isEmpty()) { {noformat} But if we have node labels, we succeed with the allocation if there's room for a container. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10158) FS-CS converter: convert property yarn.scheduler.fair.update-interval-ms
[ https://issues.apache.org/jira/browse/YARN-10158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-10158. - Resolution: Won't Do > FS-CS converter: convert property yarn.scheduler.fair.update-interval-ms > > > Key: YARN-10158 > URL: https://issues.apache.org/jira/browse/YARN-10158 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10257) FS-CS converter: check deprecated increment properties for mem/vcores and fix DRF check
Peter Bacsko created YARN-10257: --- Summary: FS-CS converter: check deprecated increment properties for mem/vcores and fix DRF check Key: YARN-10257 URL: https://issues.apache.org/jira/browse/YARN-10257 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko Two issues have been discovered during fs2cs testing: 1. The value of two properties are not checked: {{yarn.scheduler.increment-allocation-mb}} {{yarn.scheduler.increment-allocation-vcores}} Although these two are marked as deprecated, they're still in use and must be handled. 2. The following piece of code is incorrect - the default scheduling policy can be different fromDRF, which is a problem is DRF is used everywhere: {code} private boolean isDrfUsed(FairScheduler fs) { FSQueue rootQueue = fs.getQueueManager().getRootQueue(); AllocationConfiguration allocConf = fs.getAllocationConfiguration(); String defaultPolicy = allocConf.getDefaultSchedulingPolicy().getName(); if (DominantResourceFairnessPolicy.NAME.equals(defaultPolicy)) { return true; } else { return isDrfUsedOnQueueLevel(rootQueue); } } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10234) FS-CS converter: don't enale auto-create queue property for root
Peter Bacsko created YARN-10234: --- Summary: FS-CS converter: don't enale auto-create queue property for root Key: YARN-10234 URL: https://issues.apache.org/jira/browse/YARN-10234 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko The auto-create-child-queue property should not be enabled for root, otherwise it creates an exception inside capacity scheduler. {noformat} 2020-04-14 09:48:54,117 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2020-04-14 09:48:54,117 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received RMFatalEvent of type TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation failed at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:772) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:636) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) Caused by: java.io.IOException: Failed to re-init queues : null at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:489) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:430) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:761) ... 6 more Caused by: java.lang.ClassCastException {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10226) NPE when using %primary_group queue mapping
Peter Bacsko created YARN-10226: --- Summary: NPE when using %primary_group queue mapping Key: YARN-10226 URL: https://issues.apache.org/jira/browse/YARN-10226 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko If we use the following queue mapping: {{u:%user:%primary_group}} then we get a NPE inside ResourceManager: {noformat} 2020-04-06 11:59:13,883 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(881)) - Failed to load/recover state java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.getQueue(CapacitySchedulerQueueManager.java:138) at org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule.getContextForPrimaryGroup(UserGroupMappingPlacementRule.java:163) at org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule.getPlacementForUser(UserGroupMappingPlacementRule.java:118) at org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule.getPlacementForApp(UserGroupMappingPlacementRule.java:227) at org.apache.hadoop.yarn.server.resourcemanager.placement.PlacementManager.placeApplication(PlacementManager.java:67) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.placeApplication(RMAppManager.java:827) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:378) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:367) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:594) ... {noformat} We to check if parent queue is null in {{UserGroupMappingPlacementRule.getContextForPrimaryGroup()}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10198) [managedParent].%primary_group placement doesn't work after YARN-9868
Peter Bacsko created YARN-10198: --- Summary: [managedParent].%primary_group placement doesn't work after YARN-9868 Key: YARN-10198 URL: https://issues.apache.org/jira/browse/YARN-10198 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko YARN-9868 introduced an unnecessary check if we have the following placement rule: [managedParentQueue].%primary_group Here, {{%primary_group}} is expected to be created if it doesn't exist. However, there is this validation code which is not necessary: {noformat} } else if (mapping.getQueue().equals(PRIMARY_GROUP_MAPPING)) { if (this.queueManager .getQueue(groups.getGroups(user).get(0)) != null) { return getPlacementContext(mapping, groups.getGroups(user).get(0)); } else { return null; } {noformat} We should revert this part to the original version: {noformat} } else if (mapping.queue.equals(PRIMARY_GROUP_MAPPING)) { return getPlacementContext(mapping, groups.getGroups(user).get(0)); } {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10197) FS-CS converter: fix emitted ordering policy string and max-am-resource percent value
Peter Bacsko created YARN-10197: --- Summary: FS-CS converter: fix emitted ordering policy string and max-am-resource percent value Key: YARN-10197 URL: https://issues.apache.org/jira/browse/YARN-10197 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10193) FS-CS converter: fix incorrect capacity conversion
Peter Bacsko created YARN-10193: --- Summary: FS-CS converter: fix incorrect capacity conversion Key: YARN-10193 URL: https://issues.apache.org/jira/browse/YARN-10193 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko Conversion of capacities are incorrect if the total doesn't add up exactly to 100.00%. The loop invariant must be fixed: {noformat} for (int i = 0; i < children.size() - 2; i++) { {noformat} The testcase needs to be fixed too: {noformat} assertEquals("root.default capacity", "33.333", csConfig.get(PREFIX + "root.default.capacity")); assertEquals("root.admins capacity", "33.333", csConfig.get(PREFIX + "root.admins.capacity")); assertEquals("root.users capacity", "66.667", csConfig.get(PREFIX + "root.users.capacity")); {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10191) FS-CS converter: call System.exit() for every code path in main()
Peter Bacsko created YARN-10191: --- Summary: FS-CS converter: call System.exit() for every code path in main() Key: YARN-10191 URL: https://issues.apache.org/jira/browse/YARN-10191 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko Note that we don't always call {{System.exit()}} on the happy path scenario in the converter: {code:java} public static void main(String[] args) { try { FSConfigToCSConfigArgumentHandler fsConfigConversionArgumentHandler = new FSConfigToCSConfigArgumentHandler(); int exitCode = fsConfigConversionArgumentHandler.parseAndConvert(args); if (exitCode != 0) { LOG.error(FATAL, "Error while starting FS configuration conversion, " + "see previous error messages for details!"); System.exit(exitCode); } } catch (Throwable t) { LOG.error(FATAL, "Error while starting FS configuration conversion!", t); System.exit(-1); } } {code} This is a mistake. If there's any non-daemon thread hanging around which was started by either FS or CS, the tool will never terminate. We must call {{System.exit()}} in every occasion to make sure that it never blocks at the end. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10175) FS-CS converter: only convert placement rules if a cmd line switch is defined
Peter Bacsko created YARN-10175: --- Summary: FS-CS converter: only convert placement rules if a cmd line switch is defined Key: YARN-10175 URL: https://issues.apache.org/jira/browse/YARN-10175 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko In the current form, the conversion of FS placement rules to CS mapping rules has a lot of feature gaps and doesn't work properly. The output is good as a starting point but sometimes it causes CS to throw an exception. Until a proper resolution is implemented, it's better to disable this by default and introduce a command line switch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10158) FS-CS converter: convert property yarn.scheduler.fair.update-interval-ms
Peter Bacsko created YARN-10158: --- Summary: FS-CS converter: convert property yarn.scheduler.fair.update-interval-ms Key: YARN-10158 URL: https://issues.apache.org/jira/browse/YARN-10158 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10157) FS-CS converter: initPropertyActions() is not called without rules file
Peter Bacsko created YARN-10157: --- Summary: FS-CS converter: initPropertyActions() is not called without rules file Key: YARN-10157 URL: https://issues.apache.org/jira/browse/YARN-10157 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko The method {{FSConfigToCSConfigRuleHandler.initPropertyActions()}} should be invoked even if we don't use the rule file. Otherwise the rule handler will not initialize actions to WARNING. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10142) Distributed shell: add support for localization visibility
[ https://issues.apache.org/jira/browse/YARN-10142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-10142. - Resolution: Duplicate > Distributed shell: add support for localization visibility > -- > > Key: YARN-10142 > URL: https://issues.apache.org/jira/browse/YARN-10142 > Project: Hadoop YARN > Issue Type: Improvement > Components: distributed-shell >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > The localization is now hard coded in DistributedShell: > {noformat} > FileStatus scFileStatus = fs.getFileStatus(dst); > LocalResource scRsrc = > LocalResource.newInstance( > URL.fromURI(dst.toUri()), > LocalResourceType.FILE, LocalResourceVisibility.APPLICATION, > scFileStatus.getLen(), scFileStatus.getModificationTime()); > localResources.put(fileDstPath, scRsrc); > {noformat} > However, sometimes it's useful if you have the possibility to change this to > PRIVATE/PUBLIC for testing purposes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10147) FPGA plugin can't find the localized aocx file
Peter Bacsko created YARN-10147: --- Summary: FPGA plugin can't find the localized aocx file Key: YARN-10147 URL: https://issues.apache.org/jira/browse/YARN-10147 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Peter Bacsko Assignee: Peter Bacsko There's a bug in the FPGA plugin which is intended to find the localized "aocx" file: {noformat} ... if (localizedResources != null) { Optional aocxPath = localizedResources .keySet() .stream() .filter(path -> matchesIpid(path, id)) .findFirst(); if (aocxPath.isPresent()) { ipFilePath = aocxPath.get().toUri().toString(); LOG.debug("Found: " + ipFilePath); } } else { LOG.warn("Localized resource is null!"); } return ipFilePath; } private boolean matchesIpid(Path p, String id) { return p.getName().toLowerCase().equals(id.toLowerCase()) && p.getName().endsWith(".aocx"); } {noformat} The method {{matchesIpid()}} works incorrecty: the {{id}} argument is the expected filename, but without the extension. Therefore the {{equals()}} comparison will always be false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10142) Distributed shell: add support for localization visibility
Peter Bacsko created YARN-10142: --- Summary: Distributed shell: add support for localization visibility Key: YARN-10142 URL: https://issues.apache.org/jira/browse/YARN-10142 Project: Hadoop YARN Issue Type: Improvement Reporter: Peter Bacsko Assignee: Peter Bacsko The localization is now hard coded in DistributedShell: {noformat} FileStatus scFileStatus = fs.getFileStatus(dst); LocalResource scRsrc = LocalResource.newInstance( URL.fromURI(dst.toUri()), LocalResourceType.FILE, LocalResourceVisibility.APPLICATION, scFileStatus.getLen(), scFileStatus.getModificationTime()); localResources.put(fileDstPath, scRsrc); {noformat} However, sometimes it's useful if you have the possibility to change this to PRIVATE/PUBLIC for testing purposes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10135) FS-CS converter tool: issue warning on dynamic auto-create mapping rules
Peter Bacsko created YARN-10135: --- Summary: FS-CS converter tool: issue warning on dynamic auto-create mapping rules Key: YARN-10135 URL: https://issues.apache.org/jira/browse/YARN-10135 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko The converter tool should issue a warning whenever the conversion results in mapping rules similar to these: {{u:%user:[managedParentQueueName].[queueName]}} {{u:%user:[managedParentQueueName].%user}} {{u:%user:[managedParentQueueName].%primary_group}} {{u:%user:[managedParentQueueName].%secondary_group}} {{u:%user:%primary_group.%user}} {{u:%user:%secondary_group.%user}} {{u:%user:[managedParentQueuePath].%user}} The reason is that right now it's fully clear how we'll handle a case like "u:%user:%primary_group.%user", where "%primary_group.%user" might result in something like "users.john". In case of "u:%user:[managedParentQueuePath].%user" , the [managedParentQueuePath] is a result of a full path from Fair Scheduler. Therefore it's not going to be a leaf queue. The user might be required to do some fine tuning and adjust the property "auto-create-child-queues". We should display a warning about these additional steps. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10105) FS-CS converter: separator between mapping rules should be comma
Peter Bacsko created YARN-10105: --- Summary: FS-CS converter: separator between mapping rules should be comma Key: YARN-10105 URL: https://issues.apache.org/jira/browse/YARN-10105 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko A converted configuration throws this error: {noformat} 2020-01-27 03:35:35,007 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to standby state 2020-01-27 03:35:35,008 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManager java.lang.IllegalArgumentException: Illegal queue mapping u:%user:%user;u:%user:root.users.%user;u:%user:root.default at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration.getQueueMappings(CapacitySchedulerConfiguration.java:1113) at org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule.initialize(UserGroupMappingPlacementRule.java:244) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getUserGroupMappingPlacementRule(CapacityScheduler.java:671) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updatePlacementRules(CapacityScheduler.java:712) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:753) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:361) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:426) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) {noformat} Mapping rules should be separated by a "," character, not by a semicolon. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10104) FS-CS converter: dryRun requires either -p or -o
Peter Bacsko created YARN-10104: --- Summary: FS-CS converter: dryRun requires either -p or -o Key: YARN-10104 URL: https://issues.apache.org/jira/browse/YARN-10104 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko The "-d" / "--dry-run" switch doesn't work properly. You still have to define either "-p" or "-o", which is not the way the tool is supposed to work (ie. it doesn't need to generate any output after the conversion). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10103) Capacity scheduler: add support for create=true/false per mapping rule
Peter Bacsko created YARN-10103: --- Summary: Capacity scheduler: add support for create=true/false per mapping rule Key: YARN-10103 URL: https://issues.apache.org/jira/browse/YARN-10103 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko You can't ask Capacity Scheduler for a mapping to create a queue if it doesn't exist. For example, this mapping would use the first rule if the queue exist. If it doesn't, then it proceeds to the next rule. Example: {{u:%user:%primary_group.%user:create=false;u:%user%:root.default}} Let's say user "alice" belongs to the "admins" group. It would first try to map {{root.admins.alice}}. But, if the queue doesn't exist, then it places the application into {{root.default}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10102) Capacity scheduler: add support for combined %specified mapping
Peter Bacsko created YARN-10102: --- Summary: Capacity scheduler: add support for combined %specified mapping Key: YARN-10102 URL: https://issues.apache.org/jira/browse/YARN-10102 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko The reduce the gap between Fair Scheduler and Capacity Scheduler, it's reasonable to have a {{%specified}} mapping. This would be equivalent to the {{}} placement rule in FS, that is, use the queue that comes in with the application submission context. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10099) FS-CS converter: handle allow-undeclared-pools and user-as-default queue properly
Peter Bacsko created YARN-10099: --- Summary: FS-CS converter: handle allow-undeclared-pools and user-as-default queue properly Key: YARN-10099 URL: https://issues.apache.org/jira/browse/YARN-10099 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko Based on the latest documentation, there are two important properties that are ignored if we have placement rules: ||Property||Explanation|| |yarn.scheduler.fair.allow-undeclared-pools|If this is true, new queues can be created at application submission time, whether because they are specified as the application’s queue by the submitter or because they are placed there by the user-as-default-queue property. If this is false, any time an app would be placed in a queue that is not specified in the allocations file, it is placed in the “default” queue instead. Defaults to true. *If a queue placement policy is given in the allocations file, this property is ignored.*| |yarn.scheduler.fair.user-as-default-queue|Whether to use the username associated with the allocation as the default queue name, in the event that a queue name is not specified. If this is set to “false” or unset, all jobs have a shared default queue, named “default”. Defaults to true. *If a queue placement policy is given in the allocations file, this property is ignored.*| | | | Right now these settings affects the conversion regardless of the placement rules. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10085) FS-CS converter: remove mixed ordering policy check
Peter Bacsko created YARN-10085: --- Summary: FS-CS converter: remove mixed ordering policy check Key: YARN-10085 URL: https://issues.apache.org/jira/browse/YARN-10085 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko When YARN-9892 gets committed, this part will become unnecessary: {noformat} // Validate ordering policy if (queueConverter.isDrfPolicyUsedOnQueueLevel()) { if (queueConverter.isFifoOrFairSharePolicyUsed()) { throw new ConversionException( "DRF ordering policy cannot be used together with fifo/fair"); } else { capacitySchedulerConfig.set( CapacitySchedulerConfiguration.RESOURCE_CALCULATOR_CLASS, DominantResourceCalculator.class.getCanonicalName()); } } {noformat} We will be able to freely mix fifo/fair/drf, so let's get rid of this strict check and also rewrite {{FSQueueConverter.emitOrderingPolicy()}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10082) FS-CS converter: disable terminal placement rule checking
Peter Bacsko created YARN-10082: --- Summary: FS-CS converter: disable terminal placement rule checking Key: YARN-10082 URL: https://issues.apache.org/jira/browse/YARN-10082 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko Before YARN-8967, {{QueuePlacementRule}} class had a method called {{isTerminal()}}. However, sometimes this method was hard-coded to return false, accepting such configurations as: {noformat} {noformat} It's because {{NestedUserQueue.isTerminal()}} always returns {{false}}. This changed after YARN-8967, the behavior is different. Now, this configuration is not accepted because {{QueuePlacementPolicy.fromXml()}} calculates a list of terminal rules differently: https://github.com/apache/hadoop/blob/5257f50abb71905ef3068fd45541d00ce9e8f355/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementPolicy.java#L176-L183 In order to migrate existing configuration that were created before YARN-8967, we need a new switch (at least in migration mode) in FS to turn off this validation, otherwise the tool will not be able to migrate these configs and the following exception will be thrown: {noformat} ~$ ./yarn fs2cs -y /tmp/yarn-site.xml -f /tmp/fair-scheduler.xml -o /tmp WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. 20/01/13 05:48:20 INFO converter.FSConfigToCSConfigConverter: Output directory for yarn-site.xml and capacity-scheduler.xml is: /tmp 20/01/13 05:48:20 INFO converter.FSConfigToCSConfigConverter: Conversion rules file is not defined, using default conversion config! 20/01/13 05:48:21 INFO converter.FSConfigToCSConfigConverter: Using explicitly defined fair-scheduler.xml WARNING: This feature is experimental and not intended for production use! 20/01/13 05:48:21 INFO conf.Configuration: resource-types.xml not found 20/01/13 05:48:21 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 20/01/13 05:48:21 INFO security.YarnAuthorizationProvider: org.apache.hadoop.yarn.security.ConfiguredYarnAuthorizer is instantiated. 20/01/13 05:48:21 INFO scheduler.AbstractYarnScheduler: Minimum allocation = 20/01/13 05:48:21 INFO scheduler.AbstractYarnScheduler: Maximum allocation = 20/01/13 05:48:21 INFO placement.PlacementFactory: Creating PlacementRule implementation: class org.apache.hadoop.yarn.server.resourcemanager.placement.SpecifiedPlacementRule 20/01/13 05:48:21 INFO placement.PlacementFactory: Creating PlacementRule implementation: class org.apache.hadoop.yarn.server.resourcemanager.placement.UserPlacementRule 20/01/13 05:48:21 INFO fair.AllocationFileLoaderService: Loading allocation file file:/tmp/fair-scheduler.xml 20/01/13 05:48:22 INFO placement.PlacementFactory: Creating PlacementRule implementation: class org.apache.hadoop.yarn.server.resourcemanager.placement.SpecifiedPlacementRule 20/01/13 05:48:22 INFO placement.PlacementFactory: Creating PlacementRule implementation: class org.apache.hadoop.yarn.server.resourcemanager.placement.UserPlacementRule 20/01/13 05:48:22 INFO placement.PlacementFactory: Creating PlacementRule implementation: class org.apache.hadoop.yarn.server.resourcemanager.placement.DefaultPlacementRule 20/01/13 05:48:22 INFO placement.PlacementFactory: Creating PlacementRule implementation: class org.apache.hadoop.yarn.server.resourcemanager.placement.DefaultPlacementRule 20/01/13 05:48:22 INFO service.AbstractService: Service org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler failed in state INITED java.io.IOException: Failed to initialize FairScheduler at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1438) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1479) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.FSConfigToCSConfigConverter.convert(FSConfigToCSConfigConverter.java:206) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.FSConfigToCSConfigConverter.convert(FSConfigToCSConfigConverter.java:101) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.FSConfigToCSConfigArgumentHandler.parseAndConvert(FSConfigToCSConfigArgumentHandler.java:116) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.FSConfigToCSConfigConverterMain.main(FSConfigToCSConfigConverterMain.java:44) Caused by: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: Rules after rule 2 in queue placement policy can never
[jira] [Created] (YARN-10067) Add dry-run feature to FS-CS converter tool
Peter Bacsko created YARN-10067: --- Summary: Add dry-run feature to FS-CS converter tool Key: YARN-10067 URL: https://issues.apache.org/jira/browse/YARN-10067 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko Add a "-d" / "--dry-run" switch to the tool. The purpose of this would be to inform the user whether a conversion is possible and if it is, are there any warnings. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10019) container-executor: misc improvements in child process and regarding exec() calls
Peter Bacsko created YARN-10019: --- Summary: container-executor: misc improvements in child process and regarding exec() calls Key: YARN-10019 URL: https://issues.apache.org/jira/browse/YARN-10019 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Peter Bacsko Assignee: Peter Bacsko There are a couple of improvements that we can do in container-executor regarding how we exit from child processes and how we handle failed exec() calls: 1. If we're in the child code path and we detect an erroneous condition, the usual way is just simply call {{_exit()}}. Normal {{exit()}} occurs in the parent. Calling {{_exit()}} prevents flushing stdio buffers twice and any cleanup logic registered with {{atexit()}} or {{on_exit()}} will run only once. 2. There's code like {{if (execlp(script_file_dest, script_file_dest, NULL) != 0) ...}} which is not necessary. Exec functions are not supposed to return. If they do, it's definitely an error, so no need to check the return value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10018) container-executor: possible -1 return value of fork() is not always checked
Peter Bacsko created YARN-10018: --- Summary: container-executor: possible -1 return value of fork() is not always checked Key: YARN-10018 URL: https://issues.apache.org/jira/browse/YARN-10018 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Peter Bacsko Assignee: Peter Bacsko There are some places in the container-executor native, where the {{fork()}} call is not handled properly. This operation can fail with -1, but sometimes the necessary if branch is missing to validate the success. Also, at one location, the return value is defined as an {{int}}, not {{pid_t}}. It's better to handle this transparently and change it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9891) Capacity scheduler: enhance capacity / maximum-capacity setting
[ https://issues.apache.org/jira/browse/YARN-9891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-9891. Resolution: Duplicate > Capacity scheduler: enhance capacity / maximum-capacity setting > --- > > Key: YARN-9891 > URL: https://issues.apache.org/jira/browse/YARN-9891 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Peter Bacsko >Priority: Major > > Capacity Scheduler does not support two percentage values for capacity and > maximum-capacity settings. So, you can't do something like this: > {{yarn.scheduler.capacity.root.users.john.maximum-capacity=memory-mb=50.0%, > vcores=50.0%}} > It's possible to use absolute resources, but not two separate percentages > (which expresses capacity as a percentage of the overall cluster resource). > Such a configuration is accepted in Fair Scheduler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9922) Fix JavaDoc errors introduced by YARN-9699
Peter Bacsko created YARN-9922: -- Summary: Fix JavaDoc errors introduced by YARN-9699 Key: YARN-9922 URL: https://issues.apache.org/jira/browse/YARN-9922 Project: Hadoop YARN Issue Type: Sub-task Components: capacity scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9893) Capacity scheduler: enhance leaf-queue-template capacity / maximum-capacity setting
Peter Bacsko created YARN-9893: -- Summary: Capacity scheduler: enhance leaf-queue-template capacity / maximum-capacity setting Key: YARN-9893 URL: https://issues.apache.org/jira/browse/YARN-9893 Project: Hadoop YARN Issue Type: Sub-task Components: capacity scheduler Reporter: Peter Bacsko Capacity Scheduler does not support two percentage values for leaf queue capacity and maximum-capacity settings. So, you can't do something like this: {{yarn.scheduler.capacity.root.users.john.leaf-queue-template.capacity=memory-mb=50.0%, vcores=50.0%}} On top of that, it's not even possible to define absolute resources: {{yarn.scheduler.capacity.root.users.john.leaf-queue-template.capacity=memory-mb=16384, vcores=8}} Only a single percentage value is accepted. This makes it nearly impossible to properly convert a similar setting from Fair Scheduler, where such a configuration is valid and accepted ({{}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9892) Capacity scheduler: support DRF ordering policy on queue level
Peter Bacsko created YARN-9892: -- Summary: Capacity scheduler: support DRF ordering policy on queue level Key: YARN-9892 URL: https://issues.apache.org/jira/browse/YARN-9892 Project: Hadoop YARN Issue Type: Sub-task Components: capacity scheduler Reporter: Peter Bacsko Capacity scheduler does not support DRF (Dominant Resource Fairness) ordering policy on queue level. Only "fifo" and "fair" are accepted for {{yarn.scheduler.capacity..ordering-policy}}. DRF can only be used globally if {{yarn.scheduler.capacity.resource-calculator}} is set to DominantResourceCalculator. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9891) Capacity scheduler: enhance capacity / maximum-capacity setting
Peter Bacsko created YARN-9891: -- Summary: Capacity scheduler: enhance capacity / maximum-capacity setting Key: YARN-9891 URL: https://issues.apache.org/jira/browse/YARN-9891 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Capacity Scheduler does not support two percentage values for capacity and maximum-capacity settings. So, you can't do something like this: {{yarn.scheduler.capacity.root.users.john.maximum-capacity=memory-mb=50.0%, vcores=50.0%}} It's possible to use absolute resources, but not two separate percentages (which expresses capacity as a percentage of the overall cluster resource). Such a configuration is accepted in Fair Scheduler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9888) Capacity scheduler: add support for default maxRunningApps limit per user
Peter Bacsko created YARN-9888: -- Summary: Capacity scheduler: add support for default maxRunningApps limit per user Key: YARN-9888 URL: https://issues.apache.org/jira/browse/YARN-9888 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Fair scheduler has the setting {{}} which limits how many running applications each user can have. Capacity scheduler lacks this feature. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9887) Capacity scheduler: add support for limiting maxRunningApps per user
Peter Bacsko created YARN-9887: -- Summary: Capacity scheduler: add support for limiting maxRunningApps per user Key: YARN-9887 URL: https://issues.apache.org/jira/browse/YARN-9887 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Fair Scheduler supports limiting the number of applications that a particular user can submit: {noformat} 10 {noformat} Capacity Scheduler does not have an exact equivalent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9717) Add more logging to container-executor about issues with directory creation or permissions
[ https://issues.apache.org/jira/browse/YARN-9717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-9717. Resolution: Won't Fix > Add more logging to container-executor about issues with directory creation > or permissions > -- > > Key: YARN-9717 > URL: https://issues.apache.org/jira/browse/YARN-9717 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Peter Bacsko >Priority: Major > > During some downstream testing we bumped into some problems with the > container executor where an extra logging would be quite helpful when local > files and directories could not be created (container-executor.c:1810). > The most important log line could be the following: > There's a function called create_container_directories in > container-executor.c. > We should place a log line like this: > Before we're calling: > We have: > {code:java} > if (mkdirs(container_dir, perms) == 0) { > result = 0; > } > {code} > We could add an else statement and add the following log, if creating the > directory was not successful: > {code:java} > fprintf(LOGFILE, "Failed to create directory: %s, user: %s", container_dir, > user); > {code} > This way, CE at least prints the directory itself if we have any permission > issue while trying to create a subdirectory or file under it. > If we want to be very precise, some logging into the mkdirs function could > also be added as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9841) Capacity scheduler: add support for combined %user + %primary_group mapping
Peter Bacsko created YARN-9841: -- Summary: Capacity scheduler: add support for combined %user + %primary_group mapping Key: YARN-9841 URL: https://issues.apache.org/jira/browse/YARN-9841 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko Right now in CS, using {{%primary_group}} with a parent queue is only possible this way: {{u:%user:parentqueue.%primary_group}} Looking at https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/placement/UserGroupMappingPlacementRule.java, we cannot do something like: {{u:%user:%primary_group.%user}} Fair Scheduler supports a nested rule where such a placement/mapping rule is possible. This improvement would reduce this feature gap. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9840) Capacity scheduler: add support for Secondary Group user mapping
Peter Bacsko created YARN-9840: -- Summary: Capacity scheduler: add support for Secondary Group user mapping Key: YARN-9840 URL: https://issues.apache.org/jira/browse/YARN-9840 Project: Hadoop YARN Issue Type: Improvement Reporter: Peter Bacsko Assignee: Peter Bacsko Currently, Capacity Scheduler only supports primary group rule mapping like this: {{u:%user:%primary_group}} Fair scheduler already supports secondary group placement rule. Let's add this to CS to reduce the feature gap. Class of interest: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/placement/UserGroupMappingPlacementRule.java -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch
Peter Bacsko created YARN-9833: -- Summary: Race condition when DirectoryCollection.checkDirs() runs during container launch Key: YARN-9833 URL: https://issues.apache.org/jira/browse/YARN-9833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.2.0 Reporter: Peter Bacsko Assignee: Peter Bacsko During endurance testing, we found a race condition that cause an empty {{localDirs}} being passed to container-executor. The problem is that {{DirectoryCollection.checkDirs()}} clears three collections: {code:java} this.writeLock.lock(); try { localDirs.clear(); errorDirs.clear(); fullDirs.clear(); ... {code} This happens in critical section guarded by a write lock. When we start a container, we retrieve the local dirs by calling {{dirsHandler.getLocalDirs();}} which in turn invokes {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is: {code:java} List getGoodDirs() { this.readLock.lock(); try { return Collections.unmodifiableList(localDirs); } finally { this.readLock.unlock(); } } {code} So we're also in a critical section guarded by the lock. But {{Collections.unmodifiableList()}} only returns a _view_ of the collection, not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be scheduled to run and immediately clears {{localDirs}}. This caused a weird behaviour in container-executor, which exited with error code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES). Therefore we can't just return a view, we must return a copy with {{ImmutableList.copyOf()}}. Credits to [~snemeth] for analyzing and determining the root cause. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9749) TestAppLogAggregatorImpl#testDFSQuotaExceeded fails on trunk
Peter Bacsko created YARN-9749: -- Summary: TestAppLogAggregatorImpl#testDFSQuotaExceeded fails on trunk Key: YARN-9749 URL: https://issues.apache.org/jira/browse/YARN-9749 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Peter Bacsko Assignee: Adam Antal TestAppLogAggregatorImpl#testDFSQuotaExceeded currently fails on trunk. It was most likely introduced by YARN-9676 (resetting HEAD to the previous commit and then re-running the test passes). {noformat} [INFO] Running org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestAppLogAggregatorImpl [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.781 s <<< FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestAppLogAggregatorImpl [ERROR] testDFSQuotaExceeded(org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestAppLogAggregatorImpl) Time elapsed: 2.361 s <<< FAILURE! java.lang.AssertionError: The set of paths for deletion are not the same as expected: actual size: 0 vs expected size: 1 at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestAppLogAggregatorImpl.verifyFilesToDelete(TestAppLogAggregatorImpl.java:344) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestAppLogAggregatorImpl.access$000(TestAppLogAggregatorImpl.java:82) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestAppLogAggregatorImpl$1.answer(TestAppLogAggregatorImpl.java:330) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestAppLogAggregatorImpl$1.answer(TestAppLogAggregatorImpl.java:319) at org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:39) at org.mockito.internal.handler.MockHandlerImpl.handle(MockHandlerImpl.java:96) at org.mockito.internal.handler.NullResultGuardian.handle(NullResultGuardian.java:29) at org.mockito.internal.handler.InvocationNotifierHandler.handle(InvocationNotifierHandler.java:35) at org.mockito.internal.creation.bytebuddy.MockMethodInterceptor.doIntercept(MockMethodInterceptor.java:61) at org.mockito.internal.creation.bytebuddy.MockMethodInterceptor.doIntercept(MockMethodInterceptor.java:49) at org.mockito.internal.creation.bytebuddy.MockMethodInterceptor$DispatcherDefaultingToRealMethod.interceptSuperCallable(MockMethodInterceptor.java:108) at org.apache.hadoop.yarn.server.nodemanager.DeletionService$MockitoMock$1879282050.delete(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregationPostCleanUp(AppLogAggregatorImpl.java:556) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:476) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestAppLogAggregatorImpl.testDFSQuotaExceeded(TestAppLogAggregatorImpl.java:469) ... {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9473) [Umbrella] Support Vector Engine ( a new accelerator hardware) based on pluggable device framework
[ https://issues.apache.org/jira/browse/YARN-9473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-9473. Resolution: Fixed Fix Version/s: 3.3.0 > [Umbrella] Support Vector Engine ( a new accelerator hardware) based on > pluggable device framework > -- > > Key: YARN-9473 > URL: https://issues.apache.org/jira/browse/YARN-9473 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Zhankun Tang >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0 > > > As the heterogeneous computation trend rises, new acceleration hardware like > GPU, FPGA is used to satisfy various requirements. > And a new hardware Vector Engine (VE) which released by NEC company is > another example. The VE is like GPU but has different characteristics. It's > suitable for machine learning and HPC due to better memory bandwidth and no > PCIe bottleneck. > Please Check here for more VE details: > [https://www.nextplatform.com/2017/11/22/deep-dive-necs-aurora-vector-engine/] > [https://www.hotchips.org/hc30/2conf/2.14_NEC_vector_NEC_SXAurora_TSUBASA_HotChips30_finalb.pdf] > As we know, YARN-8851 is a pluggable device framework which provides an easy > way to develop a plugin for such new accelerators. This JIRA proposes to > develop a new VE plugin based on that framework and be implemented as current > GPU's "NvidiaGPUPluginForRuntimeV2" plugin. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9660) Enhance documentation of Docker on YARN support
Peter Bacsko created YARN-9660: -- Summary: Enhance documentation of Docker on YARN support Key: YARN-9660 URL: https://issues.apache.org/jira/browse/YARN-9660 Project: Hadoop YARN Issue Type: Bug Components: documentation, nodemanager Reporter: Peter Bacsko Right now, using Docker on YARN has some hard requirements. If these requirements are not met, then launching the containers will fail and and error message will be printed. Depending on how familiar the user is with Docker, it might or might not be easy for them to understand what went wrong and how to fix the underlying problem. It would be important to explicitly document these requirements along with the error messages. #1: CGroups handler cannot be systemd If docker deamon runs with systemd cgroups handler, we receive the following error upon launching a container: {noformat} Container id: container_1561638268473_0006_01_02 Exit code: 7 Exception message: Launch container failed Shell error output: /usr/bin/docker-current: Error response from daemon: cgroup-parent for systemd cgroup should be a valid slice named as "xxx.slice". See '/usr/bin/docker-current run --help'. Shell output: main : command provided 4 main : run as user is johndoe main : requested yarn user is johndoe {noformat} Solution: switch to cgroupfs. Doing so can be OS-specific, but we can document a {{systemcl}} example. #2: {{/bin/bash}} must be present on the {{$PATH}} inside the container Some smaller images like "busybox" or "alpine" does not have {{/bin/bash}}. It's because all commands under {{/bin}} are linked to {{/bin/busybox}} and there's only {{/bin/sh}}. If we try to use these kind of images, we'll see the following error message: {noformat} Container id: container_1561638268473_0015_01_02 Exit code: 7 Exception message: Launch container failed Shell error output: /usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:235: starting container process caused "exec: \"bash\": executable file not found in $PATH". Shell output: main : command provided 4 main : run as user is johndoe main : requested yarn user is johndoe {noformat} #3: {{find}} command must be available on the {{$PATH}} It seems obvious that we have the {{find}} command, but even very popular images like {{fedora}} requires that we install it separately. If we don't have {{find}} available, then {{launcher_container.sh}} fails with: {noformat} 2019-07-01 03:51:25.053]Container exited with a non-zero exit code 127. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : /tmp/hadoop-systest/nm-local-dir/usercache/systest/appcache/application_1561638268473_0017/container_1561638268473_0017_01_02/launch_container.sh: line 44: find: command not found Last 4096 bytes of stderr.txt : [2019-07-01 03:51:25.053]Container exited with a non-zero exit code 127. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : /tmp/hadoop-systest/nm-local-dir/usercache/systest/appcache/application_1561638268473_0017/container_1561638268473_0017_01_02/launch_container.sh: line 44: find: command not found Last 4096 bytes of stderr.txt : {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9622) All testcase fails in TestTimelineReaderWebServicesHBaseStorage
Peter Bacsko created YARN-9622: -- Summary: All testcase fails in TestTimelineReaderWebServicesHBaseStorage Key: YARN-9622 URL: https://issues.apache.org/jira/browse/YARN-9622 Project: Hadoop YARN Issue Type: Bug Components: timelineserver, timelineservice Reporter: Peter Bacsko When you try to run all tests from TestTimelineReaderWebServicesHBaseStorage, the result is the following: {noformat} [ERROR] Failures: [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetAppNotPresent:->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 Response from server should have been Not Found [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunNotPresent:2192->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 Response from server should have been Not Found [ERROR] TestTimelineReaderWebServicesHBaseStorage.testUIDNotProperlyEscaped:905->AbstractTimelineReaderHBaseTestBase.verifyHttpResponse:140 Response from server should have been Bad Request [ERROR] Errors: [ERROR] TestTimelineReaderWebServicesHBaseStorage.testForFlowAppsPagination:2375->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testForFlowRunAppsPagination:2420->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testForFlowRunsPagination:2465->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGenericEntitiesForPagination:2272->verifyEntitiesForPagination:2288->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetApp:1024->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetAppWithoutFlowInfo:1064->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetAppsMetricsRange:2516->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesByUID:662->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesConfigFilters:1263->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesDataToRetrieve:1154->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesEventFilters:1640->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesInfoFilters:1380->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricFilters:1494->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesMetricsTimeRange:1820->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesRelationFilters:1696->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntitiesWithoutFlowInfo:1130->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntityDataToRetrieve:1905->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetEntityWithoutFlowInfo:1113->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowApps:2047->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsFilters:2153->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowAppsNotPresent:2253->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowRun:443->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunApps:1984->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunAppsNotPresent:2235->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowRuns:488->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlowRunsMetricsToRetrieve:616->AbstractTimelineReaderHBaseTestBase.getResponse:129 » IO [ERROR] TestTimelineReaderWebServicesHBaseStorage.testGetFlows:918->verifyFlowEntites:2349->AbstractTimelineReaderHBase
[jira] [Created] (YARN-9621) Test failure TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1
Peter Bacsko created YARN-9621: -- Summary: Test failure TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1 Key: YARN-9621 URL: https://issues.apache.org/jira/browse/YARN-9621 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.1.2 Reporter: Peter Bacsko Testcase {{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} seems to constantly fail on branch 3.1. I believe it was introduced by YARN-9253. {noformat} testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager) Time elapsed: 24.636 s <<< FAILURE! java.lang.AssertionError: expected:<1> but was:<2> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9595) FPGA plugin: NullPointerException in FpgaNodeResourceUpdateHandler.updateConfiguredResource()
Peter Bacsko created YARN-9595: -- Summary: FPGA plugin: NullPointerException in FpgaNodeResourceUpdateHandler.updateConfiguredResource() Key: YARN-9595 URL: https://issues.apache.org/jira/browse/YARN-9595 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Peter Bacsko Assignee: Peter Bacsko YARN-9264 accidentally introduced a bug in FpgaDiscoverer. Sometimes {{currentFpgaInfo}} is not set, resulting in an NPE being thrown: {noformat} 2019-06-03 05:14:50,157 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaNodeResourceUpdateHandler.updateConfiguredResource(FpgaNodeResourceUpdateHandler.java:54) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.updateConfiguredResourcesViaPlugins(NodeStatusUpdaterImpl.java:358) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceInit(NodeStatusUpdaterImpl.java:190) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:459) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:869) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:942) {noformat} The problem is that in {{FpgaDiscoverer}}, we don't set {{currentFpgaInfo}} if the following condition is true: {noformat} if (allowed == null || allowed.equalsIgnoreCase( YarnConfiguration.AUTOMATICALLY_DISCOVER_GPU_DEVICES)) { return list; } else if (allowed.matches("(\\d,)*\\d")){ ... {noformat} Solution is simple, it should always be initialized, just like before. Unit tests should be enhanced to verify that it's set properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9552) FairScheduler: NODE_UPDATE can cause a NoSuchElementException
Peter Bacsko created YARN-9552: -- Summary: FairScheduler: NODE_UPDATE can cause a NoSuchElementException Key: YARN-9552 URL: https://issues.apache.org/jira/browse/YARN-9552 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Peter Bacsko Assignee: Peter Bacsko We observed a race condition inside YARN with the following stack trace: {noformat} 18/11/07 06:45:09.559 SchedulerEventDispatcher:Event Processor ERROR EventDispatcher: Error in handling event type NODE_UPDATE to the Event Dispatcher java.util.NoSuchElementException at java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036) at java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1373) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:353) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1094) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:961) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1183) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:132) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) at java.lang.Thread.run(Thread.java:748) {noformat} This is basically the same as the one described in YARN-7382, but the root cause is different. When we create an application attempt, we create an {{FSAppAttempt}} object. This contains an {{AppSchedulingInfo}} which contains a set of {{SchedulerRequestKey}}. Initially, this set is empty and only initialized a bit later on a separate thread during a state transition: {noformat} 2019-05-07 15:58:02,659 INFO [RM StateStore dispatcher] recovery.RMStateStore (RMStateStore.java:transition(239)) - Storing info for app: application_1557237478804_0001 2019-05-07 15:58:02,684 INFO [RM Event dispatcher] rmapp.RMAppImpl (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change from NEW_SAVING to SUBMITTED on event = APP_NEW_SAVED 2019-05-07 15:58:02,690 INFO [SchedulerEventDispatcher:Event Processor] fair.FairScheduler (FairScheduler.java:addApplication(490)) - Accepted application application_1557237478804_0001 from user: bacskop, in queue: root.bacskop, currently num of applications: 1 2019-05-07 15:58:02,698 INFO [RM Event dispatcher] rmapp.RMAppImpl (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change from SUBMITTED to ACCEPTED on event = APP_ACCEPTED 2019-05-07 15:58:02,731 INFO [RM Event dispatcher] resourcemanager.ApplicationMasterService (ApplicationMasterService.java:registerAppAttempt(434)) - Registering app attempt : appattempt_1557237478804_0001_01 2019-05-07 15:58:02,732 INFO [RM Event dispatcher] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(920)) - appattempt_1557237478804_0001_01 State change from NEW to SUBMITTED on event = START 2019-05-07 15:58:02,746 INFO [SchedulerEventDispatcher:Event Processor] scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:(207)) - *** In the constructor of SchedulerApplicationAttempt 2019-05-07 15:58:02,747 INFO [SchedulerEventDispatcher:Event Processor] scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:(230)) - *** Contents of appSchedulingInfo: [] 2019-05-07 15:58:02,752 INFO [SchedulerEventDispatcher:Event Processor] fair.FairScheduler (FairScheduler.java:addApplicationAttempt(546)) - Added Application Attempt appattempt_1557237478804_0001_01 to scheduler from user: bacskop 2019-05-07 15:58:02,756 INFO [RM Event dispatcher] scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updatePendingResources(257)) - *** Adding scheduler key: SchedulerRequestKey{priority=0, allocationRequestId=-1, containerToUpdate=null} for attempt: appattempt_1557237478804_0001_01 2019-05-07 15:58:02,759 INFO [RM Event dispatcher] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(920)) - appattempt_1557237478804_0001_01 State change from SUBMITTED to SCHEDULED on event = ATTEMPT_ADDED 2019-05-07 15:58:02,892
[jira] [Resolved] (YARN-9446) TestMiniMRClientCluster.testRestart is flaky
[ https://issues.apache.org/jira/browse/YARN-9446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-9446. Resolution: Won't Do Closing this as Won't Do - related Hadoop JIRA (HADOOP-16238) should fix this problem. > TestMiniMRClientCluster.testRestart is flaky > > > Key: YARN-9446 > URL: https://issues.apache.org/jira/browse/YARN-9446 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Minor > > The testcase {{TestMiniMRClientCluster.testRestart}} sometimes fails with > this error: > {noformat} > 2019-04-04 11:21:31,896 INFO [main] service.AbstractService > (AbstractService.java:noteFailure(273)) - Service > org.apache.hadoop.yarn.server.resourcemanager.AdminService failed in state > STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.net.BindException: Problem binding to [test-host:35491] > java.net.BindException: Address already in use; For more details see: > http://wiki.apache.org/hadoop/BindException > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.net.BindException: Problem binding to [test-host:35491] > java.net.BindException: Address already in use; For more details see: > http://wiki.apache.org/hadoop/BindException > at > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:138) > at > org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) > at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.startServer(AdminService.java:178) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceStart(AdminService.java:165) > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1244) > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:355) > at > org.apache.hadoop.yarn.server.MiniYARNCluster.access$300(MiniYARNCluster.java:127) > at > org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:493) > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:312) > at > org.apache.hadoop.mapreduce.v2.MiniMRYarnCluster.serviceStart(MiniMRYarnCluster.java:210) > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.mapred.MiniMRYarnClusterAdapter.restart(MiniMRYarnClusterAdapter.java:73) > at > org.apache.hadoop.mapred.TestMiniMRClientCluster.testRestart(TestMiniMRClientCluster.java:114) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62){noformat} > The solution is to set the socket option SO_REUSEADDR which is implemented in > HADOOP-16238. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9476) Create unit tests for VE plugin
Peter Bacsko created YARN-9476: -- Summary: Create unit tests for VE plugin Key: YARN-9476 URL: https://issues.apache.org/jira/browse/YARN-9476 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9477) Investigate device discovery mechanisms
Peter Bacsko created YARN-9477: -- Summary: Investigate device discovery mechanisms Key: YARN-9477 URL: https://issues.apache.org/jira/browse/YARN-9477 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9475) Add basic VE plugin
Peter Bacsko created YARN-9475: -- Summary: Add basic VE plugin Key: YARN-9475 URL: https://issues.apache.org/jira/browse/YARN-9475 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Peter Bacsko -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9461) TestRMWebServicesDelegationTokenAuthentication.testCancelledDelegationToken fails with HTTP 400
Peter Bacsko created YARN-9461: -- Summary: TestRMWebServicesDelegationTokenAuthentication.testCancelledDelegationToken fails with HTTP 400 Key: YARN-9461 URL: https://issues.apache.org/jira/browse/YARN-9461 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, test Reporter: Peter Bacsko Assignee: Peter Bacsko The test {{TestRMWebServicesDelegationTokenAuthentication.testCancelledDelegationToken}} sometimes fails with the following error: {noformat} java.io.IOException: Server returned HTTP response code: 400 for URL: http://localhost:8088/ws/v1/cluster/delegation-token at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication.cancelDelegationToken(TestRMWebServicesDelegationTokenAuthentication.java:462) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication.testCancelledDelegationToken(TestRMWebServicesDelegationTokenAuthentication.java:283) {noformat} The problem is that for whatever reason, Jetty seems to execute the token cancellation REST call twice. First we get HTTP 200 OK, but the second request fails with HTTP 400 Bad Request. The {{MockRM}} instance is static. Something could be a problem in this class and it turned out that using separate {{MockRM}} instances solves the flakiness. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9446) TestMiniMRClientCluster.testRestart is flaky
Peter Bacsko created YARN-9446: -- Summary: TestMiniMRClientCluster.testRestart is flaky Key: YARN-9446 URL: https://issues.apache.org/jira/browse/YARN-9446 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: Peter Bacsko Assignee: Peter Bacsko The testcase {{TestMiniMRClientCluster.testRestart}} sometimes fails with this error: {noformat} 2019-04-04 11:21:31,896 INFO [main] service.AbstractService (AbstractService.java:noteFailure(273)) - Service org.apache.hadoop.yarn.server.resourcemanager.AdminService failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [test-host:35491] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [test-host:35491] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:138) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.startServer(AdminService.java:178) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceStart(AdminService.java:165) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1244) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:355) at org.apache.hadoop.yarn.server.MiniYARNCluster.access$300(MiniYARNCluster.java:127) at org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:493) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:312) at org.apache.hadoop.mapreduce.v2.MiniMRYarnCluster.serviceStart(MiniMRYarnCluster.java:210) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.mapred.MiniMRYarnClusterAdapter.restart(MiniMRYarnClusterAdapter.java:73) at org.apache.hadoop.mapred.TestMiniMRClientCluster.testRestart(TestMiniMRClientCluster.java:114) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62){noformat} The solution is to set the socket option SO_REUSEADDR which is implemented in HADOOP-16238. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9436) Flaky test testApplicationLifetimeMonitor
[ https://issues.apache.org/jira/browse/YARN-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-9436. Resolution: Duplicate > Flaky test testApplicationLifetimeMonitor > - > > Key: YARN-9436 > URL: https://issues.apache.org/jira/browse/YARN-9436 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > In our test environment, we occasionally encounter this failure: > {noformat} > 2019-04-03 12:49:32 [INFO] Running > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor > 2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, > Time elapsed: 215.535 s <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor > 2019-04-03 12:53:08 [ERROR] > testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor) > Time elapsed: 34.244 s <<< FAILURE! > 2019-04-03 12:53:08 java.lang.AssertionError: Application killed before > lifetime value > 2019-04-03 12:53:08 at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218) > 2019-04-03 12:53:08 > {noformat} > The root cause is the condition here: > {noformat} > Assert.assertTrue("Application killed before lifetime value", > totalTimeRun > maxLifetime); > {noformat} > However, there are two problems with this condition: > 1. Logically it's not correct. In fact, since the app should be killed after > 30 seconds, one would expect to see {{totalTimeRun = maxLifetime}}. Due to > some asynchronicity and rounding, most of the time {{totalTimeRun}} ends up > being 31. > 2. Sometimes the application is killed fast enough and {{totalTimeRun}} is > 30, but this is correct, because in {{setUpCSQueue}} we set the queue > lifetime: > {noformat} > csConf.setMaximumLifetimePerQueue( > CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime); > csConf.setDefaultLifetimePerQueue( > CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime); > {noformat} > A more proper condition is: > {noformat} > Assert.assertTrue("Application killed before lifetime value", > totalTimeRun >= maxLifetime); > {noformat} > The assertion message in the next line is also misleading: > {noformat} > Assert.assertTrue( > "Application killed before lifetime value " + totalTimeRun, > totalTimeRun < maxLifetime + 10L); > {noformat} > If it false, it means that the application is killed _after_ 40 seconds, > which exceeds both the app's lifetime (40s) and that of the queue (30s). > {noformat} > Assert.assertTrue( > "Application killed after queue/app lifetime value: " + > totalTimeRun, > totalTimeRun < maxLifetime + 10L); > {noformat} > We can be even be stricter, since we expect a kill almost immediately after > 30 seconds: > {noformat} > Assert.assertTrue( > "Application killed too late: " + totalTimeRun, > totalTimeRun < maxLifetime + 2L); > {noformat} > where we allow a 2 second tolerance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9436) Flaky test testApplicationLifetimeMonitor
Peter Bacsko created YARN-9436: -- Summary: Flaky test testApplicationLifetimeMonitor Key: YARN-9436 URL: https://issues.apache.org/jira/browse/YARN-9436 Project: Hadoop YARN Issue Type: Bug Components: scheduler, test Reporter: Peter Bacsko Assignee: Peter Bacsko In our test environment, we occasionally encounter this failure: {noformat} 2019-04-03 12:49:32 [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor 2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 215.535 s <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor 2019-04-03 12:53:08 [ERROR] testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor) Time elapsed: 34.244 s <<< FAILURE! 2019-04-03 12:53:08 java.lang.AssertionError: Application killed before lifetime value 2019-04-03 12:53:08 at org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218) 2019-04-03 12:53:08 {noformat} The root cause is the condition here: {noformat} Assert.assertTrue("Application killed before lifetime value", totalTimeRun > maxLifetime); {noformat} However, there are two problems with this condition: 1. Logically it's not correct. In fact, since the app should be killed after 30 seconds, one would expect to see {{totalTimeRun = maxLifetime}}. Due to some asynchronicity and rounding, most of the time {{totalTimeRun}} ends up being 31. 2. Sometimes the application is killed fast enough and {{totalTimeRun}} is 30, but this is correct, because in {{setUpCSQueue}} we set the queue lifetime: {noformat} csConf.setMaximumLifetimePerQueue( CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime); csConf.setDefaultLifetimePerQueue( CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime); {noformat} A more proper condition is: {noformat} Assert.assertTrue("Application killed before lifetime value", totalTimeRun >= maxLifetime); {noformat} The assertion message in the next line is also misleading: {noformat} Assert.assertTrue( "Application killed before lifetime value " + totalTimeRun, totalTimeRun < maxLifetime + 10L); {noformat} If it false, it means that the application is killed _after_ 40 seconds, which exceeds both the app's lifetime (40s) and that of the queue (30s). {noformat} Assert.assertTrue( "Application killed after queue/app lifetime value: " + totalTimeRun, totalTimeRun < maxLifetime + 10L); {noformat} We can be even be stricter, since we expect a kill almost immediately after 30 seconds: {noformat} Assert.assertTrue( "Application killed too late: " + totalTimeRun, totalTimeRun < maxLifetime + 2L); {noformat} where we allow a 2 second tolerance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9264) [Umbrella] Follow-up on IntelOpenCL FPGA plugin
[ https://issues.apache.org/jira/browse/YARN-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-9264. Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.3.0 > [Umbrella] Follow-up on IntelOpenCL FPGA plugin > --- > > Key: YARN-9264 > URL: https://issues.apache.org/jira/browse/YARN-9264 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0 > > > The Intel FPGA resource type support was released in Hadoop 3.1.0. > Right now the plugin implementation has some deficiencies that need to be > fixed. This JIRA lists all problems that need to be resolved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9270) Minor cleanup in TestFpgaDiscoverer
Peter Bacsko created YARN-9270: -- Summary: Minor cleanup in TestFpgaDiscoverer Key: YARN-9270 URL: https://issues.apache.org/jira/browse/YARN-9270 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Let's do some cleanup in this class. * {{testLinuxFpgaResourceDiscoverPluginConfig}} - this test should be split up to 5 different tests, because it tests 5 different scenarios. * remove {{setNewEnvironmentHack()}} - too complicated. We can introduce a {{Function}} in the plugin class like {{Function envProvider = System::getenv()}} plus a setter method which allows the test to modify {{envProvider}}. Much simpler and straightfoward. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9269) Minor cleanup in FpgaResourceAllocator
Peter Bacsko created YARN-9269: -- Summary: Minor cleanup in FpgaResourceAllocator Key: YARN-9269 URL: https://issues.apache.org/jira/browse/YARN-9269 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Some stuff that we observed: * {{addFpga()}} - we check for duplicate devices, but we don't print any error/warning if there's any. * {{findMatchedFpga()}} should be called findMatchingFpga(). Also, is this method even needed? We already receive an {{FpgaDevice}} instance in {{updateFpga()}} which I believe is the same that we're looking up. * variable {{IPIDpreference}} is confusing * {{availableFpga}} / {{usedFpgaByRequestor}} are instances of {{LinkedHashMap}}. What's the rationale behind this? Doesn't a simple {{HashMap}} suffice? * {{usedFpgaByRequestor}} should be renamed, naming is a bit unclear * {{allowedFpgas}} should be an immutable list -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9268) Various fixes are needed in FpgaDevice
Peter Bacsko created YARN-9268: -- Summary: Various fixes are needed in FpgaDevice Key: YARN-9268 URL: https://issues.apache.org/jira/browse/YARN-9268 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Need to fix the following the class FpgaDevice: * It implements Comparable, but not Comparable, so we have a raw type warning. It also returns 0 in every case. There is no natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but this seems too forced and unnecessary.We think this class should not implement Comparable at all, at least not like that. * Stores unnecessary fields: devName, busNum, temperature, power usage. For one, these are never needed in the code. Secondly, temp and power usage changes constantly. It's pointless to store these in this POJO. * serialVersionUID is 1L - let's generate a number for this * Use int instead of Integer - don't allow nulls. If major/minor uniquely identifies the card, then let's demand them in the constructor and don't store Integers that can be null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9267) Various fixes are needed in FpgaResourceHandlerImpl
Peter Bacsko created YARN-9267: -- Summary: Various fixes are needed in FpgaResourceHandlerImpl Key: YARN-9267 URL: https://issues.apache.org/jira/browse/YARN-9267 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Fix some problems in FpgaResourceHandlerImpl: * preStart() does not reconfigure card with the same IP - we see it as a problem. If you recompile the FPGA application, you must rename the aocx file because the card will not be reprogrammed. Suggestion: instead of storing Node<->IPID mapping, store Node<->IPID hash (like the SHA-256 of the localized file). * Switch to slf4j from Apache Commons Logging * Some unused imports -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9265) FPGA plugin fails to recognize Intel PAC card
Peter Bacsko created YARN-9265: -- Summary: FPGA plugin fails to recognize Intel PAC card Key: YARN-9265 URL: https://issues.apache.org/jira/browse/YARN-9265 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 3.1.0 Reporter: Peter Bacsko The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card). There are two major issues. Problem #1 The output of aocl diagnose: {noformat} Device Name: acl0 Package Pat: /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp Vendor: Intel Corp Physical Dev Name StatusInformation pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20) PCIe 08:00.0 FPGA temperature = 79 degrees C. DIAGNOSTIC_PASSED Call "aocl diagnose " to run diagnose for specified devices Call "aocl diagnose all" to run diagnose for all devices {noformat} This generates the following error message: {noformat} 2019-01-25 06:46:02,834 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin: Using FPGA vendor plugin: org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin 2019-01-25 06:46:02,943 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer: Trying to diagnose FPGA information ... 2019-01-25 06:46:03,085 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule: Using traffic control bandwidth handler 2019-01-25 06:46:03,108 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl: Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn 2019-01-25 06:46:03,139 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl: FPGA Plugin bootstrap success. 2019-01-25 06:46:03,247 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin: Couldn't find (?i)bus:slot.func\s=\s.*, pattern 2019-01-25 06:46:03,248 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin: Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern 2019-01-25 06:46:03,251 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin: Failed to get major-minor number from reading /dev/pac_a10_f30 2019-01-25 06:46:03,252 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to bootstrap configured resource subsystems! org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: No FPGA devices detected! {noformat} Problem #2 The plugin assume that the file name under {{/dev}} can be derived from the "Physical Dev Name". This is not the case. For example, it thinks that the device file is {{ /dev/pac_a10_f30}} which is not the case, the actual file is {{/dev/intel-fpga-port.0}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9266) Various fixes are needed in IntelFpgaOpenclPlugin
Peter Bacsko created YARN-9266: -- Summary: Various fixes are needed in IntelFpgaOpenclPlugin Key: YARN-9266 URL: https://issues.apache.org/jira/browse/YARN-9266 Project: Hadoop YARN Issue Type: Sub-task Reporter: Peter Bacsko Problems identified in this class: * InnerShellExecutor ignores the timeout parameter * configureIP() uses printStackTrace() instead of logging * configureIP() does not log the output of aocl if the exit code != 0 * parseDiagnoseInfo() is too heavyweight -- it should be in its own class for better testability * downloadIP() uses contains() for file name check -- this can really surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx also matches) * method name downloadIP() is misleading -- it actually tries to finds the file. Everything is downloaded (localized) at this point. * @VisibleForTesting methods should be package private -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9264) [Umbrella] Follow-up on IntelOpenCL FPGA plugin
Peter Bacsko created YARN-9264: -- Summary: [Umbrella] Follow-up on IntelOpenCL FPGA plugin Key: YARN-9264 URL: https://issues.apache.org/jira/browse/YARN-9264 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.1.1 Reporter: Peter Bacsko The Intel FPGA resource type support was released in Hadoop 3.1.0. Right now the plugin implementation has some deficiencies that need to be fixed. This JIRA lists all problems that need to be resolved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9011) Race condition during decommissioning
Peter Bacsko created YARN-9011: -- Summary: Race condition during decommissioning Key: YARN-9011 URL: https://issues.apache.org/jira/browse/YARN-9011 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.1.1 Reporter: Peter Bacsko Assignee: Antal Bálint Steinbach During internal testing, we found a nasty race condition which occurs during decommissioning. Node manager, incorrect behaviour: {noformat} 2018-06-18 21:00:17,634 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting down. 2018-06-18 21:00:17,634 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 hostname:node-6.hostname.com {noformat} Node manager, expected behaviour: {noformat} 2018-06-18 21:07:37,377 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting down. 2018-06-18 21:07:37,377 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be decommissioned {noformat} Note the two different messages from the RM ("Disallowed NodeManager" vs "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an inconsistent state of nodes while they're being updated: {noformat} 2018-06-18 21:00:17,575 INFO org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} exclude:{node-6.hostname.com} 2018-06-18 21:00:17,575 INFO org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully decommission node node-6.hostname.com:8041 with state RUNNING 2018-06-18 21:00:17,575 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: node-6.hostname.com 2018-06-18 21:00:17,576 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node node-6.hostname.com:8041 in DECOMMISSIONING. 2018-06-18 21:00:17,575 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn IP=172.26.22.115OPERATION=refreshNodes TARGET=AdminService RESULT=SUCCESS 2018-06-18 21:00:17,577 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve original total capability: 2018-06-18 21:00:17,577 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING {noformat} When the decommissioning succeeds, there is no output logged from {{ResourceTrackerService}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9008) Extend YARN distributed shell with file localization feature
Peter Bacsko created YARN-9008: -- Summary: Extend YARN distributed shell with file localization feature Key: YARN-9008 URL: https://issues.apache.org/jira/browse/YARN-9008 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 3.1.1, 2.9.1 Reporter: Peter Bacsko Assignee: Peter Bacsko YARN distributed shell is a very handy tool to test various features of YARN. However, it lacks support for file localization - that is, you define files in the command like that you wish to be localized remotely. This can be extremely useful in certain scenarios. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6715) NodeHealthScriptRunner does not handle non-zero exit codes properly
Peter Bacsko created YARN-6715: -- Summary: NodeHealthScriptRunner does not handle non-zero exit codes properly Key: YARN-6715 URL: https://issues.apache.org/jira/browse/YARN-6715 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Peter Bacsko There is a bug in NodeHealthScriptRunner. The {{FAILED_WITH_EXIT_CODE}} case is incorrect: {noformat} void reportHealthStatus(HealthCheckerExitStatus status) { long now = System.currentTimeMillis(); switch (status) { case SUCCESS: setHealthStatus(true, "", now); break; case TIMED_OUT: setHealthStatus(false, NODE_HEALTH_SCRIPT_TIMED_OUT_MSG); break; case FAILED_WITH_EXCEPTION: setHealthStatus(false, exceptionStackTrace); break; case FAILED_WITH_EXIT_CODE: setHealthStatus(true, "", now); break; case FAILED: setHealthStatus(false, shexec.getOutput()); break; } } {noformat} This case also lacks unit test coverage. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org