[jira] [Created] (YARN-9598) Make reservation work well when multi-node enabled
Tao Yang created YARN-9598: -- Summary: Make reservation work well when multi-node enabled Key: YARN-9598 URL: https://issues.apache.org/jira/browse/YARN-9598 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: Tao Yang Assignee: Tao Yang This issue is to solve problems about reservation when multi-node enabled: # As discussed in YARN-9576, re-reservation proposal may be always generated on the same node and break the scheduling for this app and later apps. I think re-reservation in unnecessary and we can replace it with LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates for this app when multi-node enabled. # Scheduler iterates all nodes and try to allocate for reserved container in LeafQueue#allocateFromReservedContainer. Here there are two problems: ** The node of reserved container should be taken as candidates instead of all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later scheduler may generate a reservation-fulfilled proposal on another node, which will always be rejected in FiCaScheduler#commonCheckContainerAllocation. ** Assignment returned by FiCaSchedulerApp#assignContainers could never be null even if it's just skipped, it will break the normal scheduling process for this leaf queue because of the if clause in LeafQueue#assignContainers: "if (null != assignment) \{ return assignment;}" # Nodes which have been reserved should be skipped when iterating candidates in RegularContainerAllocator#allocate, otherwise scheduler may generate allocation or reservation proposal on these node which will always be rejected in FiCaScheduler#commonCheckContainerAllocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9597) Memory efficiency in speculator
Ahmed Hussein created YARN-9597: --- Summary: Memory efficiency in speculator Key: YARN-9597 URL: https://issues.apache.org/jira/browse/YARN-9597 Project: Hadoop YARN Issue Type: Improvement Reporter: Ahmed Hussein The data structures in speculator and runtime-estimator are bloating. Data elements such as (taskID, TA-ID, task stats, tasks speculated, tasks finished..etc) are added to the concurrent maps but never removed. For long running jobs, there are couple of issues: # memory leakage: the speculator memory usage increases over time. # performance: keeping large structures in the heap affects the performance due to locality and cache misses. *Suggested Fixes:* - When a TA transitions to {{MoveContainerToSucceededFinishingTransition}}, the TA notifies the speculator. The latter handles the event by cleaning the internal structure accordingly. - When a task transitions is failed/killed, the speculator is notified to clean the internal data structure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
Muhammad Samir Khan created YARN-9596: - Summary: QueueMetrics has incorrect metrics when labelled partitions are involved Key: YARN-9596 URL: https://issues.apache.org/jira/browse/YARN-9596 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Reporter: Muhammad Samir Khan Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 2019-06-03 at 4.44.15 PM.png After YARN-6467, QueueMetrics should only be tracking metrics for the default partition. However, the metrics are incorrect when labelled partitions are involved. Steps to reproduce == # Configure capacity-scheduler.xml with label configuration # Add label "test" to cluster and replace label on node1 to be "test" # Note down "totalMB" at /ws/v1/cluster/metrics # Start first job on test queue. # Start second job on default queue (does not work if the order of two jobs is swapped). # While the two applications are running, the "totalMB" at /ws/v1/cluster/metrics will go down by the amount of MB used by the first job (screenshots attached). Alternately: In TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), add the following line at the end of the test before rm1.close(): CSQueue rootQueue = cs.getRootQueue(); assertEquals(10*GB, rootQueue.getMetrics().getAvailableMB() + rootQueue.getMetrics().getAllocatedMB()); There are two nodes of 10GB each and only one of them have a non-default label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9595) FPGA plugin: NullPointerException in FpgaNodeResourceUpdateHandler.updateConfiguredResource()
Peter Bacsko created YARN-9595: -- Summary: FPGA plugin: NullPointerException in FpgaNodeResourceUpdateHandler.updateConfiguredResource() Key: YARN-9595 URL: https://issues.apache.org/jira/browse/YARN-9595 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Peter Bacsko Assignee: Peter Bacsko YARN-9264 accidentally introduced a bug in FpgaDiscoverer. Sometimes {{currentFpgaInfo}} is not set, resulting in an NPE being thrown: {noformat} 2019-06-03 05:14:50,157 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaNodeResourceUpdateHandler.updateConfiguredResource(FpgaNodeResourceUpdateHandler.java:54) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.updateConfiguredResourcesViaPlugins(NodeStatusUpdaterImpl.java:358) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceInit(NodeStatusUpdaterImpl.java:190) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:459) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:869) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:942) {noformat} The problem is that in {{FpgaDiscoverer}}, we don't set {{currentFpgaInfo}} if the following condition is true: {noformat} if (allowed == null || allowed.equalsIgnoreCase( YarnConfiguration.AUTOMATICALLY_DISCOVER_GPU_DEVICES)) { return list; } else if (allowed.matches("(\\d,)*\\d")){ ... {noformat} Solution is simple, it should always be initialized, just like before. Unit tests should be enhanced to verify that it's set properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
Apache Hadoop qbt Report: branch2+JDK7 on Linux/x86
For more details, see https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/ No changes -1 overall The following subsystems voted -1: asflicense findbugs hadolint pathlen unit xml The following subsystems voted -1 but were configured to be filtered/ignored: cc checkstyle javac javadoc pylint shellcheck shelldocs whitespace The following subsystems are considered long running: (runtime bigger than 1h 0m 0s) unit Specific tests: XML : Parsing Error(s): hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/conf/empty-configuration.xml hadoop-tools/hadoop-azure/src/config/checkstyle-suppressions.xml hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/public/crossdomain.xml hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/src/main/webapp/public/crossdomain.xml FindBugs : module:hadoop-common-project/hadoop-common Class org.apache.hadoop.fs.GlobalStorageStatistics defines non-transient non-serializable instance field map In GlobalStorageStatistics.java:instance field map In GlobalStorageStatistics.java FindBugs : module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice-hbase/hadoop-yarn-server-timelineservice-hbase-client Boxed value is unboxed and then immediately reboxed in org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.readResultsWithTimestamps(Result, byte[], byte[], KeyConverter, ValueConverter, boolean) At ColumnRWHelper.java:then immediately reboxed in org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.readResultsWithTimestamps(Result, byte[], byte[], KeyConverter, ValueConverter, boolean) At ColumnRWHelper.java:[line 335] Failed junit tests : hadoop.hdfs.server.namenode.TestNameNodeHttpServerXFrame hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys hadoop.hdfs.web.TestWebHdfsTimeouts hadoop.hdfs.TestBlockStoragePolicy hadoop.hdfs.TestTrashWithSecureEncryptionZones hadoop.hdfs.server.balancer.TestBalancerRPCDelay hadoop.hdfs.server.federation.router.TestRouterClientRejectOverload hadoop.registry.secure.TestSecureLogins hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2 hadoop.mapred.TestMiniMRClasspath hadoop.mapred.TestMiniMRClientCluster hadoop.ipc.TestMRCJCSocketFactory cc: https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-compile-cc-root-jdk1.7.0_95.txt [4.0K] javac: https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-compile-javac-root-jdk1.7.0_95.txt [328K] cc: https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-compile-cc-root-jdk1.8.0_212.txt [4.0K] javac: https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-compile-javac-root-jdk1.8.0_212.txt [308K] checkstyle: https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-checkstyle-root.txt [16M] hadolint: https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-patch-hadolint.txt [4.0K] pathlen: https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/pathlen.txt [12K] pylint: https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-patch-pylint.txt [24K] shellcheck: https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-patch-shellcheck.txt [72K] shelldocs: https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-patch-shelldocs.txt [8.0K] whitespace: https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/whitespace-eol.txt [12M] https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/whitespace-tabs.txt [1.2M] xml: https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/xml.txt [12K] findbugs: https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/branch-findbugs-hadoop-common-project_hadoop-common-warnings.html [8.0K] https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-timelineservice-hbase_hadoop-yarn-server-timelineservice-hbase-client-warnings.html [8.0K] javadoc: https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-javadoc-javadoc-root-jdk1.7.0_95.txt [16K]