[jira] [Created] (YARN-9598) Make reservation work well when multi-node enabled

2019-06-03 Thread Tao Yang (JIRA)
Tao Yang created YARN-9598:
--

 Summary: Make reservation work well when multi-node enabled
 Key: YARN-9598
 URL: https://issues.apache.org/jira/browse/YARN-9598
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: Tao Yang
Assignee: Tao Yang


This issue is to solve problems about reservation when multi-node enabled:
 # As discussed in YARN-9576, re-reservation proposal may be always generated 
on the same node and break the scheduling for this app and later apps. I think 
re-reservation in unnecessary and we can replace it with LOCALITY_SKIPPED to 
let scheduler have a chance to look up follow candidates for this app when 
multi-node enabled.
 # Scheduler iterates all nodes and try to allocate for reserved container in 
LeafQueue#allocateFromReservedContainer. Here there are two problems:
 ** The node of reserved container should be taken as candidates instead of all 
nodes when calling FiCaSchedulerApp#assignContainers, otherwise later scheduler 
may generate a reservation-fulfilled proposal on another node, which will 
always be rejected in FiCaScheduler#commonCheckContainerAllocation.
 ** Assignment returned by FiCaSchedulerApp#assignContainers could never be 
null even if it's just skipped, it will break the normal scheduling process for 
this leaf queue because of the if clause in LeafQueue#assignContainers: "if 
(null != assignment) \{ return assignment;}"
 # Nodes which have been reserved should be skipped when iterating candidates 
in RegularContainerAllocator#allocate, otherwise scheduler may generate 
allocation or reservation proposal on these node which will always be rejected 
in FiCaScheduler#commonCheckContainerAllocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9597) Memory efficiency in speculator

2019-06-03 Thread Ahmed Hussein (JIRA)
Ahmed Hussein created YARN-9597:
---

 Summary: Memory efficiency in speculator 
 Key: YARN-9597
 URL: https://issues.apache.org/jira/browse/YARN-9597
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Ahmed Hussein


The data structures in speculator and runtime-estimator are bloating. Data 
elements such as (taskID, TA-ID, task stats, tasks speculated, tasks 
finished..etc) are added to the concurrent maps but never removed.

For long running jobs, there are couple of issues:
 # memory leakage: the speculator memory usage increases over time. 
 # performance: keeping large structures in the heap affects the performance 
due to locality and cache misses.

*Suggested Fixes:*

- When a TA transitions to {{MoveContainerToSucceededFinishingTransition}}, the 
TA notifies the speculator. The latter handles the event by cleaning the 
internal structure accordingly.
- When a task transitions is failed/killed, the speculator is notified to clean 
the internal data structure.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-06-03 Thread Muhammad Samir Khan (JIRA)
Muhammad Samir Khan created YARN-9596:
-

 Summary: QueueMetrics has incorrect metrics when labelled 
partitions are involved
 Key: YARN-9596
 URL: https://issues.apache.org/jira/browse/YARN-9596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Muhammad Samir Khan
 Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
2019-06-03 at 4.44.15 PM.png

After YARN-6467, QueueMetrics should only be tracking metrics for the default 
partition. However, the metrics are incorrect when labelled partitions are 
involved.

Steps to reproduce

==
 # Configure capacity-scheduler.xml with label configuration
 # Add label "test" to cluster and replace label on node1 to be "test"
 # Note down "totalMB" at 
/ws/v1/cluster/metrics
 # Start first job on test queue.
 # Start second job on default queue (does not work if the order of two jobs is 
swapped).
 # While the two applications are running, the "totalMB" at 
/ws/v1/cluster/metrics will go down by the 
amount of MB used by the first job (screenshots attached).

Alternately:

In 
TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
 add the following line at the end of the test before rm1.close():

CSQueue rootQueue = cs.getRootQueue();
assertEquals(10*GB,
 rootQueue.getMetrics().getAvailableMB() + 
rootQueue.getMetrics().getAllocatedMB());

There are two nodes of 10GB each and only one of them have a non-default label. 
The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9595) FPGA plugin: NullPointerException in FpgaNodeResourceUpdateHandler.updateConfiguredResource()

2019-06-03 Thread Peter Bacsko (JIRA)
Peter Bacsko created YARN-9595:
--

 Summary: FPGA plugin: NullPointerException in 
FpgaNodeResourceUpdateHandler.updateConfiguredResource()
 Key: YARN-9595
 URL: https://issues.apache.org/jira/browse/YARN-9595
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Peter Bacsko
Assignee: Peter Bacsko


YARN-9264 accidentally introduced a bug in FpgaDiscoverer. Sometimes 
{{currentFpgaInfo}} is not set, resulting in an NPE being thrown:

{noformat}
2019-06-03 05:14:50,157 INFO org.apache.hadoop.service.AbstractService: Service 
NodeManager failed in state INITED; cause: java.lang.NullPointerException
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaNodeResourceUpdateHandler.updateConfiguredResource(FpgaNodeResourceUpdateHandler.java:54)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.updateConfiguredResourcesViaPlugins(NodeStatusUpdaterImpl.java:358)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceInit(NodeStatusUpdaterImpl.java:190)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:459)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:869)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:942)
{noformat}

The problem is that in {{FpgaDiscoverer}}, we don't set {{currentFpgaInfo}} if 
the following condition is true:

{noformat}
if (allowed == null || allowed.equalsIgnoreCase(
YarnConfiguration.AUTOMATICALLY_DISCOVER_GPU_DEVICES)) {
  return list;
} else if (allowed.matches("(\\d,)*\\d")){
...
{noformat}

Solution is simple, it should always be initialized, just like before.

Unit tests should be enhanced to verify that it's set properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Apache Hadoop qbt Report: branch2+JDK7 on Linux/x86

2019-06-03 Thread Apache Jenkins Server
For more details, see 
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/

No changes




-1 overall


The following subsystems voted -1:
asflicense findbugs hadolint pathlen unit xml


The following subsystems voted -1 but
were configured to be filtered/ignored:
cc checkstyle javac javadoc pylint shellcheck shelldocs whitespace


The following subsystems are considered long running:
(runtime bigger than 1h  0m  0s)
unit


Specific tests:

XML :

   Parsing Error(s): 
   
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/conf/empty-configuration.xml
 
   hadoop-tools/hadoop-azure/src/config/checkstyle-suppressions.xml 
   hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/public/crossdomain.xml 
   
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/src/main/webapp/public/crossdomain.xml
 

FindBugs :

   module:hadoop-common-project/hadoop-common 
   Class org.apache.hadoop.fs.GlobalStorageStatistics defines non-transient 
non-serializable instance field map In GlobalStorageStatistics.java:instance 
field map In GlobalStorageStatistics.java 

FindBugs :

   
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice-hbase/hadoop-yarn-server-timelineservice-hbase-client
 
   Boxed value is unboxed and then immediately reboxed in 
org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.readResultsWithTimestamps(Result,
 byte[], byte[], KeyConverter, ValueConverter, boolean) At 
ColumnRWHelper.java:then immediately reboxed in 
org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.readResultsWithTimestamps(Result,
 byte[], byte[], KeyConverter, ValueConverter, boolean) At 
ColumnRWHelper.java:[line 335] 

Failed junit tests :

   hadoop.hdfs.server.namenode.TestNameNodeHttpServerXFrame 
   hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys 
   hadoop.hdfs.web.TestWebHdfsTimeouts 
   hadoop.hdfs.TestBlockStoragePolicy 
   hadoop.hdfs.TestTrashWithSecureEncryptionZones 
   hadoop.hdfs.server.balancer.TestBalancerRPCDelay 
   hadoop.hdfs.server.federation.router.TestRouterClientRejectOverload 
   hadoop.registry.secure.TestSecureLogins 
   hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2 
   hadoop.mapred.TestMiniMRClasspath 
   hadoop.mapred.TestMiniMRClientCluster 
   hadoop.ipc.TestMRCJCSocketFactory 
  

   cc:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-compile-cc-root-jdk1.7.0_95.txt
  [4.0K]

   javac:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-compile-javac-root-jdk1.7.0_95.txt
  [328K]

   cc:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-compile-cc-root-jdk1.8.0_212.txt
  [4.0K]

   javac:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-compile-javac-root-jdk1.8.0_212.txt
  [308K]

   checkstyle:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-checkstyle-root.txt
  [16M]

   hadolint:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-patch-hadolint.txt
  [4.0K]

   pathlen:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/pathlen.txt
  [12K]

   pylint:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-patch-pylint.txt
  [24K]

   shellcheck:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-patch-shellcheck.txt
  [72K]

   shelldocs:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-patch-shelldocs.txt
  [8.0K]

   whitespace:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/whitespace-eol.txt
  [12M]
   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/whitespace-tabs.txt
  [1.2M]

   xml:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/xml.txt
  [12K]

   findbugs:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/branch-findbugs-hadoop-common-project_hadoop-common-warnings.html
  [8.0K]
   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-timelineservice-hbase_hadoop-yarn-server-timelineservice-hbase-client-warnings.html
  [8.0K]

   javadoc:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/341/artifact/out/diff-javadoc-javadoc-root-jdk1.7.0_95.txt
  [16K]