Wei Chen created YARN-7964: ------------------------------ Summary: Yarn Scheduler Load Simulator (SLS): MetricsLogRunnable stops working when there are too many jobs needed to load from sls Key: YARN-7964 URL: https://issues.apache.org/jira/browse/YARN-7964 Project: Hadoop YARN Issue Type: Bug Components: scheduler-load-simulator Affects Versions: 3.0.0, 2.7.5 Environment: I am running sls on a linux server (ubuntu-16.04). The hadoop version is 3.0.0 Reporter: Wei Chen
hi, I am using sls to simulate a large scale cluster, which consists more than 100 nodes and runs more than 4k jobs. I found that MetricsLogRunnable (periodically flush real-time metrics to a file) stops working if the sls takes too long to load sls file. More specifically, the exception is thrown at here in function String generateRealTimeTrackingMetrics() in SLSWebApp.java : {code:java} for (String queue : wrapper.getQueueSet()) { .......... } {code} The excepthion is reported as: 2018-02-22 17:13:59,450 INFO sls.SLSRunner: newly creaed job: 6263127055conainer size: 10queue: default java.lang.NullPointerException at org.apache.hadoop.yarn.sls.web.SLSWebApp.generateRealTimeTrackingMetrics(SLSWebApp.java:438) at org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler$MetricsLogRunnable.run(SLSCapacityScheduler.java:724) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) So the wrapper.getQueueSet() returns a NullPointer which causes the exception. After we further analyzing the source code, we noticed that in SLSRunner.java: {code:java} public void start() throws Exception { // start resource manager startRM(); // start node managers startNM(); // start application masters startAM(); // set queue & tracked apps information ((SchedulerWrapper) rm.getResourceScheduler()) .setQueueSet(this.queueAppNumMap.keySet()); ((SchedulerWrapper) rm.getResourceScheduler()) .setTrackedAppSet(this.trackedApps); // print out simulation info printSimulationInfo(); // blocked until all nodes RUNNING waitForNodesRunning(); // starting the runner once everything is ready to go, runner.start(); } {code} As you can see the queue set for tracking is set by ((SchedulerWrapper)rm.getResourceScheduler()) .setQueueSet(this.queueAppNumMap.keySet()); which is done after rm, nm and app initilization. Before the queue set is set, the MetricsLogRunnable has already been lauched. That's the reason why the queue set is empty and cause NullPointerException. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org