[ https://issues.apache.org/jira/browse/YARN-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338830#comment-14338830 ]
Karthik Kambatla commented on YARN-3231: ---------------------------------------- Thanks for reporting and working on this, [~l201514]. The approach looks generally good. Few comments (some nits): # Rename {{updateRunnabilityonRefreshQueues}} to {{updateRunnabilityOnReload}}? And, add a javadoc for when it should be called and what it does. # javadoc for the newly added private method and the significance of the new integer param. # Call the above method from AllocationReloadListner#onReload after all the other queue configs are updated. # The comment here no longer applies. Remove it? {code} // No more than one app per list will be able to be made runnable, so // we can stop looking after we've found that many if (noLongerPendingApps.size() >= maxRunnableApps) { break; } {code} # Indentation: {code} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} # Newly added tests: ## If it is not too much trouble, can we move them to a new test class (TestAppRunnability?) mostly because TestFairScheduler has so many tests already. ## Is it possible to reuse the code between these tests? ## Should we add tests for when the maxRunnableApps for a user or queue is decreased? If you think this might need additional work in the logic as well, I am open to filing a follow up JIRA and addressing it there. > FairScheduler changing queueMaxRunningApps on the fly will cause all pending > job stuck > -------------------------------------------------------------------------------------- > > Key: YARN-3231 > URL: https://issues.apache.org/jira/browse/YARN-3231 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Siqi Li > Assignee: Siqi Li > Priority: Critical > Attachments: YARN-3231.v1.patch, YARN-3231.v2.patch > > > When a queue is piling up with a lot of pending jobs due to the > maxRunningApps limit. We want to increase this property on the fly to make > some of the pending job active. However, once we increase the limit, all > pending jobs were not assigned any resource, and were stuck forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)