[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520822#comment-16520822 ]
Eric Payne commented on YARN-4606: ---------------------------------- [~maniraj...@gmail.com], we can fix the queue application starvation problem by making most of the changes in the scheduler-specific users managers. For {{CapacityScheduler}}, all the changes can be done in the {{UsersManager}} class. For the other schedulers (FIfo, Fair, etc.), I think there needs to be some amount of changes in the scheduler infrastructure classes to support retrieving iformation such as number of pending and active apps per user, amount of queue's AM limit resources, amount of a user's used AM resources, etc. But I think that most of the changes can be done in {{ActiveUsersManager}} for other schedulers as well. I am attaching a POC patch that only modifies {{UsersManager}}. The {{UsersManager}} already keeps track of all users in the queue. Each user object keeps the number of active apps and the number of pending apps. here is the sequence of events plus proposed change: - When an application is submitted, the user object's pending apps count is incremented - If limits are not exceeded, {{LeafQueue}} activates the app -- {{Leafqueue#activateApplications}} already checks whether or not activation of an application will go over the queue's AM limit. -- If activating the application will not go over the queue's AM limit, {{Leafqueue#activateApplications}} will increment the user object's active app count and decrement the pending app count. -- However, if activating the application will go over the queue's AM limit, the user's pending app count remains the same. - The change made in {{YARN-4606.POC.3.patch}} is that {{UsersManager#activateApplication}} will check whether or not the user object has any active apps. If not, it will not continue (thus not putting the user in the {{activeUsers}} list). I have not yet analyzed the problem you pointed out above regarding moving apps to different queues. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > ------------------------------------------------------------------------------------------------------------- > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler > Affects Versions: 2.8.0, 2.7.1 > Reporter: Karam Singh > Assignee: Manikandan R > Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org