[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526328#comment-16526328 ]
Manikandan R commented on YARN-4606: ------------------------------------ [~eepayne] Thank you for great explanation. I am able to understand the flow better now. I revisited "move apps" problem which i raised earlier based on new patch and don't think it requires any changes as variables required to calculate numActiveUsersWithOnlyPendingApps are already being set through submitApplication, finishApplication etc calls. However, I am seeing an minor update issue as described below: Lets say, We want to move all apps from queue, A1 to queue, B1. A1 has 4 apps (Only 2 were accommodated because of max am limit constraint. So, remaining 2 not yet activated). All these 4 apps are triggered by different users from u1 to u4. For example app1 by u1 and so on. Only for app 1 & app2, there is an allocate request in pipeline. At this point, {{numActiveUsers}} is 4 and {{numActiveUsersWithOnlyPendingApps}} is 2 in Queue, A1. Now move has been triggered. Since there were running containers for both app 1 and app 2, app3 and app4 has been activated before app 1 and app 2 in Queue, B1 as both these apps were busy in detaching and attaching containers. After the move operation and thread sleep of 5s, pulled these counts expecting u1 and u2 as ActiveUsersWithOnlyPendingApps, but couldn't able to see it. {{numActiveUsers}} is 2 as u3 and u4 had become active users and {{numActiveUsersWithOnlyPendingApps}} is 0 in Queue B1. Then, introduced an NodeUpdate event after the move operation just to force the user limit computation to see the impact on these counts. Now, can able to ActiveUsersWithOnlyPendingApps as 2 and ActiveUsers as 0 (as both u3 and u4 had become non active users by this time as there are no pending allocate request). So, after move app operation and if there is no events (which can trigger user limit computation) for brief amount of time, am seeing incorrect {{numActiveUsersWithOnlyPendingApps}} count. Is this acceptable? or Should we trigger user limit computation after move operation like how we are doing it in other places? Please share your thoughts and correct my understanding if you see a gap > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > ------------------------------------------------------------------------------------------------------------- > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler > Affects Versions: 2.8.0, 2.7.1 > Reporter: Karam Singh > Assignee: Manikandan R > Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org