[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522599#comment-16522599 ]
Manikandan R commented on YARN-4606: ------------------------------------ [~eepayne] Thanks for the patch. At a high level, POC is very simple from implementation perspective and changes would be minimal with this approach. At the same time, this patch is less "strict" in terms of updates (specifically on when? ) compared to approaches discussed in our earlier patches. For example, In earlier approach, numActiveUsersWithOnlyPendingApps would be incremented as soon as app gets activated and gets decremented as soon as AM container gets allocated. In addition, all of these things happens immediately and only after the dependent steps gets completed for sure. Whereas, new POC patch depends on the values (pendingApplications, activeApplications etc of User object), conditions before the actual work (for example, assuming AM container would be allocated successfully based on checks in LeafQueue#activateApplications) and updates numActiveUsersWithOnlyPendingApps as part of regular computeUserLimits flow. All these things is creating a slight discomfort and lead to some of the questions like What is the time frame that we are seeing between accepting the app and updating numActiveUsersWithOnlyPendingApps? Is this time frame acceptable? Aren't we running little slower in doing updates? Is there any chance by which AM container has been failed to allocate? Lets say, If AM container allocation goes through successfully, Would be there any delay in allocating AM containers? During this delayed duration, we are considering the user as active user rather than treating the user as "activeUsersWithOnlyPendingApps". Is this acceptable? I am interested in understanding your thoughts behind this tradeoff. Also, based on our earlier discussions, We need to depend on {{activeUsers.get()}} only in certain context and sum of {{activeUsers.get()}} and {{activeUsersWithOnlyPendingApps.get()}} in some other places. But POC patch always depends on later value. I didn't understand this part. On the other hand, We can avoid {{AppAMAttemptsFailedSchedulerEvent}} related changes completely with this new patch as anyway {{User.finishApplication()}} would be called for sure even when max AM attempts has been reached. Please share your thoughts. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > ------------------------------------------------------------------------------------------------------------- > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler > Affects Versions: 2.8.0, 2.7.1 > Reporter: Karam Singh > Assignee: Manikandan R > Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org