[ https://issues.apache.org/jira/browse/YARN-8233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678045#comment-16678045 ]
Tao Yang commented on YARN-8233: -------------------------------- Hi, [~ajisakaa], [~cheersyang] I have attached 3 patches: (1) patch for branch-3.1 just update test case since SchedulerApplicationAttempt#hasPendingResourceRequest API has changed in 3.2 and trunk. (2) patch for branch-3.0 includes modification above and drop the modification in CapacityScheduler#attemptAllocationOnNode since the method is not exist yet. (3) patch for branch-2 includes modifications above and update UT to add final keywords for variables which are used in Mockito#doAnswer. branch-2.9 can use branch-2 patch. Please help to review these new patches before committing, Thanks. > NPE in CapacityScheduler#tryCommit when handling allocate/reserve proposal > whose allocatedOrReservedContainer is null > --------------------------------------------------------------------------------------------------------------------- > > Key: YARN-8233 > URL: https://issues.apache.org/jira/browse/YARN-8233 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Affects Versions: 3.2.0 > Reporter: Tao Yang > Assignee: Tao Yang > Priority: Critical > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-8233.001.branch-2.patch, > YARN-8233.001.branch-3.0.patch, YARN-8233.001.branch-3.1.patch, > YARN-8233.001.patch, YARN-8233.002.patch, YARN-8233.003.patch > > > Recently we saw a NPE problem in CapacityScheduler#tryCommit when try to find > the attemptId by calling {{c.getAllocatedOrReservedContainer().get...}} from > an allocate/reserve proposal. But got null allocatedOrReservedContainer and > thrown NPE. > Reference code: > {code:java} > // find the application to accept and apply the ResourceCommitRequest > if (request.anythingAllocatedOrReserved()) { > ContainerAllocationProposal<FiCaSchedulerApp, FiCaSchedulerNode> c = > request.getFirstAllocatedOrReservedContainer(); > attemptId = > c.getAllocatedOrReservedContainer().getSchedulerApplicationAttempt() > .getApplicationAttemptId(); //NPE happens here > } else { ... > {code} > The proposal was constructed in > {{CapacityScheduler#createResourceCommitRequest}} and > allocatedOrReservedContainer is possibly null in async-scheduling process > when node was lost or application was finished (details in > {{CapacityScheduler#getSchedulerContainer}}). > Reference code: > {code:java} > // Allocated something > List<AssignmentInformation.AssignmentDetails> allocations = > csAssignment.getAssignmentInformation().getAllocationDetails(); > if (!allocations.isEmpty()) { > RMContainer rmContainer = allocations.get(0).rmContainer; > allocated = new ContainerAllocationProposal<>( > getSchedulerContainer(rmContainer, true), //possibly null > getSchedulerContainersToRelease(csAssignment), > > getSchedulerContainer(csAssignment.getFulfilledReservedContainer(), > false), csAssignment.getType(), > csAssignment.getRequestLocalityType(), > csAssignment.getSchedulingMode() != null ? > csAssignment.getSchedulingMode() : > SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY, > csAssignment.getResource()); > } > {code} > I think we should add null check for allocateOrReserveContainer before create > allocate/reserve proposals. Besides the allocation process has increase > unconfirmed resource of app when creating an allocate assignment, so if this > check is null, we should decrease the unconfirmed resource of live app. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org