[ https://issues.apache.org/jira/browse/YARN-8233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680649#comment-16680649 ]
Tao Yang commented on YARN-8233: -------------------------------- Thanks [~ajisakaa] for your effort to figure out this failure. Is this failure just exist in branch-3.1? Seems there are no QA comments for branch-3.0 and branch-2. > NPE in CapacityScheduler#tryCommit when handling allocate/reserve proposal > whose allocatedOrReservedContainer is null > --------------------------------------------------------------------------------------------------------------------- > > Key: YARN-8233 > URL: https://issues.apache.org/jira/browse/YARN-8233 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Affects Versions: 3.2.0 > Reporter: Tao Yang > Assignee: Tao Yang > Priority: Critical > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-8233.001-branch-3.1-test.patch, > YARN-8233.001-test-branch-3.1.patch, YARN-8233.001.branch-2.patch, > YARN-8233.001.branch-3.0.patch, YARN-8233.001.branch-3.1.patch, > YARN-8233.001.patch, YARN-8233.002.patch, YARN-8233.003.patch > > > Recently we saw a NPE problem in CapacityScheduler#tryCommit when try to find > the attemptId by calling {{c.getAllocatedOrReservedContainer().get...}} from > an allocate/reserve proposal. But got null allocatedOrReservedContainer and > thrown NPE. > Reference code: > {code:java} > // find the application to accept and apply the ResourceCommitRequest > if (request.anythingAllocatedOrReserved()) { > ContainerAllocationProposal<FiCaSchedulerApp, FiCaSchedulerNode> c = > request.getFirstAllocatedOrReservedContainer(); > attemptId = > c.getAllocatedOrReservedContainer().getSchedulerApplicationAttempt() > .getApplicationAttemptId(); //NPE happens here > } else { ... > {code} > The proposal was constructed in > {{CapacityScheduler#createResourceCommitRequest}} and > allocatedOrReservedContainer is possibly null in async-scheduling process > when node was lost or application was finished (details in > {{CapacityScheduler#getSchedulerContainer}}). > Reference code: > {code:java} > // Allocated something > List<AssignmentInformation.AssignmentDetails> allocations = > csAssignment.getAssignmentInformation().getAllocationDetails(); > if (!allocations.isEmpty()) { > RMContainer rmContainer = allocations.get(0).rmContainer; > allocated = new ContainerAllocationProposal<>( > getSchedulerContainer(rmContainer, true), //possibly null > getSchedulerContainersToRelease(csAssignment), > > getSchedulerContainer(csAssignment.getFulfilledReservedContainer(), > false), csAssignment.getType(), > csAssignment.getRequestLocalityType(), > csAssignment.getSchedulingMode() != null ? > csAssignment.getSchedulingMode() : > SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY, > csAssignment.getResource()); > } > {code} > I think we should add null check for allocateOrReserveContainer before create > allocate/reserve proposals. Besides the allocation process has increase > unconfirmed resource of app when creating an allocate assignment, so if this > check is null, we should decrease the unconfirmed resource of live app. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org