[jira] [Commented] (YARN-10438) NPE while fetching container report for a node which is not there in active/decommissioned/lost/unhealthy nodes on RM
[ https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197560#comment-17197560 ] Raghvendra Singh commented on YARN-10438: - [~shubhamod] Following is line of code is at line 520. ApplicationAttemptId appAttemptId = containerId.getApplicationAttemptId(); Here is the method getContainerReport() from ClientRMService.java {noformat} @Override public GetContainerReportResponse getContainerReport( GetContainerReportRequest request) throws YarnException, IOException { ContainerId containerId = request.getContainerId(); ApplicationAttemptId appAttemptId = containerId.getApplicationAttemptId(); ApplicationId appId = appAttemptId.getApplicationId(); UserGroupInformation callerUGI = getCallerUgi(appId, AuditConstants.GET_CONTAINER_REPORT); RMApp application = verifyUserAccessForRMApp(appId, callerUGI, AuditConstants.GET_CONTAINER_REPORT, ApplicationAccessType.VIEW_APP, false); boolean allowAccess = checkAccess(callerUGI, application.getUser(), ApplicationAccessType.VIEW_APP, application); GetContainerReportResponse response = null; if (allowAccess) { RMAppAttempt appAttempt = application.getAppAttempts().get(appAttemptId); if (appAttempt == null) { throw new ApplicationAttemptNotFoundException( "ApplicationAttempt with id '" + appAttemptId + "' doesn't exist in RM."); } RMContainer rmContainer = this.rmContext.getScheduler().getRMContainer( containerId); if (rmContainer == null) { throw new ContainerNotFoundException("Container with id '" + containerId + "' doesn't exist in RM."); } response = GetContainerReportResponse.newInstance(rmContainer .createContainerReport()); } else { throw new YarnException("User " + callerUGI.getShortUserName() + " does not have privilege to see this application " + appId); } return response; } {noformat} Following is method from ApplicationClientProtocolPBServiceImpl.java {noformat} @Override public GetContainerReportResponseProto getContainerReport( RpcController controller, GetContainerReportRequestProto proto) throws ServiceException { GetContainerReportRequestPBImpl request = new GetContainerReportRequestPBImpl(proto); try { GetContainerReportResponse response = real.getContainerReport(request); return ((GetContainerReportResponsePBImpl) response).getProto(); } catch (YarnException e) { throw new ServiceException(e); } catch (IOException e) { throw new ServiceException(e); } } {noformat} > NPE while fetching container report for a node which is not there in > active/decommissioned/lost/unhealthy nodes on RM > - > > Key: YARN-10438 > URL: https://issues.apache.org/jira/browse/YARN-10438 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Raghvendra Singh >Priority: Major > > Here is the Exception trace which we are seeing, we are suspecting because of > this exception RM is reaching in a state where it is no more allowing any new > job to run on the cluster. > {noformat} > 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default > port 8032, call Call#1463486 Retry#0 > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport > from 10.39.91.205:49564 java.lang.NullPointerException at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915) > {noformat} > We are seeing this issue with this specific node only, we do run this cluster > at a scale of around 500 nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (YARN-10438) NPE while fetching container report for a node which is not there in active/decommissioned/lost/unhealthy nodes on RM
[ https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197387#comment-17197387 ] Shubham Gupta commented on YARN-10438: -- [~raghvendra.s], can you share the code ClientRMService.java at line 520 in your class and the full function getContainerReport() in ClientRMService.java and ApplicationClientProtocolPBServiceImpl.java. > NPE while fetching container report for a node which is not there in > active/decommissioned/lost/unhealthy nodes on RM > - > > Key: YARN-10438 > URL: https://issues.apache.org/jira/browse/YARN-10438 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Raghvendra Singh >Priority: Major > > Here is the Exception trace which we are seeing, we are suspecting because of > this exception RM is reaching in a state where it is no more allowing any new > job to run on the cluster. > {noformat} > 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default > port 8032, call Call#1463486 Retry#0 > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport > from 10.39.91.205:49564 java.lang.NullPointerException at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915) > {noformat} > We are seeing this issue with this specific node only, we do run this cluster > at a scale of around 500 nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10438) NPE while fetching container report for a node which is not there in active/decommissioned/lost/unhealthy nodes on RM
[ https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197376#comment-17197376 ] Raghvendra Singh commented on YARN-10438: - [~shubhamod] Version number is correct, in ClientRMService class we have made a custom change because of that line number might not be matching exactly. > NPE while fetching container report for a node which is not there in > active/decommissioned/lost/unhealthy nodes on RM > - > > Key: YARN-10438 > URL: https://issues.apache.org/jira/browse/YARN-10438 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Raghvendra Singh >Priority: Major > > Here is the Exception trace which we are seeing, we are suspecting because of > this exception RM is reaching in a state where it is no more allowing any new > job to run on the cluster. > {noformat} > 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default > port 8032, call Call#1463486 Retry#0 > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport > from 10.39.91.205:49564 java.lang.NullPointerException at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915) > {noformat} > We are seeing this issue with this specific node only, we do run this cluster > at a scale of around 500 nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10438) NPE while fetching container report for a node which is not there in active/decommissioned/lost/unhealthy nodes on RM
[ https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197176#comment-17197176 ] Shubham Gupta commented on YARN-10438: -- Hi [~raghvendra.s], Can you please again check the version number of Hadoop? > NPE while fetching container report for a node which is not there in > active/decommissioned/lost/unhealthy nodes on RM > - > > Key: YARN-10438 > URL: https://issues.apache.org/jira/browse/YARN-10438 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Raghvendra Singh >Priority: Major > > Here is the Exception trace which we are seeing, we are suspecting because of > this exception RM is reaching in a state where it is no more allowing any new > job to run on the cluster. > {noformat} > 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default > port 8032, call Call#1463486 Retry#0 > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport > from 10.39.91.205:49564 java.lang.NullPointerException at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915) > {noformat} > We are seeing this issue with this specific node only, we do run this cluster > at a scale of around 500 nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10438) NPE while fetching container report for a node which is not there in active/decommissioned/lost/unhealthy nodes on RM
[ https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196115#comment-17196115 ] Surendra Singh Lilhore commented on YARN-10438: --- Hi [~raghvendra.s], Why you think, this is not a problem ? > NPE while fetching container report for a node which is not there in > active/decommissioned/lost/unhealthy nodes on RM > - > > Key: YARN-10438 > URL: https://issues.apache.org/jira/browse/YARN-10438 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Raghvendra Singh >Priority: Major > > Here is the Exception trace which we are seeing, we are suspecting because of > this exception RM is reaching in a state where it is no more allowing any new > job to run on the cluster. > {noformat} > 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default > port 8032, call Call#1463486 Retry#0 > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport > from 10.39.91.205:49564 java.lang.NullPointerException at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915) > {noformat} > We are seeing this issue with this specific node only, we do run this cluster > at a scale of around 500 nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org