[jira] [Commented] (YARN-10438) NPE while fetching container report for a node which is not there in active/decommissioned/lost/unhealthy nodes on RM

2020-09-17 Thread Raghvendra Singh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197560#comment-17197560
 ] 

Raghvendra Singh commented on YARN-10438:
-

[~shubhamod] Following is line of code is at line 520.

ApplicationAttemptId appAttemptId = containerId.getApplicationAttemptId();

Here is the method getContainerReport() from ClientRMService.java
{noformat}
@Override
  public GetContainerReportResponse getContainerReport(
  GetContainerReportRequest request) throws YarnException, IOException {
ContainerId containerId = request.getContainerId();
ApplicationAttemptId appAttemptId = containerId.getApplicationAttemptId();
ApplicationId appId = appAttemptId.getApplicationId();
UserGroupInformation callerUGI = getCallerUgi(appId,
AuditConstants.GET_CONTAINER_REPORT);
RMApp application = verifyUserAccessForRMApp(appId, callerUGI,
AuditConstants.GET_CONTAINER_REPORT, ApplicationAccessType.VIEW_APP,
false);
boolean allowAccess = checkAccess(callerUGI, application.getUser(),
ApplicationAccessType.VIEW_APP, application);
GetContainerReportResponse response = null;
if (allowAccess) {
  RMAppAttempt appAttempt = application.getAppAttempts().get(appAttemptId);
  if (appAttempt == null) {
throw new ApplicationAttemptNotFoundException(
"ApplicationAttempt with id '" + appAttemptId +
"' doesn't exist in RM.");
  }
  RMContainer rmContainer = this.rmContext.getScheduler().getRMContainer(
  containerId);
  if (rmContainer == null) {
throw new ContainerNotFoundException("Container with id '" + containerId
+ "' doesn't exist in RM.");
  }
  response = GetContainerReportResponse.newInstance(rmContainer
  .createContainerReport());
} else {
  throw new YarnException("User " + callerUGI.getShortUserName()
  + " does not have privilege to see this application " + appId);
}
return response;
  }
{noformat}

Following is method from ApplicationClientProtocolPBServiceImpl.java
{noformat}
  @Override
  public GetContainerReportResponseProto getContainerReport(
  RpcController controller, GetContainerReportRequestProto proto)
  throws ServiceException {
GetContainerReportRequestPBImpl request =
new GetContainerReportRequestPBImpl(proto);
try {
  GetContainerReportResponse response = real.getContainerReport(request);
  return ((GetContainerReportResponsePBImpl) response).getProto();
} catch (YarnException e) {
  throw new ServiceException(e);
} catch (IOException e) {
  throw new ServiceException(e);
}
  }
{noformat}

> NPE while fetching container report for a node which is not there in 
> active/decommissioned/lost/unhealthy nodes on RM
> -
>
> Key: YARN-10438
> URL: https://issues.apache.org/jira/browse/YARN-10438
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Raghvendra Singh
>Priority: Major
>
> Here is the Exception trace which we are seeing, we are suspecting because of 
> this exception RM is reaching in a state where it is no more allowing any new 
> job to run on the cluster.
> {noformat}
> 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default 
> port 8032, call Call#1463486 Retry#0 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport 
> from 10.39.91.205:49564 java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
> {noformat}
> We are seeing this issue with this specific node only, we do run this cluster 
> at a scale of around 500 nodes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (YARN-10438) NPE while fetching container report for a node which is not there in active/decommissioned/lost/unhealthy nodes on RM

2020-09-16 Thread Shubham Gupta (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197387#comment-17197387
 ] 

Shubham Gupta commented on YARN-10438:
--

[~raghvendra.s], can you share the code ClientRMService.java at line 520 in 
your class and the full function getContainerReport() in ClientRMService.java 
and ApplicationClientProtocolPBServiceImpl.java.

> NPE while fetching container report for a node which is not there in 
> active/decommissioned/lost/unhealthy nodes on RM
> -
>
> Key: YARN-10438
> URL: https://issues.apache.org/jira/browse/YARN-10438
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Raghvendra Singh
>Priority: Major
>
> Here is the Exception trace which we are seeing, we are suspecting because of 
> this exception RM is reaching in a state where it is no more allowing any new 
> job to run on the cluster.
> {noformat}
> 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default 
> port 8032, call Call#1463486 Retry#0 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport 
> from 10.39.91.205:49564 java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
> {noformat}
> We are seeing this issue with this specific node only, we do run this cluster 
> at a scale of around 500 nodes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10438) NPE while fetching container report for a node which is not there in active/decommissioned/lost/unhealthy nodes on RM

2020-09-16 Thread Raghvendra Singh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197376#comment-17197376
 ] 

Raghvendra Singh commented on YARN-10438:
-

[~shubhamod] Version number is correct, in ClientRMService class we have made a 
custom change because of that line number might not be matching exactly. 

> NPE while fetching container report for a node which is not there in 
> active/decommissioned/lost/unhealthy nodes on RM
> -
>
> Key: YARN-10438
> URL: https://issues.apache.org/jira/browse/YARN-10438
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Raghvendra Singh
>Priority: Major
>
> Here is the Exception trace which we are seeing, we are suspecting because of 
> this exception RM is reaching in a state where it is no more allowing any new 
> job to run on the cluster.
> {noformat}
> 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default 
> port 8032, call Call#1463486 Retry#0 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport 
> from 10.39.91.205:49564 java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
> {noformat}
> We are seeing this issue with this specific node only, we do run this cluster 
> at a scale of around 500 nodes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10438) NPE while fetching container report for a node which is not there in active/decommissioned/lost/unhealthy nodes on RM

2020-09-16 Thread Shubham Gupta (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197176#comment-17197176
 ] 

Shubham Gupta commented on YARN-10438:
--

Hi [~raghvendra.s], Can you please again check the version number of Hadoop?

> NPE while fetching container report for a node which is not there in 
> active/decommissioned/lost/unhealthy nodes on RM
> -
>
> Key: YARN-10438
> URL: https://issues.apache.org/jira/browse/YARN-10438
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Raghvendra Singh
>Priority: Major
>
> Here is the Exception trace which we are seeing, we are suspecting because of 
> this exception RM is reaching in a state where it is no more allowing any new 
> job to run on the cluster.
> {noformat}
> 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default 
> port 8032, call Call#1463486 Retry#0 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport 
> from 10.39.91.205:49564 java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
> {noformat}
> We are seeing this issue with this specific node only, we do run this cluster 
> at a scale of around 500 nodes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10438) NPE while fetching container report for a node which is not there in active/decommissioned/lost/unhealthy nodes on RM

2020-09-15 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196115#comment-17196115
 ] 

Surendra Singh Lilhore commented on YARN-10438:
---

Hi [~raghvendra.s],  Why you think, this is not a problem ?

> NPE while fetching container report for a node which is not there in 
> active/decommissioned/lost/unhealthy nodes on RM
> -
>
> Key: YARN-10438
> URL: https://issues.apache.org/jira/browse/YARN-10438
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Raghvendra Singh
>Priority: Major
>
> Here is the Exception trace which we are seeing, we are suspecting because of 
> this exception RM is reaching in a state where it is no more allowing any new 
> job to run on the cluster.
> {noformat}
> 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default 
> port 8032, call Call#1463486 Retry#0 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport 
> from 10.39.91.205:49564 java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
> {noformat}
> We are seeing this issue with this specific node only, we do run this cluster 
> at a scale of around 500 nodes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org