[jira] [Commented] (YARN-10438) Handle null containerId in ClientRMService#getContainerReport()
[ https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445996#comment-17445996 ] Akira Ajisaka commented on YARN-10438: -- Hi [~chaosun], I think it is good to include this in Hadoop 3.3.2 release. Would you check this? > Handle null containerId in ClientRMService#getContainerReport() > --- > > Key: YARN-10438 > URL: https://issues.apache.org/jira/browse/YARN-10438 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Raghvendra Singh >Assignee: Shubham Gupta >Priority: Major > Fix For: 3.4.0, 3.2.3, 3.3.2 > > > Here is the Exception trace which we are seeing, we are suspecting because of > this exception RM is reaching in a state where it is no more allowing any new > job to run on the cluster. > {noformat} > 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default > port 8032, call Call#1463486 Retry#0 > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport > from 10.39.91.205:49564 java.lang.NullPointerException at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915) > {noformat} > We are seeing this issue with this specific node only, we do run this cluster > at a scale of around 500 nodes. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10438) Handle null containerId in ClientRMService#getContainerReport()
[ https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated YARN-10438: - Fix Version/s: 3.2.3 3.3.2 > Handle null containerId in ClientRMService#getContainerReport() > --- > > Key: YARN-10438 > URL: https://issues.apache.org/jira/browse/YARN-10438 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Raghvendra Singh >Assignee: Shubham Gupta >Priority: Major > Fix For: 3.4.0, 3.2.3, 3.3.2 > > > Here is the Exception trace which we are seeing, we are suspecting because of > this exception RM is reaching in a state where it is no more allowing any new > job to run on the cluster. > {noformat} > 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default > port 8032, call Call#1463486 Retry#0 > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport > from 10.39.91.205:49564 java.lang.NullPointerException at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915) > {noformat} > We are seeing this issue with this specific node only, we do run this cluster > at a scale of around 500 nodes. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10438) Handle null containerId in ClientRMService#getContainerReport()
[ https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445987#comment-17445987 ] Akira Ajisaka commented on YARN-10438: -- Backported to branch-3.3, branch-3.2, and branch-3.2.3. > Handle null containerId in ClientRMService#getContainerReport() > --- > > Key: YARN-10438 > URL: https://issues.apache.org/jira/browse/YARN-10438 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Raghvendra Singh >Assignee: Shubham Gupta >Priority: Major > Fix For: 3.4.0 > > > Here is the Exception trace which we are seeing, we are suspecting because of > this exception RM is reaching in a state where it is no more allowing any new > job to run on the cluster. > {noformat} > 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default > port 8032, call Call#1463486 Retry#0 > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport > from 10.39.91.205:49564 java.lang.NullPointerException at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915) > {noformat} > We are seeing this issue with this specific node only, we do run this cluster > at a scale of around 500 nodes. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10452) YARN scheduler response returns invalid values for capacity, maxCapacity and absoluteMaxCapacity
[ https://issues.apache.org/jira/browse/YARN-10452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445740#comment-17445740 ] Tamas Domok commented on YARN-10452: I found something similar: YARN-11010 The ui2 Queues page can't handle the NaNs. > YARN scheduler response returns invalid values for capacity, maxCapacity and > absoluteMaxCapacity > > > Key: YARN-10452 > URL: https://issues.apache.org/jira/browse/YARN-10452 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Akhil PB >Priority: Major > Attachments: yarn_scheduler_response_incorrect_partition_capacity.json > > > When there are no nodes in the default partition, YARN scheduler response > returns invalid values for capacities as listed below. > - capacity is INF > - maxCapacity is NaN > - absoluteMaxCapacity is NaN > Attached the YARN scheduler response json. > cc: [~sunilg] [~wangda] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11010) YARN ui2 hangs on the Queues page when the scheduler response contains NaN values
Tamas Domok created YARN-11010: -- Summary: YARN ui2 hangs on the Queues page when the scheduler response contains NaN values Key: YARN-11010 URL: https://issues.apache.org/jira/browse/YARN-11010 Project: Hadoop YARN Issue Type: Bug Components: yarn-ui-v2 Affects Versions: 3.4.0 Reporter: Tamas Domok Assignee: Tamas Domok Attachments: capacity-scheduler.xml, shresponse.json When the scheduler response contains NaN values for capacity and maxCapacity the UI2 hangs on the Queues page. The console log shows the following error: {code:java} SyntaxError: Unexpected token N in JSON at position 666 {code} The scheduler response: {code:java} "maxCapacity": NaN, "absoluteMaxCapacity": NaN, {code} NaN, infinity, -infinity is not valid in JSON syntax: https://www.json.org/json-en.html This might be related as well: YARN-10452 I managed to reproduce this with AQCv1, where I set the parent queue's capacity in absolute mode, then I used percentage mode on the leaf-queue-template. I'm not sure if this is a valid configuration, however there is no error or warning in RM logs about any configuration error. To trigger the issue the DominantResourceCalculator must be used. (When using absolute mode on the leaf-queue-template this issue is not re-producible, further details on: YARN-10922). Reproduction steps: # Start the cluster with the attached configuration # Check the Queues page on UI2 (it should work at this point) # Send an example job (yarn jar hadoop-mapreduce-examples-3.4.0-SNAPSHOT.jar pi 1 10) # Check the Queues page on UI2 (it should not be working at this point) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org