[jira] [Commented] (YARN-10438) Handle null containerId in ClientRMService#getContainerReport()

2021-11-18 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445996#comment-17445996
 ] 

Akira Ajisaka commented on YARN-10438:
--

Hi [~chaosun], I think it is good to include this in Hadoop 3.3.2 release. 
Would you check this?

> Handle null containerId in ClientRMService#getContainerReport()
> ---
>
> Key: YARN-10438
> URL: https://issues.apache.org/jira/browse/YARN-10438
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Raghvendra Singh
>Assignee: Shubham Gupta
>Priority: Major
> Fix For: 3.4.0, 3.2.3, 3.3.2
>
>
> Here is the Exception trace which we are seeing, we are suspecting because of 
> this exception RM is reaching in a state where it is no more allowing any new 
> job to run on the cluster.
> {noformat}
> 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default 
> port 8032, call Call#1463486 Retry#0 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport 
> from 10.39.91.205:49564 java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
> {noformat}
> We are seeing this issue with this specific node only, we do run this cluster 
> at a scale of around 500 nodes. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10438) Handle null containerId in ClientRMService#getContainerReport()

2021-11-18 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-10438:
-
Fix Version/s: 3.2.3
   3.3.2

> Handle null containerId in ClientRMService#getContainerReport()
> ---
>
> Key: YARN-10438
> URL: https://issues.apache.org/jira/browse/YARN-10438
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Raghvendra Singh
>Assignee: Shubham Gupta
>Priority: Major
> Fix For: 3.4.0, 3.2.3, 3.3.2
>
>
> Here is the Exception trace which we are seeing, we are suspecting because of 
> this exception RM is reaching in a state where it is no more allowing any new 
> job to run on the cluster.
> {noformat}
> 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default 
> port 8032, call Call#1463486 Retry#0 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport 
> from 10.39.91.205:49564 java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
> {noformat}
> We are seeing this issue with this specific node only, we do run this cluster 
> at a scale of around 500 nodes. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10438) Handle null containerId in ClientRMService#getContainerReport()

2021-11-18 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445987#comment-17445987
 ] 

Akira Ajisaka commented on YARN-10438:
--

Backported to branch-3.3, branch-3.2, and branch-3.2.3.

> Handle null containerId in ClientRMService#getContainerReport()
> ---
>
> Key: YARN-10438
> URL: https://issues.apache.org/jira/browse/YARN-10438
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Raghvendra Singh
>Assignee: Shubham Gupta
>Priority: Major
> Fix For: 3.4.0
>
>
> Here is the Exception trace which we are seeing, we are suspecting because of 
> this exception RM is reaching in a state where it is no more allowing any new 
> job to run on the cluster.
> {noformat}
> 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default 
> port 8032, call Call#1463486 Retry#0 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport 
> from 10.39.91.205:49564 java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
> {noformat}
> We are seeing this issue with this specific node only, we do run this cluster 
> at a scale of around 500 nodes. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10452) YARN scheduler response returns invalid values for capacity, maxCapacity and absoluteMaxCapacity

2021-11-18 Thread Tamas Domok (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445740#comment-17445740
 ] 

Tamas Domok commented on YARN-10452:


I found something similar: YARN-11010

The ui2 Queues page can't handle the NaNs.

> YARN scheduler response returns invalid values for capacity, maxCapacity and 
> absoluteMaxCapacity
> 
>
> Key: YARN-10452
> URL: https://issues.apache.org/jira/browse/YARN-10452
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Akhil PB
>Priority: Major
> Attachments: yarn_scheduler_response_incorrect_partition_capacity.json
>
>
> When there are no nodes in the default partition, YARN scheduler response 
> returns invalid values for capacities as listed below.
> - capacity is INF
> - maxCapacity is NaN
> - absoluteMaxCapacity is NaN
> Attached the YARN scheduler response json.
> cc: [~sunilg] [~wangda]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11010) YARN ui2 hangs on the Queues page when the scheduler response contains NaN values

2021-11-18 Thread Tamas Domok (Jira)
Tamas Domok created YARN-11010:
--

 Summary: YARN ui2 hangs on the Queues page when the scheduler 
response contains NaN values
 Key: YARN-11010
 URL: https://issues.apache.org/jira/browse/YARN-11010
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-ui-v2
Affects Versions: 3.4.0
Reporter: Tamas Domok
Assignee: Tamas Domok
 Attachments: capacity-scheduler.xml, shresponse.json

When the scheduler response contains NaN values for capacity and maxCapacity 
the UI2 hangs on the Queues page. The console log shows the following error:
{code:java}
SyntaxError: Unexpected token N in JSON at position 666 {code}
The scheduler response:
{code:java}
"maxCapacity": NaN,
"absoluteMaxCapacity": NaN, {code}
NaN, infinity, -infinity is not valid in JSON syntax: 
https://www.json.org/json-en.html

This might be related as well: YARN-10452

 

I managed to reproduce this with AQCv1, where I set the parent queue's capacity 
in absolute mode, then I used percentage mode on the leaf-queue-template. I'm 
not sure if this is a valid configuration, however there is no error or warning 
in RM logs about any configuration error. To trigger the issue the 
DominantResourceCalculator must be used. (When using absolute mode on the 
leaf-queue-template this issue is not re-producible, further details on: 
YARN-10922).

 

Reproduction steps:
 # Start the cluster with the attached configuration
 # Check the Queues page on UI2 (it should work at this point)
 # Send an example job (yarn jar hadoop-mapreduce-examples-3.4.0-SNAPSHOT.jar 
pi 1 10)
 # Check the Queues page on UI2 (it should not be working at this point)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org