[jira] [Created] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-21 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-2578:
---

 Summary: NM does not failover timely if RM node network connection 
fails
 Key: YARN-2578
 URL: https://issues.apache.org/jira/browse/YARN-2578
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.1
Reporter: Wilfred Spiegelenburg


The NM does not fail over correctly when the network cable of the RM is 
unplugged or the failure is simulated by a "service network stop" or a firewall 
that drops all traffic on the node. The RM fails over to the standby node when 
the failure is detected as expected. The NM should than re-register with the 
new active RM. This re-register takes a long time (15 minutes or more). Until 
then the cluster has no nodes for processing and applications are stuck.

Reproduction test case which can be used in any environment:
- create a cluster with 3 nodes
node 1: ZK, NN, JN, ZKFC, DN, RM, NM
node 2: ZK, NN, JN, ZKFC, DN, RM, NM
node 3: ZK, JN, DN, NM
- start all services make sure they are in good health
- kill the network connection of the RM that is active using one of the network 
kills from above
- observe the NN and RM failover
- the DN's fail over to the new active NN
- the NM does not recover for a long time
- the logs show a long delay and traces show no change at all

The stack traces of the NM all show the same set of threads. The main thread 
which should be used in the re-register is the "Node Status Updater" This 
thread is stuck in:
{code}
"Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
Object.wait() [0x7f5a51fc1000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
at java.lang.Object.wait(Object.java:503)
at org.apache.hadoop.ipc.Client.call(Client.java:1395)
- locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
at org.apache.hadoop.ipc.Client.call(Client.java:1362)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
{code}

The client connection which goes through the proxy can be traced back to the 
ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
should be using a version which takes the RPC timeout (from the configuration) 
as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-21 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-2578:

Attachment: YARN-2578.patch

Attached patch for initial review
In the patch the proxy uses the RPC timeout that is set in the configuration to 
timeout the connection.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-22 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144016#comment-14144016
 ] 

Wilfred Spiegelenburg commented on YARN-2578:
-

To address [~vinodkv] comments: The active RM is completely shut off from the 
network so are all the other services on the node, including zookeeper. The RM 
can update zookeeper but that will never be propagated outside of the node to 
the other zookeeper nodes. It can thus not be seen by the standby RM. The 
standby RM detects no updates in zookeeper for the timeout period and becomes 
the active node. That is the normal HA behaviour from the standby node as if 
the RM would have crashed.


> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-22 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144022#comment-14144022
 ] 

Wilfred Spiegelenburg commented on YARN-2578:
-

I looked into automated testing but like in HDFS-4858 I have not been able to 
find a way to test this using junit tests. Manual testing is really simple 
using the above reproduction scenario.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-25 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148452#comment-14148452
 ] 

Wilfred Spiegelenburg commented on YARN-2578:
-

I proposed fixing the RPC code and by default set the timeout in HDFS-4858 but 
there was no interest to fix the client (at that point in time). So we now have 
to fix it everywhere unless we can get everyone on board and get the behaviour 
changed in the RPC code. The comments are still in that jira and it would be a 
straight forward fix in the RPC code.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.

2014-10-06 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161160#comment-14161160
 ] 

Wilfred Spiegelenburg commented on YARN-1061:
-

This is a dupe from YARN-2578. Writes do not time out and they should.

> NodeManager is indefinitely waiting for nodeHeartBeat() response from 
> ResouceManager.
> -
>
> Key: YARN-1061
> URL: https://issues.apache.org/jira/browse/YARN-1061
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.5-alpha
>Reporter: Rohith
>
> It is observed that in one of the scenario, NodeManger is indefinetly waiting 
> for nodeHeartbeat response from ResouceManger where ResouceManger is in 
> hanged up state.
> NodeManager should get timeout exception instead of waiting indefinetly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException

2014-11-26 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-2910:
---

 Summary: FSLeafQueue can throw ConcurrentModificationException
 Key: YARN-2910
 URL: https://issues.apache.org/jira/browse/YARN-2910
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.5.0
Reporter: Wilfred Spiegelenburg




The list that maintains the runnable and the non runnable apps are a standard 
ArrayList but there is no guarantee that it will only be manipulated by one 
thread in the system. This can lead to the following exception:

2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING 
RM.
java.util.ConcurrentModificationException: 
java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
at java.util.ArrayList$Itr.next(ArrayList.java:831)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516)

Full stack trace in the attached file.

We should guard against that by using a thread safe version from 
java.util.concurrent.CopyOnWriteArrayList




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException

2014-11-26 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-2910:

Attachment: FSLeafQueue_concurrent_exception.txt
YARN-2910.patch

Full exception stack trace and patch

> FSLeafQueue can throw ConcurrentModificationException
> -
>
> Key: YARN-2910
> URL: https://issues.apache.org/jira/browse/YARN-2910
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Rohith
> Attachments: FSLeafQueue_concurrent_exception.txt, YARN-2910.patch
>
>
> The list that maintains the runnable and the non runnable apps are a standard 
> ArrayList but there is no guarantee that it will only be manipulated by one 
> thread in the system. This can lead to the following exception:
> 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM.
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
> at java.util.ArrayList$Itr.next(ArrayList.java:831)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516)
> Full stack trace in the attached file.
> We should guard against that by using a thread safe version from 
> java.util.concurrent.CopyOnWriteArrayList



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException

2014-11-27 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227990#comment-14227990
 ] 

Wilfred Spiegelenburg commented on YARN-2910:
-

[~rohithsharma] Not a problem, I tried assigning it to myself but I do not seem 
to have the right to do so. I don't know if you can assign it to me or that I 
need to get some extra access.

> FSLeafQueue can throw ConcurrentModificationException
> -
>
> Key: YARN-2910
> URL: https://issues.apache.org/jira/browse/YARN-2910
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Rohith
> Attachments: FSLeafQueue_concurrent_exception.txt, YARN-2910.patch
>
>
> The list that maintains the runnable and the non runnable apps are a standard 
> ArrayList but there is no guarantee that it will only be manipulated by one 
> thread in the system. This can lead to the following exception:
> 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM.
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
> at java.util.ArrayList$Itr.next(ArrayList.java:831)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516)
> Full stack trace in the attached file.
> We should guard against that by using a thread safe version from 
> java.util.concurrent.CopyOnWriteArrayList



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException

2014-12-03 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233006#comment-14233006
 ] 

Wilfred Spiegelenburg commented on YARN-2910:
-

I have the code change done with all the synchronisation around the for loops. 
All iterator access of the {{Collections.synchronizedList}} needs to be 
synchronised, based on the javadoc, which might impact the performance as much 
or worse than the copy on write.
The junit test is in almost done and I will update the patch when that is 
finished.

> FSLeafQueue can throw ConcurrentModificationException
> -
>
> Key: YARN-2910
> URL: https://issues.apache.org/jira/browse/YARN-2910
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: FSLeafQueue_concurrent_exception.txt, YARN-2910.patch
>
>
> The list that maintains the runnable and the non runnable apps are a standard 
> ArrayList but there is no guarantee that it will only be manipulated by one 
> thread in the system. This can lead to the following exception:
> {noformat}
> 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM.
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
> at java.util.ArrayList$Itr.next(ArrayList.java:831)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516)
> {noformat}
> Full stack trace in the attached file.
> We should guard against that by using a thread safe version from 
> java.util.concurrent.CopyOnWriteArrayList



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException

2014-12-05 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-2910:

Attachment: YARN-2910.1.patch

Updated patch with the changes as discussed and a junit test

> FSLeafQueue can throw ConcurrentModificationException
> -
>
> Key: YARN-2910
> URL: https://issues.apache.org/jira/browse/YARN-2910
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: FSLeafQueue_concurrent_exception.txt, YARN-2910.1.patch, 
> YARN-2910.patch
>
>
> The list that maintains the runnable and the non runnable apps are a standard 
> ArrayList but there is no guarantee that it will only be manipulated by one 
> thread in the system. This can lead to the following exception:
> {noformat}
> 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM.
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
> at java.util.ArrayList$Itr.next(ArrayList.java:831)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516)
> {noformat}
> Full stack trace in the attached file.
> We should guard against that by using a thread safe version from 
> java.util.concurrent.CopyOnWriteArrayList



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException

2014-12-08 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-2910:

Attachment: YARN-2910.4.patch

I did not change the assignment :-(

yes, the {{when(schedulable.getResourceUsage()).thenReturn(smallResource);}} 
should not have been in the patch, my mistake. Not sure how that ended up in 
the patch I used it during development but not in the last tests.

On my machine the test failed with just adding applications. The issue seems to 
be in the initialisation of the application attempt. When I added debug into 
the test run I can see the initialisation of the app attempt in the mock taking 
up a lot of time which meant that the {{getResourceUsage}} almost always ran 
over an empty list unless the number of iterations was raised above 1000. As 
soon as I moved the creation out of the thread the failure occurs within 5 
iterations of the {{getResourceUsage}} call in the second thread after adding 
less than 15 or so app instances.

I have attached an updated patch which passes with the new code and has a 100% 
failure rate with the old code. This version of the test runs faster and is 
more reliable than the previous ones.

> FSLeafQueue can throw ConcurrentModificationException
> -
>
> Key: YARN-2910
> URL: https://issues.apache.org/jira/browse/YARN-2910
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>Reporter: Wilfred Spiegelenburg
>Assignee: Ray Chiang
> Attachments: FSLeafQueue_concurrent_exception.txt, 
> YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, 
> YARN-2910.4.patch, YARN-2910.patch
>
>
> The list that maintains the runnable and the non runnable apps are a standard 
> ArrayList but there is no guarantee that it will only be manipulated by one 
> thread in the system. This can lead to the following exception:
> {noformat}
> 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM.
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
> at java.util.ArrayList$Itr.next(ArrayList.java:831)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516)
> {noformat}
> Full stack trace in the attached file.
> We should guard against that by using a thread safe version from 
> java.util.concurrent.CopyOnWriteArrayList



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException

2014-12-08 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238007#comment-14238007
 ] 

Wilfred Spiegelenburg commented on YARN-2910:
-

The fix causes the 
{{org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler}}
 to fail.
There is a deadlock that is created by the synchronised read access in the leaf 
queue for the {{runnableApps}}. If an app has two containers at different 
stages in the allocation it can happen that the {{appAttempt}} is locked by one 
and the {{runnableApps}} by the second causing the hang.

This is what I was afraid of when I mentioned the slow down, I did not 
anticipate it this bad but the number of reads far outnumber the writes.
The earlier proposed CopyOnWriteArrayList will also not work due to the sort 
that is called (and I overlooked) which is not supported.

> FSLeafQueue can throw ConcurrentModificationException
> -
>
> Key: YARN-2910
> URL: https://issues.apache.org/jira/browse/YARN-2910
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>Reporter: Wilfred Spiegelenburg
>Assignee: Ray Chiang
> Attachments: FSLeafQueue_concurrent_exception.txt, 
> YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, 
> YARN-2910.4.patch, YARN-2910.patch
>
>
> The list that maintains the runnable and the non runnable apps are a standard 
> ArrayList but there is no guarantee that it will only be manipulated by one 
> thread in the system. This can lead to the following exception:
> {noformat}
> 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM.
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
> at java.util.ArrayList$Itr.next(ArrayList.java:831)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516)
> {noformat}
> Full stack trace in the attached file.
> We should guard against that by using a thread safe version from 
> java.util.concurrent.CopyOnWriteArrayList



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException

2014-12-08 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-2910:

Attachment: YARN-2910.5.patch

OK, a complete new approach. The other approaches did not work or did not fix 
it so back to a simple lock and unlock around the read and write actions.

The locking is setup with a fair distribution which is almost a fifo setup. 
This is not the default option and chosen to make sure we do not cause a thread 
to be starved from the lock.
Multiple reads are allowed at the same time and only one writer with no readers 
at the same time.

All junit tests pass in my local environment also other failures. 
As an extra change the {{synchronized}} has been removed from 
FSAppAttempt#getHeadRoom as discussed with [~kasha].

> FSLeafQueue can throw ConcurrentModificationException
> -
>
> Key: YARN-2910
> URL: https://issues.apache.org/jira/browse/YARN-2910
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: FSLeafQueue_concurrent_exception.txt, 
> YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, 
> YARN-2910.4.patch, YARN-2910.5.patch, YARN-2910.patch
>
>
> The list that maintains the runnable and the non runnable apps are a standard 
> ArrayList but there is no guarantee that it will only be manipulated by one 
> thread in the system. This can lead to the following exception:
> {noformat}
> 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM.
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
> at java.util.ArrayList$Itr.next(ArrayList.java:831)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516)
> {noformat}
> Full stack trace in the attached file.
> We should guard against that by using a thread safe version from 
> java.util.concurrent.CopyOnWriteArrayList



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException

2014-12-08 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-2910:

Attachment: YARN-2910.6.patch

updated patch with try finally clauses

> FSLeafQueue can throw ConcurrentModificationException
> -
>
> Key: YARN-2910
> URL: https://issues.apache.org/jira/browse/YARN-2910
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: FSLeafQueue_concurrent_exception.txt, 
> YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, 
> YARN-2910.4.patch, YARN-2910.5.patch, YARN-2910.6.patch, YARN-2910.patch
>
>
> The list that maintains the runnable and the non runnable apps are a standard 
> ArrayList but there is no guarantee that it will only be manipulated by one 
> thread in the system. This can lead to the following exception:
> {noformat}
> 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM.
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
> at java.util.ArrayList$Itr.next(ArrayList.java:831)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516)
> {noformat}
> Full stack trace in the attached file.
> We should guard against that by using a thread safe version from 
> java.util.concurrent.CopyOnWriteArrayList



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException

2014-12-08 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-2910:

Attachment: YARN-2910.7.patch

One final update to shorten the time we keep the lock and make sure we do the 
least amount of work while holding a write lock

> FSLeafQueue can throw ConcurrentModificationException
> -
>
> Key: YARN-2910
> URL: https://issues.apache.org/jira/browse/YARN-2910
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: FSLeafQueue_concurrent_exception.txt, 
> YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, 
> YARN-2910.4.patch, YARN-2910.5.patch, YARN-2910.6.patch, YARN-2910.7.patch, 
> YARN-2910.patch
>
>
> The list that maintains the runnable and the non runnable apps are a standard 
> ArrayList but there is no guarantee that it will only be manipulated by one 
> thread in the system. This can lead to the following exception:
> {noformat}
> 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM.
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
> at java.util.ArrayList$Itr.next(ArrayList.java:831)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516)
> {noformat}
> Full stack trace in the attached file.
> We should guard against that by using a thread safe version from 
> java.util.concurrent.CopyOnWriteArrayList



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException

2014-12-08 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-2910:

Attachment: YARN-2910.8.patch

cleanup of spurious includes which should not be there

> FSLeafQueue can throw ConcurrentModificationException
> -
>
> Key: YARN-2910
> URL: https://issues.apache.org/jira/browse/YARN-2910
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: FSLeafQueue_concurrent_exception.txt, 
> YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, 
> YARN-2910.4.patch, YARN-2910.5.patch, YARN-2910.6.patch, YARN-2910.7.patch, 
> YARN-2910.8.patch, YARN-2910.patch
>
>
> The list that maintains the runnable and the non runnable apps are a standard 
> ArrayList but there is no guarantee that it will only be manipulated by one 
> thread in the system. This can lead to the following exception:
> {noformat}
> 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM.
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
> at java.util.ArrayList$Itr.next(ArrayList.java:831)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516)
> {noformat}
> Full stack trace in the attached file.
> We should guard against that by using a thread safe version from 
> java.util.concurrent.CopyOnWriteArrayList



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3350) YARN RackResolver spams logs with messages at info level

2015-03-15 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created YARN-3350:
---

 Summary: YARN RackResolver spams logs with messages at info level
 Key: YARN-3350
 URL: https://issues.apache.org/jira/browse/YARN-3350
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.6.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


When you run an application the container logs shows a lot of messages for the 
RackResolver:

2015-03-10 00:58:30,483 INFO [RMCommunicator Allocator] 
org.apache.hadoop.yarn.util.RackResolver: Resolved node175.example.com to 
/rack15

A real world example for a large job was generating 20+ messages in 2 
milliseconds during a sustained period of time flooding the logs causing the 
node to run out of disk space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3350) YARN RackResolver spams logs with messages at info level

2015-03-15 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-3350:

Attachment: yarn-RackResolver-log.txt

Extract from a log which shows the messages logged by the RackResolver

> YARN RackResolver spams logs with messages at info level
> 
>
> Key: YARN-3350
> URL: https://issues.apache.org/jira/browse/YARN-3350
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.6.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: yarn-RackResolver-log.txt
>
>
> When you run an application the container logs shows a lot of messages for 
> the RackResolver:
> 2015-03-10 00:58:30,483 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved node175.example.com to 
> /rack15
> A real world example for a large job was generating 20+ messages in 2 
> milliseconds during a sustained period of time flooding the logs causing the 
> node to run out of disk space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3350) YARN RackResolver spams logs with messages at info level

2015-03-15 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-3350:

Attachment: YARN-3350.patch

> YARN RackResolver spams logs with messages at info level
> 
>
> Key: YARN-3350
> URL: https://issues.apache.org/jira/browse/YARN-3350
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.6.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: YARN-3350.patch, yarn-RackResolver-log.txt
>
>
> When you run an application the container logs shows a lot of messages for 
> the RackResolver:
> 2015-03-10 00:58:30,483 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved node175.example.com to 
> /rack15
> A real world example for a large job was generating 20+ messages in 2 
> milliseconds during a sustained period of time flooding the logs causing the 
> node to run out of disk space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3350) YARN RackResolver spams logs with messages at info level

2015-03-15 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362771#comment-14362771
 ] 

Wilfred Spiegelenburg commented on YARN-3350:
-

No tests for this change: it is just a simple log level change without any 
further code changes.

BTW: this same issue was encountered by Spark as an application using the 
RackResolver and they changed the log level from their side to prevent log 
flooding via SPARK-5393

> YARN RackResolver spams logs with messages at info level
> 
>
> Key: YARN-3350
> URL: https://issues.apache.org/jira/browse/YARN-3350
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.6.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: YARN-3350.patch, yarn-RackResolver-log.txt
>
>
> When you run an application the container logs shows a lot of messages for 
> the RackResolver:
> 2015-03-10 00:58:30,483 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved node175.example.com to 
> /rack15
> A real world example for a large job was generating 20+ messages in 2 
> milliseconds during a sustained period of time flooding the logs causing the 
> node to run out of disk space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3350) YARN RackResolver spams logs with messages at info level

2015-03-16 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-3350:

Attachment: YARN-3350.2.patch

updated patch wrapping the log in a if debug

> YARN RackResolver spams logs with messages at info level
> 
>
> Key: YARN-3350
> URL: https://issues.apache.org/jira/browse/YARN-3350
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.6.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: YARN-3350.2.patch, YARN-3350.patch, 
> yarn-RackResolver-log.txt
>
>
> When you run an application the container logs shows a lot of messages for 
> the RackResolver:
> 2015-03-10 00:58:30,483 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved node175.example.com to 
> /rack15
> A real world example for a large job was generating 20+ messages in 2 
> milliseconds during a sustained period of time flooding the logs causing the 
> node to run out of disk space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-02-25 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291339#comment-17291339
 ] 

Wilfred Spiegelenburg commented on YARN-10652:
--

Change looks good +1 (binding)

I'll let it sit for a day or so for other people to have a look at this too. I 
will commit if there are no comments in the next day or so.

> Capacity Scheduler fails to handle user weights for a user that has a "." 
> (dot) in it
> -
>
> Key: YARN-10652
> URL: https://issues.apache.org/jira/browse/YARN-10652
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: Correct user weight of 0.76 picked up for the user with 
> a dot after the patch.png, Incorrect default user weight of 1.0 being picked 
> for the user with a dot before the patch.png, YARN-10652.001.patch
>
>
> AD usernames can have a "." (dot) in them i.e. they can be of the format -> 
> {{firstname.lastname}}. However, if you specify a username with this format 
> against the Capacity Scheduler setting -> 
> {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}},
>  it fails to be applied and is instead assigned the default of 1.0f weight. 
> This renders the user weight feature (being used as a means of setting user 
> priorities for a queue) unusable for such users.
> This limitation comes from [1]. From [1], only word characters (A word 
> character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no 
> good for AD names that contain a "." (dot).
> Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and 
> HADOOP-15395 and the outcome was to use non-whitespace characters i.e. 
> instead of {{\w+}}, use {{\S+}}.
> We could go down similar path and unblock this feature for the AD usernames 
> with a "." (dot) in them.
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953
> [2] 
> https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-02-28 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292620#comment-17292620
 ] 

Wilfred Spiegelenburg commented on YARN-10652:
--

I do not see the relation with placement rules or the FS for this fix at all.

The weight of a queue can be used with or without a placement rule. It is a 
setting on a queue. Having this setting on a user based queue also does not 
really make sense. It gives more resources to one user over others in the 
queue. So I can see this being used on a parent queue or a group queue but not 
on an individuals leaf queue. On top of that the queue path of the 
configuration setting is not affected by this change. We're talking about this 
setting:
{code:java}
yarn.scheduler.capacity..user-settings..weight{code}
The weight is retrieved for a specific queue as defined in the _._ 
That part is already resolved and is not changed. The only resolution that is 
changed is the __ part between the words _user-settings_. and 
_.weight._ The queue path could be anything and is not in play here. It could 
even a fixed configured queue or one mapped on a group name.

The administrator should know as minimal as possible, preferably  nothing, 
about the internals for storing users in the CS. If the queue mapping rule for 
the user changes the dots to make it a single part of the queue path then that 
is independent of this change. It still does not change the way the user is 
stored in the CS. It changes the way you map a user to a queue in the placement 
rules.

On the FS side we thought about standardising dot usage. We considered both 
cases using and not _dot_ in the config files in user names. When I looked at 
it I was not sure which was the correct solution. It could lead to strange 
behaviour and extra administrative work. The admin forgets to remove the dot 
and all of a sudden the config does not apply. That is why it never went 
further than just the Jira YARN-5674.

With this change as proposed you will support weights for all users with a dot 
except for the user called: something_.weights_ That will be the only user set 
that breaks which is far less than breaking all users with a dot in the 
username. I do not see any other  bound properties in the 
configuration at the moment.

If you want to solve the generic dot issue for user based placement then that 
is outside of this change. 

> Capacity Scheduler fails to handle user weights for a user that has a "." 
> (dot) in it
> -
>
> Key: YARN-10652
> URL: https://issues.apache.org/jira/browse/YARN-10652
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: Correct user weight of 0.76 picked up for the user with 
> a dot after the patch.png, Incorrect default user weight of 1.0 being picked 
> for the user with a dot before the patch.png, YARN-10652.001.patch
>
>
> AD usernames can have a "." (dot) in them i.e. they can be of the format -> 
> {{firstname.lastname}}. However, if you specify a username with this format 
> against the Capacity Scheduler setting -> 
> {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}},
>  it fails to be applied and is instead assigned the default of 1.0f weight. 
> This renders the user weight feature (being used as a means of setting user 
> priorities for a queue) unusable for such users.
> This limitation comes from [1]. From [1], only word characters (A word 
> character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no 
> good for AD names that contain a "." (dot).
> Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and 
> HADOOP-15395 and the outcome was to use non-whitespace characters i.e. 
> instead of {{\w+}}, use {{\S+}}.
> We could go down similar path and unblock this feature for the AD usernames 
> with a "." (dot) in them.
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953
> [2] 
> https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-03-07 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17297089#comment-17297089
 ] 

Wilfred Spiegelenburg commented on YARN-10652:
--

I completely agree with your assessment [~pbacsko]. This is nowhere near a full 
fix for the dot problem at all. That needs to be tackled one issue at a time. 
We should not do all of it now. We can take multiple jiras to fix these issues.

I thus second the case by case solution approach. Fixing this one outside of 
placement rule changes is one step. Introducing a standard way for the property 
resolution for all properties that use the  would be a *nice* to 
have, again not needed now. The property introduced for max apps resolves 
without an issue even with dots in the name. 

Placement rules are complex, I would not recommend that this Jira should look 
at it at all. 

[~snemeth] & [~shuzirra]: based on the fact that we need to fix this 
irrespective of what is done in placement rules I would like to proceed with 
the commit for this. The change allows the administrator to just use the 
existing user name in the configuration similar to the "max-parallel-apps"  
setting. When and if a solution is implemented for the placement rules to 
support dots in user and group names, which are part of the queue path, new 
fixes might be needed for this issue and YARN-9930. We might even leave these 
two  as is. That is not a decision we need to make now.

> Capacity Scheduler fails to handle user weights for a user that has a "." 
> (dot) in it
> -
>
> Key: YARN-10652
> URL: https://issues.apache.org/jira/browse/YARN-10652
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: Correct user weight of 0.76 picked up for the user with 
> a dot after the patch.png, Incorrect default user weight of 1.0 being picked 
> for the user with a dot before the patch.png, YARN-10652.001.patch
>
>
> AD usernames can have a "." (dot) in them i.e. they can be of the format -> 
> {{firstname.lastname}}. However, if you specify a username with this format 
> against the Capacity Scheduler setting -> 
> {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}},
>  it fails to be applied and is instead assigned the default of 1.0f weight. 
> This renders the user weight feature (being used as a means of setting user 
> priorities for a queue) unusable for such users.
> This limitation comes from [1]. From [1], only word characters (A word 
> character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no 
> good for AD names that contain a "." (dot).
> Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and 
> HADOOP-15395 and the outcome was to use non-whitespace characters i.e. 
> instead of {{\w+}}, use {{\S+}}.
> We could go down similar path and unblock this feature for the AD usernames 
> with a "." (dot) in them.
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953
> [2] 
> https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-03-16 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302978#comment-17302978
 ] 

Wilfred Spiegelenburg commented on YARN-10652:
--

Thank you to [~sahuja] for the fix, and to all ([~snemeth] , [~shuzirra] , 
[~gandras] & [~pbacsko]) for the discussion and resolution around this jira.

I committed to trunk with a comment in the commit message:
{quote}This only fixes the user name resolution for weights in the queues. It 
does not add generic support for user names with dots in all use cases in the 
capacity scheduler.
{quote}

> Capacity Scheduler fails to handle user weights for a user that has a "." 
> (dot) in it
> -
>
> Key: YARN-10652
> URL: https://issues.apache.org/jira/browse/YARN-10652
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: Correct user weight of 0.76 picked up for the user with 
> a dot after the patch.png, Incorrect default user weight of 1.0 being picked 
> for the user with a dot before the patch.png, YARN-10652.001.patch
>
>
> AD usernames can have a "." (dot) in them i.e. they can be of the format -> 
> {{firstname.lastname}}. However, if you specify a username with this format 
> against the Capacity Scheduler setting -> 
> {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}},
>  it fails to be applied and is instead assigned the default of 1.0f weight. 
> This renders the user weight feature (being used as a means of setting user 
> priorities for a queue) unusable for such users.
> This limitation comes from [1]. From [1], only word characters (A word 
> character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no 
> good for AD names that contain a "." (dot).
> Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and 
> HADOOP-15395 and the outcome was to use non-whitespace characters i.e. 
> instead of {{\w+}}, use {{\S+}}.
> We could go down similar path and unblock this feature for the AD usernames 
> with a "." (dot) in them.
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953
> [2] 
> https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7769) FS QueueManager should not create default queue at init

2021-04-12 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319902#comment-17319902
 ] 

Wilfred Spiegelenburg commented on YARN-7769:
-

No problem with you taking over.

Let me know if I need to review for you.

> FS QueueManager should not create default queue at init
> ---
>
> Key: YARN-7769
> URL: https://issues.apache.org/jira/browse/YARN-7769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Benjamin Teke
>Priority: Major
>
> Currently the FairScheduler QueueManager automatically creates the default 
> queue. However the default queue does not need to exist. We have two possible 
> cases which we should handle:
> * Based on the placement rule "Default" the name for the default queue might 
> not be default and it should be created with a different name
> * There might not be a "Default" placement rule at all which removes the need 
> to create the queue.
> We should leave the creation of the default queue to the point in time that 
> we can assess if it is needed or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7769) FS QueueManager should not create default queue at init

2021-04-13 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320581#comment-17320581
 ] 

Wilfred Spiegelenburg commented on YARN-7769:
-

These test failures are all related to the change. Please check, the tests 
assumed a default queue to be available and it no longer is.

> FS QueueManager should not create default queue at init
> ---
>
> Key: YARN-7769
> URL: https://issues.apache.org/jira/browse/YARN-7769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-7769.001.patch
>
>
> Currently the FairScheduler QueueManager automatically creates the default 
> queue. However the default queue does not need to exist. We have two possible 
> cases which we should handle:
> * Based on the placement rule "Default" the name for the default queue might 
> not be default and it should be created with a different name
> * There might not be a "Default" placement rule at all which removes the need 
> to create the queue.
> We should leave the creation of the default queue to the point in time that 
> we can assess if it is needed or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7769) FS QueueManager should not create default queue at init

2021-04-19 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325396#comment-17325396
 ] 

Wilfred Spiegelenburg commented on YARN-7769:
-

Code change looks good. This does however also impact the 
[documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html]:
 * {{By default, all users share a single queue, named “default”.}}
 * {{user-as-default-queue}} says it falls back to the default queue
 * {{allow-undeclared-pools}} is another point that mentions the default queue

The last impact is on the default placement rule. Not sure if that is just a 
documentation change or a needs more.

> FS QueueManager should not create default queue at init
> ---
>
> Key: YARN-7769
> URL: https://issues.apache.org/jira/browse/YARN-7769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-7769.001.patch, YARN-7769.002.patch, 
> YARN-7769.003.patch
>
>
> Currently the FairScheduler QueueManager automatically creates the default 
> queue. However the default queue does not need to exist. We have two possible 
> cases which we should handle:
> * Based on the placement rule "Default" the name for the default queue might 
> not be default and it should be created with a different name
> * There might not be a "Default" placement rule at all which removes the need 
> to create the queue.
> We should leave the creation of the default queue to the point in time that 
> we can assess if it is needed or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7769) FS QueueManager should not create default queue at init

2021-04-25 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331726#comment-17331726
 ] 

Wilfred Spiegelenburg commented on YARN-7769:
-

Are we pushing the documentation in this Jira or are we opening up a new one? 
If we do a new Jira to fix the docs we're OK to commit otherwise we need to get 
the documentation update added in this Jira before we commit.

[~bteke] & [~snemeth] any preference from your side?

> FS QueueManager should not create default queue at init
> ---
>
> Key: YARN-7769
> URL: https://issues.apache.org/jira/browse/YARN-7769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-7769.001.patch, YARN-7769.002.patch, 
> YARN-7769.003.patch
>
>
> Currently the FairScheduler QueueManager automatically creates the default 
> queue. However the default queue does not need to exist. We have two possible 
> cases which we should handle:
> * Based on the placement rule "Default" the name for the default queue might 
> not be default and it should be created with a different name
> * There might not be a "Default" placement rule at all which removes the need 
> to create the queue.
> We should leave the creation of the default queue to the point in time that 
> we can assess if it is needed or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7769) FS QueueManager should not create default queue at init

2021-04-26 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332842#comment-17332842
 ] 

Wilfred Spiegelenburg commented on YARN-7769:
-

OK, in that case we're ready to commit. +1 from my side.

[~snemeth]: I do think that we need to release note this as it is a difference 
in behaviour. Do we also need to make sure that YARN-8951 works, or at least 
does not throw a NPE and takes down the RM before we commit this?

> FS QueueManager should not create default queue at init
> ---
>
> Key: YARN-7769
> URL: https://issues.apache.org/jira/browse/YARN-7769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-7769.001.patch, YARN-7769.002.patch, 
> YARN-7769.003.patch
>
>
> Currently the FairScheduler QueueManager automatically creates the default 
> queue. However the default queue does not need to exist. We have two possible 
> cases which we should handle:
> * Based on the placement rule "Default" the name for the default queue might 
> not be default and it should be created with a different name
> * There might not be a "Default" placement rule at all which removes the need 
> to create the queue.
> We should leave the creation of the default queue to the point in time that 
> we can assess if it is needed or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8470) Fair scheduler exception with SLS

2020-01-14 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015344#comment-17015344
 ] 

Wilfred Spiegelenburg commented on YARN-8470:
-

[~Steven Rand] this has been fixed via YARN-9984 and is in 3.2.2.

> Fair scheduler exception with SLS
> -
>
> Key: YARN-8470
> URL: https://issues.apache.org/jira/browse/YARN-8470
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Szilard Nemeth
>Priority: Major
>
> I ran into the following exception with sls:
> 2018-06-26 13:34:04,358 ERROR resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> FSPreemptionThread, that exited unexpectedly: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptOnNode(FSPreemptionThread.java:207)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptForOneContainer(FSPreemptionThread.java:161)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreempt(FSPreemptionThread.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.run(FSPreemptionThread.java:81)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8470) Fair scheduler exception with SLS

2020-01-14 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-8470.
-
  Assignee: Wilfred Spiegelenburg  (was: Szilard Nemeth)
Resolution: Duplicate

> Fair scheduler exception with SLS
> -
>
> Key: YARN-8470
> URL: https://issues.apache.org/jira/browse/YARN-8470
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>
> I ran into the following exception with sls:
> 2018-06-26 13:34:04,358 ERROR resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> FSPreemptionThread, that exited unexpectedly: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptOnNode(FSPreemptionThread.java:207)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptForOneContainer(FSPreemptionThread.java:161)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreempt(FSPreemptionThread.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.run(FSPreemptionThread.java:81)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7913) Improve error handling when application recovery fails with exception

2020-01-14 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-7913:

Description: 
There are edge cases when the application recovery fails with an exception.

Example failure scenario:
 * setup: a queue is a leaf queue in the primary RM's config and the same queue 
is a parent queue in the secondary RM's config.
 * When failover happens with this setup, the recovery will fail for 
applications on this queue, and an APP_REJECTED event will be dispatched to the 
async dispatcher. On the same thread (that handles the recovery), a 
NullPointerException is thrown when the applicationAttempt is tried to be 
recovered 
(https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
 I don't see a good way to avoid the NPE in this scenario, because when the NPE 
occurs the APP_REJECTED has not been processed yet, and we don't know that the 
application recovery failed.

Currently the first exception will abort the recovery, and if there are X 
applications, there will be ~X passive -> active RM transition attempts - the 
passive -> active RM transition will only succeed when the last APP_REJECTED 
event is processed on the async dispatcher thread.


  was:
There are edge cases when the application recovery fails with an exception.

Example failure scenario:
 * setup: a queue is a leaf queue in the primary RM's config and the same queue 
is a parent queue in the secondary RM's config.
 * When failover happens with this setup, the recovery will fail for 
applications on this queue, and an APP_REJECTED event will be dispatched to the 
async dispatcher. On the same thread (that handles the recovery), a 
NullPointerException is thrown when the applicationAttempt is tried to be 
recovered 
(https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
 I don't see a good way to avoid the NPE in this scenario, because when the NPE 
occurs the APP_REJECTED has not been processed yet, and we don't know that the 
application recovery failed.

Currently the first exception will abort the recovery, and if there are X 
applications, there will be ~X passive -> active RM transition attempts - the 
passive -> active RM transition will only succeed when the last APP_REJECTED 
event is processed on the async dispatcher thread.

_The point of this ticket is to improve the error handling and reduce the 
number of passive -> active RM transition attempts (solving the above described 
failure scenario isn't in scope)._


> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch, YARN-7913.003.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

--

[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception

2020-01-14 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015435#comment-17015435
 ] 

Wilfred Spiegelenburg commented on YARN-7913:
-

Thank you [~snemeth] for looking at the patch
1) yes it is a different approach: we now do a proper restore instead of 
failing the app and or crashing the RM

2) removed that line it does not make sense to throw errors and fail the app 
but not fail the RM

3) You are correct in your assumption about the flag. It is checked in the case 
you are mentioning. This is the full change in that area:
{code}
487 if (!isAppRecovering) {
488   rejectApplicationWithMessage(applicationId,
489   queueName + " is not a leaf queue");
490   return;
491 }
492 // app is recovering we do not want to fail the app now as it 
was there
493 // before we started the recovery. Add it to the recovery queue:
494 // dynamic queue directly under root, no ACL needed (auto clean 
up)
495 queueName = "root.recovery";
496 queue = queueMgr.getLeafQueue(queueName, true, applicationId);
{code}
The flag is already checked in line 487 of the change. When we get to the 
create of the recovery queue (line 492-496) we are recovering, if we are not 
recovering we have rejected the app and left the method using the return in 
line 490.

4) will fix the comment

5) Yes we only want to check newly added apps that are _not_ recovering. When 
we recover there is a large chance that there are no NMs registered. When we 
use the % resource setup for the maximum size of a queue then the size check of 
the AM resource will fail. Since the AM was/is already running we need to only 
perform that check if we are not recovering the app.

6) The largest number of statements inside an if..else I can see is 6 lines. 
Not sure if refactoring the method will make it any clearer. I also don't see 
any consistency in the checks that would allow us to change the flow easily and 
improve clarity.

> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch, YARN-7913.003.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9879) Allow multiple leaf queues with the same name in CS

2020-01-14 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015701#comment-17015701
 ] 

Wilfred Spiegelenburg commented on YARN-9879:
-

A per queue flag looks very strange. I am also not sure it will help or add 
anything on top of having a global flag that just prevents the config change. 
Summarising the proposed solution: add a flag that prevents the admin from 
adding non unique leaf queue names and thus fail the config change when he/she 
tries

The behaviour inside the scheduler must all be based on the full queue paths 
anyway. You cannot have one queue being addressed by the leaf name and the 
other by the path. The code complexity to do that would be enormous and lead to 
unsupportable code. That means that after the placement rule(s) are run and the 
app is placed everything must be based on a full path.

Placement rules throw up a totally different issue here. When we use placement 
rules we have one of two possible cases:
 * the rule generates a queue name and a parent queue name, i.e. a path
 * the rule generates just a leaf queue name

Which means that the rule can generate a leaf queue anywhere in the hierarchy 
without specifying a hierarchy. So no parent is set by the rule but the leaf 
queue generated could be located below a parent. With that last possibility we 
have the extra complexity in that the rules are not behaving consistently.
 Example:
Two CS definitions to compare both allow queue creation and overwrite of the 
submitted queue:
 # queues: root.parent.wilfred
 # queues: root

mapping rule defined: {{u:%user:%user}}

1) user submitting the app is {{wilfred}} queue given on submission is default
In CS config 1 we submit to the {{root.parent.wilfred}} queue while in the 
second CS config we submit to {{root.wilfred}} queue.

2) user submitting the app is {{peter}} queue given on submission is default
In both CS configs we submit to the {{root.peter}} queue.

With different config at the CS level but for the same rule we place the app in 
a sub queue sometimes but not the other, that is inconsistent.

I think rules even need to start taking this flag into account to preserve this 
inconsistent behaviour.

> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: DesignDoc_v1.pdf
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10087) ATS possible NPE on REST API when data is missing

2020-01-15 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YARN-10087:


 Summary: ATS possible NPE on REST API when data is missing
 Key: YARN-10087
 URL: https://issues.apache.org/jira/browse/YARN-10087
 Project: Hadoop YARN
  Issue Type: Bug
  Components: ATSv2
Reporter: Wilfred Spiegelenburg


If the data stored by the ATS is not complete REST calls to the ATS can return 
a NPE instead of results.

{{{"exception":"NullPointerException","javaClassName":"java.lang.NullPointerException"}}}

The issue shows up when the ATS was down for a short period and in that time 
new applications were started. This causes certain parts of the application 
data to be missing in the ATS store. In most cases this is not a problem and 
data will be returned but when you start filtering data the filtering fails 
throwing the NPE.
 In this case the request was for: 
{{http://:8188/ws/v1/applicationhistory/apps?user=hive'}}

If certain pieces of data are missing the ATS should not even consider 
returning that data, filtered or not. We should not display partial or 
incomplete data.
 In case of the missing user information ACL checks cannot be correctly 
performed and we could see more issues.

A similar issue was fixed in YARN-7118 where the queue details were missing. It 
just _skips_ the app to prevent the NPE but that is not the correct thing when 
the user is missing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10087) ATS possible NPE on REST API when data is missing

2020-01-15 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-10087:
-
Attachment: ats_stack.txt

> ATS possible NPE on REST API when data is missing
> -
>
> Key: YARN-10087
> URL: https://issues.apache.org/jira/browse/YARN-10087
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Reporter: Wilfred Spiegelenburg
>Priority: Major
> Attachments: ats_stack.txt
>
>
> If the data stored by the ATS is not complete REST calls to the ATS can 
> return a NPE instead of results.
> {{{"exception":"NullPointerException","javaClassName":"java.lang.NullPointerException"}}}
> The issue shows up when the ATS was down for a short period and in that time 
> new applications were started. This causes certain parts of the application 
> data to be missing in the ATS store. In most cases this is not a problem and 
> data will be returned but when you start filtering data the filtering fails 
> throwing the NPE.
>  In this case the request was for: 
> {{http://:8188/ws/v1/applicationhistory/apps?user=hive'}}
> If certain pieces of data are missing the ATS should not even consider 
> returning that data, filtered or not. We should not display partial or 
> incomplete data.
>  In case of the missing user information ACL checks cannot be correctly 
> performed and we could see more issues.
> A similar issue was fixed in YARN-7118 where the queue details were missing. 
> It just _skips_ the app to prevent the NPE but that is not the correct thing 
> when the user is missing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10087) ATS possible NPE on REST API when data is missing

2020-01-15 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016530#comment-17016530
 ] 

Wilfred Spiegelenburg commented on YARN-10087:
--

Attached the NPE as logged in the ATS logs.

The logs point to these lines of code in the release that was running:
{code:java}
188 if (userQuery != null && !userQuery.isEmpty()) {
189   if (!appReport.getUser().equals(userQuery)) {
190 continue;
191   }
192 } {code}

> ATS possible NPE on REST API when data is missing
> -
>
> Key: YARN-10087
> URL: https://issues.apache.org/jira/browse/YARN-10087
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Reporter: Wilfred Spiegelenburg
>Priority: Major
> Attachments: ats_stack.txt
>
>
> If the data stored by the ATS is not complete REST calls to the ATS can 
> return a NPE instead of results.
> {{{"exception":"NullPointerException","javaClassName":"java.lang.NullPointerException"}}}
> The issue shows up when the ATS was down for a short period and in that time 
> new applications were started. This causes certain parts of the application 
> data to be missing in the ATS store. In most cases this is not a problem and 
> data will be returned but when you start filtering data the filtering fails 
> throwing the NPE.
>  In this case the request was for: 
> {{http://:8188/ws/v1/applicationhistory/apps?user=hive'}}
> If certain pieces of data are missing the ATS should not even consider 
> returning that data, filtered or not. We should not display partial or 
> incomplete data.
>  In case of the missing user information ACL checks cannot be correctly 
> performed and we could see more issues.
> A similar issue was fixed in YARN-7118 where the queue details were missing. 
> It just _skips_ the app to prevent the NPE but that is not the correct thing 
> when the user is missing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7913) Improve error handling when application recovery fails with exception

2020-01-21 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-7913:

Attachment: YARN-7913-branch-3.2.001.patch
YARN-7913-branch-3.1.001.patch

> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-7913-branch-3.1.001.patch, 
> YARN-7913-branch-3.2.001.patch, YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch, YARN-7913.003.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception

2020-01-21 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020242#comment-17020242
 ] 

Wilfred Spiegelenburg commented on YARN-7913:
-

added branch-3.1 and branch-3.2 patches

> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-7913-branch-3.1.001.patch, 
> YARN-7913-branch-3.2.001.patch, YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch, YARN-7913.003.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception

2020-01-21 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020662#comment-17020662
 ] 

Wilfred Spiegelenburg commented on YARN-7913:
-

Test failures are not related, branch-3.2 fix seems to be good to go.

You might also need to trigger the branch-3.1 build, the backport is slightly 
different because of YARN-8248

> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-7913-branch-3.1.001.patch, 
> YARN-7913-branch-3.2.001.patch, YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch, YARN-7913.003.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9879) Allow multiple leaf queues with the same name in CS

2020-01-22 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020931#comment-17020931
 ] 

Wilfred Spiegelenburg commented on YARN-9879:
-

I agree, {{getQueueName()}} should stay as is. We have a {{getQueuePath()}} 
already. Every CSQueue can already return both. We should change all non 
external facing calls getting the name of a queue to the path version. The only 
calls that can stay are the ones that provide their data in an externally 
viewable form (REST, UI or IPC) as to not break compatibility.

I also do not see why we would need the ambiguous queue list. The queue is 
always unique when a path is used. It does not matter if the current leaf queue 
name uniqueness is enforced or not.
 Everything can always be found by its path. If I do not have a path I expect 
leaf queue uniqueness and can find the queue by just checking the part after 
the last _dot_ in the path.
 i.e.
 * queue paths defined as: root.parent.child1
 child queue unique flag is set
 find a queue with name: *child1* (no dots, expect leaf queue uniqueness) -> 
returns the queue correctly
 * add a queue defined as: root.otherparent.child1
 child queue unique flag is not set, allowed
 find a queue with name: *child1* (no dots, expect leaf queue uniqueness) -> 
returns an error

Internally we just store everything using the path, that would remove the whole 
keeping things in sync and makes the code consistent when combined with using 
the path everywhere internally

> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: DesignDoc_v1.pdf, YARN-9879.POC001.patch
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8990) Fix fair scheduler race condition in app submit and queue cleanup

2020-01-28 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025566#comment-17025566
 ] 

Wilfred Spiegelenburg commented on YARN-8990:
-

I do think that is a good idea. The patch should still apply for both to the 
branch-3.2, if not I can provide a branch specific patch if needed but we need 
a committer to check it in for us

> Fix fair scheduler race condition in app submit and queue cleanup
> -
>
> Key: YARN-8990
> URL: https://issues.apache.org/jira/browse/YARN-8990
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Blocker
> Fix For: 3.2.0, 3.3.0
>
> Attachments: YARN-8990.001.patch, YARN-8990.002.patch
>
>
> With the introduction of the dynamic queue deletion in YARN-8191 a race 
> condition was introduced that can cause a queue to be removed while an 
> application submit is in progress.
> The issue occurs in {{FairScheduler.addApplication()}} when an application is 
> submitted to a dynamic queue which is empty or the queue does not exist yet. 
> If during the processing of the application submit the 
> {{AllocationFileLoaderService}} kicks of for an update the queue clean up 
> will be run first. The application submit first creates the queue and get a 
> reference back to the queue. 
> Other checks are performed and as the last action before getting ready to 
> generate an AppAttempt the queue is updated to show the submitted application 
> ID..
> The time between the queue creation and the queue update to show the submit 
> is long enough for the queue to be removed. The application however is lost 
> and will never get any resources assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10112) Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used with Fair Scheduler size based weights enabled

2020-01-29 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026400#comment-17026400
 ] 

Wilfred Spiegelenburg commented on YARN-10112:
--

This does not happen in the current releases of YARN anymore.

In YARN-7414 we moved the {{getAppWeight}} out of the scheduler into the 
{{FSAppAttempt}}. That did not solve the locking issue but was the right thing 
to do. In the follow up YARN-7513 I removed the lock from the new call. I would 
say that this is thus a duplicate of the combination YARN-7414 & YARN-7513.

Both are fixed in hadoop 3.01 and 3.1. Backporting of this change is possible.

> Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used 
> with Fair Scheduler size based weights enabled
> ---
>
> Key: YARN-10112
> URL: https://issues.apache.org/jira/browse/YARN-10112
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.5
>Reporter: Yu Wang
>Priority: Minor
>
> The user uses the FairScheduler, and yarn.scheduler.fair.sizebasedweight is 
> set true. From the ticket JStack thread dump from the support engineers, we 
> could see that the method getAppWeight below in the class of FairScheduler 
> was occupying the FairScheduler object monitor always, which made 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate
>  always await of entering the same object monitor, thus resulting in the the 
> livelock.
>  
> The issue occurs very infrequently and we are still unable to figure out a 
> way to consistently reproduce the issue. The issue resembles to what the Jira 
> YARN-1458 reports, but it seems that code fix has taken into effect since 
> 2.6. 
>  
>  
> {code:java}
> "ResourceManager Event Processor" #17 prio=5 os_prio=0 tid=0x7fbcee65e800 
> nid=0x2ea4 waiting for monitor entry [0x7fbcbcd5e000] 
> java.lang.Thread.State: BLOCKED (on object monitor) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:1105)
>  - waiting to lock <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1362)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:801)
>  at java.lang.Thread.run(Thread.java:748) 
> "FairSchedulerUpdateThread" #23 daemon prio=5 os_prio=0 
> tid=0x7fbceea0e800 nid=0x2ea2 runnable [0x7fbcbcf6] 
> java.lang.Thread.State: RUNNABLE at java.lang.StrictMath.log1p(Native Method) 
> at java.lang.Math.log1p(Math.java:1747) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:570)
>  - locked <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:953)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:192)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:180)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:51)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:138)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:235)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:89)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:365)
>  - locked <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:314){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop

[jira] [Assigned] (YARN-10112) Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used with Fair Scheduler size based weights enabled

2020-01-29 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg reassigned YARN-10112:


Assignee: Wilfred Spiegelenburg

> Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used 
> with Fair Scheduler size based weights enabled
> ---
>
> Key: YARN-10112
> URL: https://issues.apache.org/jira/browse/YARN-10112
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.5
>Reporter: Yu Wang
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
>
> The user uses the FairScheduler, and yarn.scheduler.fair.sizebasedweight is 
> set true. From the ticket JStack thread dump from the support engineers, we 
> could see that the method getAppWeight below in the class of FairScheduler 
> was occupying the FairScheduler object monitor always, which made 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate
>  always await of entering the same object monitor, thus resulting in the the 
> livelock.
>  
> The issue occurs very infrequently and we are still unable to figure out a 
> way to consistently reproduce the issue. The issue resembles to what the Jira 
> YARN-1458 reports, but it seems that code fix has taken into effect since 
> 2.6. 
>  
>  
> {code:java}
> "ResourceManager Event Processor" #17 prio=5 os_prio=0 tid=0x7fbcee65e800 
> nid=0x2ea4 waiting for monitor entry [0x7fbcbcd5e000] 
> java.lang.Thread.State: BLOCKED (on object monitor) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:1105)
>  - waiting to lock <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1362)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:801)
>  at java.lang.Thread.run(Thread.java:748) 
> "FairSchedulerUpdateThread" #23 daemon prio=5 os_prio=0 
> tid=0x7fbceea0e800 nid=0x2ea2 runnable [0x7fbcbcf6] 
> java.lang.Thread.State: RUNNABLE at java.lang.StrictMath.log1p(Native Method) 
> at java.lang.Math.log1p(Math.java:1747) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:570)
>  - locked <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:953)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:192)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:180)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:51)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:138)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:235)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:89)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:365)
>  - locked <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:314){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10124) Remove restriction of ParentQueue capacity zero when childCapacities > 0

2020-02-11 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034266#comment-17034266
 ] 

Wilfred Spiegelenburg commented on YARN-10124:
--

Based on what I read here is that you expect a stopped queue to not have a 
capacity and thus return 0 when calculating the correct distributing. If that 
is the use case then why can't we implement that. In other words: why not turn 
this around, only return or take into account the capacity of a queue when it 
is not in a stopped state? So you return 0 for all stopped queues. You do not 
have to go further than that.

No need to (re)calculate below the parent that is stopped 0 (as that is all 
ignored) and turning the queue back on will trigger the existing settings to be 
applied again without further changes.

> Remove restriction of ParentQueue capacity zero when childCapacities > 0
> 
>
> Key: YARN-10124
> URL: https://issues.apache.org/jira/browse/YARN-10124
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10124-001.patch
>
>
> ParentQueue capacity cannot be set to 0 when child capacities > 0. To disable 
> a parent queue temporarily, user can only STOP the queue but the capacity of 
> the queue cannot be used for other queues. Allowing 0 capacity for parent 
> queue will allow user to use the capacity for other queues and also to retain 
> the child queue capacity values. (else user has to set all child queue 
> capacities to 0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10124) Remove restriction of ParentQueue capacity zero when childCapacities > 0

2020-02-19 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039817#comment-17039817
 ] 

Wilfred Spiegelenburg commented on YARN-10124:
--

When you set a parent queue to a 0 capacity could that not leave the apps in 
the leaf queues below it in a state that they are not scheduled?
How does it affect the capacity calculation, i.e. is the queue now always below 
or above its (hard)limits?

> Remove restriction of ParentQueue capacity zero when childCapacities > 0
> 
>
> Key: YARN-10124
> URL: https://issues.apache.org/jira/browse/YARN-10124
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10124-001.patch
>
>
> ParentQueue capacity cannot be set to 0 when child capacities > 0. To disable 
> a parent queue temporarily, user can only STOP the queue but the capacity of 
> the queue cannot be used for other queues. Allowing 0 capacity for parent 
> queue will allow user to use the capacity for other queues and also to retain 
> the child queue capacity values. (else user has to set all child queue 
> capacities to 0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10124) Remove restriction of ParentQueue capacity zero when childCapacities > 0

2020-02-19 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039817#comment-17039817
 ] 

Wilfred Spiegelenburg edited comment on YARN-10124 at 2/19/20 9:11 AM:
---

When you set a parent queue to a 0 capacity could that not leave the apps in 
the leaf queues below it in a state that they are not scheduled?
How does it affect the capacity calculation, i.e. is the queue now always below 
or above its (hard)limits?

I am not against removing the limitation but I am thinking about the side 
effects it could have.


was (Author: wilfreds):
When you set a parent queue to a 0 capacity could that not leave the apps in 
the leaf queues below it in a state that they are not scheduled?
How does it affect the capacity calculation, i.e. is the queue now always below 
or above its (hard)limits?

> Remove restriction of ParentQueue capacity zero when childCapacities > 0
> 
>
> Key: YARN-10124
> URL: https://issues.apache.org/jira/browse/YARN-10124
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10124-001.patch
>
>
> ParentQueue capacity cannot be set to 0 when child capacities > 0. To disable 
> a parent queue temporarily, user can only STOP the queue but the capacity of 
> the queue cannot be used for other queues. Allowing 0 capacity for parent 
> queue will allow user to use the capacity for other queues and also to retain 
> the child queue capacity values. (else user has to set all child queue 
> capacities to 0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10124) Remove restriction of ParentQueue capacity zero when childCapacities > 0

2020-02-25 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17044286#comment-17044286
 ] 

Wilfred Spiegelenburg commented on YARN-10124:
--

1: this point by itself sounds good, we want the child queues to get resources 
and run the apps until they finish

2: if when 0 is set for the parent it is always above the hard limit then how 
does 1 work? I would have expected that the limit is enforced in the hierarchy. 
So can you please explain a bit more on how this works as this is not what I 
would expect.

2: (second time) any resource above the limit should be preemptable so it lines 
up with the first point 2.

3: is expected I would consider setting any on the root (min max etc) as a bug: 
the root is the whole cluster and shoulkd thus mirror what is available in the 
cluster without limits. Otherwise a cluster could never grow or shrink.

> Remove restriction of ParentQueue capacity zero when childCapacities > 0
> 
>
> Key: YARN-10124
> URL: https://issues.apache.org/jira/browse/YARN-10124
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10124-001.patch
>
>
> ParentQueue capacity cannot be set to 0 when child capacities > 0. To disable 
> a parent queue temporarily, user can only STOP the queue but the capacity of 
> the queue cannot be used for other queues. Allowing 0 capacity for parent 
> queue will allow user to use the capacity for other queues and also to retain 
> the child queue capacity values. (else user has to set all child queue 
> capacities to 0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10124) Remove restriction of ParentQueue capacity zero when childCapacities > 0

2020-02-25 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17044996#comment-17044996
 ] 

Wilfred Spiegelenburg commented on YARN-10124:
--

Ah ok, that makes sense now.
Based on the testing you have done it looks like we're all set and do not have 
any side effects of the change.

Two little nits that need to be fixed before we can commit this: can we clean 
up the message generation for the strings and remove the surplus addition in 
the text for the string in these two spots:
{code}
196 throw new IllegalArgumentException("Illegal" + " capacity of "
{code}
and
{code}
229 "Illegal" + " capacity of " + sum + " for children of queue "
{code}

Beside that: +1 

> Remove restriction of ParentQueue capacity zero when childCapacities > 0
> 
>
> Key: YARN-10124
> URL: https://issues.apache.org/jira/browse/YARN-10124
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10124-001.patch
>
>
> ParentQueue capacity cannot be set to 0 when child capacities > 0. To disable 
> a parent queue temporarily, user can only STOP the queue but the capacity of 
> the queue cannot be used for other queues. Allowing 0 capacity for parent 
> queue will allow user to use the capacity for other queues and also to retain 
> the child queue capacity values. (else user has to set all child queue 
> capacities to 0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10182) SLS运行报错Couldn't create /yarn-leader-election/yarnRM

2020-03-05 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-10182.
--
Resolution: Fixed

JIRA is not the correct place to ask questions on how to setup or run parts. 
Please use the mailing lists if you need help: u...@hadoop.apache.org

> SLS运行报错Couldn't create /yarn-leader-election/yarnRM
> ---
>
> Key: YARN-10182
> URL: https://issues.apache.org/jira/browse/YARN-10182
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
> Environment: Cloudera Express 6.0.0
> RM1 :active RM2:standby
> kerberos is on
> yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml
> keytab: /etc/krb5.keytab ===>the keytab of yarn
> when I run slsrun.sh ,I will get an error:
> Exception in thread "main" org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Couldn't create /yarn-leader-election/yarnRM
> If I use sample-conf/yarn-site.xml ,I will get  "KerberosAuthException: Login 
> failure for user: yarn from keytab /etc/krb5.keytab 
> javax.security.auth.login.LoginException: Unable to obtain password from user"
> How to resolve it ?
>  
>Reporter: zhangyu
>Priority: Major
> Attachments: slsrun.log.txt
>
>
> RM1 :active RM2:standby
> kerberos is on
> yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml
> keytab: /etc/krb5.keytab ===>the keytab of yarn
> when I run slsrun.sh on RM1 ,I will get an error:
> Exception in thread "main" org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Couldn't create /yarn-leader-election/yarnRM
> If I use sample-conf/yarn-site.xml ,I will get "KerberosAuthException: Login 
> failure for user: yarn from keytab /etc/krb5.keytab 
> javax.security.auth.login.LoginException: Unable to obtain password from user"
> How to resolve it ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10063) Usage output of container-executor binary needs to include --http/--https argument

2020-03-12 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058334#comment-17058334
 ] 

Wilfred Spiegelenburg commented on YARN-10063:
--

Can you check this line:
{code}
"%11s launch docker container:%2d appid containerid workdir "
{code}
It removes spacing before the modifier {{%2d}} which is used to line up the 
numeric option with the rest of the commands. It don't think that should be 
removed.

The rest of the change looks OK.
Waiting for a build to run.

> Usage output of container-executor binary needs to include --http/--https 
> argument
> --
>
> Key: YARN-10063
> URL: https://issues.apache.org/jira/browse/YARN-10063
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10063.001.patch, YARN-10063.002.patch, 
> YARN-10063.003.patch
>
>
> YARN-8448/YARN-6586 seems to have introduced a new option - "\--http" 
> (default) and "\--https" that is possible to be passed in to the 
> container-executor binary, see :
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L564
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L521
> however, the usage output seems to have missed this:
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L74
> Raising this jira to improve this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10063) Usage output of container-executor binary needs to include --http/--https argument

2020-03-12 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058354#comment-17058354
 ] 

Wilfred Spiegelenburg commented on YARN-10063:
--

+1 pending build

> Usage output of container-executor binary needs to include --http/--https 
> argument
> --
>
> Key: YARN-10063
> URL: https://issues.apache.org/jira/browse/YARN-10063
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10063.001.patch, YARN-10063.002.patch, 
> YARN-10063.003.patch
>
>
> YARN-8448/YARN-6586 seems to have introduced a new option - "\--http" 
> (default) and "\--https" that is possible to be passed in to the 
> container-executor binary, see :
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L564
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L521
> however, the usage output seems to have missed this:
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L74
> Raising this jira to improve this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2020-03-17 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-9940.
-
Resolution: Not A Problem

This issue is fixed in later versions via YARN-8373. In the version it is 
logged against it does not exist.

The custom code that caused the issue to show up is a mix of Hadoop 2.7 and 
Hadoop 2.9.

> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Assignee: kailiu_dev
>Priority: Major
> Attachments: YARN-9940-branch-2.7.2.001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10063) Usage output of container-executor binary needs to include --http/--https argument

2020-04-07 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-10063.
--
Fix Version/s: 3.4.0
   3.3.0
   Resolution: Fixed

committed to 3.3

removed backport to 3.2 branch YARN-6586 was not backported to 3.2

> Usage output of container-executor binary needs to include --http/--https 
> argument
> --
>
> Key: YARN-10063
> URL: https://issues.apache.org/jira/browse/YARN-10063
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Fix For: 3.3.0, 3.4.0
>
> Attachments: YARN-10063.001.patch, YARN-10063.002.patch, 
> YARN-10063.003.patch, YARN-10063.004.patch
>
>
> YARN-8448/YARN-6586 seems to have introduced a new option - "\--http" 
> (default) and "\--https" that is possible to be passed in to the 
> container-executor binary, see :
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L564
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L521
> however, the usage output seems to have missed this:
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L74
> Raising this jira to improve this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10063) Usage output of container-executor binary needs to include --http/--https argument

2020-04-23 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091044#comment-17091044
 ] 

Wilfred Spiegelenburg commented on YARN-10063:
--

Sorry for that, Thank you for reverting [~weichiu] I had reverted the fix after 
the commit as per the 
[comment|https://issues.apache.org/jira/browse/YARN-10063?focusedCommentId=17077798&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17077798]
 above but I have either not pushed it or it failed during the push. 
I have the revert in my local branch-3.2 :-(

> Usage output of container-executor binary needs to include --http/--https 
> argument
> --
>
> Key: YARN-10063
> URL: https://issues.apache.org/jira/browse/YARN-10063
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Fix For: 3.3.0, 3.4.0
>
> Attachments: YARN-10063.001.patch, YARN-10063.002.patch, 
> YARN-10063.003.patch, YARN-10063.004.patch
>
>
> YARN-8448/YARN-6586 seems to have introduced a new option - "\--http" 
> (default) and "\--https" that is possible to be passed in to the 
> container-executor binary, see :
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L564
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L521
> however, the usage output seems to have missed this:
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L74
> Raising this jira to improve this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10290) Resourcemanager recover failed when fair scheduler queue acl changed

2020-06-01 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-10290.
--
Resolution: Duplicate

This issue is fixed in YARN-7913.
That change fixes a number of issues around restores that fail.

The change was not backported to Hadoop 2.x

> Resourcemanager recover failed when fair scheduler queue acl changed
> 
>
> Key: YARN-10290
> URL: https://issues.apache.org/jira/browse/YARN-10290
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: yehuanhuan
>Priority: Blocker
>
> Resourcemanager recover failed when fair scheduler queue acl changed. Because 
> of queue acl changed, when recover the application (addApplication() in 
> fairscheduler) is rejected. Then recover the applicationAttempt 
> (addApplicationAttempt() in fairscheduler) get Application is null. This will 
> lead to two RM is at standby. Repeat as follows.
>  
> # user run a long running application.
> # change queue acl (aclSubmitApps) so that the user does not have permission.
> # restart the RM.
> {code:java}
> 2020-05-25 16:04:06,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating 
> application application_1590393162216_0005 with final state: FAILED
> 2020-05-25 16:04:06,192 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
> load/recover state
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:663)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1246)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:116)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1072)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1036)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:789)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$1900(RMAppImpl.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:897)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:850)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:723)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:322)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:427)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1173)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:584)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:980)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1021)
> at 
> org.apa

[jira] [Commented] (YARN-10966) nodeUpdate will make NPE when node decomissioning trans to decomissed at same time

2021-10-20 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432150#comment-17432150
 ] 

Wilfred Spiegelenburg commented on YARN-10966:
--

I can see the issue at a slightly different point than we had seen it happening 
before. It is similar to what was seen in YARN-4677.

Instead of adding the RMNode to the method calls to just get the node ID can we 
not pass in just the node ID? It is the only part needed for further calls and 
can be used for logging also.

Can you also look at the other schedulers (Fifo and Fair) as the same test as 
you extended for the capacity scheduler also exist in those schedulers and we 
should not break those schedulers and have the same tests.

> nodeUpdate will make NPE  when node decomissioning trans to decomissed at 
> same time
> ---
>
> Key: YARN-10966
> URL: https://issues.apache.org/jira/browse/YARN-10966
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 3.1.1, 3.2.1, 3.3.1
>Reporter: tuyu
>Priority: Major
> Fix For: 3.1.1, 3.2.1
>
> Attachments: YARN-10966.001.patch
>
>
> [YARN-4677|https://issues.apache.org/jira/browse/YARN-4677] fix race 
> condition, but not fix complete, it will cause NPE exception when 
> containerLaunchedOnNode call node.getNodeID but the node is null 
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.containerLaunchedOnNode(AbstractYarnScheduler.java:366)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNewContainerInfo(AbstractYarnScheduler.java:1029)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.nodeUpdate(AbstractYarnScheduler.java:1130)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:1480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1938)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler.testRemovedNodeDecomissioningNode
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

2024-05-21 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848072#comment-17848072
 ] 

Wilfred Spiegelenburg commented on YARN-11697:
--

The stack trace does not correspond to hadoop 3.2.1: 
[FairScheduler.java:757|https://github.com/apache/hadoop/blob/branch-3.2.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L757]

That points to this line in hadoop 3.2.1 which is part of 
completedContainerInternal
{code:java}
755        application.containerCompleted(rmContainer, containerStatus, event);
756        if (node != null) {
757          node.releaseContainer(rmContainer.getContainerId(), false);
758        } else if (LOG.isDebugEnabled()) {
759          LOG.debug("Skipping container release on removed node: " + nodeID);
760        } {code}
The comment in the moveApplication around locking the app attempt are for 
scheduling. An application could be scheduled while being moved and that needs 
to be stopped. The remove of an application attempt takes a write lock on the 
scheduler itself. Same as the move does. So a moveApplication and 
removeApplicationAttempt cannot happen at the same time. they both need that 
lock and are serialised.

I think you are looking at the wrong thing and a move is not involved.

> Fix fair scheduler race condition in removeApplicationAttempt and 
> moveApplication
> -
>
> Key: YARN-11697
> URL: https://issues.apache.org/jira/browse/YARN-11697
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.1
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>
> For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with 
> the following exception
> {code:java}
> 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
> (SchedulerEventDispatcher:Event Processor): Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.IllegalStateException: Given app to remove 
> appattempt_1706879498319_86660_01 Alloc:  does not 
> exist in queue [root, demand=, 
> running=, share=, w=1.0]
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:750)
> {code}
> The exception seems similar to the one mentioned in YARN-5136, but it looks 
> like there is still some edge cases not covered by YARN-5136.
> 1. On deeper look, i could see that as mentioned in the comment here. if a 
> call for a moveApplication and removeApplicationAttempt for the same attempt 
> are processed in short succession the application attempt will still contain 
> a queue reference but is already removed from the list of applications for 
> the queue.
> 2. This can happen when 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
>  removes the appAttempt from the queue and 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
>  also tries to remove the same appAttempt from the queue.
> 3. On further checking, i could see that before doing 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
>  writeLock on appAttempt is taken where as for 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
>  , i don't see any writelock being taken which can result in race condition 
> if same appAttempt is being processed.
> 

[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

2024-05-21 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848123#comment-17848123
 ] 

Wilfred Spiegelenburg commented on YARN-11697:
--

You need to figure out why you get two remove events in a row for the same 
application. This code has not change in multiple years. If this was really a 
big issue we should have seen this happen more often and years ago.

Try to reproduce without the backports and see if it still happens. You might 
have backported things that are not compatible that cause side effects.

> Fix fair scheduler race condition in removeApplicationAttempt and 
> moveApplication
> -
>
> Key: YARN-11697
> URL: https://issues.apache.org/jira/browse/YARN-11697
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.1
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>
> For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with 
> the following exception
> {code:java}
> 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
> (SchedulerEventDispatcher:Event Processor): Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.IllegalStateException: Given app to remove 
> appattempt_1706879498319_86660_01 Alloc:  does not 
> exist in queue [root, demand=, 
> running=, share=, w=1.0]
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:750)
> {code}
> The exception seems similar to the one mentioned in YARN-5136, but it looks 
> like there is still some edge cases not covered by YARN-5136.
> 1. On deeper look, i could see that as mentioned in the comment here. if a 
> call for a moveApplication and removeApplicationAttempt for the same attempt 
> are processed in short succession the application attempt will still contain 
> a queue reference but is already removed from the list of applications for 
> the queue.
> 2. This can happen when 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
>  removes the appAttempt from the queue and 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
>  also tries to remove the same appAttempt from the queue.
> 3. On further checking, i could see that before doing 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
>  writeLock on appAttempt is taken where as for 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
>  , i don't see any writelock being taken which can result in race condition 
> if same appAttempt is being processed.
> 4. Additionally as mentioned in the comment here when such scenario occurs 
> ideally we should not take down RM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    1   2   3   4   5   6