[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2015-06-10 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581553#comment-14581553
 ] 

zhihai xu commented on YARN-3795:
-

Hi [~lachisis], thanks for reporting this issue.
Most likely, Broken pipe is due to Len error at ZooKeeper server.
To confirm this, Could you check the ZooKeeper server logs to see whether you 
can find the following log:
{code}
WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of 
session 0x due to java.io.IOException: Len error ???
{code}

You can work around the Len error issue by increasing jute.maxbuffer size at 
ZooKeeper server or you can try YARN-3469.
 

> ZKRMStateStore crashes due to IOException: Broken pipe
> --
>
> Key: YARN-3795
> URL: https://issues.apache.org/jira/browse/YARN-3795
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: lachisis
>Priority: Critical
>
> 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to dap88/134.41.33.88:2181, initiating session
> 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server dap88/134.41.33.88:2181, sessionid = 
> 0x34db2f72ac50c86, negotiated timeout = 1
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:SyncConnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, 
> closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
> 2015-06-05 06:06:54,986 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:Disconnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:54,986 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session disconnected
> 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server dap87/134.41.33.87:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to dap87/134.41.33.87:2181, initiating session
> 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server dap87/134.41.33.87:2181, sessionid = 
> 0x34db2f72ac50c86, negotiated timeout = 1
> 2015-06-05 06:06:55,343 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:SyncConnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:55,343 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-05 06:06:55,344 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, 
> closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeF

[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2015-06-10 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581537#comment-14581537
 ] 

lachisis commented on YARN-3795:


But I think most of these Watchers in ZKRMStateStore  seems not necessary.

> ZKRMStateStore crashes due to IOException: Broken pipe
> --
>
> Key: YARN-3795
> URL: https://issues.apache.org/jira/browse/YARN-3795
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: lachisis
>Priority: Critical
>
> 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to dap88/134.41.33.88:2181, initiating session
> 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server dap88/134.41.33.88:2181, sessionid = 
> 0x34db2f72ac50c86, negotiated timeout = 1
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:SyncConnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, 
> closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
> 2015-06-05 06:06:54,986 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:Disconnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:54,986 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session disconnected
> 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server dap87/134.41.33.87:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to dap87/134.41.33.87:2181, initiating session
> 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server dap87/134.41.33.87:2181, sessionid = 
> 0x34db2f72ac50c86, negotiated timeout = 1
> 2015-06-05 06:06:55,343 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:SyncConnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:55,343 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-05 06:06:55,344 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, 
> closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)



-

[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2015-06-10 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581534#comment-14581534
 ] 

lachisis commented on YARN-3795:


It is better if zookeeper fix the ZOOKEEPER-706. 

> ZKRMStateStore crashes due to IOException: Broken pipe
> --
>
> Key: YARN-3795
> URL: https://issues.apache.org/jira/browse/YARN-3795
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: lachisis
>Priority: Critical
>
> 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to dap88/134.41.33.88:2181, initiating session
> 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server dap88/134.41.33.88:2181, sessionid = 
> 0x34db2f72ac50c86, negotiated timeout = 1
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:SyncConnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, 
> closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
> 2015-06-05 06:06:54,986 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:Disconnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:54,986 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session disconnected
> 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server dap87/134.41.33.87:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to dap87/134.41.33.87:2181, initiating session
> 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server dap87/134.41.33.87:2181, sessionid = 
> 0x34db2f72ac50c86, negotiated timeout = 1
> 2015-06-05 06:06:55,343 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:SyncConnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:55,343 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-05 06:06:55,344 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, 
> closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)



--
This message was sent b

[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2015-06-10 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581531#comment-14581531
 ] 

lachisis commented on YARN-3795:


I have found ZOOKEEPER-706, this means if zookeeper server receive a request 
which the body size is larger than 1M, the server will throw exception "Broken 
pipe" to reject the request.
this feature is used to limit the body size of Znode.

By scanning the zookeeper snapshot, I do not find a znode created by 
ZKRMStateStore which have large data size. 
Then analyzing code,  I find large numbers of Watcher are set when call 
function of "loadRMAppState" and "loadApplicationAttemptState". 



> ZKRMStateStore crashes due to IOException: Broken pipe
> --
>
> Key: YARN-3795
> URL: https://issues.apache.org/jira/browse/YARN-3795
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: lachisis
>Priority: Critical
>
> 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to dap88/134.41.33.88:2181, initiating session
> 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server dap88/134.41.33.88:2181, sessionid = 
> 0x34db2f72ac50c86, negotiated timeout = 1
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:SyncConnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, 
> closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
> 2015-06-05 06:06:54,986 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:Disconnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:54,986 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session disconnected
> 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server dap87/134.41.33.87:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to dap87/134.41.33.87:2181, initiating session
> 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server dap87/134.41.33.87:2181, sessionid = 
> 0x34db2f72ac50c86, negotiated timeout = 1
> 2015-06-05 06:06:55,343 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:SyncConnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:55,343 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-05 06:06:55,344 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, 
> closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNati

[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2015-06-10 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581517#comment-14581517
 ] 

lachisis commented on YARN-3795:


This exception appears two days ago in a yarn platform.
there are about 7000+ history jobs in rmstore. Then one time, Activate 
ReourceManager find session expiry and transitionToStandby. 
meanwhile, the standby ReourceManager  start to transitionToActive, but Throw 
exception as attached above.

> ZKRMStateStore crashes due to IOException: Broken pipe
> --
>
> Key: YARN-3795
> URL: https://issues.apache.org/jira/browse/YARN-3795
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: lachisis
>Priority: Critical
>
> 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to dap88/134.41.33.88:2181, initiating session
> 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server dap88/134.41.33.88:2181, sessionid = 
> 0x34db2f72ac50c86, negotiated timeout = 1
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:SyncConnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-05 06:06:54,881 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, 
> closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
> 2015-06-05 06:06:54,986 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:Disconnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:54,986 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session disconnected
> 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server dap87/134.41.33.87:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to dap87/134.41.33.87:2181, initiating session
> 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server dap87/134.41.33.87:2181, sessionid = 
> 0x34db2f72ac50c86, negotiated timeout = 1
> 2015-06-05 06:06:55,343 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: None with state:SyncConnected for path:null for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-06-05 06:06:55,343 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-05 06:06:55,344 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, 
> closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
>   at 
> org.apache.zookeeper.ClientCnxnSock

[jira] [Created] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2015-06-10 Thread lachisis (JIRA)
lachisis created YARN-3795:
--

 Summary: ZKRMStateStore crashes due to IOException: Broken pipe
 Key: YARN-3795
 URL: https://issues.apache.org/jira/browse/YARN-3795
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical


2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to dap88/134.41.33.88:2181, initiating session
2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server dap88/134.41.33.88:2181, sessionid = 
0x34db2f72ac50c86, negotiated timeout = 1
2015-06-05 06:06:54,881 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher 
event type: None with state:SyncConnected for path:null for Service 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2015-06-05 06:06:54,881 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session connected
2015-06-05 06:06:54,881 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session restored
2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, closing 
socket connection and attempting reconnect
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
2015-06-05 06:06:54,986 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher 
event type: None with state:Disconnected for path:null for Service 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2015-06-05 06:06:54,986 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session disconnected
2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server dap87/134.41.33.87:2181. Will not attempt to authenticate 
using SASL (unknown error)
2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to dap87/134.41.33.87:2181, initiating session
2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server dap87/134.41.33.87:2181, sessionid = 
0x34db2f72ac50c86, negotiated timeout = 1
2015-06-05 06:06:55,343 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher 
event type: None with state:SyncConnected for path:null for Service 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2015-06-05 06:06:55,343 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session connected
2015-06-05 06:06:55,344 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session restored
2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, closing 
socket connection and attempting reconnect
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3794) TestRMEmbeddedElector fails because of ambiguous LOG reference

2015-06-10 Thread Chengbing Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengbing Liu updated YARN-3794:

Attachment: YARN-3794.01.patch

> TestRMEmbeddedElector fails because of ambiguous LOG reference
> --
>
> Key: YARN-3794
> URL: https://issues.apache.org/jira/browse/YARN-3794
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.0
>Reporter: Chengbing Liu
>Assignee: Chengbing Liu
> Attachments: YARN-3794.01.patch
>
>
> After YARN-2921, {{MockRM}} has also a {{LOG}} field. Therefore {{LOG}} in 
> the following code snippet is ambiguous.
> {code}
> protected AdminService createAdminService() {
>   return new AdminService(MockRMWithElector.this, getRMContext()) {
> @Override
> protected EmbeddedElectorService createEmbeddedElectorService() {
>   return new EmbeddedElectorService(getRMContext()) {
> @Override
> public void becomeActive() throws
> ServiceFailedException {
>   try {
> callbackCalled.set(true);
> LOG.info("Callback called. Sleeping now");
> Thread.sleep(delayMs);
> LOG.info("Sleep done");
>   } catch (InterruptedException e) {
> e.printStackTrace();
>   }
>   super.becomeActive();
> }
>   };
> }
>   };
> }
> {code}
> Eclipse gives the following error:
> {quote}
> The field LOG is defined in an inherited type and an enclosing scope
> {quote}
> IMO, we should fix this as {{TestRMEmbeddedElector.LOG}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3794) TestRMEmbeddedElector fails because of ambiguous LOG reference

2015-06-10 Thread Chengbing Liu (JIRA)
Chengbing Liu created YARN-3794:
---

 Summary: TestRMEmbeddedElector fails because of ambiguous LOG 
reference
 Key: YARN-3794
 URL: https://issues.apache.org/jira/browse/YARN-3794
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 2.7.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu


After YARN-2921, {{MockRM}} has also a {{LOG}} field. Therefore {{LOG}} in the 
following code snippet is ambiguous.
{code}
protected AdminService createAdminService() {
  return new AdminService(MockRMWithElector.this, getRMContext()) {
@Override
protected EmbeddedElectorService createEmbeddedElectorService() {
  return new EmbeddedElectorService(getRMContext()) {
@Override
public void becomeActive() throws
ServiceFailedException {
  try {
callbackCalled.set(true);
LOG.info("Callback called. Sleeping now");
Thread.sleep(delayMs);
LOG.info("Sleep done");
  } catch (InterruptedException e) {
e.printStackTrace();
  }
  super.becomeActive();
}
  };
}
  };
}
{code}
Eclipse gives the following error:
{quote}
The field LOG is defined in an inherited type and an enclosing scope
{quote}

IMO, we should fix this as {{TestRMEmbeddedElector.LOG}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3785) Support for Resource as an argument during submitApp call in MockRM test class

2015-06-10 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581452#comment-14581452
 ] 

Xuan Gong commented on YARN-3785:
-

Committed into trunk/branch-2. Thanks, [~sunilg]

> Support for Resource as an argument during submitApp call in MockRM test class
> --
>
> Key: YARN-3785
> URL: https://issues.apache.org/jira/browse/YARN-3785
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-3785.patch, 0002-YARN-3785.patch
>
>
> Currently MockRM#submitApp supports only memory. Adding test cases to support 
> vcores so that DominentResourceCalculator can be tested with this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3785) Support for Resource as an argument during submitApp call in MockRM test class

2015-06-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581434#comment-14581434
 ] 

Hudson commented on YARN-3785:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8004 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8004/])
YARN-3785. Support for Resource as an argument during submitApp call in (xgong: 
rev 5583f88bf7f1852dc0907ce55d0755e4fb22107a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java


> Support for Resource as an argument during submitApp call in MockRM test class
> --
>
> Key: YARN-3785
> URL: https://issues.apache.org/jira/browse/YARN-3785
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-3785.patch, 0002-YARN-3785.patch
>
>
> Currently MockRM#submitApp supports only memory. Adding test cases to support 
> vcores so that DominentResourceCalculator can be tested with this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3779) Aggregated Logs Deletion doesnt work after refreshing Log Retention Settings in secure cluster

2015-06-10 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581430#comment-14581430
 ] 

Xuan Gong commented on YARN-3779:
-

[~varun_saxena] Thanks for the logs. Could you apply the patch and print the 
ugi ?

> Aggregated Logs Deletion doesnt work after refreshing Log Retention Settings 
> in secure cluster
> --
>
> Key: YARN-3779
> URL: https://issues.apache.org/jira/browse/YARN-3779
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
> Environment: mrV2, secure mode
>Reporter: Zhang Wei
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: YARN-3779.01.patch, YARN-3779.02.patch
>
>
> {{GSSException}} is thrown everytime log aggregation deletion is attempted 
> after executing bin/mapred hsadmin -refreshLogRetentionSettings in a secure 
> cluster.
> The problem can be reproduced by following steps:
> 1. startup historyserver in secure cluster.
> 2. Log deletion happens as per expectation. 
> 3. execute {{mapred hsadmin -refreshLogRetentionSettings}} command to refresh 
> the configuration value.
> 4. All the subsequent attempts of log deletion fail with {{GSSException}}
> Following exception can be found in historyserver's log if log deletion is 
> enabled. 
> {noformat}
> 2015-06-04 14:14:40,070 | ERROR | Timer-3 | Error reading root log dir this 
> deletion attempt is being aborted | AggregatedLogDeletionService.java:127
> java.io.IOException: Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]; Host Details : local host is: "vm-31/9.91.12.31"; 
> destination host is: "vm-33":25000; 
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
> at org.apache.hadoop.ipc.Client.call(Client.java:1414)
> at org.apache.hadoop.ipc.Client.call(Client.java:1363)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy9.getListing(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:519)
> at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy10.getListing(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1767)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1750)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:691)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:753)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:749)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:749)
> at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService$LogDeletionTask.run(AggregatedLogDeletionService.java:68)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS 
> initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Failed to find any Kerberos tgt)]
> at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:677)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1641)
> at 
> org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:640)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:724)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
> at org.apache.hadoop.ipc.Client.call(Client.

[jira] [Commented] (YARN-3785) Support for Resource as an argument during submitApp call in MockRM test class

2015-06-10 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581422#comment-14581422
 ] 

Xuan Gong commented on YARN-3785:
-

+1 lgtm. Will commit

> Support for Resource as an argument during submitApp call in MockRM test class
> --
>
> Key: YARN-3785
> URL: https://issues.apache.org/jira/browse/YARN-3785
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Minor
> Attachments: 0001-YARN-3785.patch, 0002-YARN-3785.patch
>
>
> Currently MockRM#submitApp supports only memory. Adding test cases to support 
> vcores so that DominentResourceCalculator can be tested with this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3791) FSDownload

2015-06-10 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581395#comment-14581395
 ] 

Varun Saxena commented on YARN-3791:


I see the class name being {{com.suning.cybertron.superion.util.FSDownload}} 
which is not same as FSDownload in Hadoop distribution. Is the code same ?

> FSDownload
> --
>
> Key: YARN-3791
> URL: https://issues.apache.org/jira/browse/YARN-3791
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
> Environment: Linux 2.6.32-279.el6.x86_64 
>Reporter: HuanWang
>
> Inadvertently,we set two source ftp path:
> {code}
>  { { ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null 
> },pending,[(container_20150608111420_41540_1213_1503_)],4237640867118938,DOWNLOADING}
> ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null 
> },pending,[(container_20150608111420_41540_1213_1503_)],4237640866988089,DOWNLOADING}
> the first one is a wrong path,only one source was set this;but Follow the 
> log,i saw Starting from the first path source download,All next jobs sources 
> were downloaded from  ftp://10.27.178.207 by default.
> {code}
> the log is :
> {code}
> 2015-06-09 11:14:34,653 INFO  [AsyncDispatcher event handler] 
> localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(544)) - Downloading public 
> rsrc:{ ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, 
> null }
> 2015-06-09 11:14:34,653 INFO  [AsyncDispatcher event handler] 
> localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(544)) - Downloading public 
> rsrc:{ ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, 
> null }
> 2015-06-09 11:14:37,883 INFO  [Public Localizer] 
> localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(672)) - Failed to download rsrc { { 
> ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null 
> },pending,[(container_20150608111420_41540_1213_1503_)],4237640867118938,DOWNLOADING}
> java.io.IOException: Login failed on server - 10.27.178.207, port - 21
> at 
> org.apache.hadoop.fs.ftp.FTPFileSystem.connect(FTPFileSystem.java:133)
> at 
> org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:390)
> at 
> com.suning.cybertron.superion.util.FSDownload.copy(FSDownload.java:172)
> at 
> com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:279)
> at 
> com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:52)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 2015-06-09 11:14:37,885 INFO  [Public Localizer] localizer.LocalizedResource 
> (LocalizedResource.java:handle(203)) - Resource 
> ftp://10.27.178.207:21/home/cbt/1213/jxf.sql transitioned from DOWNLOADING to 
> FAILED
> 2015-06-09 11:14:37,886 INFO  [Public Localizer] 
> localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(672)) - Failed to download rsrc { { 
> ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null 
> },pending,[(container_20150608111420_41540_1213_1503_)],4237640866988089,DOWNLOADING}
> java.io.IOException: Login failed on server - 10.27.178.207, port - 21
> at 
> org.apache.hadoop.fs.ftp.FTPFileSystem.connect(FTPFileSystem.java:133)
> at 
> org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:390)
> at 
> com.suning.cybertron.superion.util.FSDownload.copy(FSDownload.java:172)
> at 
> com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:279)
> at 
> com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:52)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 2015-06-09 11:14:37,886 INFO  [AsyncDispatcher event handler] 
> container.Container (ContainerImpl.java:handle(853)) - Container 
> container_20150608111420_41540_1213_1503_ transitioned from LOCALIZING to 
> LOCALIZ

[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery

2015-06-10 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581358#comment-14581358
 ] 

Brahma Reddy Battula commented on YARN-3793:


[~kasha] thanks for reporting this jira.. It's dupe of HADOOP-11878...

> Several NPEs when deleting local files on NM recovery
> -
>
> Key: YARN-3793
> URL: https://issues.apache.org/jira/browse/YARN-3793
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> When NM work-preserving restart is enabled, we see several NPEs on recovery. 
> These seem to correspond to sub-directories that need to be deleted. I wonder 
> if null pointers here mean incorrect tracking of these resources and a 
> potential leak. This JIRA is to investigate and fix anything required.
> Logs show:
> {noformat}
> 2015-05-18 07:06:10,225 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : null
> 2015-05-18 07:06:10,224 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
> execution of task in DeletionService
> java.lang.NullPointerException
> at 
> org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
> at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1042) add ability to specify affinity/anti-affinity in container requests

2015-06-10 Thread Weiwei Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-1042:
--
Attachment: YARN-1042.001.patch

> add ability to specify affinity/anti-affinity in container requests
> ---
>
> Key: YARN-1042
> URL: https://issues.apache.org/jira/browse/YARN-1042
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Arun C Murthy
> Attachments: YARN-1042-demo.patch, YARN-1042.001.patch
>
>
> container requests to the AM should be able to request anti-affinity to 
> ensure that things like Region Servers don't come up on the same failure 
> zones. 
> Similarly, you may be able to want to specify affinity to same host or rack 
> without specifying which specific host/rack. Example: bringing up a small 
> giraph cluster in a large YARN cluster would benefit from having the 
> processes in the same rack purely for bandwidth reasons.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1042) add ability to specify affinity/anti-affinity in container requests

2015-06-10 Thread Weiwei Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-1042:
--
Attachment: (was: YARN-1042-001.patch)

> add ability to specify affinity/anti-affinity in container requests
> ---
>
> Key: YARN-1042
> URL: https://issues.apache.org/jira/browse/YARN-1042
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Arun C Murthy
> Attachments: YARN-1042-demo.patch
>
>
> container requests to the AM should be able to request anti-affinity to 
> ensure that things like Region Servers don't come up on the same failure 
> zones. 
> Similarly, you may be able to want to specify affinity to same host or rack 
> without specifying which specific host/rack. Example: bringing up a small 
> giraph cluster in a large YARN cluster would benefit from having the 
> processes in the same rack purely for bandwidth reasons.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2768) optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% of computing time of update thread

2015-06-10 Thread Hong Zhiguo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581349#comment-14581349
 ] 

Hong Zhiguo commented on YARN-2768:
---

[~kasha], the excution time displayed in the profiling output is cumulative.
Actually, I repeated such profiling a lot of times and got the same ratio.
The profiling is done with a cluster of NM/AM simulators and I don't have such 
resource now.

I wrote a testcase which creates 8000 nodes, 4500 apps within 1200 queues, and 
then performs 1 rounds of FairScheduler.update(), and print the average 
execution time of one call to update. With this patch, the average execution 
time decreased from about 35ms to 20ms.

I think the effect comes from GC and memory allocation since in each round of 
FairScheduler.update(), Resource.multiply is called as many times as the number 
of pending ResourceRequests, which is more than 3 million in our production 
cluster.

> optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% 
> of computing time of update thread
> 
>
> Key: YARN-2768
> URL: https://issues.apache.org/jira/browse/YARN-2768
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Hong Zhiguo
>Assignee: Hong Zhiguo
>Priority: Minor
> Attachments: YARN-2768.patch, profiling_FairScheduler_update.png
>
>
> See the attached picture of profiling result. The clone of Resource object 
> within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the 
> function FairScheduler.update().
> The code of FSAppAttempt.updateDemand:
> {code}
> public void updateDemand() {
> demand = Resources.createResource(0);
> // Demand is current consumption plus outstanding requests
> Resources.addTo(demand, app.getCurrentConsumption());
> // Add up outstanding resource requests
> synchronized (app) {
>   for (Priority p : app.getPriorities()) {
> for (ResourceRequest r : app.getResourceRequests(p).values()) {
>   Resource total = Resources.multiply(r.getCapability(), 
> r.getNumContainers());
>   Resources.addTo(demand, total);
> }
>   }
> }
>   }
> {code}
> The code of Resources.multiply:
> {code}
> public static Resource multiply(Resource lhs, double by) {
> return multiplyTo(clone(lhs), by);
> }
> {code}
> The clone could be skipped by directly update the value of this.demand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

2015-06-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581151#comment-14581151
 ] 

Hadoop QA commented on YARN-3790:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m  3s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 33s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 45s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 26s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  50m 55s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  88m 51s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12738905/YARN-3790.000.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / c7729ef |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8235/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8235/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8235/console |


This message was automatically generated.

> TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in 
> trunk for FS scheduler
> 
>
> Key: YARN-3790
> URL: https://issues.apache.org/jira/browse/YARN-3790
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith
>Assignee: zhihai xu
> Attachments: YARN-3790.000.patch
>
>
> Failure trace is as follows
> {noformat}
> Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
> testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
>   Time elapsed: 6.502 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<8192>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3793) Several NPEs when deleting local files on NM recovery

2015-06-10 Thread Karthik Kambatla (JIRA)
Karthik Kambatla created YARN-3793:
--

 Summary: Several NPEs when deleting local files on NM recovery
 Key: YARN-3793
 URL: https://issues.apache.org/jira/browse/YARN-3793
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical


When NM work-preserving restart is enabled, we see several NPEs on recovery. 
These seem to correspond to sub-directories that need to be deleted. I wonder 
if null pointers here mean incorrect tracking of these resources and a 
potential leak. This JIRA is to investigate and fix anything required.

Logs show:
{noformat}
2015-05-18 07:06:10,225 INFO 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
absolute path : null
2015-05-18 07:06:10,224 ERROR 
org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
execution of task in DeletionService
java.lang.NullPointerException
at 
org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
at 
org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3793) Several NPEs when deleting local files on NM recovery

2015-06-10 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3793:
---
Priority: Major  (was: Critical)

> Several NPEs when deleting local files on NM recovery
> -
>
> Key: YARN-3793
> URL: https://issues.apache.org/jira/browse/YARN-3793
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> When NM work-preserving restart is enabled, we see several NPEs on recovery. 
> These seem to correspond to sub-directories that need to be deleted. I wonder 
> if null pointers here mean incorrect tracking of these resources and a 
> potential leak. This JIRA is to investigate and fix anything required.
> Logs show:
> {noformat}
> 2015-05-18 07:06:10,225 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : null
> 2015-05-18 07:06:10,224 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
> execution of task in DeletionService
> java.lang.NullPointerException
> at 
> org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
> at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

2015-06-10 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581026#comment-14581026
 ] 

zhihai xu commented on YARN-3790:
-

I uploaded a patch YARN-3790.000.patch which will move 
{{updateRootQueueMetrics}} after {{recoverContainersOnNode}} in {{addNode}}.

> TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in 
> trunk for FS scheduler
> 
>
> Key: YARN-3790
> URL: https://issues.apache.org/jira/browse/YARN-3790
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith
>Assignee: zhihai xu
> Attachments: YARN-3790.000.patch
>
>
> Failure trace is as follows
> {noformat}
> Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
> testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
>   Time elapsed: 6.502 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<8192>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

2015-06-10 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3790:

Attachment: YARN-3790.000.patch

> TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in 
> trunk for FS scheduler
> 
>
> Key: YARN-3790
> URL: https://issues.apache.org/jira/browse/YARN-3790
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith
>Assignee: zhihai xu
> Attachments: YARN-3790.000.patch
>
>
> Failure trace is as follows
> {noformat}
> Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
> testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
>   Time elapsed: 6.502 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<8192>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

2015-06-10 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580965#comment-14580965
 ] 

zhihai xu commented on YARN-3790:
-

[~rohithsharma], thanks for updating the title.
The containers are recovered. {{rootMetrics}}'s used resource is also updated, 
But {{rootMetrics}}'s available resource is not updated.
The following logs in the failed test proved it:
{code}
2015-06-09 22:55:42,964 INFO  [ResourceManager Event Processor] 
fair.FairScheduler (FairScheduler.java:addNode(855)) - Added node 
127.0.0.1:1234 cluster capacity: 
2015-06-09 22:55:42,964 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl 
(RMAppImpl.java:handle(756)) - Processing event for 
application_1433915736884_0001 of type NODE_UPDATE
2015-06-09 22:55:42,964 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl 
(RMAppImpl.java:processNodeUpdate(820)) - Received node update 
event:NODE_USABLE for node:127.0.0.1:1234 with state:RUNNING
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue 
(FSLeafQueue.java:updateDemand(287)) - The updated demand for root.default is 
; the max is 
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue 
(FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.default 
is 
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue 
(FSParentQueue.java:updateDemand(163)) - Counting resource from root.default 
; Total resource consumption for root now 
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue 
(FSLeafQueue.java:updateDemandForApp(298)) - Counting resource from 
application_1433915736884_0001 ; Total resource consumption 
for root.zxu now 
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue 
(FSLeafQueue.java:updateDemand(287)) - The updated demand for root.zxu is 
; the max is 
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue 
(FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.zxu is 

2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue 
(FSParentQueue.java:updateDemand(163)) - Counting resource from root.zxu 
; Total resource consumption for root now 
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue 
(FSParentQueue.java:updateDemand(177)) - The updated demand for root is 
; the max is 
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSQueue 
(FSQueue.java:setFairShare(196)) - The updated fairShare for root is 

2015-06-09 22:55:42,965 INFO  [ResourceManager Event Processor] 
scheduler.AbstractYarnScheduler 
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering 
container container_id { app_attempt_id { application_id { id: 1 
cluster_timestamp: 1433915736884 } attemptId: 1 } id: 1 } container_state: 
C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } 
diagnostics: "recover container" container_exit_status: 0 creation_time: 0 
nodeLabelExpression: ""
2015-06-09 22:55:42,965 DEBUG [ResourceManager Event Processor] 
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing 
container_1433915736884_0001_01_01 of type RECOVER
2015-06-09 22:55:42,965 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType:
 APP_RUNNING_ON_NODE
2015-06-09 22:55:42,965 INFO  [ResourceManager Event Processor] 
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - 
container_1433915736884_0001_01_01 Container Transitioned from NEW to 
RUNNING
2015-06-09 22:55:42,965 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl 
(RMAppImpl.java:handle(756)) - Processing event for 
application_1433915736884_0001 of type APP_RUNNING_ON_NODE
2015-06-09 22:55:42,965 INFO  [ResourceManager Event Processor] 
scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned 
container container_1433915736884_0001_01_01 of capacity  on host 127.0.0.1:1234, which has 1 containers,  used and  available after allocation
2015-06-09 22:55:42,966 INFO  [ResourceManager Event Processor] 
scheduler.SchedulerApplicationAttempt 
(SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt 
appattempt_1433915736884_0001_01 is recovering container 
container_1433915736884_0001_01_01
2015-06-09 22:55:42,966 INFO  [ResourceManager Event Processor] 
scheduler.AbstractYarnScheduler 
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering 
container container_id { app_attempt_id { application_id { id: 1 
cluster_timestamp: 1433915736884 } attemptId: 1 } id: 2 } container_state: 
C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } 
diagnostics: "recover container" container_exit_status: 0 creation_time: 

[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers

2015-06-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580961#comment-14580961
 ] 

Hadoop QA commented on YARN-3051:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  17m 26s | Findbugs (version ) appears to 
be broken on YARN-2928. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 3 new or modified test files. |
| {color:green}+1{color} | javac |   7m 56s | There were no new javac warning 
messages. |
| {color:red}-1{color} | javadoc |  10m 12s | The applied patch generated  11  
additional warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 22s | The applied patch generated  
25 new checkstyle issues (total was 243, now 267). |
| {color:green}+1{color} | shellcheck |   0m  6s | There were no new shellcheck 
(v0.3.3) issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 40s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 40s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   4m  2s | The patch appears to introduce 5 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 22s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   1m 59s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   1m 27s | Tests passed in 
hadoop-yarn-server-timelineservice. |
| | |  48m  2s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-yarn-server-timelineservice |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12738884/YARN-3051-YARN-2928.04.patch
 |
| Optional Tests | shellcheck javadoc javac unit findbugs checkstyle |
| git revision | YARN-2928 / 0a3c147 |
| javadoc | 
https://builds.apache.org/job/PreCommit-YARN-Build/8234/artifact/patchprocess/diffJavadocWarnings.txt
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8234/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8234/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-timelineservice.html
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8234/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8234/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-timelineservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8234/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8234/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8234/console |


This message was automatically generated.

> [Storage abstraction] Create backing storage read interface for ATS readers
> ---
>
> Key: YARN-3051
> URL: https://issues.apache.org/jira/browse/YARN-3051
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-3051-YARN-2928.003.patch, 
> YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, 
> YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch
>
>
> Per design in YARN-2928, create backing storage read interface that can be 
> implemented by multiple backing storage implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers

2015-06-10 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580910#comment-14580910
 ] 

Varun Saxena commented on YARN-3051:


As of now, there are very similar APIs' for 
getEntity/getFlowEntity/getUserEntity etc. Will it be fine to combine these 
APIs' and pass something like a query type(ENTITY/USER/FLOW,etc.) in the API 
which storage implementation can then use to decide which type of query it is ?

> [Storage abstraction] Create backing storage read interface for ATS readers
> ---
>
> Key: YARN-3051
> URL: https://issues.apache.org/jira/browse/YARN-3051
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-3051-YARN-2928.003.patch, 
> YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, 
> YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch
>
>
> Per design in YARN-2928, create backing storage read interface that can be 
> implemented by multiple backing storage implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers

2015-06-10 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3051:
---
Attachment: YARN-3051-YARN-2928.04.patch

> [Storage abstraction] Create backing storage read interface for ATS readers
> ---
>
> Key: YARN-3051
> URL: https://issues.apache.org/jira/browse/YARN-3051
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-3051-YARN-2928.003.patch, 
> YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, 
> YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch
>
>
> Per design in YARN-2928, create backing storage read interface that can be 
> implemented by multiple backing storage implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2578) NM does not failover timely if RM node network connection fails

2015-06-10 Thread Masatake Iwasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated YARN-2578:
---
Attachment: YARN-2578.002.patch

I attached 002 which makes rpcTimeout configurable by "ipc.client.rpc.timeout". 
The default value is 0 in order to keep current behaviour. We can test timeout 
by changing the value explicitly and change the default value later after some 
tests. I also left Client#getTimeout as is to keep compatibility.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: YARN-2578.002.patch, YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers

2015-06-10 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580858#comment-14580858
 ] 

Varun Saxena commented on YARN-3051:


[~zjshen], thanks for your inputs. I will brief you about the APIs' I have 
decided as of now.

# APIs' for querying individual entity/flow/flow run/user and APIs' for 
querying a set of entities/flow runs/flows/users. APIs' such a set of 
flows/users will contain aggregated data. The reason for separate endpoints for 
entities, flows, users,etc. is because of the different tables in HBase/Phoenix 
schema.
# Most the APIs' will be variations of either getting a single entity or a set 
of entities. So I will primarily talk about entity and a set of entities in 
subsequent points.
# For getting a set of entities, there will be 3 kinds of filters - filtering 
on the basis of info, filtering on configs and filtering on metrics. Filtering 
on the basis of info and field will be based on equality, for instance, fetch 
entities which have a config name matching a specific config value. Metrics 
filtering though will be on the basis of relational operator. For instance, 
user can query entities which have a specific metric >= a certain value.
# In addition to that certain predicates such as limit, windowStart, windowEnd, 
etc. which used to exist in ATSv1 exist even now.Some predicates such as 
fromId, fromTs may not make sense in ATSv2 but I have included them for now 
with the intention of discussion.
# Additional predicates such as metricswindowStart and end has been specified 
to fetch metrics data for a specific time span. The reason I included this is 
because this can aid in plotting graphs on UI for a specific metric of some 
entity.
# Only entity id, type, created and modified time will be returned if fields 
are not specified in REST URL. This will be the default view of an entity.
# Moreover you can also specify which configurations and metrics to return.
# Every query param will be received as a String, even timestamp. Now from 
backing storage implementation viewpoint, would it make more sense to let these 
query params be passed as strings or do datatype conversion ?

Few concerns from Li Lu regarding parameter list becoming too long are quite 
valid as most of them will be nulls. We can also club multiple related 
parameters in a different classes to reduce them. Or as he said have different 
methods for frequently occurring use cases. Thoughts ?

Comments are welcome so that this JIRA can speed up, probably after Hadoop 
Summit :)

> [Storage abstraction] Create backing storage read interface for ATS readers
> ---
>
> Key: YARN-3051
> URL: https://issues.apache.org/jira/browse/YARN-3051
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-3051-YARN-2928.003.patch, 
> YARN-3051-YARN-2928.03.patch, YARN-3051.wip.02.YARN-2928.patch, 
> YARN-3051.wip.patch, YARN-3051_temp.patch
>
>
> Per design in YARN-2928, create backing storage read interface that can be 
> implemented by multiple backing storage implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging

2015-06-10 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580821#comment-14580821
 ] 

Bibin A Chundatt commented on YARN-3789:


With this patch there is no increase in number  of lines. Checkstyle issue 
seems unrelated

> Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
> --
>
> Key: YARN-3789
> URL: https://issues.apache.org/jira/browse/YARN-3789
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
> 0003-YARN-3789.patch
>
>
> Duplicate logging from resource manager
> during am limit check for each application
> {code}
> 015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3792) Test case failures in TestDistributedShell after changes for subjira's of YARN-2928

2015-06-10 Thread Naganarasimha G R (JIRA)
Naganarasimha G R created YARN-3792:
---

 Summary: Test case failures in TestDistributedShell after changes 
for subjira's of YARN-2928
 Key: YARN-3792
 URL: https://issues.apache.org/jira/browse/YARN-3792
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R


encountered [testcase 
failures|https://builds.apache.org/job/PreCommit-YARN-Build/8233/testReport/] 
which was happening even without the patch modifications in YARN-3044

TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow
TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow
TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS

2015-06-10 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580508#comment-14580508
 ] 

Naganarasimha G R commented on YARN-3044:
-

[~zjshen],
Seems like many of the test case failures in TestDistributedShell, 
TestDistributedShellWithNodeLabels etc.. are not related to this jira, opening 
new jira to handle it, based on past exp better to handle in new jira so that 
duplicate effort will be avoided.

> [Event producers] Implement RM writing app lifecycle events to ATS
> --
>
> Key: YARN-3044
> URL: https://issues.apache.org/jira/browse/YARN-3044
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
> Attachments: YARN-3044-YARN-2928.004.patch, 
> YARN-3044-YARN-2928.005.patch, YARN-3044-YARN-2928.006.patch, 
> YARN-3044-YARN-2928.007.patch, YARN-3044-YARN-2928.008.patch, 
> YARN-3044-YARN-2928.009.patch, YARN-3044-YARN-2928.010.patch, 
> YARN-3044-YARN-2928.011.patch, YARN-3044.20150325-1.patch, 
> YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch
>
>
> Per design in YARN-2928, implement RM writing app lifecycle events to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3791) FSDownload

2015-06-10 Thread HuanWang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

HuanWang updated YARN-3791:
---
Description: 
Inadvertently,we set two source ftp path:
{code}
 { { ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null 
},pending,[(container_20150608111420_41540_1213_1503_)],4237640867118938,DOWNLOADING}

ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null 
},pending,[(container_20150608111420_41540_1213_1503_)],4237640866988089,DOWNLOADING}

the first one is a wrong path,only one source was set this;but Follow the log,i 
saw Starting from the first path source download,All next jobs sources were 
downloaded from  ftp://10.27.178.207 by default.
{code}

the log is :

{code}
2015-06-09 11:14:34,653 INFO  [AsyncDispatcher event handler] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:addResource(544)) - Downloading public rsrc:{ 
ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null }
2015-06-09 11:14:34,653 INFO  [AsyncDispatcher event handler] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:addResource(544)) - Downloading public rsrc:{ 
ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null }
2015-06-09 11:14:37,883 INFO  [Public Localizer] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:run(672)) - Failed to download rsrc { { 
ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null 
},pending,[(container_20150608111420_41540_1213_1503_)],4237640867118938,DOWNLOADING}
java.io.IOException: Login failed on server - 10.27.178.207, port - 21
at 
org.apache.hadoop.fs.ftp.FTPFileSystem.connect(FTPFileSystem.java:133)
at 
org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:390)
at 
com.suning.cybertron.superion.util.FSDownload.copy(FSDownload.java:172)
at 
com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:279)
at 
com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:52)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-06-09 11:14:37,885 INFO  [Public Localizer] localizer.LocalizedResource 
(LocalizedResource.java:handle(203)) - Resource 
ftp://10.27.178.207:21/home/cbt/1213/jxf.sql transitioned from DOWNLOADING to 
FAILED
2015-06-09 11:14:37,886 INFO  [Public Localizer] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:run(672)) - Failed to download rsrc { { 
ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null 
},pending,[(container_20150608111420_41540_1213_1503_)],4237640866988089,DOWNLOADING}
java.io.IOException: Login failed on server - 10.27.178.207, port - 21
at 
org.apache.hadoop.fs.ftp.FTPFileSystem.connect(FTPFileSystem.java:133)
at 
org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:390)
at 
com.suning.cybertron.superion.util.FSDownload.copy(FSDownload.java:172)
at 
com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:279)
at 
com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:52)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-06-09 11:14:37,886 INFO  [AsyncDispatcher event handler] 
container.Container (ContainerImpl.java:handle(853)) - Container 
container_20150608111420_41540_1213_1503_ transitioned from LOCALIZING to 
LOCALIZATION_FAILED
2015-06-09 11:14:37,887 INFO  [Public Localizer] localizer.LocalizedResource 
(LocalizedResource.java:handle(203)) - Resource 
ftp://10.27.89.13:21/home/cbt/common/2/sql.jar transitioned from DOWNLOADING to 
FAILED
2015-06-09 11:14:37,887 INFO  [AsyncDispatcher event handler] 
localizer.LocalResourcesTrackerImpl 
(LocalResourcesTrackerImpl.java:handle(133)) - Container 
container_20150608111420_41540_1213_1503_ sent RELEASE event on a resource 
request { ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, 
null } not present
{code}

I debug the code of yarn.I found the piont is 
org.apache.hadoop.fs.FileSystem#cache 

the code source is here:

{code}
private Fil

[jira] [Updated] (YARN-3791) FSDownload

2015-06-10 Thread HuanWang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

HuanWang updated YARN-3791:
---
Description: 
Inadvertently,we set two source ftp path:
{code}
 { { ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null 
},pending,[(container_20150608111420_41540_1213_1503_)],4237640867118938,DOWNLOADING}

ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null 
},pending,[(container_20150608111420_41540_1213_1503_)],4237640866988089,DOWNLOADING}

the first one is a wrong path,only one source was set this;but Follow the log,i 
saw Starting from the first path source download,All next jobs sources were 
downloaded from  ftp://10.27.178.207 by default.


the log is :


2015-06-09 11:14:34,653 INFO  [AsyncDispatcher event handler] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:addResource(544)) - Downloading public rsrc:{ 
ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null }
2015-06-09 11:14:34,653 INFO  [AsyncDispatcher event handler] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:addResource(544)) - Downloading public rsrc:{ 
ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null }
2015-06-09 11:14:37,883 INFO  [Public Localizer] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:run(672)) - Failed to download rsrc { { 
ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null 
},pending,[(container_20150608111420_41540_1213_1503_)],4237640867118938,DOWNLOADING}
java.io.IOException: Login failed on server - 10.27.178.207, port - 21
at 
org.apache.hadoop.fs.ftp.FTPFileSystem.connect(FTPFileSystem.java:133)
at 
org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:390)
at 
com.suning.cybertron.superion.util.FSDownload.copy(FSDownload.java:172)
at 
com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:279)
at 
com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:52)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-06-09 11:14:37,885 INFO  [Public Localizer] localizer.LocalizedResource 
(LocalizedResource.java:handle(203)) - Resource 
ftp://10.27.178.207:21/home/cbt/1213/jxf.sql transitioned from DOWNLOADING to 
FAILED
2015-06-09 11:14:37,886 INFO  [Public Localizer] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:run(672)) - Failed to download rsrc { { 
ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null 
},pending,[(container_20150608111420_41540_1213_1503_)],4237640866988089,DOWNLOADING}
java.io.IOException: Login failed on server - 10.27.178.207, port - 21
at 
org.apache.hadoop.fs.ftp.FTPFileSystem.connect(FTPFileSystem.java:133)
at 
org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:390)
at 
com.suning.cybertron.superion.util.FSDownload.copy(FSDownload.java:172)
at 
com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:279)
at 
com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:52)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-06-09 11:14:37,886 INFO  [AsyncDispatcher event handler] 
container.Container (ContainerImpl.java:handle(853)) - Container 
container_20150608111420_41540_1213_1503_ transitioned from LOCALIZING to 
LOCALIZATION_FAILED
2015-06-09 11:14:37,887 INFO  [Public Localizer] localizer.LocalizedResource 
(LocalizedResource.java:handle(203)) - Resource 
ftp://10.27.89.13:21/home/cbt/common/2/sql.jar transitioned from DOWNLOADING to 
FAILED
2015-06-09 11:14:37,887 INFO  [AsyncDispatcher event handler] 
localizer.LocalResourcesTrackerImpl 
(LocalResourcesTrackerImpl.java:handle(133)) - Container 
container_20150608111420_41540_1213_1503_ sent RELEASE event on a resource 
request { ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, 
null } not present
{code}

I debug the code of yarn.I found the piont is 
org.apache.hadoop.fs.FileSystem#cache 

the code source is here:

{code}
private FileSystem getI

[jira] [Created] (YARN-3791) FSDownload

2015-06-10 Thread HuanWang (JIRA)
HuanWang created YARN-3791:
--

 Summary: FSDownload
 Key: YARN-3791
 URL: https://issues.apache.org/jira/browse/YARN-3791
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
 Environment: Linux 2.6.32-279.el6.x86_64 
Reporter: HuanWang


Inadvertently,we set two source ftp path:

 { { ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null 
},pending,[(container_20150608111420_41540_1213_1503_)],4237640867118938,DOWNLOADING}

ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null 
},pending,[(container_20150608111420_41540_1213_1503_)],4237640866988089,DOWNLOADING}

the first one is a wrong path,only one source was set this;but Follow the log,i 
saw Starting from the first path source download,All next jobs sources were 
downloaded from  ftp://10.27.178.207 by default.


the log is :


2015-06-09 11:14:34,653 INFO  [AsyncDispatcher event handler] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:addResource(544)) - Downloading public rsrc:{ 
ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null }
2015-06-09 11:14:34,653 INFO  [AsyncDispatcher event handler] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:addResource(544)) - Downloading public rsrc:{ 
ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null }
2015-06-09 11:14:37,883 INFO  [Public Localizer] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:run(672)) - Failed to download rsrc { { 
ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null 
},pending,[(container_20150608111420_41540_1213_1503_)],4237640867118938,DOWNLOADING}
java.io.IOException: Login failed on server - 10.27.178.207, port - 21
at 
org.apache.hadoop.fs.ftp.FTPFileSystem.connect(FTPFileSystem.java:133)
at 
org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:390)
at 
com.suning.cybertron.superion.util.FSDownload.copy(FSDownload.java:172)
at 
com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:279)
at 
com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:52)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-06-09 11:14:37,885 INFO  [Public Localizer] localizer.LocalizedResource 
(LocalizedResource.java:handle(203)) - Resource 
ftp://10.27.178.207:21/home/cbt/1213/jxf.sql transitioned from DOWNLOADING to 
FAILED
2015-06-09 11:14:37,886 INFO  [Public Localizer] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:run(672)) - Failed to download rsrc { { 
ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null 
},pending,[(container_20150608111420_41540_1213_1503_)],4237640866988089,DOWNLOADING}
java.io.IOException: Login failed on server - 10.27.178.207, port - 21
at 
org.apache.hadoop.fs.ftp.FTPFileSystem.connect(FTPFileSystem.java:133)
at 
org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:390)
at 
com.suning.cybertron.superion.util.FSDownload.copy(FSDownload.java:172)
at 
com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:279)
at 
com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:52)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-06-09 11:14:37,886 INFO  [AsyncDispatcher event handler] 
container.Container (ContainerImpl.java:handle(853)) - Container 
container_20150608111420_41540_1213_1503_ transitioned from LOCALIZING to 
LOCALIZATION_FAILED
2015-06-09 11:14:37,887 INFO  [Public Localizer] localizer.LocalizedResource 
(LocalizedResource.java:handle(203)) - Resource 
ftp://10.27.89.13:21/home/cbt/common/2/sql.jar transitioned from DOWNLOADING to 
FAILED
2015-06-09 11:14:37,887 INFO  [AsyncDispatcher event handler] 
localizer.LocalResourcesTrackerImpl 
(LocalResourcesTrackerImpl.java:handle(133)) - Container 
container_20150608111420_41540_1213_1503_ sent RELEASE event on a resource 
request { ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 14

[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS

2015-06-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580378#comment-14580378
 ] 

Hadoop QA commented on YARN-3044:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  18m 29s | Findbugs (version ) appears to 
be broken on YARN-2928. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 4 new or modified test files. |
| {color:green}+1{color} | javac |   7m 55s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 56s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 42s | The applied patch generated  1 
new checkstyle issues (total was 236, now 236). |
| {color:green}+1{color} | whitespace |   0m  6s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 42s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 40s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   5m 46s | The patch appears to introduce 8 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 22s | Tests passed in 
hadoop-yarn-api. |
| {color:red}-1{color} | yarn tests |   6m 55s | Tests failed in 
hadoop-yarn-applications-distributedshell. |
| {color:green}+1{color} | yarn tests |   0m 26s | Tests passed in 
hadoop-yarn-server-common. |
| {color:red}-1{color} | yarn tests |  61m 37s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| {color:green}+1{color} | yarn tests |   1m 15s | Tests passed in 
hadoop-yarn-server-timelineservice. |
| | | 118m  6s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-yarn-applications-distributedshell |
| FindBugs | module:hadoop-yarn-server-resourcemanager |
| Failed unit tests | 
hadoop.yarn.applications.distributedshell.TestDistributedShellWithNodeLabels |
|   | hadoop.yarn.applications.distributedshell.TestDistributedShell |
|   | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart |
| Timed out tests | 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12738768/YARN-3044-YARN-2928.011.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | YARN-2928 / 0a3c147 |
| Pre-patch Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/YARN-2928FindbugsWarningshadoop-yarn-server-common.html
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-applications-distributedshell.html
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-applications-distributedshell test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/testrun_hadoop-yarn-applications-distributedshell.txt
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-timelineservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8233/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8233/console |


This message was automatically generated.

> [Event producers] Implement RM writing app lifecycle events to ATS
> --
>
> Key: YARN-3044
> URL: https://issues.apache.org/jira/browse/YARN-3044
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ti

[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580275#comment-14580275
 ] 

Hadoop QA commented on YARN-2194:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 56s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 37s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 42s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 36s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 13s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   6m  6s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  43m 44s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12738765/YARN-2194-4.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 6785661 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8232/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8232/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8232/console |


This message was automatically generated.

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch, 
> YARN-2194-4.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

2015-06-10 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580255#comment-14580255
 ] 

Rohith commented on YARN-3790:
--

Thanks for looking into this issue,
bq. If UpdateThread call update after recoverContainersOnNode, the test will 
succeed
In the test, I see below code which verify for contaner to recover right?
{code}
// Wait for RM to settle down on recovering containers;
waitForNumContainersToRecover(2, rm2, am1.getApplicationAttemptId());
Set launchedContainers =
((RMNodeImpl) rm2.getRMContext().getRMNodes().get(nm1.getNodeId()))
  .getLaunchedContainers();
assertTrue(launchedContainers.contains(amContainer.getContainerId()));
assertTrue(launchedContainers.contains(runningContainer.getContainerId()));
{code}

Am I missing anything?

> TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in 
> trunk for FS scheduler
> 
>
> Key: YARN-3790
> URL: https://issues.apache.org/jira/browse/YARN-3790
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith
>Assignee: zhihai xu
>
> Failure trace is as follows
> {noformat}
> Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
> testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
>   Time elapsed: 6.502 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<8192>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS

2015-06-10 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580253#comment-14580253
 ] 

Naganarasimha G R commented on YARN-3044:
-

Hi [~zjshen],
Have taken care of the issue which you have mentioned and also added 
some test cases in TestDistributedShell to cover it (along with some code 
refactoring). Please review
bq. I'm not sure because as far as I can tell, NM's impl is different from 
RM's, but it's up to you to figure out the proper solution
Yep will start doing that now, but getting experts advise to make my job easy ;)

> [Event producers] Implement RM writing app lifecycle events to ATS
> --
>
> Key: YARN-3044
> URL: https://issues.apache.org/jira/browse/YARN-3044
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
> Attachments: YARN-3044-YARN-2928.004.patch, 
> YARN-3044-YARN-2928.005.patch, YARN-3044-YARN-2928.006.patch, 
> YARN-3044-YARN-2928.007.patch, YARN-3044-YARN-2928.008.patch, 
> YARN-3044-YARN-2928.009.patch, YARN-3044-YARN-2928.010.patch, 
> YARN-3044-YARN-2928.011.patch, YARN-3044.20150325-1.patch, 
> YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch
>
>
> Per design in YARN-2928, implement RM writing app lifecycle events to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS

2015-06-10 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3044:

Attachment: YARN-3044-YARN-2928.011.patch

> [Event producers] Implement RM writing app lifecycle events to ATS
> --
>
> Key: YARN-3044
> URL: https://issues.apache.org/jira/browse/YARN-3044
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
> Attachments: YARN-3044-YARN-2928.004.patch, 
> YARN-3044-YARN-2928.005.patch, YARN-3044-YARN-2928.006.patch, 
> YARN-3044-YARN-2928.007.patch, YARN-3044-YARN-2928.008.patch, 
> YARN-3044-YARN-2928.009.patch, YARN-3044-YARN-2928.010.patch, 
> YARN-3044-YARN-2928.011.patch, YARN-3044.20150325-1.patch, 
> YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch
>
>
> Per design in YARN-2928, implement RM writing app lifecycle events to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

2015-06-10 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3790:
-
Summary: TestWorkPreservingRMRestart#testSchedulerRecovery fails 
intermittently in trunk for FS scheduler  (was: 
TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS 
scheduler)

> TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in 
> trunk for FS scheduler
> 
>
> Key: YARN-3790
> URL: https://issues.apache.org/jira/browse/YARN-3790
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith
>Assignee: zhihai xu
>
> Failure trace is as follows
> {noformat}
> Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
> testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
>   Time elapsed: 6.502 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<8192>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS scheduler

2015-06-10 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580228#comment-14580228
 ] 

Rohith commented on YARN-3790:
--

bq. I think this test fails intermittently.
Yes, it is failing intermittenlty. May be issue summary can be updated.

> TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS 
> scheduler
> -
>
> Key: YARN-3790
> URL: https://issues.apache.org/jira/browse/YARN-3790
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith
>Assignee: zhihai xu
>
> Failure trace is as follows
> {noformat}
> Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
> testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
>   Time elapsed: 6.502 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<8192>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-10 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2194:
--
Attachment: YARN-2194-4.patch

Uploaded a patch by replacing comma with '%'.

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch, 
> YARN-2194-4.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS scheduler

2015-06-10 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580209#comment-14580209
 ] 

zhihai xu commented on YARN-3790:
-

Hi [~rohithsharma], thanks for reporting this issue. I think this test fails 
intermittently.
The following is stack trace for the test failure:
{code}
java.lang.AssertionError: expected:<6144> but was:<8192>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:852)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:341)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:240)
{code}
The failure is {{rootMetrics}}'s available resource is not correct for 
FairScheduler.
I know what cause this test failure.
For FairScheduler, {{updateRootQueueMetrics}} is used to update 
{{rootMetrics}}'s available resource.
But {{updateRootQueueMetrics}} is not called in/after 
{{recoverContainersOnNode}}, in this case, we can only depend UpdateThread to 
update {{rootMetrics}}'s available resource. Currently UpdateThread will be 
triggered in {{addNode}}. The timing in UpdateThread will decide whether this 
test will succeed or not. If UpdateThread call {{update}} after 
{{recoverContainersOnNode}}, the test will succeed. If UpdateThread call 
{{update}} before {{recoverContainersOnNode}}, the test will fail.

> TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS 
> scheduler
> -
>
> Key: YARN-3790
> URL: https://issues.apache.org/jira/browse/YARN-3790
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith
>Assignee: zhihai xu
>
> Failure trace is as follows
> {noformat}
> Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
> testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
>   Time elapsed: 6.502 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<8192>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS scheduler

2015-06-10 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu reassigned YARN-3790:
---

Assignee: zhihai xu

> TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS 
> scheduler
> -
>
> Key: YARN-3790
> URL: https://issues.apache.org/jira/browse/YARN-3790
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith
>Assignee: zhihai xu
>
> Failure trace is as follows
> {noformat}
> Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
> testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
>   Time elapsed: 6.502 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<8192>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)