[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe
[ https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581553#comment-14581553 ] zhihai xu commented on YARN-3795: - Hi [~lachisis], thanks for reporting this issue. Most likely, Broken pipe is due to Len error at ZooKeeper server. To confirm this, Could you check the ZooKeeper server logs to see whether you can find the following log: {code} WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x due to java.io.IOException: Len error ??? {code} You can work around the Len error issue by increasing jute.maxbuffer size at ZooKeeper server or you can try YARN-3469. > ZKRMStateStore crashes due to IOException: Broken pipe > -- > > Key: YARN-3795 > URL: https://issues.apache.org/jira/browse/YARN-3795 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: lachisis >Priority: Critical > > 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to dap88/134.41.33.88:2181, initiating session > 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server dap88/134.41.33.88:2181, sessionid = > 0x34db2f72ac50c86, negotiated timeout = 1 > 2015-06-05 06:06:54,881 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: None with state:SyncConnected for path:null for Service > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED > 2015-06-05 06:06:54,881 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-05 06:06:54,881 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session > 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, > closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) > 2015-06-05 06:06:54,986 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: None with state:Disconnected for path:null for Service > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED > 2015-06-05 06:06:54,986 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session disconnected > 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server dap87/134.41.33.87:2181. Will not attempt to > authenticate using SASL (unknown error) > 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to dap87/134.41.33.87:2181, initiating session > 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server dap87/134.41.33.87:2181, sessionid = > 0x34db2f72ac50c86, negotiated timeout = 1 > 2015-06-05 06:06:55,343 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: None with state:SyncConnected for path:null for Service > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED > 2015-06-05 06:06:55,343 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-05 06:06:55,344 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session > 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, > closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeF
[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe
[ https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581537#comment-14581537 ] lachisis commented on YARN-3795: But I think most of these Watchers in ZKRMStateStore seems not necessary. > ZKRMStateStore crashes due to IOException: Broken pipe > -- > > Key: YARN-3795 > URL: https://issues.apache.org/jira/browse/YARN-3795 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: lachisis >Priority: Critical > > 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to dap88/134.41.33.88:2181, initiating session > 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server dap88/134.41.33.88:2181, sessionid = > 0x34db2f72ac50c86, negotiated timeout = 1 > 2015-06-05 06:06:54,881 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: None with state:SyncConnected for path:null for Service > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED > 2015-06-05 06:06:54,881 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-05 06:06:54,881 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session > 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, > closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) > 2015-06-05 06:06:54,986 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: None with state:Disconnected for path:null for Service > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED > 2015-06-05 06:06:54,986 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session disconnected > 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server dap87/134.41.33.87:2181. Will not attempt to > authenticate using SASL (unknown error) > 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to dap87/134.41.33.87:2181, initiating session > 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server dap87/134.41.33.87:2181, sessionid = > 0x34db2f72ac50c86, negotiated timeout = 1 > 2015-06-05 06:06:55,343 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: None with state:SyncConnected for path:null for Service > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED > 2015-06-05 06:06:55,343 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-05 06:06:55,344 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session > 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, > closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) -
[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe
[ https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581534#comment-14581534 ] lachisis commented on YARN-3795: It is better if zookeeper fix the ZOOKEEPER-706. > ZKRMStateStore crashes due to IOException: Broken pipe > -- > > Key: YARN-3795 > URL: https://issues.apache.org/jira/browse/YARN-3795 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: lachisis >Priority: Critical > > 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to dap88/134.41.33.88:2181, initiating session > 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server dap88/134.41.33.88:2181, sessionid = > 0x34db2f72ac50c86, negotiated timeout = 1 > 2015-06-05 06:06:54,881 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: None with state:SyncConnected for path:null for Service > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED > 2015-06-05 06:06:54,881 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-05 06:06:54,881 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session > 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, > closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) > 2015-06-05 06:06:54,986 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: None with state:Disconnected for path:null for Service > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED > 2015-06-05 06:06:54,986 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session disconnected > 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server dap87/134.41.33.87:2181. Will not attempt to > authenticate using SASL (unknown error) > 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to dap87/134.41.33.87:2181, initiating session > 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server dap87/134.41.33.87:2181, sessionid = > 0x34db2f72ac50c86, negotiated timeout = 1 > 2015-06-05 06:06:55,343 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: None with state:SyncConnected for path:null for Service > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED > 2015-06-05 06:06:55,343 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-05 06:06:55,344 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session > 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, > closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) -- This message was sent b
[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe
[ https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581531#comment-14581531 ] lachisis commented on YARN-3795: I have found ZOOKEEPER-706, this means if zookeeper server receive a request which the body size is larger than 1M, the server will throw exception "Broken pipe" to reject the request. this feature is used to limit the body size of Znode. By scanning the zookeeper snapshot, I do not find a znode created by ZKRMStateStore which have large data size. Then analyzing code, I find large numbers of Watcher are set when call function of "loadRMAppState" and "loadApplicationAttemptState". > ZKRMStateStore crashes due to IOException: Broken pipe > -- > > Key: YARN-3795 > URL: https://issues.apache.org/jira/browse/YARN-3795 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: lachisis >Priority: Critical > > 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to dap88/134.41.33.88:2181, initiating session > 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server dap88/134.41.33.88:2181, sessionid = > 0x34db2f72ac50c86, negotiated timeout = 1 > 2015-06-05 06:06:54,881 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: None with state:SyncConnected for path:null for Service > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED > 2015-06-05 06:06:54,881 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-05 06:06:54,881 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session > 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, > closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) > 2015-06-05 06:06:54,986 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: None with state:Disconnected for path:null for Service > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED > 2015-06-05 06:06:54,986 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session disconnected > 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server dap87/134.41.33.87:2181. Will not attempt to > authenticate using SASL (unknown error) > 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to dap87/134.41.33.87:2181, initiating session > 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server dap87/134.41.33.87:2181, sessionid = > 0x34db2f72ac50c86, negotiated timeout = 1 > 2015-06-05 06:06:55,343 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: None with state:SyncConnected for path:null for Service > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED > 2015-06-05 06:06:55,343 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-05 06:06:55,344 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session > 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, > closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNati
[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe
[ https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581517#comment-14581517 ] lachisis commented on YARN-3795: This exception appears two days ago in a yarn platform. there are about 7000+ history jobs in rmstore. Then one time, Activate ReourceManager find session expiry and transitionToStandby. meanwhile, the standby ReourceManager start to transitionToActive, but Throw exception as attached above. > ZKRMStateStore crashes due to IOException: Broken pipe > -- > > Key: YARN-3795 > URL: https://issues.apache.org/jira/browse/YARN-3795 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: lachisis >Priority: Critical > > 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to dap88/134.41.33.88:2181, initiating session > 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server dap88/134.41.33.88:2181, sessionid = > 0x34db2f72ac50c86, negotiated timeout = 1 > 2015-06-05 06:06:54,881 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: None with state:SyncConnected for path:null for Service > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED > 2015-06-05 06:06:54,881 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-05 06:06:54,881 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session > 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, > closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) > 2015-06-05 06:06:54,986 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: None with state:Disconnected for path:null for Service > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED > 2015-06-05 06:06:54,986 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session disconnected > 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server dap87/134.41.33.87:2181. Will not attempt to > authenticate using SASL (unknown error) > 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to dap87/134.41.33.87:2181, initiating session > 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server dap87/134.41.33.87:2181, sessionid = > 0x34db2f72ac50c86, negotiated timeout = 1 > 2015-06-05 06:06:55,343 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: None with state:SyncConnected for path:null for Service > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED > 2015-06-05 06:06:55,343 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2015-06-05 06:06:55,344 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session > 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, > closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) > at > org.apache.zookeeper.ClientCnxnSock
[jira] [Created] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe
lachisis created YARN-3795: -- Summary: ZKRMStateStore crashes due to IOException: Broken pipe Key: YARN-3795 URL: https://issues.apache.org/jira/browse/YARN-3795 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap88/134.41.33.88:2181, initiating session 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap88/134.41.33.88:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dap87/134.41.33.87:2181. Will not attempt to authenticate using SASL (unknown error) 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap87/134.41.33.87:2181, initiating session 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap87/134.41.33.87:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:55,344 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3794) TestRMEmbeddedElector fails because of ambiguous LOG reference
[ https://issues.apache.org/jira/browse/YARN-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated YARN-3794: Attachment: YARN-3794.01.patch > TestRMEmbeddedElector fails because of ambiguous LOG reference > -- > > Key: YARN-3794 > URL: https://issues.apache.org/jira/browse/YARN-3794 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 2.7.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: YARN-3794.01.patch > > > After YARN-2921, {{MockRM}} has also a {{LOG}} field. Therefore {{LOG}} in > the following code snippet is ambiguous. > {code} > protected AdminService createAdminService() { > return new AdminService(MockRMWithElector.this, getRMContext()) { > @Override > protected EmbeddedElectorService createEmbeddedElectorService() { > return new EmbeddedElectorService(getRMContext()) { > @Override > public void becomeActive() throws > ServiceFailedException { > try { > callbackCalled.set(true); > LOG.info("Callback called. Sleeping now"); > Thread.sleep(delayMs); > LOG.info("Sleep done"); > } catch (InterruptedException e) { > e.printStackTrace(); > } > super.becomeActive(); > } > }; > } > }; > } > {code} > Eclipse gives the following error: > {quote} > The field LOG is defined in an inherited type and an enclosing scope > {quote} > IMO, we should fix this as {{TestRMEmbeddedElector.LOG}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3794) TestRMEmbeddedElector fails because of ambiguous LOG reference
Chengbing Liu created YARN-3794: --- Summary: TestRMEmbeddedElector fails because of ambiguous LOG reference Key: YARN-3794 URL: https://issues.apache.org/jira/browse/YARN-3794 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 2.7.0 Reporter: Chengbing Liu Assignee: Chengbing Liu After YARN-2921, {{MockRM}} has also a {{LOG}} field. Therefore {{LOG}} in the following code snippet is ambiguous. {code} protected AdminService createAdminService() { return new AdminService(MockRMWithElector.this, getRMContext()) { @Override protected EmbeddedElectorService createEmbeddedElectorService() { return new EmbeddedElectorService(getRMContext()) { @Override public void becomeActive() throws ServiceFailedException { try { callbackCalled.set(true); LOG.info("Callback called. Sleeping now"); Thread.sleep(delayMs); LOG.info("Sleep done"); } catch (InterruptedException e) { e.printStackTrace(); } super.becomeActive(); } }; } }; } {code} Eclipse gives the following error: {quote} The field LOG is defined in an inherited type and an enclosing scope {quote} IMO, we should fix this as {{TestRMEmbeddedElector.LOG}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3785) Support for Resource as an argument during submitApp call in MockRM test class
[ https://issues.apache.org/jira/browse/YARN-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581452#comment-14581452 ] Xuan Gong commented on YARN-3785: - Committed into trunk/branch-2. Thanks, [~sunilg] > Support for Resource as an argument during submitApp call in MockRM test class > -- > > Key: YARN-3785 > URL: https://issues.apache.org/jira/browse/YARN-3785 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Sunil G >Assignee: Sunil G >Priority: Minor > Fix For: 2.8.0 > > Attachments: 0001-YARN-3785.patch, 0002-YARN-3785.patch > > > Currently MockRM#submitApp supports only memory. Adding test cases to support > vcores so that DominentResourceCalculator can be tested with this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3785) Support for Resource as an argument during submitApp call in MockRM test class
[ https://issues.apache.org/jira/browse/YARN-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581434#comment-14581434 ] Hudson commented on YARN-3785: -- FAILURE: Integrated in Hadoop-trunk-Commit #8004 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8004/]) YARN-3785. Support for Resource as an argument during submitApp call in (xgong: rev 5583f88bf7f1852dc0907ce55d0755e4fb22107a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java > Support for Resource as an argument during submitApp call in MockRM test class > -- > > Key: YARN-3785 > URL: https://issues.apache.org/jira/browse/YARN-3785 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Sunil G >Assignee: Sunil G >Priority: Minor > Fix For: 2.8.0 > > Attachments: 0001-YARN-3785.patch, 0002-YARN-3785.patch > > > Currently MockRM#submitApp supports only memory. Adding test cases to support > vcores so that DominentResourceCalculator can be tested with this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3779) Aggregated Logs Deletion doesnt work after refreshing Log Retention Settings in secure cluster
[ https://issues.apache.org/jira/browse/YARN-3779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581430#comment-14581430 ] Xuan Gong commented on YARN-3779: - [~varun_saxena] Thanks for the logs. Could you apply the patch and print the ugi ? > Aggregated Logs Deletion doesnt work after refreshing Log Retention Settings > in secure cluster > -- > > Key: YARN-3779 > URL: https://issues.apache.org/jira/browse/YARN-3779 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 > Environment: mrV2, secure mode >Reporter: Zhang Wei >Assignee: Varun Saxena >Priority: Critical > Attachments: YARN-3779.01.patch, YARN-3779.02.patch > > > {{GSSException}} is thrown everytime log aggregation deletion is attempted > after executing bin/mapred hsadmin -refreshLogRetentionSettings in a secure > cluster. > The problem can be reproduced by following steps: > 1. startup historyserver in secure cluster. > 2. Log deletion happens as per expectation. > 3. execute {{mapred hsadmin -refreshLogRetentionSettings}} command to refresh > the configuration value. > 4. All the subsequent attempts of log deletion fail with {{GSSException}} > Following exception can be found in historyserver's log if log deletion is > enabled. > {noformat} > 2015-06-04 14:14:40,070 | ERROR | Timer-3 | Error reading root log dir this > deletion attempt is being aborted | AggregatedLogDeletionService.java:127 > java.io.IOException: Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)]; Host Details : local host is: "vm-31/9.91.12.31"; > destination host is: "vm-33":25000; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) > at org.apache.hadoop.ipc.Client.call(Client.java:1414) > at org.apache.hadoop.ipc.Client.call(Client.java:1363) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy9.getListing(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:519) > at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy10.getListing(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1767) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1750) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:691) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) > at > org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:753) > at > org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:749) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:749) > at > org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService$LogDeletionTask.run(AggregatedLogDeletionService.java:68) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS > initiate failed [Caused by GSSException: No valid credentials provided > (Mechanism level: Failed to find any Kerberos tgt)] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:677) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1641) > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:640) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:724) > at > org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462) > at org.apache.hadoop.ipc.Client.call(Client.
[jira] [Commented] (YARN-3785) Support for Resource as an argument during submitApp call in MockRM test class
[ https://issues.apache.org/jira/browse/YARN-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581422#comment-14581422 ] Xuan Gong commented on YARN-3785: - +1 lgtm. Will commit > Support for Resource as an argument during submitApp call in MockRM test class > -- > > Key: YARN-3785 > URL: https://issues.apache.org/jira/browse/YARN-3785 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Sunil G >Assignee: Sunil G >Priority: Minor > Attachments: 0001-YARN-3785.patch, 0002-YARN-3785.patch > > > Currently MockRM#submitApp supports only memory. Adding test cases to support > vcores so that DominentResourceCalculator can be tested with this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3791) FSDownload
[ https://issues.apache.org/jira/browse/YARN-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581395#comment-14581395 ] Varun Saxena commented on YARN-3791: I see the class name being {{com.suning.cybertron.superion.util.FSDownload}} which is not same as FSDownload in Hadoop distribution. Is the code same ? > FSDownload > -- > > Key: YARN-3791 > URL: https://issues.apache.org/jira/browse/YARN-3791 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 > Environment: Linux 2.6.32-279.el6.x86_64 >Reporter: HuanWang > > Inadvertently,we set two source ftp path: > {code} > { { ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null > },pending,[(container_20150608111420_41540_1213_1503_)],4237640867118938,DOWNLOADING} > ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null > },pending,[(container_20150608111420_41540_1213_1503_)],4237640866988089,DOWNLOADING} > the first one is a wrong path,only one source was set this;but Follow the > log,i saw Starting from the first path source download,All next jobs sources > were downloaded from ftp://10.27.178.207 by default. > {code} > the log is : > {code} > 2015-06-09 11:14:34,653 INFO [AsyncDispatcher event handler] > localizer.ResourceLocalizationService > (ResourceLocalizationService.java:addResource(544)) - Downloading public > rsrc:{ ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, > null } > 2015-06-09 11:14:34,653 INFO [AsyncDispatcher event handler] > localizer.ResourceLocalizationService > (ResourceLocalizationService.java:addResource(544)) - Downloading public > rsrc:{ ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, > null } > 2015-06-09 11:14:37,883 INFO [Public Localizer] > localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(672)) - Failed to download rsrc { { > ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null > },pending,[(container_20150608111420_41540_1213_1503_)],4237640867118938,DOWNLOADING} > java.io.IOException: Login failed on server - 10.27.178.207, port - 21 > at > org.apache.hadoop.fs.ftp.FTPFileSystem.connect(FTPFileSystem.java:133) > at > org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:390) > at > com.suning.cybertron.superion.util.FSDownload.copy(FSDownload.java:172) > at > com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:279) > at > com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:52) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2015-06-09 11:14:37,885 INFO [Public Localizer] localizer.LocalizedResource > (LocalizedResource.java:handle(203)) - Resource > ftp://10.27.178.207:21/home/cbt/1213/jxf.sql transitioned from DOWNLOADING to > FAILED > 2015-06-09 11:14:37,886 INFO [Public Localizer] > localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(672)) - Failed to download rsrc { { > ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null > },pending,[(container_20150608111420_41540_1213_1503_)],4237640866988089,DOWNLOADING} > java.io.IOException: Login failed on server - 10.27.178.207, port - 21 > at > org.apache.hadoop.fs.ftp.FTPFileSystem.connect(FTPFileSystem.java:133) > at > org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:390) > at > com.suning.cybertron.superion.util.FSDownload.copy(FSDownload.java:172) > at > com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:279) > at > com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:52) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2015-06-09 11:14:37,886 INFO [AsyncDispatcher event handler] > container.Container (ContainerImpl.java:handle(853)) - Container > container_20150608111420_41540_1213_1503_ transitioned from LOCALIZING to > LOCALIZ
[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery
[ https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581358#comment-14581358 ] Brahma Reddy Battula commented on YARN-3793: [~kasha] thanks for reporting this jira.. It's dupe of HADOOP-11878... > Several NPEs when deleting local files on NM recovery > - > > Key: YARN-3793 > URL: https://issues.apache.org/jira/browse/YARN-3793 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > When NM work-preserving restart is enabled, we see several NPEs on recovery. > These seem to correspond to sub-directories that need to be deleted. I wonder > if null pointers here mean incorrect tracking of these resources and a > potential leak. This JIRA is to investigate and fix anything required. > Logs show: > {noformat} > 2015-05-18 07:06:10,225 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : null > 2015-05-18 07:06:10,224 ERROR > org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during > execution of task in DeletionService > java.lang.NullPointerException > at > org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274) > at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1042) add ability to specify affinity/anti-affinity in container requests
[ https://issues.apache.org/jira/browse/YARN-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-1042: -- Attachment: YARN-1042.001.patch > add ability to specify affinity/anti-affinity in container requests > --- > > Key: YARN-1042 > URL: https://issues.apache.org/jira/browse/YARN-1042 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran >Assignee: Arun C Murthy > Attachments: YARN-1042-demo.patch, YARN-1042.001.patch > > > container requests to the AM should be able to request anti-affinity to > ensure that things like Region Servers don't come up on the same failure > zones. > Similarly, you may be able to want to specify affinity to same host or rack > without specifying which specific host/rack. Example: bringing up a small > giraph cluster in a large YARN cluster would benefit from having the > processes in the same rack purely for bandwidth reasons. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1042) add ability to specify affinity/anti-affinity in container requests
[ https://issues.apache.org/jira/browse/YARN-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-1042: -- Attachment: (was: YARN-1042-001.patch) > add ability to specify affinity/anti-affinity in container requests > --- > > Key: YARN-1042 > URL: https://issues.apache.org/jira/browse/YARN-1042 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran >Assignee: Arun C Murthy > Attachments: YARN-1042-demo.patch > > > container requests to the AM should be able to request anti-affinity to > ensure that things like Region Servers don't come up on the same failure > zones. > Similarly, you may be able to want to specify affinity to same host or rack > without specifying which specific host/rack. Example: bringing up a small > giraph cluster in a large YARN cluster would benefit from having the > processes in the same rack purely for bandwidth reasons. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2768) optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% of computing time of update thread
[ https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581349#comment-14581349 ] Hong Zhiguo commented on YARN-2768: --- [~kasha], the excution time displayed in the profiling output is cumulative. Actually, I repeated such profiling a lot of times and got the same ratio. The profiling is done with a cluster of NM/AM simulators and I don't have such resource now. I wrote a testcase which creates 8000 nodes, 4500 apps within 1200 queues, and then performs 1 rounds of FairScheduler.update(), and print the average execution time of one call to update. With this patch, the average execution time decreased from about 35ms to 20ms. I think the effect comes from GC and memory allocation since in each round of FairScheduler.update(), Resource.multiply is called as many times as the number of pending ResourceRequests, which is more than 3 million in our production cluster. > optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% > of computing time of update thread > > > Key: YARN-2768 > URL: https://issues.apache.org/jira/browse/YARN-2768 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Reporter: Hong Zhiguo >Assignee: Hong Zhiguo >Priority: Minor > Attachments: YARN-2768.patch, profiling_FairScheduler_update.png > > > See the attached picture of profiling result. The clone of Resource object > within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the > function FairScheduler.update(). > The code of FSAppAttempt.updateDemand: > {code} > public void updateDemand() { > demand = Resources.createResource(0); > // Demand is current consumption plus outstanding requests > Resources.addTo(demand, app.getCurrentConsumption()); > // Add up outstanding resource requests > synchronized (app) { > for (Priority p : app.getPriorities()) { > for (ResourceRequest r : app.getResourceRequests(p).values()) { > Resource total = Resources.multiply(r.getCapability(), > r.getNumContainers()); > Resources.addTo(demand, total); > } > } > } > } > {code} > The code of Resources.multiply: > {code} > public static Resource multiply(Resource lhs, double by) { > return multiplyTo(clone(lhs), by); > } > {code} > The clone could be skipped by directly update the value of this.demand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581151#comment-14581151 ] Hadoop QA commented on YARN-3790: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 3s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 33s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 45s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 26s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 50m 55s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 88m 51s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12738905/YARN-3790.000.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / c7729ef | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8235/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8235/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8235/console | This message was automatically generated. > TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in > trunk for FS scheduler > > > Key: YARN-3790 > URL: https://issues.apache.org/jira/browse/YARN-3790 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith >Assignee: zhihai xu > Attachments: YARN-3790.000.patch > > > Failure trace is as follows > {noformat} > Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart > testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) > Time elapsed: 6.502 sec <<< FAILURE! > java.lang.AssertionError: expected:<6144> but was:<8192> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3793) Several NPEs when deleting local files on NM recovery
Karthik Kambatla created YARN-3793: -- Summary: Several NPEs when deleting local files on NM recovery Key: YARN-3793 URL: https://issues.apache.org/jira/browse/YARN-3793 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical When NM work-preserving restart is enabled, we see several NPEs on recovery. These seem to correspond to sub-directories that need to be deleted. I wonder if null pointers here mean incorrect tracking of these resources and a potential leak. This JIRA is to investigate and fix anything required. Logs show: {noformat} 2015-05-18 07:06:10,225 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : null 2015-05-18 07:06:10,224 ERROR org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during execution of task in DeletionService java.lang.NullPointerException at org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274) at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458) at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3793) Several NPEs when deleting local files on NM recovery
[ https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3793: --- Priority: Major (was: Critical) > Several NPEs when deleting local files on NM recovery > - > > Key: YARN-3793 > URL: https://issues.apache.org/jira/browse/YARN-3793 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > When NM work-preserving restart is enabled, we see several NPEs on recovery. > These seem to correspond to sub-directories that need to be deleted. I wonder > if null pointers here mean incorrect tracking of these resources and a > potential leak. This JIRA is to investigate and fix anything required. > Logs show: > {noformat} > 2015-05-18 07:06:10,225 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : null > 2015-05-18 07:06:10,224 ERROR > org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during > execution of task in DeletionService > java.lang.NullPointerException > at > org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274) > at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581026#comment-14581026 ] zhihai xu commented on YARN-3790: - I uploaded a patch YARN-3790.000.patch which will move {{updateRootQueueMetrics}} after {{recoverContainersOnNode}} in {{addNode}}. > TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in > trunk for FS scheduler > > > Key: YARN-3790 > URL: https://issues.apache.org/jira/browse/YARN-3790 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith >Assignee: zhihai xu > Attachments: YARN-3790.000.patch > > > Failure trace is as follows > {noformat} > Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart > testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) > Time elapsed: 6.502 sec <<< FAILURE! > java.lang.AssertionError: expected:<6144> but was:<8192> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3790: Attachment: YARN-3790.000.patch > TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in > trunk for FS scheduler > > > Key: YARN-3790 > URL: https://issues.apache.org/jira/browse/YARN-3790 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith >Assignee: zhihai xu > Attachments: YARN-3790.000.patch > > > Failure trace is as follows > {noformat} > Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart > testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) > Time elapsed: 6.502 sec <<< FAILURE! > java.lang.AssertionError: expected:<6144> but was:<8192> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580965#comment-14580965 ] zhihai xu commented on YARN-3790: - [~rohithsharma], thanks for updating the title. The containers are recovered. {{rootMetrics}}'s used resource is also updated, But {{rootMetrics}}'s available resource is not updated. The following logs in the failed test proved it: {code} 2015-06-09 22:55:42,964 INFO [ResourceManager Event Processor] fair.FairScheduler (FairScheduler.java:addNode(855)) - Added node 127.0.0.1:1234 cluster capacity: 2015-06-09 22:55:42,964 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1433915736884_0001 of type NODE_UPDATE 2015-06-09 22:55:42,964 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:processNodeUpdate(820)) - Received node update event:NODE_USABLE for node:127.0.0.1:1234 with state:RUNNING 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(287)) - The updated demand for root.default is ; the max is 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.default is 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(163)) - Counting resource from root.default ; Total resource consumption for root now 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemandForApp(298)) - Counting resource from application_1433915736884_0001 ; Total resource consumption for root.zxu now 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(287)) - The updated demand for root.zxu is ; the max is 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.zxu is 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(163)) - Counting resource from root.zxu ; Total resource consumption for root now 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(177)) - The updated demand for root is ; the max is 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSQueue (FSQueue.java:setFairShare(196)) - The updated fairShare for root is 2015-06-09 22:55:42,965 INFO [ResourceManager Event Processor] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1433915736884 } attemptId: 1 } id: 1 } container_state: C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression: "" 2015-06-09 22:55:42,965 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing container_1433915736884_0001_01_01 of type RECOVER 2015-06-09 22:55:42,965 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType: APP_RUNNING_ON_NODE 2015-06-09 22:55:42,965 INFO [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - container_1433915736884_0001_01_01 Container Transitioned from NEW to RUNNING 2015-06-09 22:55:42,965 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1433915736884_0001 of type APP_RUNNING_ON_NODE 2015-06-09 22:55:42,965 INFO [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned container container_1433915736884_0001_01_01 of capacity on host 127.0.0.1:1234, which has 1 containers, used and available after allocation 2015-06-09 22:55:42,966 INFO [ResourceManager Event Processor] scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt appattempt_1433915736884_0001_01 is recovering container container_1433915736884_0001_01_01 2015-06-09 22:55:42,966 INFO [ResourceManager Event Processor] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1433915736884 } attemptId: 1 } id: 2 } container_state: C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time:
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580961#comment-14580961 ] Hadoop QA commented on YARN-3051: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 17m 26s | Findbugs (version ) appears to be broken on YARN-2928. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 7m 56s | There were no new javac warning messages. | | {color:red}-1{color} | javadoc | 10m 12s | The applied patch generated 11 additional warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 22s | The applied patch generated 25 new checkstyle issues (total was 243, now 267). | | {color:green}+1{color} | shellcheck | 0m 6s | There were no new shellcheck (v0.3.3) issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 40s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 40s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 4m 2s | The patch appears to introduce 5 new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 22s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 1m 59s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 1m 27s | Tests passed in hadoop-yarn-server-timelineservice. | | | | 48m 2s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-server-timelineservice | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12738884/YARN-3051-YARN-2928.04.patch | | Optional Tests | shellcheck javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / 0a3c147 | | javadoc | https://builds.apache.org/job/PreCommit-YARN-Build/8234/artifact/patchprocess/diffJavadocWarnings.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8234/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8234/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-timelineservice.html | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8234/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8234/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-timelineservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/8234/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8234/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8234/console | This message was automatically generated. > [Storage abstraction] Create backing storage read interface for ATS readers > --- > > Key: YARN-3051 > URL: https://issues.apache.org/jira/browse/YARN-3051 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-3051-YARN-2928.003.patch, > YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, > YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch > > > Per design in YARN-2928, create backing storage read interface that can be > implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580910#comment-14580910 ] Varun Saxena commented on YARN-3051: As of now, there are very similar APIs' for getEntity/getFlowEntity/getUserEntity etc. Will it be fine to combine these APIs' and pass something like a query type(ENTITY/USER/FLOW,etc.) in the API which storage implementation can then use to decide which type of query it is ? > [Storage abstraction] Create backing storage read interface for ATS readers > --- > > Key: YARN-3051 > URL: https://issues.apache.org/jira/browse/YARN-3051 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-3051-YARN-2928.003.patch, > YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, > YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch > > > Per design in YARN-2928, create backing storage read interface that can be > implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3051: --- Attachment: YARN-3051-YARN-2928.04.patch > [Storage abstraction] Create backing storage read interface for ATS readers > --- > > Key: YARN-3051 > URL: https://issues.apache.org/jira/browse/YARN-3051 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-3051-YARN-2928.003.patch, > YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, > YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch > > > Per design in YARN-2928, create backing storage read interface that can be > implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated YARN-2578: --- Attachment: YARN-2578.002.patch I attached 002 which makes rpcTimeout configurable by "ipc.client.rpc.timeout". The default value is 0 in order to keep current behaviour. We can test timeout by changing the value explicitly and change the default value later after some tests. I also left Client#getTimeout as is to keep compatibility. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg > Attachments: YARN-2578.002.patch, YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580858#comment-14580858 ] Varun Saxena commented on YARN-3051: [~zjshen], thanks for your inputs. I will brief you about the APIs' I have decided as of now. # APIs' for querying individual entity/flow/flow run/user and APIs' for querying a set of entities/flow runs/flows/users. APIs' such a set of flows/users will contain aggregated data. The reason for separate endpoints for entities, flows, users,etc. is because of the different tables in HBase/Phoenix schema. # Most the APIs' will be variations of either getting a single entity or a set of entities. So I will primarily talk about entity and a set of entities in subsequent points. # For getting a set of entities, there will be 3 kinds of filters - filtering on the basis of info, filtering on configs and filtering on metrics. Filtering on the basis of info and field will be based on equality, for instance, fetch entities which have a config name matching a specific config value. Metrics filtering though will be on the basis of relational operator. For instance, user can query entities which have a specific metric >= a certain value. # In addition to that certain predicates such as limit, windowStart, windowEnd, etc. which used to exist in ATSv1 exist even now.Some predicates such as fromId, fromTs may not make sense in ATSv2 but I have included them for now with the intention of discussion. # Additional predicates such as metricswindowStart and end has been specified to fetch metrics data for a specific time span. The reason I included this is because this can aid in plotting graphs on UI for a specific metric of some entity. # Only entity id, type, created and modified time will be returned if fields are not specified in REST URL. This will be the default view of an entity. # Moreover you can also specify which configurations and metrics to return. # Every query param will be received as a String, even timestamp. Now from backing storage implementation viewpoint, would it make more sense to let these query params be passed as strings or do datatype conversion ? Few concerns from Li Lu regarding parameter list becoming too long are quite valid as most of them will be nulls. We can also club multiple related parameters in a different classes to reduce them. Or as he said have different methods for frequently occurring use cases. Thoughts ? Comments are welcome so that this JIRA can speed up, probably after Hadoop Summit :) > [Storage abstraction] Create backing storage read interface for ATS readers > --- > > Key: YARN-3051 > URL: https://issues.apache.org/jira/browse/YARN-3051 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-3051-YARN-2928.003.patch, > YARN-3051-YARN-2928.03.patch, YARN-3051.wip.02.YARN-2928.patch, > YARN-3051.wip.patch, YARN-3051_temp.patch > > > Per design in YARN-2928, create backing storage read interface that can be > implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
[ https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580821#comment-14580821 ] Bibin A Chundatt commented on YARN-3789: With this patch there is no increase in number of lines. Checkstyle issue seems unrelated > Refactor logs for LeafQueue#activateApplications() to remove duplicate logging > -- > > Key: YARN-3789 > URL: https://issues.apache.org/jira/browse/YARN-3789 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, > 0003-YARN-3789.patch > > > Duplicate logging from resource manager > during am limit check for each application > {code} > 015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3792) Test case failures in TestDistributedShell after changes for subjira's of YARN-2928
Naganarasimha G R created YARN-3792: --- Summary: Test case failures in TestDistributedShell after changes for subjira's of YARN-2928 Key: YARN-3792 URL: https://issues.apache.org/jira/browse/YARN-3792 Project: Hadoop YARN Issue Type: Sub-task Reporter: Naganarasimha G R Assignee: Naganarasimha G R encountered [testcase failures|https://builds.apache.org/job/PreCommit-YARN-Build/8233/testReport/] which was happening even without the patch modifications in YARN-3044 TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580508#comment-14580508 ] Naganarasimha G R commented on YARN-3044: - [~zjshen], Seems like many of the test case failures in TestDistributedShell, TestDistributedShellWithNodeLabels etc.. are not related to this jira, opening new jira to handle it, based on past exp better to handle in new jira so that duplicate effort will be avoided. > [Event producers] Implement RM writing app lifecycle events to ATS > -- > > Key: YARN-3044 > URL: https://issues.apache.org/jira/browse/YARN-3044 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-3044-YARN-2928.004.patch, > YARN-3044-YARN-2928.005.patch, YARN-3044-YARN-2928.006.patch, > YARN-3044-YARN-2928.007.patch, YARN-3044-YARN-2928.008.patch, > YARN-3044-YARN-2928.009.patch, YARN-3044-YARN-2928.010.patch, > YARN-3044-YARN-2928.011.patch, YARN-3044.20150325-1.patch, > YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch > > > Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3791) FSDownload
[ https://issues.apache.org/jira/browse/YARN-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] HuanWang updated YARN-3791: --- Description: Inadvertently,we set two source ftp path: {code} { { ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null },pending,[(container_20150608111420_41540_1213_1503_)],4237640867118938,DOWNLOADING} ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null },pending,[(container_20150608111420_41540_1213_1503_)],4237640866988089,DOWNLOADING} the first one is a wrong path,only one source was set this;but Follow the log,i saw Starting from the first path source download,All next jobs sources were downloaded from ftp://10.27.178.207 by default. {code} the log is : {code} 2015-06-09 11:14:34,653 INFO [AsyncDispatcher event handler] localizer.ResourceLocalizationService (ResourceLocalizationService.java:addResource(544)) - Downloading public rsrc:{ ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null } 2015-06-09 11:14:34,653 INFO [AsyncDispatcher event handler] localizer.ResourceLocalizationService (ResourceLocalizationService.java:addResource(544)) - Downloading public rsrc:{ ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null } 2015-06-09 11:14:37,883 INFO [Public Localizer] localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(672)) - Failed to download rsrc { { ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null },pending,[(container_20150608111420_41540_1213_1503_)],4237640867118938,DOWNLOADING} java.io.IOException: Login failed on server - 10.27.178.207, port - 21 at org.apache.hadoop.fs.ftp.FTPFileSystem.connect(FTPFileSystem.java:133) at org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:390) at com.suning.cybertron.superion.util.FSDownload.copy(FSDownload.java:172) at com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:279) at com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:52) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-06-09 11:14:37,885 INFO [Public Localizer] localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource ftp://10.27.178.207:21/home/cbt/1213/jxf.sql transitioned from DOWNLOADING to FAILED 2015-06-09 11:14:37,886 INFO [Public Localizer] localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(672)) - Failed to download rsrc { { ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null },pending,[(container_20150608111420_41540_1213_1503_)],4237640866988089,DOWNLOADING} java.io.IOException: Login failed on server - 10.27.178.207, port - 21 at org.apache.hadoop.fs.ftp.FTPFileSystem.connect(FTPFileSystem.java:133) at org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:390) at com.suning.cybertron.superion.util.FSDownload.copy(FSDownload.java:172) at com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:279) at com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:52) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-06-09 11:14:37,886 INFO [AsyncDispatcher event handler] container.Container (ContainerImpl.java:handle(853)) - Container container_20150608111420_41540_1213_1503_ transitioned from LOCALIZING to LOCALIZATION_FAILED 2015-06-09 11:14:37,887 INFO [Public Localizer] localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource ftp://10.27.89.13:21/home/cbt/common/2/sql.jar transitioned from DOWNLOADING to FAILED 2015-06-09 11:14:37,887 INFO [AsyncDispatcher event handler] localizer.LocalResourcesTrackerImpl (LocalResourcesTrackerImpl.java:handle(133)) - Container container_20150608111420_41540_1213_1503_ sent RELEASE event on a resource request { ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null } not present {code} I debug the code of yarn.I found the piont is org.apache.hadoop.fs.FileSystem#cache the code source is here: {code} private Fil
[jira] [Updated] (YARN-3791) FSDownload
[ https://issues.apache.org/jira/browse/YARN-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] HuanWang updated YARN-3791: --- Description: Inadvertently,we set two source ftp path: {code} { { ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null },pending,[(container_20150608111420_41540_1213_1503_)],4237640867118938,DOWNLOADING} ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null },pending,[(container_20150608111420_41540_1213_1503_)],4237640866988089,DOWNLOADING} the first one is a wrong path,only one source was set this;but Follow the log,i saw Starting from the first path source download,All next jobs sources were downloaded from ftp://10.27.178.207 by default. the log is : 2015-06-09 11:14:34,653 INFO [AsyncDispatcher event handler] localizer.ResourceLocalizationService (ResourceLocalizationService.java:addResource(544)) - Downloading public rsrc:{ ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null } 2015-06-09 11:14:34,653 INFO [AsyncDispatcher event handler] localizer.ResourceLocalizationService (ResourceLocalizationService.java:addResource(544)) - Downloading public rsrc:{ ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null } 2015-06-09 11:14:37,883 INFO [Public Localizer] localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(672)) - Failed to download rsrc { { ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null },pending,[(container_20150608111420_41540_1213_1503_)],4237640867118938,DOWNLOADING} java.io.IOException: Login failed on server - 10.27.178.207, port - 21 at org.apache.hadoop.fs.ftp.FTPFileSystem.connect(FTPFileSystem.java:133) at org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:390) at com.suning.cybertron.superion.util.FSDownload.copy(FSDownload.java:172) at com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:279) at com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:52) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-06-09 11:14:37,885 INFO [Public Localizer] localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource ftp://10.27.178.207:21/home/cbt/1213/jxf.sql transitioned from DOWNLOADING to FAILED 2015-06-09 11:14:37,886 INFO [Public Localizer] localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(672)) - Failed to download rsrc { { ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null },pending,[(container_20150608111420_41540_1213_1503_)],4237640866988089,DOWNLOADING} java.io.IOException: Login failed on server - 10.27.178.207, port - 21 at org.apache.hadoop.fs.ftp.FTPFileSystem.connect(FTPFileSystem.java:133) at org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:390) at com.suning.cybertron.superion.util.FSDownload.copy(FSDownload.java:172) at com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:279) at com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:52) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-06-09 11:14:37,886 INFO [AsyncDispatcher event handler] container.Container (ContainerImpl.java:handle(853)) - Container container_20150608111420_41540_1213_1503_ transitioned from LOCALIZING to LOCALIZATION_FAILED 2015-06-09 11:14:37,887 INFO [Public Localizer] localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource ftp://10.27.89.13:21/home/cbt/common/2/sql.jar transitioned from DOWNLOADING to FAILED 2015-06-09 11:14:37,887 INFO [AsyncDispatcher event handler] localizer.LocalResourcesTrackerImpl (LocalResourcesTrackerImpl.java:handle(133)) - Container container_20150608111420_41540_1213_1503_ sent RELEASE event on a resource request { ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null } not present {code} I debug the code of yarn.I found the piont is org.apache.hadoop.fs.FileSystem#cache the code source is here: {code} private FileSystem getI
[jira] [Created] (YARN-3791) FSDownload
HuanWang created YARN-3791: -- Summary: FSDownload Key: YARN-3791 URL: https://issues.apache.org/jira/browse/YARN-3791 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Environment: Linux 2.6.32-279.el6.x86_64 Reporter: HuanWang Inadvertently,we set two source ftp path: { { ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null },pending,[(container_20150608111420_41540_1213_1503_)],4237640867118938,DOWNLOADING} ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null },pending,[(container_20150608111420_41540_1213_1503_)],4237640866988089,DOWNLOADING} the first one is a wrong path,only one source was set this;but Follow the log,i saw Starting from the first path source download,All next jobs sources were downloaded from ftp://10.27.178.207 by default. the log is : 2015-06-09 11:14:34,653 INFO [AsyncDispatcher event handler] localizer.ResourceLocalizationService (ResourceLocalizationService.java:addResource(544)) - Downloading public rsrc:{ ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null } 2015-06-09 11:14:34,653 INFO [AsyncDispatcher event handler] localizer.ResourceLocalizationService (ResourceLocalizationService.java:addResource(544)) - Downloading public rsrc:{ ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null } 2015-06-09 11:14:37,883 INFO [Public Localizer] localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(672)) - Failed to download rsrc { { ftp://10.27.178.207:21/home/cbt/1213/jxf.sql, 143322551, FILE, null },pending,[(container_20150608111420_41540_1213_1503_)],4237640867118938,DOWNLOADING} java.io.IOException: Login failed on server - 10.27.178.207, port - 21 at org.apache.hadoop.fs.ftp.FTPFileSystem.connect(FTPFileSystem.java:133) at org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:390) at com.suning.cybertron.superion.util.FSDownload.copy(FSDownload.java:172) at com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:279) at com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:52) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-06-09 11:14:37,885 INFO [Public Localizer] localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource ftp://10.27.178.207:21/home/cbt/1213/jxf.sql transitioned from DOWNLOADING to FAILED 2015-06-09 11:14:37,886 INFO [Public Localizer] localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(672)) - Failed to download rsrc { { ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 1433225415000, FILE, null },pending,[(container_20150608111420_41540_1213_1503_)],4237640866988089,DOWNLOADING} java.io.IOException: Login failed on server - 10.27.178.207, port - 21 at org.apache.hadoop.fs.ftp.FTPFileSystem.connect(FTPFileSystem.java:133) at org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:390) at com.suning.cybertron.superion.util.FSDownload.copy(FSDownload.java:172) at com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:279) at com.suning.cybertron.superion.util.FSDownload.call(FSDownload.java:52) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-06-09 11:14:37,886 INFO [AsyncDispatcher event handler] container.Container (ContainerImpl.java:handle(853)) - Container container_20150608111420_41540_1213_1503_ transitioned from LOCALIZING to LOCALIZATION_FAILED 2015-06-09 11:14:37,887 INFO [Public Localizer] localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource ftp://10.27.89.13:21/home/cbt/common/2/sql.jar transitioned from DOWNLOADING to FAILED 2015-06-09 11:14:37,887 INFO [AsyncDispatcher event handler] localizer.LocalResourcesTrackerImpl (LocalResourcesTrackerImpl.java:handle(133)) - Container container_20150608111420_41540_1213_1503_ sent RELEASE event on a resource request { ftp://10.27.89.13:21/home/cbt/common/2/sql.jar, 14
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580378#comment-14580378 ] Hadoop QA commented on YARN-3044: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 18m 29s | Findbugs (version ) appears to be broken on YARN-2928. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 7m 55s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 56s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 42s | The applied patch generated 1 new checkstyle issues (total was 236, now 236). | | {color:green}+1{color} | whitespace | 0m 6s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 42s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 40s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 5m 46s | The patch appears to introduce 8 new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 22s | Tests passed in hadoop-yarn-api. | | {color:red}-1{color} | yarn tests | 6m 55s | Tests failed in hadoop-yarn-applications-distributedshell. | | {color:green}+1{color} | yarn tests | 0m 26s | Tests passed in hadoop-yarn-server-common. | | {color:red}-1{color} | yarn tests | 61m 37s | Tests failed in hadoop-yarn-server-resourcemanager. | | {color:green}+1{color} | yarn tests | 1m 15s | Tests passed in hadoop-yarn-server-timelineservice. | | | | 118m 6s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-applications-distributedshell | | FindBugs | module:hadoop-yarn-server-resourcemanager | | Failed unit tests | hadoop.yarn.applications.distributedshell.TestDistributedShellWithNodeLabels | | | hadoop.yarn.applications.distributedshell.TestDistributedShell | | | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart | | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12738768/YARN-3044-YARN-2928.011.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / 0a3c147 | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/YARN-2928FindbugsWarningshadoop-yarn-server-common.html | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-applications-distributedshell.html | | Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-applications-distributedshell test log | https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/testrun_hadoop-yarn-applications-distributedshell.txt | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-timelineservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/8233/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8233/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8233/console | This message was automatically generated. > [Event producers] Implement RM writing app lifecycle events to ATS > -- > > Key: YARN-3044 > URL: https://issues.apache.org/jira/browse/YARN-3044 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ti
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580275#comment-14580275 ] Hadoop QA commented on YARN-2194: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 56s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 37s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 42s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 36s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 13s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 6m 6s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 43m 44s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12738765/YARN-2194-4.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 6785661 | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8232/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8232/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8232/console | This message was automatically generated. > Cgroups cease to work in RHEL7 > -- > > Key: YARN-2194 > URL: https://issues.apache.org/jira/browse/YARN-2194 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Critical > Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch, > YARN-2194-4.patch > > > In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the > controller name leads to container launch failure. > RHEL7 deprecates libcgroup and recommends the user of systemd. However, > systemd has certain shortcomings as identified in this JIRA (see comments). > This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580255#comment-14580255 ] Rohith commented on YARN-3790: -- Thanks for looking into this issue, bq. If UpdateThread call update after recoverContainersOnNode, the test will succeed In the test, I see below code which verify for contaner to recover right? {code} // Wait for RM to settle down on recovering containers; waitForNumContainersToRecover(2, rm2, am1.getApplicationAttemptId()); Set launchedContainers = ((RMNodeImpl) rm2.getRMContext().getRMNodes().get(nm1.getNodeId())) .getLaunchedContainers(); assertTrue(launchedContainers.contains(amContainer.getContainerId())); assertTrue(launchedContainers.contains(runningContainer.getContainerId())); {code} Am I missing anything? > TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in > trunk for FS scheduler > > > Key: YARN-3790 > URL: https://issues.apache.org/jira/browse/YARN-3790 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith >Assignee: zhihai xu > > Failure trace is as follows > {noformat} > Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart > testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) > Time elapsed: 6.502 sec <<< FAILURE! > java.lang.AssertionError: expected:<6144> but was:<8192> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580253#comment-14580253 ] Naganarasimha G R commented on YARN-3044: - Hi [~zjshen], Have taken care of the issue which you have mentioned and also added some test cases in TestDistributedShell to cover it (along with some code refactoring). Please review bq. I'm not sure because as far as I can tell, NM's impl is different from RM's, but it's up to you to figure out the proper solution Yep will start doing that now, but getting experts advise to make my job easy ;) > [Event producers] Implement RM writing app lifecycle events to ATS > -- > > Key: YARN-3044 > URL: https://issues.apache.org/jira/browse/YARN-3044 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-3044-YARN-2928.004.patch, > YARN-3044-YARN-2928.005.patch, YARN-3044-YARN-2928.006.patch, > YARN-3044-YARN-2928.007.patch, YARN-3044-YARN-2928.008.patch, > YARN-3044-YARN-2928.009.patch, YARN-3044-YARN-2928.010.patch, > YARN-3044-YARN-2928.011.patch, YARN-3044.20150325-1.patch, > YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch > > > Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-3044: Attachment: YARN-3044-YARN-2928.011.patch > [Event producers] Implement RM writing app lifecycle events to ATS > -- > > Key: YARN-3044 > URL: https://issues.apache.org/jira/browse/YARN-3044 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-3044-YARN-2928.004.patch, > YARN-3044-YARN-2928.005.patch, YARN-3044-YARN-2928.006.patch, > YARN-3044-YARN-2928.007.patch, YARN-3044-YARN-2928.008.patch, > YARN-3044-YARN-2928.009.patch, YARN-3044-YARN-2928.010.patch, > YARN-3044-YARN-2928.011.patch, YARN-3044.20150325-1.patch, > YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch > > > Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3790: - Summary: TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler (was: TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS scheduler) > TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in > trunk for FS scheduler > > > Key: YARN-3790 > URL: https://issues.apache.org/jira/browse/YARN-3790 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith >Assignee: zhihai xu > > Failure trace is as follows > {noformat} > Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart > testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) > Time elapsed: 6.502 sec <<< FAILURE! > java.lang.AssertionError: expected:<6144> but was:<8192> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580228#comment-14580228 ] Rohith commented on YARN-3790: -- bq. I think this test fails intermittently. Yes, it is failing intermittenlty. May be issue summary can be updated. > TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS > scheduler > - > > Key: YARN-3790 > URL: https://issues.apache.org/jira/browse/YARN-3790 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith >Assignee: zhihai xu > > Failure trace is as follows > {noformat} > Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart > testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) > Time elapsed: 6.502 sec <<< FAILURE! > java.lang.AssertionError: expected:<6144> but was:<8192> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2194: -- Attachment: YARN-2194-4.patch Uploaded a patch by replacing comma with '%'. > Cgroups cease to work in RHEL7 > -- > > Key: YARN-2194 > URL: https://issues.apache.org/jira/browse/YARN-2194 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Critical > Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch, > YARN-2194-4.patch > > > In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the > controller name leads to container launch failure. > RHEL7 deprecates libcgroup and recommends the user of systemd. However, > systemd has certain shortcomings as identified in this JIRA (see comments). > This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580209#comment-14580209 ] zhihai xu commented on YARN-3790: - Hi [~rohithsharma], thanks for reporting this issue. I think this test fails intermittently. The following is stack trace for the test failure: {code} java.lang.AssertionError: expected:<6144> but was:<8192> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:852) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:341) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:240) {code} The failure is {{rootMetrics}}'s available resource is not correct for FairScheduler. I know what cause this test failure. For FairScheduler, {{updateRootQueueMetrics}} is used to update {{rootMetrics}}'s available resource. But {{updateRootQueueMetrics}} is not called in/after {{recoverContainersOnNode}}, in this case, we can only depend UpdateThread to update {{rootMetrics}}'s available resource. Currently UpdateThread will be triggered in {{addNode}}. The timing in UpdateThread will decide whether this test will succeed or not. If UpdateThread call {{update}} after {{recoverContainersOnNode}}, the test will succeed. If UpdateThread call {{update}} before {{recoverContainersOnNode}}, the test will fail. > TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS > scheduler > - > > Key: YARN-3790 > URL: https://issues.apache.org/jira/browse/YARN-3790 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith >Assignee: zhihai xu > > Failure trace is as follows > {noformat} > Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart > testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) > Time elapsed: 6.502 sec <<< FAILURE! > java.lang.AssertionError: expected:<6144> but was:<8192> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu reassigned YARN-3790: --- Assignee: zhihai xu > TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS > scheduler > - > > Key: YARN-3790 > URL: https://issues.apache.org/jira/browse/YARN-3790 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith >Assignee: zhihai xu > > Failure trace is as follows > {noformat} > Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart > testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) > Time elapsed: 6.502 sec <<< FAILURE! > java.lang.AssertionError: expected:<6144> but was:<8192> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)