[ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi reopened YARN-2368:
-------------------------

We can not only set the yarn.resourcemanager.zk-jutemaxbuffer-bytes to be 
configured, but also can control the if we want to retry or just log some 
application info to define what application cause the boom of the zk buffer. So 
that we can make sure the gc problem not happen when we retry too much and time 
out the zk connection. Also we can find the root application which cause the 
boom of the zk buffer.

> ResourceManager failed when ZKRMStateStore tries to update znode data larger 
> than 1MB
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-2368
>                 URL: https://issues.apache.org/jira/browse/YARN-2368
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.1
>            Reporter: Leitao Guo
>            Assignee: zhuqi
>            Priority: Critical
>         Attachments: YARN-2368.patch
>
>
> Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed 
> finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode 
> larger than 1MB, which is the default configuration of ZooKeeper server and 
> client in 'jute.maxbuffer'.
> ResourceManager (ip addr: 10.153.80.8) log shows as the following:
> {code}
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2014-07-25 22:33:11,214 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_000001
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> Meanwhile, ZooKeeps log shows as the following:
> {code}
> 2014-07-25 22:10:09,728 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
> Accepted socket connection from /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client 
> attempting to renew session 0x247684586e70006 at /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating 
> client: 0x247684586e70006
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 
> 0x247684586e70006 with negotiated timeout 10000 for client /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth 
> packet /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth 
> success /10.153.80.8:58890
> 2014-07-25 22:10:09,742 [myid:1] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
> causing close of session 0x247684586e70006 due to java.io.IOException: Len 
> error 1530
> 747
> 2014-07-25 22:10:09,743 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed 
> socket connection for client /10.153.80.8:58890 which had sessionid 
> 0x247684586e70006
> ... ...
> 2014-07-25 22:33:10,966 [myid:1] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
> causing close of session 0x247684586e70006 due to java.io.IOException: Len 
> error 1530747
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to