[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12747532#action_12747532
 ] 

Benjamin Reed commented on ZOOKEEPER-515:
-----------------------------------------

first, it is important to note that our limit of 1M for data is a sanity check. 
it is unwise to design your application to run on the edge of sanity. generally 
we talk about data in the kilobyte range 100 bytes - 64k. zookeeper stores 
meta-data not application data.

do you know how big the resulting data is? what is the size of a snapshot file?

1) perhaps you are hitting the memory error again when you try to rebuild your 
in-memory data structure. you may try increasing the memory limit using the 
-Xmx flag.
2) there is a configuration option to specify the number of requests in flight, 
globalOutstandingLimit, which defaults to 1000, but with 1000 1M requests you 
need 1G for just the inflight requests, in addition to the memory needed for 
the tree. if you want to handle such large requests you need to look at the 
amount of memory we have and possibly tune that parameter. also if you have a 
large in memory tree and you need to do a state transfer for followers that are 
behind, you will need some time to push a lot of data over the network, so you 
probably also need to adjust the syncLimit and initLimit.
3) if you want to reinitialize everything you need to remove the version-2 
directory from all servers, otherwise, a server that still has the version-2 
directory will get elected and the other servers will sync with it.

> Zookeeper quorum didn't provide service when restart after an "Out of memory" 
> crash
> -----------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-515
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-515
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.2.0
>         Environment: Linux 2.6.9-52bs-4core #2 SMP Wed Jan 16 14:44:08 EST 
> 2008 x86_64 x86_64 x86_64 GNU/Linux
> Jdk: 1.6.0_14 
>            Reporter: Qian Ye
>
> The Zookeeper quorum, containing 5 servers, didn't provide service when 
> restart after an "Out of memory" crash. 
> It happened as following:
> 1. we built  a Zookeeper quorum which contained  5 servers, say 1, 3, 4, 5, 6 
> (have no 2), and 6 was the leader.
> 2. we created 18 threads on 6 different servers to set and get data from a 
> znode in the Zookeeper at the same time.  The size of the data is 1MB. The 
> test threads did their job as fast as possible, no pause between two 
> operation, and they repeated the setting and getting 4000 times. 
> 3. the Zookeeper leader crashed about 10 mins  after the test threads 
> started. The leader printed out the log:
> 2009-08-25 12:00:12,301 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x523
> 4223c2dc00b5 due to java.io.IOException: Read error
> 2009-08-25 12:00:12,318 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x523
> 4223c2dc00b6 due to java.io.IOException: Read error
> 2009-08-25 12:03:44,086 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x523
> 4223c2dc00b8 due to java.io.IOException: Read error
> 2009-08-25 12:04:53,757 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x523
> 4223c2dc00b7 due to java.io.IOException: Read error
> 2009-08-25 12:15:45,151 - FATAL [SyncThread:0:syncrequestproces...@131] - 
> Severe unrecoverable error, exiting
> java.lang.OutOfMemoryError: Java heap space
>     at java.util.Arrays.copyOf(Arrays.java:2786)
>     at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71)
>     at java.io.DataOutputStream.writeInt(DataOutputStream.java:180)
>     at 
> org.apache.jute.BinaryOutputArchive.writeInt(BinaryOutputArchive.java:55)
>     at org.apache.zookeeper.txn.SetDataTxn.serialize(SetDataTxn.java:42)
>     at 
> org.apache.zookeeper.server.persistence.Util.marshallTxnEntry(Util.java:262)
>     at 
> org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:154)
>     at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:268)
>     at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:100)
> It is clear that the leader ran out of memory. then the server 4 was down 
> almost at the same time, and printed out the log:
> 2009-08-25 12:15:45,995 - ERROR 
> [FollowerRequestProcessor:3:followerrequestproces...@91] - Unexpected 
> exception causing
> exit
> java.net.SocketException: Connection reset
>     at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>     at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>     at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>     at java.io.DataOutputStream.write(DataOutputStream.java:90)
>     at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
>     at 
> org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:119)
>     at 
> org.apache.zookeeper.server.quorum.QuorumPacket.serialize(QuorumPacket.java:51)
>     at 
> org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)
>     at 
> org.apache.zookeeper.server.quorum.Follower.writePacket(Follower.java:97)
>     at org.apache.zookeeper.server.quorum.Follower.request(Follower.java:399)
>     at 
> org.apache.zookeeper.server.quorum.FollowerRequestProcessor.run(FollowerRequestProcessor.java:86)
> 2009-08-25 12:15:45,996 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x423
> 4ab894330075 due to java.net.SocketException: Broken pipe
> 2009-08-25 12:15:45,996 - FATAL [SyncThread:3:syncrequestproces...@131] - 
> Severe unrecoverable error, exiting
> java.net.SocketException: Broken pipe
>     at java.net.SocketOutputStream.socketWrite0(Native Method)
>     at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
>     at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>     at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>     at 
> org.apache.zookeeper.server.quorum.Follower.writePacket(Follower.java:100)
>     at 
> org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:52)
>     at 
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:147)
>     at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:92)
> 2009-08-25 12:15:45,995 - WARN  [QuorumPeer:/0.0.0.0:2181:follo...@309] - 
> Exception when following the leader
> java.net.SocketException: Broken pipe
>     at java.net.SocketOutputStream.socketWrite0(Native Method)
>     at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
>     at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>     at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>     at 
> org.apache.zookeeper.server.quorum.Follower.writePacket(Follower.java:100)
>     at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:256)
>     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:498)
> 2009-08-25 12:15:46,022 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> 2009-08-25 12:15:46,022 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> 2009-08-25 12:15:46,023 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> It was really strange that when the 2 server down, the other three servers 
> cannot provide service any more, the 'stat' command all leaded to 
> "ZooKeeperServer not running".
> 4. I restarted the server 6(the former leader) and the server 4. But the 
> service didn't come back. All the five servers printed "ZookeeperServer not 
> running". The server 6 printed the logs:
> 2009-08-25 14:02:15,395 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running 
> 2009-08-25 14:02:27,703 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: Responded to info probe
> 2009-08-25 14:02:28,733 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running 
> 2009-08-25 14:02:42,070 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running 
> 2009-08-25 14:02:55,407 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running 
> 2009-08-25 14:03:08,744 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running 
> 2009-08-25 14:03:22,080 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running 
> 2009-08-25 14:03:29,396 - ERROR [main:u...@238] - Last transaction was 
> partial.
> 2009-08-25 14:03:35,417 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running 
> 2009-08-25 14:03:48,761 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running 
> The server 4 printed logs like:
> 2009-08-25 14:03:48,747 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> 2009-08-25 14:04:02,091 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> 2009-08-25 14:04:15,427 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> 2009-08-25 14:04:17,816 - WARN  [QuorumPeer:/0.0.0.0:2181:follo...@164] - 
> Unexpected exception, tries=0
> java.net.ConnectException: Connection refused
>     at java.net.PlainSocketImpl.socketConnect(Native Method)
>     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>     at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>     at java.net.Socket.connect(Socket.java:525)
>     at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
>     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:498)
> 2009-08-25 14:04:18,820 - WARN  [QuorumPeer:/0.0.0.0:2181:follo...@164] - 
> Unexpected exception, tries=1
> java.net.ConnectException: Connection refused
>     at java.net.PlainSocketImpl.socketConnect(Native Method)
>     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>     at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>     at java.net.Socket.connect(Socket.java:525)
>     at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
>     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:498)
> 2009-08-25 14:04:19,823 - WARN  [QuorumPeer:/0.0.0.0:2181:follo...@164] - 
> Unexpected exception, tries=2
> java.net.ConnectException: Connection refused
>     at java.net.PlainSocketImpl.socketConnect(Native Method)
>     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>     at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>     at java.net.Socket.connect(Socket.java:525)
>     at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
>     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:498)
> 2009-08-25 14:04:28,764 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> 2009-08-25 14:04:42,101 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> the server 1, 3, 5 printed out the logs like:
> 2009-08-25 14:01:35,396 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> 2009-08-25 14:01:36,554 - WARN  [QuorumPeer:/0.0.0.0:2181:leaderelect...@194] 
> - Ignoring exception while looking for lea
> der
> java.net.SocketTimeoutException: Receive timed out 
>     at java.net.PlainDatagramSocketImpl.receive0(Native Method)
>     at 
> java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
>     at java.net.DatagramSocket.receive(DatagramSocket.java:712)
>     at 
> org.apache.zookeeper.server.quorum.LeaderElection.lookForLeader(LeaderElection.java:170)
>     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488)
> 2009-08-25 14:01:37,758 - WARN  [QuorumPeer:/0.0.0.0:2181:leaderelect...@194] 
> - Ignoring exception while looking for lea
> der
> java.net.SocketTimeoutException: Receive timed out 
>     at java.net.PlainDatagramSocketImpl.receive0(Native Method)
>     at 
> java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
>     at java.net.DatagramSocket.receive(DatagramSocket.java:712)
>     at 
> org.apache.zookeeper.server.quorum.LeaderElection.lookForLeader(LeaderElection.java:170)
>     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488)
> 2009-08-25 14:01:37,865 - WARN  [QuorumPeer:/0.0.0.0:2181:follo...@164] - 
> Unexpected exception, tries=0
> java.net.ConnectException: Connection refused
>     at java.net.PlainSocketImpl.socketConnect(Native Method)
>     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>     at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>     at java.net.Socket.connect(Socket.java:525)
>     at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
>     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:498)
> 2009-08-25 14:01:38,289 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: Responded to info probe
> my zoo.cfg is like:
> tickTime=2000
> dataDir=./status/
> clientPort=2181
> initLimit=10
> syncLimit=2
> server.1=10.81.11.107:2888:3888
> server.2=10.81.11.106:2888:3888
> server.3=10.81.11.89:2888:3888
> server.4=10.81.11.99:2888:3888
> server.5=10.81.11.79:2888:3888
> Several questions:
> 1. Why the leader selection failed after the restart?
> 2. Is the size of data too big to be processed properly?
> 3. How can I recover from this situation? Can I just remove the version-2 
> directory on server 6(the former leader) and restart the server?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to