I've been doing some testing in the past for this scenario and I've seen no data loss over an extended period of time (a day).
Testing steps: 0. start an ensemble running 5 servers 1. start an workload generator (e.g. push a strictly increasing sequence of numbers to a queue stored in zookeeper) every few seconds: kill the cluster leader (-9) and restart You should be careful how you handle ConnectionLossException & OperationTimeoutException You can find the code for this test here (executed against the trunk version): https://github.com/andreisavu/zookeeper-mq -- Andrei Savu / andreisavu.ro On Thu, Jul 28, 2011 at 9:05 AM, Benjamin Reed <br...@apache.org> wrote: > almost everything we do in zookkeeper is to make sure that we don't > lose data in much worse scenarios. the probably of a loss in this > scenario is really just the probability of a bug in the code. i don't > think that kill -TERM vs kill -KILL changes that probability at all > either way. > > ben > > On Thu, Jul 28, 2011 at 12:50 AM, Laxman <lakshman...@huawei.com> wrote: >> Thanks for the responses Mahadev, Pat and Ben. >> I understand your explanation. >> >> My only question is "Will there be any probability data loss in the scenario >> mentioned?" >> >>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted >> there is a chance of data loss. >> >>>>if we use sigterm in the script, we would want to put a timeout in to >> escalate to a -9 >> >> As Ben mentioned, even if we escalate to "kill -9" to ensure shutdown, still >> we may have data loss. But the probability is very less by giving a chance >> to shutdown gracefully. >> >> Please do correct me if my understanding is wrong. >> -- >> Laxman >> >> -----Original Message----- >> From: Benjamin Reed [mailto:br...@apache.org] >> Sent: Thursday, July 28, 2011 11:40 AM >> To: dev@zookeeper.apache.org >> Subject: Re: FW: Does abrupt kill corrupts the datadir? >> >> i agree with pat. if we use sigterm in the script, we would want to >> put a timeout in to escalate to a -9 which makes the script a bit more >> complicated without reason since we don't have any exit hooks that we >> want to run. zookeeper is designed to recover well from hard failures, >> much worse than a kill -9. i don't think we want to change that. >> >> ben >> >> On Wed, Jul 27, 2011 at 10:25 AM, Patrick Hunt <ph...@apache.org> wrote: >>> ZK has been built around the "fail fast" approach. In order to >>> maintain high availability we want to ensure that restarting a server >>> will result in it attempting to rejoin the quorum. IMO we would not >>> want to change this (kill -9). >>> >>> Patrick >>> >>> On Tue, Jul 26, 2011 at 2:02 AM, Laxman <lakshman...@huawei.com> wrote: >>>> Hi Everyone, >>>> >>>> Any thoughts? >>>> Do we need consider changing abrupt shutdown to >>>> >>>> Implementations in some other hadoop eco system projects for your >> reference. >>>> Hadoop - kill [SIGTERM] >>>> HBase - kill [SIGTERM] and then "kill -9" [SIGKILL] if process hung >>>> ZooKeeper - "kill -9" [SIGKILL] >>>> >>>> >>>> -----Original Message----- >>>> From: Laxman [mailto:lakshman...@huawei.com] >>>> Sent: Wednesday, July 13, 2011 12:36 PM >>>> To: 'dev@zookeeper.apache.org' >>>> Subject: RE: Does abrupt kill corrupts the datadir? >>>> >>>> Hi Mahadev, >>>> >>>> Shutdown hook is just a quick thought. Another approach can be just give >> a >>>> kill [SIGTERM] call which can be interpreted by process. >>>> >>>> First look at the "kill -9" triggered the following scenario. >>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted >> there >>>>>is a chance of dataloss. >>>> >>>> How does zookeeper can deal with this scenario gracefully? >>>> >>>> Also, I feel we should give a chance to application to shutdown >> gracefully >>>> before abrupt shutdown. >>>> >>>> http://en.wikipedia.org/wiki/SIGKILL >>>> >>>> Because SIGKILL gives the process no opportunity to do cleanup operations >> on >>>> terminating, in most system shutdown procedures an attempt is first made >> to >>>> terminate processes using SIGTERM, before resorting to SIGKILL. >>>> >>>> http://rackerhacker.com/2010/03/18/sigterm-vs-sigkill/ >>>> >>>> The application can determine what it wants to do once a SIGTERM is >>>> received. While most applications will clean up their resources and stop, >>>> some may not. An application may be configured to do something completely >>>> different when a SIGTERM is received. Also, if the application is in a >> bad >>>> state, such as waiting for disk I/O, it may not be able to act on the >> signal >>>> that was sent. >>>> >>>> Most system administrators will usually resort to the more abrupt signal >>>> when an application doesn't respond to a SIGTERM. >>>> >>>> -----Original Message----- >>>> From: Mahadev Konar [mailto:maha...@hortonworks.com] >>>> Sent: Wednesday, July 13, 2011 12:02 PM >>>> To: dev@zookeeper.apache.org >>>> Subject: Re: Does abrupt kill corrupts the datadir? >>>> >>>> Hi Laxman, >>>> The servers takes care of all the issues with data integrity, so a kill >>>> -9 is OK. Shutdown hooks are tricky. Also, the best way to make sure >>>> everything works reliably is use kill -9 :). >>>> >>>> Thanks >>>> mahadev >>>> >>>> On 7/12/11 11:16 PM, "Laxman" <lakshman...@huawei.com> wrote: >>>> >>>>>When we stop zookeeper through zkServer.sh stop, we are aborting the >>>>>zookeeper process using "kill -9". >>>>> >>>>> >>>>> >>>>>129 stop) >>>>> >>>>>130 echo -n "Stopping zookeeper ... " >>>>> >>>>>131 if [ ! -f "$ZOOPIDFILE" ] >>>>> >>>>>132 then >>>>> >>>>>133 echo "error: could not find file $ZOOPIDFILE" >>>>> >>>>>134 exit 1 >>>>> >>>>>135 else >>>>> >>>>>136 $KILL -9 $(cat "$ZOOPIDFILE") >>>>> >>>>>137 rm "$ZOOPIDFILE" >>>>> >>>>>138 echo STOPPED >>>>> >>>>>139 exit 0 >>>>> >>>>>140 fi >>>>> >>>>>141 ;; >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>This may corrupt the snapshot and transaction logs. Also, its not >>>>>recommended to use "kill -9". >>>>> >>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted >> there >>>>>is a chance of dataloss. >>>>> >>>>> >>>>> >>>>>How about introducing a shutdown hook which will ensure zookeeper is >>>>>shutdown gracefully when we call stop? >>>>> >>>>> >>>>> >>>>>Note: This is just an observation and its not found in a test. >>>>> >>>>> >>>>> >>>>>-- >>>>> >>>>>Thanks, >>>>> >>>>>Laxman >>>>> >>>> >>>> >>>> >>> >> >> >