Andrei, you might find this useful for such testing: https://github.com/toddlipcon/gremlins
Patrick On Thu, Jul 28, 2011 at 4:14 PM, Andrei Savu <savu.and...@gmail.com> wrote: > I've been doing some testing in the past for this scenario and I've > seen no data loss over an extended period of time (a day). > > Testing steps: > 0. start an ensemble running 5 servers > 1. start an workload generator (e.g. push a strictly increasing > sequence of numbers to a queue stored in zookeeper) > every few seconds: kill the cluster leader (-9) and restart > > You should be careful how you handle ConnectionLossException & > OperationTimeoutException > > You can find the code for this test here (executed against the trunk version): > https://github.com/andreisavu/zookeeper-mq > > -- Andrei Savu / andreisavu.ro > > On Thu, Jul 28, 2011 at 9:05 AM, Benjamin Reed <br...@apache.org> wrote: >> almost everything we do in zookkeeper is to make sure that we don't >> lose data in much worse scenarios. the probably of a loss in this >> scenario is really just the probability of a bug in the code. i don't >> think that kill -TERM vs kill -KILL changes that probability at all >> either way. >> >> ben >> >> On Thu, Jul 28, 2011 at 12:50 AM, Laxman <lakshman...@huawei.com> wrote: >>> Thanks for the responses Mahadev, Pat and Ben. >>> I understand your explanation. >>> >>> My only question is "Will there be any probability data loss in the scenario >>> mentioned?" >>> >>>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted >>> there is a chance of data loss. >>> >>>>>if we use sigterm in the script, we would want to put a timeout in to >>> escalate to a -9 >>> >>> As Ben mentioned, even if we escalate to "kill -9" to ensure shutdown, still >>> we may have data loss. But the probability is very less by giving a chance >>> to shutdown gracefully. >>> >>> Please do correct me if my understanding is wrong. >>> -- >>> Laxman >>> >>> -----Original Message----- >>> From: Benjamin Reed [mailto:br...@apache.org] >>> Sent: Thursday, July 28, 2011 11:40 AM >>> To: dev@zookeeper.apache.org >>> Subject: Re: FW: Does abrupt kill corrupts the datadir? >>> >>> i agree with pat. if we use sigterm in the script, we would want to >>> put a timeout in to escalate to a -9 which makes the script a bit more >>> complicated without reason since we don't have any exit hooks that we >>> want to run. zookeeper is designed to recover well from hard failures, >>> much worse than a kill -9. i don't think we want to change that. >>> >>> ben >>> >>> On Wed, Jul 27, 2011 at 10:25 AM, Patrick Hunt <ph...@apache.org> wrote: >>>> ZK has been built around the "fail fast" approach. In order to >>>> maintain high availability we want to ensure that restarting a server >>>> will result in it attempting to rejoin the quorum. IMO we would not >>>> want to change this (kill -9). >>>> >>>> Patrick >>>> >>>> On Tue, Jul 26, 2011 at 2:02 AM, Laxman <lakshman...@huawei.com> wrote: >>>>> Hi Everyone, >>>>> >>>>> Any thoughts? >>>>> Do we need consider changing abrupt shutdown to >>>>> >>>>> Implementations in some other hadoop eco system projects for your >>> reference. >>>>> Hadoop - kill [SIGTERM] >>>>> HBase - kill [SIGTERM] and then "kill -9" [SIGKILL] if process hung >>>>> ZooKeeper - "kill -9" [SIGKILL] >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Laxman [mailto:lakshman...@huawei.com] >>>>> Sent: Wednesday, July 13, 2011 12:36 PM >>>>> To: 'dev@zookeeper.apache.org' >>>>> Subject: RE: Does abrupt kill corrupts the datadir? >>>>> >>>>> Hi Mahadev, >>>>> >>>>> Shutdown hook is just a quick thought. Another approach can be just give >>> a >>>>> kill [SIGTERM] call which can be interpreted by process. >>>>> >>>>> First look at the "kill -9" triggered the following scenario. >>>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted >>> there >>>>>>is a chance of dataloss. >>>>> >>>>> How does zookeeper can deal with this scenario gracefully? >>>>> >>>>> Also, I feel we should give a chance to application to shutdown >>> gracefully >>>>> before abrupt shutdown. >>>>> >>>>> http://en.wikipedia.org/wiki/SIGKILL >>>>> >>>>> Because SIGKILL gives the process no opportunity to do cleanup operations >>> on >>>>> terminating, in most system shutdown procedures an attempt is first made >>> to >>>>> terminate processes using SIGTERM, before resorting to SIGKILL. >>>>> >>>>> http://rackerhacker.com/2010/03/18/sigterm-vs-sigkill/ >>>>> >>>>> The application can determine what it wants to do once a SIGTERM is >>>>> received. While most applications will clean up their resources and stop, >>>>> some may not. An application may be configured to do something completely >>>>> different when a SIGTERM is received. Also, if the application is in a >>> bad >>>>> state, such as waiting for disk I/O, it may not be able to act on the >>> signal >>>>> that was sent. >>>>> >>>>> Most system administrators will usually resort to the more abrupt signal >>>>> when an application doesn't respond to a SIGTERM. >>>>> >>>>> -----Original Message----- >>>>> From: Mahadev Konar [mailto:maha...@hortonworks.com] >>>>> Sent: Wednesday, July 13, 2011 12:02 PM >>>>> To: dev@zookeeper.apache.org >>>>> Subject: Re: Does abrupt kill corrupts the datadir? >>>>> >>>>> Hi Laxman, >>>>> The servers takes care of all the issues with data integrity, so a kill >>>>> -9 is OK. Shutdown hooks are tricky. Also, the best way to make sure >>>>> everything works reliably is use kill -9 :). >>>>> >>>>> Thanks >>>>> mahadev >>>>> >>>>> On 7/12/11 11:16 PM, "Laxman" <lakshman...@huawei.com> wrote: >>>>> >>>>>>When we stop zookeeper through zkServer.sh stop, we are aborting the >>>>>>zookeeper process using "kill -9". >>>>>> >>>>>> >>>>>> >>>>>>129 stop) >>>>>> >>>>>>130 echo -n "Stopping zookeeper ... " >>>>>> >>>>>>131 if [ ! -f "$ZOOPIDFILE" ] >>>>>> >>>>>>132 then >>>>>> >>>>>>133 echo "error: could not find file $ZOOPIDFILE" >>>>>> >>>>>>134 exit 1 >>>>>> >>>>>>135 else >>>>>> >>>>>>136 $KILL -9 $(cat "$ZOOPIDFILE") >>>>>> >>>>>>137 rm "$ZOOPIDFILE" >>>>>> >>>>>>138 echo STOPPED >>>>>> >>>>>>139 exit 0 >>>>>> >>>>>>140 fi >>>>>> >>>>>>141 ;; >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>This may corrupt the snapshot and transaction logs. Also, its not >>>>>>recommended to use "kill -9". >>>>>> >>>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted >>> there >>>>>>is a chance of dataloss. >>>>>> >>>>>> >>>>>> >>>>>>How about introducing a shutdown hook which will ensure zookeeper is >>>>>>shutdown gracefully when we call stop? >>>>>> >>>>>> >>>>>> >>>>>>Note: This is just an observation and its not found in a test. >>>>>> >>>>>> >>>>>> >>>>>>-- >>>>>> >>>>>>Thanks, >>>>>> >>>>>>Laxman >>>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >> >