I've been  doing some testing in the past for this scenario and I've
seen no data loss over an extended period of time (a day).

Testing steps:
0. start an ensemble running 5 servers
1. start an workload generator (e.g.  push a strictly increasing
sequence of numbers to a queue stored in zookeeper)
every few seconds: kill the cluster leader (-9) and restart

You should be careful how you handle ConnectionLossException &
OperationTimeoutException

You can find the code for this test here (executed against the trunk version):
https://github.com/andreisavu/zookeeper-mq

-- Andrei Savu / andreisavu.ro

On Thu, Jul 28, 2011 at 9:05 AM, Benjamin Reed <br...@apache.org> wrote:
> almost everything we do in zookkeeper is to make sure that we don't
> lose data in much worse scenarios. the probably of a loss in this
> scenario is really just the probability of a bug in the code. i don't
> think that kill -TERM vs kill -KILL changes that probability at all
> either way.
>
> ben
>
> On Thu, Jul 28, 2011 at 12:50 AM, Laxman <lakshman...@huawei.com> wrote:
>> Thanks for the responses Mahadev, Pat and Ben.
>> I understand your explanation.
>>
>> My only question is "Will there be any probability data loss in the scenario
>> mentioned?"
>>
>>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted
>> there is a chance of data loss.
>>
>>>>if we use sigterm in the script, we would want to put a timeout in to
>> escalate to a -9
>>
>> As Ben mentioned, even if we escalate to "kill -9" to ensure shutdown, still
>> we may have data loss. But the probability is very less by giving a chance
>> to shutdown gracefully.
>>
>> Please do correct me if my understanding is wrong.
>> --
>> Laxman
>>
>> -----Original Message-----
>> From: Benjamin Reed [mailto:br...@apache.org]
>> Sent: Thursday, July 28, 2011 11:40 AM
>> To: dev@zookeeper.apache.org
>> Subject: Re: FW: Does abrupt kill corrupts the datadir?
>>
>> i agree with pat. if we use sigterm in the script, we would want to
>> put a timeout in to escalate to a -9 which makes the script a bit more
>> complicated without reason since we don't have any exit hooks that we
>> want to run. zookeeper is designed to recover well from hard failures,
>> much worse than a kill -9. i don't think we want to change that.
>>
>> ben
>>
>> On Wed, Jul 27, 2011 at 10:25 AM, Patrick Hunt <ph...@apache.org> wrote:
>>> ZK has been built around the "fail fast" approach. In order to
>>> maintain high availability we want to ensure that restarting a server
>>> will result in it attempting to rejoin the quorum. IMO we would not
>>> want to change this (kill -9).
>>>
>>> Patrick
>>>
>>> On Tue, Jul 26, 2011 at 2:02 AM, Laxman <lakshman...@huawei.com> wrote:
>>>> Hi Everyone,
>>>>
>>>> Any thoughts?
>>>> Do we need consider changing abrupt shutdown to
>>>>
>>>> Implementations in some other hadoop eco system projects for your
>> reference.
>>>> Hadoop - kill [SIGTERM]
>>>> HBase - kill [SIGTERM] and then "kill -9" [SIGKILL] if process hung
>>>> ZooKeeper - "kill -9" [SIGKILL]
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Laxman [mailto:lakshman...@huawei.com]
>>>> Sent: Wednesday, July 13, 2011 12:36 PM
>>>> To: 'dev@zookeeper.apache.org'
>>>> Subject: RE: Does abrupt kill corrupts the datadir?
>>>>
>>>> Hi Mahadev,
>>>>
>>>> Shutdown hook is just a quick thought. Another approach can be just give
>> a
>>>> kill [SIGTERM] call which can be interpreted by process.
>>>>
>>>> First look at the "kill -9" triggered the following scenario.
>>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted
>> there
>>>>>is a chance of dataloss.
>>>>
>>>> How does zookeeper can deal with this scenario gracefully?
>>>>
>>>> Also, I feel we should give a chance to application to shutdown
>> gracefully
>>>> before abrupt shutdown.
>>>>
>>>> http://en.wikipedia.org/wiki/SIGKILL
>>>>
>>>> Because SIGKILL gives the process no opportunity to do cleanup operations
>> on
>>>> terminating, in most system shutdown procedures an attempt is first made
>> to
>>>> terminate processes using SIGTERM, before resorting to SIGKILL.
>>>>
>>>> http://rackerhacker.com/2010/03/18/sigterm-vs-sigkill/
>>>>
>>>> The application can determine what it wants to do once a SIGTERM is
>>>> received. While most applications will clean up their resources and stop,
>>>> some may not. An application may be configured to do something completely
>>>> different when a SIGTERM is received. Also, if the application is in a
>> bad
>>>> state, such as waiting for disk I/O, it may not be able to act on the
>> signal
>>>> that was sent.
>>>>
>>>> Most system administrators will usually resort to the more abrupt signal
>>>> when an application doesn't respond to a SIGTERM.
>>>>
>>>> -----Original Message-----
>>>> From: Mahadev Konar [mailto:maha...@hortonworks.com]
>>>> Sent: Wednesday, July 13, 2011 12:02 PM
>>>> To: dev@zookeeper.apache.org
>>>> Subject: Re: Does abrupt kill corrupts the datadir?
>>>>
>>>> Hi Laxman,
>>>>  The servers takes care of all the issues with data integrity, so a kill
>>>> -9 is OK. Shutdown hooks are tricky. Also, the best way to make sure
>>>> everything works reliably is use kill -9 :).
>>>>
>>>> Thanks
>>>> mahadev
>>>>
>>>> On 7/12/11 11:16 PM, "Laxman" <lakshman...@huawei.com> wrote:
>>>>
>>>>>When we stop zookeeper through zkServer.sh stop, we are aborting the
>>>>>zookeeper process using "kill -9".
>>>>>
>>>>>
>>>>>
>>>>>129 stop)
>>>>>
>>>>>130     echo -n "Stopping zookeeper ... "
>>>>>
>>>>>131     if [ ! -f "$ZOOPIDFILE" ]
>>>>>
>>>>>132     then
>>>>>
>>>>>133       echo "error: could not find file $ZOOPIDFILE"
>>>>>
>>>>>134       exit 1
>>>>>
>>>>>135     else
>>>>>
>>>>>136       $KILL -9 $(cat "$ZOOPIDFILE")
>>>>>
>>>>>137       rm "$ZOOPIDFILE"
>>>>>
>>>>>138       echo STOPPED
>>>>>
>>>>>139       exit 0
>>>>>
>>>>>140     fi
>>>>>
>>>>>141     ;;
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>This may corrupt the snapshot and transaction logs. Also, its not
>>>>>recommended to use "kill -9".
>>>>>
>>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted
>> there
>>>>>is a chance of dataloss.
>>>>>
>>>>>
>>>>>
>>>>>How about introducing a shutdown hook which will ensure zookeeper is
>>>>>shutdown gracefully when we call stop?
>>>>>
>>>>>
>>>>>
>>>>>Note: This is just an observation and its not found in a test.
>>>>>
>>>>>
>>>>>
>>>>>--
>>>>>
>>>>>Thanks,
>>>>>
>>>>>Laxman
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>

Reply via email to