Re: FW: Does abrupt kill corrupts the datadir?

Patrick Hunt Mon, 01 Aug 2011 11:37:29 -0700

Andrei, you might find this useful for such testing:
https://github.com/toddlipcon/gremlins


Patrick

On Thu, Jul 28, 2011 at 4:14 PM, Andrei Savu <savu.and...@gmail.com> wrote:
> I've been  doing some testing in the past for this scenario and I've
> seen no data loss over an extended period of time (a day).
>
> Testing steps:
> 0. start an ensemble running 5 servers
> 1. start an workload generator (e.g.  push a strictly increasing
> sequence of numbers to a queue stored in zookeeper)
> every few seconds: kill the cluster leader (-9) and restart
>
> You should be careful how you handle ConnectionLossException &
> OperationTimeoutException
>
> You can find the code for this test here (executed against the trunk version):
> https://github.com/andreisavu/zookeeper-mq
>
> -- Andrei Savu / andreisavu.ro
>
> On Thu, Jul 28, 2011 at 9:05 AM, Benjamin Reed <br...@apache.org> wrote:
>> almost everything we do in zookkeeper is to make sure that we don't
>> lose data in much worse scenarios. the probably of a loss in this
>> scenario is really just the probability of a bug in the code. i don't
>> think that kill -TERM vs kill -KILL changes that probability at all
>> either way.
>>
>> ben
>>
>> On Thu, Jul 28, 2011 at 12:50 AM, Laxman <lakshman...@huawei.com> wrote:
>>> Thanks for the responses Mahadev, Pat and Ben.
>>> I understand your explanation.
>>>
>>> My only question is "Will there be any probability data loss in the scenario
>>> mentioned?"
>>>
>>>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted
>>> there is a chance of data loss.
>>>
>>>>>if we use sigterm in the script, we would want to put a timeout in to
>>> escalate to a -9
>>>
>>> As Ben mentioned, even if we escalate to "kill -9" to ensure shutdown, still
>>> we may have data loss. But the probability is very less by giving a chance
>>> to shutdown gracefully.
>>>
>>> Please do correct me if my understanding is wrong.
>>> --
>>> Laxman
>>>
>>> -----Original Message-----
>>> From: Benjamin Reed [mailto:br...@apache.org]
>>> Sent: Thursday, July 28, 2011 11:40 AM
>>> To: dev@zookeeper.apache.org
>>> Subject: Re: FW: Does abrupt kill corrupts the datadir?
>>>
>>> i agree with pat. if we use sigterm in the script, we would want to
>>> put a timeout in to escalate to a -9 which makes the script a bit more
>>> complicated without reason since we don't have any exit hooks that we
>>> want to run. zookeeper is designed to recover well from hard failures,
>>> much worse than a kill -9. i don't think we want to change that.
>>>
>>> ben
>>>
>>> On Wed, Jul 27, 2011 at 10:25 AM, Patrick Hunt <ph...@apache.org> wrote:
>>>> ZK has been built around the "fail fast" approach. In order to
>>>> maintain high availability we want to ensure that restarting a server
>>>> will result in it attempting to rejoin the quorum. IMO we would not
>>>> want to change this (kill -9).
>>>>
>>>> Patrick
>>>>
>>>> On Tue, Jul 26, 2011 at 2:02 AM, Laxman <lakshman...@huawei.com> wrote:
>>>>> Hi Everyone,
>>>>>
>>>>> Any thoughts?
>>>>> Do we need consider changing abrupt shutdown to
>>>>>
>>>>> Implementations in some other hadoop eco system projects for your
>>> reference.
>>>>> Hadoop - kill [SIGTERM]
>>>>> HBase - kill [SIGTERM] and then "kill -9" [SIGKILL] if process hung
>>>>> ZooKeeper - "kill -9" [SIGKILL]
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Laxman [mailto:lakshman...@huawei.com]
>>>>> Sent: Wednesday, July 13, 2011 12:36 PM
>>>>> To: 'dev@zookeeper.apache.org'
>>>>> Subject: RE: Does abrupt kill corrupts the datadir?
>>>>>
>>>>> Hi Mahadev,
>>>>>
>>>>> Shutdown hook is just a quick thought. Another approach can be just give
>>> a
>>>>> kill [SIGTERM] call which can be interpreted by process.
>>>>>
>>>>> First look at the "kill -9" triggered the following scenario.
>>>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted
>>> there
>>>>>>is a chance of dataloss.
>>>>>
>>>>> How does zookeeper can deal with this scenario gracefully?
>>>>>
>>>>> Also, I feel we should give a chance to application to shutdown
>>> gracefully
>>>>> before abrupt shutdown.
>>>>>
>>>>> http://en.wikipedia.org/wiki/SIGKILL
>>>>>
>>>>> Because SIGKILL gives the process no opportunity to do cleanup operations
>>> on
>>>>> terminating, in most system shutdown procedures an attempt is first made
>>> to
>>>>> terminate processes using SIGTERM, before resorting to SIGKILL.
>>>>>
>>>>> http://rackerhacker.com/2010/03/18/sigterm-vs-sigkill/
>>>>>
>>>>> The application can determine what it wants to do once a SIGTERM is
>>>>> received. While most applications will clean up their resources and stop,
>>>>> some may not. An application may be configured to do something completely
>>>>> different when a SIGTERM is received. Also, if the application is in a
>>> bad
>>>>> state, such as waiting for disk I/O, it may not be able to act on the
>>> signal
>>>>> that was sent.
>>>>>
>>>>> Most system administrators will usually resort to the more abrupt signal
>>>>> when an application doesn't respond to a SIGTERM.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Mahadev Konar [mailto:maha...@hortonworks.com]
>>>>> Sent: Wednesday, July 13, 2011 12:02 PM
>>>>> To: dev@zookeeper.apache.org
>>>>> Subject: Re: Does abrupt kill corrupts the datadir?
>>>>>
>>>>> Hi Laxman,
>>>>>  The servers takes care of all the issues with data integrity, so a kill
>>>>> -9 is OK. Shutdown hooks are tricky. Also, the best way to make sure
>>>>> everything works reliably is use kill -9 :).
>>>>>
>>>>> Thanks
>>>>> mahadev
>>>>>
>>>>> On 7/12/11 11:16 PM, "Laxman" <lakshman...@huawei.com> wrote:
>>>>>
>>>>>>When we stop zookeeper through zkServer.sh stop, we are aborting the
>>>>>>zookeeper process using "kill -9".
>>>>>>
>>>>>>
>>>>>>
>>>>>>129 stop)
>>>>>>
>>>>>>130     echo -n "Stopping zookeeper ... "
>>>>>>
>>>>>>131     if [ ! -f "$ZOOPIDFILE" ]
>>>>>>
>>>>>>132     then
>>>>>>
>>>>>>133       echo "error: could not find file $ZOOPIDFILE"
>>>>>>
>>>>>>134       exit 1
>>>>>>
>>>>>>135     else
>>>>>>
>>>>>>136       $KILL -9 $(cat "$ZOOPIDFILE")
>>>>>>
>>>>>>137       rm "$ZOOPIDFILE"
>>>>>>
>>>>>>138       echo STOPPED
>>>>>>
>>>>>>139       exit 0
>>>>>>
>>>>>>140     fi
>>>>>>
>>>>>>141     ;;
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>This may corrupt the snapshot and transaction logs. Also, its not
>>>>>>recommended to use "kill -9".
>>>>>>
>>>>>>In worst case, if latest snaps in all zookeeper nodes gets corrupted
>>> there
>>>>>>is a chance of dataloss.
>>>>>>
>>>>>>
>>>>>>
>>>>>>How about introducing a shutdown hook which will ensure zookeeper is
>>>>>>shutdown gracefully when we call stop?
>>>>>>
>>>>>>
>>>>>>
>>>>>>Note: This is just an observation and its not found in a test.
>>>>>>
>>>>>>
>>>>>>
>>>>>>--
>>>>>>
>>>>>>Thanks,
>>>>>>
>>>>>>Laxman
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Re: FW: Does abrupt kill corrupts the datadir?

Reply via email to