Re: [sheepdog] [sheepdog-users] Several difficulties with sheepdog (from 0.4.0-0+tek2b-10 deb package)

2012-07-27 Thread Bastian Scholz

Am 2012-07-26 23:06, schrieb David Douard:

I've put a modified version of this in the wiki.



Do never kill more than X sheep daemons (X being the number
of copies you formatted your cluster with) at a time


Technical it is a little more complex, you can kill more
than X sheeps, but avoid to kill more than in X-1 Zones...

Because you had to keep at least one copy, it is X-1...

For a quick change, could you[1] please change the first
sentence to "Do never kill more than X-1 sheep daemons[...]"
I will have a look next week, if I could integrate the zones
concept in that part...


I thins these questions
needs to be clarified for the newcomer, maybe with some examples with
failures scenarios up to the disaster (data lost; when does this 
occur

in each example config).

What do you think?


A Good Practice Guide with some examples for Cluster
Creation is a good idea, and some of the docs seems
to be a little outdated (snapshots for Example gives
an different output with 0.4).

Cheers

Bastian

[1] I have no github account right now, will create one next
week, with more time :-)


--
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [sheepdog-users] Several difficulties with sheepdog (from 0.4.0-0+tek2b-10 deb package)

2012-07-26 Thread David Douard
On 26/07/2012 19:59, Bastian Scholz wrote:
> Am 2012-07-26 18:53, schrieb Jens WEBER:
>> In a case of a crash, like your network error, you have a problem if
>> one node dosn't have a full copy. So 3 nodes must have 3 copies. Or
>> use redundant network links, so situation can't happen. For me some
>> times collie cluster recover and collie cluster cleanup works after
>> killing/crash.
> 
> Ironically this happens, while testing the redundant network
> lines and a strange firmware switch error kills the hole
> network... ;-)
> 
> But even if this should nearly never happen in a well designed
> network, I think, that it should be possible to recover from
> this kind of corner case error. The sheeps detect, that they
> had to few living nodes and halts...
> 
> Theoretically they had a valid status, because they rejects all
> further write requests at the same time, but I cant reconnect
> them after the network error...
> 
> For their own, they wont detect to each other again...
> collie cluster shutdown wont work, because of too few hosts...
> killing them, invalidates the data...
> 
> So, what I had expected is following scenario:
> 
> If I loose more zones than number of copies at the same time
> -> Shit happens! It will be be very unlikely under normal
> conditions.
> 
> But when I loose half or even more sheeps at the same time, I
> think it should be possible to fail to a recoverable state...
> 
>> Next step is to write best-practice-guide how to setup a sheepdog
>> cluster in the right way. All help is welcome.
> 
> I am not very good while documentation something, but I try
> my best ;-)
> 
> To answer Davids question for best practise update something
> like this should do it...
> 
> The update scenario depends if you need a running cluster the
> hole time, or if you can plan a complete shutdown for some time.
> 
> If you need to run the cluster all the time, you have to kill
> the sheeps on one node, make the update and restart the sheeps.
> After this, wait for recovery to complete and proceed with
> the next node. After finishing with all nodes run ''collie
> cluster cleanup'', this removes obj no longer needed on the
> nodes after successful recovery.
> 
> If you have a timeframe to shutdown the cluster completely, it
> is maybe faster to use ''collie cluster shutdown'' (shut down
> all connected qemu instances before) to stop all sheeps on all
> nodes which leaves the cluster in a clean state.
> Then make the updates on all nodes an restart the sheeps, the
> cluster starts working again, if all original inhabitants are
> back alive on the farm.
> 

Thanks,

I've put a modified version of this in the wiki.

I'd also like to have more doc (on the wiki and in the man pages) on the
meaning and the implications of several parameters (what are zone in the
sheep cmd args? How do this is related to the mode (safe, quorum,
unsafe) used when formating the cluster? etc.). I thins these questions
needs to be clarified for the newcomer, maybe with some examples with
failures scenarios up to the disaster (data lost; when does this occur
in each example config).

What do you think?

David

PS: I've CC this email to the dev list since I don't know haw many
sheepdog developers are actually registered in this 'sheepdog-users' list.

> Cheers
> 
> Bastian
> 
> 
> 


-- 
--
David DOUARDLOGILAB
+33 1 45 32 03 12   david.dou...@logilab.fr
+33 1 83 64 25 26   http://www.logilab.fr/id/david.douard

Formations - http://www.logilab.fr/formations
Développements - http://www.logilab.fr/services
Gestion de connaissances - http://www.cubicweb.org/
<>

signature.asc
Description: OpenPGP digital signature
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog