At Sun, 25 Sep 2011 12:00:20 +0800, Liu Yuan wrote: > > From: Liu Yuan <tailai...@taobao.com> > > Hi Kazum, > would this solve the data loss problem as you mentioned when there is > no epoch overlap? This patch would allow cluster to recover as if the master > (last failed node) were no ever crashed.
The concept of mastership transfer is great! I think this is the right way to go. But your patch has some problems. Here is a test case to reproduce the problems: #!/bin/bash # create a directory which has a different creation time sheep /store/1 -p 7001 sleep 1 collie cluster format -p 7001 collie cluster shutdown -p 7001 sleep 1 # start Sheepdog sheep /store/0 -p 7000 sleep 1 collie cluster format -p 7000 while true; do sheep /store/1 -p 7001 sheep /store/2 -p 7002 # wait for node join while [ "`collie cluster info -p 7002 -r 2>&1 | head -1`" != 'running' ]; do sleep 0.1 done if [ "`collie node list -p 7002 -r | wc -l`" -ne 2 ]; then # break if the result is not correct break fi pkill -f "sheep /store/2" done # show results collie cluster info -p 7000 collie cluster info -p 7002 The detailed reasons are below. > @@ -975,17 +992,6 @@ static void __sd_deliver(struct cpg_event *cevent) > addr_to_str(name, sizeof(name), m->from.addr, m->from.port), > m->pid); > > - /* > - * we don't want to perform any deliver events until we > - * join; we wait for our JOIN message. > - */ > - if (!sys->join_finished) { > - if (m->pid != sys->this_pid || m->nodeid != sys->this_nodeid) { > - cevent->skip = 1; > - return; > - } > - } > - Sheepdog assumes that only joined nodes handle the delived messages, so we cannot remove this block. You should pass only mastership transfer events here. > if (m->op == SD_MSG_JOIN) { > uint32_t nodeid = m->nodeid; > uint32_t pid = m->pid; > @@ -1052,7 +1058,15 @@ static void send_join_response(struct work_deliver *w) > jm->nr_leave_nodes++; > } > print_node_list(&sys->leave_list); > + } else if (jm->result != SD_RES_SUCCESS && > + jm->epoch > sys->epoch && > + jm->cluster_status == SD_STATUS_WAIT_FOR_JOIN) { > + eprintf("Transfer mastership.\n"); > + leave_cluster(); > + eprintf("Restart me later when master is up, please.Bye.\n"); > + exit(1); > } > + jm->epoch = sys->epoch; > send_message(sys->handle, m); > } > > @@ -1090,15 +1104,23 @@ static void __sd_deliver_done(struct cpg_event > *cevent) > lm = (struct leave_message *)m; > add_node_to_leave_list(m); > > - if (lm->epoch > sys->leave_epoch) > - sys->leave_epoch = lm->epoch; > + /* Sheep needs this to identify itself as > master. > + * Now mastership transfer is done. > + */ > + if (!sys->join_finished) { > + sys->join_finished = 1; > + move_node_to_sd_list(sys->this_nodeid, > sys->this_pid, sys->this_node); > + sys->epoch = get_latest_epoch(); > + } IIUC, this codes assume that all other nodes will send leave messages because this node has a newer epoch, so this can be a master. But the assumption is wrong because the node which has a completely wrong epoch information (e.g. a node with a different creation time) also sends a leave message. My suggestion is introducing another message type something like SD_MSG_MASTER_TRANSFER. I think we should clearly distinguish master transfer events from leave messages. Thanks, Kazutaka > > nr_local = get_nodes_nr_epoch(sys->epoch); > nr = get_nodes_nr_from(&sys->sd_node_list); > nr_leave = get_nodes_nr_from(&sys->leave_list); > + > + dprintf("%d == %d + %d \n", nr_local, nr, > nr_leave); > if (nr_local == nr + nr_leave) { > sys->status = SD_STATUS_OK; > - sys->epoch = sys->leave_epoch + 1; > + sys->epoch = sys->epoch; > update_epoch_log(sys->epoch); > update_epoch_store(sys->epoch); > } > @@ -1931,7 +1953,6 @@ join_retry: > sys->handle = cpg_handle; > sys->this_nodeid = nodeid; > sys->this_pid = getpid(); > - sys->leave_epoch = 0; > > ret = set_addr(nodeid, port); > if (ret) > diff --git a/sheep/sheep_priv.h b/sheep/sheep_priv.h > index 6680f79..4711cdd 100644 > --- a/sheep/sheep_priv.h > +++ b/sheep/sheep_priv.h > @@ -144,7 +144,6 @@ struct cluster_info { > int nr_outstanding_reqs; > > uint32_t recovered_epoch; > - uint32_t leave_epoch; /* The highest number in the clsuter */ > > int use_directio; > > -- > 1.7.6.1 > > -- > sheepdog mailing list > sheepdog@lists.wpkg.org > http://lists.wpkg.org/mailman/listinfo/sheepdog -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog