At Sun, 25 Sep 2011 12:00:20 +0800,
Liu Yuan wrote:
> 
> From: Liu Yuan <tailai...@taobao.com>
> 
> Hi Kazum,
>         would this solve the data loss problem as you mentioned when there is 
> no epoch overlap? This patch would allow cluster to recover as if the master 
> (last failed node) were no ever crashed.

The concept of mastership transfer is great!  I think this is the
right way to go.

But your patch has some problems.  Here is a test case to reproduce
the problems:

    #!/bin/bash
    
    # create a directory which has a different creation time
    sheep /store/1 -p 7001
    sleep 1
    collie cluster format -p 7001
    collie cluster shutdown -p 7001
    sleep 1
    
    # start Sheepdog
    sheep /store/0 -p 7000
    sleep 1
    collie cluster format -p 7000
    
    while true; do
        sheep /store/1 -p 7001
        sheep /store/2 -p 7002
    
        # wait for node join
        while [ "`collie cluster info -p 7002 -r 2>&1 | head -1`" != 'running' 
]; do
            sleep 0.1
        done
    
        if [ "`collie node list -p 7002 -r | wc -l`" -ne 2 ]; then
            # break if the result is not correct
            break
        fi
    
        pkill -f "sheep /store/2"
    done
    
    # show results
    collie cluster info -p 7000
    collie cluster info -p 7002


The detailed reasons are below.

> @@ -975,17 +992,6 @@ static void __sd_deliver(struct cpg_event *cevent)
>               addr_to_str(name, sizeof(name), m->from.addr, m->from.port),
>               m->pid);
>  
> -     /*
> -      * we don't want to perform any deliver events until we
> -      * join; we wait for our JOIN message.
> -      */
> -     if (!sys->join_finished) {
> -             if (m->pid != sys->this_pid || m->nodeid != sys->this_nodeid) {
> -                     cevent->skip = 1;
> -                     return;
> -             }
> -     }
> -

Sheepdog assumes that only joined nodes handle the delived messages,
so we cannot remove this block.  You should pass only mastership
transfer events here.


>       if (m->op == SD_MSG_JOIN) {
>               uint32_t nodeid = m->nodeid;
>               uint32_t pid = m->pid;
> @@ -1052,7 +1058,15 @@ static void send_join_response(struct work_deliver *w)
>                       jm->nr_leave_nodes++;
>               }
>               print_node_list(&sys->leave_list);
> +     } else if (jm->result != SD_RES_SUCCESS &&
> +                     jm->epoch > sys->epoch &&
> +                     jm->cluster_status == SD_STATUS_WAIT_FOR_JOIN) {
> +             eprintf("Transfer mastership.\n");
> +             leave_cluster();
> +             eprintf("Restart me later when master is up, please.Bye.\n");
> +             exit(1);
>       }
> +     jm->epoch = sys->epoch;
>       send_message(sys->handle, m);
>  }
>  
> @@ -1090,15 +1104,23 @@ static void __sd_deliver_done(struct cpg_event 
> *cevent)
>                               lm = (struct leave_message *)m;
>                               add_node_to_leave_list(m);
>  
> -                             if (lm->epoch > sys->leave_epoch)
> -                                     sys->leave_epoch = lm->epoch;
> +                             /* Sheep needs this to identify itself as 
> master.
> +                              * Now mastership transfer is done.
> +                              */
> +                             if (!sys->join_finished) {
> +                                     sys->join_finished = 1;
> +                                     move_node_to_sd_list(sys->this_nodeid, 
> sys->this_pid, sys->this_node);
> +                                     sys->epoch = get_latest_epoch();
> +                             }

IIUC, this codes assume that all other nodes will send leave messages
because this node has a newer epoch, so this can be a master.  But the
assumption is wrong because the node which has a completely wrong
epoch information (e.g. a node with a different creation time) also
sends a leave message.

My suggestion is introducing another message type something like
SD_MSG_MASTER_TRANSFER.  I think we should clearly distinguish master
transfer events from leave messages.


Thanks,

Kazutaka


>  
>                               nr_local = get_nodes_nr_epoch(sys->epoch);
>                               nr = get_nodes_nr_from(&sys->sd_node_list);
>                               nr_leave = get_nodes_nr_from(&sys->leave_list);
> +
> +                             dprintf("%d == %d + %d \n", nr_local, nr, 
> nr_leave);
>                               if (nr_local == nr + nr_leave) {
>                                       sys->status = SD_STATUS_OK;
> -                                     sys->epoch = sys->leave_epoch + 1;
> +                                     sys->epoch = sys->epoch;
>                                       update_epoch_log(sys->epoch);
>                                       update_epoch_store(sys->epoch);
>                               }
> @@ -1931,7 +1953,6 @@ join_retry:
>       sys->handle = cpg_handle;
>       sys->this_nodeid = nodeid;
>       sys->this_pid = getpid();
> -     sys->leave_epoch = 0;
>  
>       ret = set_addr(nodeid, port);
>       if (ret)
> diff --git a/sheep/sheep_priv.h b/sheep/sheep_priv.h
> index 6680f79..4711cdd 100644
> --- a/sheep/sheep_priv.h
> +++ b/sheep/sheep_priv.h
> @@ -144,7 +144,6 @@ struct cluster_info {
>       int nr_outstanding_reqs;
>  
>       uint32_t recovered_epoch;
> -     uint32_t leave_epoch; /* The highest number in the clsuter */
>  
>       int use_directio;
>  
> -- 
> 1.7.6.1
> 
> -- 
> sheepdog mailing list
> sheepdog@lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog

Reply via email to