On 09/19/2011 11:21 PM, Shawn Moore wrote:
I sent a patch to show a correct output of 'collie cluster info'
without segfault.  Can you try it out?
I went ahead and pulled down "77f26b4" as I was using "3a2801b" for my testing.


 From your log messages, it looks like node174 stores a higher epoch.
I think if you run a sheep daemon on node174 first, Sheepdog would
work again.
I had already tried starting node174 first, but with the new code, at
least "collie cluster info" doesn't segfault anymore:
[root@node174 ~]# collie cluster info
Cluster status: Waiting for other nodes joining

Creation time        Epoch Nodes
2011-09-15 20:21:18     17 [192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     16 [192.168.0.157:7000, 192.168.0.173:7000,
192.168.0.174:7000]
2011-09-15 20:21:18     15 [192.168.0.156:7000, 192.168.0.157:7000,
192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     14 [192.168.0.156:7000, 192.168.0.173:7000,
192.168.0.174:7000]
2011-09-15 20:21:18     13 [192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     12 [192.168.0.156:7000, 192.168.0.173:7000,
192.168.0.174:7000]
2011-09-15 20:21:18     11 [192.168.0.156:7000, 192.168.0.157:7000,
192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     10 [192.168.0.156:7000, 192.168.0.173:7000,
192.168.0.174:7000]


But I still can't get the other nodes to join.  Here is the sheep.log
from node174:
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000001
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000002
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000003
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000004
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000005
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000006
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000007
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000008
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000009
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000010
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000011
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000012
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000013
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000014
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000015
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000016
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000017
Sep 19 11:08:43 jrnl_recover(2238) Openning the directory
/node/sheepdog/journal/00000017/.
Sep 19 11:08:43 worker_routine(206) started this thread 0
Sep 19 11:08:43 worker_routine(206) started this thread 0
Sep 19 11:08:43 worker_routine(206) started this thread 3
Sep 19 11:08:43 worker_routine(206) started this thread 0
Sep 19 11:08:43 worker_routine(206) started this thread 1
Sep 19 11:08:43 worker_routine(206) started this thread 0
Sep 19 11:08:43 worker_routine(206) started this thread 0
Sep 19 11:08:43 worker_routine(206) started this thread 1
Sep 19 11:08:43 worker_routine(206) started this thread 2
Sep 19 11:08:43 worker_routine(206) started this thread 2
Sep 19 11:08:43 worker_routine(206) started this thread 3
Sep 19 11:08:43 set_addr(1723) addr = 192.168.0.174, port = 7000
Sep 19 11:08:43 create_cluster(1778) zone id = 1
Sep 19 11:08:43 main(167) Sheepdog daemon (version 0.2.3) started
Sep 19 11:08:43 sd_confchg(1621) confchg nodeid aed92998
Sep 19 11:08:43 sd_confchg(1623) 1 0 1
Sep 19 11:08:43 sd_confchg(1627) [0] node_id: aed92998, pid: 8646, reason: 0
Sep 19 11:08:43 sd_confchg(1641) allow new confchg, 0x254e020
Sep 19 11:08:43 start_cpg_event_work(1465) 0 0
Sep 19 11:08:43 cpg_event_fn(1279) 0x254e020, 0 2
Sep 19 11:08:43 cpg_event_done(1315) 0x254e020
Sep 19 11:08:43 __sd_confchg_done(1206) 8646 aed92998
Sep 19 11:08:43 update_cluster_info(683) l nodeid: aed92998, pid:
8646, ip: 192.168.0.174:7000
Sep 19 11:08:43 cpg_event_done(1373) free 0x254e020
Sep 19 11:09:38 sd_confchg(1621) confchg nodeid add92998
Sep 19 11:09:38 sd_confchg(1623) 2 0 1
Sep 19 11:09:38 sd_confchg(1627) [0] node_id: add92998, pid: 8097,
reason: 1940777327
Sep 19 11:09:38 sd_confchg(1627) [1] node_id: aed92998, pid: 8646,
reason: 6485728
Sep 19 11:09:38 sd_confchg(1641) allow new confchg, 0x254e020
Sep 19 11:09:38 start_cpg_event_work(1465) 0 0
Sep 19 11:09:38 cpg_event_fn(1279) 0x254e020, 0 2
Sep 19 11:09:38 cpg_event_done(1315) 0x254e020
Sep 19 11:09:38 __sd_confchg_done(1232) l nodeid: aed92998, pid: 8646,
ip: 192.168.0.174:7000
Sep 19 11:09:38 cpg_event_done(1373) free 0x254e020
Sep 19 11:09:38 sd_deliver(987) op: 1, state: 1, size: 32840, from:
192.168.0.173:7000, nodeid: add92998, pid: 8097
Sep 19 11:09:38 sd_deliver(996) allow new deliver, 0x254e1a0
Sep 19 11:09:38 start_cpg_event_work(1465) 0 1
Sep 19 11:09:38 cpg_event_fn(1279) 0x254e1a0, 1 2
Sep 19 11:09:38 cpg_event_fn(1293) 1
Sep 19 11:09:38 __sd_deliver(839) op: 1, state: 1, size: 32840, from:
192.168.0.173:7000, pid: 8097
Sep 19 11:09:38 cpg_event_done(1315) 0x254e1a0
Sep 19 11:09:38 __sd_deliver_done(955) op: 1, state: 1, size: 32840,
from: 192.168.0.173:7000
Sep 19 11:09:38 get_cluster_status(440) sheepdog is waiting with newer
epoch, 16 17 192.168.0.173:7000
Sep 19 11:09:38 cpg_event_done(1373) free 0x254e1a0
Sep 19 11:09:39 sd_deliver(987) op: 1, state: 3, size: 32840, from:
192.168.0.173:7000, nodeid: aed92998, pid: 8646
Sep 19 11:09:39 sd_deliver(996) allow new deliver, 0x254e1a0
Sep 19 11:09:39 start_cpg_event_work(1465) 0 1
Sep 19 11:09:39 cpg_event_fn(1279) 0x254e1a0, 1 2
Sep 19 11:09:39 cpg_event_fn(1293) 3
Sep 19 11:09:39 __sd_deliver(839) op: 1, state: 3, size: 32840, from:
192.168.0.173:7000, pid: 8097
Sep 19 11:09:39 cpg_event_done(1315) 0x254e1a0
Sep 19 11:09:39 __sd_deliver_done(955) op: 1, state: 3, size: 32840,
from: 192.168.0.173:7000
Sep 19 11:09:39 cpg_event_done(1373) free 0x254e1a0
Sep 19 11:09:58 sd_confchg(1621) confchg nodeid 9cd92998
Sep 19 11:09:58 sd_confchg(1623) 3 0 1
Sep 19 11:09:58 sd_confchg(1627) [0] node_id: 9cd92998, pid: 14918, reason: 0
Sep 19 11:09:58 sd_confchg(1627) [1] node_id: add92998, pid: 8097, reason: 0
Sep 19 11:09:58 sd_confchg(1627) [2] node_id: aed92998, pid: 8646, reason: 0
Sep 19 11:09:58 sd_confchg(1641) allow new confchg, 0x254e020
Sep 19 11:09:58 start_cpg_event_work(1465) 0 0
Sep 19 11:09:58 cpg_event_fn(1279) 0x254e020, 0 2
Sep 19 11:09:58 cpg_event_done(1315) 0x254e020
Sep 19 11:09:58 __sd_confchg_done(1232) l nodeid: aed92998, pid: 8646,
ip: 192.168.0.174:7000
Sep 19 11:09:58 cpg_event_done(1373) free 0x254e020
Sep 19 11:09:58 sd_deliver(987) op: 1, state: 1, size: 32840, from:
192.168.0.156:7000, nodeid: 9cd92998, pid: 14918
Sep 19 11:09:58 sd_deliver(996) allow new deliver, 0x254e1a0
Sep 19 11:09:58 start_cpg_event_work(1465) 0 1
Sep 19 11:09:58 cpg_event_fn(1279) 0x254e1a0, 1 2
Sep 19 11:09:58 cpg_event_fn(1293) 1
Sep 19 11:09:58 __sd_deliver(839) op: 1, state: 1, size: 32840, from:
192.168.0.156:7000, pid: 14918
Sep 19 11:09:58 cpg_event_done(1315) 0x254e1a0
Sep 19 11:09:58 __sd_deliver_done(955) op: 1, state: 1, size: 32840,
from: 192.168.0.156:7000
Sep 19 11:09:58 get_cluster_status(440) sheepdog is waiting with newer
epoch, 15 17 192.168.0.156:7000
Sep 19 11:09:58 cpg_event_done(1373) free 0x254e1a0
Sep 19 11:09:58 sd_deliver(987) op: 1, state: 3, size: 32840, from:
192.168.0.156:7000, nodeid: aed92998, pid: 8646
Sep 19 11:09:58 sd_deliver(996) allow new deliver, 0x254e1a0
Sep 19 11:09:58 start_cpg_event_work(1465) 0 1
Sep 19 11:09:58 cpg_event_fn(1279) 0x254e1a0, 1 2
Sep 19 11:09:58 cpg_event_fn(1293) 3
Sep 19 11:09:58 __sd_deliver(839) op: 1, state: 3, size: 32840, from:
192.168.0.156:7000, pid: 14918
Sep 19 11:09:58 cpg_event_done(1315) 0x254e1a0
Sep 19 11:09:58 __sd_deliver_done(955) op: 1, state: 3, size: 32840,
from: 192.168.0.156:7000
Sep 19 11:09:58 cpg_event_done(1373) free 0x254e1a0
Sep 19 11:10:04 sd_confchg(1621) confchg nodeid 9cd92998
Sep 19 11:10:04 sd_confchg(1623) 4 0 1
Sep 19 11:10:04 sd_confchg(1627) [0] node_id: 9cd92998, pid: 14918, reason: 0
Sep 19 11:10:04 sd_confchg(1627) [1] node_id: 9dd92998, pid: 8515, reason: 0
Sep 19 11:10:04 sd_confchg(1627) [2] node_id: add92998, pid: 8097,
reason: 1940777327
Sep 19 11:10:04 sd_confchg(1627) [3] node_id: aed92998, pid: 8646,
reason: 6485728
Sep 19 11:10:04 sd_confchg(1641) allow new confchg, 0x254e020
Sep 19 11:10:04 start_cpg_event_work(1465) 0 0
Sep 19 11:10:04 cpg_event_fn(1279) 0x254e020, 0 2
Sep 19 11:10:04 cpg_event_done(1315) 0x254e020
Sep 19 11:10:04 __sd_confchg_done(1232) l nodeid: aed92998, pid: 8646,
ip: 192.168.0.174:7000
Sep 19 11:10:04 cpg_event_done(1373) free 0x254e020
Sep 19 11:10:04 sd_deliver(987) op: 1, state: 1, size: 32840, from:
192.168.0.157:7000, nodeid: 9dd92998, pid: 8515
Sep 19 11:10:04 sd_deliver(996) allow new deliver, 0x254e1a0
Sep 19 11:10:04 start_cpg_event_work(1465) 0 1
Sep 19 11:10:04 cpg_event_fn(1279) 0x254e1a0, 1 2
Sep 19 11:10:04 cpg_event_fn(1293) 1
Sep 19 11:10:04 __sd_deliver(839) op: 1, state: 1, size: 32840, from:
192.168.0.157:7000, pid: 8515
Sep 19 11:10:04 cpg_event_done(1315) 0x254e1a0
Sep 19 11:10:04 __sd_deliver_done(955) op: 1, state: 1, size: 32840,
from: 192.168.0.157:7000
Sep 19 11:10:04 get_cluster_status(440) sheepdog is waiting with newer
epoch, 16 17 192.168.0.157:7000
Sep 19 11:10:04 cpg_event_done(1373) free 0x254e1a0
Sep 19 11:10:04 sd_deliver(987) op: 1, state: 3, size: 32840, from:
192.168.0.157:7000, nodeid: aed92998, pid: 8646
Sep 19 11:10:04 sd_deliver(996) allow new deliver, 0x254e1a0
Sep 19 11:10:04 start_cpg_event_work(1465) 0 1
Sep 19 11:10:04 cpg_event_fn(1279) 0x254e1a0, 1 2
Sep 19 11:10:04 cpg_event_fn(1293) 3
Sep 19 11:10:04 __sd_deliver(839) op: 1, state: 3, size: 32840, from:
192.168.0.157:7000, pid: 8515
Sep 19 11:10:04 cpg_event_done(1315) 0x254e1a0
Sep 19 11:10:04 __sd_deliver_done(955) op: 1, state: 3, size: 32840,
from: 192.168.0.157:7000
Sep 19 11:10:04 cpg_event_done(1373) free 0x254e1a0
Sep 19 11:10:10 listen_handler(613) accepted a new connection, 11
Sep 19 11:10:10 queue_request(211) 82
Sep 19 11:10:10 start_cpg_event_work(1465) 0 2
Sep 19 11:10:10 cluster_queue_request(261) 0x7f92a13fb010 82
Sep 19 11:10:10 client_handler(563) closed a connection, 11
Sep 19 11:10:13 listen_handler(613) accepted a new connection, 11
Sep 19 11:10:13 queue_request(211) 87
Sep 19 11:10:13 start_cpg_event_work(1465) 0 2
Sep 19 11:10:13 cluster_queue_request(261) 0x254e340 87
Sep 19 11:10:13 client_handler(563) closed a connection, 11


Thanks for your assistance with this

So I guess you have shutdowned the cluster by 'collie cluster shutdown' command, no? would you please attach the log from the nodes that wouldnot join?

I think the patch set 'sheep: teach sheepdog to better recovery the shut-down cluster' might solve your problem if you happen to have the problem of recovering cluster from the shutdown state. But right now Kazutaka might be reviewing it and please wait for it merging.

Thanks,
Yuan
--
sheepdog mailing list
[email protected]
http://lists.wpkg.org/mailman/listinfo/sheepdog

Reply via email to