I did go ahead this morning and pull down g066d753 and apply it to the
four nodes.  I brought up node174 first like usual, and then tried
node173, but it still refused to join, but with a different message:

.......
Sep 22 08:28:49 update_cluster_info(756) status = 2, epoch = 17, 66, 0
Sep 22 08:28:49 update_cluster_info(759) failed to join sheepdog, 66
Sep 22 08:28:49 leave_cluster(1984) 16
Sep 22 08:28:49 update_cluster_info(761) I am really hurt and gonna
leave cluster.
Sep 22 08:28:49 update_cluster_info(762) Fix yourself and restart me
later, pleaseeeee...Bye.
Sep 22 08:28:49 log_sigsegv(367) sheep logger exits abnormally, pid:24265


I then brought up node157 and it recovered.  Then I brought up node156
and it recovered as well.  Then I was able to bring up node173.

I noticed some odd things (mostly related to sizes of in use
changing), looking at "collie node info" during the node startups.
I'm not sure if this is normal or not, but below is what I saw.  The
timing is from top (oldest) to bottom (most current).  Do objects
re-distribute themselves around the cluster during recovery or epoch
changes?

node174 and node157:
[root@node174 ~]# collie node info
Id      Size    Used    Use%
 0      382 GB  17 GB     4%
 1      394 GB  17 GB     4%

Total   775 GB  34 GB     4%, total virtual VDI Size    100 GB

Then added node156:
[root@node174 ~]# collie node info
Id      Size    Used    Use%
 0      365 GB  720 MB    0%
 1      376 GB  12 GB     3%
 2      380 GB  3.6 GB    0%
failed to read object, 80f5969500000000 Remote node has a new epoch
failed to read a inode header
failed to read object, 80f5969600000000 Remote node has a new epoch
failed to read a inode header

Total   1.1 TB  16 GB     1%, total virtual VDI Size    0.0 MB

[root@node174 ~]# collie node info
Id      Size    Used    Use%
 0      365 GB  1008 MB   0%
 1      377 GB  12 GB     3%
 2      382 GB  5.2 GB    1%

Total   1.1 TB  19 GB     1%, total virtual VDI Size    100 GB

Then after everyone is done with recovery, added node173 back:
[root@node174 ~]# collie node info
Id      Size    Used    Use%
 0      374 GB  10 GB     2%
 1      377 GB  12 GB     3%
 2      399 GB  22 GB     5%

Total   1.1 TB  45 GB     3%, total virtual VDI Size    100 GB

[root@node174 ~]# collie node info
Id      Size    Used    Use%
 0      365 GB  496 MB    0%
 1      366 GB  1.3 GB    0%
 2      377 GB  792 MB    0%
 3      394 GB  17 GB     4%
failed to read object, 80f5969400000000 Remote node has a new epoch
failed to read a inode header

Total   1.5 TB  20 GB     1%, total virtual VDI Size    100 GB

[root@node174 sheepdog]# collie node info
Id      Size    Used    Use%
 0      386 GB  21 GB     5%
 1      381 GB  17 GB     4%
 2      397 GB  21 GB     5%
 3      394 GB  17 GB     4%

Total   1.5 TB  76 GB     4%, total virtual VDI Size    100 GB

But as far as I can tell, everything is working right now.
-- 
sheepdog mailing list
[email protected]
http://lists.wpkg.org/mailman/listinfo/sheepdog

Reply via email to