I'd just like to make an update according to my latest findings on this. Googling further, I ended up reading this article: https://community.rackspace.com/developers/f/7/t/4858
Reflecting it to the docs (https://gluster.readthedocs.io/en/latest/Administrator Guide/Resolving Peer Rejected/) and my situation, I was able to establish a reproducible chain of events, like this: #stop the glusterfs sst2# service glusterfs-server stop sst2# killall glusterfs glusterfsd # make sure there are no more glusterfs processes sst2# ps auwwx | grep gluster # preserve glusterd.info and clean everything else sst2# cd /var/lib/glusterd && mv glusterd.info .. && rm -rf * && mv ../glusterd.info . # start glusterfs sst2# service glusterfs-server start # probe peers sst2# gluster peer status Number of Peers: 0 sst2# gluster peer probe sst0 peer probe: success. sst2# gluster peer probe sst1 peer probe: success. # restart glusterd twice to bring peers back into the cluster sst2# gluster peer status Number of Peers: 2 Hostname: sst0 Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc State: Accepted peer request (Connected) Hostname: sst1 Uuid: 5a2198de-f536-4328-a278-7f746f276e35 State: Accepted peer request (Connected) sst2# service glusterfs-server restart sst2# gluster peer status Number of Peers: 2 Hostname: sst1 Uuid: 5a2198de-f536-4328-a278-7f746f276e35 State: Sent and Received peer request (Connected) Hostname: sst0 Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc State: Sent and Received peer request (Connected) sst2# service glusterfs-server restart sst2# gluster peer status Number of Peers: 2 Hostname: sst1 Uuid: 5a2198de-f536-4328-a278-7f746f276e35 State: Peer in Cluster (Connected) Hostname: sst0 Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc State: Peer in Cluster (Connected) # resync volume information sst2# gluster volume sync sst0 all Sync volume may make data inaccessible while the sync is in progress. Do you want to continue? (y/n) y volume sync: success sst2# gluster volume info Volume Name: gv0 Type: Replicate Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: sst0:/var/glusterfs Brick2: sst2:/var/glusterfs Options Reconfigured: cluster.self-heal-daemon: enable performance.readdir-ahead: on storage.owner-uid: 1000 storage.owner-gid: 1000 sst2# gluster volume status Status of volume: gv0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick sst0:/var/glusterfs 49153 0 Y 29830 Brick sst2:/var/glusterfs 49152 0 Y 5137 NFS Server on localhost N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 6034 NFS Server on sst0 N/A N/A N N/A Self-heal Daemon on sst0 N/A N/A Y 29821 NFS Server on sst1 N/A N/A N N/A Self-heal Daemon on sst1 N/A N/A Y 19997 Task Status of Volume gv0 ------------------------------------------------------------------------------ There are no active volume tasks sst2# gluster volume heal gv0 full Launching heal operation to perform full self heal on volume gv0 has been successful Use heal info commands to check status sst2# gluster volume heal gv0 info Brick sst0:/var/glusterfs Status: Connected Number of entries: 0 Brick sst2:/var/glusterfs Status: Connected Number of entries: 0 The most disturbing thing about this is that I'm perfectly sure that bricks are NOT in sync, according to du -s output: sst0# du -s /var/glusterfs/ 3107570500 /var/glusterfs/ sst2# du -s /var/glusterfs/ 3107567396 /var/glusterfs/ If anybody could be so kind and point me out how to get replicase back to the synchronous state, I would be extremely grateful. Best, Seva 28.04.2017, 13:01, "Seva Gluschenko" <g...@webkontrol.ru>: > Of course. Please find attached. Hope they can shed some light on this. > > Thanks, > > Seva > > 28.04.2017, 12:41, "Mohammed Rafi K C" <rkavu...@redhat.com>: >> Can you share the glusterd logs from the three nodes ? >> >> Rafi KC >> >> On 04/28/2017 02:34 PM, Seva Gluschenko wrote: >>> Dear Community, >>> >>> I call for your wisdom, as it appears that googling for keywords doesn't >>> help much. >>> >>> I have a glusterfs volume with replica count 2, and I tried to perform >>> the online upgrade procedure described in the docs >>> (http://gluster.readthedocs.io/en/latest/Upgrade-Guide/upgrade_to_3.10/). >>> It all went almost fine when I'd done with the first replica, the only >>> problem was the self-heal procedure that refused to complete until I >>> commented out all IPv6 entries in the /etc/hosts. >>> >>> So far, being sure that it all should work on the 2nd replica pretty the >>> same as it was on the 1st one, I had proceeded with the upgrade on the >>> replica 2. All of a sudden, it told me that it doesn't see the first >>> replica at all. The state before upgrade was: >>> >>> sst2# gluster volume status >>> Status of volume: gv0 >>> Gluster process TCP Port RDMA Port Online Pid >>> >>> ------------------------------------------------------------------------------ >>> Brick sst0:/var/glusterfs 49152 0 Y 3482 >>> Brick sst2:/var/glusterfs 49152 0 Y 29863 >>> NFS Server on localhost 2049 0 Y 25175 >>> Self-heal Daemon on localhost N/A N/A Y 25283 >>> NFS Server on sst0 N/A N/A N N/A >>> Self-heal Daemon on sst0 N/A N/A Y 4827 >>> NFS Server on sst1 N/A N/A N N/A >>> Self-heal Daemon on sst1 N/A N/A Y 15009 >>> >>> Task Status of Volume gv0 >>> >>> ------------------------------------------------------------------------------ >>> There are no active volume tasks >>> >>> sst2# gluster peer status >>> Number of Peers: 2 >>> >>> Hostname: sst0 >>> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc >>> State: Peer in Cluster (Connected) >>> >>> Hostname: sst1 >>> Uuid: 5a2198de-f536-4328-a278-7f746f276e35 >>> State: Sent and Received peer request (Connected) >>> >>> sst2# gluster volume heal gv0 info >>> Brick sst0:/var/glusterfs >>> Number of entries: 0 >>> >>> Brick sst2:/var/glusterfs >>> Number of entries: 0 >>> >>> After upgrade, it looked like this: >>> >>> sst2# gluster volume status >>> Status of volume: gv0 >>> Gluster process TCP Port RDMA Port Online Pid >>> >>> ------------------------------------------------------------------------------ >>> Brick sst2:/var/glusterfs N/A N/A N N/A >>> NFS Server on localhost N/A N/A N N/A >>> NFS Server on localhost N/A N/A N N/A >>> >>> Task Status of Volume gv0 >>> >>> ------------------------------------------------------------------------------ >>> There are no active volume tasks >>> >>> sst2# gluster peer status >>> Number of Peers: 2 >>> >>> Hostname: sst1 >>> Uuid: 5a2198de-f536-4328-a278-7f746f276e35 >>> State: Sent and Received peer request (Connected) >>> >>> Hostname: sst0 >>> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc >>> State: Peer Rejected (Connected) >>> >>> My biggest fault probably, at that point I googled and found this article >>> https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/ >>> -- and followed its advice, removing at sst2 all the /var/lib/glusterd >>> contents except the glusterd.info file. As the result, the node, >>> predictably, lost all information about the volume. >>> >>> sst2# gluster volume status >>> No volumes present >>> >>> sst2# gluster peer status >>> Number of Peers: 2 >>> >>> Hostname: sst0 >>> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc >>> State: Accepted peer request (Connected) >>> >>> Hostname: sst1 >>> Uuid: 5a2198de-f536-4328-a278-7f746f276e35 >>> State: Accepted peer request (Connected) >>> >>> Okay, I thought, this is might be a high time to re-add the brick. Not >>> that easy, Jack: >>> >>> sst0# gluster volume add-brick gv0 replica 2 'sst2:/var/glusterfs' >>> volume add-brick: failed: Operation failed >>> >>> The reason appeared to be natural: sst0 still knows that there was the >>> replica on sst2. What should I do then? At this point, I tried to recover >>> the volume information on sst2 by putting it offline and copying all the >>> volume info from the sst0. Of course it wasn't enough to just copy as is, I >>> modified /var/lib/glusterd/vols/gv0/sst*\:-var-glusterfs, setting >>> listen-port=0 for the remote brick (sst0) and listen-port=49152 for the >>> local brick (sst2). It didn't help much, unfortunately. The final state >>> I've reached is as follows: >>> >>> sst2# gluster peer status >>> Number of Peers: 2 >>> >>> Hostname: sst1 >>> Uuid: 5a2198de-f536-4328-a278-7f746f276e35 >>> State: Sent and Received peer request (Connected) >>> >>> Hostname: sst0 >>> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc >>> State: Sent and Received peer request (Connected) >>> >>> sst2# gluster volume info >>> >>> Volume Name: gv0 >>> Type: Replicate >>> Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 1 x 2 = 2 >>> Transport-type: tcp >>> Bricks: >>> Brick1: sst0:/var/glusterfs >>> Brick2: sst2:/var/glusterfs >>> Options Reconfigured: >>> cluster.self-heal-daemon: enable >>> performance.readdir-ahead: on >>> storage.owner-uid: 1000 >>> storage.owner-gid: 1000 >>> >>> sst2# gluster volume status >>> Status of volume: gv0 >>> Gluster process TCP Port RDMA Port Online Pid >>> >>> ------------------------------------------------------------------------------ >>> Brick sst2:/var/glusterfs N/A N/A N N/A >>> NFS Server on localhost N/A N/A N N/A >>> NFS Server on localhost N/A N/A N N/A >>> >>> Task Status of Volume gv0 >>> >>> ------------------------------------------------------------------------------ >>> There are no active volume tasks >>> >>> Meanwhile, on sst0: >>> >>> sst0# gluster volume info >>> >>> Volume Name: gv0 >>> Type: Replicate >>> Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 1 x 2 = 2 >>> Transport-type: tcp >>> Bricks: >>> Brick1: sst0:/var/glusterfs >>> Brick2: sst2:/var/glusterfs >>> Options Reconfigured: >>> storage.owner-gid: 1000 >>> storage.owner-uid: 1000 >>> performance.readdir-ahead: on >>> cluster.self-heal-daemon: enable >>> >>> sst0 ~ # gluster volume status >>> Status of volume: gv0 >>> Gluster process TCP Port RDMA Port Online Pid >>> >>> ------------------------------------------------------------------------------ >>> Brick sst0:/var/glusterfs 49152 0 Y 31263 >>> NFS Server on localhost N/A N/A N N/A >>> Self-heal Daemon on localhost N/A N/A Y 31254 >>> >>> Task Status of Volume gv0 >>> >>> ------------------------------------------------------------------------------ >>> There are no active volume tasks >>> >>> Any ideas how to bring the sst2 back to normal are appreciated. As a last >>> resort solution, I can schedule the downtime, backup data, kill the volume >>> and start all over, but I would like to know if there is a shorter path. >>> Thank you very much in advance. >>> >>> -- >>> Best Regards, >>> >>> Seva Gluschenko _______________________________________________ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users