Re: [Gluster-users] lost one replica after upgrading glusterfs from 3.7 to 3.10, please help

Seva Gluschenko Fri, 28 Apr 2017 07:16:43 -0700

I'd just like to make an update according to my latest findings on this.

Googling further, I ended up reading this article: 
https://community.rackspace.com/developers/f/7/t/4858


Reflecting it to the docs 
(https://gluster.readthedocs.io/en/latest/Administrator Guide/Resolving Peer 
Rejected/) and my situation, I was able to establish a reproducible chain of 
events, like this:

#stop the glusterfs
sst2# service glusterfs-server stop
sst2# killall glusterfs glusterfsd

# make sure there are no more glusterfs processes
sst2# ps auwwx | grep gluster

# preserve glusterd.info and clean everything else
sst2# cd /var/lib/glusterd && mv glusterd.info .. && rm -rf * && mv 
../glusterd.info .

# start glusterfs
sst2# service glusterfs-server start

# probe peers
sst2# gluster peer status
Number of Peers: 0
sst2# gluster peer probe sst0
peer probe: success. 
sst2# gluster peer probe sst1
peer probe: success. 

# restart glusterd twice to bring peers back into the cluster
sst2# gluster peer status
Number of Peers: 2

Hostname: sst0
Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
State: Accepted peer request (Connected)

Hostname: sst1
Uuid: 5a2198de-f536-4328-a278-7f746f276e35
State: Accepted peer request (Connected)

sst2# service glusterfs-server restart
sst2# gluster peer status
Number of Peers: 2

Hostname: sst1
Uuid: 5a2198de-f536-4328-a278-7f746f276e35
State: Sent and Received peer request (Connected)

Hostname: sst0
Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
State: Sent and Received peer request (Connected)

sst2# service glusterfs-server restart
sst2# gluster peer status
Number of Peers: 2

Hostname: sst1
Uuid: 5a2198de-f536-4328-a278-7f746f276e35
State: Peer in Cluster (Connected)

Hostname: sst0
Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
State: Peer in Cluster (Connected)

# resync volume information
sst2# gluster volume sync sst0 all
Sync volume may make data inaccessible while the sync is in progress. Do you 
want to continue? (y/n) y
volume sync: success
sst2# gluster volume info
 
Volume Name: gv0
Type: Replicate
Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: sst0:/var/glusterfs
Brick2: sst2:/var/glusterfs
Options Reconfigured:
cluster.self-heal-daemon: enable
performance.readdir-ahead: on
storage.owner-uid: 1000
storage.owner-gid: 1000

sst2# gluster volume status
Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick sst0:/var/glusterfs                   49153     0          Y       29830
Brick sst2:/var/glusterfs                   49152     0          Y       5137 
NFS Server on localhost                     N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       6034 
NFS Server on sst0                          N/A       N/A        N       N/A  
Self-heal Daemon on sst0                    N/A       N/A        Y       29821
NFS Server on sst1                          N/A       N/A        N       N/A  
Self-heal Daemon on sst1                    N/A       N/A        Y       19997
 
Task Status of Volume gv0
------------------------------------------------------------------------------
There are no active volume tasks
 
sst2# gluster volume heal gv0 full
Launching heal operation to perform full self heal on volume gv0 has been 
successful 
Use heal info commands to check status

sst2# gluster volume heal gv0 info
Brick sst0:/var/glusterfs
Status: Connected
Number of entries: 0

Brick sst2:/var/glusterfs
Status: Connected
Number of entries: 0


The most disturbing thing about this is that I'm perfectly sure that bricks are 
NOT in sync, according to du -s output:

sst0# du -s /var/glusterfs/
3107570500      /var/glusterfs/

sst2# du -s /var/glusterfs/
3107567396      /var/glusterfs/


If anybody could be so kind and point me out how to get replicase back to the 
synchronous state, I would be extremely grateful.


Best,

Seva


28.04.2017, 13:01, "Seva Gluschenko" <g...@webkontrol.ru>:
> Of course. Please find attached. Hope they can shed some light on this.
>
> Thanks,
>
> Seva
>
> 28.04.2017, 12:41, "Mohammed Rafi K C" <rkavu...@redhat.com>:
>>  Can you share the glusterd logs from the three nodes ?
>>
>>  Rafi KC
>>
>>  On 04/28/2017 02:34 PM, Seva Gluschenko wrote:
>>>   Dear Community,
>>>
>>>   I call for your wisdom, as it appears that googling for keywords doesn't 
>>> help much.
>>>
>>>   I have a glusterfs volume with replica count 2, and I tried to perform 
>>> the online upgrade procedure described in the docs 
>>> (http://gluster.readthedocs.io/en/latest/Upgrade-Guide/upgrade_to_3.10/). 
>>> It all went almost fine when I'd done with the first replica, the only 
>>> problem was the self-heal procedure that refused to complete until I 
>>> commented out all IPv6 entries in the /etc/hosts.
>>>
>>>   So far, being sure that it all should work on the 2nd replica pretty the 
>>> same as it was on the 1st one, I had proceeded with the upgrade on the 
>>> replica 2. All of a sudden, it told me that it doesn't see the first 
>>> replica at all. The state before upgrade was:
>>>
>>>   sst2# gluster volume status
>>>   Status of volume: gv0
>>>   Gluster process TCP Port RDMA Port Online Pid
>>>   
>>> ------------------------------------------------------------------------------
>>>   Brick sst0:/var/glusterfs 49152 0 Y 3482
>>>   Brick sst2:/var/glusterfs 49152 0 Y 29863
>>>   NFS Server on localhost 2049 0 Y 25175
>>>   Self-heal Daemon on localhost N/A N/A Y 25283
>>>   NFS Server on sst0 N/A N/A N N/A
>>>   Self-heal Daemon on sst0 N/A N/A Y 4827
>>>   NFS Server on sst1 N/A N/A N N/A
>>>   Self-heal Daemon on sst1 N/A N/A Y 15009
>>>
>>>   Task Status of Volume gv0
>>>   
>>> ------------------------------------------------------------------------------
>>>   There are no active volume tasks
>>>
>>>   sst2# gluster peer status
>>>   Number of Peers: 2
>>>
>>>   Hostname: sst0
>>>   Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
>>>   State: Peer in Cluster (Connected)
>>>
>>>   Hostname: sst1
>>>   Uuid: 5a2198de-f536-4328-a278-7f746f276e35
>>>   State: Sent and Received peer request (Connected)
>>>
>>>   sst2# gluster volume heal gv0 info
>>>   Brick sst0:/var/glusterfs
>>>   Number of entries: 0
>>>
>>>   Brick sst2:/var/glusterfs
>>>   Number of entries: 0
>>>
>>>   After upgrade, it looked like this:
>>>
>>>   sst2# gluster volume status
>>>   Status of volume: gv0
>>>   Gluster process TCP Port RDMA Port Online Pid
>>>   
>>> ------------------------------------------------------------------------------
>>>   Brick sst2:/var/glusterfs N/A N/A N N/A
>>>   NFS Server on localhost N/A N/A N N/A
>>>   NFS Server on localhost N/A N/A N N/A
>>>
>>>   Task Status of Volume gv0
>>>   
>>> ------------------------------------------------------------------------------
>>>   There are no active volume tasks
>>>
>>>   sst2# gluster peer status
>>>   Number of Peers: 2
>>>
>>>   Hostname: sst1
>>>   Uuid: 5a2198de-f536-4328-a278-7f746f276e35
>>>   State: Sent and Received peer request (Connected)
>>>
>>>   Hostname: sst0
>>>   Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
>>>   State: Peer Rejected (Connected)
>>>
>>>   My biggest fault probably, at that point I googled and found this article 
>>> https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/
>>>  -- and followed its advice, removing at sst2 all the /var/lib/glusterd 
>>> contents except the glusterd.info file. As the result, the node, 
>>> predictably, lost all information about the volume.
>>>
>>>   sst2# gluster volume status
>>>   No volumes present
>>>
>>>   sst2# gluster peer status
>>>   Number of Peers: 2
>>>
>>>   Hostname: sst0
>>>   Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
>>>   State: Accepted peer request (Connected)
>>>
>>>   Hostname: sst1
>>>   Uuid: 5a2198de-f536-4328-a278-7f746f276e35
>>>   State: Accepted peer request (Connected)
>>>
>>>   Okay, I thought, this is might be a high time to re-add the brick. Not 
>>> that easy, Jack:
>>>
>>>   sst0# gluster volume add-brick gv0 replica 2 'sst2:/var/glusterfs'
>>>   volume add-brick: failed: Operation failed
>>>
>>>   The reason appeared to be natural: sst0 still knows that there was the 
>>> replica on sst2. What should I do then? At this point, I tried to recover 
>>> the volume information on sst2 by putting it offline and copying all the 
>>> volume info from the sst0. Of course it wasn't enough to just copy as is, I 
>>> modified /var/lib/glusterd/vols/gv0/sst*\:-var-glusterfs, setting 
>>> listen-port=0 for the remote brick (sst0) and listen-port=49152 for the 
>>> local brick (sst2). It didn't help much, unfortunately. The final state 
>>> I've reached is as follows:
>>>
>>>   sst2# gluster peer status
>>>   Number of Peers: 2
>>>
>>>   Hostname: sst1
>>>   Uuid: 5a2198de-f536-4328-a278-7f746f276e35
>>>   State: Sent and Received peer request (Connected)
>>>
>>>   Hostname: sst0
>>>   Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
>>>   State: Sent and Received peer request (Connected)
>>>
>>>   sst2# gluster volume info
>>>
>>>   Volume Name: gv0
>>>   Type: Replicate
>>>   Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b
>>>   Status: Started
>>>   Snapshot Count: 0
>>>   Number of Bricks: 1 x 2 = 2
>>>   Transport-type: tcp
>>>   Bricks:
>>>   Brick1: sst0:/var/glusterfs
>>>   Brick2: sst2:/var/glusterfs
>>>   Options Reconfigured:
>>>   cluster.self-heal-daemon: enable
>>>   performance.readdir-ahead: on
>>>   storage.owner-uid: 1000
>>>   storage.owner-gid: 1000
>>>
>>>   sst2# gluster volume status
>>>   Status of volume: gv0
>>>   Gluster process TCP Port RDMA Port Online Pid
>>>   
>>> ------------------------------------------------------------------------------
>>>   Brick sst2:/var/glusterfs N/A N/A N N/A
>>>   NFS Server on localhost N/A N/A N N/A
>>>   NFS Server on localhost N/A N/A N N/A
>>>
>>>   Task Status of Volume gv0
>>>   
>>> ------------------------------------------------------------------------------
>>>   There are no active volume tasks
>>>
>>>   Meanwhile, on sst0:
>>>
>>>   sst0# gluster volume info
>>>
>>>   Volume Name: gv0
>>>   Type: Replicate
>>>   Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b
>>>   Status: Started
>>>   Snapshot Count: 0
>>>   Number of Bricks: 1 x 2 = 2
>>>   Transport-type: tcp
>>>   Bricks:
>>>   Brick1: sst0:/var/glusterfs
>>>   Brick2: sst2:/var/glusterfs
>>>   Options Reconfigured:
>>>   storage.owner-gid: 1000
>>>   storage.owner-uid: 1000
>>>   performance.readdir-ahead: on
>>>   cluster.self-heal-daemon: enable
>>>
>>>   sst0 ~ # gluster volume status
>>>   Status of volume: gv0
>>>   Gluster process TCP Port RDMA Port Online Pid
>>>   
>>> ------------------------------------------------------------------------------
>>>   Brick sst0:/var/glusterfs 49152 0 Y 31263
>>>   NFS Server on localhost N/A N/A N N/A
>>>   Self-heal Daemon on localhost N/A N/A Y 31254
>>>
>>>   Task Status of Volume gv0
>>>   
>>> ------------------------------------------------------------------------------
>>>   There are no active volume tasks
>>>
>>>   Any ideas how to bring the sst2 back to normal are appreciated. As a last 
>>> resort solution, I can schedule the downtime, backup data, kill the volume 
>>> and start all over, but I would like to know if there is a shorter path. 
>>> Thank you very much in advance.
>>>
>>>   --
>>>   Best Regards,
>>>
>>>   Seva Gluschenko
_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] lost one replica after upgrading glusterfs from 3.7 to 3.10, please help

Reply via email to