I changed quota-version=1 on the two new nodes, and was able to join the cluster. I also rebooted the two new nodes and everything came up correctly.
Then I triggered a rebalance fix-layout and one of the original cluster members (node gluster03) glusterd crashed. I restarted glusterd and was connected but after a few minutes I'm left with: # gluster peer status Number of Peers: 5 Hostname: 10.0.231.51 Uuid: b01de59a-4428-486b-af49-cb486ab44a07 State: Peer in Cluster (Connected) Hostname: 10.0.231.52 Uuid: 75143760-52a3-4583-82bb-a9920b283dac *State: Peer Rejected (Connected)* Hostname: 10.0.231.53 Uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411 State: Peer in Cluster (Connected) Hostname: 10.0.231.54 Uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c State: Peer in Cluster (Connected) Hostname: 10.0.231.55 Uuid: 9c155c8e-2cd1-4cfc-83af-47129b582fd3 State: Peer in Cluster (Connected) I see in the logs (attached) there is now a cksum error: [2016-02-29 19:16:42.082256] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.55 [2016-02-29 19:16:42.082298] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.55 (0), ret: 0 [2016-02-29 19:16:42.092535] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411, host: 10.0.231.53, port: 0 [2016-02-29 19:16:42.096036] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/lv-export-domain-storage/export-domain-storage on port 49153 [2016-02-29 19:16:42.097296] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/lv-vm-storage/vm-storage on port 49155 [2016-02-29 19:16:42.100727] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700 [2016-02-29 19:16:42.108495] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411 [2016-02-29 19:16:42.109295] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.53 [2016-02-29 19:16:42.109338] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.53 (0), ret: 0 [2016-02-29 19:16:42.119521] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/lv-env-modules/env-modules on port 49157 [2016-02-29 19:16:42.122856] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/raid6-storage/storage on port 49156 [2016-02-29 19:16:42.508104] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: b01de59a-4428-486b-af49-cb486ab44a07, host: 10.0.231.51, port: 0 [2016-02-29 19:16:42.519403] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700 [2016-02-29 19:16:42.524353] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: b01de59a-4428-486b-af49-cb486ab44a07 [2016-02-29 19:16:42.524999] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.51 [2016-02-29 19:16:42.525038] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.51 (0), ret: 0 [2016-02-29 19:16:42.592523] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c, host: 10.0.231.54, port: 0 [2016-02-29 19:16:42.599518] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700 [2016-02-29 19:16:42.604821] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c [2016-02-29 19:16:42.605458] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.54 [2016-02-29 19:16:42.605492] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.54 (0), ret: 0 [2016-02-29 19:16:42.621943] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700 [2016-02-29 19:16:42.628443] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: a965e782-39e2-41cc-a0d1-b32ecccdcd2f [2016-02-29 19:16:42.629079] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.50 On gluster01/02/04/05 /var/lib/glusterd/vols/storage/cksum info=998305000 On gluster03 /var/lib/glusterd/vols/storage/cksum info=998305001 How do I recover from this? Can I just stop glusterd on gluster03 and change the cksum value? On Thu, Feb 25, 2016 at 12:49 PM, Mohammed Rafi K C <rkavu...@redhat.com> wrote: > > > On 02/26/2016 01:53 AM, Mohammed Rafi K C wrote: > > > > On 02/26/2016 01:32 AM, Steve Dainard wrote: > > I haven't done anything more than peer thus far, so I'm a bit confused as > to how the volume info fits in, can you expand on this a bit? > > Failed commits? Is this split brain on the replica volumes? I don't get > any return from 'gluster volume heal <volname> info' on all the replica > volumes, but if I try a gluster volume heal <volname> full I get: > 'Launching heal operation to perform full self heal on volume <volname> has > been unsuccessful'. > > > forget about this. it is not for metadata selfheal . > > > I have 5 volumes total. > > 'Replica 3' volumes running on gluster01/02/03: > vm-storage > iso-storage > export-domain-storage > env-modules > > And one distributed only volume 'storage' info shown below: > > *From existing host gluster01/02:* > type=0 > count=4 > status=1 > sub_count=0 > stripe_count=1 > replica_count=1 > disperse_count=0 > redundancy_count=0 > version=25 > transport-type=0 > volume-id=26d355cb-c486-481f-ac16-e25390e73775 > username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4c > password= > op-version=3 > client-op-version=3 > quota-version=1 > parent_volname=N/A > restored_from_snap=00000000-0000-0000-0000-000000000000 > snap-max-hard-limit=256 > features.quota-deem-statfs=on > features.inode-quota=on > diagnostics.brick-log-level=WARNING > features.quota=on > performance.readdir-ahead=on > performance.cache-size=1GB > performance.stat-prefetch=on > brick-0=10.0.231.50:-mnt-raid6-storage-storage > brick-1=10.0.231.51:-mnt-raid6-storage-storage > brick-2=10.0.231.52:-mnt-raid6-storage-storage > brick-3=10.0.231.53:-mnt-raid6-storage-storage > > *From existing host gluster03/04:* > type=0 > count=4 > status=1 > sub_count=0 > stripe_count=1 > replica_count=1 > disperse_count=0 > redundancy_count=0 > version=25 > transport-type=0 > volume-id=26d355cb-c486-481f-ac16-e25390e73775 > username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4c > password= > op-version=3 > client-op-version=3 > quota-version=1 > parent_volname=N/A > restored_from_snap=00000000-0000-0000-0000-000000000000 > snap-max-hard-limit=256 > features.quota-deem-statfs=on > features.inode-quota=on > performance.stat-prefetch=on > performance.cache-size=1GB > performance.readdir-ahead=on > features.quota=on > diagnostics.brick-log-level=WARNING > brick-0=10.0.231.50:-mnt-raid6-storage-storage > brick-1=10.0.231.51:-mnt-raid6-storage-storage > brick-2=10.0.231.52:-mnt-raid6-storage-storage > brick-3=10.0.231.53:-mnt-raid6-storage-storage > > So far between gluster01/02 and gluster03/04 the configs are the same, > although the ordering is different for some of the features. > > On gluster05/06 the ordering is different again, and the quota-version=0 > instead of 1. > > > This is why the peer shows as rejected. Can you check the op-version of > all the glusterd including the one which is in reject state. you can find > out the op-version here in /var/lib/glusterd/glusterd.info > > > If all the op-version are same and 3.7.6, then to work-around the issue, > you can manually make it quota-version=1, and restarting the glusterd will > solve the problem, But I would strongly recommend you to figure out the > RCA. May be you can file a bug for this. > > Rafi > > > > Rafi KC > > > *From new hosts gluster05/gluster06:* > type=0 > count=4 > status=1 > sub_count=0 > stripe_count=1 > replica_count=1 > disperse_count=0 > redundancy_count=0 > version=25 > transport-type=0 > volume-id=26d355cb-c486-481f-ac16-e25390e73775 > username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4c > password= > op-version=3 > client-op-version=3 > quota-version=0 > parent_volname=N/A > restored_from_snap=00000000-0000-0000-0000-000000000000 > snap-max-hard-limit=256 > performance.stat-prefetch=on > performance.cache-size=1GB > performance.readdir-ahead=on > features.quota=on > diagnostics.brick-log-level=WARNING > features.inode-quota=on > features.quota-deem-statfs=on > brick-0=10.0.231.50:-mnt-raid6-storage-storage > brick-1=10.0.231.51:-mnt-raid6-storage-storage > brick-2=10.0.231.52:-mnt-raid6-storage-storage > brick-3=10.0.231.53:-mnt-raid6-storage-storage > > Also, I forgot to mention that when I initially peer'd the two new hosts, > glusterd crashed on gluster03 and had to be restarted (log attached) but > has been fine since. > > Thanks, > Steve > > On Thu, Feb 25, 2016 at 11:27 AM, Mohammed Rafi K C <rkavu...@redhat.com> > wrote: > >> >> >> On 02/25/2016 11:45 PM, Steve Dainard wrote: >> >> Hello, >> >> I upgraded from 3.6.6 to 3.7.6 a couple weeks ago. I just peered 2 new >> nodes to a 4 node cluster and gluster peer status is: >> >> # gluster peer status *<-- from node gluster01* >> Number of Peers: 5 >> >> Hostname: 10.0.231.51 >> Uuid: b01de59a-4428-486b-af49-cb486ab44a07 >> State: Peer in Cluster (Connected) >> >> Hostname: 10.0.231.52 >> Uuid: 75143760-52a3-4583-82bb-a9920b283dac >> State: Peer in Cluster (Connected) >> >> Hostname: 10.0.231.53 >> Uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411 >> State: Peer in Cluster (Connected) >> >> Hostname: 10.0.231.54 *<-- new node gluster05* >> Uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c >> *State: Peer Rejected (Connected)* >> >> Hostname: 10.0.231.55 *<-- new node gluster06* >> Uuid: 9c155c8e-2cd1-4cfc-83af-47129b582fd3 >> *State: Peer Rejected (Connected)* >> >> >> Looks like your configuration files are mismatching, ie the checksum >> calculation differs on this two node than the others, >> >> Did you had any failed commit ? >> >> Compare your /var/lib/glusterd/<volname>/info of the failed node against >> good one, mostly you could see some difference. >> >> can you paste the /var/lib/glusterd/<volname>/info ? >> >> Regards >> Rafi KC >> >> >> >> I followed the write-up here: >> http://www.gluster.org/community/documentation/index.php/Resolving_Peer_Rejected >> and the two new nodes peer'd properly but after a reboot of the two new >> nodes I'm seeing the same Peer Rejected (Connected) State. >> >> I've attached logs from an existing node, and the two new nodes. >> >> Thanks for any suggestions, >> Steve >> >> >> >> >> _______________________________________________ >> Gluster-users mailing >> listGluster-users@gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-users >> >> >> > > >
etc-glusterfs-glusterd.vol.log.gluster03
Description: Binary data
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users