Re: [Gluster-users] geo-replication fails after upgrade to gluster 3.10
I was able to get this working by deleting the geo-replication session and recreating it. Not sure why it broke in the first place but it is working now. On 04/28/2017 03:08 PM, Michael Watters wrote: > I've just upgraded my gluster hosts from gluster 3.8 to 3.10 and it > appears that geo-replication on my volume is now broken. Here are the > log entries from the master. I've tried restarting the geo-replication > process several times which did not help. Is there any way to resolve this? > > 2017-04-28 19:03:56.477895] I [monitor(monitor):275:monitor] Monitor: > starting gsyncd worker(/var/mnt/gluster/brick2). Slave node: > ssh://root@mdct-gluster-srv3:gluster://localhost:slavevol > [2017-04-28 19:03:56.616667] I > [changelogagent(/var/mnt/gluster/brick2):73:__init__] ChangelogAgent: > Agent listining... > [2017-04-28 19:04:07.780885] I > [master(/var/mnt/gluster/brick2):1328:register] _GMaster: Working dir: > /var/lib/misc/glusterfsd/gv0/ssh%3A%2F%2Froot%4010.112.215.10%3Agluster%3A%2F%2F127.0.0.1%3Aslavevol/920a96ad4a5f9c0c2bdbd24a14eeb1af > [2017-04-28 19:04:07.781146] I > [resource(/var/mnt/gluster/brick2):1604:service_loop] GLUSTER: Register > time: 1493406247 > [2017-04-28 19:04:08.143078] I > [gsyncdstatus(/var/mnt/gluster/brick2):272:set_active] GeorepStatus: > Worker Status: Active > [2017-04-28 19:04:08.254343] I > [gsyncdstatus(/var/mnt/gluster/brick2):245:set_worker_crawl_status] > GeorepStatus: Crawl Status: History Crawl > [2017-04-28 19:04:08.254739] I > [master(/var/mnt/gluster/brick2):1244:crawl] _GMaster: starting history > crawl... turns: 1, stime: (1493382958, 0), etime: 1493406248, > entry_stime: (1493382958, 0) > [2017-04-28 19:04:09.256428] I > [master(/var/mnt/gluster/brick2):1272:crawl] _GMaster: slave's time: > (1493382958, 0) > [2017-04-28 19:04:09.381602] E > [syncdutils(/var/mnt/gluster/brick2):297:log_raise_exception] : FAIL: > Traceback (most recent call last): > File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, > in main > main_i() > File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 780, > in main_i > local.service_loop(*[r for r in [remote] if r]) > File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line > 1610, in service_loop > g3.crawlwrap(oneshot=True) > File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 600, > in crawlwrap > self.crawl() > File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1281, > in crawl > self.changelogs_batch_process(changes) > File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1184, > in changelogs_batch_process > self.process(batch) > File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1039, > in process > self.process_change(change, done, retry) > File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 969, > in process_change > entry_stime_to_update[0]) > File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line > 200, in set_field > return self._update(merger) > File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line > 161, in _update > data = mergerfunc(data) > File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line > 194, in merger > if data[key] == value: > KeyError: 'last_synced_entry' > [2017-04-28 19:04:09.383002] I > [syncdutils(/var/mnt/gluster/brick2):238:finalize] : exiting. > [2017-04-28 19:04:09.387280] I > [repce(/var/mnt/gluster/brick2):92:service_loop] RepceServer: > terminating on reaching EOF. > [2017-04-28 19:04:09.387507] I > [syncdutils(/var/mnt/gluster/brick2):238:finalize] : exiting. > [2017-04-28 19:04:09.764077] I [monitor(monitor):357:monitor] Monitor: > worker(/var/mnt/gluster/brick2) died in startup phase > [2017-04-28 19:04:09.768179] I > [gsyncdstatus(monitor):241:set_worker_status] GeorepStatus: Worker > Status: Faulty > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] geo-replication fails after upgrade to gluster 3.10
I've just upgraded my gluster hosts from gluster 3.8 to 3.10 and it appears that geo-replication on my volume is now broken. Here are the log entries from the master. I've tried restarting the geo-replication process several times which did not help. Is there any way to resolve this? 2017-04-28 19:03:56.477895] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/var/mnt/gluster/brick2). Slave node: ssh://root@mdct-gluster-srv3:gluster://localhost:slavevol [2017-04-28 19:03:56.616667] I [changelogagent(/var/mnt/gluster/brick2):73:__init__] ChangelogAgent: Agent listining... [2017-04-28 19:04:07.780885] I [master(/var/mnt/gluster/brick2):1328:register] _GMaster: Working dir: /var/lib/misc/glusterfsd/gv0/ssh%3A%2F%2Froot%4010.112.215.10%3Agluster%3A%2F%2F127.0.0.1%3Aslavevol/920a96ad4a5f9c0c2bdbd24a14eeb1af [2017-04-28 19:04:07.781146] I [resource(/var/mnt/gluster/brick2):1604:service_loop] GLUSTER: Register time: 1493406247 [2017-04-28 19:04:08.143078] I [gsyncdstatus(/var/mnt/gluster/brick2):272:set_active] GeorepStatus: Worker Status: Active [2017-04-28 19:04:08.254343] I [gsyncdstatus(/var/mnt/gluster/brick2):245:set_worker_crawl_status] GeorepStatus: Crawl Status: History Crawl [2017-04-28 19:04:08.254739] I [master(/var/mnt/gluster/brick2):1244:crawl] _GMaster: starting history crawl... turns: 1, stime: (1493382958, 0), etime: 1493406248, entry_stime: (1493382958, 0) [2017-04-28 19:04:09.256428] I [master(/var/mnt/gluster/brick2):1272:crawl] _GMaster: slave's time: (1493382958, 0) [2017-04-28 19:04:09.381602] E [syncdutils(/var/mnt/gluster/brick2):297:log_raise_exception] : FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 780, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1610, in service_loop g3.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 600, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1281, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1184, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1039, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 969, in process_change entry_stime_to_update[0]) File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 200, in set_field return self._update(merger) File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 161, in _update data = mergerfunc(data) File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 194, in merger if data[key] == value: KeyError: 'last_synced_entry' [2017-04-28 19:04:09.383002] I [syncdutils(/var/mnt/gluster/brick2):238:finalize] : exiting. [2017-04-28 19:04:09.387280] I [repce(/var/mnt/gluster/brick2):92:service_loop] RepceServer: terminating on reaching EOF. [2017-04-28 19:04:09.387507] I [syncdutils(/var/mnt/gluster/brick2):238:finalize] : exiting. [2017-04-28 19:04:09.764077] I [monitor(monitor):357:monitor] Monitor: worker(/var/mnt/gluster/brick2) died in startup phase [2017-04-28 19:04:09.768179] I [gsyncdstatus(monitor):241:set_worker_status] GeorepStatus: Worker Status: Faulty ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] What is the CLI NUFA option "local-volume-name" good for?
On Fri, Apr 28, 2017, at 10:57 AM, Jan Wrona wrote: > I've been struggling with NUFA for a while now and I know very well what > the "option local-volume-name brick" in the volfile does. In fact, I've > been using a filter to force gluster to use the local subvolume I want > instead of the first local subvolume it finds, but filters are very > unreliable. Recently I've found this bug [1] and thought that I'll > finally be able to set the NUFA's "local-volume-name" option > *per-server* through the CLI without the use of the filter, but no. This > options sets the value globally, so I'm asking what is the use of LOCAL > volume name set GLOBALLY with the same value on every server? You're right that it would be a bit silly to set this using "gluster volume set" but it makes much more sense as a command-line override using "--xlator-option" instead. Then it could in fact be different on every client, even though it's not even set in the volfile. ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] What is the CLI NUFA option "local-volume-name" good for?
Hi, I've been struggling with NUFA for a while now and I know very well what the "option local-volume-name brick" in the volfile does. In fact, I've been using a filter to force gluster to use the local subvolume I want instead of the first local subvolume it finds, but filters are very unreliable. Recently I've found this bug [1] and thought that I'll finally be able to set the NUFA's "local-volume-name" option *per-server* through the CLI without the use of the filter, but no. This options sets the value globally, so I'm asking what is the use of LOCAL volume name set GLOBALLY with the same value on every server? With thanks, Jan Wrona [1] https://bugzilla.redhat.com/show_bug.cgi?id=987240 smime.p7s Description: S/MIME Cryptographic Signature ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] how to change a default option for all new volumes ?
Hi, We see an issue when using a volume that has the default option set performance.write-behind so have been running gluster volume set performance.write-behind off manually, how would I go about forcing all new volumes to have that set ? we are trying to use heketi for dynamic volume provisioning but without being able to set this flag automatically we still have a manual step. Thanks Steven This communication (including any attachments) is sent on behalf of Playtech plc or one of its subsidiaries (Playtech Group). It contains information which is confidential and privileged. If you are not the intended recipient, please notify the sender immediately and destroy any copies of this communication. Unless expressly stated to the contrary, nothing in this communication constitutes a contractual offer capable of acceptance or indicates intention to create legal relations or grant any rights. The Playtech Group monitors communications sent or received by it for security and other purposes. Any views or opinions presented are solely those of the author and do not necessarily represent those of the Playtech Group ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Small files performance on SSD
Hello, I have problems with tuning Gluster for optimal small files performance. My usage scenario is, as I've learned, worst possible scenario, but it's not up to me to change it: - small 1KB files - at least 20M of those - approx. 10 files/directory - mostly writes - average speed 1000 files/sec with peaks up to 10K files/sec. I'm doing something wrong, because I cannot get past performance metrics 4K files/sec for distributed volume (2 bricks) 2K files/sec for replicated volume (2 bricks). I've been experimenting with various XFS formatting and mounting options and with Gluster tuning, but no luck. I've learned that it's not disk IO that is the bottleneck (direct tests on mounted XFS partition show waaay better results, like 100K files/sec). As I've learned from http://blog.gluster.org/2014/03/experiments-using-ssds-with-gluster/ it's possible to get 24K files/sec performance (and that was three years ago). Test I'm using, run on one server (2 x Xeons, 256 GB RAM, 10GbE network): smallfile_cli.py --operation create --threads 32 --file-size 1 --files 15625 --top /mnt/testdir/test Setup: 2 servers with 2 Xeons each, 256GB RAM, 8 x 800GB SSD drives in RAID6, 10GbE network Ubuntu 14.04 Gluster 3.7.3 Do you have any hints where I should start investigating for the bottleneck? Best regards, Szymon ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Replica Out Of Sync - How to efficiently recover
Good morning guys, We’re using GlusterFS 3.6.7 on RHEL 6.7 on AWS using multiple 1TB EBS GP2 disks as bricks. We have two nodes with several volumes using type Replicate and two bricks. 1 brick belong to server #1 and, of course, the other one to server #2. Transport is over TCP and the only option reconfigured is performance.cache-size which is tuned to 4GB. Clients connect to those targets over FUSE with backupvolfile-server parameter configured to server #2 and primary to server #1 Is worth to specify that those bricks host hundreds of thousands of subdirectories which contains a lot of small xml files and images. Couple of weeks ago one of the nodes goes down because of some AWS problem, reboot was so quick that we don't even record this with agents so, because daemon was not enable on autostart, brick two went out of sync of about 1+ TB. When we realized this we immediatly tried to bring everything up and trigger the self heal but it was literally killing our clients ending up with high iowait and takes forever to retrieve content from the fuse share. Only option was kill the sync process. We tried using rsync and then trigger the self heal with no consistent result. We tried to remove the bad brick, cleaning up the directory on second node and the re-create it causing this massive iowait and the same exact situation. We tried to clone the EBS of primary node, attach it to the secondary and then try again with self heal with no consistent result. We noticed that once brick two becomes online seems that is used as primary even if configured on fstab as backupvolfile-server. I'm saying this because some directories appear missing while is possibile to cd into which reflects the brick status on secondary server. Is there anything that you can suggest to solve this? Are we missing something? Thanks a lot for any help. Lorenzo ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] lost one replica after upgrading glusterfs from 3.7 to 3.10, please help
I'd just like to make an update according to my latest findings on this. Googling further, I ended up reading this article: https://community.rackspace.com/developers/f/7/t/4858 Reflecting it to the docs (https://gluster.readthedocs.io/en/latest/Administrator Guide/Resolving Peer Rejected/) and my situation, I was able to establish a reproducible chain of events, like this: #stop the glusterfs sst2# service glusterfs-server stop sst2# killall glusterfs glusterfsd # make sure there are no more glusterfs processes sst2# ps auwwx | grep gluster # preserve glusterd.info and clean everything else sst2# cd /var/lib/glusterd && mv glusterd.info .. && rm -rf * && mv ../glusterd.info . # start glusterfs sst2# service glusterfs-server start # probe peers sst2# gluster peer status Number of Peers: 0 sst2# gluster peer probe sst0 peer probe: success. sst2# gluster peer probe sst1 peer probe: success. # restart glusterd twice to bring peers back into the cluster sst2# gluster peer status Number of Peers: 2 Hostname: sst0 Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc State: Accepted peer request (Connected) Hostname: sst1 Uuid: 5a2198de-f536-4328-a278-7f746f276e35 State: Accepted peer request (Connected) sst2# service glusterfs-server restart sst2# gluster peer status Number of Peers: 2 Hostname: sst1 Uuid: 5a2198de-f536-4328-a278-7f746f276e35 State: Sent and Received peer request (Connected) Hostname: sst0 Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc State: Sent and Received peer request (Connected) sst2# service glusterfs-server restart sst2# gluster peer status Number of Peers: 2 Hostname: sst1 Uuid: 5a2198de-f536-4328-a278-7f746f276e35 State: Peer in Cluster (Connected) Hostname: sst0 Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc State: Peer in Cluster (Connected) # resync volume information sst2# gluster volume sync sst0 all Sync volume may make data inaccessible while the sync is in progress. Do you want to continue? (y/n) y volume sync: success sst2# gluster volume info Volume Name: gv0 Type: Replicate Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: sst0:/var/glusterfs Brick2: sst2:/var/glusterfs Options Reconfigured: cluster.self-heal-daemon: enable performance.readdir-ahead: on storage.owner-uid: 1000 storage.owner-gid: 1000 sst2# gluster volume status Status of volume: gv0 Gluster process TCP Port RDMA Port Online Pid -- Brick sst0:/var/glusterfs 49153 0 Y 29830 Brick sst2:/var/glusterfs 49152 0 Y 5137 NFS Server on localhost N/A N/AN N/A Self-heal Daemon on localhost N/A N/AY 6034 NFS Server on sst0 N/A N/AN N/A Self-heal Daemon on sst0N/A N/AY 29821 NFS Server on sst1 N/A N/AN N/A Self-heal Daemon on sst1N/A N/AY 19997 Task Status of Volume gv0 -- There are no active volume tasks sst2# gluster volume heal gv0 full Launching heal operation to perform full self heal on volume gv0 has been successful Use heal info commands to check status sst2# gluster volume heal gv0 info Brick sst0:/var/glusterfs Status: Connected Number of entries: 0 Brick sst2:/var/glusterfs Status: Connected Number of entries: 0 The most disturbing thing about this is that I'm perfectly sure that bricks are NOT in sync, according to du -s output: sst0# du -s /var/glusterfs/ 3107570500 /var/glusterfs/ sst2# du -s /var/glusterfs/ 3107567396 /var/glusterfs/ If anybody could be so kind and point me out how to get replicase back to the synchronous state, I would be extremely grateful. Best, Seva 28.04.2017, 13:01, "Seva Gluschenko" : > Of course. Please find attached. Hope they can shed some light on this. > > Thanks, > > Seva > > 28.04.2017, 12:41, "Mohammed Rafi K C" : >> Can you share the glusterd logs from the three nodes ? >> >> Rafi KC >> >> On 04/28/2017 02:34 PM, Seva Gluschenko wrote: >>> Dear Community, >>> >>> I call for your wisdom, as it appears that googling for keywords doesn't >>> help much. >>> >>> I have a glusterfs volume with replica count 2, and I tried to perform >>> the online upgrade procedure described in the docs >>> (http://gluster.readthedocs.io/en/latest/Upgrade-Guide/upgrade_to_3.10/). >>> It all went almost fine when I'd done with the first replica, the only >>> problem was the self-heal procedure that refused to complete until I >>> commented out all IPv6 entries in the /etc/hosts. >>> >>> So far, being sure that it all should work on the 2nd replica
Re: [Gluster-users] Small files performance on SSD
On Fri, Apr 28, 2017 at 3:44 PM, Szymon Miotk wrote: > Dear Gluster community, > > I have problems with tuning Gluster for small files performance on SSD. > > My usage scenario is, as I've learned, worst possible scenario, but > it's not up to me to change it: > - small 1KB files > - at least 20M of those > - approx. 10 files/directory > - mostly writes > - average speed 1000 files/sec with peaks up to 10K files/sec. > > I'm doing something wrong, because I cannot get past performance metrics > 4K files/sec for distributed volume (2 bricks) > 2K files/sec for replicated volume (2 bricks). > I've been experimenting with various XFS formatting and mounting > options and with Gluster tuning (md-cache, lookup optimize, thread, > writeback, tiering), but no luck. > > I've learned that it's not disk IO that is the bottleneck (direct > tests on mounted XFS partition show waaay better results, like 100K > files/sec). > > As I've learned from > http://blog.gluster.org/2014/03/experiments-using-ssds-with-gluster/ > it's possible to get 24K files/sec performance (and that was three years > ago). > > How many clients are you running? Considering Gluster is a distributed solution, the performance should be measured as aggregate of all the clients. > Test I'm using, run on one server (2 x Xeons, 256 GB RAM, 10GbE network): > smallfile_cli.py --operation create --threads 32 --file-size 1 --files > 15625 --top /mnt/testdir/test > > Setup: > 2 servers with 2 Xeons each, 256GB RAM, 8 x 800GB SSD drives in RAID6, > 10GbE network > Ubuntu 14.04 > Gluster 3.7.3 > > Do you have any hints where I should start investigating for the > bottleneck? > > Currently just the fuse mount may be the bottleneck, but I recommend running multiple clients (from different machines) doing these operations in parallel to get the best results out of Gluster. -Amar Best regards, > Szymon > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users > -- Amar Tumballi (amarts) ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Small files performance on SSD
Dear Gluster community, I have problems with tuning Gluster for small files performance on SSD. My usage scenario is, as I've learned, worst possible scenario, but it's not up to me to change it: - small 1KB files - at least 20M of those - approx. 10 files/directory - mostly writes - average speed 1000 files/sec with peaks up to 10K files/sec. I'm doing something wrong, because I cannot get past performance metrics 4K files/sec for distributed volume (2 bricks) 2K files/sec for replicated volume (2 bricks). I've been experimenting with various XFS formatting and mounting options and with Gluster tuning (md-cache, lookup optimize, thread, writeback, tiering), but no luck. I've learned that it's not disk IO that is the bottleneck (direct tests on mounted XFS partition show waaay better results, like 100K files/sec). As I've learned from http://blog.gluster.org/2014/03/experiments-using-ssds-with-gluster/ it's possible to get 24K files/sec performance (and that was three years ago). Test I'm using, run on one server (2 x Xeons, 256 GB RAM, 10GbE network): smallfile_cli.py --operation create --threads 32 --file-size 1 --files 15625 --top /mnt/testdir/test Setup: 2 servers with 2 Xeons each, 256GB RAM, 8 x 800GB SSD drives in RAID6, 10GbE network Ubuntu 14.04 Gluster 3.7.3 Do you have any hints where I should start investigating for the bottleneck? Best regards, Szymon ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] lost one replica after upgrading glusterfs from 3.7 to 3.10, please help
Of course. Please find attached. Hope they can shed some light on this. Thanks, Seva 28.04.2017, 12:41, "Mohammed Rafi K C" : > Can you share the glusterd logs from the three nodes ? > > Rafi KC > > On 04/28/2017 02:34 PM, Seva Gluschenko wrote: >> Dear Community, >> >> I call for your wisdom, as it appears that googling for keywords doesn't >> help much. >> >> I have a glusterfs volume with replica count 2, and I tried to perform the >> online upgrade procedure described in the docs >> (http://gluster.readthedocs.io/en/latest/Upgrade-Guide/upgrade_to_3.10/). It >> all went almost fine when I'd done with the first replica, the only problem >> was the self-heal procedure that refused to complete until I commented out >> all IPv6 entries in the /etc/hosts. >> >> So far, being sure that it all should work on the 2nd replica pretty the >> same as it was on the 1st one, I had proceeded with the upgrade on the >> replica 2. All of a sudden, it told me that it doesn't see the first replica >> at all. The state before upgrade was: >> >> sst2# gluster volume status >> Status of volume: gv0 >> Gluster process TCP Port RDMA Port Online Pid >> >> -- >> Brick sst0:/var/glusterfs 49152 0 Y 3482 >> Brick sst2:/var/glusterfs 49152 0 Y 29863 >> NFS Server on localhost 2049 0 Y 25175 >> Self-heal Daemon on localhost N/A N/A Y 25283 >> NFS Server on sst0 N/A N/A N N/A >> Self-heal Daemon on sst0 N/A N/A Y 4827 >> NFS Server on sst1 N/A N/A N N/A >> Self-heal Daemon on sst1 N/A N/A Y 15009 >> >> Task Status of Volume gv0 >> >> -- >> There are no active volume tasks >> >> sst2# gluster peer status >> Number of Peers: 2 >> >> Hostname: sst0 >> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc >> State: Peer in Cluster (Connected) >> >> Hostname: sst1 >> Uuid: 5a2198de-f536-4328-a278-7f746f276e35 >> State: Sent and Received peer request (Connected) >> >> sst2# gluster volume heal gv0 info >> Brick sst0:/var/glusterfs >> Number of entries: 0 >> >> Brick sst2:/var/glusterfs >> Number of entries: 0 >> >> After upgrade, it looked like this: >> >> sst2# gluster volume status >> Status of volume: gv0 >> Gluster process TCP Port RDMA Port Online Pid >> >> -- >> Brick sst2:/var/glusterfs N/A N/A N N/A >> NFS Server on localhost N/A N/A N N/A >> NFS Server on localhost N/A N/A N N/A >> >> Task Status of Volume gv0 >> >> -- >> There are no active volume tasks >> >> sst2# gluster peer status >> Number of Peers: 2 >> >> Hostname: sst1 >> Uuid: 5a2198de-f536-4328-a278-7f746f276e35 >> State: Sent and Received peer request (Connected) >> >> Hostname: sst0 >> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc >> State: Peer Rejected (Connected) >> >> My biggest fault probably, at that point I googled and found this article >> https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/ >> -- and followed its advice, removing at sst2 all the /var/lib/glusterd >> contents except the glusterd.info file. As the result, the node, >> predictably, lost all information about the volume. >> >> sst2# gluster volume status >> No volumes present >> >> sst2# gluster peer status >> Number of Peers: 2 >> >> Hostname: sst0 >> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc >> State: Accepted peer request (Connected) >> >> Hostname: sst1 >> Uuid: 5a2198de-f536-4328-a278-7f746f276e35 >> State: Accepted peer request (Connected) >> >> Okay, I thought, this is might be a high time to re-add the brick. Not that >> easy, Jack: >> >> sst0# gluster volume add-brick gv0 replica 2 'sst2:/var/glusterfs' >> volume add-brick: failed: Operation failed >> >> The reason appeared to be natural: sst0 still knows that there was the >> replica on sst2. What should I do then? At this point, I tried to recover >> the volume information on sst2 by putting it offline and copying all the >> volume info from the sst0. Of course it wasn't enough to just copy as is, I >> modified /var/lib/glusterd/vols/gv0/sst*\:-var-glusterfs, setting >> listen-port=0 for the remote brick (sst0) and listen-port=49152 for the >> local brick (sst2). It didn't help much, unfortunately. The final state I've >> reached is as follows: >> >> sst2# gluster peer status >> Number of Peers: 2 >> >> Hostname: sst1 >> Uuid: 5a2198de-f536-4328-a278-7f746f276e35 >> State: Sent and Received peer request (Connected) >> >> Hostname: sst0 >> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc >> State: Sent and Received peer request (Connected) >> >> sst2# gluster volume info >> >> Volume Name: gv0 >> Type: Replicate >> Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b >> Status: Started >> Snapshot Count: 0 >> Numb
Re: [Gluster-users] lost one replica after upgrading glusterfs from 3.7 to 3.10, please help
Can you share the glusterd logs from the three nodes ? Rafi KC On 04/28/2017 02:34 PM, Seva Gluschenko wrote: > Dear Community, > > > I call for your wisdom, as it appears that googling for keywords doesn't help > much. > > I have a glusterfs volume with replica count 2, and I tried to perform the > online upgrade procedure described in the docs > (http://gluster.readthedocs.io/en/latest/Upgrade-Guide/upgrade_to_3.10/). It > all went almost fine when I'd done with the first replica, the only problem > was the self-heal procedure that refused to complete until I commented out > all IPv6 entries in the /etc/hosts. > > So far, being sure that it all should work on the 2nd replica pretty the same > as it was on the 1st one, I had proceeded with the upgrade on the replica 2. > All of a sudden, it told me that it doesn't see the first replica at all. The > state before upgrade was: > > sst2# gluster volume status > Status of volume: gv0 > Gluster process TCP Port RDMA Port Online Pid > -- > Brick sst0:/var/glusterfs 49152 0 Y 3482 > Brick sst2:/var/glusterfs 49152 0 Y 29863 > NFS Server on localhost 2049 0 Y 25175 > Self-heal Daemon on localhostN/A N/AY 25283 > NFS Server on sst0 N/A N/AN N/A > Self-heal Daemon on sst0N/A N/AY 4827 > NFS Server on sst1 N/A N/AN N/A > Self-heal Daemon on sst1N/A N/AY 15009 > > Task Status of Volume gv0 > -- > There are no active volume tasks > > sst2# gluster peer status > Number of Peers: 2 > > Hostname: sst0 > Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc > State: Peer in Cluster (Connected) > > Hostname: sst1 > Uuid: 5a2198de-f536-4328-a278-7f746f276e35 > State: Sent and Received peer request (Connected) > > sst2# gluster volume heal gv0 info > Brick sst0:/var/glusterfs > Number of entries: 0 > > Brick sst2:/var/glusterfs > Number of entries: 0 > > > After upgrade, it looked like this: > > sst2# gluster volume status > Status of volume: gv0 > Gluster process TCP Port RDMA Port Online Pid > -- > Brick sst2:/var/glusterfs N/A N/AN N/A > NFS Server on localhost N/A N/AN N/A > NFS Server on localhost N/A N/AN N/A > > Task Status of Volume gv0 > -- > There are no active volume tasks > > sst2# gluster peer status > Number of Peers: 2 > > Hostname: sst1 > Uuid: 5a2198de-f536-4328-a278-7f746f276e35 > State: Sent and Received peer request (Connected) > > Hostname: sst0 > Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc > State: Peer Rejected (Connected) > > > My biggest fault probably, at that point I googled and found this article > https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/ > -- and followed its advice, removing at sst2 all the /var/lib/glusterd > contents except the glusterd.info file. As the result, the node, predictably, > lost all information about the volume. > > sst2# gluster volume status > No volumes present > > sst2# gluster peer status > Number of Peers: 2 > > Hostname: sst0 > Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc > State: Accepted peer request (Connected) > > Hostname: sst1 > Uuid: 5a2198de-f536-4328-a278-7f746f276e35 > State: Accepted peer request (Connected) > > Okay, I thought, this is might be a high time to re-add the brick. Not that > easy, Jack: > > sst0# gluster volume add-brick gv0 replica 2 'sst2:/var/glusterfs' > volume add-brick: failed: Operation failed > > The reason appeared to be natural: sst0 still knows that there was the > replica on sst2. What should I do then? At this point, I tried to recover the > volume information on sst2 by putting it offline and copying all the volume > info from the sst0. Of course it wasn't enough to just copy as is, I modified > /var/lib/glusterd/vols/gv0/sst*\:-var-glusterfs, setting listen-port=0 for > the remote brick (sst0) and listen-port=49152 for the local brick (sst2). It > didn't help much, unfortunately. The final state I've reached is as follows: > > sst2# gluster peer status > Number of Peers: 2 > > Hostname: sst1 > Uuid: 5a2198de-f536-4328-a278-7f746f276e35 > State: Sent and Received peer request (Connected) > > Hostname: sst0 > Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc > State: Sent and Received peer request (Connected) > > sst2# gluster volume info > > Vo
[Gluster-users] lost one replica after upgrading glusterfs from 3.7 to 3.10, please help
Dear Community, I call for your wisdom, as it appears that googling for keywords doesn't help much. I have a glusterfs volume with replica count 2, and I tried to perform the online upgrade procedure described in the docs (http://gluster.readthedocs.io/en/latest/Upgrade-Guide/upgrade_to_3.10/). It all went almost fine when I'd done with the first replica, the only problem was the self-heal procedure that refused to complete until I commented out all IPv6 entries in the /etc/hosts. So far, being sure that it all should work on the 2nd replica pretty the same as it was on the 1st one, I had proceeded with the upgrade on the replica 2. All of a sudden, it told me that it doesn't see the first replica at all. The state before upgrade was: sst2# gluster volume status Status of volume: gv0 Gluster process TCP Port RDMA Port Online Pid -- Brick sst0:/var/glusterfs 49152 0 Y 3482 Brick sst2:/var/glusterfs 49152 0 Y 29863 NFS Server on localhost 2049 0 Y 25175 Self-heal Daemon on localhostN/A N/AY 25283 NFS Server on sst0 N/A N/AN N/A Self-heal Daemon on sst0N/A N/AY 4827 NFS Server on sst1 N/A N/AN N/A Self-heal Daemon on sst1N/A N/AY 15009 Task Status of Volume gv0 -- There are no active volume tasks sst2# gluster peer status Number of Peers: 2 Hostname: sst0 Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc State: Peer in Cluster (Connected) Hostname: sst1 Uuid: 5a2198de-f536-4328-a278-7f746f276e35 State: Sent and Received peer request (Connected) sst2# gluster volume heal gv0 info Brick sst0:/var/glusterfs Number of entries: 0 Brick sst2:/var/glusterfs Number of entries: 0 After upgrade, it looked like this: sst2# gluster volume status Status of volume: gv0 Gluster process TCP Port RDMA Port Online Pid -- Brick sst2:/var/glusterfs N/A N/AN N/A NFS Server on localhost N/A N/AN N/A NFS Server on localhost N/A N/AN N/A Task Status of Volume gv0 -- There are no active volume tasks sst2# gluster peer status Number of Peers: 2 Hostname: sst1 Uuid: 5a2198de-f536-4328-a278-7f746f276e35 State: Sent and Received peer request (Connected) Hostname: sst0 Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc State: Peer Rejected (Connected) My biggest fault probably, at that point I googled and found this article https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/ -- and followed its advice, removing at sst2 all the /var/lib/glusterd contents except the glusterd.info file. As the result, the node, predictably, lost all information about the volume. sst2# gluster volume status No volumes present sst2# gluster peer status Number of Peers: 2 Hostname: sst0 Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc State: Accepted peer request (Connected) Hostname: sst1 Uuid: 5a2198de-f536-4328-a278-7f746f276e35 State: Accepted peer request (Connected) Okay, I thought, this is might be a high time to re-add the brick. Not that easy, Jack: sst0# gluster volume add-brick gv0 replica 2 'sst2:/var/glusterfs' volume add-brick: failed: Operation failed The reason appeared to be natural: sst0 still knows that there was the replica on sst2. What should I do then? At this point, I tried to recover the volume information on sst2 by putting it offline and copying all the volume info from the sst0. Of course it wasn't enough to just copy as is, I modified /var/lib/glusterd/vols/gv0/sst*\:-var-glusterfs, setting listen-port=0 for the remote brick (sst0) and listen-port=49152 for the local brick (sst2). It didn't help much, unfortunately. The final state I've reached is as follows: sst2# gluster peer status Number of Peers: 2 Hostname: sst1 Uuid: 5a2198de-f536-4328-a278-7f746f276e35 State: Sent and Received peer request (Connected) Hostname: sst0 Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc State: Sent and Received peer request (Connected) sst2# gluster volume info Volume Name: gv0 Type: Replicate Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: sst0:/var/glusterfs Brick2: sst2:/var/glusterfs Options Reconfigured: cluster.self-heal-daemon: enable performance.readdir-ahead: on storage.owner-uid: 1000 storage