[Gluster-users] How safe are major version upgrades?
Hello, We have a production system with around 50GB of data running GlusterFS 5.13. It has 3 replicating/mirrored nodes, and also geo-replicates to another site. How safe would it be to upgrade to a more recent major version, eg 7.x? I'm not sure how recommended in-place upgrades are, or if a complete re-install is necessary for safety. We have a maximum window of around 4 hours for this upgrade and would not want any significant risk of an unsuccessful upgrade at the end of that time. Is version 8.0 considered stable? Thanks in advance, -- David Cunningham, Voisonics Limited http://voisonics.com/ USA: +1 213 221 1092 New Zealand: +64 (0)28 2558 3782 Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Geo-replication log file not closed
Hello, We're having an issue with the rotated gsyncd.log not being released. Here's the output of 'lsof': # lsof | grep 'gsyncd.log.1' python24495 root3w REG8,1 9916750234332241 /var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log.1 (deleted) python24495 4496root3w REG8,1 9916750234332241 /var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log.1 (deleted) python24495 4507root3w REG8,1 9916750234332241 /var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log.1 (deleted) python24508 root3w REG8,1 9916750234332241 /var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log.1 (deleted) python24508 root5w REG8,1 9916750234332241 /var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log.1 (deleted) python24508 4511root3w REG8,1 9916750234332241 /var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log.1 (deleted) ... etc... Those processes are: # ps -ef | egrep '4495|4508' root 4495 1 0 Aug10 ?00:00:59 /usr/bin/python2 /usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py --path=/nodirectwritedata/gluster/gvol0 --monitor -c /var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf --iprefix=/var :gvol0 --glusterd-uuid=b7521445-ee93-4fed-8ced-6a609fa8c7d4 nvfs10::gvol0 root 4508 4495 0 Aug10 ?00:01:56 python2 /usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py agent gvol0 nvfs10::gvol0 --local-path /nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 9,12,11,10 And here's the relevant part of the /etc/logrotate.d/glusterfs-georep script: /var/log/glusterfs/geo-replication/*/*.log { sharedscripts rotate 52 missingok compress delaycompress notifempty postrotate for pid in `ps -aef | grep glusterfs | egrep "\-\-aux-gfid-mount" | awk '{print $2}'`; do /usr/bin/kill -HUP $pid > /dev/null 2>&1 || true done endscript } If I run the postrotate part manually: # ps -aef | grep glusterfs | egrep "\-\-aux-gfid-mount" | awk '{print $2}' 4520 # ps -aef | grep 4520 root 4520 1 0 Aug10 ?01:24:23 /usr/sbin/glusterfs --aux-gfid-mount --acl --log-level=INFO --log-file=/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/mnt-nodirectwritedata-gluster-gvol0.log --volfile-server=localhost --volfile-id=gvol0 --client-pid=-1 /tmp/gsyncd-aux-mount-Tq_3sU Perhaps the problem is that the kill -HUP in the logrotate script doesn't act on the right process? If so, does anyone have a command to get the right PID? Thanks in advance for any help. -- David Cunningham, Voisonics Limited http://voisonics.com/ USA: +1 213 221 1092 New Zealand: +64 (0)28 2558 3782 Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
Dear Nikhil, Thank you for your answer. So does this mean that all my FUSE clients where I have the volume mounted will not loose at any time their connection during the whole upgrade procedure of all 3 nodes? I am asking because if I understand correctly there will be an overlap of time where more than one node will not be running the glusterfsd (brick) process so this means that the quorum is lost and then my FUSE clients will loose connection to the volume? I just want to be sure that there will not be any downtime. Best regards, Mabi ‐‐‐ Original Message ‐‐‐ On Monday, August 24, 2020 11:14 AM, Nikhil Ladha wrote: > Hello Mabi > > You don't need to follow the offline upgrade procedure. Please do follow the > online upgrade procedure only. Upgrade the nodes one by one, you will notice > the `Peer Rejected` state, after upgrading one node or so, but once all the > nodes are upgraded it will be back to `Peer in Cluster(Connected)`. Also, if > any of the shd's are not online you can try restarting that node to fix that. > I have tried this on my own setup so I am pretty sure, it should work for you > as well. > This is the workaround for the time being so that you are able to upgrade, we > are working on the issue to come up with a fix for it ASAP. > > And, yes if you face any issues even after upgrading all the nodes to 7.7, > you will be able to downgrade in back to 6.9, which I think you have already > tried and it works as per your previous mail. > > Regards > Nikhil Ladha Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] How to fix I/O error ? (resend)
Il 21/08/20 13:56, Diego Zuccato ha scritto: Hello again. I also tried disabling bitrot (and re-enabling it afterwards) and the procedure for recovery from split-brain[*] removing the file and its link from one of the nodes, but no luck. I'm now completely out of ideas :( How can I resync those gfids ? Tks! Diego [*] even if "gluster volume heal BigVol info split-brain" reports 0 for every brick. > Hello all. > > I have a volume setup as: > -8<-- > root@str957-biostor:~# gluster v info BigVol > > Volume Name: BigVol > Type: Distributed-Replicate > Volume ID: c51926bd-6715-46b2-8bb3-8c915ec47e28 > Status: Started > Snapshot Count: 0 > Number of Bricks: 28 x (2 + 1) = 84 > Transport-type: tcp > Bricks: > Brick1: str957-biostor2:/srv/bricks/00/BigVol > Brick2: str957-biostor:/srv/bricks/00/BigVol > Brick3: str957-biostq:/srv/arbiters/00/BigVol (arbiter) > [...] > Options Reconfigured: > cluster.granular-entry-heal: enable > client.event-threads: 8 > server.event-threads: 8 > server.ssl: on > client.ssl: on > nfs.disable: on > performance.readdir-ahead: on > transport.address-family: inet > features.bitrot: on > features.scrub: Active > features.scrub-freq: biweekly > auth.ssl-allow: str957-bio* > ssl.certificate-depth: 1 > cluster.self-heal-daemon: enable > features.quota: on > features.inode-quota: on > features.quota-deem-statfs: on > server.manage-gids: on > features.scrub-throttle: aggressive > -8<-- > > After a couple failures (a disk on biostor2 went "missing", and glusterd > on biostq got killed by OOM) I noticed that some files can't be accessed > from the clients: > -8<-- > $ ls -lh 1_germline_CGTACTAG_L005_R* > -rwxr-xr-x 1 e.f domain^users 2,0G apr 24 2015 > 1_germline_CGTACTAG_L005_R1_001.fastq.gz > -rwxr-xr-x 1 e.f domain^users 2,0G apr 24 2015 > 1_germline_CGTACTAG_L005_R2_001.fastq.gz > $ ls -lh 1_germline_CGTACTAG_L005_R1_001.fastq.gz > ls: cannot access '1_germline_CGTACTAG_L005_R1_001.fastq.gz': > Input/output error > -8<-- > (note that if I request ls for more files, it works...). > > The files have exactly the same contents (verified via md5sum). The only > difference is in getfattr: trusted.bit-rot.version is > 0x17005f3f9e670002ad5b on a node and > 0x12005f3ce7af000dccad on the other. > > On the client, the log reports: > -8<- > [2020-08-21 11:32:52.208809] W [MSGID: 108008] > [afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check] > 4-BigVol-replicate-13: GFID mismatch for > /1_germline_CGTACTAG_L005_R1_001.fastq.gz > d70a4a6d-05fc-4988-8041-5e7f62155fe5 on BigVol-client-55 and > f249f88a-909f-489d-8d1d-d428e842ee96 on BigVol-client-34 > [2020-08-21 11:32:52.209768] W [fuse-bridge.c:471:fuse_entry_cbk] > 0-glusterfs-fuse: 233606: LOOKUP() > /[...]/1_germline_CGTACTAG_L005_R1_001.fastq.gz => -1 (Errore di > input/output) > -8<-- > > As suggested on IRC, I tested the RAM, but the only thing I got have > been a "Peer rejected" status due to another OOM kill. No problem, I've > been able to resolve it, but the original problem still remains. > > What else can I do? > > TIA! > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
Hello Mabi You don't need to follow the offline upgrade procedure. Please do follow the online upgrade procedure only. Upgrade the nodes one by one, you will notice the `Peer Rejected` state, after upgrading one node or so, but once all the nodes are upgraded it will be back to `Peer in Cluster(Connected)`. Also, if any of the shd's are not online you can try restarting that node to fix that. I have tried this on my own setup so I am pretty sure, it should work for you as well. This is the workaround for the time being so that you are able to upgrade, we are working on the issue to come up with a fix for it ASAP. And, yes if you face any issues even after upgrading all the nodes to 7.7, you will be able to downgrade in back to 6.9, which I think you have already tried and it works as per your previous mail. Regards Nikhil Ladha Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users