[Gluster-users] How safe are major version upgrades?

2020-08-24 Thread David Cunningham
Hello,

We have a production system with around 50GB of data running GlusterFS
5.13. It has 3 replicating/mirrored nodes, and also geo-replicates to
another site.

How safe would it be to upgrade to a more recent major version, eg 7.x? I'm
not sure how recommended in-place upgrades are, or if a complete re-install
is necessary for safety.

We have a maximum window of around 4 hours for this upgrade and would not
want any significant risk of an unsuccessful upgrade at the end of that
time.

Is version 8.0 considered stable?

Thanks in advance,

-- 
David Cunningham, Voisonics Limited
http://voisonics.com/
USA: +1 213 221 1092
New Zealand: +64 (0)28 2558 3782




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] Geo-replication log file not closed

2020-08-24 Thread David Cunningham
Hello,

We're having an issue with the rotated gsyncd.log not being released.
Here's the output of 'lsof':

# lsof | grep 'gsyncd.log.1'
python24495  root3w  REG8,1
 9916750234332241
/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log.1 (deleted)
python24495  4496root3w  REG8,1
 9916750234332241
/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log.1 (deleted)
python24495  4507root3w  REG8,1
 9916750234332241
/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log.1 (deleted)
python24508  root3w  REG8,1
 9916750234332241
/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log.1 (deleted)
python24508  root5w  REG8,1
 9916750234332241
/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log.1 (deleted)
python24508  4511root3w  REG8,1
 9916750234332241
/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log.1 (deleted)
... etc...

Those processes are:
# ps -ef | egrep '4495|4508'
root  4495 1  0 Aug10 ?00:00:59 /usr/bin/python2
/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py
--path=/nodirectwritedata/gluster/gvol0  --monitor -c
/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf
--iprefix=/var :gvol0 --glusterd-uuid=b7521445-ee93-4fed-8ced-6a609fa8c7d4
nvfs10::gvol0
root  4508  4495  0 Aug10 ?00:01:56 python2
/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py agent gvol0
nvfs10::gvol0 --local-path /nodirectwritedata/gluster/gvol0 --local-node
cafs30 --local-node-id b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id
cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 9,12,11,10

And here's the relevant part of the /etc/logrotate.d/glusterfs-georep
script:

/var/log/glusterfs/geo-replication/*/*.log {
sharedscripts
rotate 52
missingok
compress
delaycompress
notifempty
postrotate
for pid in `ps -aef | grep glusterfs | egrep "\-\-aux-gfid-mount" | awk
'{print $2}'`; do
/usr/bin/kill -HUP $pid > /dev/null 2>&1 || true
done
 endscript
}

If I run the postrotate part manually:
# ps -aef | grep glusterfs | egrep "\-\-aux-gfid-mount" | awk '{print $2}'
4520

# ps -aef | grep 4520
root  4520 1  0 Aug10 ?01:24:23 /usr/sbin/glusterfs
--aux-gfid-mount --acl --log-level=INFO
--log-file=/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/mnt-nodirectwritedata-gluster-gvol0.log
--volfile-server=localhost --volfile-id=gvol0 --client-pid=-1
/tmp/gsyncd-aux-mount-Tq_3sU

Perhaps the problem is that the kill -HUP in the logrotate script doesn't
act on the right process? If so, does anyone have a command to get the
right PID?

Thanks in advance for any help.

-- 
David Cunningham, Voisonics Limited
http://voisonics.com/
USA: +1 213 221 1092
New Zealand: +64 (0)28 2558 3782




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)

2020-08-24 Thread mabi
Dear Nikhil,

Thank you for your answer. So does this mean that all my FUSE clients where I 
have the volume mounted will not loose at any time their connection during the 
whole upgrade procedure of all 3 nodes?

I am asking because if I understand correctly there will be an overlap of time 
where more than one node will not be running the glusterfsd (brick) process so 
this means that the quorum is lost and then my FUSE clients will loose 
connection to the volume?

I just want to be sure that there will not be any downtime.

Best regards,
Mabi

‐‐‐ Original Message ‐‐‐
On Monday, August 24, 2020 11:14 AM, Nikhil Ladha  wrote:

> Hello Mabi
>
> You don't need to follow the offline upgrade procedure. Please do follow the 
> online upgrade procedure only. Upgrade the nodes one by one, you will notice 
> the `Peer Rejected` state, after upgrading one node or so, but once all the 
> nodes are upgraded it will be back to `Peer in Cluster(Connected)`. Also, if 
> any of the shd's are not online you can try restarting that node to fix that. 
> I have tried this on my own setup so I am pretty sure, it should work for you 
> as well.
> This is the workaround for the time being so that you are able to upgrade, we 
> are working on the issue to come up with a fix for it ASAP.
>
> And, yes if you face any issues even after upgrading all the nodes to 7.7, 
> you will be able to downgrade in back to 6.9, which I think you have already 
> tried and it works as per your previous mail.
>
> Regards
> Nikhil Ladha



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] How to fix I/O error ? (resend)

2020-08-24 Thread Diego Zuccato
Il 21/08/20 13:56, Diego Zuccato ha scritto:

Hello again.

I also tried disabling bitrot (and re-enabling it afterwards) and the
procedure for recovery from split-brain[*] removing the file and its
link from one of the nodes, but no luck.

I'm now completely out of ideas :(

How can I resync those gfids ?

Tks!
Diego

[*] even if "gluster volume heal BigVol info split-brain" reports 0 for
every brick.

> Hello all.
> 
> I have a volume setup as:
> -8<--
> root@str957-biostor:~# gluster v info BigVol
> 
> Volume Name: BigVol
> Type: Distributed-Replicate
> Volume ID: c51926bd-6715-46b2-8bb3-8c915ec47e28
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 28 x (2 + 1) = 84
> Transport-type: tcp
> Bricks:
> Brick1: str957-biostor2:/srv/bricks/00/BigVol
> Brick2: str957-biostor:/srv/bricks/00/BigVol
> Brick3: str957-biostq:/srv/arbiters/00/BigVol (arbiter)
> [...]
> Options Reconfigured:
> cluster.granular-entry-heal: enable
> client.event-threads: 8
> server.event-threads: 8
> server.ssl: on
> client.ssl: on
> nfs.disable: on
> performance.readdir-ahead: on
> transport.address-family: inet
> features.bitrot: on
> features.scrub: Active
> features.scrub-freq: biweekly
> auth.ssl-allow: str957-bio*
> ssl.certificate-depth: 1
> cluster.self-heal-daemon: enable
> features.quota: on
> features.inode-quota: on
> features.quota-deem-statfs: on
> server.manage-gids: on
> features.scrub-throttle: aggressive
> -8<--
> 
> After a couple failures (a disk on biostor2 went "missing", and glusterd
> on biostq got killed by OOM) I noticed that some files can't be accessed
> from the clients:
> -8<--
> $ ls -lh 1_germline_CGTACTAG_L005_R*
> -rwxr-xr-x 1 e.f domain^users 2,0G apr 24  2015
> 1_germline_CGTACTAG_L005_R1_001.fastq.gz
> -rwxr-xr-x 1 e.f domain^users 2,0G apr 24  2015
> 1_germline_CGTACTAG_L005_R2_001.fastq.gz
> $ ls -lh 1_germline_CGTACTAG_L005_R1_001.fastq.gz
> ls: cannot access '1_germline_CGTACTAG_L005_R1_001.fastq.gz':
> Input/output error
> -8<--
> (note that if I request ls for more files, it works...).
> 
> The files have exactly the same contents (verified via md5sum). The only
> difference is in getfattr: trusted.bit-rot.version is
> 0x17005f3f9e670002ad5b on a node and
> 0x12005f3ce7af000dccad on the other.
> 
> On the client, the log reports:
> -8<-
> [2020-08-21 11:32:52.208809] W [MSGID: 108008]
> [afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
> 4-BigVol-replicate-13: GFID mismatch for
> /1_germline_CGTACTAG_L005_R1_001.fastq.gz
> d70a4a6d-05fc-4988-8041-5e7f62155fe5 on BigVol-client-55 and
> f249f88a-909f-489d-8d1d-d428e842ee96 on BigVol-client-34
> [2020-08-21 11:32:52.209768] W [fuse-bridge.c:471:fuse_entry_cbk]
> 0-glusterfs-fuse: 233606: LOOKUP()
> /[...]/1_germline_CGTACTAG_L005_R1_001.fastq.gz => -1 (Errore di
> input/output)
> -8<--
> 
> As suggested on IRC, I tested the RAM, but the only thing I got have
> been a "Peer rejected" status due to another OOM kill. No problem, I've
> been able to resolve it, but the original problem still remains.
> 
> What else can I do?
> 
> TIA!
> 
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
> 
> 
> 
> 
> Community Meeting Calendar:
> 
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
> 
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
> 


-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)

2020-08-24 Thread Nikhil Ladha
Hello Mabi

You don't need to follow the offline upgrade procedure. Please do follow
the online upgrade procedure only. Upgrade the nodes one by one, you will
notice the `Peer Rejected` state, after upgrading one node or so, but once
all the nodes are upgraded it will be back to `Peer in
Cluster(Connected)`.  Also, if any of the shd's are not online you can try
restarting that node to fix that. I have tried this on my own setup so I am
pretty sure, it should work for you as well.
This is the workaround for the time being so that you are able to upgrade,
we are working on the issue to come up with a fix for it ASAP.

And, yes if you face any issues even after upgrading all the nodes to 7.7,
you will be able to downgrade in back to 6.9, which I think you have
already tried and it works as per your previous mail.

Regards
Nikhil Ladha




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users