[Gluster-users] Registration for mountpoint is available!
mountpoint (https://mountpoint.io/) a open source software storage conference is a co-located event with Open Source Summit North America, August 27-28, 2018. https://events.linuxfoundation.org/events/open-source-summit-north-america-2018/program/co-located-events/ Registration for just mountpoint is now available at: https://www.regonline.com/registration/Checkin.aspx?EventID=2447527 Looking forward to seeing you there! - amye -- Amye Scavarda | a...@redhat.com | Gluster Community Lead ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Rebalance state stuck or corrupted
We have had a rebalance operation going on for a few days. After a couple days the rebalance status said "failed". We stopped the rebalance operation by doing gluster volume rebalance gv0 stop. Rebalance log indicated gluster did try to stop the rebalance. However, when we try now to stop the volume or try to restart rebalance it says there's a rebalance operation going on and volume can't be stopped. I tried restarting all the glusterfs-server service (we're using Gluster 3.8.15 on Ubuntu) but that did not help user@gfs-vm000:~$ sudo gluster volume stop gv0 Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: gv0: failed: Staging failed on gfs-vm001. Error: rebalance session is in progress for the volume 'gv0' Staging failed on gfs-vm017. Error: rebalance session is in progress for the volume 'gv0' Staging failed on gfs-vm011. Error: rebalance session is in progress for the volume 'gv0' Staging failed on gfs-vm006. Error: rebalance session is in progress for the volume 'gv0' Staging failed on gfs-vm003. Error: rebalance session is in progress for the volume 'gv0' Staging failed on gfs-vm004. Error: rebalance session is in progress for the volume 'gv0' Staging failed on 10.0.13.9. Error: rebalance session is in progress for the volume 'gv0' Staging failed on gfs-vm014. Error: rebalance session is in progress for the volume 'gv0' Staging failed on gfs-vm013. Error: rebalance session is in progress for the volume 'gv0' Staging failed on gfs-vm002. Error: rebalance session is in progress for the volume 'gv0' Staging failed on gfs-vm016. Error: rebalance session is in progress for the volume 'gv0' Staging failed on gfs-vm007. Error: rebalance session is in progress for the volume 'gv0' Staging failed on gfs-vm010. Error: rebalance session is in progress for the volume 'gv0' user@gfs-vm000:~$ sudo gluster volume rebalance gv0 stop volume rebalance: gv0: failed: Rebalance not started. tail log from gv0-rebalance.log [2018-05-23 17:32:55.262168] I [MSGID: 109029] [dht-rebalance.c:4260:gf_defrag_stop] 0-: Received stop command on rebalance [2018-05-23 17:32:55.262221] I [MSGID: 109028] [dht-rebalance.c:4079:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 749380.00 secs [2018-05-23 17:32:55.262234] I [MSGID: 109028] [dht-rebalance.c:4083:gf_defrag_status_get] 0-glusterfs: Files migrated: 821417, size: 25797609415002, lookups: 1162021, failures: 0, skipped: 1814 [2018-05-23 17:32:55.777149] I [MSGID: 109022] [dht-rebalance.c:1703:dht_migrate_file] 0-gv0-dht: completed migration of /pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl/model.ckpt-50724.data-2-of-3 from subvolume gv0-replicate-0 to gv0-replicate-3 [2018-05-23 17:32:55.782048] W [dht-rebalance.c:2826:gf_defrag_process_dir] 0-gv0-dht: Found error from gf_defrag_get_entry [2018-05-23 17:32:55.782358] E [MSGID: 109111] [dht-rebalance.c:3123:gf_defrag_fix_layout] 0-gv0-dht: gf_defrag_process_dir failed for directory: /pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl [2018-05-23 17:32:56.115106] E [MSGID: 109016] [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed for /pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl [2018-05-23 17:32:56.115586] E [MSGID: 109016] [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed for /pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill [2018-05-23 17:32:56.115849] E [MSGID: 109016] [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed for /pnrsy/v-zhli2/generated/ende_with_teacher/model [2018-05-23 17:32:56.116141] E [MSGID: 109016] [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed for /pnrsy/v-zhli2/generated/ende_with_teacher [2018-05-23 17:32:56.116237] E [MSGID: 109016] [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed for /pnrsy/v-zhli2/generated [2018-05-23 17:32:56.116393] E [MSGID: 109016] [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed for /pnrsy/v-zhli2 [2018-05-23 17:32:56.116625] E [MSGID: 109016] [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed for /pnrsy [2018-05-23 17:32:56.129836] I [dht-rebalance.c:2246:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 7 [2018-05-23 17:32:56.130072] I [dht-rebalance.c:2246:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 8 [2018-05-23 17:32:56.130567] I [dht-rebalance.c:2246:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 9 [2018-05-23 17:32:56.131273] I
Re: [Gluster-users] [Nfs-ganesha-support] [SOLVED] volume start: gv01: failed: Quorum not met. Volume operation not allowed.
Thanks, Tom. Good to know. Daniel On 05/22/2018 01:43 AM, TomK wrote: This list has been deprecated. Please subscribe to the new support list at lists.nfs-ganesha.org. Hey All, Appears I solved this one and NFS mounts now work on all my clients. No issues since fixing it a few hours back. RESOLUTION Auditd is to blame for the trouble. Noticed this in the logs on 2 of the 3 NFS servers (nfs01, nfs02, nfs03): type=AVC msg=audit(1526965320.850:4094): avc: denied { write } for pid=8714 comm="ganesha.nfsd" name="nfs_0" dev="dm-0" ino=201547689 scontext=system_u:system_r:ganesha_t:s0 tcontext=system_u:object_r:krb5_host_rcache_t:s0 tclass=file type=SYSCALL msg=audit(1526965320.850:4094): arch=c03e syscall=2 success=no exit=-13 a0=7f23b0003150 a1=2 a2=180 a3=2 items=0 ppid=1 pid=8714 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ganesha.nfsd" exe="/usr/bin/ganesha.nfsd" subj=system_u:system_r:ganesha_t:s0 key=(null) type=PROCTITLE msg=audit(1526965320.850:4094): proctitle=2F7573722F62696E2F67616E657368612E6E667364002D4C002F7661722F6C6F672F67616E657368612F67616E657368612E6C6F67002D66002F6574632F67616E657368612F67616E657368612E636F6E66002D4E004E49565F4556454E54 type=AVC msg=audit(1526965320.850:4095): avc: denied { unlink } for pid=8714 comm="ganesha.nfsd" name="nfs_0" dev="dm-0" ino=201547689 scontext=system_u:system_r:ganesha_t:s0 tcontext=system_u:object_r:krb5_host_rcache_t:s0 tclass=file type=SYSCALL msg=audit(1526965320.850:4095): arch=c03e syscall=87 success=no exit=-13 a0=7f23b0004100 a1=7f23b050 a2=7f23b0004100 a3=5 items=0 ppid=1 pid=8714 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ganesha.nfsd" exe="/usr/bin/ganesha.nfsd" subj=system_u:system_r:ganesha_t:s0 key=(null) type=PROCTITLE msg=audit(1526965320.850:4095): proctitle=2F7573722F62696E2F67616E657368612E6E667364002D4C002F7661722F6C6F672F67616E657368612F67616E657368612E6C6F67002D66002F6574632F67616E657368612F67616E657368612E636F6E66002D4E004E49565F4556454E54 Fix was to adjust the SELinux rules using audit2allow. All the errors below including the one in the link below, were due to that. Turns out that when ever it worked, it hit the only working server in the system, nfs03. Whenever it didn't work, it was hitting the non working servers. So sometimes it worked, and other times it didn't. It looked like it was to do with Haproxy / Keepalived as well since I couldn't mount using the VIP but could using the host. But that wasn't the case either. I've also added the third brick to the Gluster FS, nfs03, trying to see if the backend FS was to blame since Gluster FS recommends 3 bricks minimum for replication, but that had no effect. In case anyone runs into this, I've added notes here as well: http://microdevsys.com/wp/kernel-nfs-nfs4_discover_server_trunking-unhandled-error-512-exiting-with-error-eio-and-mount-hangs/ http://microdevsys.com/wp/nfs-reply-xid-3844308326-reply-err-20-auth-rejected-credentials-client-should-begin-new-session/ The errors thrown included: NFS reply xid 3844308326 reply ERR 20: Auth Rejected Credentials (client should begin new session) kernel: NFS: nfs4_discover_server_trunking unhandled error -512. Exiting with error EIO and mount hangs + the kernel exception below. ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] New 3.12.7 possible split-brain on replica 3
On 05/23/2018 12:47 PM, mabi wrote: Hello, I just wanted to ask if you had time to look into this bug I am encountering and if there is anything else I can do? For now in order to get rid of these 3 unsynched files shall I do the same method that was suggested to me in this thread? Sorry Mabi, I haven't had a chance to dig deeper into this. The workaround of resetting xattrs should be fine though. Thanks, Ravi Thanks, Mabi ‐‐‐ Original Message ‐‐‐ On May 17, 2018 11:07 PM, mabi wrote: Hi Ravi, Please fine below the answers to your questions 1. I have never touched the cluster.quorum-type option. Currently it is set as following for this volume: Option Value cluster.quorum-type none 2) The .shareKey files are not supposed to be empty. They should be 512 bytes big and contain binary data (PGP Secret Sub-key). I am not in a position to say why it is in this specific case only 0 bytes and if it is the fault of the software (Nextcloud) or GlusterFS. I can just say here that I have another file server which is a simple NFS server with another Nextcloud installation and there I never saw any 0 bytes .shareKey files being created. 3) It seems to be quite random and I am not the person who uses the Nextcloud software so I can't say what it was doing at that specific time but I guess uploading files or moving files around. Basically I use GlusterFS to store the files/data of the Nextcloud web application where I have it mounted using a fuse mount (mount -t glusterfs). Regarding the logs I have attached the mount log file from the client and below are the relevant log entries from the brick log file of all 3 nodes. Let me know if you need any other log files. Also if you know any "log file sanitizer tool" which can replace sensitive file names with random file names in log files that would like to use it as right now I have to do that manually. NODE 1 brick log: [2018-05-15 06:54:20.176679] E [MSGID: 113015] [posix.c:1211:posix_opendir] 0-myvol-private-posix: opendir failed on /data/myvol-private/brick/cloud/data/admin/files_encryption/keys/files/dir/dir/anotherdir/dir/OC_DEFAULT_MODULE [No such file or directory] NODE 2 brick log: [2018-05-15 06:54:20.176415] E [MSGID: 113015] [posix.c:1211:posix_opendir] 0-myvol-private-posix: opendir failed on /data/myvol-private/brick/cloud/data/admin/files_encryption/keys/files/dir/dir/anotherdir/dir/OC_DEFAULT_MODULE [No such file or directory] NODE 3 (arbiter) brick log: [2018-05-15 06:54:19.898981] W [MSGID: 113103] [posix.c:285:posix_lookup] 0-myvol-private-posix: Found stale gfid handle /srv/glusterfs/myvol-private/brick/.glusterfs/f0/65/f065a5e7-ac06-445f-add0-83acf8ce4155, removing it. [Stale file handle] [2018-05-15 06:54:20.056196] W [MSGID: 113103] [posix.c:285:posix_lookup] 0-myvol-private-posix: Found stale gfid handle /srv/glusterfs/myvol-private/brick/.glusterfs/8f/a1/8fa15dbd-cd5c-4900-b889-0fe7fce46a13, removing it. [Stale file handle] [2018-05-15 06:54:20.172823] I [MSGID: 115056] [server-rpc-fops.c:485:server_rmdir_cbk] 0-myvol-private-server: 14740125: RMDIR /cloud/data/admin/files_encryption/keys/files/dir/dir/anotherdir/dir/OC_DEFAULT_MODULE (f065a5e7-ac06-445f-add0-83acf8ce4155/OC_DEFAULT_MODULE), client: nextcloud.domain.com-7972-2018/05/10-20:31:46:163206-myvol-private-client-2-0-0, error-xlator: myvol-private-posix [Directory not empty] [2018-05-15 06:54:20.190911] I [MSGID: 115056] [server-rpc-fops.c:485:server_rmdir_cbk] 0-myvol-private-server: 14740141: RMDIR /cloud/data/admin/files_encryption/keys/files/dir/dir/anotherdir/dir (72a1613e-2ac0-48bd-8ace-f2f723f3796c/2016.03.15 AVB_Photovoltaik-Versicherung 2013.pdf), client: nextcloud.domain.com-7972-2018/05/10-20:31:46:163206-myvol-private-client-2-0-0, error-xlator: myvol-private-posix [Directory not empty] Best regards, Mabi ‐‐‐ Original Message ‐‐‐ On May 17, 2018 7:00 AM, Ravishankar N ravishan...@redhat.com wrote: Hi mabi, Some questions: -Did you by any chance change the cluster.quorum-type option from the default values? -Is filename.shareKey supposed to be any empty file? Looks like the file was fallocated with the keep-size option but never written to. (On the 2 data bricks, stat output shows Size =0, but non zero Blocks and yet a 'regular empty file'). -Do you have some sort of a reproducer/ steps that you perform when the issue occurs? Please also share the logs from all 3 nodes and the client(s). Thanks, Ravi On 05/15/2018 05:26 PM, mabi wrote: Thank you Ravi for your fast answer. As requested you will find below the "stat" and "getfattr" of one of the files and its parent directory from all three nodes of my cluster. NODE 1: File: ‘/data/myvolume-private/brick/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/OC_DEFAULT_MODULE/filename.shareKey’ Size: 0 Blocks: 38 IO Block: 131072 regular empty file Device: 23h/35d Inode: 744413 Links: 2 Access: (0644
Re: [Gluster-users] New 3.12.7 possible split-brain on replica 3
Hello, I just wanted to ask if you had time to look into this bug I am encountering and if there is anything else I can do? For now in order to get rid of these 3 unsynched files shall I do the same method that was suggested to me in this thread? Thanks, Mabi ‐‐‐ Original Message ‐‐‐ On May 17, 2018 11:07 PM, mabi wrote: > > > Hi Ravi, > > Please fine below the answers to your questions > > 1. I have never touched the cluster.quorum-type option. Currently it is set > as following for this volume: > > Option Value > > > cluster.quorum-type none > > 2) The .shareKey files are not supposed to be empty. They should be 512 bytes > big and contain binary data (PGP Secret Sub-key). I am not in a position to > say why it is in this specific case only 0 bytes and if it is the fault of > the software (Nextcloud) or GlusterFS. I can just say here that I have > another file server which is a simple NFS server with another Nextcloud > installation and there I never saw any 0 bytes .shareKey files being created. > > 3) It seems to be quite random and I am not the person who uses the Nextcloud > software so I can't say what it was doing at that specific time but I guess > uploading files or moving files around. Basically I use GlusterFS to store > the files/data of the Nextcloud web application where I have it mounted using > a fuse mount (mount -t glusterfs). > > Regarding the logs I have attached the mount log file from the client and > below are the relevant log entries from the brick log file of all 3 nodes. > Let me know if you need any other log files. Also if you know any "log file > sanitizer tool" which can replace sensitive file names with random file names > in log files that would like to use it as right now I have to do that > manually. > > NODE 1 brick log: > > [2018-05-15 06:54:20.176679] E [MSGID: 113015] [posix.c:1211:posix_opendir] > 0-myvol-private-posix: opendir failed on > /data/myvol-private/brick/cloud/data/admin/files_encryption/keys/files/dir/dir/anotherdir/dir/OC_DEFAULT_MODULE > [No such file or directory] > > NODE 2 brick log: > > [2018-05-15 06:54:20.176415] E [MSGID: 113015] [posix.c:1211:posix_opendir] > 0-myvol-private-posix: opendir failed on > /data/myvol-private/brick/cloud/data/admin/files_encryption/keys/files/dir/dir/anotherdir/dir/OC_DEFAULT_MODULE > [No such file or directory] > > NODE 3 (arbiter) brick log: > > [2018-05-15 06:54:19.898981] W [MSGID: 113103] [posix.c:285:posix_lookup] > 0-myvol-private-posix: Found stale gfid handle > /srv/glusterfs/myvol-private/brick/.glusterfs/f0/65/f065a5e7-ac06-445f-add0-83acf8ce4155, > removing it. [Stale file handle] > > [2018-05-15 06:54:20.056196] W [MSGID: 113103] [posix.c:285:posix_lookup] > 0-myvol-private-posix: Found stale gfid handle > /srv/glusterfs/myvol-private/brick/.glusterfs/8f/a1/8fa15dbd-cd5c-4900-b889-0fe7fce46a13, > removing it. [Stale file handle] > > [2018-05-15 06:54:20.172823] I [MSGID: 115056] > [server-rpc-fops.c:485:server_rmdir_cbk] 0-myvol-private-server: 14740125: > RMDIR > /cloud/data/admin/files_encryption/keys/files/dir/dir/anotherdir/dir/OC_DEFAULT_MODULE > (f065a5e7-ac06-445f-add0-83acf8ce4155/OC_DEFAULT_MODULE), client: > nextcloud.domain.com-7972-2018/05/10-20:31:46:163206-myvol-private-client-2-0-0, > error-xlator: myvol-private-posix [Directory not empty] > > [2018-05-15 06:54:20.190911] I [MSGID: 115056] > [server-rpc-fops.c:485:server_rmdir_cbk] 0-myvol-private-server: 14740141: > RMDIR /cloud/data/admin/files_encryption/keys/files/dir/dir/anotherdir/dir > (72a1613e-2ac0-48bd-8ace-f2f723f3796c/2016.03.15 > AVB_Photovoltaik-Versicherung 2013.pdf), client: > nextcloud.domain.com-7972-2018/05/10-20:31:46:163206-myvol-private-client-2-0-0, > error-xlator: myvol-private-posix [Directory not empty] > > Best regards, > > Mabi > > ‐‐‐ Original Message ‐‐‐ > > On May 17, 2018 7:00 AM, Ravishankar N ravishan...@redhat.com wrote: > > > Hi mabi, > > > > Some questions: > > > > -Did you by any chance change the cluster.quorum-type option from the > > > > default values? > > > > -Is filename.shareKey supposed to be any empty file? Looks like the file > > > > was fallocated with the keep-size option but never written to. (On the 2 > > > > data bricks, stat output shows Size =0, but non zero Blocks and yet a > > > > 'regular empty file'). > > > > -Do you have some sort of a reproducer/ steps that you perform when the > > > > issue occurs? Please also share the logs from all 3 nodes and the client(s). > > > > Thanks, > > > > Ravi > > > > On 05/15/2018 05:26 PM, mabi wrote: > > > > > Thank you Ravi for your fast answer. As requested you will find below the > > > "stat" and "getfattr" of one of the files and its parent directory from > > > all three nodes of my cluster. > > > > > > NODE 1: > > > > > > File: > > > ‘/data/myvolume-private/brick/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/OC_DEFAU