Re: [Gluster-users] heaps split-brains during back-transfert
Hi Vijay, Yes of course, i sent my email after making some tests and checks and the result was still wrong (even after a couple of hours/1day after having forced the start of every bricks) … until i decided to do a « du » on every quota path. Now, all seems to ~OK as you can read below: # gluster volume quota vol_home list Path Hard-limit Soft-limit Used Available Soft-limit exceeded? Hard-limit exceeded? --- /simlab_team 5.0TB 80% 1.2TB 3.8TB No No /amyloid_team 7.0TB 80% 4.9TB 2.1TB No No /amyloid_team/nguyen 3.5TB 80% 2.0TB 1.5TB No No /sacquin_team 10.0TB 80% 55.3GB 9.9TB No No /baaden_team 20.0TB 80% 11.5TB 8.5TB No No /derreumaux_team 5.0TB 80% 2.2TB 2.8TB No No /sterpone_team14.0TB 80% 9.3TB 4.7TB No No /admin_team1.0TB 80% 15.8GB 1008.2GB No No # for path in $(gluster volume quota vol_home list|awk 'NR2 {print $1}'); do pdsh -w storage[1,3] du -sh /export/brick_home/brick{1,2}/data$path; done storage1: 219G /export/brick_home/brick1/data/simlab_team storage3: 334G /export/brick_home/brick1/data/simlab_team storage1: 307G /export/brick_home/brick2/data/simlab_team storage3: 327G /export/brick_home/brick2/data/simlab_team storage1: 1,2T /export/brick_home/brick1/data/amyloid_team storage3: 1,2T /export/brick_home/brick1/data/amyloid_team storage1: 1,2T /export/brick_home/brick2/data/amyloid_team storage3: 1,2T /export/brick_home/brick2/data/amyloid_team storage1: 505G /export/brick_home/brick1/data/amyloid_team/nguyen storage1: 483G /export/brick_home/brick2/data/amyloid_team/nguyen storage3: 508G /export/brick_home/brick1/data/amyloid_team/nguyen storage3: 503G /export/brick_home/brick2/data/amyloid_team/nguyen storage3: 16G /export/brick_home/brick1/data/sacquin_team storage1: 14G /export/brick_home/brick1/data/sacquin_team storage3: 13G /export/brick_home/brick2/data/sacquin_team storage1: 13G /export/brick_home/brick2/data/sacquin_team storage1: 3,2T /export/brick_home/brick1/data/baaden_team storage1: 2,8T /export/brick_home/brick2/data/baaden_team storage3: 2,9T /export/brick_home/brick1/data/baaden_team storage3: 2,7T /export/brick_home/brick2/data/baaden_team storage3: 588G /export/brick_home/brick1/data/derreumaux_team storage1: 566G /export/brick_home/brick1/data/derreumaux_team storage1: 563G /export/brick_home/brick2/data/derreumaux_team storage3: 610G /export/brick_home/brick2/data/derreumaux_team storage3: 2,5T /export/brick_home/brick1/data/sterpone_team storage1: 2,7T /export/brick_home/brick1/data/sterpone_team storage3: 2,4T /export/brick_home/brick2/data/sterpone_team storage1: 2,4T /export/brick_home/brick2/data/sterpone_team storage3: 519M /export/brick_home/brick1/data/admin_team storage1: 11G /export/brick_home/brick1/data/admin_team storage3: 974M /export/brick_home/brick2/data/admin_team storage1: 4,0G /export/brick_home/brick2/data/admin_team In short: simlab_team: ~1.2TB amyloid_team: ~4.8TB amyloid_team/nguyen: ~2TB sacquin_team: ~56GB baaden_team: ~11.6TB derreumaux_team: 2.3TB sterpone_team: ~10TB admin_team: ~16.5GB There’s still some difference but it’s globally quite correct (except for sterpone_team quota defined). But, I also noticed something strange: here are the result of every « du » i did to force the « recompute » of the quota size (on the glusterfs mount point): # du -sh /home/simlab_team/ 1,2T/home/simlab_team/ # du -sh /home/amyloid_team/ 4,7T/home/amyloid_team/ # du -sh /home/sacquin_team/ 56G /home/sacquin_team/ # du -sh /home/baaden_team/ 12T /home/baaden_team/ # du -sh /home/derreumaux_team/ 2,3T/home/derreumaux_team/ # du -sh /home/sterpone_team/ 9,9T/home/sterpone_team/ As you can above, I dont understand why the quota size computed by quota daemon is different than a du, especially concerning the quota size of /sterpone_team Now, concerning all hangs i met, can you provide me the brand of your infiniband interconnect? From my side, we use QLogic -maybe the problem takes its origin here (Intel/Qlogic and Mellanox are quite different). Concerning the brick logs, I just noticed I have a lot of error on one of my brick logs and the file take around 5GB. Here is an extract: # tail
Re: [Gluster-users] Locking failed - since upgrade to 3.6.4
Hi, [2015-08-03 14:51:57.791081] E [glusterd-utils.c:148:glusterd_lock] 0-management: Unable to get lock for uuid: 76e4398c-e00a-4f3b-9206-4f885c4e5206, lock held by: 76e4398c-e00a-4f3b-9206-4f885c4e5206 This indicates that cluster is still operating at older op version. You would need to bump up the op version to 30604 using Gluster volume set all cluster.op-version 30604 Hmm, it would be helpful if that were in the upgrade documentation in a location that is obvious. Anyhow: # gluster volume set all cluster.op-version 30604 volume set: failed: Required op_version (30604) is not supported Not so good. dpkg --list | grep glus ii glusterfs-client 3.6.4-1 amd64 clustered file-system (client package) ii glusterfs-common 3.6.4-1 amd64 GlusterFS common libraries and translator modules ii glusterfs-server 3.6.4-1 amd64 clustered file-system (server package) So tried on the basis of http://www.gluster.org/pipermail/gluster-users/2014-November/019666.html : Rgfse-rh-01:/var/log/glusterfs# gluster volume set all cluster.op-version 30600 volume set: success Rgfse-rh-01:/var/log/glusterfs# gluster volume set all cluster.op-version 30601 volume set: success gfse-rh-01:/var/log/glusterfs# gluster volume set all cluster.op-version 30602 volume set: success Rgfse-rh-01:/var/log/glusterfs# gluster volume set all cluster.op-version 30603 volume set: success Rgfse-rh-01:/var/log/glusterfs# gluster volume set all cluster.op-version 30604 volume set: failed: Required op_version (30604) is not supported Which I guess is closer to where I want to be... Will see if that does what I need - even if not quite right... Thanks Paul ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Locking failed - since upgrade to 3.6.4
Could you check the glusterd log at the other nodes, that would give you the hint of the exact issue. Also looking at .cmd_log_history will give you the time interval at which volume status commands are executed. If the gap is in milisecs then you are bound to hit it and its expected. -Atin Sent from one plus one On Aug 3, 2015 7:32 PM, Osborne, Paul (paul.osbo...@canterbury.ac.uk) paul.osbo...@canterbury.ac.uk wrote: Hi, Last week I upgraded one of my gluster clusters (3 hosts with bricks as replica 3) to 3.6.4 from 3.5.4 and all seemed well. Today I am getting reports that locking has failed: gfse-cant-01:/var/log/glusterfs# gluster volume status Locking failed on gfse-rh-01.core.canterbury.ac.uk. Please check log file for details. Locking failed on gfse-isr-01.core.canterbury.ac.uk. Please check log file for details. Logs: [2015-08-03 13:45:29.974560] E [glusterd-syncop.c:1640:gd_sync_task_begin] 0-management: Locking Peers Failed. [2015-08-03 13:49:48.273159] E [glusterd-syncop.c:105:gd_collate_errors] 0-: Locking failed on gfse-rh-01.core.canterbury.ac.uk. Please ch eck log file for details. [2015-08-03 13:49:48.273778] E [glusterd-syncop.c:105:gd_collate_errors] 0-: Locking failed on gfse-isr-01.core.canterbury.ac.uk. Please c heck log file for details. I am wondering if this is a new feature due to 3.6.4 or something that has gone wrong. Restarting gluster entirely (btw the restart script does not actually appear to kill the processes...) resolves the issue but then it repeats a few minutes later which is rather suboptimal for a running service. Googling suggests that there may be simultaneous actions going on that can cause a locking issue. I know that I have nagios running volume status volname for each of my volumes on each host every few minutes however this is not new and has been in place for the last 8-9 months that against 3.5 without issue so would hope that this is not causing the issue. I am not sure where to look now tbh. Paul Osborne Senior Systems Engineer Canterbury Christ Church University Tel: 01227 782751 ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Gluster 3.7 for Ubuntu-12.04
Does the 3.7 client work for precise ? On Mon, Aug 3, 2015 at 6:14 PM, Kaleb Keithley kkeit...@redhat.com wrote: From: John S bun...@gmail.com Hi All, Is there any gluster 3.7 version available for ubuntu-12.04. Currently we have all production/test servers running on ubuntu-12.04 Saw glusterfs-3.7 for Ubuntu-14.04, 14.10 and 15 versions. please help. 3.7.x does not build on 12.04 (Precise). Some of the dependencies apparently either don't exist or are too old. As you noted, there are 3.7.x packages for 14.04 LTS (Trusty). If you absolutely need 3.7 then you'll need to update your servers. -- Kaleb ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Locking failed - since upgrade to 3.6.4
Hi, OK I have tracked through the logs which of the hosts apparently has a lock open: [2015-08-03 14:55:37.602717] I [glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management: Received status volume req for volume blogs [2015-08-03 14:51:57.791081] E [glusterd-utils.c:148:glusterd_lock] 0-management: Unable to get lock for uuid: 76e4398c-e00a-4f3b-9206-4f885c4e5206, lock held by: 76e4398c-e00a-4f3b-9206-4f885c4e5206 I have identified the UID for each peer via gluster peer status and working backwards. I see that gluster volume clear-locks may the locks on the volume - but is not clear from the logs is what the path is that has the lock or the kind that is locked. Incidentally my clients (using NFS) through manual testing appear to still be able to read/write to the volume - it is the volume status and heal checks that are failing. All of my clients and servers have been sequentially rebooted in the hope that this would clear any issue - however that doe not appear to be the case. Thanks Paul Paul Osborne Senior Systems Engineer Canterbury Christ Church University Tel: 01227 782751 From: Atin Mukherjee atin.mukherje...@gmail.com Sent: 03 August 2015 15:22 To: Osborne, Paul (paul.osbo...@canterbury.ac.uk) Cc: gluster-users@gluster.org Subject: Re: [Gluster-users] Locking failed - since upgrade to 3.6.4 Could you check the glusterd log at the other nodes, that would give you the hint of the exact issue. Also looking at .cmd_log_history will give you the time interval at which volume status commands are executed. If the gap is in milisecs then you are bound to hit it and its expected. -Atin Sent from one plus one On Aug 3, 2015 7:32 PM, Osborne, Paul (paul.osbo...@canterbury.ac.ukmailto:paul.osbo...@canterbury.ac.uk) paul.osbo...@canterbury.ac.ukmailto:paul.osbo...@canterbury.ac.uk wrote: Hi, Last week I upgraded one of my gluster clusters (3 hosts with bricks as replica 3) to 3.6.4 from 3.5.4 and all seemed well. Today I am getting reports that locking has failed: gfse-cant-01:/var/log/glusterfs# gluster volume status Locking failed on gfse-rh-01.core.canterbury.ac.ukhttp://gfse-rh-01.core.canterbury.ac.uk. Please check log file for details. Locking failed on gfse-isr-01.core.canterbury.ac.ukhttp://gfse-isr-01.core.canterbury.ac.uk. Please check log file for details. Logs: [2015-08-03 13:45:29.974560] E [glusterd-syncop.c:1640:gd_sync_task_begin] 0-management: Locking Peers Failed. [2015-08-03 13:49:48.273159] E [glusterd-syncop.c:105:gd_collate_errors] 0-: Locking failed on gfse-rh-01.core.canterbury.ac.ukhttp://gfse-rh-01.core.canterbury.ac.uk. Please ch eck log file for details. [2015-08-03 13:49:48.273778] E [glusterd-syncop.c:105:gd_collate_errors] 0-: Locking failed on gfse-isr-01.core.canterbury.ac.ukhttp://gfse-isr-01.core.canterbury.ac.uk. Please c heck log file for details. I am wondering if this is a new feature due to 3.6.4 or something that has gone wrong. Restarting gluster entirely (btw the restart script does not actually appear to kill the processes...) resolves the issue but then it repeats a few minutes later which is rather suboptimal for a running service. Googling suggests that there may be simultaneous actions going on that can cause a locking issue. I know that I have nagios running volume status volname for each of my volumes on each host every few minutes however this is not new and has been in place for the last 8-9 months that against 3.5 without issue so would hope that this is not causing the issue. I am not sure where to look now tbh. Paul Osborne Senior Systems Engineer Canterbury Christ Church University Tel: 01227 782751 ___ Gluster-users mailing list Gluster-users@gluster.orgmailto:Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Locking failed - since upgrade to 3.6.4
-Atin Sent from one plus one On Aug 3, 2015 8:31 PM, Osborne, Paul (paul.osbo...@canterbury.ac.uk) paul.osbo...@canterbury.ac.uk wrote: Hi, OK I have tracked through the logs which of the hosts apparently has a lock open: [2015-08-03 14:55:37.602717] I [glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management: Received status volume req for volume blogs [2015-08-03 14:51:57.791081] E [glusterd-utils.c:148:glusterd_lock] 0-management: Unable to get lock for uuid: 76e4398c-e00a-4f3b-9206-4f885c4e5206, lock held by: 76e4398c-e00a-4f3b-9206-4f885c4e5206 This indicates that cluster is still operating at older op version. You would need to bump up the op version to 30604 using Gluster volume set all cluster.op-version 30604 I have identified the UID for each peer via gluster peer status and working backwards. I see that gluster volume clear-locks may the locks on the volume - but is not clear from the logs is what the path is that has the lock or the kind that is locked. Incidentally my clients (using NFS) through manual testing appear to still be able to read/write to the volume - it is the volume status and heal checks that are failing. All of my clients and servers have been sequentially rebooted in the hope that this would clear any issue - however that doe not appear to be the case. Thanks Paul Paul Osborne Senior Systems Engineer Canterbury Christ Church University Tel: 01227 782751 From: Atin Mukherjee atin.mukherje...@gmail.com Sent: 03 August 2015 15:22 To: Osborne, Paul (paul.osbo...@canterbury.ac.uk) Cc: gluster-users@gluster.org Subject: Re: [Gluster-users] Locking failed - since upgrade to 3.6.4 Could you check the glusterd log at the other nodes, that would give you the hint of the exact issue. Also looking at .cmd_log_history will give you the time interval at which volume status commands are executed. If the gap is in milisecs then you are bound to hit it and its expected. -Atin Sent from one plus one On Aug 3, 2015 7:32 PM, Osborne, Paul (paul.osbo...@canterbury.ac.uk) paul.osbo...@canterbury.ac.uk wrote: Hi, Last week I upgraded one of my gluster clusters (3 hosts with bricks as replica 3) to 3.6.4 from 3.5.4 and all seemed well. Today I am getting reports that locking has failed: gfse-cant-01:/var/log/glusterfs# gluster volume status Locking failed on gfse-rh-01.core.canterbury.ac.uk. Please check log file for details. Locking failed on gfse-isr-01.core.canterbury.ac.uk. Please check log file for details. Logs: [2015-08-03 13:45:29.974560] E [glusterd-syncop.c:1640:gd_sync_task_begin] 0-management: Locking Peers Failed. [2015-08-03 13:49:48.273159] E [glusterd-syncop.c:105:gd_collate_errors] 0-: Locking failed on gfse-rh-01.core.canterbury.ac.uk. Please ch eck log file for details. [2015-08-03 13:49:48.273778] E [glusterd-syncop.c:105:gd_collate_errors] 0-: Locking failed on gfse-isr-01.core.canterbury.ac.uk. Please c heck log file for details. I am wondering if this is a new feature due to 3.6.4 or something that has gone wrong. Restarting gluster entirely (btw the restart script does not actually appear to kill the processes...) resolves the issue but then it repeats a few minutes later which is rather suboptimal for a running service. Googling suggests that there may be simultaneous actions going on that can cause a locking issue. I know that I have nagios running volume status volname for each of my volumes on each host every few minutes however this is not new and has been in place for the last 8-9 months that against 3.5 without issue so would hope that this is not causing the issue. I am not sure where to look now tbh. Paul Osborne Senior Systems Engineer Canterbury Christ Church University Tel: 01227 782751 ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] File out of sync
date is not the same but is content different ? You may have disable the mtime attribue to get better perf ? What are these 2 GFID ? You can use this script to find who they are: https://gist.github.com/semiosis/4392640 Cordialement, Mathieu CHATEAU http://www.lotp.fr 2015-08-03 12:30 GMT+02:00 shacky shack...@gmail.com: 2015-08-03 12:16 GMT+02:00 Mathieu Chateau mathieu.chat...@lotp.fr: You should start a heal gluster volume xxx heal or even a full one if not enough gluster volume xxx heal full Thanks, I tried both but this not solved the problem: root@web1:~# ls -l /mnt/data/web/system/web/config.inc.php -rw-r--r-- 1 1010 1010 4041 Jul 31 17:42 /mnt/data/web/system/web/config.inc.php root@web2:~# ls -l /mnt/data/web/system/web/config.inc.php -rw-r--r-- 1 1010 1010 4041 Jul 24 09:20 /mnt/data/web/system/web/config.inc.php root@web3:~# ls -l /mnt/data/web/system/web/config.inc.php -rw-r--r-- 1 1010 1010 4041 Jul 24 09:20 /mnt/data/web/system/web/config.inc.php This is the heal info: root@web1:~# gluster volume heal data info Brick web1:/data/ gfid:d5f6d18f-082f-40e9-bdd1-b7d7eee0ad6d gfid:376cc8a1-8592-4e3f-a47a-5d36fa62f6bc Number of entries: 2 Brick web2:/data/ Number of entries: 0 Brick web3:/data/ Number of entries: 0 I guess clients wrote some files while node was down or rebooting? I don't think this happened, because when I updated that file all nodes was running. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Very slow ls
Dear Gluster users, after setting up a distributed replicated volume (3x2 bricks) on gluster 3.6.4 on Ubuntu systems and populating it with some data (about 150 GB in 20k files) I experience extreme delay when navigating through directories or trying to ls the contents (actually the process seems to hang completely now until I kill the /usr/sbin/glusterfs process on the mounting machine). Is there some common misconfiguration or any performance tuning option that I could try? I mount via automount with fstype=glusterfs option (using the native fuse mount). Any tips? Best regards, Florian Oppermann ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users