Re: [Gluster-users] Locking failed - since upgrade to 3.6.4
Hi, [2015-08-03 14:51:57.791081] E [glusterd-utils.c:148:glusterd_lock] 0-management: Unable to get lock for uuid: 76e4398c-e00a-4f3b-9206-4f885c4e5206, lock held by: 76e4398c-e00a-4f3b-9206-4f885c4e5206 This indicates that cluster is still operating at older op version. You would need to bump up the op version to 30604 using Gluster volume set all cluster.op-version 30604 Hmm, it would be helpful if that were in the upgrade documentation in a location that is obvious. Anyhow: # gluster volume set all cluster.op-version 30604 volume set: failed: Required op_version (30604) is not supported Not so good. dpkg --list | grep glus ii glusterfs-client 3.6.4-1 amd64 clustered file-system (client package) ii glusterfs-common 3.6.4-1 amd64 GlusterFS common libraries and translator modules ii glusterfs-server 3.6.4-1 amd64 clustered file-system (server package) So tried on the basis of http://www.gluster.org/pipermail/gluster-users/2014-November/019666.html : Rgfse-rh-01:/var/log/glusterfs# gluster volume set all cluster.op-version 30600 volume set: success Rgfse-rh-01:/var/log/glusterfs# gluster volume set all cluster.op-version 30601 volume set: success gfse-rh-01:/var/log/glusterfs# gluster volume set all cluster.op-version 30602 volume set: success Rgfse-rh-01:/var/log/glusterfs# gluster volume set all cluster.op-version 30603 volume set: success Rgfse-rh-01:/var/log/glusterfs# gluster volume set all cluster.op-version 30604 volume set: failed: Required op_version (30604) is not supported Which I guess is closer to where I want to be... Will see if that does what I need - even if not quite right... Thanks Paul ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Locking failed - since upgrade to 3.6.4
Could you check the glusterd log at the other nodes, that would give you the hint of the exact issue. Also looking at .cmd_log_history will give you the time interval at which volume status commands are executed. If the gap is in milisecs then you are bound to hit it and its expected. -Atin Sent from one plus one On Aug 3, 2015 7:32 PM, Osborne, Paul (paul.osbo...@canterbury.ac.uk) paul.osbo...@canterbury.ac.uk wrote: Hi, Last week I upgraded one of my gluster clusters (3 hosts with bricks as replica 3) to 3.6.4 from 3.5.4 and all seemed well. Today I am getting reports that locking has failed: gfse-cant-01:/var/log/glusterfs# gluster volume status Locking failed on gfse-rh-01.core.canterbury.ac.uk. Please check log file for details. Locking failed on gfse-isr-01.core.canterbury.ac.uk. Please check log file for details. Logs: [2015-08-03 13:45:29.974560] E [glusterd-syncop.c:1640:gd_sync_task_begin] 0-management: Locking Peers Failed. [2015-08-03 13:49:48.273159] E [glusterd-syncop.c:105:gd_collate_errors] 0-: Locking failed on gfse-rh-01.core.canterbury.ac.uk. Please ch eck log file for details. [2015-08-03 13:49:48.273778] E [glusterd-syncop.c:105:gd_collate_errors] 0-: Locking failed on gfse-isr-01.core.canterbury.ac.uk. Please c heck log file for details. I am wondering if this is a new feature due to 3.6.4 or something that has gone wrong. Restarting gluster entirely (btw the restart script does not actually appear to kill the processes...) resolves the issue but then it repeats a few minutes later which is rather suboptimal for a running service. Googling suggests that there may be simultaneous actions going on that can cause a locking issue. I know that I have nagios running volume status volname for each of my volumes on each host every few minutes however this is not new and has been in place for the last 8-9 months that against 3.5 without issue so would hope that this is not causing the issue. I am not sure where to look now tbh. Paul Osborne Senior Systems Engineer Canterbury Christ Church University Tel: 01227 782751 ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Locking failed - since upgrade to 3.6.4
Hi, OK I have tracked through the logs which of the hosts apparently has a lock open: [2015-08-03 14:55:37.602717] I [glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management: Received status volume req for volume blogs [2015-08-03 14:51:57.791081] E [glusterd-utils.c:148:glusterd_lock] 0-management: Unable to get lock for uuid: 76e4398c-e00a-4f3b-9206-4f885c4e5206, lock held by: 76e4398c-e00a-4f3b-9206-4f885c4e5206 I have identified the UID for each peer via gluster peer status and working backwards. I see that gluster volume clear-locks may the locks on the volume - but is not clear from the logs is what the path is that has the lock or the kind that is locked. Incidentally my clients (using NFS) through manual testing appear to still be able to read/write to the volume - it is the volume status and heal checks that are failing. All of my clients and servers have been sequentially rebooted in the hope that this would clear any issue - however that doe not appear to be the case. Thanks Paul Paul Osborne Senior Systems Engineer Canterbury Christ Church University Tel: 01227 782751 From: Atin Mukherjee atin.mukherje...@gmail.com Sent: 03 August 2015 15:22 To: Osborne, Paul (paul.osbo...@canterbury.ac.uk) Cc: gluster-users@gluster.org Subject: Re: [Gluster-users] Locking failed - since upgrade to 3.6.4 Could you check the glusterd log at the other nodes, that would give you the hint of the exact issue. Also looking at .cmd_log_history will give you the time interval at which volume status commands are executed. If the gap is in milisecs then you are bound to hit it and its expected. -Atin Sent from one plus one On Aug 3, 2015 7:32 PM, Osborne, Paul (paul.osbo...@canterbury.ac.ukmailto:paul.osbo...@canterbury.ac.uk) paul.osbo...@canterbury.ac.ukmailto:paul.osbo...@canterbury.ac.uk wrote: Hi, Last week I upgraded one of my gluster clusters (3 hosts with bricks as replica 3) to 3.6.4 from 3.5.4 and all seemed well. Today I am getting reports that locking has failed: gfse-cant-01:/var/log/glusterfs# gluster volume status Locking failed on gfse-rh-01.core.canterbury.ac.ukhttp://gfse-rh-01.core.canterbury.ac.uk. Please check log file for details. Locking failed on gfse-isr-01.core.canterbury.ac.ukhttp://gfse-isr-01.core.canterbury.ac.uk. Please check log file for details. Logs: [2015-08-03 13:45:29.974560] E [glusterd-syncop.c:1640:gd_sync_task_begin] 0-management: Locking Peers Failed. [2015-08-03 13:49:48.273159] E [glusterd-syncop.c:105:gd_collate_errors] 0-: Locking failed on gfse-rh-01.core.canterbury.ac.ukhttp://gfse-rh-01.core.canterbury.ac.uk. Please ch eck log file for details. [2015-08-03 13:49:48.273778] E [glusterd-syncop.c:105:gd_collate_errors] 0-: Locking failed on gfse-isr-01.core.canterbury.ac.ukhttp://gfse-isr-01.core.canterbury.ac.uk. Please c heck log file for details. I am wondering if this is a new feature due to 3.6.4 or something that has gone wrong. Restarting gluster entirely (btw the restart script does not actually appear to kill the processes...) resolves the issue but then it repeats a few minutes later which is rather suboptimal for a running service. Googling suggests that there may be simultaneous actions going on that can cause a locking issue. I know that I have nagios running volume status volname for each of my volumes on each host every few minutes however this is not new and has been in place for the last 8-9 months that against 3.5 without issue so would hope that this is not causing the issue. I am not sure where to look now tbh. Paul Osborne Senior Systems Engineer Canterbury Christ Church University Tel: 01227 782751 ___ Gluster-users mailing list Gluster-users@gluster.orgmailto:Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Locking failed - since upgrade to 3.6.4
-Atin Sent from one plus one On Aug 3, 2015 8:31 PM, Osborne, Paul (paul.osbo...@canterbury.ac.uk) paul.osbo...@canterbury.ac.uk wrote: Hi, OK I have tracked through the logs which of the hosts apparently has a lock open: [2015-08-03 14:55:37.602717] I [glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management: Received status volume req for volume blogs [2015-08-03 14:51:57.791081] E [glusterd-utils.c:148:glusterd_lock] 0-management: Unable to get lock for uuid: 76e4398c-e00a-4f3b-9206-4f885c4e5206, lock held by: 76e4398c-e00a-4f3b-9206-4f885c4e5206 This indicates that cluster is still operating at older op version. You would need to bump up the op version to 30604 using Gluster volume set all cluster.op-version 30604 I have identified the UID for each peer via gluster peer status and working backwards. I see that gluster volume clear-locks may the locks on the volume - but is not clear from the logs is what the path is that has the lock or the kind that is locked. Incidentally my clients (using NFS) through manual testing appear to still be able to read/write to the volume - it is the volume status and heal checks that are failing. All of my clients and servers have been sequentially rebooted in the hope that this would clear any issue - however that doe not appear to be the case. Thanks Paul Paul Osborne Senior Systems Engineer Canterbury Christ Church University Tel: 01227 782751 From: Atin Mukherjee atin.mukherje...@gmail.com Sent: 03 August 2015 15:22 To: Osborne, Paul (paul.osbo...@canterbury.ac.uk) Cc: gluster-users@gluster.org Subject: Re: [Gluster-users] Locking failed - since upgrade to 3.6.4 Could you check the glusterd log at the other nodes, that would give you the hint of the exact issue. Also looking at .cmd_log_history will give you the time interval at which volume status commands are executed. If the gap is in milisecs then you are bound to hit it and its expected. -Atin Sent from one plus one On Aug 3, 2015 7:32 PM, Osborne, Paul (paul.osbo...@canterbury.ac.uk) paul.osbo...@canterbury.ac.uk wrote: Hi, Last week I upgraded one of my gluster clusters (3 hosts with bricks as replica 3) to 3.6.4 from 3.5.4 and all seemed well. Today I am getting reports that locking has failed: gfse-cant-01:/var/log/glusterfs# gluster volume status Locking failed on gfse-rh-01.core.canterbury.ac.uk. Please check log file for details. Locking failed on gfse-isr-01.core.canterbury.ac.uk. Please check log file for details. Logs: [2015-08-03 13:45:29.974560] E [glusterd-syncop.c:1640:gd_sync_task_begin] 0-management: Locking Peers Failed. [2015-08-03 13:49:48.273159] E [glusterd-syncop.c:105:gd_collate_errors] 0-: Locking failed on gfse-rh-01.core.canterbury.ac.uk. Please ch eck log file for details. [2015-08-03 13:49:48.273778] E [glusterd-syncop.c:105:gd_collate_errors] 0-: Locking failed on gfse-isr-01.core.canterbury.ac.uk. Please c heck log file for details. I am wondering if this is a new feature due to 3.6.4 or something that has gone wrong. Restarting gluster entirely (btw the restart script does not actually appear to kill the processes...) resolves the issue but then it repeats a few minutes later which is rather suboptimal for a running service. Googling suggests that there may be simultaneous actions going on that can cause a locking issue. I know that I have nagios running volume status volname for each of my volumes on each host every few minutes however this is not new and has been in place for the last 8-9 months that against 3.5 without issue so would hope that this is not causing the issue. I am not sure where to look now tbh. Paul Osborne Senior Systems Engineer Canterbury Christ Church University Tel: 01227 782751 ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users