Re: [Gluster-devel] [glusterfs-3.6.0beta3-0.11.gitd01b00a] gluster volume status is running even though the Disk is detached
I set zfs pool failmode to continue, which should disable only write and not read as explained below failmode=wait | continue | panic Controls the system behavior in the event of catastrophic pool failure. This condition is typically a result of a loss of connec- tivity to the underlying storage device(s) or a failure of all devices within the pool. The behavior of such an event is deter- mined as follows: waitBlocks all I/O access until the device connectivity is recovered and the errors are cleared. This is the default behavior. continueReturns EIO to any new write I/O requests but allows reads to any of the remaining healthy devices. Any write requests that have yet to be committed to disk would be blocked. panic Prints out a message to the console and generates a system crash dump. Now, I rebuilt the glusterfs master and tried to see if failed driver results in failed brick and in turn kill brick process and the brick is not going offline. # gluster volume status Status of volume: repvol Gluster process Port Online Pid -- Brick 192.168.1.246:/zp1/brick1 49152 Y 2400 Brick 192.168.1.246:/zp2/brick2 49153 Y 2407 NFS Server on localhost 2049 Y 30488 Self-heal Daemon on localhost N/A Y 30495 Task Status of Volume repvol -- There are no active volume tasks The /var/log/gluster/mnt.log output: [2014-10-31 09:18:15.934700] W [rpc-clnt-ping.c:154:rpc_clnt_ping_cbk] 0-repvol-client-1: socket disconnected [2014-10-31 09:18:15.934725] I [client.c:2215:client_rpc_notify] 0-repvol-client-1: disconnected from repvol-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2014-10-31 09:18:15.935238] I [rpc-clnt.c:1765:rpc_clnt_reconfig] 0-repvol-client-1: changing port to 49153 (from 0) Now if I copy a file to /mnt it copied without any hang and brick still shows online. Thanks, Kiran. On Tue, Oct 28, 2014 at 3:44 PM, Niels de Vos nde...@redhat.com wrote: On Tue, Oct 28, 2014 at 02:08:32PM +0530, Kiran Patil wrote: The content of file zp2-brick2.log is at http://ur1.ca/iku0l ( http://fpaste.org/145714/44849041/ ) I can't open the file /zp2/brick2/.glusterfs/health_check since it hangs due to no disk present. Let me know the filename pattern, so that I can find it. Hmm, if there is a hang while reading from the disk, it will not get detected in the current solution. We implemented failure detection on top of the detection that is done by the filesystem. Suspending a filesystem with fsfreeze or similar should probably not be seen as a failure. In your case, it seems that the filesystem suspends itself when the disk went away. I have no idea if it is possible to configure ZFS to not suspend, but return an error to the reading/writing application. Please check with such an option. If you find such an option, please update the wiki page and recommend enabling it: - http://gluster.org/community/documentation/index.php/GlusterOnZFS Thanks, Niels On Tue, Oct 28, 2014 at 1:42 PM, Niels de Vos nde...@redhat.com wrote: On Tue, Oct 28, 2014 at 01:10:56PM +0530, Kiran Patil wrote: I applied the patches, compiled and installed the gluster. # glusterfs --version glusterfs 3.7dev built on Oct 28 2014 12:03:10 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2013 Red Hat, Inc. http://www.redhat.com/ GlusterFS comes with ABSOLUTELY NO WARRANTY. It is licensed to you under your choice of the GNU Lesser General Public License, version 3 or any later version (LGPLv3 or later), or the GNU General Public License, version 2 (GPLv2), in all cases as published by the Free Software Foundation. # git log commit 990ce16151c3af17e4cdaa94608b737940b60e4d Author: Lalatendu Mohanty lmoha...@redhat.com Date: Tue Jul 1 07:52:27 2014 -0400 Posix: Brick failure detection fix for ext4 filesystem ... ... I see below messages Many thanks Kiran! Do you have the messages from the brick that uses the zp2 mountpoint? There also should be a file with a timestamp when the last check was done successfully. If the brick is still running, this timestamp should get updated every storage.health-check-interval seconds: /zp2/brick2/.glusterfs/health_check Niels File /var/log/glusterfs/etc-glusterfs-glusterd.vol.log : The message I [MSGID: 106005] [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick 192.168.1.246:/zp2/brick2 has disconnected from glusterd. repeated 39 times between [2014-10-28 05:58:09.209419] and [2014-10-28 06:00:06.226330] [2014-10-28
Re: [Gluster-devel] [glusterfs-3.6.0beta3-0.11.gitd01b00a] gluster volume status is running even though the Disk is detached
I am not seeing below message in any log files under /var/log/glusterfs directroy and its subdirectories. health-check failed, going down On Fri, Oct 31, 2014 at 3:16 PM, Kiran Patil ki...@fractalio.com wrote: I set zfs pool failmode to continue, which should disable only write and not read as explained below failmode=wait | continue | panic Controls the system behavior in the event of catastrophic pool failure. This condition is typically a result of a loss of connec- tivity to the underlying storage device(s) or a failure of all devices within the pool. The behavior of such an event is deter- mined as follows: waitBlocks all I/O access until the device connectivity is recovered and the errors are cleared. This is the default behavior. continueReturns EIO to any new write I/O requests but allows reads to any of the remaining healthy devices. Any write requests that have yet to be committed to disk would be blocked. panic Prints out a message to the console and generates a system crash dump. Now, I rebuilt the glusterfs master and tried to see if failed driver results in failed brick and in turn kill brick process and the brick is not going offline. # gluster volume status Status of volume: repvol Gluster process Port Online Pid -- Brick 192.168.1.246:/zp1/brick1 49152 Y 2400 Brick 192.168.1.246:/zp2/brick2 49153 Y 2407 NFS Server on localhost 2049 Y 30488 Self-heal Daemon on localhost N/A Y 30495 Task Status of Volume repvol -- There are no active volume tasks The /var/log/gluster/mnt.log output: [2014-10-31 09:18:15.934700] W [rpc-clnt-ping.c:154:rpc_clnt_ping_cbk] 0-repvol-client-1: socket disconnected [2014-10-31 09:18:15.934725] I [client.c:2215:client_rpc_notify] 0-repvol-client-1: disconnected from repvol-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2014-10-31 09:18:15.935238] I [rpc-clnt.c:1765:rpc_clnt_reconfig] 0-repvol-client-1: changing port to 49153 (from 0) Now if I copy a file to /mnt it copied without any hang and brick still shows online. Thanks, Kiran. On Tue, Oct 28, 2014 at 3:44 PM, Niels de Vos nde...@redhat.com wrote: On Tue, Oct 28, 2014 at 02:08:32PM +0530, Kiran Patil wrote: The content of file zp2-brick2.log is at http://ur1.ca/iku0l ( http://fpaste.org/145714/44849041/ ) I can't open the file /zp2/brick2/.glusterfs/health_check since it hangs due to no disk present. Let me know the filename pattern, so that I can find it. Hmm, if there is a hang while reading from the disk, it will not get detected in the current solution. We implemented failure detection on top of the detection that is done by the filesystem. Suspending a filesystem with fsfreeze or similar should probably not be seen as a failure. In your case, it seems that the filesystem suspends itself when the disk went away. I have no idea if it is possible to configure ZFS to not suspend, but return an error to the reading/writing application. Please check with such an option. If you find such an option, please update the wiki page and recommend enabling it: - http://gluster.org/community/documentation/index.php/GlusterOnZFS Thanks, Niels On Tue, Oct 28, 2014 at 1:42 PM, Niels de Vos nde...@redhat.com wrote: On Tue, Oct 28, 2014 at 01:10:56PM +0530, Kiran Patil wrote: I applied the patches, compiled and installed the gluster. # glusterfs --version glusterfs 3.7dev built on Oct 28 2014 12:03:10 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2013 Red Hat, Inc. http://www.redhat.com/ GlusterFS comes with ABSOLUTELY NO WARRANTY. It is licensed to you under your choice of the GNU Lesser General Public License, version 3 or any later version (LGPLv3 or later), or the GNU General Public License, version 2 (GPLv2), in all cases as published by the Free Software Foundation. # git log commit 990ce16151c3af17e4cdaa94608b737940b60e4d Author: Lalatendu Mohanty lmoha...@redhat.com Date: Tue Jul 1 07:52:27 2014 -0400 Posix: Brick failure detection fix for ext4 filesystem ... ... I see below messages Many thanks Kiran! Do you have the messages from the brick that uses the zp2 mountpoint? There also should be a file with a timestamp when the last check was done successfully. If the brick is still running, this timestamp should get updated every storage.health-check-interval seconds: /zp2/brick2/.glusterfs/health_check Niels File /var/log/glusterfs/etc-glusterfs-glusterd.vol.log : The message I [MSGID:
Re: [Gluster-devel] [glusterfs-3.6.0beta3-0.11.gitd01b00a] gluster volume status is running even though the Disk is detached
I changed git fetch git://review.gluster.org/glusterfs to git fetch http://review.gluster.org/glusterfs and now it works. Thanks, Kiran. On Tue, Oct 28, 2014 at 11:13 AM, Kiran Patil ki...@fractalio.com wrote: Hi Niels, I am getting fatal: Couldn't find remote ref refs/changes/13/8213/9 error. Steps to reproduce the issue. 1) # git clone git://review.gluster.org/glusterfs Initialized empty Git repository in /root/gluster-3.6/glusterfs/.git/ remote: Counting objects: 84921, done. remote: Compressing objects: 100% (48307/48307), done. remote: Total 84921 (delta 57264), reused 63233 (delta 36254) Receiving objects: 100% (84921/84921), 23.23 MiB | 192 KiB/s, done. Resolving deltas: 100% (57264/57264), done. 2) # cd glusterfs # git branch * master 3) # git fetch git://review.gluster.org/glusterfs refs/changes/13/8213/9 git checkout FETCH_HEAD fatal: Couldn't find remote ref refs/changes/13/8213/9 Note: I also tried the above steps on git repo https://github.com/gluster/glusterfs and the result is same as above. Please let me know if I miss any steps. Thanks, Kiran. On Mon, Oct 27, 2014 at 5:53 PM, Niels de Vos nde...@redhat.com wrote: On Mon, Oct 27, 2014 at 05:19:13PM +0530, Kiran Patil wrote: Hi, I created replicated vol with two bricks on the same node and copied some data to it. Now removed the disk which has hosted one of the brick of the volume. Storage.health-check-interval is set to 30 seconds. I could see the disk is unavailable using zpool command of zfs on linux but the gluster volume status still displays the brick process running which should have been shutdown by this time. Is this a bug in 3.6 since it is mentioned as feature https://github.com/gluster/glusterfs/blob/release-3.6/doc/features/brick-failure-detection.md or am I doing any mistakes here? The initial detection of brick failures did not work for all filesystems. It may not work for ZFS too. A fix has been posted, but it has not been merged into the master branch yet. When the change has been merged, it can get backported to 3.6 and 3.5. You may want to test with the patch applied, and add your +1 Verified to the change in case it makes it functional for you: - http://review.gluster.org/8213 Cheers, Niels [root@fractal-c92e gluster-3.6]# gluster volume status Status of volume: repvol Gluster process Port Online Pid -- Brick 192.168.1.246:/zp1/brick1 49154 Y 17671 Brick 192.168.1.246:/zp2/brick2 49155 Y 17682 NFS Server on localhost 2049 Y 17696 Self-heal Daemon on localhost N/A Y 17701 Task Status of Volume repvol -- There are no active volume tasks [root@fractal-c92e gluster-3.6]# gluster volume info Volume Name: repvol Type: Replicate Volume ID: d4f992b1-1393-43b8-9fda-2e2b6e3b5039 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 192.168.1.246:/zp1/brick1 Brick2: 192.168.1.246:/zp2/brick2 Options Reconfigured: storage.health-check-interval: 30 [root@fractal-c92e gluster-3.6]# zpool status zp2 pool: zp2 state: UNAVAIL status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://zfsonlinux.org/msg/ZFS-8000-HC scan: none requested config: NAMESTATE READ WRITE CKSUM zp2 UNAVAIL 0 0 0 insufficient replicas sdb UNAVAIL 0 0 0 errors: 2 data errors, use '-v' for a list Thanks, Kiran. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [glusterfs-3.6.0beta3-0.11.gitd01b00a] gluster volume status is running even though the Disk is detached
I applied the patches, compiled and installed the gluster. # glusterfs --version glusterfs 3.7dev built on Oct 28 2014 12:03:10 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2013 Red Hat, Inc. http://www.redhat.com/ GlusterFS comes with ABSOLUTELY NO WARRANTY. It is licensed to you under your choice of the GNU Lesser General Public License, version 3 or any later version (LGPLv3 or later), or the GNU General Public License, version 2 (GPLv2), in all cases as published by the Free Software Foundation. # git log commit 990ce16151c3af17e4cdaa94608b737940b60e4d Author: Lalatendu Mohanty lmoha...@redhat.com Date: Tue Jul 1 07:52:27 2014 -0400 Posix: Brick failure detection fix for ext4 filesystem ... ... I see below messages File /var/log/glusterfs/etc-glusterfs-glusterd.vol.log : The message I [MSGID: 106005] [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick 192.168.1.246:/zp2/brick2 has disconnected from glusterd. repeated 39 times between [2014-10-28 05:58:09.209419] and [2014-10-28 06:00:06.226330] [2014-10-28 06:00:09.226507] W [socket.c:545:__socket_rwv] 0-management: readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid argument) [2014-10-28 06:00:09.226712] I [MSGID: 106005] [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick 192.168.1.246:/zp2/brick2 has disconnected from glusterd. [2014-10-28 06:00:12.226881] W [socket.c:545:__socket_rwv] 0-management: readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid argument) [2014-10-28 06:00:15.227249] W [socket.c:545:__socket_rwv] 0-management: readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid argument) [2014-10-28 06:00:18.227616] W [socket.c:545:__socket_rwv] 0-management: readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid argument) [2014-10-28 06:00:21.227976] W [socket.c:545:__socket_rwv] 0-management: readv on . . [2014-10-28 06:19:15.142867] I [glusterd-handler.c:1280:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req The message I [MSGID: 106005] [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick 192.168.1.246:/zp2/brick2 has disconnected from glusterd. repeated 12 times between [2014-10-28 06:18:09.368752] and [2014-10-28 06:18:45.373063] [2014-10-28 06:23:38.207649] W [glusterfsd.c:1194:cleanup_and_exit] (-- 0-: received signum (15), shutting down dmesg output: SPLError: 7869:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has encountered an uncorrectable I/O failure and has been suspended. SPLError: 7868:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has encountered an uncorrectable I/O failure and has been suspended. SPLError: 7869:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has encountered an uncorrectable I/O failure and has been suspended. The brick is still online. # gluster volume status Status of volume: repvol Gluster process Port Online Pid -- Brick 192.168.1.246:/zp1/brick1 49152 Y 4067 Brick 192.168.1.246:/zp2/brick2 49153 Y 4078 NFS Server on localhost 2049 Y 4092 Self-heal Daemon on localhost N/A Y 4097 Task Status of Volume repvol -- There are no active volume tasks # gluster volume info Volume Name: repvol Type: Replicate Volume ID: ba1e7c6d-1e1c-45cd-8132-5f4fa4d2d22b Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 192.168.1.246:/zp1/brick1 Brick2: 192.168.1.246:/zp2/brick2 Options Reconfigured: storage.health-check-interval: 30 Let me know if you need further information. Thanks, Kiran. On Tue, Oct 28, 2014 at 11:44 AM, Kiran Patil ki...@fractalio.com wrote: I changed git fetch git://review.gluster.org/glusterfs to git fetch http://review.gluster.org/glusterfs and now it works. Thanks, Kiran. On Tue, Oct 28, 2014 at 11:13 AM, Kiran Patil ki...@fractalio.com wrote: Hi Niels, I am getting fatal: Couldn't find remote ref refs/changes/13/8213/9 error. Steps to reproduce the issue. 1) # git clone git://review.gluster.org/glusterfs Initialized empty Git repository in /root/gluster-3.6/glusterfs/.git/ remote: Counting objects: 84921, done. remote: Compressing objects: 100% (48307/48307), done. remote: Total 84921 (delta 57264), reused 63233 (delta 36254) Receiving objects: 100% (84921/84921), 23.23 MiB | 192 KiB/s, done. Resolving deltas: 100% (57264/57264), done. 2) # cd glusterfs # git branch * master 3) # git fetch git://review.gluster.org/glusterfs refs/changes/13/8213/9 git checkout FETCH_HEAD fatal: Couldn't find remote ref refs/changes/13/8213/9 Note: I also tried the above steps on git repo https://github.com/gluster/glusterfs and the result is same as above. Please let me know if I miss any steps. Thanks, Kiran. On Mon, Oct 27, 2014 at 5:53 PM, Niels de Vos
Re: [Gluster-devel] [glusterfs-3.6.0beta3-0.11.gitd01b00a] gluster volume status is running even though the Disk is detached
On Tue, Oct 28, 2014 at 01:10:56PM +0530, Kiran Patil wrote: I applied the patches, compiled and installed the gluster. # glusterfs --version glusterfs 3.7dev built on Oct 28 2014 12:03:10 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2013 Red Hat, Inc. http://www.redhat.com/ GlusterFS comes with ABSOLUTELY NO WARRANTY. It is licensed to you under your choice of the GNU Lesser General Public License, version 3 or any later version (LGPLv3 or later), or the GNU General Public License, version 2 (GPLv2), in all cases as published by the Free Software Foundation. # git log commit 990ce16151c3af17e4cdaa94608b737940b60e4d Author: Lalatendu Mohanty lmoha...@redhat.com Date: Tue Jul 1 07:52:27 2014 -0400 Posix: Brick failure detection fix for ext4 filesystem ... ... I see below messages Many thanks Kiran! Do you have the messages from the brick that uses the zp2 mountpoint? There also should be a file with a timestamp when the last check was done successfully. If the brick is still running, this timestamp should get updated every storage.health-check-interval seconds: /zp2/brick2/.glusterfs/health_check Niels File /var/log/glusterfs/etc-glusterfs-glusterd.vol.log : The message I [MSGID: 106005] [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick 192.168.1.246:/zp2/brick2 has disconnected from glusterd. repeated 39 times between [2014-10-28 05:58:09.209419] and [2014-10-28 06:00:06.226330] [2014-10-28 06:00:09.226507] W [socket.c:545:__socket_rwv] 0-management: readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid argument) [2014-10-28 06:00:09.226712] I [MSGID: 106005] [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick 192.168.1.246:/zp2/brick2 has disconnected from glusterd. [2014-10-28 06:00:12.226881] W [socket.c:545:__socket_rwv] 0-management: readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid argument) [2014-10-28 06:00:15.227249] W [socket.c:545:__socket_rwv] 0-management: readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid argument) [2014-10-28 06:00:18.227616] W [socket.c:545:__socket_rwv] 0-management: readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid argument) [2014-10-28 06:00:21.227976] W [socket.c:545:__socket_rwv] 0-management: readv on . . [2014-10-28 06:19:15.142867] I [glusterd-handler.c:1280:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req The message I [MSGID: 106005] [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick 192.168.1.246:/zp2/brick2 has disconnected from glusterd. repeated 12 times between [2014-10-28 06:18:09.368752] and [2014-10-28 06:18:45.373063] [2014-10-28 06:23:38.207649] W [glusterfsd.c:1194:cleanup_and_exit] (-- 0-: received signum (15), shutting down dmesg output: SPLError: 7869:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has encountered an uncorrectable I/O failure and has been suspended. SPLError: 7868:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has encountered an uncorrectable I/O failure and has been suspended. SPLError: 7869:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has encountered an uncorrectable I/O failure and has been suspended. The brick is still online. # gluster volume status Status of volume: repvol Gluster process Port Online Pid -- Brick 192.168.1.246:/zp1/brick1 49152 Y 4067 Brick 192.168.1.246:/zp2/brick2 49153 Y 4078 NFS Server on localhost 2049 Y 4092 Self-heal Daemon on localhost N/A Y 4097 Task Status of Volume repvol -- There are no active volume tasks # gluster volume info Volume Name: repvol Type: Replicate Volume ID: ba1e7c6d-1e1c-45cd-8132-5f4fa4d2d22b Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 192.168.1.246:/zp1/brick1 Brick2: 192.168.1.246:/zp2/brick2 Options Reconfigured: storage.health-check-interval: 30 Let me know if you need further information. Thanks, Kiran. On Tue, Oct 28, 2014 at 11:44 AM, Kiran Patil ki...@fractalio.com wrote: I changed git fetch git://review.gluster.org/glusterfs to git fetch http://review.gluster.org/glusterfs and now it works. Thanks, Kiran. On Tue, Oct 28, 2014 at 11:13 AM, Kiran Patil ki...@fractalio.com wrote: Hi Niels, I am getting fatal: Couldn't find remote ref refs/changes/13/8213/9 error. Steps to reproduce the issue. 1) # git clone git://review.gluster.org/glusterfs Initialized empty Git repository in /root/gluster-3.6/glusterfs/.git/ remote: Counting objects: 84921, done. remote: Compressing objects: 100% (48307/48307), done. remote: Total 84921 (delta 57264), reused 63233 (delta 36254) Receiving objects: 100%
Re: [Gluster-devel] [glusterfs-3.6.0beta3-0.11.gitd01b00a] gluster volume status is running even though the Disk is detached
On Tue, Oct 28, 2014 at 02:08:32PM +0530, Kiran Patil wrote: The content of file zp2-brick2.log is at http://ur1.ca/iku0l ( http://fpaste.org/145714/44849041/ ) I can't open the file /zp2/brick2/.glusterfs/health_check since it hangs due to no disk present. Let me know the filename pattern, so that I can find it. Hmm, if there is a hang while reading from the disk, it will not get detected in the current solution. We implemented failure detection on top of the detection that is done by the filesystem. Suspending a filesystem with fsfreeze or similar should probably not be seen as a failure. In your case, it seems that the filesystem suspends itself when the disk went away. I have no idea if it is possible to configure ZFS to not suspend, but return an error to the reading/writing application. Please check with such an option. If you find such an option, please update the wiki page and recommend enabling it: - http://gluster.org/community/documentation/index.php/GlusterOnZFS Thanks, Niels On Tue, Oct 28, 2014 at 1:42 PM, Niels de Vos nde...@redhat.com wrote: On Tue, Oct 28, 2014 at 01:10:56PM +0530, Kiran Patil wrote: I applied the patches, compiled and installed the gluster. # glusterfs --version glusterfs 3.7dev built on Oct 28 2014 12:03:10 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2013 Red Hat, Inc. http://www.redhat.com/ GlusterFS comes with ABSOLUTELY NO WARRANTY. It is licensed to you under your choice of the GNU Lesser General Public License, version 3 or any later version (LGPLv3 or later), or the GNU General Public License, version 2 (GPLv2), in all cases as published by the Free Software Foundation. # git log commit 990ce16151c3af17e4cdaa94608b737940b60e4d Author: Lalatendu Mohanty lmoha...@redhat.com Date: Tue Jul 1 07:52:27 2014 -0400 Posix: Brick failure detection fix for ext4 filesystem ... ... I see below messages Many thanks Kiran! Do you have the messages from the brick that uses the zp2 mountpoint? There also should be a file with a timestamp when the last check was done successfully. If the brick is still running, this timestamp should get updated every storage.health-check-interval seconds: /zp2/brick2/.glusterfs/health_check Niels File /var/log/glusterfs/etc-glusterfs-glusterd.vol.log : The message I [MSGID: 106005] [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick 192.168.1.246:/zp2/brick2 has disconnected from glusterd. repeated 39 times between [2014-10-28 05:58:09.209419] and [2014-10-28 06:00:06.226330] [2014-10-28 06:00:09.226507] W [socket.c:545:__socket_rwv] 0-management: readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid argument) [2014-10-28 06:00:09.226712] I [MSGID: 106005] [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick 192.168.1.246:/zp2/brick2 has disconnected from glusterd. [2014-10-28 06:00:12.226881] W [socket.c:545:__socket_rwv] 0-management: readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid argument) [2014-10-28 06:00:15.227249] W [socket.c:545:__socket_rwv] 0-management: readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid argument) [2014-10-28 06:00:18.227616] W [socket.c:545:__socket_rwv] 0-management: readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid argument) [2014-10-28 06:00:21.227976] W [socket.c:545:__socket_rwv] 0-management: readv on . . [2014-10-28 06:19:15.142867] I [glusterd-handler.c:1280:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req The message I [MSGID: 106005] [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick 192.168.1.246:/zp2/brick2 has disconnected from glusterd. repeated 12 times between [2014-10-28 06:18:09.368752] and [2014-10-28 06:18:45.373063] [2014-10-28 06:23:38.207649] W [glusterfsd.c:1194:cleanup_and_exit] (-- 0-: received signum (15), shutting down dmesg output: SPLError: 7869:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has encountered an uncorrectable I/O failure and has been suspended. SPLError: 7868:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has encountered an uncorrectable I/O failure and has been suspended. SPLError: 7869:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has encountered an uncorrectable I/O failure and has been suspended. The brick is still online. # gluster volume status Status of volume: repvol Gluster process Port Online Pid -- Brick 192.168.1.246:/zp1/brick1 49152 Y 4067 Brick 192.168.1.246:/zp2/brick2 49153 Y 4078 NFS Server on localhost 2049 Y 4092 Self-heal Daemon on localhost N/A Y 4097
Re: [Gluster-devel] [glusterfs-3.6.0beta3-0.11.gitd01b00a] gluster volume status is running even though the Disk is detached
Hi Niels, I am getting fatal: Couldn't find remote ref refs/changes/13/8213/9 error. Steps to reproduce the issue. 1) # git clone git://review.gluster.org/glusterfs Initialized empty Git repository in /root/gluster-3.6/glusterfs/.git/ remote: Counting objects: 84921, done. remote: Compressing objects: 100% (48307/48307), done. remote: Total 84921 (delta 57264), reused 63233 (delta 36254) Receiving objects: 100% (84921/84921), 23.23 MiB | 192 KiB/s, done. Resolving deltas: 100% (57264/57264), done. 2) # cd glusterfs # git branch * master 3) # git fetch git://review.gluster.org/glusterfs refs/changes/13/8213/9 git checkout FETCH_HEAD fatal: Couldn't find remote ref refs/changes/13/8213/9 Note: I also tried the above steps on git repo https://github.com/gluster/glusterfs and the result is same as above. Please let me know if I miss any steps. Thanks, Kiran. On Mon, Oct 27, 2014 at 5:53 PM, Niels de Vos nde...@redhat.com wrote: On Mon, Oct 27, 2014 at 05:19:13PM +0530, Kiran Patil wrote: Hi, I created replicated vol with two bricks on the same node and copied some data to it. Now removed the disk which has hosted one of the brick of the volume. Storage.health-check-interval is set to 30 seconds. I could see the disk is unavailable using zpool command of zfs on linux but the gluster volume status still displays the brick process running which should have been shutdown by this time. Is this a bug in 3.6 since it is mentioned as feature https://github.com/gluster/glusterfs/blob/release-3.6/doc/features/brick-failure-detection.md or am I doing any mistakes here? The initial detection of brick failures did not work for all filesystems. It may not work for ZFS too. A fix has been posted, but it has not been merged into the master branch yet. When the change has been merged, it can get backported to 3.6 and 3.5. You may want to test with the patch applied, and add your +1 Verified to the change in case it makes it functional for you: - http://review.gluster.org/8213 Cheers, Niels [root@fractal-c92e gluster-3.6]# gluster volume status Status of volume: repvol Gluster process Port Online Pid -- Brick 192.168.1.246:/zp1/brick1 49154 Y 17671 Brick 192.168.1.246:/zp2/brick2 49155 Y 17682 NFS Server on localhost 2049 Y 17696 Self-heal Daemon on localhost N/A Y 17701 Task Status of Volume repvol -- There are no active volume tasks [root@fractal-c92e gluster-3.6]# gluster volume info Volume Name: repvol Type: Replicate Volume ID: d4f992b1-1393-43b8-9fda-2e2b6e3b5039 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 192.168.1.246:/zp1/brick1 Brick2: 192.168.1.246:/zp2/brick2 Options Reconfigured: storage.health-check-interval: 30 [root@fractal-c92e gluster-3.6]# zpool status zp2 pool: zp2 state: UNAVAIL status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://zfsonlinux.org/msg/ZFS-8000-HC scan: none requested config: NAMESTATE READ WRITE CKSUM zp2 UNAVAIL 0 0 0 insufficient replicas sdb UNAVAIL 0 0 0 errors: 2 data errors, use '-v' for a list Thanks, Kiran. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel