I am not seeing below message in any log files under /var/log/glusterfs directroy and its subdirectories.
health-check failed, going down On Fri, Oct 31, 2014 at 3:16 PM, Kiran Patil <ki...@fractalio.com> wrote: > I set zfs pool failmode to continue, which should disable only write and > not read as explained below > > failmode=wait | continue | panic > > Controls the system behavior in the event of catastrophic pool > failure. This condition is typically a result of a loss of connec- > tivity to the underlying storage device(s) or a failure of > all devices within the pool. The behavior of such an event is deter- > mined as follows: > > wait Blocks all I/O access until the device connectivity > is recovered and the errors are cleared. This is the default > behavior. > > continue Returns EIO to any new write I/O requests > but allows reads to any of the remaining healthy devices. Any write > requests that have yet to be committed to disk > would be blocked. > > panic Prints out a message to the console and generates a > system crash dump. > > > Now, I rebuilt the glusterfs master and tried to see if failed driver > results in failed brick and in turn kill brick process and the brick is not > going offline. > > # gluster volume status > Status of volume: repvol > Gluster process Port Online Pid > > ------------------------------------------------------------------------------ > Brick 192.168.1.246:/zp1/brick1 49152 Y 2400 > Brick 192.168.1.246:/zp2/brick2 49153 Y 2407 > NFS Server on localhost 2049 Y 30488 > Self-heal Daemon on localhost N/A Y 30495 > > Task Status of Volume repvol > > ------------------------------------------------------------------------------ > There are no active volume tasks > > > The /var/log/gluster/mnt.log output: > > [2014-10-31 09:18:15.934700] W [rpc-clnt-ping.c:154:rpc_clnt_ping_cbk] > 0-repvol-client-1: socket disconnected > [2014-10-31 09:18:15.934725] I [client.c:2215:client_rpc_notify] > 0-repvol-client-1: disconnected from repvol-client-1. Client process will > keep trying to connect to glusterd until brick's port is available > [2014-10-31 09:18:15.935238] I [rpc-clnt.c:1765:rpc_clnt_reconfig] > 0-repvol-client-1: changing port to 49153 (from 0) > > Now if I copy a file to /mnt it copied without any hang and brick still > shows online. > > Thanks, > Kiran. > > On Tue, Oct 28, 2014 at 3:44 PM, Niels de Vos <nde...@redhat.com> wrote: > >> On Tue, Oct 28, 2014 at 02:08:32PM +0530, Kiran Patil wrote: >> > The content of file zp2-brick2.log is at http://ur1.ca/iku0l ( >> > http://fpaste.org/145714/44849041/ ) >> > >> > I can't open the file /zp2/brick2/.glusterfs/health_check since it hangs >> > due to no disk present. >> > >> > Let me know the filename pattern, so that I can find it. >> >> Hmm, if there is a hang while reading from the disk, it will not get >> detected in the current solution. We implemented failure detection on >> top of the detection that is done by the filesystem. Suspending a >> filesystem with fsfreeze or similar should probably not be seen as a >> failure. >> >> In your case, it seems that the filesystem suspends itself when the disk >> went away. I have no idea if it is possible to configure ZFS to not >> suspend, but return an error to the reading/writing application. Please >> check with such an option. >> >> If you find such an option, please update the wiki page and recommend >> enabling it: >> - http://gluster.org/community/documentation/index.php/GlusterOnZFS >> >> >> Thanks, >> Niels >> >> >> > >> > On Tue, Oct 28, 2014 at 1:42 PM, Niels de Vos <nde...@redhat.com> >> wrote: >> > >> > > On Tue, Oct 28, 2014 at 01:10:56PM +0530, Kiran Patil wrote: >> > > > I applied the patches, compiled and installed the gluster. >> > > > >> > > > # glusterfs --version >> > > > glusterfs 3.7dev built on Oct 28 2014 12:03:10 >> > > > Repository revision: git://git.gluster.com/glusterfs.git >> > > > Copyright (c) 2006-2013 Red Hat, Inc. <http://www.redhat.com/> >> > > > GlusterFS comes with ABSOLUTELY NO WARRANTY. >> > > > It is licensed to you under your choice of the GNU Lesser >> > > > General Public License, version 3 or any later version (LGPLv3 >> > > > or later), or the GNU General Public License, version 2 (GPLv2), >> > > > in all cases as published by the Free Software Foundation. >> > > > >> > > > # git log >> > > > commit 990ce16151c3af17e4cdaa94608b737940b60e4d >> > > > Author: Lalatendu Mohanty <lmoha...@redhat.com> >> > > > Date: Tue Jul 1 07:52:27 2014 -0400 >> > > > >> > > > Posix: Brick failure detection fix for ext4 filesystem >> > > > ... >> > > > ... >> > > > >> > > > I see below messages >> > > >> > > Many thanks Kiran! >> > > >> > > Do you have the messages from the brick that uses the zp2 mountpoint? >> > > >> > > There also should be a file with a timestamp when the last check was >> > > done successfully. If the brick is still running, this timestamp >> should >> > > get updated every storage.health-check-interval seconds: >> > > /zp2/brick2/.glusterfs/health_check >> > > >> > > Niels >> > > >> > > > >> > > > File /var/log/glusterfs/etc-glusterfs-glusterd.vol.log : >> > > > >> > > > The message "I [MSGID: 106005] >> > > > [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: >> Brick >> > > > 192.168.1.246:/zp2/brick2 has disconnected from glusterd." >> repeated 39 >> > > > times between [2014-10-28 05:58:09.209419] and [2014-10-28 >> > > 06:00:06.226330] >> > > > [2014-10-28 06:00:09.226507] W [socket.c:545:__socket_rwv] >> 0-management: >> > > > readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed >> (Invalid >> > > > argument) >> > > > [2014-10-28 06:00:09.226712] I [MSGID: 106005] >> > > > [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: >> Brick >> > > > 192.168.1.246:/zp2/brick2 has disconnected from glusterd. >> > > > [2014-10-28 06:00:12.226881] W [socket.c:545:__socket_rwv] >> 0-management: >> > > > readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed >> (Invalid >> > > > argument) >> > > > [2014-10-28 06:00:15.227249] W [socket.c:545:__socket_rwv] >> 0-management: >> > > > readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed >> (Invalid >> > > > argument) >> > > > [2014-10-28 06:00:18.227616] W [socket.c:545:__socket_rwv] >> 0-management: >> > > > readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed >> (Invalid >> > > > argument) >> > > > [2014-10-28 06:00:21.227976] W [socket.c:545:__socket_rwv] >> 0-management: >> > > > readv on >> > > > >> > > > ..... >> > > > ..... >> > > > >> > > > [2014-10-28 06:19:15.142867] I >> > > > [glusterd-handler.c:1280:__glusterd_handle_cli_get_volume] >> 0-glusterd: >> > > > Received get vol req >> > > > The message "I [MSGID: 106005] >> > > > [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: >> Brick >> > > > 192.168.1.246:/zp2/brick2 has disconnected from glusterd." >> repeated 12 >> > > > times between [2014-10-28 06:18:09.368752] and [2014-10-28 >> > > 06:18:45.373063] >> > > > [2014-10-28 06:23:38.207649] W [glusterfsd.c:1194:cleanup_and_exit] >> (--> >> > > > 0-: received signum (15), shutting down >> > > > >> > > > >> > > > dmesg output: >> > > > >> > > > SPLError: 7869:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has >> > > > encountered an uncorrectable I/O failure and has been suspended. >> > > > >> > > > SPLError: 7868:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has >> > > > encountered an uncorrectable I/O failure and has been suspended. >> > > > >> > > > SPLError: 7869:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has >> > > > encountered an uncorrectable I/O failure and has been suspended. >> > > > >> > > > The brick is still online. >> > > > >> > > > # gluster volume status >> > > > Status of volume: repvol >> > > > Gluster process Port Online Pid >> > > > >> > > >> ------------------------------------------------------------------------------ >> > > > Brick 192.168.1.246:/zp1/brick1 49152 Y 4067 >> > > > Brick 192.168.1.246:/zp2/brick2 49153 Y 4078 >> > > > NFS Server on localhost 2049 Y 4092 >> > > > Self-heal Daemon on localhost N/A Y 4097 >> > > > >> > > > Task Status of Volume repvol >> > > > >> > > >> ------------------------------------------------------------------------------ >> > > > There are no active volume tasks >> > > > >> > > > # gluster volume info >> > > > >> > > > Volume Name: repvol >> > > > Type: Replicate >> > > > Volume ID: ba1e7c6d-1e1c-45cd-8132-5f4fa4d2d22b >> > > > Status: Started >> > > > Number of Bricks: 1 x 2 = 2 >> > > > Transport-type: tcp >> > > > Bricks: >> > > > Brick1: 192.168.1.246:/zp1/brick1 >> > > > Brick2: 192.168.1.246:/zp2/brick2 >> > > > Options Reconfigured: >> > > > storage.health-check-interval: 30 >> > > > >> > > > Let me know if you need further information. >> > > > >> > > > Thanks, >> > > > Kiran. >> > > > >> > > > On Tue, Oct 28, 2014 at 11:44 AM, Kiran Patil <ki...@fractalio.com> >> > > wrote: >> > > > >> > > > > I changed git fetch git://review.gluster.org/glusterfs to git >> fetch >> > > > > http://review.gluster.org/glusterfs and now it works. >> > > > > >> > > > > Thanks, >> > > > > Kiran. >> > > > > >> > > > > On Tue, Oct 28, 2014 at 11:13 AM, Kiran Patil < >> ki...@fractalio.com> >> > > wrote: >> > > > > >> > > > >> Hi Niels, >> > > > >> >> > > > >> I am getting "fatal: Couldn't find remote ref >> refs/changes/13/8213/9" >> > > > >> error. >> > > > >> >> > > > >> Steps to reproduce the issue. >> > > > >> >> > > > >> 1) # git clone git://review.gluster.org/glusterfs >> > > > >> Initialized empty Git repository in >> /root/gluster-3.6/glusterfs/.git/ >> > > > >> remote: Counting objects: 84921, done. >> > > > >> remote: Compressing objects: 100% (48307/48307), done. >> > > > >> remote: Total 84921 (delta 57264), reused 63233 (delta 36254) >> > > > >> Receiving objects: 100% (84921/84921), 23.23 MiB | 192 KiB/s, >> done. >> > > > >> Resolving deltas: 100% (57264/57264), done. >> > > > >> >> > > > >> 2) # cd glusterfs >> > > > >> # git branch >> > > > >> * master >> > > > >> >> > > > >> 3) # git fetch git://review.gluster.org/glusterfs >> > > refs/changes/13/8213/9 >> > > > >> && git checkout FETCH_HEAD >> > > > >> fatal: Couldn't find remote ref refs/changes/13/8213/9 >> > > > >> >> > > > >> Note: I also tried the above steps on git repo >> > > > >> https://github.com/gluster/glusterfs and the result is same as >> above. >> > > > >> >> > > > >> Please let me know if I miss any steps. >> > > > >> >> > > > >> Thanks, >> > > > >> Kiran. >> > > > >> >> > > > >> On Mon, Oct 27, 2014 at 5:53 PM, Niels de Vos <nde...@redhat.com >> > >> > > wrote: >> > > > >> >> > > > >>> On Mon, Oct 27, 2014 at 05:19:13PM +0530, Kiran Patil wrote: >> > > > >>> > Hi, >> > > > >>> > >> > > > >>> > I created replicated vol with two bricks on the same node and >> > > copied >> > > > >>> some >> > > > >>> > data to it. >> > > > >>> > >> > > > >>> > Now removed the disk which has hosted one of the brick of the >> > > volume. >> > > > >>> > >> > > > >>> > Storage.health-check-interval is set to 30 seconds. >> > > > >>> > >> > > > >>> > I could see the disk is unavailable using zpool command of >> zfs on >> > > > >>> linux but >> > > > >>> > the gluster volume status still displays the brick process >> running >> > > > >>> which >> > > > >>> > should have been shutdown by this time. >> > > > >>> > >> > > > >>> > Is this a bug in 3.6 since it is mentioned as feature " >> > > > >>> > >> > > > >>> >> > > >> https://github.com/gluster/glusterfs/blob/release-3.6/doc/features/brick-failure-detection.md >> > > > >>> " >> > > > >>> > or am I doing any mistakes here? >> > > > >>> >> > > > >>> The initial detection of brick failures did not work for all >> > > > >>> filesystems. It may not work for ZFS too. A fix has been >> posted, but >> > > it >> > > > >>> has not been merged into the master branch yet. When the change >> has >> > > been >> > > > >>> merged, it can get backported to 3.6 and 3.5. >> > > > >>> >> > > > >>> You may want to test with the patch applied, and add your "+1 >> > > Verified" >> > > > >>> to the change in case it makes it functional for you: >> > > > >>> - http://review.gluster.org/8213 >> > > > >>> >> > > > >>> Cheers, >> > > > >>> Niels >> > > > >>> >> > > > >>> > >> > > > >>> > [root@fractal-c92e gluster-3.6]# gluster volume status >> > > > >>> > Status of volume: repvol >> > > > >>> > Gluster process Port Online Pid >> > > > >>> > >> > > > >>> >> > > >> ------------------------------------------------------------------------------ >> > > > >>> > Brick 192.168.1.246:/zp1/brick1 49154 Y 17671 >> > > > >>> > Brick 192.168.1.246:/zp2/brick2 49155 Y 17682 >> > > > >>> > NFS Server on localhost 2049 Y 17696 >> > > > >>> > Self-heal Daemon on localhost N/A Y 17701 >> > > > >>> > >> > > > >>> > Task Status of Volume repvol >> > > > >>> > >> > > > >>> >> > > >> ------------------------------------------------------------------------------ >> > > > >>> > There are no active volume tasks >> > > > >>> > >> > > > >>> > >> > > > >>> > [root@fractal-c92e gluster-3.6]# gluster volume info >> > > > >>> > >> > > > >>> > Volume Name: repvol >> > > > >>> > Type: Replicate >> > > > >>> > Volume ID: d4f992b1-1393-43b8-9fda-2e2b6e3b5039 >> > > > >>> > Status: Started >> > > > >>> > Number of Bricks: 1 x 2 = 2 >> > > > >>> > Transport-type: tcp >> > > > >>> > Bricks: >> > > > >>> > Brick1: 192.168.1.246:/zp1/brick1 >> > > > >>> > Brick2: 192.168.1.246:/zp2/brick2 >> > > > >>> > Options Reconfigured: >> > > > >>> > storage.health-check-interval: 30 >> > > > >>> > >> > > > >>> > [root@fractal-c92e gluster-3.6]# zpool status zp2 >> > > > >>> > pool: zp2 >> > > > >>> > state: UNAVAIL >> > > > >>> > status: One or more devices are faulted in response to IO >> failures. >> > > > >>> > action: Make sure the affected devices are connected, then run >> > > 'zpool >> > > > >>> > clear'. >> > > > >>> > see: http://zfsonlinux.org/msg/ZFS-8000-HC >> > > > >>> > scan: none requested >> > > > >>> > config: >> > > > >>> > >> > > > >>> > NAME STATE READ WRITE CKSUM >> > > > >>> > zp2 UNAVAIL 0 0 0 insufficient replicas >> > > > >>> > sdb UNAVAIL 0 0 0 >> > > > >>> > >> > > > >>> > errors: 2 data errors, use '-v' for a list >> > > > >>> > >> > > > >>> > >> > > > >>> > Thanks, >> > > > >>> > Kiran. >> > > > >>> >> > > > >>> > _______________________________________________ >> > > > >>> > Gluster-devel mailing list >> > > > >>> > Gluster-devel@gluster.org >> > > > >>> > http://supercolony.gluster.org/mailman/listinfo/gluster-devel >> > > > >>> >> > > > >>> >> > > > >> >> > > > > >> > > >> > >
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel