Re: [Gluster-users] md5sum-s of file on different nodes are different.
Hello. I saw different md5sum-s on version 2.0.6. on node ftp2 at [2009-08-25 ~18:00]. Also I've read about this bug in Gluster changelog before update. Then I've updated my soft and have saw different md5sum-s again (you can see some errors in -etc-glusterfs-glusterfs-server.vol-ftp2.log 2009-08-25 ~18:00) At that time I've already used 2.0.6. Ilya. Pavan Vilas Sondur wrote: Hi Ilya, The logfiles reveal that you're running version 2.0.4. We've had a similar corruption issue reported and is fixed in the latest release - Bug 126: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=126 Please use the 2.0.6 version and let us know if this problem recurs again. Pavan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] Strange file permisson
Hi, I have 4 server nodes set up, distributed over 2-replicated-nodes. volume client1a type protocol/client option transport-type tcp/client option remote-host gs1 option remote-port 7001 option remote-subvolume brick end-volume volume client2a type protocol/client option transport-type tcp/client option remote-host gs2 option remote-port 7001 option remote-subvolume brick end-volume volume client1b type protocol/client option transport-type tcp/client option remote-host gs1 option remote-port 7002 option remote-subvolume brick end-volume volume client2b type protocol/client option transport-type tcp/client option remote-host gs2 option remote-port 7002 option remote-subvolume brick end-volume volume afr1 type cluster/replicate subvolumes client1a client2a end-volume volume afr2 type cluster/replicate subvolumes client1b client2b end-volume volume distribute type cluster/distribute subvolumes afr1 afr2 end-volume I recently just noticed that there are some files missing, and when I check the nodes client1a and client1b, or client2a and client2b. I notice that on one node, the file appears as: -T 1 root root 0 2009-08-25 16:25 DSC01927.JPG -T 1 root root 0 2009-08-25 16:26 DSC01929.JPG -T 1 root root 0 2009-08-25 16:26 DSC01931.JPG -T 1 root root 0 2009-08-25 16:25 DSC01942.JPG -T 1 root root 0 2009-08-25 16:25 DSC01944.JPG -T 1 root root 0 2009-08-25 16:25 DSC01946.JPG -T 1 root root 0 2009-08-25 16:26 DSC01915.JPG -T 1 root root 0 2009-08-25 16:26 DSC01905.JPG -T 1 root root 0 2009-08-25 16:26 DSC01907.JPG But on the other node is fine. That is one problem, the other is that since files are distributed between client(1,2)a and client(1,2)b, why are the files appearing on both servers? Distribute should only copy files to one node or the other, not both. Regards, Simon ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Original-Nachricht > Datum: Mon, 31 Aug 2009 10:58:41 +1000 (EST) > Von: "Jeff Evans" > An: da...@ols.es > CC: gluster-users@gluster.org > Betreff: Re: [Gluster-users] Replication not working on server hang > Hola David, > > > Maybe we could try to > > see if all the ones experiencing this problem have something in > > common. > > Agreed. > > > In our case the hanged server is: > > > > Dell PE2900 > > 2 x x5...@3.33 8Gb RAM > > SAS 6/iR Integrated RAID Controller > > 7 x SEAGATE ST31000640SS > > 1 x SEAGATE ST3300656SS > > Debian testing > > Kernel 2.6.26-2 > > > > server hanged when writing to a unified volume (7 x 1Tb + > > namespace and system on the ST3300656SS) > > We have: > > IBM x3650 > 2 x x5...@3.16 32Gb RAM > SATA Integrated RAID Controller > 4 X 1TB SATA Hitachi HUA72101 > RHEL 5.3 > Kernel 2.6.18-128.4.1.el5xen > Glusterfs 2.0.3 w/ patch 943 > > Couldn't be more different really! > > Server hangs when building software on a 100GB replicated volume, > mounted with direct-io-mode=disabled. > > I have found that building: > > http://mirror.cs.wisc.edu/pub/mirrors/ghost/AFPL/GhostPCL/ghostpcl_1.40.tar.bz2 > Reliably produces the hang in my case. > Even just a grep -R of the source gives me the dreaded hang. > I tested it now on my GlusterFS with XFS below and it works without issues: --- uranos test # wget http://mirror.cs.wisc.edu/pub/mirrors/ghost/AFPL/GhostPCL/ghostpcl_1.40.tar.bz2 --2009-08-31 03:49:06-- http://mirror.cs.wisc.edu/pub/mirrors/ghost/AFPL/GhostPCL/ghostpcl_1.40.tar.bz2 Resolving mirror.cs.wisc.edu... 128.105.103.12 Connecting to mirror.cs.wisc.edu|128.105.103.12|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 10093425 (9.6M) [application/x-tar] Saving to: `ghostpcl_1.40.tar.bz2' 100%[=>] 10,093,425 289K/s in 35s 2009-08-31 03:49:42 (278 KB/s) - `ghostpcl_1.40.tar.bz2' saved [10093425/10093425] uranos test # time tar xjf ghostpcl_1.40.tar.bz2 real1m32.895s user0m4.520s sys 0m0.720s uranos test # time echo $(grep -iR "test" ghostpcl_1.40/ | wc -l) 3180 real0m9.776s user0m0.070s sys 0m0.310s uranos test # --- > Talk of XFS being stable is encouraging me to give it a shot. > > XFS isn't shipped with RHEL 5.3, but then neither is FUSE! (both > should be in 5.4 though, finally). > > Thanks, Jeff. > Steve > > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hola David, > Maybe we could try to > see if all the ones experiencing this problem have something in > common. Agreed. > In our case the hanged server is: > > Dell PE2900 > 2 x x5...@3.33 8Gb RAM > SAS 6/iR Integrated RAID Controller > 7 x SEAGATE ST31000640SS > 1 x SEAGATE ST3300656SS > Debian testing > Kernel 2.6.26-2 > > server hanged when writing to a unified volume (7 x 1Tb + > namespace and system on the ST3300656SS) We have: IBM x3650 2 x x5...@3.16 32Gb RAM SATA Integrated RAID Controller 4 X 1TB SATA Hitachi HUA72101 RHEL 5.3 Kernel 2.6.18-128.4.1.el5xen Glusterfs 2.0.3 w/ patch 943 Couldn't be more different really! Server hangs when building software on a 100GB replicated volume, mounted with direct-io-mode=disabled. I have found that building: http://mirror.cs.wisc.edu/pub/mirrors/ghost/AFPL/GhostPCL/ghostpcl_1.40.tar.bz2 Reliably produces the hang in my case. Even just a grep -R of the source gives me the dreaded hang. Talk of XFS being stable is encouraging me to give it a shot. XFS isn't shipped with RHEL 5.3, but then neither is FUSE! (both should be in 5.4 though, finally). Thanks, Jeff. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Original-Nachricht > Datum: Sun, 30 Aug 2009 21:52:17 +0200 > Von: Jasper van Wanrooy - Chatventure > An: gluster-users > Betreff: Re: [Gluster-users] Replication not working on server hang > Hi, > Hello > Sideways I'm reading the discussions about server hangs the last few > weeks. However, I did quite a few stress tests on our test systems, > but I'm unable to reproduce the hangs. The only real difference I see > is that we are using the XFS filesystem. Does anyone have experience > with that? > I do use as well XFS and can't reproduce any issues with it when using anything >= GlusterFS 2.0.4. Older 2.0.x releases of GlusterFS where ultra unstable for me but starting from 2.0.4 things seem to get better. Currently I am using 2.1.0git in production for serving web pages and things work flawless. If it continues like that then I am going to try again to move my mailstorage to be on GlusterFS. But not in the next 2 to 3 weeks. Anyway... GlusterFS and XFS = no hangs at all for me. Crashing GlusterFS? Yes! Hangs? No! If you use XFS then be sure to not use a Kernel from the 2.6.29 and 2.6.30 series as it has an bug with XFS. There are patchs for 2.6.29 and 2.6.30 but none of them is included in the main line of the Kernel. Maybe released 2.6.31 Kernel will fix the issue? RC8 however has still the same issue as 2.6.29/2.6.30. > Kind Regards, > > Jasper > Steve > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Bonnie crash
Does no one else have a similar problem? Can someone let me know if there is a another (better) way of getting performance numbers? Josh. Hiren Joshi wrote: Not this is weird looking at the log while it happened: [2009-08-25 01:01:58] E [client-protocol.c:437:client_ping_timer_expired] glust1b_3: Server 192.168.4.51:6996 has not responded in the last 10 secon ds, disconnecting. [2009-08-25 01:01:58] E [client-protocol.c:437:client_ping_timer_expired] glust1a_3: Server 127.0.0.1:6996 has not responded in the last 10 seconds, disconnecting. [2009-08-25 01:01:58] E [client-protocol.c:437:client_ping_timer_expired] glust1a_3: Server 127.0.0.1:6996 has not responded in the last 10 seconds, disconnecting. [2009-08-25 01:01:58] E [client-protocol.c:437:client_ping_timer_expired] glust1b_3: Server 192.168.4.51:6996 has not responded in the last 10 secon ds, disconnecting. [2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind] glust1b_3: forced unwinding frame type(1) op(LOOKUP) [2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind] glust1b_3: forced unwinding frame type(1) op(STATFS) [2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind] glust1b_3: forced unwinding frame type(2) op(PING) [2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind] glust1b_3: forced unwinding frame type(1) op(XATTROP) [2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind] glust1b_3: forced unwinding frame type(2) op(PING) [2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind] glust1b_3: forced unwinding frame type(3) op(RELEASE) [2009-08-25 01:01:58] N [client-protocol.c:6246:notify] glust1b_3: disconnected [2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind] glust1a_3: forced unwinding frame type(1) op(LOOKUP) [2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind] glust1a_3: forced unwinding frame type(1) op(STATFS) [2009-08-25 01:01:58] W [fuse-bridge.c:1841:fuse_statfs_cbk] glusterfs-fuse: 167604474: ERR => -1 (Transport endpoint is not connected) [2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind] glust1a_3: forced unwinding frame type(2) op(PING) [2009-08-25 01:01:58] W [fuse-bridge.c:1841:fuse_statfs_cbk] glusterfs-fuse: 167604479: ERR => -1 (Transport endpoint is not connected) [2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind] glust1a_3: forced unwinding frame type(1) op(XATTROP) [2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind] glust1a_3: forced unwinding frame type(2) op(PING) [2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind] glust1a_3: forced unwinding frame type(3) op(RELEASE) [2009-08-25 01:01:58] N [client-protocol.c:6246:notify] glust1a_3: disconnected [2009-08-25 01:01:58] E [afr.c:2228:notify] mirror1_3: All subvolumes are down. Going offline until atleast one of them comes back up. [2009-08-25 01:01:58] W [fuse-bridge.c:1841:fuse_statfs_cbk] glusterfs-fuse: 167604480: ERR => -1 (Transport endpoint is not connected) [2009-08-25 01:01:58] W [fuse-bridge.c:1841:fuse_statfs_cbk] glusterfs-fuse: 167604481: ERR => -1 (Transport endpoint is not connected) [2009-08-25 01:01:58] W [fuse-bridge.c:395:fuse_entry_cbk] glusterfs-fuse: 167604483: MKDIR() /test/Bonnie.24759 => -1 (Transport endpoint is not co nnected) Perhaps a network problem? -Original Message- From: gluster-users-boun...@gluster.org [mailto:gluster-users-boun...@gluster.org] On Behalf Of Hiren Joshi Sent: 28 August 2009 14:48 To: gluster-users@gluster.org Subject: [Gluster-users] Bonnie crash Hello all, I'm using gluster 2.0.4 and bonnie++ 1.96, I can't get the test to complete. bonnie++ -u 99:99 -d /home/webspace_glust/test/ Using uid:99, gid:99. Writing a byte at a time...done Writing intelligently...done Rewriting...done Reading a byte at a time...done Reading intelligently...done start 'em...done...done...done...done...done... Create files in sequential order...Can't make directory ./Bonnie.24759 Cleaning up test directory after error. Bonnie: drastic I/O error (rmdir): No such file or directory I can't see what's wrong a quick google yielded very little. Any pointers appreciated Josh. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi, Sideways I'm reading the discussions about server hangs the last few weeks. However, I did quite a few stress tests on our test systems, but I'm unable to reproduce the hangs. The only real difference I see is that we are using the XFS filesystem. Does anyone have experience with that? Kind Regards, Jasper ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
On 08/30/2009 04:00 AM, Anand Avati wrote: I'm wondering if there's some way for glusterfs to detect the flaws of the underlying operating system. I believe there's no bug-free file systems in the universe, so I believe it is the job of the glusterfs developer to specify which underlying filesystem is tested and supported. It's not good to simply say that glusterfs works on all real-world approximations to an imaginary bug-free posix filesystem. I would be genuinely interested to know about another project which is geared up to be resilient against kernel hangs so that we can borrow some ideas on how to reliably detect kernel soft lockups or syscall hangs. As far as I know, even mature projects like Apache have not bothered fixing such hangs (or even detecting this kind of underlying OS flaw). There are projects that require kernel patches to work properly (for example, the OpenVZ project), and most Linux distributions (i.e. RedHat) maintain a set of kernel patches. Vendors may provide work arounds for known kernel problems - for example, the dovecot people go through various means to flush the NFS or FUSE cache (including for GlusterFS) before doing certain operations, and these are done using non-portable operations. Summary of it is that relying on the Linux kernel to be correct in all situations (or any kernel for that matter) will have limits. Sometimes, it is necessary to track down the problem, correct it, and provide a patch. This can involve discussions on linux-dev leading to it finally being corrected upstream, and no longer needing to provide a patch. Not saying it has to go this far - but unless the problem is understood, it shouldn't be written off either. If GlusterFS can issue a set of operations that reproducibly causes ext3 to freeze, this is of a concern for both the ext3 developers/maintainers and the GlusterFS developers/maintainers, and it is a joint problem to solve, since ext3 is so common. As for detecting lockups or hangs - I'm not aware of this being done in the userspace area, but it could be argued that this is a bit artificial of a comparison, because GlusterFS is at its base, a network file system, and it *is* common for network file systems (such as NFS) to deal with problems with the underlying volumes. GlusterFS uses FUSE as a novel approach to avoiding the problem entirely - but if GlusterFS from user space can cause the backend storage volume to freeze up, even from outside GlusterFS, then it seems like the user space barrier is insufficient. For all of the above - I am assuming that GlusterFS is being used to do something which ends up locking up the entire volume, even from outside GlusterFS. If anybody is experiencing GlusterFS *only* problems, where the underlying volume is still accessible from another process, than this would be a different problem, probably GlusterFS specific. Cheers, mark -- Mark Mielke ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi The calls (as you have seen in the logs as well) which are hanging are lookup calls, which have to be sent to all subvolumes to ensure all the copies are in sync. one thing that i could not understand is why if such this calls are sent to all servers to keep files in sync why replicate will only self-heal if the files exist on the first subvolume but not if the files do not exist on the first subvolume -- Best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Known Issues : Replicate will only self-heal if the files exist on the first subvolume. Server A-> B works, Server A <-B does not work.
On Sat, 29 Aug 2009 03:46:04 +0200 "supp...@citytoo.com" wrote: > Hello, > > Known Issues : Replicate will only self-heal if the files exist on the first > subvolume. Server A-> B works, Server A <-B does not work. > > When this probleme will be fixed because it's very important ? > > Ben > > Cordialement Hi Ben, really, don't push to hard in this direction, because this is easily solvable by running find on server b and statd'ing the filelist on server a. You may call that inconveniant, but at least there is a trivial solution. -- Regards, Stephan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
On Sun, 30 Aug 2009 01:00:13 -0700 Anand Avati wrote: > > I'm wondering if there's some way for glusterfs to detect the flaws of the > > underlying operating system. I believe there's no bug-free file systems in > > the universe, so I believe it is the job of the glusterfs developer to > > specify which underlying filesystem is tested and supported. It's not good > > to simply say that glusterfs works on all real-world approximations to an > > imaginary bug-free posix filesystem. > > I would be genuinely interested to know about another project which is > geared up to be resilient against kernel hangs so that we can borrow > some ideas on how to reliably detect kernel soft lockups or syscall > hangs. As far as I know, even mature projects like Apache have not > bothered fixing such hangs (or even detecting this kind of underlying > OS flaw). Apache is no software thats' primary use is to overcome hardware (and software) issues leading to offline filesystems. You cannot compare two applications with totally different usage patterns. And, just to say that clearly, nobody expects you to _solve_ or fix a hang. The users only expect to _recognise_ a problem and just shut down. It is far better to shut down without a real problem than to continue while having one and hang. First one leads to more work at max, but second one leads to offline service. And thats exactly why we are all here, to prevent an offline file service. > Avati -- Regards, Stephan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi This backend fs hang can happen only because of a kernel bug. That is indeed false. The fs hang can well have simple hardware reasons, too. maybe it also could be due to some sort of wrong access to the filesystem. What is clear is that the soft lockup itself is a kernel bug, the problems here are what is exactly causing this bug (file system, controller driver, hardware, kernel itself, ...) and why glusterfs is triggering this bug and direct operations to the ext3 file system or through nfs are not. -- Best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
HI The discussion in this thread is about those situations where the server (machine hosting the storage/posix volume) hangs the backend filesystem (verified by kernel console logs) and that in turn results in the mountpoint hang. That seems to be the case in Stephan's situation, yes, as we have evidence from reiserFS. What evidence have we in the ext3 cases? just searching on the net i found similar cases that where due to the sata driver (altough in our case all disks ara sas), so the problem could also be due to the disk driver or to some other piece of the system. Having both reiserfs and ext3 have a bug that produces this hangs is very unlikely. Maybe we could try to see if all the ones experiencing this problem have something in common. In our case the hanged server is: Dell PE2900 2 x x5...@3.33 8Gb RAM SAS 6/iR Integrated RAID Controller 7 x SEAGATE ST31000640SS 1 x SEAGATE ST3300656SS Debian testing Kernel 2.6.26-2 server hanged when writing to a unified volume (7 x 1Tb + namespace and system on the ST3300656SS) -- Best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
On Sat, 29 Aug 2009 18:23:23 -0700 Anand Avati wrote: > This backend fs > hang can happen only because of a kernel bug. That is indeed false. The fs hang can well have simple hardware reasons, too. In fact it is a good idea and defensive programming style to not count on everybody being perfect - just like you should act on the street where you should not count on perfect drivers in other cars, too. Lets say your favourite hd controller just died half way, you cannot blame the kernel for keeping networking up, but all fs related just block. -- Regards, Stephan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi Avati, I'm experiencing complete system-wide hangs exactly as David has mentioned. > The discussion in this > thread is about those situations where the server (machine > hosting the > storage/posix volume) hangs the backend filesystem (verified by > kernel console logs) and that in turn results in the mountpoint > hang. That seems to be the case in Stephan's situation, yes, as we have evidence from reiserFS. What evidence have we in the ext3 cases? > While your symptoms are similar on the client side hanging, In the case of 144, my systems didn't hang. Maybe I was just lucky. Now that I have disabled read-ahead to workaround 144, I am seeing total system hangs. I also saw these hangs back before I used read-ahead (with 1.3). As I have said, it is like new FD's cannot be allocated, while those already open continue normally. I'm talking about regular ext3 mounts here, not glusterfs ones. > The discussion thread is about the situation where the server side > kernel misbehaves and results in glusterfs hanging. The two > actual problems are quite different. Perhaps, as I said, it may be coincidence, but when I ran with read-ahead, I didn't get any system hangs, just the core-dumps. Now, I don't get core dumps any more. I get system-wide hangs. Thanks, Jeff. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi As far as I know, even mature projects like Apache have not bothered fixing such hangs (or even detecting this kind of underlying OS flaw). that's right, but Apache is not a fault tolerant file system, in the other hand some applications that face bugs in other apps have options to workaround bugs in such applications (like dovecot has for some outlook bugs). For a fault tolerant file sistem i would expect that it can at least detect and handle any problem in any of the subsystems involved. -- Best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
On Sun, Aug 30, 2009 at 01:00:13AM -0700, Anand Avati wrote: > > I'm wondering if there's some way for glusterfs to detect the flaws of the > > underlying operating system. I believe there's no bug-free file systems in > > the universe, so I believe it is the job of the glusterfs developer to > > specify which underlying filesystem is tested and supported. It's not good > > to simply say that glusterfs works on all real-world approximations to an > > imaginary bug-free posix filesystem. > > I would be genuinely interested to know about another project which is > geared up to be resilient against kernel hangs so that we can borrow > some ideas on how to reliably detect kernel soft lockups or syscall > hangs. As far as I know, even mature projects like Apache have not > bothered fixing such hangs (or even detecting this kind of underlying > OS flaw). Check out heartbeat, and the rest (perhaps you knew of this): http://www.linux-ha.org/ cheers zenaan -- Homepage: www.SoulSound.net -- Free Australia: www.UPMART.org Please respect the confidentiality of this email as sensibly warranted. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi What we have here (kernel lockups and glusterfs on the same machine) might not be a co-incidence. but it could be There might well be a correlation -- but by nature of the problem it is not right to treat this as a cause-effect relation with glusterfs being the cause. i think it's also not right to simply discard glusterfs being the cause It is just not right to blame _any_ userspace application for any kind of kernel lockups or hangs. well, i'm to saying that this is glusterfs fault, what i'm saying is that is very likely that glusterfs is at least triggering this fault So any hang or lockup in the kernel can only be caused by a bug in itself, which could possibly be triggered by a specific user application. maybe, but don't you feel that this needs to be investigated in order to know what is really happening ? What we will be fixing is failing over to other machines when the backend FS hangs. The reason why this was not a priority (so far atleast) is because a kernel is a trusted piece of software in the system, and when you are having a kernel which has a bug in the fs, you should just upgrade to a newer kernel. yes, but right now there is no evidence that this is a kernel bug. From a user's point of view, if this did not happen when using nfs and happens when using glusterfs the most evident solution is to switch back to nfs (like you, we usually prefer to trust kernel stability against application stability) and not do any kernel upgrade unless there is an evidence that this is a kernel bug (as a kernel upgrade could mean having to upgrade many other pieces of software that were working ok and that will need to be tested again). What we promise to fix is a way to (as best as possible) somehow translate a backend FS hang into a "subvolume down" status and consider that subvolume to be down. After that, you will _still_ continue to face kernel hangs and lockups and just glusterfs will stop hanging. Your machines would still remain locked up. that's great ! -- Thanx & best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
Hi a) documentation says "All operations that do not modify the file or directory are sent to all the subvolumes and the first successful reply is returned to the application", why is blocking then ? it's suposed that the reply from the non blocked server will come first and nothing will block, but clients are blocking on a simple ls operation The calls (as you have seen in the logs as well) which are hanging are lookup calls, which have to be sent to all subvolumes to ensure all the copies are in sync. ok, then the most simple fix will be to add a timeout for lookup calls, altough i will prefer to optionally also have the first reply to the lookup being sent to the application and then wait in the background for the other ones so gluster can keep files in sync, this will eliminate this hang and also make the system more responsive. BTW, will switching off some of the self heal options in the client make glusterfs use only the first reply received to the lookup call ? -- Thanx & best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Replication not working on server hang
> I'm wondering if there's some way for glusterfs to detect the flaws of the > underlying operating system. I believe there's no bug-free file systems in > the universe, so I believe it is the job of the glusterfs developer to > specify which underlying filesystem is tested and supported. It's not good > to simply say that glusterfs works on all real-world approximations to an > imaginary bug-free posix filesystem. I would be genuinely interested to know about another project which is geared up to be resilient against kernel hangs so that we can borrow some ideas on how to reliably detect kernel soft lockups or syscall hangs. As far as I know, even mature projects like Apache have not bothered fixing such hangs (or even detecting this kind of underlying OS flaw). Avati ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users