Re: [Gluster-users] md5sum-s of file on different nodes are different.

2009-08-30 Thread Ilya Galanin

Hello.
   I saw different md5sum-s on version 2.0.6. on node ftp2 at 
[2009-08-25 ~18:00].


   Also I've read about this bug in Gluster changelog before update.
   Then I've updated my soft and have saw different md5sum-s again 
(you can see some errors in -etc-glusterfs-glusterfs-server.vol-ftp2.log 
2009-08-25 ~18:00)

   At that time I've already used 2.0.6.

Ilya.

Pavan Vilas Sondur wrote:

Hi Ilya,
The logfiles reveal that you're running version 2.0.4. We've had a similar 
corruption issue reported and is fixed in the latest release - Bug 126: 
http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=126

Please use the 2.0.6 version and let us know if this problem recurs again.

Pavan

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Gluster-users] Strange file permisson

2009-08-30 Thread Simon Liang
Hi,

 

I have 4 server nodes set up, distributed over 2-replicated-nodes.

 

volume client1a

type protocol/client

option transport-type tcp/client

option remote-host gs1

option remote-port 7001

option remote-subvolume brick

end-volume

 

volume client2a

type protocol/client

option transport-type tcp/client

option remote-host gs2

option remote-port 7001

option remote-subvolume brick

end-volume

 

volume client1b

type protocol/client

option transport-type tcp/client

option remote-host gs1

option remote-port 7002

option remote-subvolume brick

end-volume

 

volume client2b

type protocol/client

option transport-type tcp/client

option remote-host gs2

option remote-port 7002

option remote-subvolume brick

end-volume

 

volume afr1

type cluster/replicate

subvolumes client1a client2a

end-volume

 

volume afr2

type cluster/replicate

subvolumes client1b client2b

end-volume

 

volume distribute

type cluster/distribute

subvolumes afr1 afr2

end-volume

 

I recently just noticed that there are some files missing, and when I
check the nodes client1a and client1b, or client2a and client2b.

 

I notice that on one node, the file appears as:

-T 1 root root   0 2009-08-25 16:25 DSC01927.JPG

-T 1 root root   0 2009-08-25 16:26 DSC01929.JPG

-T 1 root root   0 2009-08-25 16:26 DSC01931.JPG

-T 1 root root   0 2009-08-25 16:25 DSC01942.JPG

-T 1 root root   0 2009-08-25 16:25 DSC01944.JPG

-T 1 root root   0 2009-08-25 16:25 DSC01946.JPG

-T 1 root root   0 2009-08-25 16:26 DSC01915.JPG

-T 1 root root   0 2009-08-25 16:26 DSC01905.JPG

-T 1 root root   0 2009-08-25 16:26 DSC01907.JPG

 

But on the other node is fine.

 

That is one problem, the other is that since files are distributed
between client(1,2)a and client(1,2)b, why are the files appearing on
both servers? Distribute should only copy files to one node or the
other, not both.

 

Regards,

Simon

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Replication not working on server hang

2009-08-30 Thread Steve

 Original-Nachricht 
> Datum: Mon, 31 Aug 2009 10:58:41 +1000 (EST)
> Von: "Jeff Evans" 
> An: da...@ols.es
> CC: gluster-users@gluster.org
> Betreff: Re: [Gluster-users] Replication not working on server hang

> Hola David,
> 
> > Maybe we could try to
> > see if all the ones experiencing this problem have something in
> > common.
> 
> Agreed.
> 
> > In our case the hanged server is:
> >
> > Dell PE2900
> > 2 x x5...@3.33 8Gb RAM
> > SAS 6/iR Integrated RAID Controller
> > 7 x SEAGATE ST31000640SS
> > 1 x SEAGATE ST3300656SS
> > Debian testing
> > Kernel 2.6.26-2
> >
> > server hanged when writing to a unified volume (7 x 1Tb +
> > namespace and system on the ST3300656SS)
> 
> We have:
> 
> IBM x3650
> 2 x x5...@3.16 32Gb RAM
> SATA Integrated RAID Controller
> 4 X 1TB SATA Hitachi HUA72101
> RHEL 5.3
> Kernel 2.6.18-128.4.1.el5xen
> Glusterfs 2.0.3 w/ patch 943
> 
> Couldn't be more different really!
> 
> Server hangs when building software on a 100GB replicated volume,
> mounted with direct-io-mode=disabled.
> 
> I have found that building:
> 
> http://mirror.cs.wisc.edu/pub/mirrors/ghost/AFPL/GhostPCL/ghostpcl_1.40.tar.bz2
> Reliably produces the hang in my case.
> Even just a grep -R of the source gives me the dreaded hang.
> 
I tested it now on my GlusterFS with XFS below and it works without issues:
---
uranos test # wget 
http://mirror.cs.wisc.edu/pub/mirrors/ghost/AFPL/GhostPCL/ghostpcl_1.40.tar.bz2
--2009-08-31 03:49:06--  
http://mirror.cs.wisc.edu/pub/mirrors/ghost/AFPL/GhostPCL/ghostpcl_1.40.tar.bz2
Resolving mirror.cs.wisc.edu... 128.105.103.12
Connecting to mirror.cs.wisc.edu|128.105.103.12|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10093425 (9.6M) [application/x-tar]
Saving to: `ghostpcl_1.40.tar.bz2'

100%[=>]
 10,093,425   289K/s   in 35s

2009-08-31 03:49:42 (278 KB/s) - `ghostpcl_1.40.tar.bz2' saved 
[10093425/10093425]

uranos test # time tar xjf ghostpcl_1.40.tar.bz2

real1m32.895s
user0m4.520s
sys 0m0.720s
uranos test # time echo $(grep -iR "test" ghostpcl_1.40/ | wc -l)
3180

real0m9.776s
user0m0.070s
sys 0m0.310s
uranos test #
---


> Talk of XFS being stable is encouraging me to give it a shot.
> 
> XFS isn't shipped with RHEL 5.3, but then neither is FUSE! (both
> should be in 5.4 though, finally).
> 
> Thanks, Jeff.
> 
Steve


> 
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

-- 
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Replication not working on server hang

2009-08-30 Thread Jeff Evans
Hola David,

> Maybe we could try to
> see if all the ones experiencing this problem have something in
> common.

Agreed.

> In our case the hanged server is:
>
> Dell PE2900
> 2 x x5...@3.33 8Gb RAM
> SAS 6/iR Integrated RAID Controller
> 7 x SEAGATE ST31000640SS
> 1 x SEAGATE ST3300656SS
> Debian testing
> Kernel 2.6.26-2
>
> server hanged when writing to a unified volume (7 x 1Tb +
> namespace and system on the ST3300656SS)

We have:

IBM x3650
2 x x5...@3.16 32Gb RAM
SATA Integrated RAID Controller
4 X 1TB SATA Hitachi HUA72101
RHEL 5.3
Kernel 2.6.18-128.4.1.el5xen
Glusterfs 2.0.3 w/ patch 943

Couldn't be more different really!

Server hangs when building software on a 100GB replicated volume,
mounted with direct-io-mode=disabled.

I have found that building:

http://mirror.cs.wisc.edu/pub/mirrors/ghost/AFPL/GhostPCL/ghostpcl_1.40.tar.bz2
Reliably produces the hang in my case.
Even just a grep -R of the source gives me the dreaded hang.

Talk of XFS being stable is encouraging me to give it a shot.

XFS isn't shipped with RHEL 5.3, but then neither is FUSE! (both
should be in 5.4 though, finally).

Thanks, Jeff.


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Replication not working on server hang

2009-08-30 Thread Steve

 Original-Nachricht 
> Datum: Sun, 30 Aug 2009 21:52:17 +0200
> Von: Jasper van Wanrooy - Chatventure 
> An: gluster-users 
> Betreff: Re: [Gluster-users] Replication not working on server hang

> Hi,
> 
Hello


> Sideways I'm reading the discussions about server hangs the last few  
> weeks. However, I did quite a few stress tests on our test systems,  
> but I'm unable to reproduce the hangs. The only real difference I see  
> is that we are using the XFS filesystem. Does anyone have experience  
> with that?
> 
I do use as well XFS and can't reproduce any issues with it when using anything 
>= GlusterFS 2.0.4. Older 2.0.x releases of GlusterFS where ultra unstable for 
me but starting from 2.0.4 things seem to get better. Currently I am using 
2.1.0git in production for serving web pages and things work flawless. If it 
continues like that then I am going to try again to move my mailstorage to be 
on GlusterFS. But not in the next 2 to 3 weeks.

Anyway... GlusterFS and XFS = no hangs at all for me.
Crashing GlusterFS? Yes! Hangs? No!

If you use XFS then be sure to not use a Kernel from the 2.6.29 and 2.6.30 
series as it has an bug with XFS. There are patchs for 2.6.29 and 2.6.30 but 
none of them is included in the main line of the Kernel. Maybe released 2.6.31 
Kernel will fix the issue? RC8 however has still the same issue as 
2.6.29/2.6.30.


> Kind Regards,
> 
> Jasper
>
Steve


> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

-- 
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Bonnie crash

2009-08-30 Thread Hiren Joshi

Does no one else have a similar problem?

Can someone let me know if there is a another (better) way of getting 
performance numbers?


Josh.

Hiren Joshi wrote:

Not this is weird looking at the log while it happened:
[2009-08-25 01:01:58] E
[client-protocol.c:437:client_ping_timer_expired] glust1b_3: Server
192.168.4.51:6996 has not responded in the last 10 secon
ds, disconnecting.
[2009-08-25 01:01:58] E
[client-protocol.c:437:client_ping_timer_expired] glust1a_3: Server
127.0.0.1:6996 has not responded in the last 10 seconds,
 disconnecting.
[2009-08-25 01:01:58] E
[client-protocol.c:437:client_ping_timer_expired] glust1a_3: Server
127.0.0.1:6996 has not responded in the last 10 seconds,
 disconnecting.
[2009-08-25 01:01:58] E
[client-protocol.c:437:client_ping_timer_expired] glust1b_3: Server
192.168.4.51:6996 has not responded in the last 10 secon
ds, disconnecting.
[2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind]
glust1b_3: forced unwinding frame type(1) op(LOOKUP)
[2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind]
glust1b_3: forced unwinding frame type(1) op(STATFS)
[2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind]
glust1b_3: forced unwinding frame type(2) op(PING)
[2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind]
glust1b_3: forced unwinding frame type(1) op(XATTROP)
[2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind]
glust1b_3: forced unwinding frame type(2) op(PING)
[2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind]
glust1b_3: forced unwinding frame type(3) op(RELEASE)
[2009-08-25 01:01:58] N [client-protocol.c:6246:notify] glust1b_3:
disconnected
[2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind]
glust1a_3: forced unwinding frame type(1) op(LOOKUP)
[2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind]
glust1a_3: forced unwinding frame type(1) op(STATFS)
[2009-08-25 01:01:58] W [fuse-bridge.c:1841:fuse_statfs_cbk]
glusterfs-fuse: 167604474: ERR => -1 (Transport endpoint is not
connected)
[2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind]
glust1a_3: forced unwinding frame type(2) op(PING)
[2009-08-25 01:01:58] W [fuse-bridge.c:1841:fuse_statfs_cbk]
glusterfs-fuse: 167604479: ERR => -1 (Transport endpoint is not
connected)
[2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind]
glust1a_3: forced unwinding frame type(1) op(XATTROP)
[2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind]
glust1a_3: forced unwinding frame type(2) op(PING)
[2009-08-25 01:01:58] E [saved-frames.c:165:saved_frames_unwind]
glust1a_3: forced unwinding frame type(3) op(RELEASE)
[2009-08-25 01:01:58] N [client-protocol.c:6246:notify] glust1a_3:
disconnected
[2009-08-25 01:01:58] E [afr.c:2228:notify] mirror1_3: All subvolumes
are down. Going offline until atleast one of them comes back up.
[2009-08-25 01:01:58] W [fuse-bridge.c:1841:fuse_statfs_cbk]
glusterfs-fuse: 167604480: ERR => -1 (Transport endpoint is not
connected)
[2009-08-25 01:01:58] W [fuse-bridge.c:1841:fuse_statfs_cbk]
glusterfs-fuse: 167604481: ERR => -1 (Transport endpoint is not
connected)
[2009-08-25 01:01:58] W [fuse-bridge.c:395:fuse_entry_cbk]
glusterfs-fuse: 167604483: MKDIR() /test/Bonnie.24759 => -1 (Transport
endpoint is not co
nnected)


Perhaps a network problem? 


-Original Message-
From: gluster-users-boun...@gluster.org 
[mailto:gluster-users-boun...@gluster.org] On Behalf Of Hiren Joshi

Sent: 28 August 2009 14:48
To: gluster-users@gluster.org
Subject: [Gluster-users] Bonnie crash

Hello all,
 
I'm using gluster 2.0.4 and bonnie++ 1.96, I can't get the test to

complete.
 
bonnie++ -u 99:99 -d /home/webspace_glust/test/
 
Using uid:99, gid:99.

Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...Can't make directory ./Bonnie.24759
Cleaning up test directory after error.
Bonnie: drastic I/O error (rmdir): No such file or directory

I can't see what's wrong a quick google yielded very little. Any
pointers appreciated
 
Josh.

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Replication not working on server hang

2009-08-30 Thread Jasper van Wanrooy - Chatventure

Hi,

Sideways I'm reading the discussions about server hangs the last few  
weeks. However, I did quite a few stress tests on our test systems,  
but I'm unable to reproduce the hangs. The only real difference I see  
is that we are using the XFS filesystem. Does anyone have experience  
with that?


Kind Regards,


Jasper
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Replication not working on server hang

2009-08-30 Thread Mark Mielke

On 08/30/2009 04:00 AM, Anand Avati wrote:

I'm wondering if there's some way for glusterfs to detect the flaws of the
underlying operating system.  I believe there's no bug-free file systems in
the universe, so I believe it is the job of the glusterfs developer to
specify which underlying filesystem is tested and supported.  It's not good
to simply say that glusterfs works on all real-world approximations to an
imaginary bug-free posix  filesystem.
 

I would be genuinely interested to know about another project which is
geared up to be resilient against kernel hangs so that we can borrow
some ideas on how to reliably detect kernel soft lockups or syscall
hangs. As far as I know, even mature projects like Apache have not
bothered fixing such hangs (or even detecting this kind of underlying
OS flaw).

   


There are projects that require kernel patches to work properly (for 
example, the OpenVZ project), and most Linux distributions (i.e. RedHat) 
maintain a set of kernel patches. Vendors may provide work arounds for 
known kernel problems - for example, the dovecot people go through 
various means to flush the NFS or FUSE cache (including for GlusterFS) 
before doing certain operations, and these are done using non-portable 
operations.


Summary of it is that relying on the Linux kernel to be correct in all 
situations (or any kernel for that matter) will have limits. Sometimes, 
it is necessary to track down the problem, correct it, and provide a 
patch. This can involve discussions on linux-dev leading to it finally 
being corrected upstream, and no longer needing to provide a patch. Not 
saying it has to go this far - but unless the problem is understood, it 
shouldn't be written off either. If GlusterFS can issue a set of 
operations that reproducibly causes ext3 to freeze, this is of a concern 
for both the ext3 developers/maintainers and the GlusterFS 
developers/maintainers, and it is a joint problem to solve, since ext3 
is so common.


As for detecting lockups or hangs - I'm not aware of this being done in 
the userspace area, but it could be argued that this is a bit artificial 
of a comparison, because GlusterFS is at its base, a network file 
system, and it *is* common for network file systems (such as NFS) to 
deal with problems with the underlying volumes. GlusterFS uses FUSE as a 
novel approach to avoiding the problem entirely - but if GlusterFS from 
user space can cause the backend storage volume to freeze up, even from 
outside GlusterFS, then it seems like the user space barrier is 
insufficient.


For all of the above - I am assuming that GlusterFS is being used to do 
something which ends up locking up the entire volume, even from outside 
GlusterFS. If anybody is experiencing GlusterFS *only* problems, where 
the underlying volume is still accessible from another process, than 
this would be a different problem, probably GlusterFS specific.


Cheers,
mark

--
Mark Mielke

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Replication not working on server hang

2009-08-30 Thread David Saez Padros

Hi


The calls (as you have seen in the logs as well) which are hanging are
lookup calls, which have to be sent to all subvolumes to ensure all
the copies are in sync.


one thing that i could not understand is why if such this calls are
sent to all servers to keep files in sync why replicate will only
self-heal if the files exist on the first subvolume but not if the
files do not exist on the first subvolume

--
Best regards ...


   David Saez Padroshttp://www.ols.es
   On-Line Services 2000 S.L.   telf+34 902 50 29 75



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Known Issues : Replicate will only self-heal if the files exist on the first subvolume. Server A-> B works, Server A <-B does not work.

2009-08-30 Thread Stephan von Krawczynski
On Sat, 29 Aug 2009 03:46:04 +0200
"supp...@citytoo.com"  wrote:

> Hello,
> 
>  Known Issues : Replicate will only self-heal if the files exist on the first 
> subvolume. Server A-> B works, Server A <-B does not work.
> 
> When this probleme will be fixed because it's very important ?
> 
> Ben
> 
> Cordialement

Hi Ben,

really, don't push to hard in this direction, because this is easily solvable
by running find on server b and statd'ing the filelist on server a. You may
call that inconveniant, but at least there is a trivial solution.

-- 
Regards,
Stephan
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Replication not working on server hang

2009-08-30 Thread Stephan von Krawczynski
On Sun, 30 Aug 2009 01:00:13 -0700
Anand Avati  wrote:

> > I'm wondering if there's some way for glusterfs to detect the flaws of the
> > underlying operating system.  I believe there's no bug-free file systems in
> > the universe, so I believe it is the job of the glusterfs developer to
> > specify which underlying filesystem is tested and supported.  It's not good
> > to simply say that glusterfs works on all real-world approximations to an
> > imaginary bug-free posix  filesystem.
> 
> I would be genuinely interested to know about another project which is
> geared up to be resilient against kernel hangs so that we can borrow
> some ideas on how to reliably detect kernel soft lockups or syscall
> hangs. As far as I know, even mature projects like Apache have not
> bothered fixing such hangs (or even detecting this kind of underlying
> OS flaw).

Apache is no software thats' primary use is to overcome hardware (and
software) issues leading to offline filesystems.
You cannot compare two applications with totally different usage patterns.
And, just to say that clearly, nobody expects you to _solve_ or fix a hang.
The users only expect to _recognise_ a problem and just shut down. It is far
better to shut down without a real problem than to continue while having
one and hang. First one leads to more work at max, but second one leads to
offline service. And thats exactly why we are all here, to prevent an offline
file service.

> Avati

-- 
Regards,
Stephan
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Replication not working on server hang

2009-08-30 Thread David Saez Padros

Hi


This backend fs
hang can happen only because of a kernel bug.


That is indeed false. The fs hang can well have simple hardware reasons, too.


maybe it also could be due to some sort of wrong access to the
filesystem. What is clear is that the soft lockup itself is a kernel
bug, the problems here are what is exactly causing this bug (file
system, controller driver, hardware, kernel itself, ...) and why
glusterfs is triggering this bug and direct operations to the
ext3 file system or through nfs are not.

--
Best regards ...


   David Saez Padroshttp://www.ols.es
   On-Line Services 2000 S.L.   telf+34 902 50 29 75



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Replication not working on server hang

2009-08-30 Thread David Saez Padros

HI


The discussion in this
thread is about those situations where the server (machine
hosting the
storage/posix volume) hangs the backend filesystem (verified by
kernel console logs) and that in turn results in the mountpoint
hang.


That seems to be the case in Stephan's situation, yes, as we have
evidence from reiserFS. What evidence have we in the ext3 cases?


just searching on the net i found similar cases that where due
to the sata driver (altough in our case all disks ara sas), so
the problem could also be due to the disk driver or to some other
piece of the system. Having both reiserfs and ext3 have a bug that
produces this hangs is very unlikely. Maybe we could try to see
if all the ones experiencing this problem have something in common.
In our case the hanged server is:

Dell PE2900
2 x x5...@3.33 8Gb RAM
SAS 6/iR Integrated RAID Controller
7 x SEAGATE ST31000640SS
1 x SEAGATE ST3300656SS
Debian testing
Kernel 2.6.26-2

server hanged when writing to a unified volume (7 x 1Tb +
namespace and system on the ST3300656SS)

--
Best regards ...


   David Saez Padroshttp://www.ols.es
   On-Line Services 2000 S.L.   telf+34 902 50 29 75



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Replication not working on server hang

2009-08-30 Thread Stephan von Krawczynski
On Sat, 29 Aug 2009 18:23:23 -0700
Anand Avati  wrote:

> This backend fs
> hang can happen only because of a kernel bug.

That is indeed false. The fs hang can well have simple hardware reasons, too.
In fact it is a good idea and defensive programming style to not count on
everybody being perfect - just like you should act on the street where you
should not count on perfect drivers in other cars, too.

Lets say your favourite hd controller just died half way, you cannot blame the
kernel for keeping networking up, but all fs related just block.

-- 
Regards,
Stephan
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Replication not working on server hang

2009-08-30 Thread Jeff Evans
Hi Avati,

I'm experiencing complete system-wide hangs exactly as David has
mentioned.

> The discussion in this
> thread is about those situations where the server (machine
> hosting the
> storage/posix volume) hangs the backend filesystem (verified by
> kernel console logs) and that in turn results in the mountpoint
> hang.

That seems to be the case in Stephan's situation, yes, as we have
evidence from reiserFS. What evidence have we in the ext3 cases?

> While your symptoms are similar on the client side hanging,

In the case of 144, my systems didn't hang. Maybe I was just lucky.
Now that I have disabled read-ahead to workaround 144, I am seeing
total system hangs. I also saw these hangs back before I used
read-ahead (with 1.3).

As I have said, it is like new FD's cannot be allocated, while those
already open continue normally. I'm talking about regular ext3 mounts
here, not glusterfs ones.

> The discussion thread is about the situation where the server side
> kernel misbehaves and results in glusterfs hanging. The two
> actual problems are quite different.

Perhaps, as I said, it may be coincidence, but when I ran with
read-ahead, I didn't get any system hangs, just the core-dumps.

Now, I don't get core dumps any more. I get system-wide hangs.

Thanks, Jeff.



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Replication not working on server hang

2009-08-30 Thread David Saez Padros

Hi


As far as I know, even mature projects like Apache have not
bothered fixing such hangs (or even detecting this kind of underlying
OS flaw).


that's right, but Apache is not a fault tolerant file system, in the
other hand some applications that face bugs in other apps have options
to workaround bugs in such applications (like dovecot has for some 
outlook bugs). For a fault tolerant file sistem i would expect that

it can at least detect and handle any problem in any of the subsystems
involved.

--
Best regards ...


   David Saez Padroshttp://www.ols.es
   On-Line Services 2000 S.L.   telf+34 902 50 29 75



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Replication not working on server hang

2009-08-30 Thread Zenaan Harkness
On Sun, Aug 30, 2009 at 01:00:13AM -0700, Anand Avati wrote:
> > I'm wondering if there's some way for glusterfs to detect the flaws of the
> > underlying operating system.  I believe there's no bug-free file systems in
> > the universe, so I believe it is the job of the glusterfs developer to
> > specify which underlying filesystem is tested and supported.  It's not good
> > to simply say that glusterfs works on all real-world approximations to an
> > imaginary bug-free posix  filesystem.
> 
> I would be genuinely interested to know about another project which is
> geared up to be resilient against kernel hangs so that we can borrow
> some ideas on how to reliably detect kernel soft lockups or syscall
> hangs. As far as I know, even mature projects like Apache have not
> bothered fixing such hangs (or even detecting this kind of underlying
> OS flaw).

Check out heartbeat, and the rest (perhaps you knew of this):
http://www.linux-ha.org/

cheers
zenaan

-- 
Homepage: www.SoulSound.net -- Free Australia: www.UPMART.org
Please respect the confidentiality of this email as sensibly warranted.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Replication not working on server hang

2009-08-30 Thread David Saez Padros

Hi


What we have here (kernel lockups and glusterfs on the same machine)
might not be a co-incidence.


but it could be


There might well be a correlation -- but
by nature of the problem it is not right to treat this as a
cause-effect relation with glusterfs being the cause.


i think it's also not right to simply discard glusterfs being the
cause


It is just
not right to blame _any_ userspace application for any kind of kernel
lockups or hangs. 


well, i'm to saying that this is glusterfs fault, what i'm saying
is that is very likely that glusterfs is at least triggering this
fault


So any hang or lockup in the kernel can only be
caused by a bug in itself, which could possibly be triggered by a
specific user application.


maybe, but don't you feel that this needs to be investigated in order
to know what is really happening ?


What we will be fixing is failing over to other machines when the
backend FS hangs. The reason why this was not a priority (so far
atleast) is because a kernel is a trusted piece of software in the
system, and when you are having a kernel which has a bug in the fs,
you should just upgrade to a newer kernel.


yes, but right now there is no evidence that this is a kernel bug.
From a user's point of view, if this did not happen when using nfs and
happens when using glusterfs the most evident solution is to switch back
to nfs (like you, we usually prefer to trust kernel stability against
application stability) and not do any kernel upgrade unless there is an
evidence that this is a kernel bug (as a kernel upgrade could mean 
having to upgrade many other pieces of software that were working ok

and that will need to be tested again).


What we promise to fix is a way to (as best
as possible) somehow translate a backend FS hang into a "subvolume
down" status and consider that subvolume to be down. After that, you
will _still_ continue to face kernel hangs and lockups and just
glusterfs will stop hanging. Your machines would still remain locked
up.


that's great !

--
Thanx & best regards ...


   David Saez Padroshttp://www.ols.es
   On-Line Services 2000 S.L.   telf+34 902 50 29 75



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Replication not working on server hang

2009-08-30 Thread David Saez Padros

Hi


a) documentation says "All operations that do not modify the file
or directory are sent to all the subvolumes and the first successful
reply is returned to the application", why is blocking then ?
it's suposed that the reply from the non blocked server will
come first and nothing will block, but clients are blocking on
a simple ls operation


The calls (as you have seen in the logs as well) which are hanging are
lookup calls, which have to be sent to all subvolumes to ensure all
the copies are in sync.


ok, then the most simple fix will be to add a timeout for lookup
calls, altough i will prefer to optionally also have the first reply
to the lookup being sent to the application and then wait in the
background for the other ones so gluster can keep files in sync,
this will eliminate this hang and also make the system more responsive.

BTW, will switching off some of the self heal options in the client
make glusterfs use only the first reply received to the lookup call ?

--
Thanx & best regards ...


   David Saez Padroshttp://www.ols.es
   On-Line Services 2000 S.L.   telf+34 902 50 29 75



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Replication not working on server hang

2009-08-30 Thread Anand Avati
> I'm wondering if there's some way for glusterfs to detect the flaws of the
> underlying operating system.  I believe there's no bug-free file systems in
> the universe, so I believe it is the job of the glusterfs developer to
> specify which underlying filesystem is tested and supported.  It's not good
> to simply say that glusterfs works on all real-world approximations to an
> imaginary bug-free posix  filesystem.

I would be genuinely interested to know about another project which is
geared up to be resilient against kernel hangs so that we can borrow
some ideas on how to reliably detect kernel soft lockups or syscall
hangs. As far as I know, even mature projects like Apache have not
bothered fixing such hangs (or even detecting this kind of underlying
OS flaw).

Avati
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users