Re: [Gluster-users] Gluster failure testing

2012-08-15 Thread Bryan Whitehead
are you using ext4 with redhat/centos? There is a previous thread that
shows some kind of bug with ext4 that causes similar sounding
problems.

If you are using ext4, try using xfs.

On Tue, Aug 14, 2012 at 11:12 PM, Brian Candler  wrote:
> On Tue, Aug 14, 2012 at 08:19:27PM -0700, stephen pierce wrote:
>>I let both clients run for a while, then I stop one client. I then
>>reset the brick/server that is not active (the other one is servicing
>>the HTTP traffic) now.
>
> Do you mean that client1 sends HTTP traffic to brick/server1, and client2
> sends HTTP traffic to brick/server2?
>
>>While investigating, I discover that there are a lot of phantom
>>files that are listed with just a filename, and lots of question marks
>>() when doing an ls l. rm rf * on the Gluster volume seems to
>>complete, but leaves behind all the broken files.
>
> It would be helpful if you could show the actual ls -l output, but my guess
> is you are seeing something like this (demo on a local filesystem, not
> gluster):
>
> $ mkdir testdir
> $ touch testdir/testfile
> $ chmod -x testdir
> $ ls -l testdir
> ls: cannot access testdir/testfile: Permission denied
> total 0
> -? ? ? ? ?? testfile
>
> If so, these aren't really "phantom files", but the permissions of the
> enclosing directory are set wrongly (which might be some intermediate state
> in gluster replication, I don't know)
>
> So an "ls -ld" of the parent directory would also be a good thing. Also, are
> these filenames those you'd expect your application to create?
>
> What might be helpful is to trace your backend-application and what's making
> it return a 500 server error, which may or may not be related to these
> permissions.  If you can see what file operations the backend is trying to
> do and what filesystem error is being returned (e.g.  with strace), this may
> make it clearer what's going on.  Then you can perhaps crank up gluster logs
> at the appropriate place too.
>
> Any log messages talking about "split brain" would be especially interesting.
>
> Regards,
>
> Brian.
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster failure testing

2012-08-14 Thread Brian Candler
On Tue, Aug 14, 2012 at 08:19:27PM -0700, stephen pierce wrote:
>I let both clients run for a while, then I stop one client. I then
>reset the brick/server that is not active (the other one is servicing
>the HTTP traffic) now.

Do you mean that client1 sends HTTP traffic to brick/server1, and client2
sends HTTP traffic to brick/server2?

>While investigating, I discover that there are a lot of phantom
>files that are listed with just a filename, and lots of question marks
>() when doing an ls l. rm rf * on the Gluster volume seems to
>complete, but leaves behind all the broken files.

It would be helpful if you could show the actual ls -l output, but my guess
is you are seeing something like this (demo on a local filesystem, not
gluster):

$ mkdir testdir
$ touch testdir/testfile
$ chmod -x testdir
$ ls -l testdir
ls: cannot access testdir/testfile: Permission denied
total 0
-? ? ? ? ?? testfile

If so, these aren't really "phantom files", but the permissions of the
enclosing directory are set wrongly (which might be some intermediate state
in gluster replication, I don't know)

So an "ls -ld" of the parent directory would also be a good thing. Also, are
these filenames those you'd expect your application to create?

What might be helpful is to trace your backend-application and what's making
it return a 500 server error, which may or may not be related to these
permissions.  If you can see what file operations the backend is trying to
do and what filesystem error is being returned (e.g.  with strace), this may
make it clearer what's going on.  Then you can perhaps crank up gluster logs
at the appropriate place too.

Any log messages talking about "split brain" would be especially interesting.

Regards,

Brian.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Gluster-users] Gluster failure testing

2012-08-14 Thread stephen pierce
I’ve been doing some failure testing, and I ran into a really nasty
condition. I'm hoping that I did something stupid. If you guys know what
happened, or can shed some light, please let me know.



My test environment is four virtual machines. Two I installed Gluster 3.3,
and created a redundant volume between the two. I also installed apache and
my custom application (it's like webdav) on these boxes. The boxes mount
the redundant volume via 127.0.0.1 as Gluster clients. The application uses
the volume as it's storage.



The other two boxes are clients. They run a custom python script to
download files, upload files, remove files and list directories; very
similar to webdav. Clients connect via http, perform the operation
(PUT,GET,DELETE) then disconnect. Rinse, repeat. The balance of
PUT/GET/DELETE is 1/5/1. One client connects to one server/brick, the other
client connects to the other server/brick.



I let both clients run for a while, then I stop one client. I then ‘reset’
the brick/server that is not ‘active’ (the other one is servicing the HTTP
traffic) now. This is interesting to watch the test client, because there
is a 15 second pause, then the operations proceed. This is great. I'm very
happy with this.



When the ‘failed’ brick comes back up, the operations stop for 45 seconds.
This is also fine. I then let the client run for a while, but the test
suite fails shortly (10 minutes?) afterwards with a 500 server error. While
investigating, I discover that there are a lot of ‘phantom’ files that are
listed with just a filename, and lots of question marks () when doing
an ‘ls –l’. ‘rm –rf *’ on the Gluster volume seems to complete, but leaves
behind all the ‘broken’ files.


I eventually decided to blow away the volume and start over again, which
caused me to get educated on 'setfattr' and wasted the rest of the day. I’m
going to start some overnight runs now (before I leave for the day). I'm
going to try to reproduce this failure mode tomorrow.


So guys, what might be going on here? My workload is moderate, and it’s
only one client; not like it’s writing a bunch of files at once. Gluster has
been pretty bulletproof and this is the first time it’s really scared me.
If this was production, I'd certainly have data loss. I have to believe
that I'm doing something very wrong, as hardware failures (simulated by the
virtual 'reset') are very common, and should not be a problem..



Thanks for any insights,


Steve
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users