Re: [Gluster-users] self heal errors on 3.1.1 clients

Jeff Darcy Thu, 27 Jan 2011 06:01:20 -0800

On 01/26/2011 07:25 PM, David Lloyd wrote:

Well, I did this and it seems to have worked. I was just guessing really,
didn't have any documentation or advice from anyone in the know.


I just reset the attributes on the root directory for each brick that was
not all zeroes.

I found it easier to dump the attributes without the '-e hex'

g4:~ # getfattr -d  -m trusted.afr /mnt/glus1 /mnt/glus2
getfattr: Removing leading '/' from absolute path names
# file: mnt/glus1
trusted.afr.glustervol1-client-2=0sAAAAAAAAAAEAAAAA
trusted.afr.glustervol1-client-3=0sAAAAAAAAAAAAAAAA

Then
setfattr -n trusted.afr.glustervol1-client-2 -v 0sAAAAAAAAAAAAAAAA
/mnt/glus1

I did that on all the bricks that didn't have all A's

next time i stat-ed the root of the filesystem on the client the self heal
worked ok.

I'm not comfortable advising you to do this as I'm really feeling my way
here, but it looks as though it worked for me.

This seems really dangerous to me. On a brick xxx, the trusted.afr.yyyattribute consists of three unsigned 32-bit counters, indicating howmany uncommitted operations (data, metadata, and namespace respectively)might exist at yyy. If xxx shows uncommitted operations at yyy but notvice versa, then we know that xxx is more up to date and it should bethe source for self-heal. If two bricks show uncommitted operations ateach other, then we're in the infamous "split brain" scenario. Someclient was unable to clear the counter at xxx while another was unableto clear it at yyy, or both xxx and yyy went down after the operationwas complete but before they could clear the counters for each other.

In this case, it looks like a metadata operation (permission change) wasin this state. If the permissions are in fact the same both places thenit doesn't matter which way self-heal happens, or whether it happens atall. In fact, it seems to me that AFR should be able to detect thisparticular condition and not flag it as an error. In any case, I thinkyou're probably fine in this case but in general it's a very bad idea toclear these flags manually because it can cause updates to be lost (ifself-heal goes the wrong way) or files to remain in an inconsistentstate (if no self-heal occurs).

The real thing I'd wonder about is why both servers are so frequentlybecoming unavailable at the same instant (switch problem?) and whypermission changes on the root are apparently so frequent that this ofenresults in a split-brain.

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] self heal errors on 3.1.1 clients

Reply via email to