Re: [Gluster-users] Split-brain seen with [0 0] pending matrix and io-cache page errors

Pranith Kumar Karampuri Sun, 19 Oct 2014 22:25:14 -0700


On 10/19/2014 06:05 PM, Anirban Ghoshal wrote:

I see. Thanks a tonne for the thorough explanation! :) I can see thatour setup would be vulnerable here because the logger on one server isnot generally aware of the state of the replica on the other server.So, it is possible that the log files may have been renamed beforeheal had a chance to kick in.
Could I also request you for the bug ID (should there be one) againstwhich you are coding up the fix, so that we could get a notificationonce it is passed?

This bug was reported by Redhat QE and the bug is cloned upstream. Icopied the relevant content so you would understand the context:

https://bugzilla.redhat.com/show_bug.cgi?id=1154491

Pranith

Also, as an aside, is O_DIRECT supposed to prevent this from occurringif one were to make allowance for the performance hit?

Unfortunately no :-(. As far as I understand that was the only work-around.

Pranith

Thanks again,
Anirban


------------------------------------------------------------------------
*From: * Pranith Kumar Karampuri <pkara...@redhat.com>;
*To: * Anirban Ghoshal <chalcogen_eg_oxy...@yahoo.com>;<gluster-users@gluster.org>;*Subject: * Re: [Gluster-users] Split-brain seen with [0 0] pendingmatrix and io-cache page errors
*Sent: * Sun, Oct 19, 2014 9:01:58 AM


On 10/19/2014 01:36 PM, Anirban Ghoshal wrote:
It is possible, yes, because these are actually a kind of log files.I suppose, like other logging frameworks these files an remain openfor a considerable period, and then get renamed to support log rotatesemantics.
That said, I might need to check with the team that actually managesthe logging framework to be sure. I only take care of the file-systemstuff. I can tell you for sure Monday.
If it is the same race that you mention, is there a fix for it?

Thanks,
Anirban
I am working on the fix.

RCA:
0) Lets say the file 'abc.log' is opened for writing on replica pair(brick-0, brick-1)
1) brick-0 went down
2) abc.log is renamed to abc.log.1
3) brick-0 comes back up
4) re-open on old abc.log happens from mount to brick-0
5) self-heal kicks in and deletes old abc.log and creates and syncsabc.log.16) But the mount is still writing to the deleted 'old abc.log' onbrick-0 so abc.log.1 file remains at the same size while abc.log.1file keeps increasing on brick-1. This leads to size mismatchsplit-brain on abc.log.1.
Race happens between steps 4), 5). If 5) happens before 4) nosplit-brain will be observed.
Work-around:

0) Take backup of good abc.log.1 file from brick-1. (Just being paranoid)
Do any of the following two steps to make sure the stale file that isopen is closed1-a) Take the brick process with bad file down using kill -9<brick-pid> (In my example brick-0).
1-b) Introduce a temporary disconnect between mount and brick-0.
(I would choose 1-a)
2) Remove the bad file(abc.log.1) and its gfid-backend-file from brick-0
3) Bring the brick back up (gluster volume start <volname>force)/restore the connection and let it heal by doing 'stat' on thefile abc.log.1 on the mount.
This bug existed from 2012, from the first time I implementedrename/hard-link self-heal. It is difficult to re-create. I have toput break-points at several places in the process to hit the race.
Pranith
Thanks,
Anirban

------------------------------------------------------------------------
*From: * Pranith Kumar Karampuri <pkara...@redhat.com>;
*To: * Anirban Ghoshal <chalcogen_eg_oxy...@yahoo.com>;<gluster-users@gluster.org>;*Subject: * Re: [Gluster-users] Split-brain seen with [0 0] pendingmatrix and io-cache page errors
*Sent: * Sun, Oct 19, 2014 5:42:24 AM


On 10/18/2014 04:36 PM, Anirban Ghoshal wrote:
Hi,
Yes, they do, and considerably. I'd forgotten to mention that on mylast email. Their mtimes, however, as far as i could tell onseparate servers, seemed to coincide.
Thanks,
Anirban
Are these files always open? And is it possible that the file couldhave been renamed when one of the bricks was offline? I know of arace which can introduce this one. Just trying to find if it is thesame case.
Pranith
------------------------------------------------------------------------
*From: * Pranith Kumar Karampuri <pkara...@redhat.com>;
*To: * Anirban Ghoshal <chalcogen_eg_oxy...@yahoo.com>;gluster-users@gluster.org <gluster-users@gluster.org>;*Subject: * Re: [Gluster-users] Split-brain seen with [0 0] pendingmatrix and io-cache page errors
*Sent: * Sat, Oct 18, 2014 12:26:08 AM

hi,
      Could you see if the size of the file mismatches?

Pranith

On 10/18/2014 04:20 AM, Anirban Ghoshal wrote:
Hi everyone,
I have this really confusing split-brain here that's bothering me.I am running glusterfs 3.4.2 over linux 2.6.34. I have a replica 2volume 'testvol' that is It seems I cannot read/stat/edit the filein question, and `gluster volume heal testvol info split-brain`shows nothing. Here are the logs from the fuse-mount for the volume:
[2014-09-29 07:53:02.867111] W [fuse-bridge.c:1172:fuse_err_cbk]0-glusterfs-fuse: 4560969: FLUSH() ERR => -1 (Input/output error)[2014-09-29 07:54:16.007799] W [page.c:991:__ioc_page_error]0-testvol-io-cache: page error for page = 0x7fd5c8529d20 & waitq =0x7fd5c8067d40[2014-09-29 07:54:16.007854] W [fuse-bridge.c:2089:fuse_readv_cbk]0-glusterfs-fuse: 4561103: READ => -1 (Input/output error)[2014-09-29 07:54:16.008018] W [page.c:991:__ioc_page_error]0-testvol-io-cache: page error for page = 0x7fd5c8607ee0 & waitq =0x7fd5c8067d40[2014-09-29 07:54:16.008056] W [fuse-bridge.c:2089:fuse_readv_cbk]0-glusterfs-fuse: 4561104: READ => -1 (Input/output error)[2014-09-29 07:54:16.008233] W [page.c:991:__ioc_page_error]0-testvol-io-cache: page error for page = 0x7fd5c8066f30 & waitq =0x7fd5c8067d40[2014-09-29 07:54:16.008269] W [fuse-bridge.c:2089:fuse_readv_cbk]0-glusterfs-fuse: 4561105: READ => -1 (Input/output error)[2014-09-29 07:54:16.008800] W [page.c:991:__ioc_page_error]0-testvol-io-cache: page error for page = 0x7fd5c860bcf0 & waitq =0x7fd5c863b1f0[2014-09-29 07:54:16.008839] W [fuse-bridge.c:2089:fuse_readv_cbk]0-glusterfs-fuse: 4561107: READ => -1 (Input/output error)[2014-09-29 07:54:16.009365] W [page.c:991:__ioc_page_error]0-testvol-io-cache: page error for page = 0x7fd5c85fd120 & waitq =0x7fd5c8067d40[2014-09-29 07:54:16.009413] W [fuse-bridge.c:2089:fuse_readv_cbk]0-glusterfs-fuse: 4561109: READ => -1 (Input/output error)[2014-09-29 07:54:16.040549] W [afr-open.c:213:afr_open]0-testvol-replicate-0: failed to open as split brain seen,returning EIO[2014-09-29 07:54:16.040594] W [fuse-bridge.c:915:fuse_fd_cbk]0-glusterfs-fuse: 4561142: OPEN()/SECLOG/20140908.d/SECLOG_00000000000000427425_00000000000000000000.log=> -1 (Input/output error)
Could somebody please give me some clue on where to begin? Ichecked the xattrs on/SECLOG/20140908.d/SECLOG_00000000000000427425_00000000000000000000.logand it seems the changelogs are [0, 0] on both replicas, and thegfid's match.
Thank you very much for any help on this.
Anirban





_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Split-brain seen with [0 0] pending matrix and io-cache page errors

Reply via email to