Re: [Gluster-users] Self-healing not healing 27k files on GlusterFS 4.1.5 3 nodes replica

mabi Tue, 06 Nov 2018 23:52:52 -0800

To my eyes this specific case looks like a split-brain scenario but the output 
of "volume info split-brain" does not show any files. Should I still use the 
process for split-brain files as documented in the glusterfs documentation? or 
what do you recommend here?



‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, November 5, 2018 4:36 PM, mabi <m...@protonmail.ch> wrote:

> Ravi, I did not yet modify the cluster.data-self-heal parameter to off 
> because in the mean time node2 of my cluster had a memory shortage (this node 
> has 32 GB of RAM) and as such I had to reboot it. After that reboot all locks 
> got released and there are no more files left to heal on that volume. So the 
> reboot of node2 did the trick (but this still seems to be a bug).
>
> Now on another volume of this same cluster I have a total of 8 files (4 of 
> them being directories) unsynced from node1 and node3 (arbiter) as you can 
> see below:
>
> Brick node1:/data/myvol-pro/brick
> /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/oc_dir
> gfid:3c92459b-8fa1-4669-9a3d-b38b8d41c360
> /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/le_dir
> gfid:aae4098a-1a71-4155-9cc9-e564b89957cf
> Status: Connected
> Number of entries: 4
>
> Brick node2:/data/myvol-pro/brick
> Status: Connected
> Number of entries: 0
>
> Brick node3:/srv/glusterfs/myvol-pro/brick
> /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/oc_dir
> /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/le_dir
> gfid:aae4098a-1a71-4155-9cc9-e564b89957cf
> gfid:3c92459b-8fa1-4669-9a3d-b38b8d41c360
> Status: Connected
> Number of entries: 4
>
> If I check the "/data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/" 
> with an "ls -l" directory on the client (gluster fuse mount) I get the 
> following garbage:
>
> drwxr-xr-x 4 www-data www-data 4096 Nov 5 14:19 .
> drwxr-xr-x 31 www-data www-data 4096 Nov 5 14:23 ..
> d????????? ? ? ? ? ? le_dir
>
> I checked on the nodes and indeed node1 and node3 have the same directory 
> from the time 14:19 but node2 has a directory from the time 14:12.
>
> Again here the self-heal daemon doesn't seem to be doing anything... What do 
> you recommend me to do in order to heal these unsynced files?
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Monday, November 5, 2018 2:42 AM, Ravishankar N ravishan...@redhat.com 
> wrote:
>
> > On 11/03/2018 04:13 PM, mabi wrote:
> >
> > > Ravi (or anyone else who can help), I now have even more files which are 
> > > pending for healing.
> >
> > If the count is increasing, there is likely a network (disconnect)
> > problem between the gluster clients and the bricks that needs fixing.
> >
> > > Here is the output of a "volume heal info summary":
> > > Brick node1:/data/myvol-private/brick
> > > Status: Connected
> > > Total Number of entries: 49845
> > > Number of entries in heal pending: 49845
> > > Number of entries in split-brain: 0
> > > Number of entries possibly healing: 0
> > > Brick node2:/data/myvol-private/brick
> > > Status: Connected
> > > Total Number of entries: 26644
> > > Number of entries in heal pending: 26644
> > > Number of entries in split-brain: 0
> > > Number of entries possibly healing: 0
> > > Brick node3:/srv/glusterfs/myvol-private/brick
> > > Status: Connected
> > > Total Number of entries: 0
> > > Number of entries in heal pending: 0
> > > Number of entries in split-brain: 0
> > > Number of entries possibly healing: 0
> > > Should I try to set the "cluster.data-self-heal" parameter of that volume 
> > > to "off" as mentioned in the bug?
> >
> > Yes, as  mentioned in the workaround in the thread that I shared.
> >
> > > And by doing that, does it mean that my files pending heal are in danger 
> > > of being lost?
> >
> > No.
> >
> > > Also is it dangerous to leave "cluster.data-self-heal" to off?
> >
> > No. This is only disabling client side data healing. Self-heal daemon
> > would still heal the files.
> > -Ravi
> >
> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > On Saturday, November 3, 2018 1:31 AM, Ravishankar N 
> > > ravishan...@redhat.com wrote:
> > >
> > > > Mabi,
> > > > If bug 1637953 is what you are experiencing, then you need to follow the
> > > > workarounds mentioned in
> > > > https://lists.gluster.org/pipermail/gluster-users/2018-October/035178.html.
> > > > Can you see if this works?
> > > > -Ravi
> > > > On 11/02/2018 11:40 PM, mabi wrote:
> > > >
> > > > > I tried again to manually run a heal by using the "gluster volume 
> > > > > heal" command because still not files have been healed and noticed 
> > > > > the following warning in the glusterd.log file:
> > > > > [2018-11-02 18:04:19.454702] I [MSGID: 106533] 
> > > > > [glusterd-volume-ops.c:938:__glusterd_handle_cli_heal_volume] 
> > > > > 0-management: Received heal vol req for volume myvol-private
> > > > > [2018-11-02 18:04:19.457311] W [rpc-clnt.c:1753:rpc_clnt_submit] 
> > > > > 0-glustershd: error returned while attempting to connect to 
> > > > > host:(null), port:0
> > > > > It looks like the glustershd can't connect to "host:(null)", could 
> > > > > that be the reason why there is no healing taking place? if yes why 
> > > > > do I see here "host:(null)"? and what needs fixing?
> > > > > This seeem to have happened since I upgraded from 3.12.14 to 4.1.5.
> > > > > I really would appreciate some help here, I suspect being an issue 
> > > > > with GlusterFS 4.1.5.
> > > > > Thank you in advance for any feedback.
> > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > > > On Wednesday, October 31, 2018 11:13 AM, mabi m...@protonmail.ch 
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > > I have a GlusterFS 4.1.5 cluster with 3 nodes (including 1 arbiter) 
> > > > > > and currently have a volume with around 27174 files which are not 
> > > > > > being healed. The "volume heal info" command shows the same 27k 
> > > > > > files under the first node and the second node but there is nothing 
> > > > > > under the 3rd node (arbiter).
> > > > > > I already tried running a "volume heal" but none of the files got 
> > > > > > healed.
> > > > > > In the glfsheal log file for that particular volume the only error 
> > > > > > I see is a few of these entries:
> > > > > > [2018-10-31 10:06:41.524300] E [rpc-clnt.c:184:call_bail] 
> > > > > > 0-myvol-private-client-0: bailing out frame type(GlusterFS 4.x v1) 
> > > > > > op(INODELK(29)) xid = 0x108b sent = 2018-10-31 09:36:41.314203. 
> > > > > > timeout = 1800 for 127.0.1.1:49152
> > > > > > and then a few of these warnings:
> > > > > > [2018-10-31 10:08:12.161498] W [dict.c:671:dict_ref] 
> > > > > > (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/xlator/cluster/replicate.so(+0x6734a)
> > > > > >  [0x7f2a6dff434a] 
> > > > > > -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x5da84) 
> > > > > > [0x7f2a798e8a84] 
> > > > > > -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x58) 
> > > > > > [0x7f2a798a37f8] ) 0-dict: dict is NULL [Invalid argument]
> > > > > > the glustershd.log file shows the following:
> > > > > > [2018-10-31 10:10:52.502453] E [rpc-clnt.c:184:call_bail] 
> > > > > > 0-myvol-private-client-0: bailing out frame type(GlusterFS 4.x v1) 
> > > > > > op(INODELK(29)) xid = 0xaa398 sent = 2018-10-31 09:40:50.927816. 
> > > > > > timeout = 1800 for 127.0.1.1:49152
> > > > > > [2018-10-31 10:10:52.502502] E [MSGID: 114031] 
> > > > > > [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 
> > > > > > 0-myvol-private-client-0: remote operation failed [Transport 
> > > > > > endpoint is not connected]
> > > > > > any idea what could be wrong here?
> > > > > > Regards,
> > > > > > Mabi
> > > > > > Gluster-users mailing list
> > > > > > Gluster-users@gluster.org
> > > > > > https://lists.gluster.org/mailman/listinfo/gluster-users


_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Self-healing not healing 27k files on GlusterFS 4.1.5 3 nodes replica

Reply via email to