To my eyes this specific case looks like a split-brain scenario but the output of "volume info split-brain" does not show any files. Should I still use the process for split-brain files as documented in the glusterfs documentation? or what do you recommend here?
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Monday, November 5, 2018 4:36 PM, mabi <m...@protonmail.ch> wrote: > Ravi, I did not yet modify the cluster.data-self-heal parameter to off > because in the mean time node2 of my cluster had a memory shortage (this node > has 32 GB of RAM) and as such I had to reboot it. After that reboot all locks > got released and there are no more files left to heal on that volume. So the > reboot of node2 did the trick (but this still seems to be a bug). > > Now on another volume of this same cluster I have a total of 8 files (4 of > them being directories) unsynced from node1 and node3 (arbiter) as you can > see below: > > Brick node1:/data/myvol-pro/brick > /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/oc_dir > gfid:3c92459b-8fa1-4669-9a3d-b38b8d41c360 > /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/le_dir > gfid:aae4098a-1a71-4155-9cc9-e564b89957cf > Status: Connected > Number of entries: 4 > > Brick node2:/data/myvol-pro/brick > Status: Connected > Number of entries: 0 > > Brick node3:/srv/glusterfs/myvol-pro/brick > /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/oc_dir > /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/le_dir > gfid:aae4098a-1a71-4155-9cc9-e564b89957cf > gfid:3c92459b-8fa1-4669-9a3d-b38b8d41c360 > Status: Connected > Number of entries: 4 > > If I check the "/data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/" > with an "ls -l" directory on the client (gluster fuse mount) I get the > following garbage: > > drwxr-xr-x 4 www-data www-data 4096 Nov 5 14:19 . > drwxr-xr-x 31 www-data www-data 4096 Nov 5 14:23 .. > d????????? ? ? ? ? ? le_dir > > I checked on the nodes and indeed node1 and node3 have the same directory > from the time 14:19 but node2 has a directory from the time 14:12. > > Again here the self-heal daemon doesn't seem to be doing anything... What do > you recommend me to do in order to heal these unsynced files? > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Monday, November 5, 2018 2:42 AM, Ravishankar N ravishan...@redhat.com > wrote: > > > On 11/03/2018 04:13 PM, mabi wrote: > > > > > Ravi (or anyone else who can help), I now have even more files which are > > > pending for healing. > > > > If the count is increasing, there is likely a network (disconnect) > > problem between the gluster clients and the bricks that needs fixing. > > > > > Here is the output of a "volume heal info summary": > > > Brick node1:/data/myvol-private/brick > > > Status: Connected > > > Total Number of entries: 49845 > > > Number of entries in heal pending: 49845 > > > Number of entries in split-brain: 0 > > > Number of entries possibly healing: 0 > > > Brick node2:/data/myvol-private/brick > > > Status: Connected > > > Total Number of entries: 26644 > > > Number of entries in heal pending: 26644 > > > Number of entries in split-brain: 0 > > > Number of entries possibly healing: 0 > > > Brick node3:/srv/glusterfs/myvol-private/brick > > > Status: Connected > > > Total Number of entries: 0 > > > Number of entries in heal pending: 0 > > > Number of entries in split-brain: 0 > > > Number of entries possibly healing: 0 > > > Should I try to set the "cluster.data-self-heal" parameter of that volume > > > to "off" as mentioned in the bug? > > > > Yes, as mentioned in the workaround in the thread that I shared. > > > > > And by doing that, does it mean that my files pending heal are in danger > > > of being lost? > > > > No. > > > > > Also is it dangerous to leave "cluster.data-self-heal" to off? > > > > No. This is only disabling client side data healing. Self-heal daemon > > would still heal the files. > > -Ravi > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > > On Saturday, November 3, 2018 1:31 AM, Ravishankar N > > > ravishan...@redhat.com wrote: > > > > > > > Mabi, > > > > If bug 1637953 is what you are experiencing, then you need to follow the > > > > workarounds mentioned in > > > > https://lists.gluster.org/pipermail/gluster-users/2018-October/035178.html. > > > > Can you see if this works? > > > > -Ravi > > > > On 11/02/2018 11:40 PM, mabi wrote: > > > > > > > > > I tried again to manually run a heal by using the "gluster volume > > > > > heal" command because still not files have been healed and noticed > > > > > the following warning in the glusterd.log file: > > > > > [2018-11-02 18:04:19.454702] I [MSGID: 106533] > > > > > [glusterd-volume-ops.c:938:__glusterd_handle_cli_heal_volume] > > > > > 0-management: Received heal vol req for volume myvol-private > > > > > [2018-11-02 18:04:19.457311] W [rpc-clnt.c:1753:rpc_clnt_submit] > > > > > 0-glustershd: error returned while attempting to connect to > > > > > host:(null), port:0 > > > > > It looks like the glustershd can't connect to "host:(null)", could > > > > > that be the reason why there is no healing taking place? if yes why > > > > > do I see here "host:(null)"? and what needs fixing? > > > > > This seeem to have happened since I upgraded from 3.12.14 to 4.1.5. > > > > > I really would appreciate some help here, I suspect being an issue > > > > > with GlusterFS 4.1.5. > > > > > Thank you in advance for any feedback. > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > > > > On Wednesday, October 31, 2018 11:13 AM, mabi m...@protonmail.ch > > > > > wrote: > > > > > > > > > > > Hello, > > > > > > I have a GlusterFS 4.1.5 cluster with 3 nodes (including 1 arbiter) > > > > > > and currently have a volume with around 27174 files which are not > > > > > > being healed. The "volume heal info" command shows the same 27k > > > > > > files under the first node and the second node but there is nothing > > > > > > under the 3rd node (arbiter). > > > > > > I already tried running a "volume heal" but none of the files got > > > > > > healed. > > > > > > In the glfsheal log file for that particular volume the only error > > > > > > I see is a few of these entries: > > > > > > [2018-10-31 10:06:41.524300] E [rpc-clnt.c:184:call_bail] > > > > > > 0-myvol-private-client-0: bailing out frame type(GlusterFS 4.x v1) > > > > > > op(INODELK(29)) xid = 0x108b sent = 2018-10-31 09:36:41.314203. > > > > > > timeout = 1800 for 127.0.1.1:49152 > > > > > > and then a few of these warnings: > > > > > > [2018-10-31 10:08:12.161498] W [dict.c:671:dict_ref] > > > > > > (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/xlator/cluster/replicate.so(+0x6734a) > > > > > > [0x7f2a6dff434a] > > > > > > -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x5da84) > > > > > > [0x7f2a798e8a84] > > > > > > -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x58) > > > > > > [0x7f2a798a37f8] ) 0-dict: dict is NULL [Invalid argument] > > > > > > the glustershd.log file shows the following: > > > > > > [2018-10-31 10:10:52.502453] E [rpc-clnt.c:184:call_bail] > > > > > > 0-myvol-private-client-0: bailing out frame type(GlusterFS 4.x v1) > > > > > > op(INODELK(29)) xid = 0xaa398 sent = 2018-10-31 09:40:50.927816. > > > > > > timeout = 1800 for 127.0.1.1:49152 > > > > > > [2018-10-31 10:10:52.502502] E [MSGID: 114031] > > > > > > [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] > > > > > > 0-myvol-private-client-0: remote operation failed [Transport > > > > > > endpoint is not connected] > > > > > > any idea what could be wrong here? > > > > > > Regards, > > > > > > Mabi > > > > > > Gluster-users mailing list > > > > > > Gluster-users@gluster.org > > > > > > https://lists.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users