I had tried increasing the log level, but didn't find anything of note.

However, after trying a number of different things over the weekend, it turned 
out that simply starting and stopping the volume seemed to have fixed this.

It does then seem like a bug perhaps, or some confused state, given that it 
doesn't seem to be any issue with communication between nodes.  I'm not really 
sure how to report it though, given that I don't have steps to reproduce, or 
much insight into what the cause might be from logging.
________________________________
From: Strahil <hunter86...@yahoo.com>
Sent: Sunday, October 27, 2019 10:19 AM
To: Timothy Orme <to...@ancestry.com>; gluster-users <gluster-users@gluster.org>
Subject: [EXTERNAL] Re: Re: [Gluster-users] Remove Brick Rebalance Hangs With 
No Activity


I guess you can increase loglevel ( check       
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level<https://urldefense.proofpoint.com/v2/url?u=https-3A__access.redhat.com_documentation_en-2Dus_red-5Fhat-5Fgluster-5Fstorage_3_html_administration-5Fguide_configuring-5Fthe-5Flog-5Flevel&d=DwMGaQ&c=kKqjBR9KKWaWpMhASkPbOg&r=d0SJB4ihnau-Oyws6GEzcipkV9DfxCuMbgdSRgXeuxM&m=9egPta2lJ02rLk5KJSl3qn2mEUT8wQMfwioLnHY6a-s&s=802zgGPyfRaCfnIIq-B8NSfcr7V2ie-0S4YLU2yPEpI&e=>
 )

Also, have you checked if new and old servers can communicate properly ?

Also consider a tcpdump (for a short time) on the problematic node  can prove 
if communication is OK.

I would go with the logs first.

Best Regards,
Strahil Nikolov

On Oct 26, 2019 20:25, Timothy Orme <to...@ancestry.com> wrote:
Thats what I thought as well.  All instances seem to be responding and alive 
according to the volume status.  I also was able to run a `rebalance 
fix-layout` without any issues, so it seems that communication between the 
nodes is OK.  I also tried replacing the 
10.158.10.1<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.158.10.1&d=DwMGaQ&c=kKqjBR9KKWaWpMhASkPbOg&r=d0SJB4ihnau-Oyws6GEzcipkV9DfxCuMbgdSRgXeuxM&m=9egPta2lJ02rLk5KJSl3qn2mEUT8wQMfwioLnHY6a-s&s=A36clfcJL6IF2zIxCifcYh8D3F6cklL-hykkfB74Xos&e=>
 brick with an entirely new server since that seemed to be the common one 
between in the logs.  Self heal ran just fine in that replica set.  However, it 
still is just hanging on the removal when I try and then remove those bricks.

I might try and full rebalance as well, just to verify that it works.

Only other thing I can think to note is that I'm using SSL for both client and 
server, and maybe thats obfuscating some more important error message, but it 
would still seem odd given that other communication between the nodes is just 
fine.

Any other suggestions for things to try, or other log locations to check on?

Thanks,
Tim
________________________________
From: Strahil <hunter86...@yahoo.com>
Sent: Saturday, October 26, 2019 2:21 AM
To: Timothy Orme <to...@ancestry.com>; gluster-users <gluster-users@gluster.org>
Subject: [EXTERNAL] Re: [Gluster-users] Remove Brick Rebalance Hangs With No 
Activity


According to logs there is some communucation problem.

Check that glusterd is running everywhere and every brick process has a pid & 
port (gluster volume status should point any issues ).

Best Regards,
Strahil Nikolov

On Oct 26, 2019 06:25, Timothy Orme <to...@ancestry.com> wrote:
It looks like this does eventually fail.  At a bit of a loss at what to do 
here... At this point unable to remove any nodes from the cluster.  Any help is 
greatly appreciated!

Here's the log from one of the nodes

[2019-10-26 01:54:35.912284] E [rpc-clnt.c:183:call_bail] 0-scratch-client-4: 
bailing out frame type(GlusterFS 4.x v1), op(INODELK(29)), xid = 0x38, unique = 
0, sent = 2019-10-26 01:24:35.787361, timeout = 1800 for 
<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.158.10.2-3A49152&d=DwMGaQ&c=kKqjBR9KKWaWpMhASkPbOg&r=d0SJB4ihnau-Oyws6GEzcipkV9DfxCuMbgdSRgXeuxM&m=_vOnhjdfuMWECsVUDEHzP4-e90z9Xyvel2CXsbtzeWY&s=ZfU7EXb4XCj6XngfxkJ2nNvAtgGeZt7M3NTn4rHpjcs&e=>
 
10.158.10.2:49152<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.158.10.2-3A49152&d=DwMGaQ&c=kKqjBR9KKWaWpMhASkPbOg&r=d0SJB4ihnau-Oyws6GEzcipkV9DfxCuMbgdSRgXeuxM&m=9egPta2lJ02rLk5KJSl3qn2mEUT8wQMfwioLnHY6a-s&s=49Ix65w5yipJpbcsCiWfUeN8dI7WBMiP25sET_iSeLo&e=>
[2019-10-26 01:54:35.912304] E [MSGID: 114031] 
[client-rpc-fops_v2.c:1345:client4_0_inodelk_cbk] 0-scratch-client-4: remote 
operation failed [Transport endpoint is not connected]
________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/118564314

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/118564314

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Reply via email to