Re: [Gluster-users] How to diagnose volume rebalance failure?

PuYun Mon, 14 Dec 2015 16:21:50 -0800

Hi,

Another weird piece of infomation may be useful. The failed task had actually 
been running for hours, but the status command output only 3621 sec.


============== shell ==============
[root@d001 glusterfs]# gluster volume rebalance FastVol status
                                    Node Rebalanced-files          size       
scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   
-----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes        
952767             0             0               failed            3621.00
volume rebalance: FastVol: success:
================================

As you can see, I started rebalance task for only 1 time. 
======== cmd_history.log-20151215 ======
[2015-12-14 12:50:41.443937]  : volume start FastVol : SUCCESS
[2015-12-14 12:55:01.367519]  : volume rebalance FastVol start : SUCCESS
[2015-12-14 13:55:22.132199]  : volume rebalance FastVol status : SUCCESS
[2015-12-14 23:04:01.780885]  : volume rebalance FastVol status : SUCCESS
[2015-12-14 23:35:56.708077]  : volume rebalance FastVol status : SUCCESS
=================================

Because the task failed at [2015-12-14 21:46:54.179xx], something wrong might 
happened at 3621 secs before, that is [2015-12-14 20:46:33.179xx]. I check logs 
at that time, found nothing special. 
========== FastVol-rebalance.log ========
[2015-12-14 20:46:33.166748] I [dht-rebalance.c:1010:dht_migrate_file] 
0-FastVol-dht: 
/for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/userPoint: 
attempting to move from FastVol-client-0 to FastVol-client-1
[2015-12-14 20:46:33.171009] I [MSGID: 109022] 
[dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of 
/for_ybest_fsdir/user/Weixin.oClDcjjJ/t2/n1/VSXZlm65KjfhbgoM/flag_finished from 
subvolume FastVol-client-0 to FastVol-client-1
[2015-12-14 20:46:33.174851] I [dht-rebalance.c:1010:dht_migrate_file] 
0-FastVol-dht: 
/for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_origin.jpg:
 attempting to move from FastVol-client-0 to FastVol-client-1
[2015-12-14 20:46:33.181448] I [MSGID: 109022] 
[dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of 
/for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/userPoint from 
subvolume FastVol-client-0 to FastVol-client-1
[2015-12-14 20:46:33.184996] I [dht-rebalance.c:1010:dht_migrate_file] 
0-FastVol-dht: 
/for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_small.jpg:
 attempting to move from FastVol-client-0 to FastVol-client-1
[2015-12-14 20:46:33.191681] I [MSGID: 109022] 
[dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of 
/for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_origin.jpg
 from subvolume FastVol-client-0 to FastVol-client-1
[2015-12-14 20:46:33.195396] I [dht-rebalance.c:1010:dht_migrate_file] 
0-FastVol-dht: 
/for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_big_square.jpg:
 attempting to move from FastVol-client-0 to FastVol-client-1
==================================

And, there is no logs around at [2015-12-14 20:46:33.179xx] in 
mnt-b1-brick.log, mnt-c1-brick.log and etc-glusterfs-glusterd.vol.log.



PuYun
 
From: PuYun
Date: 2015-12-15 07:30
To: gluster-users
Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
Hi,

Failed again.  I can see disconnections in logs, but no more details.

=========== mnt-b1-brick.log ===========
[2015-12-14 21:46:54.179662] I [MSGID: 115036] [server.c:552:server_rpc_notify] 
0-FastVol-server: disconnecting connection from 
d001-1799-2015/12/14-12:54:56:347561-FastVol-client-1-0-0
[2015-12-14 21:46:54.181764] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /
[2015-12-14 21:46:54.181815] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on 
/for_ybest_fsdir
[2015-12-14 21:46:54.181856] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on 
/for_ybest_fsdir/user
[2015-12-14 21:46:54.181918] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on 
/for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg
[2015-12-14 21:46:54.181961] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on 
/for_ybest_fsdir/user/ji/ay/an
[2015-12-14 21:46:54.182003] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on 
/for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif
[2015-12-14 21:46:54.182036] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on 
/for_ybest_fsdir/user/ji
[2015-12-14 21:46:54.182076] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on 
/for_ybest_fsdir/user/ji/ay
[2015-12-14 21:46:54.182110] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on 
/for_ybest_fsdir/user/ji/ay/an/ling00
[2015-12-14 21:46:54.182203] I [MSGID: 101055] [client_t.c:419:gf_client_unref] 
0-FastVol-server: Shutting down connection 
d001-1799-2015/12/14-12:54:56:347561-FastVol-client-1-0-0
======================================

============== mnt-c1-brick.log -============
[2015-12-14 21:46:54.179597] I [MSGID: 115036] [server.c:552:server_rpc_notify] 
0-FastVol-server: disconnecting connection from 
d001-1799-2015/12/14-12:54:56:347561-FastVol-client-0-0-0
[2015-12-14 21:46:54.180428] W [inodelk.c:404:pl_inodelk_log_cleanup] 
0-FastVol-server: releasing lock on 5e300cdb-7298-44c0-90eb-5b50018daed6 held 
by {client=0x7effc810cce0, pid=-3 lk-owner=fdffffff}
[2015-12-14 21:46:54.180454] W [inodelk.c:404:pl_inodelk_log_cleanup] 
0-FastVol-server: releasing lock on 3c9a1cd5-84c8-4967-98d5-e75a402b1f74 held 
by {client=0x7effc810cce0, pid=-3 lk-owner=fdffffff}
[2015-12-14 21:46:54.180483] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /
[2015-12-14 21:46:54.180525] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on 
/for_ybest_fsdir
[2015-12-14 21:46:54.180570] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on 
/for_ybest_fsdir/user
[2015-12-14 21:46:54.180604] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on 
/for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg
[2015-12-14 21:46:54.180634] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on 
/for_ybest_fsdir/user/ji
[2015-12-14 21:46:54.180678] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on 
/for_ybest_fsdir/user/ji/ay
[2015-12-14 21:46:54.180725] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on 
/for_ybest_fsdir/user/ji/ay/an/ling00
[2015-12-14 21:46:54.180779] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on 
/for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif
[2015-12-14 21:46:54.180820] I [MSGID: 115013] 
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on 
/for_ybest_fsdir/user/ji/ay/an
[2015-12-14 21:46:54.180859] I [MSGID: 101055] [client_t.c:419:gf_client_unref] 
0-FastVol-server: Shutting down connection 
d001-1799-2015/12/14-12:54:56:347561-FastVol-client-0-0-0
======================================


============== etc-glusterfs-glusterd.vol.log ==========
[2015-12-14 21:46:54.179819] W [socket.c:588:__socket_rwv] 0-management: readv 
on /var/run/gluster/gluster-rebalance-dbee250a-e3fe-4448-b905-b76c5ba80b25.sock 
failed (No data available)
[2015-12-14 21:46:54.209586] I [MSGID: 106007] 
[glusterd-rebalance.c:162:__glusterd_defrag_notify] 0-management: Rebalance 
process for volume FastVol has disconnected.
[2015-12-14 21:46:54.209627] I [MSGID: 101053] 
[mem-pool.c:616:mem_pool_destroy] 0-management: size=588 max=1 total=1
[2015-12-14 21:46:54.209640] I [MSGID: 101053] 
[mem-pool.c:616:mem_pool_destroy] 0-management: size=124 max=1 total=1
=============================================


================== FastVol-rebalance.log ============
...
[2015-12-14 21:46:53.423719] I [MSGID: 109022] 
[dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of 
/for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/07.jpg from subvolume 
FastVol-client-0 to FastVol-client-1
[2015-12-14 21:46:53.423976] I [MSGID: 109022] 
[dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of 
/for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/126724/1d0ca0de913c4e50f85f2b29694e4e64.html
 from subvolume FastVol-client-0 to FastVol-client-1
[2015-12-14 21:46:53.436268] I [dht-rebalance.c:1010:dht_migrate_file] 
0-FastVol-dht: /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg: 
attempting to move from FastVol-client-0 to FastVol-client-1
[2015-12-14 21:46:53.436597] I [dht-rebalance.c:1010:dht_migrate_file] 
0-FastVol-dht: 
/for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif:
 attempting to move from FastVol-client-0 to FastVol-client-1
<EOF>
==============================================



PuYun
 
From: PuYun
Date: 2015-12-14 21:51
To: gluster-users
Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
Hi,

Thank you for your reply. I don't know how to send you the huge sized rebalance 
log file which is about 2GB. 

However, I might have found out the reason why the task failed. My gluster 
server has only 2 cpu cores and carries 2 ssd bricks. When the rebalance task 
began, top 3  processes are 70%~80%, 30%~40 and 30%~40 cpu usage. Others are 
less than 1%. But after a while, 2 CPU cores are used up totally and I even 
can't login until the rebalance task failed. 

It seems 2 bricks require 4 CPU cores at least. Now I upgrade the virtual 
server with 8 CPU cores and start rebalance task again. Everything goes well 
for now.

I will report again when the current task completed or failed.



PuYun
 
From: Nithya Balachandran
Date: 2015-12-14 18:57
To: PuYun
CC: gluster-users
Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
Hi,
 
Can you send us the rebalance log?
 
Regards,
Nithya
 
----- Original Message -----
> From: "PuYun" <clou...@126.com>
> To: "gluster-users" <gluster-users@gluster.org>
> Sent: Monday, December 14, 2015 11:33:40 AM
> Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
> 
> Here is the tail of the failed rebalance log, any clue?
> 
> [2015-12-13 21:30:31.527493] I [dht-rebalance.c:2340:gf_defrag_process_dir]
> 0-FastVol-dht: Migration operation on dir
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/5F/1MsH5--BcoGRAJPI took 20.95 secs
> [2015-12-13 21:30:31.528704] I [dht-rebalance.c:1010:dht_migrate_file]
> 0-FastVol-dht:
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Kn/hM/oHcPMp4hKq5Tq2ZQ/flag_finished:
> attempting to move from FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:30:31.543901] I [dht-rebalance.c:1010:dht_migrate_file]
> 0-FastVol-dht:
> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/userPoint:
> attempting to move from FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:31:37.210496] I [MSGID: 109081]
> [dht-common.c:3780:dht_setxattr] 0-FastVol-dht: fixing the layout of
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/7Q
> [2015-12-13 21:31:37.722825] I [MSGID: 109045]
> [dht-selfheal.c:1508:dht_fix_layout_of_directory] 0-FastVol-dht: subvolume 0
> (FastVol-client-0): 1032124 chunks
> [2015-12-13 21:31:37.722837] I [MSGID: 109045]
> [dht-selfheal.c:1508:dht_fix_layout_of_directory] 0-FastVol-dht: subvolume 1
> (FastVol-client-1): 1032124 chunks
> [2015-12-13 21:33:03.955539] I [MSGID: 109064]
> [dht-layout.c:808:dht_layout_dir_mismatch] 0-FastVol-dht: subvol:
> FastVol-client-0; inode layout - 0 - 2146817919 - 1; disk layout -
> 2146817920 - 4294967295 - 1
> [2015-12-13 21:33:04.069859] I [MSGID: 109018]
> [dht-common.c:806:dht_revalidate_cbk] 0-FastVol-dht: Mismatching layouts for
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/7Q, gfid =
> f38c4ed2-a26a-4d83-adfd-6b0331831738
> [2015-12-13 21:33:04.118800] I [MSGID: 109064]
> [dht-layout.c:808:dht_layout_dir_mismatch] 0-FastVol-dht: subvol:
> FastVol-client-1; inode layout - 2146817920 - 4294967295 - 1; disk layout -
> 0 - 2146817919 - 1
> [2015-12-13 21:33:19.979507] I [MSGID: 109022]
> [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration
> of
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Kn/hM/oHcPMp4hKq5Tq2ZQ/flag_finished
> from subvolume FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:33:19.979459] I [MSGID: 109022]
> [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration
> of /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/userPoint
> from subvolume FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:33:25.543941] I [dht-rebalance.c:1010:dht_migrate_file]
> 0-FastVol-dht:
> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/portrait_origin.jpg:
> attempting to move from FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:33:25.962547] I [dht-rebalance.c:1010:dht_migrate_file]
> 0-FastVol-dht:
> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/portrait_small.jpg:
> attempting to move from FastVol-client-0 to FastVol-client-1
> 
> 
> Cloudor
> 
> 
> 
> From: Sakshi Bansal
> Date: 2015-12-12 13:02
> To: 蒲云
> CC: gluster-users
> Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
> In the rebalance log file you can check the file/directory for which the
> rebalance has failed. It can mention what was the fop for whihc the failure
> happened.
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] How to diagnose volume rebalance failure?

Reply via email to