Re: [Gluster-devel] Skipped files during rebalance
in this location. 2. When you see any of the rebalance process on any of the servers using high memory issue the following command. kill -USR1 pid-of-rebalance-process. --- ps aux | grep rebalance should give the rebalance process pid. The state dump should give some hint about the high mem-usage. Thanks, Susant - Original Message - From: Susant Palai spa...@redhat.com To: Christophe TREFOIS christophe.tref...@uni.lu Cc: Gluster Devel gluster-devel@gluster.org Sent: Friday, 21 August, 2015 3:52:07 PM Subject: Re: [Gluster-devel] Skipped files during rebalance Thanks Christophe for the details. Will get back to you with the analysis. Regards, Susant - Original Message - From: Christophe TREFOIS christophe.tref...@uni.lu To: Susant Palai spa...@redhat.com Cc: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran nbala...@redhat.com, Shyamsundar Ranganathan srang...@redhat.com, Mohammed Rafi K C rkavu...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Friday, 21 August, 2015 12:39:05 AM Subject: Re: [Gluster-devel] Skipped files during rebalance Dear Susant, The rebalance failed again and also had (in my opinion) excessive RAM usage. Please find a very detailled list below. All logs: http://wikisend.com/download/651948/allstores.tar.gz Thank you for letting me know how I could successfully complete the rebalance process. The fedora pastes are the output of top of each node at that time (more or less). Please let me know if you need more information, Best, —— Start of mem info # After reboot, before starting glusterd [root@highlander ~]# pdsh -g live 'free -m' stor106: totalusedfree shared buff/cache available stor106: Mem: 1932492208 190825 9 215 190772 stor106: Swap: 0 0 0 stor105: totalusedfree shared buff/cache available stor105: Mem: 1932482275 190738 9 234 190681 stor105: Swap: 0 0 0 stor104: totalusedfree shared buff/cache available stor104: Mem: 1932492221 190811 9 216 190757 stor104: Swap: 0 0 0 [root@highlander ~]# # Gluster Info [root@stor106 glusterfs]# gluster volume info Volume Name: live Type: Distribute Volume ID: 1328637d-7730-4627-8945-bbe43626d527 Status: Started Number of Bricks: 9 Transport-type: tcp Bricks: Brick1: stor104:/zfs/brick0/brick Brick2: stor104:/zfs/brick1/brick Brick3: stor104:/zfs/brick2/brick Brick4: stor106:/zfs/brick0/brick Brick5: stor106:/zfs/brick1/brick Brick6: stor106:/zfs/brick2/brick Brick7: stor105:/zfs/brick0/brick Brick8: stor105:/zfs/brick1/brick Brick9: stor105:/zfs/brick2/brick Options Reconfigured: nfs.disable: true diagnostics.count-fop-hits: on diagnostics.latency-measurement: on performance.write-behind-window-size: 4MB performance.io-thread-count: 32 performance.client-io-threads: on performance.cache-size: 1GB performance.cache-refresh-timeout: 60 performance.cache-max-file-size: 4MB cluster.data-self-heal-algorithm: full diagnostics.client-log-level: ERROR diagnostics.brick-log-level: ERROR cluster.min-free-disk: 1% server.allow-insecure: on # Starting gluserd [root@highlander ~]# pdsh -g live 'systemctl start glusterd' [root@highlander ~]# pdsh -g live 'free -m' stor106: totalusedfree shared buff/cache available stor106: Mem: 1932492290 190569 9 389 190587 stor106: Swap: 0 0 0 stor104: totalusedfree shared buff/cache available stor104: Mem: 1932492297 190557 9 394 190571 stor104: Swap: 0 0 0 stor105: totalusedfree shared buff/cache available stor105: Mem: 1932482286 190554 9 407 190595 stor105: Swap: 0 0 0 [root@highlander ~]# systemctl start glusterd [root@highlander ~]# gluster volume start live volume start: live: success [root@highlander ~]# gluster volume status Status of volume: live Gluster process TCP Port RDMA Port Online Pid -- Brick stor104:/zfs/brick0/brick 49164 0 Y 5945 Brick stor104:/zfs/brick1/brick 49165 0 Y 5963 Brick stor104:/zfs/brick2/brick 49166 0 Y 5981 Brick stor106:/zfs/brick0/brick 49158 0 Y 5256 Brick stor106
Re: [Gluster-devel] Skipped files during rebalance
Hi, Mostly the rebalance failures are due to the network problem. Here is the log: [2015-08-16 20:31:36.301467] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/PA 27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/003002002.flex lookup failed [2015-08-16 20:31:36.921405] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/PA 27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/003004005.flex lookup failed [2015-08-16 20:31:36.921591] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/PA 27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/006004004.flex lookup failed [2015-08-16 20:31:36.921770] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/PA 27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/005004007.flex lookup failed [2015-08-16 20:31:37.577758] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/PA 27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/007004005.flex lookup failed [2015-08-16 20:34:12.387425] E [socket.c:2332:socket_connect_finish] 0-live-client-4: connection to 192.168.123.106:24007 failed (Connection refused) [2015-08-16 20:34:12.392820] E [socket.c:2332:socket_connect_finish] 0-live-client-5: connection to 192.168.123.106:24007 failed (Connection refused) [2015-08-16 20:34:12.398023] E [socket.c:2332:socket_connect_finish] 0-live-client-0: connection to 192.168.123.104:24007 failed (Connection refused) [2015-08-16 20:34:12.402904] E [socket.c:2332:socket_connect_finish] 0-live-client-2: connection to 192.168.123.104:24007 failed (Connection refused) [2015-08-16 20:34:12.407464] E [socket.c:2332:socket_connect_finish] 0-live-client-3: connection to 192.168.123.106:24007 failed (Connection refused) [2015-08-16 20:34:12.412249] E [socket.c:2332:socket_connect_finish] 0-live-client-1: connection to 192.168.123.104:24007 failed (Connection refused) [2015-08-16 20:34:12.416621] E [socket.c:2332:socket_connect_finish] 0-live-client-6: connection to 192.168.123.105:24007 failed (Connection refused) [2015-08-16 20:34:12.420906] E [socket.c:2332:socket_connect_finish] 0-live-client-8: connection to 192.168.123.105:24007 failed (Connection refused) [2015-08-16 20:34:12.425066] E [socket.c:2332:socket_connect_finish] 0-live-client-7: connection to 192.168.123.105:24007 failed (Connection refused) [2015-08-16 20:34:17.479925] E [socket.c:2332:socket_connect_finish] 0-glusterfs: connection to 127.0.0.1:24007 failed (Connection refused) [2015-08-16 20:36:23.788206] E [MSGID: 101075] [common-utils.c:314:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known) [2015-08-16 20:36:23.788286] E [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-4: DNS resolution failed on host stor106 [2015-08-16 20:36:23.788387] E [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-5: DNS resolution failed on host stor106 [2015-08-16 20:36:23.788918] E [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-0: DNS resolution failed on host stor104 [2015-08-16 20:36:23.789233] E [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-2: DNS resolution failed on host stor104 [2015-08-16 20:36:23.789295] E [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-3: DNS resolution failed on host stor106 For the high mem usage part I will try to run rebalance and analyze. In the mean time it will be help full if you can take a state dump of the rebalance process when it is using high RAM. Here are the steps to take the state dump. 1. Find your state-dump destination; Run gluster --print-statedumpdir. The state dump will be stored in this location. 2. When you see any of the rebalance process on any of the servers using high memory issue the following command. kill -USR1 pid-of-rebalance-process. --- ps aux | grep rebalance should give the rebalance process pid. The state dump should give some hint about the high mem-usage. Thanks, Susant - Original Message - From: Susant Palai spa...@redhat.com To: Christophe TREFOIS christophe.tref...@uni.lu Cc: Gluster Devel gluster-devel@gluster.org Sent: Friday, 21 August, 2015 3:52:07 PM Subject: Re: [Gluster-devel] Skipped files during rebalance Thanks Christophe for the details. Will get back to you with the analysis. Regards, Susant - Original Message - From: Christophe TREFOIS christophe.tref...@uni.lu To: Susant Palai spa...@redhat.com Cc: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran nbala...@redhat.com, Shyamsundar Ranganathan srang...@redhat.com, Mohammed Rafi K C rkavu...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Friday, 21 August, 2015 12
Re: [Gluster-devel] Skipped files during rebalance
Thanks Christophe for the details. Will get back to you with the analysis. Regards, Susant - Original Message - From: Christophe TREFOIS christophe.tref...@uni.lu To: Susant Palai spa...@redhat.com Cc: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran nbala...@redhat.com, Shyamsundar Ranganathan srang...@redhat.com, Mohammed Rafi K C rkavu...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Friday, 21 August, 2015 12:39:05 AM Subject: Re: [Gluster-devel] Skipped files during rebalance Dear Susant, The rebalance failed again and also had (in my opinion) excessive RAM usage. Please find a very detailled list below. All logs: http://wikisend.com/download/651948/allstores.tar.gz Thank you for letting me know how I could successfully complete the rebalance process. The fedora pastes are the output of top of each node at that time (more or less). Please let me know if you need more information, Best, —— Start of mem info # After reboot, before starting glusterd [root@highlander ~]# pdsh -g live 'free -m' stor106: totalusedfree shared buff/cache available stor106: Mem: 1932492208 190825 9 215 190772 stor106: Swap: 0 0 0 stor105: totalusedfree shared buff/cache available stor105: Mem: 1932482275 190738 9 234 190681 stor105: Swap: 0 0 0 stor104: totalusedfree shared buff/cache available stor104: Mem: 1932492221 190811 9 216 190757 stor104: Swap: 0 0 0 [root@highlander ~]# # Gluster Info [root@stor106 glusterfs]# gluster volume info Volume Name: live Type: Distribute Volume ID: 1328637d-7730-4627-8945-bbe43626d527 Status: Started Number of Bricks: 9 Transport-type: tcp Bricks: Brick1: stor104:/zfs/brick0/brick Brick2: stor104:/zfs/brick1/brick Brick3: stor104:/zfs/brick2/brick Brick4: stor106:/zfs/brick0/brick Brick5: stor106:/zfs/brick1/brick Brick6: stor106:/zfs/brick2/brick Brick7: stor105:/zfs/brick0/brick Brick8: stor105:/zfs/brick1/brick Brick9: stor105:/zfs/brick2/brick Options Reconfigured: nfs.disable: true diagnostics.count-fop-hits: on diagnostics.latency-measurement: on performance.write-behind-window-size: 4MB performance.io-thread-count: 32 performance.client-io-threads: on performance.cache-size: 1GB performance.cache-refresh-timeout: 60 performance.cache-max-file-size: 4MB cluster.data-self-heal-algorithm: full diagnostics.client-log-level: ERROR diagnostics.brick-log-level: ERROR cluster.min-free-disk: 1% server.allow-insecure: on # Starting gluserd [root@highlander ~]# pdsh -g live 'systemctl start glusterd' [root@highlander ~]# pdsh -g live 'free -m' stor106: totalusedfree shared buff/cache available stor106: Mem: 1932492290 190569 9 389 190587 stor106: Swap: 0 0 0 stor104: totalusedfree shared buff/cache available stor104: Mem: 1932492297 190557 9 394 190571 stor104: Swap: 0 0 0 stor105: totalusedfree shared buff/cache available stor105: Mem: 1932482286 190554 9 407 190595 stor105: Swap: 0 0 0 [root@highlander ~]# systemctl start glusterd [root@highlander ~]# gluster volume start live volume start: live: success [root@highlander ~]# gluster volume status Status of volume: live Gluster process TCP Port RDMA Port Online Pid -- Brick stor104:/zfs/brick0/brick 49164 0 Y 5945 Brick stor104:/zfs/brick1/brick 49165 0 Y 5963 Brick stor104:/zfs/brick2/brick 49166 0 Y 5981 Brick stor106:/zfs/brick0/brick 49158 0 Y 5256 Brick stor106:/zfs/brick1/brick 49159 0 Y 5274 Brick stor106:/zfs/brick2/brick 49160 0 Y 5292 Brick stor105:/zfs/brick0/brick 49155 0 Y 5284 Brick stor105:/zfs/brick1/brick 49156 0 Y 5302 Brick stor105:/zfs/brick2/brick 49157 0 Y 5320 NFS Server on localhost N/A N/AN N/A NFS Server on 192.168.123.106 N/A N/AN N/A NFS Server on stor105 N/A N/AN N/A NFS Server on 192.168.123.104 N/A N/AN N/A Task Status of Volume live
Re: [Gluster-devel] Skipped files during rebalance
- --- --- --- --- --- -- 192.168.123.104 748812 4.4TB 4160456 1311156772 failed 63114.00 192.168.123.106 1187917 3.3TB 6021931 21625 1209503 failed 75243.00 stor10500Bytes 244043116 196 failed 69658.00 volume rebalance: live: success: ``` Dr Christophe Trefois, Dipl.-Ing. Technical Specialist / Post-Doc UNIVERSITÉ DU LUXEMBOURG LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE Campus Belval | House of Biomedicine 6, avenue du Swing L-4367 Belvaux T: +352 46 66 44 6124 F: +352 46 66 44 6949 http://www.uni.lu/lcsb This message is confidential and may contain privileged information. It is intended for the named recipient only. If you receive it in error please notify me and permanently delete the original message and any copies. On 19 Aug 2015, at 08:14, Susant Palai spa...@redhat.com wrote: Comments inline. - Original Message - From: Christophe TREFOIS christophe.tref...@uni.lu To: Susant Palai spa...@redhat.com Cc: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran nbala...@redhat.com, Shyamsundar Ranganathan srang...@redhat.com, Mohammed Rafi K C rkavu...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Tuesday, August 18, 2015 8:08:41 PM Subject: Re: [Gluster-devel] Skipped files during rebalance Hi Susan, Thank you for the response. On 18 Aug 2015, at 10:45, Susant Palai spa...@redhat.com wrote: Hi Christophe, Need some info regarding the high mem-usage. 1. Top output: To see whether any other process eating up memory. I will be interested to know the memory usage of all the gluster process referring to the high mem-usage. These process includes glusterfsd, glusterd, gluster, any mount process (glusterfs), and rebalance(glusterfs). 2. Gluster volume info root@highlander ~]# gluster volume info Volume Name: live Type: Distribute Volume ID: 1328637d-7730-4627-8945-bbe43626d527 Status: Started Number of Bricks: 9 Transport-type: tcp Bricks: Brick1: stor104:/zfs/brick0/brick Brick2: stor104:/zfs/brick1/brick Brick3: stor104:/zfs/brick2/brick Brick4: stor106:/zfs/brick0/brick Brick5: stor106:/zfs/brick1/brick Brick6: stor106:/zfs/brick2/brick Brick7: stor105:/zfs/brick0/brick Brick8: stor105:/zfs/brick1/brick Brick9: stor105:/zfs/brick2/brick Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on server.allow-insecure: on cluster.min-free-disk: 1% diagnostics.brick-log-level: ERROR diagnostics.client-log-level: ERROR cluster.data-self-heal-algorithm: full performance.cache-max-file-size: 4MB performance.cache-refresh-timeout: 60 performance.cache-size: 1GB performance.client-io-threads: on performance.io-thread-count: 32 performance.write-behind-window-size: 4MB 3. Is rebalance process still running? If yes can you point to specific mem usage by rebalance process? The high mem-usage was seen during rebalance or even post rebalance? I would like to restart the rebalance process since it failed… But I can’t as the volume cannot be stopped (I wanted to reboot the servers to have a clean testing grounds). Here are the logs from the three nodes: http://paste.fedoraproject.org/256183/43989079 Maybe you could help me figure out how to stop the volume? This is what happens [root@highlander ~]# gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. Requesting glusterd team to give input. [root@highlander ~]# ssh stor105 gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. [root@highlander ~]# ssh stor104 gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. [root@highlander ~]# ssh stor106 gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. [root@highlander ~]# gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. [root@highlander ~]# gluster volume stop live Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: live: failed: Staging failed on stor106. Error: rebalance session is in progress for the volume 'live' Staging failed on stor104. Error: rebalance session is in progress for the volume ‘live' Can you run [ps aux | grep rebalance] on all the servers and post here? Just want to check whether rebalance is really running or not. Again requesting glusterd team to give inputs. 4. Gluster version [root@highlander ~]# pdsh -g live 'rpm -qa | grep gluster' stor104: glusterfs-api-3.7.3-1.el7.x86_64 stor104
Re: [Gluster-devel] Skipped files during rebalance
Comments inline. - Original Message - From: Christophe TREFOIS christophe.tref...@uni.lu To: Susant Palai spa...@redhat.com Cc: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran nbala...@redhat.com, Shyamsundar Ranganathan srang...@redhat.com, Mohammed Rafi K C rkavu...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Tuesday, August 18, 2015 8:08:41 PM Subject: Re: [Gluster-devel] Skipped files during rebalance Hi Susan, Thank you for the response. On 18 Aug 2015, at 10:45, Susant Palai spa...@redhat.com wrote: Hi Christophe, Need some info regarding the high mem-usage. 1. Top output: To see whether any other process eating up memory. I will be interested to know the memory usage of all the gluster process referring to the high mem-usage. These process includes glusterfsd, glusterd, gluster, any mount process (glusterfs), and rebalance(glusterfs). 2. Gluster volume info root@highlander ~]# gluster volume info Volume Name: live Type: Distribute Volume ID: 1328637d-7730-4627-8945-bbe43626d527 Status: Started Number of Bricks: 9 Transport-type: tcp Bricks: Brick1: stor104:/zfs/brick0/brick Brick2: stor104:/zfs/brick1/brick Brick3: stor104:/zfs/brick2/brick Brick4: stor106:/zfs/brick0/brick Brick5: stor106:/zfs/brick1/brick Brick6: stor106:/zfs/brick2/brick Brick7: stor105:/zfs/brick0/brick Brick8: stor105:/zfs/brick1/brick Brick9: stor105:/zfs/brick2/brick Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on server.allow-insecure: on cluster.min-free-disk: 1% diagnostics.brick-log-level: ERROR diagnostics.client-log-level: ERROR cluster.data-self-heal-algorithm: full performance.cache-max-file-size: 4MB performance.cache-refresh-timeout: 60 performance.cache-size: 1GB performance.client-io-threads: on performance.io-thread-count: 32 performance.write-behind-window-size: 4MB 3. Is rebalance process still running? If yes can you point to specific mem usage by rebalance process? The high mem-usage was seen during rebalance or even post rebalance? I would like to restart the rebalance process since it failed… But I can’t as the volume cannot be stopped (I wanted to reboot the servers to have a clean testing grounds). Here are the logs from the three nodes: http://paste.fedoraproject.org/256183/43989079 Maybe you could help me figure out how to stop the volume? This is what happens [root@highlander ~]# gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. Requesting glusterd team to give input. [root@highlander ~]# ssh stor105 gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. [root@highlander ~]# ssh stor104 gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. [root@highlander ~]# ssh stor106 gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. [root@highlander ~]# gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. [root@highlander ~]# gluster volume stop live Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: live: failed: Staging failed on stor106. Error: rebalance session is in progress for the volume 'live' Staging failed on stor104. Error: rebalance session is in progress for the volume ‘live' Can you run [ps aux | grep rebalance] on all the servers and post here? Just want to check whether rebalance is really running or not. Again requesting glusterd team to give inputs. 4. Gluster version [root@highlander ~]# pdsh -g live 'rpm -qa | grep gluster' stor104: glusterfs-api-3.7.3-1.el7.x86_64 stor104: glusterfs-server-3.7.3-1.el7.x86_64 stor104: glusterfs-libs-3.7.3-1.el7.x86_64 stor104: glusterfs-3.7.3-1.el7.x86_64 stor104: glusterfs-fuse-3.7.3-1.el7.x86_64 stor104: glusterfs-client-xlators-3.7.3-1.el7.x86_64 stor104: glusterfs-cli-3.7.3-1.el7.x86_64 stor105: glusterfs-3.7.3-1.el7.x86_64 stor105: glusterfs-client-xlators-3.7.3-1.el7.x86_64 stor105: glusterfs-api-3.7.3-1.el7.x86_64 stor105: glusterfs-cli-3.7.3-1.el7.x86_64 stor105: glusterfs-server-3.7.3-1.el7.x86_64 stor105: glusterfs-libs-3.7.3-1.el7.x86_64 stor105: glusterfs-fuse-3.7.3-1.el7.x86_64 stor106: glusterfs-libs-3.7.3-1.el7.x86_64 stor106: glusterfs-fuse-3.7.3-1.el7.x86_64 stor106: glusterfs-client-xlators-3.7.3-1.el7.x86_64 stor106: glusterfs-api-3.7.3-1.el7.x86_64 stor106: glusterfs-cli-3.7.3-1.el7.x86_64 stor106: glusterfs-server-3.7.3-1.el7.x86_64 stor106: glusterfs-3.7.3-1.el7.x86_64 Will ask for more information in case needed. Regards, Susant - Original Message - From: Christophe TREFOIS christophe.tref...@uni.lu To: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran nbala...@redhat.com, Susant Palai spa...@redhat.com, Shyamsundar
Re: [Gluster-devel] Skipped files during rebalance
Hi Christophe, Forgot to ask you to post the rebalance and glusterd logs. Regards, Susant - Original Message - From: Susant Palai spa...@redhat.com To: Christophe TREFOIS christophe.tref...@uni.lu Cc: Gluster Devel gluster-devel@gluster.org Sent: Wednesday, August 19, 2015 11:44:35 AM Subject: Re: [Gluster-devel] Skipped files during rebalance Comments inline. - Original Message - From: Christophe TREFOIS christophe.tref...@uni.lu To: Susant Palai spa...@redhat.com Cc: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran nbala...@redhat.com, Shyamsundar Ranganathan srang...@redhat.com, Mohammed Rafi K C rkavu...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Tuesday, August 18, 2015 8:08:41 PM Subject: Re: [Gluster-devel] Skipped files during rebalance Hi Susan, Thank you for the response. On 18 Aug 2015, at 10:45, Susant Palai spa...@redhat.com wrote: Hi Christophe, Need some info regarding the high mem-usage. 1. Top output: To see whether any other process eating up memory. I will be interested to know the memory usage of all the gluster process referring to the high mem-usage. These process includes glusterfsd, glusterd, gluster, any mount process (glusterfs), and rebalance(glusterfs). 2. Gluster volume info root@highlander ~]# gluster volume info Volume Name: live Type: Distribute Volume ID: 1328637d-7730-4627-8945-bbe43626d527 Status: Started Number of Bricks: 9 Transport-type: tcp Bricks: Brick1: stor104:/zfs/brick0/brick Brick2: stor104:/zfs/brick1/brick Brick3: stor104:/zfs/brick2/brick Brick4: stor106:/zfs/brick0/brick Brick5: stor106:/zfs/brick1/brick Brick6: stor106:/zfs/brick2/brick Brick7: stor105:/zfs/brick0/brick Brick8: stor105:/zfs/brick1/brick Brick9: stor105:/zfs/brick2/brick Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on server.allow-insecure: on cluster.min-free-disk: 1% diagnostics.brick-log-level: ERROR diagnostics.client-log-level: ERROR cluster.data-self-heal-algorithm: full performance.cache-max-file-size: 4MB performance.cache-refresh-timeout: 60 performance.cache-size: 1GB performance.client-io-threads: on performance.io-thread-count: 32 performance.write-behind-window-size: 4MB 3. Is rebalance process still running? If yes can you point to specific mem usage by rebalance process? The high mem-usage was seen during rebalance or even post rebalance? I would like to restart the rebalance process since it failed… But I can’t as the volume cannot be stopped (I wanted to reboot the servers to have a clean testing grounds). Here are the logs from the three nodes: http://paste.fedoraproject.org/256183/43989079 Maybe you could help me figure out how to stop the volume? This is what happens [root@highlander ~]# gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. Requesting glusterd team to give input. [root@highlander ~]# ssh stor105 gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. [root@highlander ~]# ssh stor104 gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. [root@highlander ~]# ssh stor106 gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. [root@highlander ~]# gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. [root@highlander ~]# gluster volume stop live Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: live: failed: Staging failed on stor106. Error: rebalance session is in progress for the volume 'live' Staging failed on stor104. Error: rebalance session is in progress for the volume ‘live' Can you run [ps aux | grep rebalance] on all the servers and post here? Just want to check whether rebalance is really running or not. Again requesting glusterd team to give inputs. 4. Gluster version [root@highlander ~]# pdsh -g live 'rpm -qa | grep gluster' stor104: glusterfs-api-3.7.3-1.el7.x86_64 stor104: glusterfs-server-3.7.3-1.el7.x86_64 stor104: glusterfs-libs-3.7.3-1.el7.x86_64 stor104: glusterfs-3.7.3-1.el7.x86_64 stor104: glusterfs-fuse-3.7.3-1.el7.x86_64 stor104: glusterfs-client-xlators-3.7.3-1.el7.x86_64 stor104: glusterfs-cli-3.7.3-1.el7.x86_64 stor105: glusterfs-3.7.3-1.el7.x86_64 stor105: glusterfs-client-xlators-3.7.3-1.el7.x86_64 stor105: glusterfs-api-3.7.3-1.el7.x86_64 stor105: glusterfs-cli-3.7.3-1.el7.x86_64 stor105: glusterfs-server-3.7.3-1.el7.x86_64 stor105: glusterfs-libs-3.7.3-1.el7.x86_64 stor105: glusterfs-fuse-3.7.3-1.el7.x86_64 stor106: glusterfs-libs-3.7.3-1.el7.x86_64 stor106: glusterfs-fuse-3.7.3-1.el7.x86_64 stor106: glusterfs
Re: [Gluster-devel] Skipped files during rebalance
Dear Susant, Apparently the glistered process was stuck in a strange state. So we restarted the glusterd process on stor106. This allowed us to stop the volume, and reboot. I will start a new rebalance now, and will get the information you asked during the rebalance operation. I think it makes more sense to post the logs of this new rebalance operation. Kind regards, — Christophe On 19 Aug 2015, at 08:49, Susant Palai spa...@redhat.com wrote: Hi Christophe, Forgot to ask you to post the rebalance and glusterd logs. Regards, Susant - Original Message - From: Susant Palai spa...@redhat.com To: Christophe TREFOIS christophe.tref...@uni.lu Cc: Gluster Devel gluster-devel@gluster.org Sent: Wednesday, August 19, 2015 11:44:35 AM Subject: Re: [Gluster-devel] Skipped files during rebalance Comments inline. - Original Message - From: Christophe TREFOIS christophe.tref...@uni.lu To: Susant Palai spa...@redhat.com Cc: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran nbala...@redhat.com, Shyamsundar Ranganathan srang...@redhat.com, Mohammed Rafi K C rkavu...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Tuesday, August 18, 2015 8:08:41 PM Subject: Re: [Gluster-devel] Skipped files during rebalance Hi Susan, Thank you for the response. On 18 Aug 2015, at 10:45, Susant Palai spa...@redhat.com wrote: Hi Christophe, Need some info regarding the high mem-usage. 1. Top output: To see whether any other process eating up memory. I will be interested to know the memory usage of all the gluster process referring to the high mem-usage. These process includes glusterfsd, glusterd, gluster, any mount process (glusterfs), and rebalance(glusterfs). 2. Gluster volume info root@highlander ~]# gluster volume info Volume Name: live Type: Distribute Volume ID: 1328637d-7730-4627-8945-bbe43626d527 Status: Started Number of Bricks: 9 Transport-type: tcp Bricks: Brick1: stor104:/zfs/brick0/brick Brick2: stor104:/zfs/brick1/brick Brick3: stor104:/zfs/brick2/brick Brick4: stor106:/zfs/brick0/brick Brick5: stor106:/zfs/brick1/brick Brick6: stor106:/zfs/brick2/brick Brick7: stor105:/zfs/brick0/brick Brick8: stor105:/zfs/brick1/brick Brick9: stor105:/zfs/brick2/brick Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on server.allow-insecure: on cluster.min-free-disk: 1% diagnostics.brick-log-level: ERROR diagnostics.client-log-level: ERROR cluster.data-self-heal-algorithm: full performance.cache-max-file-size: 4MB performance.cache-refresh-timeout: 60 performance.cache-size: 1GB performance.client-io-threads: on performance.io-thread-count: 32 performance.write-behind-window-size: 4MB 3. Is rebalance process still running? If yes can you point to specific mem usage by rebalance process? The high mem-usage was seen during rebalance or even post rebalance? I would like to restart the rebalance process since it failed… But I can’t as the volume cannot be stopped (I wanted to reboot the servers to have a clean testing grounds). Here are the logs from the three nodes: http://paste.fedoraproject.org/256183/43989079 Maybe you could help me figure out how to stop the volume? This is what happens [root@highlander ~]# gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. Requesting glusterd team to give input. [root@highlander ~]# ssh stor105 gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. [root@highlander ~]# ssh stor104 gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. [root@highlander ~]# ssh stor106 gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. [root@highlander ~]# gluster volume rebalance live stop volume rebalance: live: failed: Rebalance not started. [root@highlander ~]# gluster volume stop live Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: live: failed: Staging failed on stor106. Error: rebalance session is in progress for the volume 'live' Staging failed on stor104. Error: rebalance session is in progress for the volume ‘live' Can you run [ps aux | grep rebalance] on all the servers and post here? Just want to check whether rebalance is really running or not. Again requesting glusterd team to give inputs. 4. Gluster version [root@highlander ~]# pdsh -g live 'rpm -qa | grep gluster' stor104: glusterfs-api-3.7.3-1.el7.x86_64 stor104: glusterfs-server-3.7.3-1.el7.x86_64 stor104: glusterfs-libs-3.7.3-1.el7.x86_64 stor104: glusterfs-3.7.3-1.el7.x86_64 stor104: glusterfs-fuse-3.7.3-1.el7.x86_64 stor104: glusterfs-client-xlators-3.7.3-1.el7.x86_64 stor104: glusterfs-cli-3.7.3-1.el7.x86_64 stor105: glusterfs-3.7.3-1.el7.x86_64 stor105: glusterfs-client-xlators-3.7.3-1.el7.x86_64
Re: [Gluster-devel] Skipped files during rebalance
Hi Christophe, Need some info regarding the high mem-usage. 1. Top output: To see whether any other process eating up memory. 2. Gluster volume info 3. Is rebalance process still running? If yes can you point to specific mem usage by rebalance process? The high mem-usage was seen during rebalance or even post rebalance? 4. Gluster version Will ask for more information in case needed. Regards, Susant - Original Message - From: Christophe TREFOIS christophe.tref...@uni.lu To: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran nbala...@redhat.com, Susant Palai spa...@redhat.com, Shyamsundar Ranganathan srang...@redhat.com Cc: Mohammed Rafi K C rkavu...@redhat.com Sent: Monday, 17 August, 2015 7:03:20 PM Subject: Fwd: [Gluster-devel] Skipped files during rebalance Hi DHT team, This email somehow didn’t get forwarded to you. In addition to my problem described below, here is one example of free memory after everything failed [root@highlander ~]# pdsh -g live 'free -m' stor106: totalusedfree shared buff/cache available stor106: Mem: 193249 1247841347 9 67118 12769 stor106: Swap: 0 0 0 stor104: totalusedfree shared buff/cache available stor104: Mem: 193249 107617 31323 9 54308 42752 stor104: Swap: 0 0 0 stor105: totalusedfree shared buff/cache available stor105: Mem: 193248 1418046736 9 44707 9713 stor105: Swap: 0 0 0 So after the failed operation, there’s almost no memory free, and it is also not freed up. Thank you for pointing me to any directions, Kind regards, — Christophe Begin forwarded message: From: Christophe TREFOIS christophe.tref...@uni.lumailto:christophe.tref...@uni.lu Subject: Re: [Gluster-devel] Skipped files during rebalance Date: 17 Aug 2015 11:54:32 CEST To: Mohammed Rafi K C rkavu...@redhat.commailto:rkavu...@redhat.com Cc: gluster-devel@gluster.orgmailto:gluster-devel@gluster.org gluster-devel@gluster.orgmailto:gluster-devel@gluster.org Dear Rafi, Thanks for submitting a patch. @DHT, I have two additional questions / problems. 1. When doing a rebalance (with data) RAM consumption on the nodes goes dramatically high, eg out of 196 GB available per node, RAM usage would fill up to 195.6 GB. This seems quite excessive and strange to me. 2. As you can see, the rebalance (with data) failed as one endpoint becomes unconnected (even though it still is connected). I’m thinking this could be due to the high RAM usage? Thank you for your help, — Christophe Dr Christophe Trefois, Dipl.-Ing. Technical Specialist / Post-Doc UNIVERSITÉ DU LUXEMBOURG LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE Campus Belval | House of Biomedicine 6, avenue du Swing L-4367 Belvaux T: +352 46 66 44 6124 F: +352 46 66 44 6949 http://www.uni.lu/lcsb [Facebook]https://www.facebook.com/trefex [Twitter] https://twitter.com/Trefex [Google Plus] https://plus.google.com/+ChristopheTrefois/ [Linkedin] https://www.linkedin.com/in/trefoischristophe [skype] http://skype:Trefex?call This message is confidential and may contain privileged information. It is intended for the named recipient only. If you receive it in error please notify me and permanently delete the original message and any copies. On 17 Aug 2015, at 11:27, Mohammed Rafi K C rkavu...@redhat.commailto:rkavu...@redhat.com wrote: On 08/17/2015 01:58 AM, Christophe TREFOIS wrote: Dear all, I have successfully added a new node to our setup, and finally managed to get a successful fix-layout run as well with no errors. Now, as per the documentation, I started a gluster volume rebalance live start task and I see many skipped files. The error log contains then entires as follows for each skipped file. [2015-08-16 20:23:30.591161] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/004010008.flex lookup failed [2015-08-16 20:23:30.768391] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/007005003.flex lookup failed [2015-08-16 20:23:30.804811] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/006005009.flex lookup failed [2015-08-16 20:23:30.805201] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK
Re: [Gluster-devel] Skipped files during rebalance
Dear Rafi, Thanks for submitting a patch. @DHT, I have two additional questions / problems. 1. When doing a rebalance (with data) RAM consumption on the nodes goes dramatically high, eg out of 196 GB available per node, RAM usage would fill up to 195.6 GB. This seems quite excessive and strange to me. 2. As you can see, the rebalance (with data) failed as one endpoint becomes unconnected (even though it still is connected). I’m thinking this could be due to the high RAM usage? Thank you for your help, — Christophe Dr Christophe Trefois, Dipl.-Ing. Technical Specialist / Post-Doc UNIVERSITÉ DU LUXEMBOURG LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE Campus Belval | House of Biomedicine 6, avenue du Swing L-4367 Belvaux T: +352 46 66 44 6124 F: +352 46 66 44 6949 http://www.uni.lu/lcsb [Facebook]https://www.facebook.com/trefex [Twitter] https://twitter.com/Trefex [Google Plus] https://plus.google.com/+ChristopheTrefois/ [Linkedin] https://www.linkedin.com/in/trefoischristophe [skype] http://skype:Trefex?call This message is confidential and may contain privileged information. It is intended for the named recipient only. If you receive it in error please notify me and permanently delete the original message and any copies. On 17 Aug 2015, at 11:27, Mohammed Rafi K C rkavu...@redhat.commailto:rkavu...@redhat.com wrote: On 08/17/2015 01:58 AM, Christophe TREFOIS wrote: Dear all, I have successfully added a new node to our setup, and finally managed to get a successful fix-layout run as well with no errors. Now, as per the documentation, I started a gluster volume rebalance live start task and I see many skipped files. The error log contains then entires as follows for each skipped file. [2015-08-16 20:23:30.591161] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/004010008.flex lookup failed [2015-08-16 20:23:30.768391] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/007005003.flex lookup failed [2015-08-16 20:23:30.804811] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/006005009.flex lookup failed [2015-08-16 20:23:30.805201] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/005006011.flex lookup failed [2015-08-16 20:23:30.880037] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/005009012.flex lookup failed [2015-08-16 20:23:31.038236] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/003008007.flex lookup failed [2015-08-16 20:23:31.259762] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/004008006.flex lookup failed [2015-08-16 20:23:31.333764] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/007008001.flex lookup failed [2015-08-16 20:23:31.340190] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/006007004.flex lookup failed Update: one of the rebalance tasks now failed. @Rafi, I got the same error as Friday except this time with data. Packets that carrying the ping request could be waiting in the queue during the whole time-out period, because of the heavy traffic in the network. I have sent a patch for this. You can track the status here : http://review.gluster.org/11935 [2015-08-16 20:24:34.533167] C [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-0: server 192.168.123.104:49164 has not responded in the last 42 seconds, disconnecting. [2015-08-16 20:24:34.533614] E [rpc-clnt.c:362:saved_frames_unwind] (-- /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-- /lib64/libgfrpc.so.0(saved_frames_unwin d+0x1de)[0x7fa454bb09be] (-- /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-- /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (-- /lib64/li bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at 2015-08-16
Re: [Gluster-devel] Skipped files during rebalance
On 08/17/2015 01:58 AM, Christophe TREFOIS wrote: Dear all, I have successfully added a new node to our setup, and finally managed to get a successful fix-layout run as well with no errors. Now, as per the documentation, I started a gluster volume rebalance live start task and I see many skipped files. The error log contains then entires as follows for each skipped file. [2015-08-16 20:23:30.591161] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/004010008.flex lookup failed [2015-08-16 20:23:30.768391] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/007005003.flex lookup failed [2015-08-16 20:23:30.804811] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/006005009.flex lookup failed [2015-08-16 20:23:30.805201] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/005006011.flex lookup failed [2015-08-16 20:23:30.880037] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/005009012.flex lookup failed [2015-08-16 20:23:31.038236] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/003008007.flex lookup failed [2015-08-16 20:23:31.259762] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/004008006.flex lookup failed [2015-08-16 20:23:31.333764] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/007008001.flex lookup failed [2015-08-16 20:23:31.340190] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea s_05(2013-10-11_17-12-02)/006007004.flex lookup failed Update: one of the rebalance tasks now failed. @Rafi, I got the same error as Friday except this time with data. Packets that carrying the ping request could be waiting in the queue during the whole time-out period, because of the heavy traffic in the network. I have sent a patch for this. You can track the status here : http://review.gluster.org/11935 [2015-08-16 20:24:34.533167] C [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-0: server 192.168.123.104:49164 has not responded in the last 42 seconds, disconnecting. [2015-08-16 20:24:34.533614] E [rpc-clnt.c:362:saved_frames_unwind] (-- /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-- /lib64/libgfrpc.so.0(saved_frames_unwin d+0x1de)[0x7fa454bb09be] (-- /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-- /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (-- /lib64/li bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at 2015-08-16 20:23:51.305640 (xid=0x5dd4da) [2015-08-16 20:24:34.533672] E [MSGID: 114031] [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote operation failed [Transport endpoint is not connected] [2015-08-16 20:24:34.534201] E [rpc-clnt.c:362:saved_frames_unwind] (-- /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-- /lib64/libgfrpc.so.0(saved_frames_unwin d+0x1de)[0x7fa454bb09be] (-- /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-- /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (-- /lib64/li bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2015-08-16 20:23:51.303938 (xid=0x5dd4d7) [2015-08-16 20:24:34.534347] E [MSGID: 109023] [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_ 12(2013-10-12_00-12-55)/007008007.flex: failed to migrate data [2015-08-16 20:24:34.534413] E [rpc-clnt.c:362:saved_frames_unwind] (-- /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-- /lib64/libgfrpc.so.0(saved_frames_unwin d+0x1de)[0x7fa454bb09be] (-- /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--