Re: [Gluster-devel] Skipped files during rebalance

2015-08-21 Thread Christophe TREFOIS
 in this location.
 
 2. When you see any of the rebalance process on any of the servers using high 
 memory issue the following command.
   kill -USR1 pid-of-rebalance-process.  --- ps aux | grep rebalance 
 should give the rebalance process pid.
 
 The state dump should give some hint about the high mem-usage.
 
 Thanks,
 Susant
 
 - Original Message -
 From: Susant Palai spa...@redhat.com
 To: Christophe TREFOIS christophe.tref...@uni.lu
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Friday, 21 August, 2015 3:52:07 PM
 Subject: Re: [Gluster-devel] Skipped files during rebalance
 
 Thanks Christophe for the details. Will get back to you with the analysis.
 
 Regards,
 Susant
 
 - Original Message -
 From: Christophe TREFOIS christophe.tref...@uni.lu
 To: Susant Palai spa...@redhat.com
 Cc: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran 
 nbala...@redhat.com, Shyamsundar Ranganathan srang...@redhat.com, 
 Mohammed Rafi K C rkavu...@redhat.com, Gluster Devel 
 gluster-devel@gluster.org
 Sent: Friday, 21 August, 2015 12:39:05 AM
 Subject: Re: [Gluster-devel] Skipped files during rebalance
 
 Dear Susant,
 
 The rebalance failed again and also had (in my opinion) excessive RAM usage.
 
 Please find a very detailled list below.
 
 All logs:
 
 http://wikisend.com/download/651948/allstores.tar.gz
 
 Thank you for letting me know how I could successfully complete the rebalance 
 process.
 The fedora pastes are the output of top of each node at that time (more or 
 less).
 
 Please let me know if you need more information,
 
 Best,
 
 —— Start of mem info
 
 # After reboot, before starting glusterd
 
 [root@highlander ~]# pdsh -g live 'free -m'
 stor106:   totalusedfree  shared  buff/cache  
  available
 stor106: Mem: 1932492208  190825   9 215  
 190772
 stor106: Swap: 0   0   0
 stor105:   totalusedfree  shared  buff/cache  
  available
 stor105: Mem: 1932482275  190738   9 234  
 190681
 stor105: Swap: 0   0   0
 stor104:   totalusedfree  shared  buff/cache  
  available
 stor104: Mem: 1932492221  190811   9 216  
 190757
 stor104: Swap: 0   0   0
 [root@highlander ~]#
 
 # Gluster Info
 
 [root@stor106 glusterfs]# gluster volume info
 
 Volume Name: live
 Type: Distribute
 Volume ID: 1328637d-7730-4627-8945-bbe43626d527
 Status: Started
 Number of Bricks: 9
 Transport-type: tcp
 Bricks:
 Brick1: stor104:/zfs/brick0/brick
 Brick2: stor104:/zfs/brick1/brick
 Brick3: stor104:/zfs/brick2/brick
 Brick4: stor106:/zfs/brick0/brick
 Brick5: stor106:/zfs/brick1/brick
 Brick6: stor106:/zfs/brick2/brick
 Brick7: stor105:/zfs/brick0/brick
 Brick8: stor105:/zfs/brick1/brick
 Brick9: stor105:/zfs/brick2/brick
 Options Reconfigured:
 nfs.disable: true
 diagnostics.count-fop-hits: on
 diagnostics.latency-measurement: on
 performance.write-behind-window-size: 4MB
 performance.io-thread-count: 32
 performance.client-io-threads: on
 performance.cache-size: 1GB
 performance.cache-refresh-timeout: 60
 performance.cache-max-file-size: 4MB
 cluster.data-self-heal-algorithm: full
 diagnostics.client-log-level: ERROR
 diagnostics.brick-log-level: ERROR
 cluster.min-free-disk: 1%
 server.allow-insecure: on
 
 # Starting gluserd
 
 [root@highlander ~]# pdsh -g live 'systemctl start glusterd'
 [root@highlander ~]# pdsh -g live 'free -m'
 stor106:   totalusedfree  shared  buff/cache  
  available
 stor106: Mem: 1932492290  190569   9 389  
 190587
 stor106: Swap: 0   0   0
 stor104:   totalusedfree  shared  buff/cache  
  available
 stor104: Mem: 1932492297  190557   9 394  
 190571
 stor104: Swap: 0   0   0
 stor105:   totalusedfree  shared  buff/cache  
  available
 stor105: Mem: 1932482286  190554   9 407  
 190595
 stor105: Swap: 0   0   0
 
 [root@highlander ~]# systemctl start glusterd
 [root@highlander ~]# gluster volume start live
 volume start: live: success
 [root@highlander ~]# gluster volume status
 Status of volume: live
 Gluster process TCP Port  RDMA Port  Online  Pid
 --
 Brick stor104:/zfs/brick0/brick 49164 0  Y   5945
 Brick stor104:/zfs/brick1/brick 49165 0  Y   5963
 Brick stor104:/zfs/brick2/brick 49166 0  Y   5981
 Brick stor106:/zfs/brick0/brick 49158 0  Y   5256
 Brick stor106

Re: [Gluster-devel] Skipped files during rebalance

2015-08-21 Thread Susant Palai
Hi,
 Mostly the rebalance failures are due to the network problem.

Here is the log:

[2015-08-16 20:31:36.301467] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/PA 
27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/003002002.flex 
lookup failed
[2015-08-16 20:31:36.921405] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/PA 
27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/003004005.flex 
lookup failed
[2015-08-16 20:31:36.921591] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/PA 
27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/006004004.flex 
lookup failed
[2015-08-16 20:31:36.921770] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/PA 
27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/005004007.flex 
lookup failed
[2015-08-16 20:31:37.577758] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/PA 
27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/007004005.flex 
lookup failed
[2015-08-16 20:34:12.387425] E [socket.c:2332:socket_connect_finish] 
0-live-client-4: connection to 192.168.123.106:24007 failed (Connection refused)
[2015-08-16 20:34:12.392820] E [socket.c:2332:socket_connect_finish] 
0-live-client-5: connection to 192.168.123.106:24007 failed (Connection refused)
[2015-08-16 20:34:12.398023] E [socket.c:2332:socket_connect_finish] 
0-live-client-0: connection to 192.168.123.104:24007 failed (Connection refused)
[2015-08-16 20:34:12.402904] E [socket.c:2332:socket_connect_finish] 
0-live-client-2: connection to 192.168.123.104:24007 failed (Connection refused)
[2015-08-16 20:34:12.407464] E [socket.c:2332:socket_connect_finish] 
0-live-client-3: connection to 192.168.123.106:24007 failed (Connection refused)
[2015-08-16 20:34:12.412249] E [socket.c:2332:socket_connect_finish] 
0-live-client-1: connection to 192.168.123.104:24007 failed (Connection refused)
[2015-08-16 20:34:12.416621] E [socket.c:2332:socket_connect_finish] 
0-live-client-6: connection to 192.168.123.105:24007 failed (Connection refused)
[2015-08-16 20:34:12.420906] E [socket.c:2332:socket_connect_finish] 
0-live-client-8: connection to 192.168.123.105:24007 failed (Connection refused)
[2015-08-16 20:34:12.425066] E [socket.c:2332:socket_connect_finish] 
0-live-client-7: connection to 192.168.123.105:24007 failed (Connection refused)
[2015-08-16 20:34:17.479925] E [socket.c:2332:socket_connect_finish] 
0-glusterfs: connection to 127.0.0.1:24007 failed (Connection refused)
[2015-08-16 20:36:23.788206] E [MSGID: 101075] 
[common-utils.c:314:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or 
service not known)
[2015-08-16 20:36:23.788286] E [name.c:247:af_inet_client_get_remote_sockaddr] 
0-live-client-4: DNS resolution failed on host stor106
[2015-08-16 20:36:23.788387] E [name.c:247:af_inet_client_get_remote_sockaddr] 
0-live-client-5: DNS resolution failed on host stor106
[2015-08-16 20:36:23.788918] E [name.c:247:af_inet_client_get_remote_sockaddr] 
0-live-client-0: DNS resolution failed on host stor104
[2015-08-16 20:36:23.789233] E [name.c:247:af_inet_client_get_remote_sockaddr] 
0-live-client-2: DNS resolution failed on host stor104
[2015-08-16 20:36:23.789295] E [name.c:247:af_inet_client_get_remote_sockaddr] 
0-live-client-3: DNS resolution failed on host stor106


For the high mem usage part I will try to run rebalance and analyze. In the 
mean time it will be help full if you can take a state dump of the rebalance 
process when it is using high RAM.

Here are the steps to take the state dump.

1. Find your state-dump destination; Run gluster --print-statedumpdir. The 
state dump will be stored in this location.

2. When you see any of the rebalance process on any of the servers using high 
memory issue the following command.
   kill -USR1 pid-of-rebalance-process.  --- ps aux | grep rebalance 
should give the rebalance process pid.

The state dump should give some hint about the high mem-usage.

Thanks,
Susant

- Original Message -
From: Susant Palai spa...@redhat.com
To: Christophe TREFOIS christophe.tref...@uni.lu
Cc: Gluster Devel gluster-devel@gluster.org
Sent: Friday, 21 August, 2015 3:52:07 PM
Subject: Re: [Gluster-devel] Skipped files during rebalance

Thanks Christophe for the details. Will get back to you with the analysis.

Regards,
Susant

- Original Message -
From: Christophe TREFOIS christophe.tref...@uni.lu
To: Susant Palai spa...@redhat.com
Cc: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran 
nbala...@redhat.com, Shyamsundar Ranganathan srang...@redhat.com, 
Mohammed Rafi K C rkavu...@redhat.com, Gluster Devel 
gluster-devel@gluster.org
Sent: Friday, 21 August, 2015 12

Re: [Gluster-devel] Skipped files during rebalance

2015-08-21 Thread Susant Palai
Thanks Christophe for the details. Will get back to you with the analysis.

Regards,
Susant

- Original Message -
From: Christophe TREFOIS christophe.tref...@uni.lu
To: Susant Palai spa...@redhat.com
Cc: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran 
nbala...@redhat.com, Shyamsundar Ranganathan srang...@redhat.com, 
Mohammed Rafi K C rkavu...@redhat.com, Gluster Devel 
gluster-devel@gluster.org
Sent: Friday, 21 August, 2015 12:39:05 AM
Subject: Re: [Gluster-devel] Skipped files during rebalance

Dear Susant,

The rebalance failed again and also had (in my opinion) excessive RAM usage.

Please find a very detailled list below.

All logs:

http://wikisend.com/download/651948/allstores.tar.gz

Thank you for letting me know how I could successfully complete the rebalance 
process.
The fedora pastes are the output of top of each node at that time (more or 
less).

Please let me know if you need more information,

Best,

—— Start of mem info

# After reboot, before starting glusterd

[root@highlander ~]# pdsh -g live 'free -m'
stor106:   totalusedfree  shared  buff/cache   
available
stor106: Mem: 1932492208  190825   9 215
  190772
stor106: Swap: 0   0   0
stor105:   totalusedfree  shared  buff/cache   
available
stor105: Mem: 1932482275  190738   9 234
  190681
stor105: Swap: 0   0   0
stor104:   totalusedfree  shared  buff/cache   
available
stor104: Mem: 1932492221  190811   9 216
  190757
stor104: Swap: 0   0   0
[root@highlander ~]#

# Gluster Info

[root@stor106 glusterfs]# gluster volume info

Volume Name: live
Type: Distribute
Volume ID: 1328637d-7730-4627-8945-bbe43626d527
Status: Started
Number of Bricks: 9
Transport-type: tcp
Bricks:
Brick1: stor104:/zfs/brick0/brick
Brick2: stor104:/zfs/brick1/brick
Brick3: stor104:/zfs/brick2/brick
Brick4: stor106:/zfs/brick0/brick
Brick5: stor106:/zfs/brick1/brick
Brick6: stor106:/zfs/brick2/brick
Brick7: stor105:/zfs/brick0/brick
Brick8: stor105:/zfs/brick1/brick
Brick9: stor105:/zfs/brick2/brick
Options Reconfigured:
nfs.disable: true
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.write-behind-window-size: 4MB
performance.io-thread-count: 32
performance.client-io-threads: on
performance.cache-size: 1GB
performance.cache-refresh-timeout: 60
performance.cache-max-file-size: 4MB
cluster.data-self-heal-algorithm: full
diagnostics.client-log-level: ERROR
diagnostics.brick-log-level: ERROR
cluster.min-free-disk: 1%
server.allow-insecure: on

# Starting gluserd

[root@highlander ~]# pdsh -g live 'systemctl start glusterd'
[root@highlander ~]# pdsh -g live 'free -m'
stor106:   totalusedfree  shared  buff/cache   
available
stor106: Mem: 1932492290  190569   9 389
  190587
stor106: Swap: 0   0   0
stor104:   totalusedfree  shared  buff/cache   
available
stor104: Mem: 1932492297  190557   9 394
  190571
stor104: Swap: 0   0   0
stor105:   totalusedfree  shared  buff/cache   
available
stor105: Mem: 1932482286  190554   9 407
  190595
stor105: Swap: 0   0   0

[root@highlander ~]# systemctl start glusterd
[root@highlander ~]# gluster volume start live
volume start: live: success
[root@highlander ~]# gluster volume status
Status of volume: live
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick stor104:/zfs/brick0/brick 49164 0  Y   5945
Brick stor104:/zfs/brick1/brick 49165 0  Y   5963
Brick stor104:/zfs/brick2/brick 49166 0  Y   5981
Brick stor106:/zfs/brick0/brick 49158 0  Y   5256
Brick stor106:/zfs/brick1/brick 49159 0  Y   5274
Brick stor106:/zfs/brick2/brick 49160 0  Y   5292
Brick stor105:/zfs/brick0/brick 49155 0  Y   5284
Brick stor105:/zfs/brick1/brick 49156 0  Y   5302
Brick stor105:/zfs/brick2/brick 49157 0  Y   5320
NFS Server on localhost N/A   N/AN   N/A
NFS Server on 192.168.123.106   N/A   N/AN   N/A
NFS Server on stor105   N/A   N/AN   N/A
NFS Server on 192.168.123.104   N/A   N/AN   N/A

Task Status of Volume live

Re: [Gluster-devel] Skipped files during rebalance

2015-08-20 Thread Christophe TREFOIS
   -  ---   ---   
---   ---   ---  --
 192.168.123.104   748812 4.4TB   
4160456  1311156772   failed   63114.00
 192.168.123.106  1187917 3.3TB   
6021931 21625   1209503   failed   75243.00
 stor10500Bytes   
244043116   196   failed   69658.00
volume rebalance: live: success:
```



Dr Christophe Trefois, Dipl.-Ing.  
Technical Specialist / Post-Doc

UNIVERSITÉ DU LUXEMBOURG

LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
Campus Belval | House of Biomedicine  
6, avenue du Swing 
L-4367 Belvaux  
T: +352 46 66 44 6124 
F: +352 46 66 44 6949  
http://www.uni.lu/lcsb




This message is confidential and may contain privileged information. 
It is intended for the named recipient only. 
If you receive it in error please notify me and permanently delete the original 
message and any copies. 


  

 On 19 Aug 2015, at 08:14, Susant Palai spa...@redhat.com wrote:
 
 Comments inline.
 
 - Original Message -
 From: Christophe TREFOIS christophe.tref...@uni.lu
 To: Susant Palai spa...@redhat.com
 Cc: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran 
 nbala...@redhat.com, Shyamsundar
 Ranganathan srang...@redhat.com, Mohammed Rafi K C 
 rkavu...@redhat.com, Gluster Devel
 gluster-devel@gluster.org
 Sent: Tuesday, August 18, 2015 8:08:41 PM
 Subject: Re: [Gluster-devel] Skipped files during rebalance
 
 Hi Susan,
 
 Thank you for the response.
 
 On 18 Aug 2015, at 10:45, Susant Palai spa...@redhat.com wrote:
 
 Hi Christophe,
 
  Need some info regarding the high mem-usage.
 
 1. Top output: To see whether any other process eating up memory.
 
 I will be interested to know the memory usage of all the gluster process 
 referring to the high mem-usage. These process includes glusterfsd, glusterd, 
 gluster, any mount process (glusterfs), and rebalance(glusterfs).
 
 
 2. Gluster volume info
 
 root@highlander ~]# gluster volume info
 
 Volume Name: live
 Type: Distribute
 Volume ID: 1328637d-7730-4627-8945-bbe43626d527
 Status: Started
 Number of Bricks: 9
 Transport-type: tcp
 Bricks:
 Brick1: stor104:/zfs/brick0/brick
 Brick2: stor104:/zfs/brick1/brick
 Brick3: stor104:/zfs/brick2/brick
 Brick4: stor106:/zfs/brick0/brick
 Brick5: stor106:/zfs/brick1/brick
 Brick6: stor106:/zfs/brick2/brick
 Brick7: stor105:/zfs/brick0/brick
 Brick8: stor105:/zfs/brick1/brick
 Brick9: stor105:/zfs/brick2/brick
 Options Reconfigured:
 diagnostics.count-fop-hits: on
 diagnostics.latency-measurement: on
 server.allow-insecure: on
 cluster.min-free-disk: 1%
 diagnostics.brick-log-level: ERROR
 diagnostics.client-log-level: ERROR
 cluster.data-self-heal-algorithm: full
 performance.cache-max-file-size: 4MB
 performance.cache-refresh-timeout: 60
 performance.cache-size: 1GB
 performance.client-io-threads: on
 performance.io-thread-count: 32
 performance.write-behind-window-size: 4MB
 
 3. Is rebalance process still running? If yes can you point to specific mem
 usage by rebalance process? The high mem-usage was seen during rebalance
 or even post rebalance?
 
 I would like to restart the rebalance process since it failed… But I can’t as
 the volume cannot be stopped (I wanted to reboot the servers to have a clean
 testing grounds).
 
 Here are the logs from the three nodes:
 http://paste.fedoraproject.org/256183/43989079
 
 Maybe you could help me figure out how to stop the volume?
 
 This is what happens
 
 [root@highlander ~]# gluster volume rebalance live stop
 volume rebalance: live: failed: Rebalance not started.
 
 Requesting glusterd team to give input. 
 
 [root@highlander ~]# ssh stor105 gluster volume rebalance live stop
 volume rebalance: live: failed: Rebalance not started.
 
 [root@highlander ~]# ssh stor104 gluster volume rebalance live stop
 volume rebalance: live: failed: Rebalance not started.
 
 [root@highlander ~]# ssh stor106 gluster volume rebalance live stop
 volume rebalance: live: failed: Rebalance not started.
 
 [root@highlander ~]# gluster volume rebalance live stop
 volume rebalance: live: failed: Rebalance not started.
 
 [root@highlander ~]# gluster volume stop live
 Stopping volume will make its data inaccessible. Do you want to continue?
 (y/n) y
 volume stop: live: failed: Staging failed on stor106. Error: rebalance
 session is in progress for the volume 'live'
 Staging failed on stor104. Error: rebalance session is in progress for the
 volume ‘live'
 Can you run [ps aux |  grep rebalance] on all the servers and post here? 
 Just want to check whether rebalance is really running or not. Again 
 requesting glusterd team to give inputs.
 
 
 
 4. Gluster version
 
 [root@highlander ~]# pdsh -g live 'rpm -qa | grep gluster'
 stor104: glusterfs-api-3.7.3-1.el7.x86_64
 stor104

Re: [Gluster-devel] Skipped files during rebalance

2015-08-19 Thread Susant Palai
Comments inline.

- Original Message -
 From: Christophe TREFOIS christophe.tref...@uni.lu
 To: Susant Palai spa...@redhat.com
 Cc: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran 
 nbala...@redhat.com, Shyamsundar
 Ranganathan srang...@redhat.com, Mohammed Rafi K C 
 rkavu...@redhat.com, Gluster Devel
 gluster-devel@gluster.org
 Sent: Tuesday, August 18, 2015 8:08:41 PM
 Subject: Re: [Gluster-devel] Skipped files during rebalance
 
 Hi Susan,
 
 Thank you for the response.
 
  On 18 Aug 2015, at 10:45, Susant Palai spa...@redhat.com wrote:
  
  Hi Christophe,
  
Need some info regarding the high mem-usage.
  
  1. Top output: To see whether any other process eating up memory.

I will be interested to know the memory usage of all the gluster process 
referring to the high mem-usage. These process includes glusterfsd, glusterd, 
gluster, any mount process (glusterfs), and rebalance(glusterfs).


  2. Gluster volume info
 
 root@highlander ~]# gluster volume info
 
 Volume Name: live
 Type: Distribute
 Volume ID: 1328637d-7730-4627-8945-bbe43626d527
 Status: Started
 Number of Bricks: 9
 Transport-type: tcp
 Bricks:
 Brick1: stor104:/zfs/brick0/brick
 Brick2: stor104:/zfs/brick1/brick
 Brick3: stor104:/zfs/brick2/brick
 Brick4: stor106:/zfs/brick0/brick
 Brick5: stor106:/zfs/brick1/brick
 Brick6: stor106:/zfs/brick2/brick
 Brick7: stor105:/zfs/brick0/brick
 Brick8: stor105:/zfs/brick1/brick
 Brick9: stor105:/zfs/brick2/brick
 Options Reconfigured:
 diagnostics.count-fop-hits: on
 diagnostics.latency-measurement: on
 server.allow-insecure: on
 cluster.min-free-disk: 1%
 diagnostics.brick-log-level: ERROR
 diagnostics.client-log-level: ERROR
 cluster.data-self-heal-algorithm: full
 performance.cache-max-file-size: 4MB
 performance.cache-refresh-timeout: 60
 performance.cache-size: 1GB
 performance.client-io-threads: on
 performance.io-thread-count: 32
 performance.write-behind-window-size: 4MB
 
  3. Is rebalance process still running? If yes can you point to specific mem
  usage by rebalance process? The high mem-usage was seen during rebalance
  or even post rebalance?
 
 I would like to restart the rebalance process since it failed… But I can’t as
 the volume cannot be stopped (I wanted to reboot the servers to have a clean
 testing grounds).
 
 Here are the logs from the three nodes:
 http://paste.fedoraproject.org/256183/43989079
 
 Maybe you could help me figure out how to stop the volume?
 
 This is what happens
 
 [root@highlander ~]# gluster volume rebalance live stop
 volume rebalance: live: failed: Rebalance not started.

Requesting glusterd team to give input. 
 
 [root@highlander ~]# ssh stor105 gluster volume rebalance live stop
 volume rebalance: live: failed: Rebalance not started.
 
 [root@highlander ~]# ssh stor104 gluster volume rebalance live stop
 volume rebalance: live: failed: Rebalance not started.
 
 [root@highlander ~]# ssh stor106 gluster volume rebalance live stop
 volume rebalance: live: failed: Rebalance not started.
 
 [root@highlander ~]# gluster volume rebalance live stop
 volume rebalance: live: failed: Rebalance not started.
 
 [root@highlander ~]# gluster volume stop live
 Stopping volume will make its data inaccessible. Do you want to continue?
 (y/n) y
 volume stop: live: failed: Staging failed on stor106. Error: rebalance
 session is in progress for the volume 'live'
 Staging failed on stor104. Error: rebalance session is in progress for the
 volume ‘live'
Can you run [ps aux |  grep rebalance] on all the servers and post here? Just 
want to check whether rebalance is really running or not. Again requesting 
glusterd team to give inputs.

 
 
  4. Gluster version
 
 [root@highlander ~]# pdsh -g live 'rpm -qa | grep gluster'
 stor104: glusterfs-api-3.7.3-1.el7.x86_64
 stor104: glusterfs-server-3.7.3-1.el7.x86_64
 stor104: glusterfs-libs-3.7.3-1.el7.x86_64
 stor104: glusterfs-3.7.3-1.el7.x86_64
 stor104: glusterfs-fuse-3.7.3-1.el7.x86_64
 stor104: glusterfs-client-xlators-3.7.3-1.el7.x86_64
 stor104: glusterfs-cli-3.7.3-1.el7.x86_64
 
 stor105: glusterfs-3.7.3-1.el7.x86_64
 stor105: glusterfs-client-xlators-3.7.3-1.el7.x86_64
 stor105: glusterfs-api-3.7.3-1.el7.x86_64
 stor105: glusterfs-cli-3.7.3-1.el7.x86_64
 stor105: glusterfs-server-3.7.3-1.el7.x86_64
 stor105: glusterfs-libs-3.7.3-1.el7.x86_64
 stor105: glusterfs-fuse-3.7.3-1.el7.x86_64
 
 stor106: glusterfs-libs-3.7.3-1.el7.x86_64
 stor106: glusterfs-fuse-3.7.3-1.el7.x86_64
 stor106: glusterfs-client-xlators-3.7.3-1.el7.x86_64
 stor106: glusterfs-api-3.7.3-1.el7.x86_64
 stor106: glusterfs-cli-3.7.3-1.el7.x86_64
 stor106: glusterfs-server-3.7.3-1.el7.x86_64
 stor106: glusterfs-3.7.3-1.el7.x86_64
 
  
  Will ask for more information in case needed.
  
  Regards,
  Susant
  
  
  - Original Message -
  From: Christophe TREFOIS christophe.tref...@uni.lu
  To: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran
  nbala...@redhat.com, Susant Palai
  spa...@redhat.com, Shyamsundar

Re: [Gluster-devel] Skipped files during rebalance

2015-08-19 Thread Susant Palai
Hi Christophe,
   Forgot to ask you to post the rebalance and glusterd logs.

Regards,
Susant
   

- Original Message -
 From: Susant Palai spa...@redhat.com
 To: Christophe TREFOIS christophe.tref...@uni.lu
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Wednesday, August 19, 2015 11:44:35 AM
 Subject: Re: [Gluster-devel] Skipped files during rebalance
 
 Comments inline.
 
 - Original Message -
  From: Christophe TREFOIS christophe.tref...@uni.lu
  To: Susant Palai spa...@redhat.com
  Cc: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran
  nbala...@redhat.com, Shyamsundar
  Ranganathan srang...@redhat.com, Mohammed Rafi K C
  rkavu...@redhat.com, Gluster Devel
  gluster-devel@gluster.org
  Sent: Tuesday, August 18, 2015 8:08:41 PM
  Subject: Re: [Gluster-devel] Skipped files during rebalance
  
  Hi Susan,
  
  Thank you for the response.
  
   On 18 Aug 2015, at 10:45, Susant Palai spa...@redhat.com wrote:
   
   Hi Christophe,
   
 Need some info regarding the high mem-usage.
   
   1. Top output: To see whether any other process eating up memory.
 
 I will be interested to know the memory usage of all the gluster process
 referring to the high mem-usage. These process includes glusterfsd,
 glusterd, gluster, any mount process (glusterfs), and rebalance(glusterfs).
 
 
   2. Gluster volume info
  
  root@highlander ~]# gluster volume info
  
  Volume Name: live
  Type: Distribute
  Volume ID: 1328637d-7730-4627-8945-bbe43626d527
  Status: Started
  Number of Bricks: 9
  Transport-type: tcp
  Bricks:
  Brick1: stor104:/zfs/brick0/brick
  Brick2: stor104:/zfs/brick1/brick
  Brick3: stor104:/zfs/brick2/brick
  Brick4: stor106:/zfs/brick0/brick
  Brick5: stor106:/zfs/brick1/brick
  Brick6: stor106:/zfs/brick2/brick
  Brick7: stor105:/zfs/brick0/brick
  Brick8: stor105:/zfs/brick1/brick
  Brick9: stor105:/zfs/brick2/brick
  Options Reconfigured:
  diagnostics.count-fop-hits: on
  diagnostics.latency-measurement: on
  server.allow-insecure: on
  cluster.min-free-disk: 1%
  diagnostics.brick-log-level: ERROR
  diagnostics.client-log-level: ERROR
  cluster.data-self-heal-algorithm: full
  performance.cache-max-file-size: 4MB
  performance.cache-refresh-timeout: 60
  performance.cache-size: 1GB
  performance.client-io-threads: on
  performance.io-thread-count: 32
  performance.write-behind-window-size: 4MB
  
   3. Is rebalance process still running? If yes can you point to specific
   mem
   usage by rebalance process? The high mem-usage was seen during rebalance
   or even post rebalance?
  
  I would like to restart the rebalance process since it failed… But I can’t
  as
  the volume cannot be stopped (I wanted to reboot the servers to have a
  clean
  testing grounds).
  
  Here are the logs from the three nodes:
  http://paste.fedoraproject.org/256183/43989079
  
  Maybe you could help me figure out how to stop the volume?
  
  This is what happens
  
  [root@highlander ~]# gluster volume rebalance live stop
  volume rebalance: live: failed: Rebalance not started.
 
 Requesting glusterd team to give input.
  
  [root@highlander ~]# ssh stor105 gluster volume rebalance live stop
  volume rebalance: live: failed: Rebalance not started.
  
  [root@highlander ~]# ssh stor104 gluster volume rebalance live stop
  volume rebalance: live: failed: Rebalance not started.
  
  [root@highlander ~]# ssh stor106 gluster volume rebalance live stop
  volume rebalance: live: failed: Rebalance not started.
  
  [root@highlander ~]# gluster volume rebalance live stop
  volume rebalance: live: failed: Rebalance not started.
  
  [root@highlander ~]# gluster volume stop live
  Stopping volume will make its data inaccessible. Do you want to continue?
  (y/n) y
  volume stop: live: failed: Staging failed on stor106. Error: rebalance
  session is in progress for the volume 'live'
  Staging failed on stor104. Error: rebalance session is in progress for the
  volume ‘live'
 Can you run [ps aux |  grep rebalance] on all the servers and post here?
 Just want to check whether rebalance is really running or not. Again
 requesting glusterd team to give inputs.
 
  
  
   4. Gluster version
  
  [root@highlander ~]# pdsh -g live 'rpm -qa | grep gluster'
  stor104: glusterfs-api-3.7.3-1.el7.x86_64
  stor104: glusterfs-server-3.7.3-1.el7.x86_64
  stor104: glusterfs-libs-3.7.3-1.el7.x86_64
  stor104: glusterfs-3.7.3-1.el7.x86_64
  stor104: glusterfs-fuse-3.7.3-1.el7.x86_64
  stor104: glusterfs-client-xlators-3.7.3-1.el7.x86_64
  stor104: glusterfs-cli-3.7.3-1.el7.x86_64
  
  stor105: glusterfs-3.7.3-1.el7.x86_64
  stor105: glusterfs-client-xlators-3.7.3-1.el7.x86_64
  stor105: glusterfs-api-3.7.3-1.el7.x86_64
  stor105: glusterfs-cli-3.7.3-1.el7.x86_64
  stor105: glusterfs-server-3.7.3-1.el7.x86_64
  stor105: glusterfs-libs-3.7.3-1.el7.x86_64
  stor105: glusterfs-fuse-3.7.3-1.el7.x86_64
  
  stor106: glusterfs-libs-3.7.3-1.el7.x86_64
  stor106: glusterfs-fuse-3.7.3-1.el7.x86_64
  stor106: glusterfs

Re: [Gluster-devel] Skipped files during rebalance

2015-08-19 Thread Christophe TREFOIS
Dear Susant,

Apparently the glistered process was stuck in a strange state. So we restarted 
the glusterd process on stor106. This allowed us to stop the volume, and reboot.

I will start a new rebalance now, and will get the information you asked during 
the rebalance operation.

I think it makes more sense to post the logs of this new rebalance operation.

Kind regards,

—
Christophe

 

 On 19 Aug 2015, at 08:49, Susant Palai spa...@redhat.com wrote:
 
 Hi Christophe,
   Forgot to ask you to post the rebalance and glusterd logs.
 
 Regards,
 Susant
 
 
 - Original Message -
 From: Susant Palai spa...@redhat.com
 To: Christophe TREFOIS christophe.tref...@uni.lu
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Wednesday, August 19, 2015 11:44:35 AM
 Subject: Re: [Gluster-devel] Skipped files during rebalance
 
 Comments inline.
 
 - Original Message -
 From: Christophe TREFOIS christophe.tref...@uni.lu
 To: Susant Palai spa...@redhat.com
 Cc: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran
 nbala...@redhat.com, Shyamsundar
 Ranganathan srang...@redhat.com, Mohammed Rafi K C
 rkavu...@redhat.com, Gluster Devel
 gluster-devel@gluster.org
 Sent: Tuesday, August 18, 2015 8:08:41 PM
 Subject: Re: [Gluster-devel] Skipped files during rebalance
 
 Hi Susan,
 
 Thank you for the response.
 
 On 18 Aug 2015, at 10:45, Susant Palai spa...@redhat.com wrote:
 
 Hi Christophe,
 
  Need some info regarding the high mem-usage.
 
 1. Top output: To see whether any other process eating up memory.
 
 I will be interested to know the memory usage of all the gluster process
 referring to the high mem-usage. These process includes glusterfsd,
 glusterd, gluster, any mount process (glusterfs), and rebalance(glusterfs).
 
 
 2. Gluster volume info
 
 root@highlander ~]# gluster volume info
 
 Volume Name: live
 Type: Distribute
 Volume ID: 1328637d-7730-4627-8945-bbe43626d527
 Status: Started
 Number of Bricks: 9
 Transport-type: tcp
 Bricks:
 Brick1: stor104:/zfs/brick0/brick
 Brick2: stor104:/zfs/brick1/brick
 Brick3: stor104:/zfs/brick2/brick
 Brick4: stor106:/zfs/brick0/brick
 Brick5: stor106:/zfs/brick1/brick
 Brick6: stor106:/zfs/brick2/brick
 Brick7: stor105:/zfs/brick0/brick
 Brick8: stor105:/zfs/brick1/brick
 Brick9: stor105:/zfs/brick2/brick
 Options Reconfigured:
 diagnostics.count-fop-hits: on
 diagnostics.latency-measurement: on
 server.allow-insecure: on
 cluster.min-free-disk: 1%
 diagnostics.brick-log-level: ERROR
 diagnostics.client-log-level: ERROR
 cluster.data-self-heal-algorithm: full
 performance.cache-max-file-size: 4MB
 performance.cache-refresh-timeout: 60
 performance.cache-size: 1GB
 performance.client-io-threads: on
 performance.io-thread-count: 32
 performance.write-behind-window-size: 4MB
 
 3. Is rebalance process still running? If yes can you point to specific
 mem
 usage by rebalance process? The high mem-usage was seen during rebalance
 or even post rebalance?
 
 I would like to restart the rebalance process since it failed… But I can’t
 as
 the volume cannot be stopped (I wanted to reboot the servers to have a
 clean
 testing grounds).
 
 Here are the logs from the three nodes:
 http://paste.fedoraproject.org/256183/43989079
 
 Maybe you could help me figure out how to stop the volume?
 
 This is what happens
 
 [root@highlander ~]# gluster volume rebalance live stop
 volume rebalance: live: failed: Rebalance not started.
 
 Requesting glusterd team to give input.
 
 [root@highlander ~]# ssh stor105 gluster volume rebalance live stop
 volume rebalance: live: failed: Rebalance not started.
 
 [root@highlander ~]# ssh stor104 gluster volume rebalance live stop
 volume rebalance: live: failed: Rebalance not started.
 
 [root@highlander ~]# ssh stor106 gluster volume rebalance live stop
 volume rebalance: live: failed: Rebalance not started.
 
 [root@highlander ~]# gluster volume rebalance live stop
 volume rebalance: live: failed: Rebalance not started.
 
 [root@highlander ~]# gluster volume stop live
 Stopping volume will make its data inaccessible. Do you want to continue?
 (y/n) y
 volume stop: live: failed: Staging failed on stor106. Error: rebalance
 session is in progress for the volume 'live'
 Staging failed on stor104. Error: rebalance session is in progress for the
 volume ‘live'
 Can you run [ps aux |  grep rebalance] on all the servers and post here?
 Just want to check whether rebalance is really running or not. Again
 requesting glusterd team to give inputs.
 
 
 
 4. Gluster version
 
 [root@highlander ~]# pdsh -g live 'rpm -qa | grep gluster'
 stor104: glusterfs-api-3.7.3-1.el7.x86_64
 stor104: glusterfs-server-3.7.3-1.el7.x86_64
 stor104: glusterfs-libs-3.7.3-1.el7.x86_64
 stor104: glusterfs-3.7.3-1.el7.x86_64
 stor104: glusterfs-fuse-3.7.3-1.el7.x86_64
 stor104: glusterfs-client-xlators-3.7.3-1.el7.x86_64
 stor104: glusterfs-cli-3.7.3-1.el7.x86_64
 
 stor105: glusterfs-3.7.3-1.el7.x86_64
 stor105: glusterfs-client-xlators-3.7.3-1.el7.x86_64

Re: [Gluster-devel] Skipped files during rebalance

2015-08-18 Thread Susant Palai
Hi Christophe,
  
   Need some info regarding the high mem-usage.

1. Top output: To see whether any other process eating up memory.
2. Gluster volume info
3. Is rebalance process still running? If yes can you point to specific mem 
usage by rebalance process? The high mem-usage was seen during rebalance or 
even post rebalance?
4. Gluster version

Will ask for more information in case needed.

Regards,
Susant


- Original Message -
 From: Christophe TREFOIS christophe.tref...@uni.lu
 To: Raghavendra Gowdappa rgowd...@redhat.com, Nithya Balachandran 
 nbala...@redhat.com, Susant Palai
 spa...@redhat.com, Shyamsundar Ranganathan srang...@redhat.com
 Cc: Mohammed Rafi K C rkavu...@redhat.com
 Sent: Monday, 17 August, 2015 7:03:20 PM
 Subject: Fwd: [Gluster-devel] Skipped files during rebalance
 
 Hi DHT team,
 
 This email somehow didn’t get forwarded to you.
 
 In addition to my problem described below, here is one example of free memory
 after everything failed
 
 [root@highlander ~]# pdsh -g live 'free -m'
 stor106:   totalusedfree  shared  buff/cache
 available
 stor106: Mem: 193249  1247841347   9   67118
 12769
 stor106: Swap: 0   0   0
 stor104:   totalusedfree  shared  buff/cache
 available
 stor104: Mem: 193249  107617   31323   9   54308
 42752
 stor104: Swap: 0   0   0
 stor105:   totalusedfree  shared  buff/cache
 available
 stor105: Mem: 193248  1418046736   9   44707
 9713
 stor105: Swap: 0   0   0
 
 So after the failed operation, there’s almost no memory free, and it is also
 not freed up.
 
 Thank you for pointing me to any directions,
 
 Kind regards,
 
 —
 Christophe
 
 
 Begin forwarded message:
 
 From: Christophe TREFOIS
 christophe.tref...@uni.lumailto:christophe.tref...@uni.lu
 Subject: Re: [Gluster-devel] Skipped files during rebalance
 Date: 17 Aug 2015 11:54:32 CEST
 To: Mohammed Rafi K C rkavu...@redhat.commailto:rkavu...@redhat.com
 Cc: gluster-devel@gluster.orgmailto:gluster-devel@gluster.org
 gluster-devel@gluster.orgmailto:gluster-devel@gluster.org
 
 Dear Rafi,
 
 Thanks for submitting a patch.
 
 @DHT, I have two additional questions / problems.
 
 1. When doing a rebalance (with data) RAM consumption on the nodes goes
 dramatically high, eg out of 196 GB available per node, RAM usage would fill
 up to 195.6 GB. This seems quite excessive and strange to me.
 
 2. As you can see, the rebalance (with data) failed as one endpoint becomes
 unconnected (even though it still is connected). I’m thinking this could be
 due to the high RAM usage?
 
 Thank you for your help,
 
 —
 Christophe
 
 Dr Christophe Trefois, Dipl.-Ing.
 Technical Specialist / Post-Doc
 
 UNIVERSITÉ DU LUXEMBOURG
 
 LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
 Campus Belval | House of Biomedicine
 6, avenue du Swing
 L-4367 Belvaux
 T: +352 46 66 44 6124
 F: +352 46 66 44 6949
 http://www.uni.lu/lcsb
 
 [Facebook]https://www.facebook.com/trefex  [Twitter]
 https://twitter.com/Trefex   [Google Plus]
 https://plus.google.com/+ChristopheTrefois/   [Linkedin]
 https://www.linkedin.com/in/trefoischristophe   [skype]
 http://skype:Trefex?call
 
 
 
 This message is confidential and may contain privileged information.
 It is intended for the named recipient only.
 If you receive it in error please notify me and permanently delete the
 original message and any copies.
 
 
 
 
 On 17 Aug 2015, at 11:27, Mohammed Rafi K C
 rkavu...@redhat.commailto:rkavu...@redhat.com wrote:
 
 
 
 On 08/17/2015 01:58 AM, Christophe TREFOIS wrote:
 Dear all,
 
 I have successfully added a new node to our setup, and finally managed to get
 a successful fix-layout run as well with no errors.
 
 Now, as per the documentation, I started a gluster volume rebalance live
 start task and I see many skipped files.
 The error log contains then entires as follows for each skipped file.
 
 [2015-08-16 20:23:30.591161] E [MSGID: 109023]
 [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
 failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
 s_05(2013-10-11_17-12-02)/004010008.flex lookup failed
 [2015-08-16 20:23:30.768391] E [MSGID: 109023]
 [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
 failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
 s_05(2013-10-11_17-12-02)/007005003.flex lookup failed
 [2015-08-16 20:23:30.804811] E [MSGID: 109023]
 [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
 failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
 s_05(2013-10-11_17-12-02)/006005009.flex lookup failed
 [2015-08-16 20:23:30.805201] E [MSGID: 109023]
 [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
 failed:/hcs/hcs/OperaArchiveCol/SK

Re: [Gluster-devel] Skipped files during rebalance

2015-08-17 Thread Christophe TREFOIS
Dear Rafi,

Thanks for submitting a patch.

@DHT, I have two additional questions / problems.

1. When doing a rebalance (with data) RAM consumption on the nodes goes 
dramatically high, eg out of 196 GB available per node, RAM usage would fill up 
to 195.6 GB. This seems quite excessive and strange to me.

2. As you can see, the rebalance (with data) failed as one endpoint becomes 
unconnected (even though it still is connected). I’m thinking this could be due 
to the high RAM usage?

Thank you for your help,

—
Christophe

Dr Christophe Trefois, Dipl.-Ing.
Technical Specialist / Post-Doc

UNIVERSITÉ DU LUXEMBOURG

LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
Campus Belval | House of Biomedicine
6, avenue du Swing
L-4367 Belvaux
T: +352 46 66 44 6124
F: +352 46 66 44 6949
http://www.uni.lu/lcsb

[Facebook]https://www.facebook.com/trefex  [Twitter] 
https://twitter.com/Trefex   [Google Plus] 
https://plus.google.com/+ChristopheTrefois/   [Linkedin] 
https://www.linkedin.com/in/trefoischristophe   [skype] 
http://skype:Trefex?call


This message is confidential and may contain privileged information.
It is intended for the named recipient only.
If you receive it in error please notify me and permanently delete the original 
message and any copies.




On 17 Aug 2015, at 11:27, Mohammed Rafi K C 
rkavu...@redhat.commailto:rkavu...@redhat.com wrote:



On 08/17/2015 01:58 AM, Christophe TREFOIS wrote:
Dear all,

I have successfully added a new node to our setup, and finally managed to get a 
successful fix-layout run as well with no errors.

Now, as per the documentation, I started a gluster volume rebalance live start 
task and I see many skipped files.
The error log contains then entires as follows for each skipped file.

[2015-08-16 20:23:30.591161] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/004010008.flex lookup failed
[2015-08-16 20:23:30.768391] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/007005003.flex lookup failed
[2015-08-16 20:23:30.804811] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/006005009.flex lookup failed
[2015-08-16 20:23:30.805201] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/005006011.flex lookup failed
[2015-08-16 20:23:30.880037] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/005009012.flex lookup failed
[2015-08-16 20:23:31.038236] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/003008007.flex lookup failed
[2015-08-16 20:23:31.259762] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/004008006.flex lookup failed
[2015-08-16 20:23:31.333764] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/007008001.flex lookup failed
[2015-08-16 20:23:31.340190] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/006007004.flex lookup failed

Update: one of the rebalance tasks now failed.

@Rafi, I got the same error as Friday except this time with data.

Packets that carrying the ping request could be waiting in the queue during the 
whole time-out period, because of the heavy traffic in the network. I have sent 
a patch for this. You can track the status here : 
http://review.gluster.org/11935



[2015-08-16 20:24:34.533167] C 
[rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-0: server 
192.168.123.104:49164 has not responded in the last 42 seconds, disconnecting.
[2015-08-16 20:24:34.533614] E [rpc-clnt.c:362:saved_frames_unwind] (-- 
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-- 
/lib64/libgfrpc.so.0(saved_frames_unwin
d+0x1de)[0x7fa454bb09be] (-- 
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-- 
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (-- 
/lib64/li
bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ) 0-live-client-0: forced 
unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at 2015-08-16 

Re: [Gluster-devel] Skipped files during rebalance

2015-08-17 Thread Mohammed Rafi K C


On 08/17/2015 01:58 AM, Christophe TREFOIS wrote:

 Dear all,

  

 I have successfully added a new node to our setup, and finally managed
 to get a successful fix-layout run as well with no errors.

  

 Now, as per the documentation, I started a gluster volume rebalance
 live start task and I see many skipped files. 

 The error log contains then entires as follows for each skipped file.

  

 [2015-08-16 20:23:30.591161] E [MSGID: 109023]
 [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
 failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea

 s_05(2013-10-11_17-12-02)/004010008.flex lookup failed

 [2015-08-16 20:23:30.768391] E [MSGID: 109023]
 [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
 failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea

 s_05(2013-10-11_17-12-02)/007005003.flex lookup failed

 [2015-08-16 20:23:30.804811] E [MSGID: 109023]
 [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
 failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea

 s_05(2013-10-11_17-12-02)/006005009.flex lookup failed

 [2015-08-16 20:23:30.805201] E [MSGID: 109023]
 [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
 failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea

 s_05(2013-10-11_17-12-02)/005006011.flex lookup failed

 [2015-08-16 20:23:30.880037] E [MSGID: 109023]
 [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
 failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea

 s_05(2013-10-11_17-12-02)/005009012.flex lookup failed

 [2015-08-16 20:23:31.038236] E [MSGID: 109023]
 [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
 failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea

 s_05(2013-10-11_17-12-02)/003008007.flex lookup failed

 [2015-08-16 20:23:31.259762] E [MSGID: 109023]
 [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
 failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea

 s_05(2013-10-11_17-12-02)/004008006.flex lookup failed

 [2015-08-16 20:23:31.333764] E [MSGID: 109023]
 [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
 failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea

 s_05(2013-10-11_17-12-02)/007008001.flex lookup failed

 [2015-08-16 20:23:31.340190] E [MSGID: 109023]
 [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
 failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea

 s_05(2013-10-11_17-12-02)/006007004.flex lookup failed

  

 Update: one of the rebalance tasks now failed.

  

 @Rafi, I got the same error as Friday except this time with data.


Packets that carrying the ping request could be waiting in the queue
during the whole time-out period, because of the heavy traffic in the
network. I have sent a patch for this. You can track the status here :
http://review.gluster.org/11935


  

 [2015-08-16 20:24:34.533167] C
 [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-0:
 server 192.168.123.104:49164 has not responded in the last 42 seconds,
 disconnecting.

 [2015-08-16 20:24:34.533614] E [rpc-clnt.c:362:saved_frames_unwind]
 (-- /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
 (-- /lib64/libgfrpc.so.0(saved_frames_unwin

 d+0x1de)[0x7fa454bb09be] (--
 /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--
 /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
 (-- /lib64/li

 bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )
 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
 op(INODELK(29)) called at 2015-08-16 20:23:51.305640 (xid=0x5dd4da)

 [2015-08-16 20:24:34.533672] E [MSGID: 114031]
 [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote
 operation failed [Transport endpoint is not connected]

 [2015-08-16 20:24:34.534201] E [rpc-clnt.c:362:saved_frames_unwind]
 (-- /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
 (-- /lib64/libgfrpc.so.0(saved_frames_unwin

 d+0x1de)[0x7fa454bb09be] (--
 /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--
 /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
 (-- /lib64/li

 bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )
 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
 op(READ(12)) called at 2015-08-16 20:23:51.303938 (xid=0x5dd4d7)

 [2015-08-16 20:24:34.534347] E [MSGID: 109023]
 [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file
 failed: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_

 12(2013-10-12_00-12-55)/007008007.flex: failed to migrate data

 [2015-08-16 20:24:34.534413] E [rpc-clnt.c:362:saved_frames_unwind]
 (-- /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
 (-- /lib64/libgfrpc.so.0(saved_frames_unwin

 d+0x1de)[0x7fa454bb09be] (--
 /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--