Re: [Gluster-devel] Skipped files during rebalance

2015-09-03 Thread Susant Palai
-16 20:34:17.479925] E [socket.c:2332:socket_connect_finish] 
>> 0-glusterfs: connection to 127.0.0.1:24007 failed (Connection refused)
>> [2015-08-16 20:36:23.788206] E [MSGID: 101075] 
>> [common-utils.c:314:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or 
>> service not known)
>> [2015-08-16 20:36:23.788286] E 
>> [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-4: DNS 
>> resolution failed on host stor106
>> [2015-08-16 20:36:23.788387] E 
>> [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-5: DNS 
>> resolution failed on host stor106
>> [2015-08-16 20:36:23.788918] E 
>> [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-0: DNS 
>> resolution failed on host stor104
>> [2015-08-16 20:36:23.789233] E 
>> [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-2: DNS 
>> resolution failed on host stor104
>> [2015-08-16 20:36:23.789295] E 
>> [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-3: DNS 
>> resolution failed on host stor106
>>
>>
>> For the high mem usage part I will try to run rebalance and analyze. In the 
>> mean time it will be help full if you can take a state dump of the rebalance 
>> process when it is using high RAM.
>>
>> Here are the steps to take the state dump.
>>
>> 1. Find your state-dump destination; Run "gluster --print-statedumpdir". The 
>> state dump will be stored in this location.
>>
>> 2. When you see any of the rebalance process on any of the servers using 
>> high memory issue the following command.
>>   "kill -USR1 ".  ---> ps aux | grep rebalance 
>> should give the rebalance process pid.
>>
>> The state dump should give some hint about the high mem-usage.
>>
>> Thanks,
>> Susant
>>
>> - Original Message -
>> From: "Susant Palai" 
>> To: "Christophe TREFOIS" 
>> Cc: "Gluster Devel" 
>> Sent: Friday, 21 August, 2015 3:52:07 PM
>> Subject: Re: [Gluster-devel] Skipped files during rebalance
>>
>> Thanks Christophe for the details. Will get back to you with the analysis.
>>
>> Regards,
>> Susant
>>
>> - Original Message -
>> From: "Christophe TREFOIS" 
>> To: "Susant Palai" 
>> Cc: "Raghavendra Gowdappa" , "Nithya Balachandran" 
>> , "Shyamsundar Ranganathan" , 
>> "Mohammed Rafi K C" , "Gluster Devel" 
>> 
>> Sent: Friday, 21 August, 2015 12:39:05 AM
>> Subject: Re: [Gluster-devel] Skipped files during rebalance
>>
>> Dear Susant,
>>
>> The rebalance failed again and also had (in my opinion) excessive RAM usage.
>>
>> Please find a very detailled list below.
>>
>> All logs:
>>
>> http://wikisend.com/download/651948/allstores.tar.gz
>>
>> Thank you for letting me know how I could successfully complete the 
>> rebalance process.
>> The fedora pastes are the output of top of each node at that time (more or 
>> less).
>>
>> Please let me know if you need more information,
>>
>> Best,
>>
>> —— Start of mem info
>>
>> # After reboot, before starting glusterd
>>
>> [root@highlander ~]# pdsh -g live 'free -m'
>> stor106:   totalusedfree  shared  buff/cache 
>>   available
>> stor106: Mem: 1932492208  190825   9 215 
>>  190772
>> stor106: Swap: 0   0   0
>> stor105:   totalusedfree  shared  buff/cache 
>>   available
>> stor105: Mem: 1932482275  190738   9 234 
>>  190681
>> stor105: Swap: 0   0   0
>> stor104:   totalusedfree  shared  buff/cache 
>>   available
>> stor104: Mem: 1932492221  190811   9 216 
>>  190757
>> stor104: Swap: 0   0   0
>> [root@highlander ~]#
>>
>> # Gluster Info
>>
>> [root@stor106 glusterfs]# gluster volume info
>>
>> Volume Name: live
>> Type: Distribute
>> Volume ID: 1328637d-7730-4627-8945-bbe43626d527
>> Status: Started
>> Number of Bricks: 9
>> Transport-type: tcp
>> Bricks:
>> Brick1: stor104:/zfs/brick0/brick
>> Brick2: stor104:/zfs/brick1/brick
>> Brick3: stor104:/zfs/brick2/brick
>> Brick4: stor106:/zfs/brick0/brick
>> Brick5: stor106:/

Re: [Gluster-devel] Skipped files during rebalance

2015-08-21 Thread Christophe TREFOIS
art I will try to run rebalance and analyze. In the 
> mean time it will be help full if you can take a state dump of the rebalance 
> process when it is using high RAM.
> 
> Here are the steps to take the state dump.
> 
> 1. Find your state-dump destination; Run "gluster --print-statedumpdir". The 
> state dump will be stored in this location.
> 
> 2. When you see any of the rebalance process on any of the servers using high 
> memory issue the following command.
>   "kill -USR1 ".  ---> ps aux | grep rebalance 
> should give the rebalance process pid.
> 
> The state dump should give some hint about the high mem-usage.
> 
> Thanks,
> Susant
> 
> - Original Message -
> From: "Susant Palai" 
> To: "Christophe TREFOIS" 
> Cc: "Gluster Devel" 
> Sent: Friday, 21 August, 2015 3:52:07 PM
> Subject: Re: [Gluster-devel] Skipped files during rebalance
> 
> Thanks Christophe for the details. Will get back to you with the analysis.
> 
> Regards,
> Susant
> 
> - Original Message -
> From: "Christophe TREFOIS" 
> To: "Susant Palai" 
> Cc: "Raghavendra Gowdappa" , "Nithya Balachandran" 
> , "Shyamsundar Ranganathan" , 
> "Mohammed Rafi K C" , "Gluster Devel" 
> 
> Sent: Friday, 21 August, 2015 12:39:05 AM
> Subject: Re: [Gluster-devel] Skipped files during rebalance
> 
> Dear Susant,
> 
> The rebalance failed again and also had (in my opinion) excessive RAM usage.
> 
> Please find a very detailled list below.
> 
> All logs:
> 
> http://wikisend.com/download/651948/allstores.tar.gz
> 
> Thank you for letting me know how I could successfully complete the rebalance 
> process.
> The fedora pastes are the output of top of each node at that time (more or 
> less).
> 
> Please let me know if you need more information,
> 
> Best,
> 
> —— Start of mem info
> 
> # After reboot, before starting glusterd
> 
> [root@highlander ~]# pdsh -g live 'free -m'
> stor106:   totalusedfree  shared  buff/cache  
>  available
> stor106: Mem: 1932492208  190825   9 215  
> 190772
> stor106: Swap: 0   0   0
> stor105:   totalusedfree  shared  buff/cache  
>  available
> stor105: Mem: 1932482275  190738   9 234  
> 190681
> stor105: Swap: 0   0   0
> stor104:   totalusedfree  shared  buff/cache  
>  available
> stor104: Mem: 1932492221  190811   9 216  
> 190757
> stor104: Swap: 0   0   0
> [root@highlander ~]#
> 
> # Gluster Info
> 
> [root@stor106 glusterfs]# gluster volume info
> 
> Volume Name: live
> Type: Distribute
> Volume ID: 1328637d-7730-4627-8945-bbe43626d527
> Status: Started
> Number of Bricks: 9
> Transport-type: tcp
> Bricks:
> Brick1: stor104:/zfs/brick0/brick
> Brick2: stor104:/zfs/brick1/brick
> Brick3: stor104:/zfs/brick2/brick
> Brick4: stor106:/zfs/brick0/brick
> Brick5: stor106:/zfs/brick1/brick
> Brick6: stor106:/zfs/brick2/brick
> Brick7: stor105:/zfs/brick0/brick
> Brick8: stor105:/zfs/brick1/brick
> Brick9: stor105:/zfs/brick2/brick
> Options Reconfigured:
> nfs.disable: true
> diagnostics.count-fop-hits: on
> diagnostics.latency-measurement: on
> performance.write-behind-window-size: 4MB
> performance.io-thread-count: 32
> performance.client-io-threads: on
> performance.cache-size: 1GB
> performance.cache-refresh-timeout: 60
> performance.cache-max-file-size: 4MB
> cluster.data-self-heal-algorithm: full
> diagnostics.client-log-level: ERROR
> diagnostics.brick-log-level: ERROR
> cluster.min-free-disk: 1%
> server.allow-insecure: on
> 
> # Starting gluserd
> 
> [root@highlander ~]# pdsh -g live 'systemctl start glusterd'
> [root@highlander ~]# pdsh -g live 'free -m'
> stor106:   totalusedfree  shared  buff/cache  
>  available
> stor106: Mem: 1932492290  190569   9 389  
> 190587
> stor106: Swap: 0   0   0
> stor104:   totalusedfree  shared  buff/cache  
>  available
> stor104: Mem: 1932492297  190557   9 394  
> 190571
> stor104: Swap: 0   0   0
> stor105:   totalusedfree  shared  buff/cache  
>  available
> stor105: Mem: 193248 

Re: [Gluster-devel] Skipped files during rebalance

2015-08-21 Thread Susant Palai
Hi,
 Mostly the rebalance failures are due to the network problem.

Here is the log:

[2015-08-16 20:31:36.301467] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/PA 
27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/003002002.flex 
lookup failed
[2015-08-16 20:31:36.921405] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/PA 
27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/003004005.flex 
lookup failed
[2015-08-16 20:31:36.921591] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/PA 
27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/006004004.flex 
lookup failed
[2015-08-16 20:31:36.921770] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/PA 
27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/005004007.flex 
lookup failed
[2015-08-16 20:31:37.577758] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/PA 
27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/007004005.flex 
lookup failed
[2015-08-16 20:34:12.387425] E [socket.c:2332:socket_connect_finish] 
0-live-client-4: connection to 192.168.123.106:24007 failed (Connection refused)
[2015-08-16 20:34:12.392820] E [socket.c:2332:socket_connect_finish] 
0-live-client-5: connection to 192.168.123.106:24007 failed (Connection refused)
[2015-08-16 20:34:12.398023] E [socket.c:2332:socket_connect_finish] 
0-live-client-0: connection to 192.168.123.104:24007 failed (Connection refused)
[2015-08-16 20:34:12.402904] E [socket.c:2332:socket_connect_finish] 
0-live-client-2: connection to 192.168.123.104:24007 failed (Connection refused)
[2015-08-16 20:34:12.407464] E [socket.c:2332:socket_connect_finish] 
0-live-client-3: connection to 192.168.123.106:24007 failed (Connection refused)
[2015-08-16 20:34:12.412249] E [socket.c:2332:socket_connect_finish] 
0-live-client-1: connection to 192.168.123.104:24007 failed (Connection refused)
[2015-08-16 20:34:12.416621] E [socket.c:2332:socket_connect_finish] 
0-live-client-6: connection to 192.168.123.105:24007 failed (Connection refused)
[2015-08-16 20:34:12.420906] E [socket.c:2332:socket_connect_finish] 
0-live-client-8: connection to 192.168.123.105:24007 failed (Connection refused)
[2015-08-16 20:34:12.425066] E [socket.c:2332:socket_connect_finish] 
0-live-client-7: connection to 192.168.123.105:24007 failed (Connection refused)
[2015-08-16 20:34:17.479925] E [socket.c:2332:socket_connect_finish] 
0-glusterfs: connection to 127.0.0.1:24007 failed (Connection refused)
[2015-08-16 20:36:23.788206] E [MSGID: 101075] 
[common-utils.c:314:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or 
service not known)
[2015-08-16 20:36:23.788286] E [name.c:247:af_inet_client_get_remote_sockaddr] 
0-live-client-4: DNS resolution failed on host stor106
[2015-08-16 20:36:23.788387] E [name.c:247:af_inet_client_get_remote_sockaddr] 
0-live-client-5: DNS resolution failed on host stor106
[2015-08-16 20:36:23.788918] E [name.c:247:af_inet_client_get_remote_sockaddr] 
0-live-client-0: DNS resolution failed on host stor104
[2015-08-16 20:36:23.789233] E [name.c:247:af_inet_client_get_remote_sockaddr] 
0-live-client-2: DNS resolution failed on host stor104
[2015-08-16 20:36:23.789295] E [name.c:247:af_inet_client_get_remote_sockaddr] 
0-live-client-3: DNS resolution failed on host stor106


For the high mem usage part I will try to run rebalance and analyze. In the 
mean time it will be help full if you can take a state dump of the rebalance 
process when it is using high RAM.

Here are the steps to take the state dump.

1. Find your state-dump destination; Run "gluster --print-statedumpdir". The 
state dump will be stored in this location.

2. When you see any of the rebalance process on any of the servers using high 
memory issue the following command.
   "kill -USR1 ".  ---> ps aux | grep rebalance 
should give the rebalance process pid.

The state dump should give some hint about the high mem-usage.

Thanks,
Susant

- Original Message -
From: "Susant Palai" 
To: "Christophe TREFOIS" 
Cc: "Gluster Devel" 
Sent: Friday, 21 August, 2015 3:52:07 PM
Subject: Re: [Gluster-devel] Skipped files during rebalance

Thanks Christophe for the details. Will get back to you with the analysis.

Regards,
Susant

- Original Message -
From: "Christophe TREFOIS" 
To: "Susant Palai" 
Cc: "Raghavendra Gowdappa" , "Nithya Balachandran" 
, "Shyamsundar Ranganathan" , 
"Mohammed Rafi K C" , "Gluster Devel" 

Sent: Friday, 21 August, 2015 12:39:05 AM
Subject: Re: [Gluster-devel] Skipped files during rebalance

Dear Susant,


Re: [Gluster-devel] Skipped files during rebalance

2015-08-21 Thread Susant Palai
Thanks Christophe for the details. Will get back to you with the analysis.

Regards,
Susant

- Original Message -
From: "Christophe TREFOIS" 
To: "Susant Palai" 
Cc: "Raghavendra Gowdappa" , "Nithya Balachandran" 
, "Shyamsundar Ranganathan" , 
"Mohammed Rafi K C" , "Gluster Devel" 

Sent: Friday, 21 August, 2015 12:39:05 AM
Subject: Re: [Gluster-devel] Skipped files during rebalance

Dear Susant,

The rebalance failed again and also had (in my opinion) excessive RAM usage.

Please find a very detailled list below.

All logs:

http://wikisend.com/download/651948/allstores.tar.gz

Thank you for letting me know how I could successfully complete the rebalance 
process.
The fedora pastes are the output of top of each node at that time (more or 
less).

Please let me know if you need more information,

Best,

—— Start of mem info

# After reboot, before starting glusterd

[root@highlander ~]# pdsh -g live 'free -m'
stor106:   totalusedfree  shared  buff/cache   
available
stor106: Mem: 1932492208  190825   9 215
  190772
stor106: Swap: 0   0   0
stor105:   totalusedfree  shared  buff/cache   
available
stor105: Mem: 1932482275  190738   9 234
  190681
stor105: Swap: 0   0   0
stor104:   totalusedfree  shared  buff/cache   
available
stor104: Mem: 1932492221  190811   9 216
  190757
stor104: Swap: 0   0   0
[root@highlander ~]#

# Gluster Info

[root@stor106 glusterfs]# gluster volume info

Volume Name: live
Type: Distribute
Volume ID: 1328637d-7730-4627-8945-bbe43626d527
Status: Started
Number of Bricks: 9
Transport-type: tcp
Bricks:
Brick1: stor104:/zfs/brick0/brick
Brick2: stor104:/zfs/brick1/brick
Brick3: stor104:/zfs/brick2/brick
Brick4: stor106:/zfs/brick0/brick
Brick5: stor106:/zfs/brick1/brick
Brick6: stor106:/zfs/brick2/brick
Brick7: stor105:/zfs/brick0/brick
Brick8: stor105:/zfs/brick1/brick
Brick9: stor105:/zfs/brick2/brick
Options Reconfigured:
nfs.disable: true
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.write-behind-window-size: 4MB
performance.io-thread-count: 32
performance.client-io-threads: on
performance.cache-size: 1GB
performance.cache-refresh-timeout: 60
performance.cache-max-file-size: 4MB
cluster.data-self-heal-algorithm: full
diagnostics.client-log-level: ERROR
diagnostics.brick-log-level: ERROR
cluster.min-free-disk: 1%
server.allow-insecure: on

# Starting gluserd

[root@highlander ~]# pdsh -g live 'systemctl start glusterd'
[root@highlander ~]# pdsh -g live 'free -m'
stor106:   totalusedfree  shared  buff/cache   
available
stor106: Mem: 1932492290  190569   9 389
  190587
stor106: Swap: 0   0   0
stor104:   totalusedfree  shared  buff/cache   
available
stor104: Mem: 1932492297  190557   9 394
  190571
stor104: Swap: 0   0   0
stor105:   totalusedfree  shared  buff/cache   
available
stor105: Mem: 1932482286  190554   9 407
  190595
stor105: Swap: 0   0   0

[root@highlander ~]# systemctl start glusterd
[root@highlander ~]# gluster volume start live
volume start: live: success
[root@highlander ~]# gluster volume status
Status of volume: live
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick stor104:/zfs/brick0/brick 49164 0  Y   5945
Brick stor104:/zfs/brick1/brick 49165 0  Y   5963
Brick stor104:/zfs/brick2/brick 49166 0  Y   5981
Brick stor106:/zfs/brick0/brick 49158 0  Y   5256
Brick stor106:/zfs/brick1/brick 49159 0  Y   5274
Brick stor106:/zfs/brick2/brick 49160 0  Y   5292
Brick stor105:/zfs/brick0/brick 49155 0  Y   5284
Brick stor105:/zfs/brick1/brick 49156 0  Y   5302
Brick stor105:/zfs/brick2/brick 49157 0  Y   5320
NFS Server on localhost N/A   N/AN   N/A
NFS Server on 192.168.123.106   N/A   N/AN   N/A
NFS Server on stor105   N/A   N/AN   N/A
NFS Server on 192.168.123.104   N/A   N/AN   N/A

Task Status of Volume live
-

Re: [Gluster-devel] Skipped files during rebalance

2015-08-20 Thread Christophe TREFOIS
les  size   
scanned  failures   skipped   status   run time in secs
   -  ---   ---   
---   ---   ---  --
 192.168.123.104   748812 4.4TB   
4160456  1311156772   failed   63114.00
 192.168.123.106  1187917 3.3TB   
6021931 21625   1209503   failed   75243.00
 stor10500Bytes   
244043116   196   failed   69658.00
volume rebalance: live: success:
```



Dr Christophe Trefois, Dipl.-Ing.  
Technical Specialist / Post-Doc

UNIVERSITÉ DU LUXEMBOURG

LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
Campus Belval | House of Biomedicine  
6, avenue du Swing 
L-4367 Belvaux  
T: +352 46 66 44 6124 
F: +352 46 66 44 6949  
http://www.uni.lu/lcsb




This message is confidential and may contain privileged information. 
It is intended for the named recipient only. 
If you receive it in error please notify me and permanently delete the original 
message and any copies. 


  

> On 19 Aug 2015, at 08:14, Susant Palai  wrote:
> 
> Comments inline.
> 
> - Original Message -
>> From: "Christophe TREFOIS" 
>> To: "Susant Palai" 
>> Cc: "Raghavendra Gowdappa" , "Nithya Balachandran" 
>> , "Shyamsundar
>> Ranganathan" , "Mohammed Rafi K C" 
>> , "Gluster Devel"
>> 
>> Sent: Tuesday, August 18, 2015 8:08:41 PM
>> Subject: Re: [Gluster-devel] Skipped files during rebalance
>> 
>> Hi Susan,
>> 
>> Thank you for the response.
>> 
>>> On 18 Aug 2015, at 10:45, Susant Palai  wrote:
>>> 
>>> Hi Christophe,
>>> 
>>>  Need some info regarding the high mem-usage.
>>> 
>>> 1. Top output: To see whether any other process eating up memory.
> 
> I will be interested to know the memory usage of all the gluster process 
> referring to the high mem-usage. These process includes glusterfsd, glusterd, 
> gluster, any mount process (glusterfs), and rebalance(glusterfs).
> 
> 
>>> 2. Gluster volume info
>> 
>> root@highlander ~]# gluster volume info
>> 
>> Volume Name: live
>> Type: Distribute
>> Volume ID: 1328637d-7730-4627-8945-bbe43626d527
>> Status: Started
>> Number of Bricks: 9
>> Transport-type: tcp
>> Bricks:
>> Brick1: stor104:/zfs/brick0/brick
>> Brick2: stor104:/zfs/brick1/brick
>> Brick3: stor104:/zfs/brick2/brick
>> Brick4: stor106:/zfs/brick0/brick
>> Brick5: stor106:/zfs/brick1/brick
>> Brick6: stor106:/zfs/brick2/brick
>> Brick7: stor105:/zfs/brick0/brick
>> Brick8: stor105:/zfs/brick1/brick
>> Brick9: stor105:/zfs/brick2/brick
>> Options Reconfigured:
>> diagnostics.count-fop-hits: on
>> diagnostics.latency-measurement: on
>> server.allow-insecure: on
>> cluster.min-free-disk: 1%
>> diagnostics.brick-log-level: ERROR
>> diagnostics.client-log-level: ERROR
>> cluster.data-self-heal-algorithm: full
>> performance.cache-max-file-size: 4MB
>> performance.cache-refresh-timeout: 60
>> performance.cache-size: 1GB
>> performance.client-io-threads: on
>> performance.io-thread-count: 32
>> performance.write-behind-window-size: 4MB
>> 
>>> 3. Is rebalance process still running? If yes can you point to specific mem
>>> usage by rebalance process? The high mem-usage was seen during rebalance
>>> or even post rebalance?
>> 
>> I would like to restart the rebalance process since it failed… But I can’t as
>> the volume cannot be stopped (I wanted to reboot the servers to have a clean
>> testing grounds).
>> 
>> Here are the logs from the three nodes:
>> http://paste.fedoraproject.org/256183/43989079
>> 
>> Maybe you could help me figure out how to stop the volume?
>> 
>> This is what happens
>> 
>> [root@highlander ~]# gluster volume rebalance live stop
>> volume rebalance: live: failed: Rebalance not started.
> 
> Requesting glusterd team to give input. 
>> 
>> [root@highlander ~]# ssh stor105 "gluster volume rebalance live stop"
>> volume rebalance: live: failed: Rebalance not started.
>> 
>> [root@highlander ~]# ssh stor104 "gluster volume rebalance live stop"
>> volume rebalance: live: failed: Rebalance not started.
>> 
>> [root@highlander ~]# ssh stor106 "gluster volume rebalance live stop"
>> volume rebalance: live: failed: Rebalance not 

Re: [Gluster-devel] Skipped files during rebalance

2015-08-19 Thread Christophe TREFOIS
Dear Susant,

Apparently the glistered process was stuck in a strange state. So we restarted 
the glusterd process on stor106. This allowed us to stop the volume, and reboot.

I will start a new rebalance now, and will get the information you asked during 
the rebalance operation.

I think it makes more sense to post the logs of this new rebalance operation.

Kind regards,

—
Christophe

 

> On 19 Aug 2015, at 08:49, Susant Palai  wrote:
> 
> Hi Christophe,
>   Forgot to ask you to post the rebalance and glusterd logs.
> 
> Regards,
> Susant
> 
> 
> - Original Message -
>> From: "Susant Palai" 
>> To: "Christophe TREFOIS" 
>> Cc: "Gluster Devel" 
>> Sent: Wednesday, August 19, 2015 11:44:35 AM
>> Subject: Re: [Gluster-devel] Skipped files during rebalance
>> 
>> Comments inline.
>> 
>> - Original Message -
>>> From: "Christophe TREFOIS" 
>>> To: "Susant Palai" 
>>> Cc: "Raghavendra Gowdappa" , "Nithya Balachandran"
>>> , "Shyamsundar
>>> Ranganathan" , "Mohammed Rafi K C"
>>> , "Gluster Devel"
>>> 
>>> Sent: Tuesday, August 18, 2015 8:08:41 PM
>>> Subject: Re: [Gluster-devel] Skipped files during rebalance
>>> 
>>> Hi Susan,
>>> 
>>> Thank you for the response.
>>> 
>>>> On 18 Aug 2015, at 10:45, Susant Palai  wrote:
>>>> 
>>>> Hi Christophe,
>>>> 
>>>>  Need some info regarding the high mem-usage.
>>>> 
>>>> 1. Top output: To see whether any other process eating up memory.
>> 
>> I will be interested to know the memory usage of all the gluster process
>> referring to the high mem-usage. These process includes glusterfsd,
>> glusterd, gluster, any mount process (glusterfs), and rebalance(glusterfs).
>> 
>> 
>>>> 2. Gluster volume info
>>> 
>>> root@highlander ~]# gluster volume info
>>> 
>>> Volume Name: live
>>> Type: Distribute
>>> Volume ID: 1328637d-7730-4627-8945-bbe43626d527
>>> Status: Started
>>> Number of Bricks: 9
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: stor104:/zfs/brick0/brick
>>> Brick2: stor104:/zfs/brick1/brick
>>> Brick3: stor104:/zfs/brick2/brick
>>> Brick4: stor106:/zfs/brick0/brick
>>> Brick5: stor106:/zfs/brick1/brick
>>> Brick6: stor106:/zfs/brick2/brick
>>> Brick7: stor105:/zfs/brick0/brick
>>> Brick8: stor105:/zfs/brick1/brick
>>> Brick9: stor105:/zfs/brick2/brick
>>> Options Reconfigured:
>>> diagnostics.count-fop-hits: on
>>> diagnostics.latency-measurement: on
>>> server.allow-insecure: on
>>> cluster.min-free-disk: 1%
>>> diagnostics.brick-log-level: ERROR
>>> diagnostics.client-log-level: ERROR
>>> cluster.data-self-heal-algorithm: full
>>> performance.cache-max-file-size: 4MB
>>> performance.cache-refresh-timeout: 60
>>> performance.cache-size: 1GB
>>> performance.client-io-threads: on
>>> performance.io-thread-count: 32
>>> performance.write-behind-window-size: 4MB
>>> 
>>>> 3. Is rebalance process still running? If yes can you point to specific
>>>> mem
>>>> usage by rebalance process? The high mem-usage was seen during rebalance
>>>> or even post rebalance?
>>> 
>>> I would like to restart the rebalance process since it failed… But I can’t
>>> as
>>> the volume cannot be stopped (I wanted to reboot the servers to have a
>>> clean
>>> testing grounds).
>>> 
>>> Here are the logs from the three nodes:
>>> http://paste.fedoraproject.org/256183/43989079
>>> 
>>> Maybe you could help me figure out how to stop the volume?
>>> 
>>> This is what happens
>>> 
>>> [root@highlander ~]# gluster volume rebalance live stop
>>> volume rebalance: live: failed: Rebalance not started.
>> 
>> Requesting glusterd team to give input.
>>> 
>>> [root@highlander ~]# ssh stor105 "gluster volume rebalance live stop"
>>> volume rebalance: live: failed: Rebalance not started.
>>> 
>>> [root@highlander ~]# ssh stor104 "gluster volume rebalance live stop"
>>> volume rebalance: live: failed: Rebalance not started.
>>> 
>>> [root@highlander ~]# ssh stor106 "gluster volume rebalance live stop"
>>> volume rebalance: live: failed: Rebalance 

Re: [Gluster-devel] Skipped files during rebalance

2015-08-18 Thread Susant Palai
Hi Christophe,
   Forgot to ask you to post the rebalance and glusterd logs.

Regards,
Susant
   

- Original Message -
> From: "Susant Palai" 
> To: "Christophe TREFOIS" 
> Cc: "Gluster Devel" 
> Sent: Wednesday, August 19, 2015 11:44:35 AM
> Subject: Re: [Gluster-devel] Skipped files during rebalance
> 
> Comments inline.
> 
> - Original Message -
> > From: "Christophe TREFOIS" 
> > To: "Susant Palai" 
> > Cc: "Raghavendra Gowdappa" , "Nithya Balachandran"
> > , "Shyamsundar
> > Ranganathan" , "Mohammed Rafi K C"
> > , "Gluster Devel"
> > 
> > Sent: Tuesday, August 18, 2015 8:08:41 PM
> > Subject: Re: [Gluster-devel] Skipped files during rebalance
> > 
> > Hi Susan,
> > 
> > Thank you for the response.
> > 
> > > On 18 Aug 2015, at 10:45, Susant Palai  wrote:
> > > 
> > > Hi Christophe,
> > > 
> > >   Need some info regarding the high mem-usage.
> > > 
> > > 1. Top output: To see whether any other process eating up memory.
> 
> I will be interested to know the memory usage of all the gluster process
> referring to the high mem-usage. These process includes glusterfsd,
> glusterd, gluster, any mount process (glusterfs), and rebalance(glusterfs).
> 
> 
> > > 2. Gluster volume info
> > 
> > root@highlander ~]# gluster volume info
> > 
> > Volume Name: live
> > Type: Distribute
> > Volume ID: 1328637d-7730-4627-8945-bbe43626d527
> > Status: Started
> > Number of Bricks: 9
> > Transport-type: tcp
> > Bricks:
> > Brick1: stor104:/zfs/brick0/brick
> > Brick2: stor104:/zfs/brick1/brick
> > Brick3: stor104:/zfs/brick2/brick
> > Brick4: stor106:/zfs/brick0/brick
> > Brick5: stor106:/zfs/brick1/brick
> > Brick6: stor106:/zfs/brick2/brick
> > Brick7: stor105:/zfs/brick0/brick
> > Brick8: stor105:/zfs/brick1/brick
> > Brick9: stor105:/zfs/brick2/brick
> > Options Reconfigured:
> > diagnostics.count-fop-hits: on
> > diagnostics.latency-measurement: on
> > server.allow-insecure: on
> > cluster.min-free-disk: 1%
> > diagnostics.brick-log-level: ERROR
> > diagnostics.client-log-level: ERROR
> > cluster.data-self-heal-algorithm: full
> > performance.cache-max-file-size: 4MB
> > performance.cache-refresh-timeout: 60
> > performance.cache-size: 1GB
> > performance.client-io-threads: on
> > performance.io-thread-count: 32
> > performance.write-behind-window-size: 4MB
> > 
> > > 3. Is rebalance process still running? If yes can you point to specific
> > > mem
> > > usage by rebalance process? The high mem-usage was seen during rebalance
> > > or even post rebalance?
> > 
> > I would like to restart the rebalance process since it failed… But I can’t
> > as
> > the volume cannot be stopped (I wanted to reboot the servers to have a
> > clean
> > testing grounds).
> > 
> > Here are the logs from the three nodes:
> > http://paste.fedoraproject.org/256183/43989079
> > 
> > Maybe you could help me figure out how to stop the volume?
> > 
> > This is what happens
> > 
> > [root@highlander ~]# gluster volume rebalance live stop
> > volume rebalance: live: failed: Rebalance not started.
> 
> Requesting glusterd team to give input.
> > 
> > [root@highlander ~]# ssh stor105 "gluster volume rebalance live stop"
> > volume rebalance: live: failed: Rebalance not started.
> > 
> > [root@highlander ~]# ssh stor104 "gluster volume rebalance live stop"
> > volume rebalance: live: failed: Rebalance not started.
> > 
> > [root@highlander ~]# ssh stor106 "gluster volume rebalance live stop"
> > volume rebalance: live: failed: Rebalance not started.
> > 
> > [root@highlander ~]# gluster volume rebalance live stop
> > volume rebalance: live: failed: Rebalance not started.
> > 
> > [root@highlander ~]# gluster volume stop live
> > Stopping volume will make its data inaccessible. Do you want to continue?
> > (y/n) y
> > volume stop: live: failed: Staging failed on stor106. Error: rebalance
> > session is in progress for the volume 'live'
> > Staging failed on stor104. Error: rebalance session is in progress for the
> > volume ‘live'
> Can you run [ps aux |  grep "rebalance"] on all the servers and post here?
> Just want to check whether rebalance is really running or not. Again
> requesting glusterd team to give inputs.
>

Re: [Gluster-devel] Skipped files during rebalance

2015-08-18 Thread Susant Palai
Comments inline.

- Original Message -
> From: "Christophe TREFOIS" 
> To: "Susant Palai" 
> Cc: "Raghavendra Gowdappa" , "Nithya Balachandran" 
> , "Shyamsundar
> Ranganathan" , "Mohammed Rafi K C" 
> , "Gluster Devel"
> 
> Sent: Tuesday, August 18, 2015 8:08:41 PM
> Subject: Re: [Gluster-devel] Skipped files during rebalance
> 
> Hi Susan,
> 
> Thank you for the response.
> 
> > On 18 Aug 2015, at 10:45, Susant Palai  wrote:
> > 
> > Hi Christophe,
> > 
> >   Need some info regarding the high mem-usage.
> > 
> > 1. Top output: To see whether any other process eating up memory.

I will be interested to know the memory usage of all the gluster process 
referring to the high mem-usage. These process includes glusterfsd, glusterd, 
gluster, any mount process (glusterfs), and rebalance(glusterfs).


> > 2. Gluster volume info
> 
> root@highlander ~]# gluster volume info
> 
> Volume Name: live
> Type: Distribute
> Volume ID: 1328637d-7730-4627-8945-bbe43626d527
> Status: Started
> Number of Bricks: 9
> Transport-type: tcp
> Bricks:
> Brick1: stor104:/zfs/brick0/brick
> Brick2: stor104:/zfs/brick1/brick
> Brick3: stor104:/zfs/brick2/brick
> Brick4: stor106:/zfs/brick0/brick
> Brick5: stor106:/zfs/brick1/brick
> Brick6: stor106:/zfs/brick2/brick
> Brick7: stor105:/zfs/brick0/brick
> Brick8: stor105:/zfs/brick1/brick
> Brick9: stor105:/zfs/brick2/brick
> Options Reconfigured:
> diagnostics.count-fop-hits: on
> diagnostics.latency-measurement: on
> server.allow-insecure: on
> cluster.min-free-disk: 1%
> diagnostics.brick-log-level: ERROR
> diagnostics.client-log-level: ERROR
> cluster.data-self-heal-algorithm: full
> performance.cache-max-file-size: 4MB
> performance.cache-refresh-timeout: 60
> performance.cache-size: 1GB
> performance.client-io-threads: on
> performance.io-thread-count: 32
> performance.write-behind-window-size: 4MB
> 
> > 3. Is rebalance process still running? If yes can you point to specific mem
> > usage by rebalance process? The high mem-usage was seen during rebalance
> > or even post rebalance?
> 
> I would like to restart the rebalance process since it failed… But I can’t as
> the volume cannot be stopped (I wanted to reboot the servers to have a clean
> testing grounds).
> 
> Here are the logs from the three nodes:
> http://paste.fedoraproject.org/256183/43989079
> 
> Maybe you could help me figure out how to stop the volume?
> 
> This is what happens
> 
> [root@highlander ~]# gluster volume rebalance live stop
> volume rebalance: live: failed: Rebalance not started.

Requesting glusterd team to give input. 
> 
> [root@highlander ~]# ssh stor105 "gluster volume rebalance live stop"
> volume rebalance: live: failed: Rebalance not started.
> 
> [root@highlander ~]# ssh stor104 "gluster volume rebalance live stop"
> volume rebalance: live: failed: Rebalance not started.
> 
> [root@highlander ~]# ssh stor106 "gluster volume rebalance live stop"
> volume rebalance: live: failed: Rebalance not started.
> 
> [root@highlander ~]# gluster volume rebalance live stop
> volume rebalance: live: failed: Rebalance not started.
> 
> [root@highlander ~]# gluster volume stop live
> Stopping volume will make its data inaccessible. Do you want to continue?
> (y/n) y
> volume stop: live: failed: Staging failed on stor106. Error: rebalance
> session is in progress for the volume 'live'
> Staging failed on stor104. Error: rebalance session is in progress for the
> volume ‘live'
Can you run [ps aux |  grep "rebalance"] on all the servers and post here? Just 
want to check whether rebalance is really running or not. Again requesting 
glusterd team to give inputs.

> 
> 
> > 4. Gluster version
> 
> [root@highlander ~]# pdsh -g live 'rpm -qa | grep gluster'
> stor104: glusterfs-api-3.7.3-1.el7.x86_64
> stor104: glusterfs-server-3.7.3-1.el7.x86_64
> stor104: glusterfs-libs-3.7.3-1.el7.x86_64
> stor104: glusterfs-3.7.3-1.el7.x86_64
> stor104: glusterfs-fuse-3.7.3-1.el7.x86_64
> stor104: glusterfs-client-xlators-3.7.3-1.el7.x86_64
> stor104: glusterfs-cli-3.7.3-1.el7.x86_64
> 
> stor105: glusterfs-3.7.3-1.el7.x86_64
> stor105: glusterfs-client-xlators-3.7.3-1.el7.x86_64
> stor105: glusterfs-api-3.7.3-1.el7.x86_64
> stor105: glusterfs-cli-3.7.3-1.el7.x86_64
> stor105: glusterfs-server-3.7.3-1.el7.x86_64
> stor105: glusterfs-libs-3.7.3-1.el7.x86_64
> stor105: glusterfs-fuse-3.7.3-1.el7.x86_64
> 
> stor106: glusterfs-libs-3.7.3-1.el7.x86_64
> stor106: glusterfs-fuse-3.7.3-1.el7.x86_64
> stor106:

Re: [Gluster-devel] Skipped files during rebalance

2015-08-18 Thread Susant Palai
++CCing glusterd team to see the glusterd part of problems.

- Original Message -
> From: "Christophe TREFOIS" 
> To: "Susant Palai" 
> Cc: "Raghavendra Gowdappa" , "Nithya Balachandran" 
> , "Shyamsundar
> Ranganathan" , "Mohammed Rafi K C" 
> , "Gluster Devel"
> 
> Sent: Tuesday, August 18, 2015 8:08:41 PM
> Subject: Re: [Gluster-devel] Skipped files during rebalance
> 
> Hi Susan,
> 
> Thank you for the response.
> 
> > On 18 Aug 2015, at 10:45, Susant Palai  wrote:
> > 
> > Hi Christophe,
> > 
> >   Need some info regarding the high mem-usage.
> > 
> > 1. Top output: To see whether any other process eating up memory.
> > 2. Gluster volume info
> 
> root@highlander ~]# gluster volume info
> 
> Volume Name: live
> Type: Distribute
> Volume ID: 1328637d-7730-4627-8945-bbe43626d527
> Status: Started
> Number of Bricks: 9
> Transport-type: tcp
> Bricks:
> Brick1: stor104:/zfs/brick0/brick
> Brick2: stor104:/zfs/brick1/brick
> Brick3: stor104:/zfs/brick2/brick
> Brick4: stor106:/zfs/brick0/brick
> Brick5: stor106:/zfs/brick1/brick
> Brick6: stor106:/zfs/brick2/brick
> Brick7: stor105:/zfs/brick0/brick
> Brick8: stor105:/zfs/brick1/brick
> Brick9: stor105:/zfs/brick2/brick
> Options Reconfigured:
> diagnostics.count-fop-hits: on
> diagnostics.latency-measurement: on
> server.allow-insecure: on
> cluster.min-free-disk: 1%
> diagnostics.brick-log-level: ERROR
> diagnostics.client-log-level: ERROR
> cluster.data-self-heal-algorithm: full
> performance.cache-max-file-size: 4MB
> performance.cache-refresh-timeout: 60
> performance.cache-size: 1GB
> performance.client-io-threads: on
> performance.io-thread-count: 32
> performance.write-behind-window-size: 4MB
> 
> > 3. Is rebalance process still running? If yes can you point to specific mem
> > usage by rebalance process? The high mem-usage was seen during rebalance
> > or even post rebalance?
> 
> I would like to restart the rebalance process since it failed… But I can’t as
> the volume cannot be stopped (I wanted to reboot the servers to have a clean
> testing grounds).
> 
> Here are the logs from the three nodes:
> http://paste.fedoraproject.org/256183/43989079
> 
> Maybe you could help me figure out how to stop the volume?
> 
> This is what happens
> 
> [root@highlander ~]# gluster volume rebalance live stop
> volume rebalance: live: failed: Rebalance not started.
> 
> [root@highlander ~]# ssh stor105 "gluster volume rebalance live stop"
> volume rebalance: live: failed: Rebalance not started.
> 
> [root@highlander ~]# ssh stor104 "gluster volume rebalance live stop"
> volume rebalance: live: failed: Rebalance not started.
> 
> [root@highlander ~]# ssh stor106 "gluster volume rebalance live stop"
> volume rebalance: live: failed: Rebalance not started.
> 
> [root@highlander ~]# gluster volume rebalance live stop
> volume rebalance: live: failed: Rebalance not started.
> 
> [root@highlander ~]# gluster volume stop live
> Stopping volume will make its data inaccessible. Do you want to continue?
> (y/n) y
> volume stop: live: failed: Staging failed on stor106. Error: rebalance
> session is in progress for the volume 'live'
> Staging failed on stor104. Error: rebalance session is in progress for the
> volume ‘live'
> 
> 
> > 4. Gluster version
> 
> [root@highlander ~]# pdsh -g live 'rpm -qa | grep gluster'
> stor104: glusterfs-api-3.7.3-1.el7.x86_64
> stor104: glusterfs-server-3.7.3-1.el7.x86_64
> stor104: glusterfs-libs-3.7.3-1.el7.x86_64
> stor104: glusterfs-3.7.3-1.el7.x86_64
> stor104: glusterfs-fuse-3.7.3-1.el7.x86_64
> stor104: glusterfs-client-xlators-3.7.3-1.el7.x86_64
> stor104: glusterfs-cli-3.7.3-1.el7.x86_64
> 
> stor105: glusterfs-3.7.3-1.el7.x86_64
> stor105: glusterfs-client-xlators-3.7.3-1.el7.x86_64
> stor105: glusterfs-api-3.7.3-1.el7.x86_64
> stor105: glusterfs-cli-3.7.3-1.el7.x86_64
> stor105: glusterfs-server-3.7.3-1.el7.x86_64
> stor105: glusterfs-libs-3.7.3-1.el7.x86_64
> stor105: glusterfs-fuse-3.7.3-1.el7.x86_64
> 
> stor106: glusterfs-libs-3.7.3-1.el7.x86_64
> stor106: glusterfs-fuse-3.7.3-1.el7.x86_64
> stor106: glusterfs-client-xlators-3.7.3-1.el7.x86_64
> stor106: glusterfs-api-3.7.3-1.el7.x86_64
> stor106: glusterfs-cli-3.7.3-1.el7.x86_64
> stor106: glusterfs-server-3.7.3-1.el7.x86_64
> stor106: glusterfs-3.7.3-1.el7.x86_64
> 
> > 
> > Will ask for more information in case needed.
> > 
> > Regards,
> > Susant
> > 
> > 
> > - Original Message -
> &

Re: [Gluster-devel] Skipped files during rebalance

2015-08-18 Thread Christophe TREFOIS
249  107617   31323   9   54308
>> 42752
>> stor104: Swap: 0   0   0
>> stor105:   totalusedfree  shared  buff/cache
>> available
>> stor105: Mem: 193248  1418046736   9   44707
>> 9713
>> stor105: Swap:         0       0   0
>> 
>> So after the failed operation, there’s almost no memory free, and it is also
>> not freed up.
>> 
>> Thank you for pointing me to any directions,
>> 
>> Kind regards,
>> 
>> —
>> Christophe
>> 
>> 
>> Begin forwarded message:
>> 
>> From: Christophe TREFOIS
>> mailto:christophe.tref...@uni.lu>>
>> Subject: Re: [Gluster-devel] Skipped files during rebalance
>> Date: 17 Aug 2015 11:54:32 CEST
>> To: Mohammed Rafi K C mailto:rkavu...@redhat.com>>
>> Cc: "gluster-devel@gluster.org<mailto:gluster-devel@gluster.org>"
>> mailto:gluster-devel@gluster.org>>
>> 
>> Dear Rafi,
>> 
>> Thanks for submitting a patch.
>> 
>> @DHT, I have two additional questions / problems.
>> 
>> 1. When doing a rebalance (with data) RAM consumption on the nodes goes
>> dramatically high, eg out of 196 GB available per node, RAM usage would fill
>> up to 195.6 GB. This seems quite excessive and strange to me.
>> 
>> 2. As you can see, the rebalance (with data) failed as one endpoint becomes
>> unconnected (even though it still is connected). I’m thinking this could be
>> due to the high RAM usage?
>> 
>> Thank you for your help,
>> 
>> —
>> Christophe
>> 
>> Dr Christophe Trefois, Dipl.-Ing.
>> Technical Specialist / Post-Doc
>> 
>> UNIVERSITÉ DU LUXEMBOURG
>> 
>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>> Campus Belval | House of Biomedicine
>> 6, avenue du Swing
>> L-4367 Belvaux
>> T: +352 46 66 44 6124
>> F: +352 46 66 44 6949
>> http://www.uni.lu/lcsb
>> 
>> [Facebook]<https://www.facebook.com/trefex>  [Twitter]
>> <https://twitter.com/Trefex>   [Google Plus]
>> <https://plus.google.com/+ChristopheTrefois/>   [Linkedin]
>> <https://www.linkedin.com/in/trefoischristophe>   [skype]
>> <http://skype:Trefex?call>
>> 
>> 
>> 
>> This message is confidential and may contain privileged information.
>> It is intended for the named recipient only.
>> If you receive it in error please notify me and permanently delete the
>> original message and any copies.
>> 
>> 
>> 
>> 
>> On 17 Aug 2015, at 11:27, Mohammed Rafi K C
>> mailto:rkavu...@redhat.com>> wrote:
>> 
>> 
>> 
>> On 08/17/2015 01:58 AM, Christophe TREFOIS wrote:
>> Dear all,
>> 
>> I have successfully added a new node to our setup, and finally managed to get
>> a successful fix-layout run as well with no errors.
>> 
>> Now, as per the documentation, I started a gluster volume rebalance live
>> start task and I see many skipped files.
>> The error log contains then entires as follows for each skipped file.
>> 
>> [2015-08-16 20:23:30.591161] E [MSGID: 109023]
>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>> s_05(2013-10-11_17-12-02)/004010008.flex lookup failed
>> [2015-08-16 20:23:30.768391] E [MSGID: 109023]
>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>> s_05(2013-10-11_17-12-02)/007005003.flex lookup failed
>> [2015-08-16 20:23:30.804811] E [MSGID: 109023]
>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>> s_05(2013-10-11_17-12-02)/006005009.flex lookup failed
>> [2015-08-16 20:23:30.805201] E [MSGID: 109023]
>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>> s_05(2013-10-11_17-12-02)/005006011.flex lookup failed
>> [2015-08-16 20:23:30.880037] E [MSGID: 109023]
>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>> s_05(2013-10-11_17-12-02)/005009012.flex lookup failed
>> [2015-08-16 20:23:31.038236] E [MSGID: 109023]
>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>> failed:/hcs/h

Re: [Gluster-devel] Skipped files during rebalance

2015-08-18 Thread Susant Palai
Hi Christophe,
  
   Need some info regarding the high mem-usage.

1. Top output: To see whether any other process eating up memory.
2. Gluster volume info
3. Is rebalance process still running? If yes can you point to specific mem 
usage by rebalance process? The high mem-usage was seen during rebalance or 
even post rebalance?
4. Gluster version

Will ask for more information in case needed.

Regards,
Susant


- Original Message -
> From: "Christophe TREFOIS" 
> To: "Raghavendra Gowdappa" , "Nithya Balachandran" 
> , "Susant Palai"
> , "Shyamsundar Ranganathan" 
> Cc: "Mohammed Rafi K C" 
> Sent: Monday, 17 August, 2015 7:03:20 PM
> Subject: Fwd: [Gluster-devel] Skipped files during rebalance
> 
> Hi DHT team,
> 
> This email somehow didn’t get forwarded to you.
> 
> In addition to my problem described below, here is one example of free memory
> after everything failed
> 
> [root@highlander ~]# pdsh -g live 'free -m'
> stor106:   totalusedfree  shared  buff/cache
> available
> stor106: Mem: 193249  1247841347   9   67118
> 12769
> stor106: Swap: 0   0   0
> stor104:   totalusedfree  shared  buff/cache
> available
> stor104: Mem: 193249  107617   31323   9   54308
> 42752
> stor104: Swap: 0   0   0
> stor105:   totalusedfree  shared  buff/cache
> available
> stor105: Mem: 193248  1418046736   9   44707
> 9713
> stor105: Swap: 0   0   0
> 
> So after the failed operation, there’s almost no memory free, and it is also
> not freed up.
> 
> Thank you for pointing me to any directions,
> 
> Kind regards,
> 
> —
> Christophe
> 
> 
> Begin forwarded message:
> 
> From: Christophe TREFOIS
> mailto:christophe.tref...@uni.lu>>
> Subject: Re: [Gluster-devel] Skipped files during rebalance
> Date: 17 Aug 2015 11:54:32 CEST
> To: Mohammed Rafi K C mailto:rkavu...@redhat.com>>
> Cc: "gluster-devel@gluster.org<mailto:gluster-devel@gluster.org>"
> mailto:gluster-devel@gluster.org>>
> 
> Dear Rafi,
> 
> Thanks for submitting a patch.
> 
> @DHT, I have two additional questions / problems.
> 
> 1. When doing a rebalance (with data) RAM consumption on the nodes goes
> dramatically high, eg out of 196 GB available per node, RAM usage would fill
> up to 195.6 GB. This seems quite excessive and strange to me.
> 
> 2. As you can see, the rebalance (with data) failed as one endpoint becomes
> unconnected (even though it still is connected). I’m thinking this could be
> due to the high RAM usage?
> 
> Thank you for your help,
> 
> —
> Christophe
> 
> Dr Christophe Trefois, Dipl.-Ing.
> Technical Specialist / Post-Doc
> 
> UNIVERSITÉ DU LUXEMBOURG
> 
> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
> Campus Belval | House of Biomedicine
> 6, avenue du Swing
> L-4367 Belvaux
> T: +352 46 66 44 6124
> F: +352 46 66 44 6949
> http://www.uni.lu/lcsb
> 
> [Facebook]<https://www.facebook.com/trefex>  [Twitter]
> <https://twitter.com/Trefex>   [Google Plus]
> <https://plus.google.com/+ChristopheTrefois/>   [Linkedin]
> <https://www.linkedin.com/in/trefoischristophe>   [skype]
> <http://skype:Trefex?call>
> 
> 
> 
> This message is confidential and may contain privileged information.
> It is intended for the named recipient only.
> If you receive it in error please notify me and permanently delete the
> original message and any copies.
> 
> 
> 
> 
> On 17 Aug 2015, at 11:27, Mohammed Rafi K C
> mailto:rkavu...@redhat.com>> wrote:
> 
> 
> 
> On 08/17/2015 01:58 AM, Christophe TREFOIS wrote:
> Dear all,
> 
> I have successfully added a new node to our setup, and finally managed to get
> a successful fix-layout run as well with no errors.
> 
> Now, as per the documentation, I started a gluster volume rebalance live
> start task and I see many skipped files.
> The error log contains then entires as follows for each skipped file.
> 
> [2015-08-16 20:23:30.591161] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
> s_05(2013-10-11_17-12-02)/004010008.flex lookup failed
> [2015-08-16 20:23:30.768391] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
> 

Re: [Gluster-devel] Skipped files during rebalance

2015-08-17 Thread Christophe TREFOIS
Dear Rafi,

Thanks for submitting a patch.

@DHT, I have two additional questions / problems.

1. When doing a rebalance (with data) RAM consumption on the nodes goes 
dramatically high, eg out of 196 GB available per node, RAM usage would fill up 
to 195.6 GB. This seems quite excessive and strange to me.

2. As you can see, the rebalance (with data) failed as one endpoint becomes 
unconnected (even though it still is connected). I’m thinking this could be due 
to the high RAM usage?

Thank you for your help,

—
Christophe

Dr Christophe Trefois, Dipl.-Ing.
Technical Specialist / Post-Doc

UNIVERSITÉ DU LUXEMBOURG

LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
Campus Belval | House of Biomedicine
6, avenue du Swing
L-4367 Belvaux
T: +352 46 66 44 6124
F: +352 46 66 44 6949
http://www.uni.lu/lcsb

[Facebook]  [Twitter] 
   [Google Plus] 
   [Linkedin] 
   [skype] 



This message is confidential and may contain privileged information.
It is intended for the named recipient only.
If you receive it in error please notify me and permanently delete the original 
message and any copies.




On 17 Aug 2015, at 11:27, Mohammed Rafi K C 
mailto:rkavu...@redhat.com>> wrote:



On 08/17/2015 01:58 AM, Christophe TREFOIS wrote:
Dear all,

I have successfully added a new node to our setup, and finally managed to get a 
successful fix-layout run as well with no errors.

Now, as per the documentation, I started a gluster volume rebalance live start 
task and I see many skipped files.
The error log contains then entires as follows for each skipped file.

[2015-08-16 20:23:30.591161] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/004010008.flex lookup failed
[2015-08-16 20:23:30.768391] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/007005003.flex lookup failed
[2015-08-16 20:23:30.804811] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/006005009.flex lookup failed
[2015-08-16 20:23:30.805201] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/005006011.flex lookup failed
[2015-08-16 20:23:30.880037] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/005009012.flex lookup failed
[2015-08-16 20:23:31.038236] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/003008007.flex lookup failed
[2015-08-16 20:23:31.259762] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/004008006.flex lookup failed
[2015-08-16 20:23:31.333764] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/007008001.flex lookup failed
[2015-08-16 20:23:31.340190] E [MSGID: 109023] 
[dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file 
failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/006007004.flex lookup failed

Update: one of the rebalance tasks now failed.

@Rafi, I got the same error as Friday except this time with data.

Packets that carrying the ping request could be waiting in the queue during the 
whole time-out period, because of the heavy traffic in the network. I have sent 
a patch for this. You can track the status here : 
http://review.gluster.org/11935



[2015-08-16 20:24:34.533167] C 
[rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-0: server 
192.168.123.104:49164 has not responded in the last 42 seconds, disconnecting.
[2015-08-16 20:24:34.533614] E [rpc-clnt.c:362:saved_frames_unwind] (--> 
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> 
/lib64/libgfrpc.so.0(saved_frames_unwin
d+0x1de)[0x7fa454bb09be] (--> 
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> 
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> 
/lib64/li
bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ) 0-live-client-0: forced 
unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at 2015-08-16 
20:

Re: [Gluster-devel] Skipped files during rebalance

2015-08-17 Thread Mohammed Rafi K C


On 08/17/2015 01:58 AM, Christophe TREFOIS wrote:
>
> Dear all,
>
>  
>
> I have successfully added a new node to our setup, and finally managed
> to get a successful fix-layout run as well with no errors.
>
>  
>
> Now, as per the documentation, I started a gluster volume rebalance
> live start task and I see many skipped files. 
>
> The error log contains then entires as follows for each skipped file.
>
>  
>
> [2015-08-16 20:23:30.591161] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/004010008.flex lookup failed
>
> [2015-08-16 20:23:30.768391] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/007005003.flex lookup failed
>
> [2015-08-16 20:23:30.804811] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/006005009.flex lookup failed
>
> [2015-08-16 20:23:30.805201] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/005006011.flex lookup failed
>
> [2015-08-16 20:23:30.880037] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/005009012.flex lookup failed
>
> [2015-08-16 20:23:31.038236] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/003008007.flex lookup failed
>
> [2015-08-16 20:23:31.259762] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/004008006.flex lookup failed
>
> [2015-08-16 20:23:31.333764] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/007008001.flex lookup failed
>
> [2015-08-16 20:23:31.340190] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/006007004.flex lookup failed
>
>  
>
> Update: one of the rebalance tasks now failed.
>
>  
>
> @Rafi, I got the same error as Friday except this time with data.
>

Packets that carrying the ping request could be waiting in the queue
during the whole time-out period, because of the heavy traffic in the
network. I have sent a patch for this. You can track the status here :
http://review.gluster.org/11935


>  
>
> [2015-08-16 20:24:34.533167] C
> [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-0:
> server 192.168.123.104:49164 has not responded in the last 42 seconds,
> disconnecting.
>
> [2015-08-16 20:24:34.533614] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwin
>
> d+0x1de)[0x7fa454bb09be] (-->
> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
> (--> /lib64/li
>
> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )
> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
> op(INODELK(29)) called at 2015-08-16 20:23:51.305640 (xid=0x5dd4da)
>
> [2015-08-16 20:24:34.533672] E [MSGID: 114031]
> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote
> operation failed [Transport endpoint is not connected]
>
> [2015-08-16 20:24:34.534201] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwin
>
> d+0x1de)[0x7fa454bb09be] (-->
> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
> (--> /lib64/li
>
> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )
> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
> op(READ(12)) called at 2015-08-16 20:23:51.303938 (xid=0x5dd4d7)
>
> [2015-08-16 20:24:34.534347] E [MSGID: 109023]
> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file
> failed: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_
>
> 12(2013-10-12_00-12-55)/007008007.flex: failed to migrate data
>
> [2015-08-16 20:24:34.534413] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
> (--> /l