Thanks Christophe for the details. Will get back to you with the analysis.

Regards,
Susant

----- Original Message -----
From: "Christophe TREFOIS" <christophe.tref...@uni.lu>
To: "Susant Palai" <spa...@redhat.com>
Cc: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Nithya Balachandran" 
<nbala...@redhat.com>, "Shyamsundar Ranganathan" <srang...@redhat.com>, 
"Mohammed Rafi K C" <rkavu...@redhat.com>, "Gluster Devel" 
<gluster-devel@gluster.org>
Sent: Friday, 21 August, 2015 12:39:05 AM
Subject: Re: [Gluster-devel] Skipped files during rebalance

Dear Susant,

The rebalance failed again and also had (in my opinion) excessive RAM usage.

Please find a very detailled list below.

All logs:

http://wikisend.com/download/651948/allstores.tar.gz

Thank you for letting me know how I could successfully complete the rebalance 
process.
The fedora pastes are the output of top of each node at that time (more or 
less).

Please let me know if you need more information,

Best,

—— Start of mem info

# After reboot, before starting glusterd

[root@highlander ~]# pdsh -g live 'free -m'
stor106:               total        used        free      shared  buff/cache   
available
stor106: Mem:         193249        2208      190825           9         215    
  190772
stor106: Swap:             0           0           0
stor105:               total        used        free      shared  buff/cache   
available
stor105: Mem:         193248        2275      190738           9         234    
  190681
stor105: Swap:             0           0           0
stor104:               total        used        free      shared  buff/cache   
available
stor104: Mem:         193249        2221      190811           9         216    
  190757
stor104: Swap:             0           0           0
[root@highlander ~]#

# Gluster Info

[root@stor106 glusterfs]# gluster volume info

Volume Name: live
Type: Distribute
Volume ID: 1328637d-7730-4627-8945-bbe43626d527
Status: Started
Number of Bricks: 9
Transport-type: tcp
Bricks:
Brick1: stor104:/zfs/brick0/brick
Brick2: stor104:/zfs/brick1/brick
Brick3: stor104:/zfs/brick2/brick
Brick4: stor106:/zfs/brick0/brick
Brick5: stor106:/zfs/brick1/brick
Brick6: stor106:/zfs/brick2/brick
Brick7: stor105:/zfs/brick0/brick
Brick8: stor105:/zfs/brick1/brick
Brick9: stor105:/zfs/brick2/brick
Options Reconfigured:
nfs.disable: true
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.write-behind-window-size: 4MB
performance.io-thread-count: 32
performance.client-io-threads: on
performance.cache-size: 1GB
performance.cache-refresh-timeout: 60
performance.cache-max-file-size: 4MB
cluster.data-self-heal-algorithm: full
diagnostics.client-log-level: ERROR
diagnostics.brick-log-level: ERROR
cluster.min-free-disk: 1%
server.allow-insecure: on

# Starting gluserd

[root@highlander ~]# pdsh -g live 'systemctl start glusterd'
[root@highlander ~]# pdsh -g live 'free -m'
stor106:               total        used        free      shared  buff/cache   
available
stor106: Mem:         193249        2290      190569           9         389    
  190587
stor106: Swap:             0           0           0
stor104:               total        used        free      shared  buff/cache   
available
stor104: Mem:         193249        2297      190557           9         394    
  190571
stor104: Swap:             0           0           0
stor105:               total        used        free      shared  buff/cache   
available
stor105: Mem:         193248        2286      190554           9         407    
  190595
stor105: Swap:             0           0           0

[root@highlander ~]# systemctl start glusterd
[root@highlander ~]# gluster volume start live
volume start: live: success
[root@highlander ~]# gluster volume status
Status of volume: live
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick stor104:/zfs/brick0/brick             49164     0          Y       5945
Brick stor104:/zfs/brick1/brick             49165     0          Y       5963
Brick stor104:/zfs/brick2/brick             49166     0          Y       5981
Brick stor106:/zfs/brick0/brick             49158     0          Y       5256
Brick stor106:/zfs/brick1/brick             49159     0          Y       5274
Brick stor106:/zfs/brick2/brick             49160     0          Y       5292
Brick stor105:/zfs/brick0/brick             49155     0          Y       5284
Brick stor105:/zfs/brick1/brick             49156     0          Y       5302
Brick stor105:/zfs/brick2/brick             49157     0          Y       5320
NFS Server on localhost                     N/A       N/A        N       N/A
NFS Server on 192.168.123.106               N/A       N/A        N       N/A
NFS Server on stor105                       N/A       N/A        N       N/A
NFS Server on 192.168.123.104               N/A       N/A        N       N/A

Task Status of Volume live
------------------------------------------------------------------------------
There are no active volume tasks

[root@highlander ~]#

# Memory usage of each node after 5 minutes

Output of top:

pdsh -g live 'top -n 1 -b' | fpaste

http://paste.fedoraproject.org/256710/14399886/

[root@highlander ~]# pdsh -g live 'free -m'
stor106:               total        used        free      shared  buff/cache   
available
stor106: Mem:         193249        6877      184154           9        2218    
  184250
stor106: Swap:             0           0           0
stor105:               total        used        free      shared  buff/cache   
available
stor105: Mem:         193248       22126      169351           9        1771    
  169403
stor105: Swap:             0           0           0
stor104:               total        used        free      shared  buff/cache   
available
stor104: Mem:         193249        2708      188638           9        1902    
  188687
stor104: Swap:             0           0           0


# Memory usage of each node after 45 minutes

[root@highlander ~]# pdsh -g live 'free -m'
stor104:               total        used        free      shared  buff/cache   
available
stor104: Mem:         193249        3131      184168           9        5949    
  184524
stor104: Swap:             0           0           0
stor106:               total        used        free      shared  buff/cache   
available
stor106: Mem:         193249       27919      158176           9        7153    
  158894
stor106: Swap:             0           0           0
stor105:               total        used        free      shared  buff/cache   
available
stor105: Mem:         193248      117096       70621           9        5530    
   70891
stor105: Swap:             0           0           0

http://paste.fedoraproject.org/256726/43999119

# Memory usage of each node after 90 minutes

[root@highlander ~]# pdsh -g live 'free -m'
stor104:               total        used        free      shared  buff/cache   
available
stor104: Mem:         193249        3390      181034           9        8825    
  181661
stor104: Swap:             0           0           0
stor106:               total        used        free      shared  buff/cache   
available
stor106: Mem:         193249       45780      136424           9       11044    
  137759
stor106: Swap:             0           0           0
stor105:               total        used        free      shared  buff/cache   
available
stor105: Mem:         193248      151483       33492           9        8272    
   33972
stor105: Swap:             0           0           0

http://paste.fedoraproject.org/256745/14399937

# Memory usage after 5 hours

```bash
[root@highlander ~]# pdsh -g live 'free -m'
stor104:               total        used        free      shared  buff/cache   
available
stor104: Mem:         193249        4645      163186           9       25417    
  165473
stor104: Swap:             0           0           0
stor105:               total        used        free      shared  buff/cache   
available
stor105: Mem:         193248      155094       14784           9       23369    
   16640
stor105: Swap:             0           0           0
stor106:               total        used        free      shared  buff/cache   
available
stor106: Mem:         193249      141379       16515           9       35355    
   23714
stor106: Swap:             0           0           0
```

http://paste.fedoraproject.org/256879/44001235

# Memory usage after 6 hours

```bash
[root@highlander ~]# pdsh -g live 'free -m'
stor106:               total        used        free      shared  buff/cache   
available
stor106: Mem:         193249      140526       12207           9       40516    
   21612
stor106: Swap:             0           0           0
stor104:               total        used        free      shared  buff/cache   
available
stor104: Mem:         193249      102923       58748           9       31578    
   63632
stor104: Swap:             0           0           0
stor105:               total        used        free      shared  buff/cache   
available
stor105: Mem:         193248      155394       10876           9       26977    
   13154
stor105: Swap:             0           0           0
```

http://paste.fedoraproject.org/256905/00168781

# Memory after 24 hours + Failed

```bash
[root@highlander ~]# pdsh -g live 'free -m'
stor105:               total        used        free      shared  buff/cache   
available
stor105: Mem:         193248      136123        6323           9       50801    
   10281
stor105: Swap:             0           0           0
stor104:               total        used        free      shared  buff/cache   
available
stor104: Mem:         193249      125320        2812           9       65116    
   17337
stor104: Swap:             0           0           0
stor106:               total        used        free      shared  buff/cache   
available
stor106: Mem:         193249      111997       13969           9       67282    
   19429
stor106: Swap:             0           0           0
[root@highlander ~]#
```

 http://paste.fedoraproject.org/257254/14400880

# Failed logs

```bash
[root@highlander ~]# gluster volume rebalance live status
                                    Node Rebalanced-files          size       
scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   
-----------   -----------   -----------         ------------     --------------
                         192.168.123.104           748812         4.4TB       
4160456          1311        156772               failed           63114.00
                         192.168.123.106          1187917         3.3TB       
6021931         21625       1209503               failed           75243.00
                                 stor105                0        0Bytes       
2440431            16           196               failed           69658.00
volume rebalance: live: success:
```



Dr Christophe Trefois, Dipl.-Ing.  
Technical Specialist / Post-Doc

UNIVERSITÉ DU LUXEMBOURG

LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
Campus Belval | House of Biomedicine  
6, avenue du Swing 
L-4367 Belvaux  
T: +352 46 66 44 6124 
F: +352 46 66 44 6949  
http://www.uni.lu/lcsb

        

----
This message is confidential and may contain privileged information. 
It is intended for the named recipient only. 
If you receive it in error please notify me and permanently delete the original 
message and any copies. 
----

  

> On 19 Aug 2015, at 08:14, Susant Palai <spa...@redhat.com> wrote:
> 
> Comments inline.
> 
> ----- Original Message -----
>> From: "Christophe TREFOIS" <christophe.tref...@uni.lu>
>> To: "Susant Palai" <spa...@redhat.com>
>> Cc: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Nithya Balachandran" 
>> <nbala...@redhat.com>, "Shyamsundar
>> Ranganathan" <srang...@redhat.com>, "Mohammed Rafi K C" 
>> <rkavu...@redhat.com>, "Gluster Devel"
>> <gluster-devel@gluster.org>
>> Sent: Tuesday, August 18, 2015 8:08:41 PM
>> Subject: Re: [Gluster-devel] Skipped files during rebalance
>> 
>> Hi Susan,
>> 
>> Thank you for the response.
>> 
>>> On 18 Aug 2015, at 10:45, Susant Palai <spa...@redhat.com> wrote:
>>> 
>>> Hi Christophe,
>>> 
>>>  Need some info regarding the high mem-usage.
>>> 
>>> 1. Top output: To see whether any other process eating up memory.
> 
> I will be interested to know the memory usage of all the gluster process 
> referring to the high mem-usage. These process includes glusterfsd, glusterd, 
> gluster, any mount process (glusterfs), and rebalance(glusterfs).
> 
> 
>>> 2. Gluster volume info
>> 
>> root@highlander ~]# gluster volume info
>> 
>> Volume Name: live
>> Type: Distribute
>> Volume ID: 1328637d-7730-4627-8945-bbe43626d527
>> Status: Started
>> Number of Bricks: 9
>> Transport-type: tcp
>> Bricks:
>> Brick1: stor104:/zfs/brick0/brick
>> Brick2: stor104:/zfs/brick1/brick
>> Brick3: stor104:/zfs/brick2/brick
>> Brick4: stor106:/zfs/brick0/brick
>> Brick5: stor106:/zfs/brick1/brick
>> Brick6: stor106:/zfs/brick2/brick
>> Brick7: stor105:/zfs/brick0/brick
>> Brick8: stor105:/zfs/brick1/brick
>> Brick9: stor105:/zfs/brick2/brick
>> Options Reconfigured:
>> diagnostics.count-fop-hits: on
>> diagnostics.latency-measurement: on
>> server.allow-insecure: on
>> cluster.min-free-disk: 1%
>> diagnostics.brick-log-level: ERROR
>> diagnostics.client-log-level: ERROR
>> cluster.data-self-heal-algorithm: full
>> performance.cache-max-file-size: 4MB
>> performance.cache-refresh-timeout: 60
>> performance.cache-size: 1GB
>> performance.client-io-threads: on
>> performance.io-thread-count: 32
>> performance.write-behind-window-size: 4MB
>> 
>>> 3. Is rebalance process still running? If yes can you point to specific mem
>>> usage by rebalance process? The high mem-usage was seen during rebalance
>>> or even post rebalance?
>> 
>> I would like to restart the rebalance process since it failed… But I can’t as
>> the volume cannot be stopped (I wanted to reboot the servers to have a clean
>> testing grounds).
>> 
>> Here are the logs from the three nodes:
>> http://paste.fedoraproject.org/256183/43989079
>> 
>> Maybe you could help me figure out how to stop the volume?
>> 
>> This is what happens
>> 
>> [root@highlander ~]# gluster volume rebalance live stop
>> volume rebalance: live: failed: Rebalance not started.
> 
> Requesting glusterd team to give input. 
>> 
>> [root@highlander ~]# ssh stor105 "gluster volume rebalance live stop"
>> volume rebalance: live: failed: Rebalance not started.
>> 
>> [root@highlander ~]# ssh stor104 "gluster volume rebalance live stop"
>> volume rebalance: live: failed: Rebalance not started.
>> 
>> [root@highlander ~]# ssh stor106 "gluster volume rebalance live stop"
>> volume rebalance: live: failed: Rebalance not started.
>> 
>> [root@highlander ~]# gluster volume rebalance live stop
>> volume rebalance: live: failed: Rebalance not started.
>> 
>> [root@highlander ~]# gluster volume stop live
>> Stopping volume will make its data inaccessible. Do you want to continue?
>> (y/n) y
>> volume stop: live: failed: Staging failed on stor106. Error: rebalance
>> session is in progress for the volume 'live'
>> Staging failed on stor104. Error: rebalance session is in progress for the
>> volume ‘live'
> Can you run [ps aux |  grep "rebalance"] on all the servers and post here? 
> Just want to check whether rebalance is really running or not. Again 
> requesting glusterd team to give inputs.
> 
>> 
>> 
>>> 4. Gluster version
>> 
>> [root@highlander ~]# pdsh -g live 'rpm -qa | grep gluster'
>> stor104: glusterfs-api-3.7.3-1.el7.x86_64
>> stor104: glusterfs-server-3.7.3-1.el7.x86_64
>> stor104: glusterfs-libs-3.7.3-1.el7.x86_64
>> stor104: glusterfs-3.7.3-1.el7.x86_64
>> stor104: glusterfs-fuse-3.7.3-1.el7.x86_64
>> stor104: glusterfs-client-xlators-3.7.3-1.el7.x86_64
>> stor104: glusterfs-cli-3.7.3-1.el7.x86_64
>> 
>> stor105: glusterfs-3.7.3-1.el7.x86_64
>> stor105: glusterfs-client-xlators-3.7.3-1.el7.x86_64
>> stor105: glusterfs-api-3.7.3-1.el7.x86_64
>> stor105: glusterfs-cli-3.7.3-1.el7.x86_64
>> stor105: glusterfs-server-3.7.3-1.el7.x86_64
>> stor105: glusterfs-libs-3.7.3-1.el7.x86_64
>> stor105: glusterfs-fuse-3.7.3-1.el7.x86_64
>> 
>> stor106: glusterfs-libs-3.7.3-1.el7.x86_64
>> stor106: glusterfs-fuse-3.7.3-1.el7.x86_64
>> stor106: glusterfs-client-xlators-3.7.3-1.el7.x86_64
>> stor106: glusterfs-api-3.7.3-1.el7.x86_64
>> stor106: glusterfs-cli-3.7.3-1.el7.x86_64
>> stor106: glusterfs-server-3.7.3-1.el7.x86_64
>> stor106: glusterfs-3.7.3-1.el7.x86_64
>> 
>>> 
>>> Will ask for more information in case needed.
>>> 
>>> Regards,
>>> Susant
>>> 
>>> 
>>> ----- Original Message -----
>>>> From: "Christophe TREFOIS" <christophe.tref...@uni.lu>
>>>> To: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Nithya Balachandran"
>>>> <nbala...@redhat.com>, "Susant Palai"
>>>> <spa...@redhat.com>, "Shyamsundar Ranganathan" <srang...@redhat.com>
>>>> Cc: "Mohammed Rafi K C" <rkavu...@redhat.com>
>>>> Sent: Monday, 17 August, 2015 7:03:20 PM
>>>> Subject: Fwd: [Gluster-devel] Skipped files during rebalance
>>>> 
>>>> Hi DHT team,
>>>> 
>>>> This email somehow didn’t get forwarded to you.
>>>> 
>>>> In addition to my problem described below, here is one example of free
>>>> memory
>>>> after everything failed
>>>> 
>>>> [root@highlander ~]# pdsh -g live 'free -m'
>>>> stor106:               total        used        free      shared
>>>> buff/cache
>>>> available
>>>> stor106: Mem:         193249      124784        1347           9
>>>> 67118
>>>> 12769
>>>> stor106: Swap:             0           0           0
>>>> stor104:               total        used        free      shared
>>>> buff/cache
>>>> available
>>>> stor104: Mem:         193249      107617       31323           9
>>>> 54308
>>>> 42752
>>>> stor104: Swap:             0           0           0
>>>> stor105:               total        used        free      shared
>>>> buff/cache
>>>> available
>>>> stor105: Mem:         193248      141804        6736           9
>>>> 44707
>>>> 9713
>>>> stor105: Swap:             0           0           0
>>>> 
>>>> So after the failed operation, there’s almost no memory free, and it is
>>>> also
>>>> not freed up.
>>>> 
>>>> Thank you for pointing me to any directions,
>>>> 
>>>> Kind regards,
>>>> 
>>>> —
>>>> Christophe
>>>> 
>>>> 
>>>> Begin forwarded message:
>>>> 
>>>> From: Christophe TREFOIS
>>>> <christophe.tref...@uni.lu<mailto:christophe.tref...@uni.lu>>
>>>> Subject: Re: [Gluster-devel] Skipped files during rebalance
>>>> Date: 17 Aug 2015 11:54:32 CEST
>>>> To: Mohammed Rafi K C <rkavu...@redhat.com<mailto:rkavu...@redhat.com>>
>>>> Cc: "gluster-devel@gluster.org<mailto:gluster-devel@gluster.org>"
>>>> <gluster-devel@gluster.org<mailto:gluster-devel@gluster.org>>
>>>> 
>>>> Dear Rafi,
>>>> 
>>>> Thanks for submitting a patch.
>>>> 
>>>> @DHT, I have two additional questions / problems.
>>>> 
>>>> 1. When doing a rebalance (with data) RAM consumption on the nodes goes
>>>> dramatically high, eg out of 196 GB available per node, RAM usage would
>>>> fill
>>>> up to 195.6 GB. This seems quite excessive and strange to me.
>>>> 
>>>> 2. As you can see, the rebalance (with data) failed as one endpoint
>>>> becomes
>>>> unconnected (even though it still is connected). I’m thinking this could
>>>> be
>>>> due to the high RAM usage?
>>>> 
>>>> Thank you for your help,
>>>> 
>>>> —
>>>> Christophe
>>>> 
>>>> Dr Christophe Trefois, Dipl.-Ing.
>>>> Technical Specialist / Post-Doc
>>>> 
>>>> UNIVERSITÉ DU LUXEMBOURG
>>>> 
>>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>>>> Campus Belval | House of Biomedicine
>>>> 6, avenue du Swing
>>>> L-4367 Belvaux
>>>> T: +352 46 66 44 6124
>>>> F: +352 46 66 44 6949
>>>> http://www.uni.lu/lcsb
>>>> 
>>>> [Facebook]<https://www.facebook.com/trefex>  [Twitter]
>>>> <https://twitter.com/Trefex>   [Google Plus]
>>>> <https://plus.google.com/+ChristopheTrefois/>   [Linkedin]
>>>> <https://www.linkedin.com/in/trefoischristophe>   [skype]
>>>> <http://skype:Trefex?call>
>>>> 
>>>> 
>>>> ----
>>>> This message is confidential and may contain privileged information.
>>>> It is intended for the named recipient only.
>>>> If you receive it in error please notify me and permanently delete the
>>>> original message and any copies.
>>>> ----
>>>> 
>>>> 
>>>> 
>>>> On 17 Aug 2015, at 11:27, Mohammed Rafi K C
>>>> <rkavu...@redhat.com<mailto:rkavu...@redhat.com>> wrote:
>>>> 
>>>> 
>>>> 
>>>> On 08/17/2015 01:58 AM, Christophe TREFOIS wrote:
>>>> Dear all,
>>>> 
>>>> I have successfully added a new node to our setup, and finally managed to
>>>> get
>>>> a successful fix-layout run as well with no errors.
>>>> 
>>>> Now, as per the documentation, I started a gluster volume rebalance live
>>>> start task and I see many skipped files.
>>>> The error log contains then entires as follows for each skipped file.
>>>> 
>>>> [2015-08-16 20:23:30.591161] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/004010008.flex lookup failed
>>>> [2015-08-16 20:23:30.768391] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/007005003.flex lookup failed
>>>> [2015-08-16 20:23:30.804811] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/006005009.flex lookup failed
>>>> [2015-08-16 20:23:30.805201] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/005006011.flex lookup failed
>>>> [2015-08-16 20:23:30.880037] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/005009012.flex lookup failed
>>>> [2015-08-16 20:23:31.038236] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/003008007.flex lookup failed
>>>> [2015-08-16 20:23:31.259762] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/004008006.flex lookup failed
>>>> [2015-08-16 20:23:31.333764] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/007008001.flex lookup failed
>>>> [2015-08-16 20:23:31.340190] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/006007004.flex lookup failed
>>>> 
>>>> Update: one of the rebalance tasks now failed.
>>>> 
>>>> @Rafi, I got the same error as Friday except this time with data.
>>>> 
>>>> Packets that carrying the ping request could be waiting in the queue
>>>> during
>>>> the whole time-out period, because of the heavy traffic in the network. I
>>>> have sent a patch for this. You can track the status here :
>>>> http://review.gluster.org/11935
>>>> 
>>>> 
>>>> 
>>>> [2015-08-16 20:24:34.533167] C
>>>> [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-0: server
>>>> 192.168.123.104:49164 has not responded in the last 42 seconds,
>>>> disconnecting.
>>>> [2015-08-16 20:24:34.533614] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwin
>>>> d+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/li
>>>> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0:
>>>> forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at
>>>> 2015-08-16 20:23:51.305640 (xid=0x5dd4da)
>>>> [2015-08-16 20:24:34.533672] E [MSGID: 114031]
>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote
>>>> operation failed [Transport endpoint is not connected]
>>>> [2015-08-16 20:24:34.534201] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwin
>>>> d+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/li
>>>> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0:
>>>> forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at
>>>> 2015-08-16
>>>> 20:23:51.303938 (xid=0x5dd4d7)
>>>> [2015-08-16 20:24:34.534347] E [MSGID: 109023]
>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>> /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_
>>>> 12(2013-10-12_00-12-55)/007008007.flex: failed to migrate data
>>>> [2015-08-16 20:24:34.534413] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwin
>>>> d+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12))
>>>> called at 2015-08-16 20:23:51.303969 (xid=0x5dd4d8)
>>>> [2015-08-16 20:24:34.534579] E [MSGID: 109023]
>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>> /hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007009012.flex:
>>>> failed to migrate data
>>>> [2015-08-16 20:24:34.534676] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12))
>>>> called at 2015-08-16 20:23:51.313548 (xid=0x5dd4db)
>>>> [2015-08-16 20:24:34.534745] E [MSGID: 109023]
>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>> /hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/006008011.flex:
>>>> failed to migrate data
>>>> [2015-08-16 20:24:34.535199] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12))
>>>> called at 2015-08-16 20:23:51.326369 (xid=0x5dd4dc)
>>>> [2015-08-16 20:24:34.535232] E [MSGID: 109023]
>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>> /hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005003001.flex:
>>>> failed to migrate data
>>>> [2015-08-16 20:24:34.535984] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12))
>>>> called at 2015-08-16 20:23:51.326437 (xid=0x5dd4dd)
>>>> [2015-08-16 20:24:34.536069] E [MSGID: 109023]
>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>> /hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007010012.flex:
>>>> failed to migrate data
>>>> [2015-08-16 20:24:34.536267] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>>>> called at 2015-08-16 20:23:51.337240 (xid=0x5dd4de)
>>>> [2015-08-16 20:24:34.536339] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/002005012.flex
>>>> lookup failed
>>>> [2015-08-16 20:24:34.536487] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>>>> called at 2015-08-16 20:23:51.425254 (xid=0x5dd4df)
>>>> [2015-08-16 20:24:34.536685] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>>>> called at 2015-08-16 20:23:51.738907 (xid=0x5dd4e0)
>>>> [2015-08-16 20:24:34.536891] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>>>> called at 2015-08-16 20:23:51.805096 (xid=0x5dd4e1)
>>>> [2015-08-16 20:24:34.537316] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>>>> called at 2015-08-16 20:23:51.805977 (xid=0x5dd4e2)
>>>> [2015-08-16 20:24:34.537735] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called
>>>> at
>>>> 2015-08-16 20:23:52.530107 (xid=0x5dd4e3)
>>>> [2015-08-16 20:24:34.538475] E [MSGID: 114031]
>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote
>>>> operation failed [Transport endpoint is not connected]
>>>> The message "E [MSGID: 114031]
>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk]
>>>> 0-live-client-0: remote operation failed [Transport endpoint is not
>>>> connected]" repeated 4 times between [2015-08-16 20:24:34.538475] and
>>>> [2015-08-16 20:24:34.538535]
>>>> [2015-08-16 20:24:34.538584] E [MSGID: 109023]
>>>> [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate
>>>> file failed: 002004003.flex lookup failed
>>>> [2015-08-16 20:24:34.538904] E [MSGID: 109023]
>>>> [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate
>>>> file failed: 003009008.flex lookup failed
>>>> [2015-08-16 20:24:34.539724] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/005009006.flex
>>>> lookup failed
>>>> [2015-08-16 20:24:34.539820] E [MSGID: 109016]
>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed
>>>> for /hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)
>>>> [2015-08-16 20:24:34.540031] E [MSGID: 109016]
>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed
>>>> for /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1
>>>> [2015-08-16 20:24:34.540691] E [MSGID: 114031]
>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote
>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/002005008.flex
>>>> [Transport endpoint is not connected]
>>>> [2015-08-16 20:24:34.541152] E [MSGID: 114031]
>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote
>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005004009.flex
>>>> [Transport endpoint is not connected]
>>>> [2015-08-16 20:24:34.541331] E [MSGID: 114031]
>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote
>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007005011.flex
>>>> [Transport endpoint is not connected]
>>>> [2015-08-16 20:24:34.541486] E [MSGID: 109016]
>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed
>>>> for /hcs/hcs/OperaArchiveCol
>>>> [2015-08-16 20:24:34.541572] E [MSGID: 109016]
>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed
>>>> for /hcs/hcs
>>>> [2015-08-16 20:24:34.541639] E [MSGID: 109016]
>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed
>>>> for /hcs
>>>> 
>>>> Any help would be greatly appreciated.
>>>> CCing dht teams to give you better idea about why rebalance failed/ and
>>>> about
>>>> huge memory consumption by rebalance process (200GB RAM) .
>>>> 
>>>> Regards
>>>> Rafi KC
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> --
>>>> Christophe
>>>> 
>>>> Dr Christophe Trefois, Dipl.-Ing.
>>>> Technical Specialist / Post-Doc
>>>> 
>>>> UNIVERSITÉ DU LUXEMBOURG
>>>> 
>>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>>>> Campus Belval | House of Biomedicine
>>>> 6, avenue du Swing
>>>> L-4367 Belvaux
>>>> T: +352 46 66 44 6124
>>>> F: +352 46 66 44 6949
>>>> http://www.uni.lu/lcsb
>>>> 
>>>> ----
>>>> This message is confidential and may contain privileged information.
>>>> It is intended for the named recipient only.
>>>> If you receive it in error please notify me and permanently delete the
>>>> original message and any copies.
>>>> ----
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel@gluster.org<mailto:Gluster-devel@gluster.org>
>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>> 
>> 

_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Reply via email to