Re: [Gluster-users] heaps split-brains during back-transfert

2015-08-03 Thread Geoffrey Letessier
Hi Vijay,

Yes of course, i sent my email after making some tests and checks and the 
result was still wrong (even after a couple of hours/1day after having forced 
the start of every bricks) … until i decided to do a « du » on every quota 
path. Now, all seems to ~OK as you can read below:
# gluster volume quota vol_home list
  Path   Hard-limit Soft-limit   Used  
Available  Soft-limit exceeded? Hard-limit exceeded?
---
/simlab_team   5.0TB   80%   1.2TB   3.8TB  
No   No
/amyloid_team  7.0TB   80%   4.9TB   2.1TB  
No   No
/amyloid_team/nguyen   3.5TB   80%   2.0TB   1.5TB  
No   No
/sacquin_team 10.0TB   80%  55.3GB   9.9TB  
No   No
/baaden_team  20.0TB   80%  11.5TB   8.5TB  
No   No
/derreumaux_team   5.0TB   80%   2.2TB   2.8TB  
No   No
/sterpone_team14.0TB   80%   9.3TB   4.7TB  
No   No
/admin_team1.0TB   80%  15.8GB 1008.2GB 
 No   No
# for path in $(gluster volume quota vol_home list|awk 'NR2 {print $1}'); do 
pdsh -w storage[1,3] du -sh /export/brick_home/brick{1,2}/data$path; done
storage1: 219G  /export/brick_home/brick1/data/simlab_team
storage3: 334G  /export/brick_home/brick1/data/simlab_team
storage1: 307G  /export/brick_home/brick2/data/simlab_team
storage3: 327G  /export/brick_home/brick2/data/simlab_team
storage1: 1,2T  /export/brick_home/brick1/data/amyloid_team
storage3: 1,2T  /export/brick_home/brick1/data/amyloid_team
storage1: 1,2T  /export/brick_home/brick2/data/amyloid_team
storage3: 1,2T  /export/brick_home/brick2/data/amyloid_team
storage1: 505G  /export/brick_home/brick1/data/amyloid_team/nguyen
storage1: 483G  /export/brick_home/brick2/data/amyloid_team/nguyen
storage3: 508G  /export/brick_home/brick1/data/amyloid_team/nguyen
storage3: 503G  /export/brick_home/brick2/data/amyloid_team/nguyen
storage3: 16G   /export/brick_home/brick1/data/sacquin_team
storage1: 14G   /export/brick_home/brick1/data/sacquin_team
storage3: 13G   /export/brick_home/brick2/data/sacquin_team
storage1: 13G   /export/brick_home/brick2/data/sacquin_team
storage1: 3,2T  /export/brick_home/brick1/data/baaden_team
storage1: 2,8T  /export/brick_home/brick2/data/baaden_team
storage3: 2,9T  /export/brick_home/brick1/data/baaden_team
storage3: 2,7T  /export/brick_home/brick2/data/baaden_team
storage3: 588G  /export/brick_home/brick1/data/derreumaux_team
storage1: 566G  /export/brick_home/brick1/data/derreumaux_team
storage1: 563G  /export/brick_home/brick2/data/derreumaux_team
storage3: 610G  /export/brick_home/brick2/data/derreumaux_team
storage3: 2,5T  /export/brick_home/brick1/data/sterpone_team
storage1: 2,7T  /export/brick_home/brick1/data/sterpone_team
storage3: 2,4T  /export/brick_home/brick2/data/sterpone_team
storage1: 2,4T  /export/brick_home/brick2/data/sterpone_team
storage3: 519M  /export/brick_home/brick1/data/admin_team
storage1: 11G   /export/brick_home/brick1/data/admin_team
storage3: 974M  /export/brick_home/brick2/data/admin_team
storage1: 4,0G  /export/brick_home/brick2/data/admin_team

In short:
simlab_team: ~1.2TB
amyloid_team: ~4.8TB
amyloid_team/nguyen: ~2TB
sacquin_team: ~56GB
baaden_team: ~11.6TB
derreumaux_team: 2.3TB
sterpone_team: ~10TB
admin_team: ~16.5GB

There’s still some difference but it’s globally quite correct (except for 
sterpone_team quota defined).

But, I also noticed something strange: here are the result of every « du » i 
did to force the « recompute » of the quota size (on the glusterfs mount point):
# du -sh /home/simlab_team/
1,2T/home/simlab_team/
# du -sh /home/amyloid_team/
4,7T/home/amyloid_team/
# du -sh /home/sacquin_team/
56G /home/sacquin_team/
# du -sh /home/baaden_team/
12T /home/baaden_team/
# du -sh /home/derreumaux_team/
2,3T/home/derreumaux_team/
# du -sh /home/sterpone_team/
9,9T/home/sterpone_team/

As you can above, I dont understand why the quota size computed by quota daemon 
is different than a du, especially concerning the quota size of /sterpone_team

Now, concerning all hangs i met, can you provide me the brand of your 
infiniband interconnect? From my side, we use QLogic -maybe the problem takes 
its origin here (Intel/Qlogic and Mellanox are quite different).


Concerning the brick logs, I just noticed I have a lot of error on one of my 
brick logs and the file take around 5GB. Here is an extract:
# tail 

Re: [Gluster-users] Locking failed - since upgrade to 3.6.4

2015-08-03 Thread Osborne, Paul (paul.osbo...@canterbury.ac.uk)
Hi,


 [2015-08-03 14:51:57.791081] E [glusterd-utils.c:148:glusterd_lock] 
 0-management: Unable to get lock for uuid: 
 76e4398c-e00a-4f3b-9206-4f885c4e5206, lock held by: 
 76e4398c-e00a-4f3b-9206-4f885c4e5206



 This indicates that cluster is still operating at older op version. You would 
 need to bump up the op version to 30604 using Gluster

 volume set all cluster.op-version 30604


Hmm, it would be helpful if that were in the upgrade documentation in a 
location that is obvious.


Anyhow:


# gluster volume set all cluster.op-version 30604
volume set: failed: Required op_version (30604) is not supported


Not so good.


dpkg --list | grep glus
ii  glusterfs-client   3.6.4-1   amd64  
  clustered file-system (client package)
ii  glusterfs-common   3.6.4-1   amd64  
  GlusterFS common libraries and translator modules
ii  glusterfs-server   3.6.4-1   amd64  
  clustered file-system (server package)


So tried on the basis of 
http://www.gluster.org/pipermail/gluster-users/2014-November/019666.html

:


Rgfse-rh-01:/var/log/glusterfs# gluster volume set all cluster.op-version 30600
volume set: success
Rgfse-rh-01:/var/log/glusterfs# gluster volume set all cluster.op-version 30601
volume set: success
gfse-rh-01:/var/log/glusterfs# gluster volume set all cluster.op-version 30602
volume set: success
Rgfse-rh-01:/var/log/glusterfs# gluster volume set all cluster.op-version 30603
volume set: success
Rgfse-rh-01:/var/log/glusterfs# gluster volume set all cluster.op-version 30604
volume set: failed: Required op_version (30604) is not supported



Which I guess is closer to where I want to be...

Will see if that does what I need - even if not quite right...

Thanks

Paul

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Locking failed - since upgrade to 3.6.4

2015-08-03 Thread Atin Mukherjee
Could you check the glusterd log at the other nodes, that would give you
the hint of the exact issue. Also looking at .cmd_log_history will give you
the time interval at which volume status commands are executed. If the gap
is in milisecs then you are bound to hit it and its expected.

-Atin
Sent from one plus one
On Aug 3, 2015 7:32 PM, Osborne, Paul (paul.osbo...@canterbury.ac.uk) 
paul.osbo...@canterbury.ac.uk wrote:


 Hi,

 Last week I upgraded one of my gluster clusters (3 hosts with bricks as
 replica 3) to 3.6.4 from 3.5.4 and all seemed well.

 Today I am getting reports that locking has failed:


 gfse-cant-01:/var/log/glusterfs# gluster volume status
 Locking failed on gfse-rh-01.core.canterbury.ac.uk. Please check log file
 for details.
 Locking failed on gfse-isr-01.core.canterbury.ac.uk. Please check log
 file for details.

 Logs:
 [2015-08-03 13:45:29.974560] E [glusterd-syncop.c:1640:gd_sync_task_begin]
 0-management: Locking Peers Failed.
 [2015-08-03 13:49:48.273159] E [glusterd-syncop.c:105:gd_collate_errors]
 0-: Locking failed on gfse-rh-01.core.canterbury.ac.uk. Please ch
 eck log file for details.
 [2015-08-03 13:49:48.273778] E [glusterd-syncop.c:105:gd_collate_errors]
 0-: Locking failed on gfse-isr-01.core.canterbury.ac.uk. Please c
 heck log file for details.


 I am wondering if this is a new feature due to 3.6.4 or something that has
 gone wrong.

 Restarting gluster entirely (btw the restart script does not actually
 appear to kill the processes...) resolves the issue but then it repeats a
 few minutes later which is rather suboptimal for a running service.

 Googling suggests that there may be simultaneous actions going on that can
 cause a locking issue.

 I know that I have nagios running volume status volname for each of my
 volumes on each host every few minutes however this is not new and has been
 in place for the last 8-9 months that against 3.5 without issue so would
 hope that this is not causing the issue.

 I am not sure where to look now tbh.




 Paul Osborne
 Senior Systems Engineer
 Canterbury Christ Church University
 Tel: 01227 782751
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-users

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster 3.7 for Ubuntu-12.04

2015-08-03 Thread Prasun Gera
Does the 3.7 client work for precise ?

On Mon, Aug 3, 2015 at 6:14 PM, Kaleb Keithley kkeit...@redhat.com wrote:



  From: John S bun...@gmail.com
 
  Hi All,
 
  Is there any gluster 3.7 version available for ubuntu-12.04. Currently we
  have all production/test servers running on ubuntu-12.04
 
  Saw glusterfs-3.7 for Ubuntu-14.04, 14.10 and 15 versions. please help.
 

 3.7.x does not build on 12.04 (Precise). Some of the dependencies
 apparently either don't exist or are too old.

 As you noted, there are 3.7.x packages for 14.04 LTS (Trusty). If you
 absolutely need 3.7 then you'll need to update your servers.

 --

 Kaleb


 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-users

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Locking failed - since upgrade to 3.6.4

2015-08-03 Thread Osborne, Paul (paul.osbo...@canterbury.ac.uk)
Hi,


OK I have tracked through the logs which of the hosts apparently has a lock 
open:


[2015-08-03 14:55:37.602717] I 
[glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management: 
Received status volume req for volume blogs

[2015-08-03 14:51:57.791081] E [glusterd-utils.c:148:glusterd_lock] 
0-management: Unable to get lock for uuid: 
76e4398c-e00a-4f3b-9206-4f885c4e5206, lock held by: 
76e4398c-e00a-4f3b-9206-4f885c4e5206


I have identified the UID for each peer via gluster peer status and working 
backwards.

I see that gluster volume clear-locks may the locks on the volume - but is not 
clear from the logs is what the path is that has the lock or the kind that is 
locked.

Incidentally my clients (using NFS) through manual testing appear to still be 
able to read/write to the volume - it is the volume status and heal checks that 
are failing. All of my clients and servers have been sequentially rebooted in 
the hope that this would clear any issue - however that doe not appear to be 
the case.



Thanks

Paul




Paul Osborne
Senior Systems Engineer
Canterbury Christ Church University
Tel: 01227 782751



From: Atin Mukherjee atin.mukherje...@gmail.com
Sent: 03 August 2015 15:22
To: Osborne, Paul (paul.osbo...@canterbury.ac.uk)
Cc: gluster-users@gluster.org
Subject: Re: [Gluster-users] Locking failed - since upgrade to 3.6.4


Could you check the glusterd log at the other nodes, that would give you the 
hint of the exact issue. Also looking at .cmd_log_history will give you the 
time interval at which volume status commands are executed. If the gap is in 
milisecs then you are bound to hit it and its expected.

-Atin
Sent from one plus one

On Aug 3, 2015 7:32 PM, Osborne, Paul 
(paul.osbo...@canterbury.ac.ukmailto:paul.osbo...@canterbury.ac.uk) 
paul.osbo...@canterbury.ac.ukmailto:paul.osbo...@canterbury.ac.uk wrote:

Hi,

Last week I upgraded one of my gluster clusters (3 hosts with bricks as replica 
3) to 3.6.4 from 3.5.4 and all seemed well.

Today I am getting reports that locking has failed:


gfse-cant-01:/var/log/glusterfs# gluster volume status
Locking failed on 
gfse-rh-01.core.canterbury.ac.ukhttp://gfse-rh-01.core.canterbury.ac.uk. 
Please check log file for details.
Locking failed on 
gfse-isr-01.core.canterbury.ac.ukhttp://gfse-isr-01.core.canterbury.ac.uk. 
Please check log file for details.

Logs:
[2015-08-03 13:45:29.974560] E [glusterd-syncop.c:1640:gd_sync_task_begin] 
0-management: Locking Peers Failed.
[2015-08-03 13:49:48.273159] E [glusterd-syncop.c:105:gd_collate_errors] 0-: 
Locking failed on 
gfse-rh-01.core.canterbury.ac.ukhttp://gfse-rh-01.core.canterbury.ac.uk. 
Please ch
eck log file for details.
[2015-08-03 13:49:48.273778] E [glusterd-syncop.c:105:gd_collate_errors] 0-: 
Locking failed on 
gfse-isr-01.core.canterbury.ac.ukhttp://gfse-isr-01.core.canterbury.ac.uk. 
Please c
heck log file for details.


I am wondering if this is a new feature due to 3.6.4 or something that has gone 
wrong.

Restarting gluster entirely (btw the restart script does not actually appear to 
kill the processes...) resolves the issue but then it repeats a few minutes 
later which is rather suboptimal for a running service.

Googling suggests that there may be simultaneous actions going on that can 
cause a locking issue.

I know that I have nagios running volume status volname for each of my 
volumes on each host every few minutes however this is not new and has been in 
place for the last 8-9 months that against 3.5 without issue so would hope that 
this is not causing the issue.

I am not sure where to look now tbh.




Paul Osborne
Senior Systems Engineer
Canterbury Christ Church University
Tel: 01227 782751
___
Gluster-users mailing list
Gluster-users@gluster.orgmailto:Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Locking failed - since upgrade to 3.6.4

2015-08-03 Thread Atin Mukherjee
-Atin
Sent from one plus one
On Aug 3, 2015 8:31 PM, Osborne, Paul (paul.osbo...@canterbury.ac.uk) 
paul.osbo...@canterbury.ac.uk wrote:

 Hi,


 OK I have tracked through the logs which of the hosts apparently has a
lock open:


 [2015-08-03 14:55:37.602717] I
[glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management:
Received status volume req for volume blogs

 [2015-08-03 14:51:57.791081] E [glusterd-utils.c:148:glusterd_lock]
0-management: Unable to get lock for uuid:
76e4398c-e00a-4f3b-9206-4f885c4e5206, lock held by:
76e4398c-e00a-4f3b-9206-4f885c4e5206

This indicates that cluster is still operating at older op version. You
would need to bump up the op version to 30604 using Gluster volume set all
cluster.op-version 30604

 I have identified the UID for each peer via gluster peer status and
working backwards.

 I see that gluster volume clear-locks may the locks on the volume - but
is not clear from the logs is what the path is that has the lock or the
kind that is locked.

 Incidentally my clients (using NFS) through manual testing appear to
still be able to read/write to the volume - it is the volume status and
heal checks that are failing. All of my clients and servers have been
sequentially rebooted in the hope that this would clear any issue - however
that doe not appear to be the case.



 Thanks

 Paul




 Paul Osborne
 Senior Systems Engineer
 Canterbury Christ Church University
 Tel: 01227 782751


 
 From: Atin Mukherjee atin.mukherje...@gmail.com
 Sent: 03 August 2015 15:22
 To: Osborne, Paul (paul.osbo...@canterbury.ac.uk)
 Cc: gluster-users@gluster.org
 Subject: Re: [Gluster-users] Locking failed - since upgrade to 3.6.4


 Could you check the glusterd log at the other nodes, that would give you
the hint of the exact issue. Also looking at .cmd_log_history will give you
the time interval at which volume status commands are executed. If the gap
is in milisecs then you are bound to hit it and its expected.

 -Atin
 Sent from one plus one

 On Aug 3, 2015 7:32 PM, Osborne, Paul (paul.osbo...@canterbury.ac.uk) 
paul.osbo...@canterbury.ac.uk wrote:


 Hi,

 Last week I upgraded one of my gluster clusters (3 hosts with bricks as
replica 3) to 3.6.4 from 3.5.4 and all seemed well.

 Today I am getting reports that locking has failed:


 gfse-cant-01:/var/log/glusterfs# gluster volume status
 Locking failed on gfse-rh-01.core.canterbury.ac.uk. Please check log
file for details.
 Locking failed on gfse-isr-01.core.canterbury.ac.uk. Please check log
file for details.

 Logs:
 [2015-08-03 13:45:29.974560] E
[glusterd-syncop.c:1640:gd_sync_task_begin] 0-management: Locking Peers
Failed.
 [2015-08-03 13:49:48.273159] E [glusterd-syncop.c:105:gd_collate_errors]
0-: Locking failed on gfse-rh-01.core.canterbury.ac.uk. Please ch
 eck log file for details.
 [2015-08-03 13:49:48.273778] E [glusterd-syncop.c:105:gd_collate_errors]
0-: Locking failed on gfse-isr-01.core.canterbury.ac.uk. Please c
 heck log file for details.


 I am wondering if this is a new feature due to 3.6.4 or something that
has gone wrong.

 Restarting gluster entirely (btw the restart script does not actually
appear to kill the processes...) resolves the issue but then it repeats a
few minutes later which is rather suboptimal for a running service.

 Googling suggests that there may be simultaneous actions going on that
can cause a locking issue.

 I know that I have nagios running volume status volname for each of my
volumes on each host every few minutes however this is not new and has been
in place for the last 8-9 months that against 3.5 without issue so would
hope that this is not causing the issue.

 I am not sure where to look now tbh.




 Paul Osborne
 Senior Systems Engineer
 Canterbury Christ Church University
 Tel: 01227 782751
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] File out of sync

2015-08-03 Thread Mathieu Chateau
date is not the same but is content different ?
You may have disable the mtime attribue to get better perf ?
What are these 2 GFID ?
You can use this script to find who they are:
https://gist.github.com/semiosis/4392640


Cordialement,
Mathieu CHATEAU
http://www.lotp.fr

2015-08-03 12:30 GMT+02:00 shacky shack...@gmail.com:

 2015-08-03 12:16 GMT+02:00 Mathieu Chateau mathieu.chat...@lotp.fr:

  You should start a heal
  gluster volume xxx heal
  or even a full one if not enough
  gluster volume xxx heal full

 Thanks, I tried both but this not solved the problem:

 root@web1:~# ls -l /mnt/data/web/system/web/config.inc.php
 -rw-r--r-- 1 1010 1010 4041 Jul 31 17:42
 /mnt/data/web/system/web/config.inc.php

 root@web2:~# ls -l /mnt/data/web/system/web/config.inc.php
 -rw-r--r-- 1 1010 1010 4041 Jul 24 09:20
 /mnt/data/web/system/web/config.inc.php

 root@web3:~# ls -l /mnt/data/web/system/web/config.inc.php
 -rw-r--r-- 1 1010 1010 4041 Jul 24 09:20
 /mnt/data/web/system/web/config.inc.php

 This is the heal info:

 root@web1:~# gluster volume heal data info
 Brick web1:/data/
 gfid:d5f6d18f-082f-40e9-bdd1-b7d7eee0ad6d
 gfid:376cc8a1-8592-4e3f-a47a-5d36fa62f6bc
 Number of entries: 2

 Brick web2:/data/
 Number of entries: 0

 Brick web3:/data/
 Number of entries: 0

  I guess clients wrote some files while node was down or rebooting?

 I don't think this happened, because when I updated that file all
 nodes was running.

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Very slow ls

2015-08-03 Thread Florian Oppermann
Dear Gluster users,

after setting up a distributed replicated volume (3x2 bricks) on gluster
3.6.4 on Ubuntu systems and populating it with some data (about 150 GB
in 20k files) I experience extreme delay when navigating through
directories or trying to ls the contents (actually the process seems to
hang completely now until I kill the /usr/sbin/glusterfs process on the
mounting machine).

Is there some common misconfiguration or any performance tuning option
that I could try?

I mount via automount with fstype=glusterfs option (using the native
fuse mount).

Any tips?

Best regards,
Florian Oppermann
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users