from:"harry mangalam"

Re: [Gluster-users] fractured/split glusterfs - 2 up, 2 down for an hour - SOLVED

2014-01-05 Thread harry mangalam

All these problems disappeared with a client unmount / remount of the gluster 
filesystem.  

A remarkably simple fix for such a bizarre set of symptoms.  We'll see how 
systemic the fix is, but all cluster nodes that have had that fix applied are 
now behaving normally (AFAICT) and the 2 that have not (due to long-running 
jobs writing to new files on existing dirs) still have the otherwise odd 
behavior, previously described in excruciating detail.

Maybe this should be added to the HOWTO/DOTHISINCASEOFEMERGENCY doc.

hjm

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] fractured/split glusterfs - 2 up, 2 down for an hour

2014-01-04 Thread harry mangalam

On Saturday, January 04, 2014 10:45:29 PM Vijay Bellur wrote:
 rdma.so seems to be missing here. Is glusterfs-rdma-3.4.2-1 rpm 
 installed on the servers?

It's not.  The original gluster (3.2, I think) was set up with RDMA and IP 
transport but RDMA was never instantiated and it's been working fine without 
it (except for the zillions of repeating errors).

hjm
---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] fractured/split glusterfs - 2 up, 2 down for an hour

2014-01-04 Thread harry mangalam

Also some other anomalies.  Even when the files are visible and readable, many 
dirs are unwritable and/or undeleteable.

for example: 

Sat Jan 04 18:36:17 [0.02 0.08 0.12]  root@hpc-s:/bio/mmacchie
1104 $ mkdir hjmtest
mkdir: cannot create directory `hjmtest': Invalid argument

Sat Jan 04 18:36:23 [0.02 0.08 0.12]  root@hpc-s:/bio/mmacchie

The client log says this for that operation (note offset times - UTC vs local:
http://pastie.org/8602365

And in many subdirs,  other dirs can be made, but not deleted:

Sat Jan 04 18:41:45 [0.00 0.04 0.09]  root@hpc-
s:/bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered
1109 $ mkdir j1

Sat Jan 04 18:42:00 [0.00 0.03 0.09]  root@hpc-
s:/bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered
1110 $ rmdir j1
rmdir: failed to remove `j1': Transport endpoint is not connected

Sat Jan 04 18:42:09 [0.08 0.05 0.09]  root@hpc-
s:/bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered

With the client log saying:

[2014-01-05 02:42:09.548263] W [client-rpc-fops.c:526:client3_3_stat_cbk] 0-
gl-client-2: remote operation failed: Transport endpoint is not connected
[2014-01-05 02:42:09.549314] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 
0-gl-client-2: remote operation failed: Transport endpoint is not connected. 
Path: /bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered/j1 
(aebbf21f-37fe-4edc-be8a-0f57b057b516)
[2014-01-05 02:42:09.550124] W [client-rpc-fops.c:2541:client3_3_opendir_cbk] 
0-gl-client-2: remote operation failed: Transport endpoint is not connected. 
Path: /bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered/j1 
(aebbf21f-37fe-4edc-be8a-0f57b057b516)
[2014-01-05 02:42:09.552439] W [fuse-bridge.c:1193:fuse_unlink_cbk] 0-
glusterfs-fuse: 5805445: RMDIR() 
/bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered/j1 = -1 
(Transport endpoint is not connected)
[2014-01-05 02:42:12.175860] W [socket.c:514:__socket_rwv] 0-gl-client-2: 
readv failed (No data available)
[2014-01-05 02:42:15.181365] W [socket.c:514:__socket_rwv] 0-gl-client-2: 
readv failed (No data available)
[2014-01-05 02:42:18.186668] W [socket.c:514:__socket_rwv] 0-gl-client-2: 
readv failed (No data available)


This is odd - how can a dir be created OK but then the fs lose track of it to 
delete it?

And that dir (j1)  can have /files/ created and deleted inside of it, but not 
other /dirs/ (same result as the parent dir).

In looking thru the client log, I see instances of this:

[2014-01-05 02:27:20.721043] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 
0-gl-client-2: remote operation failed: Transport endpoint is not connected. 
Path: /bio/mmacchie/Nematodes (----)
[2014-01-05 02:27:20.769058] I [dht-layout.c:630:dht_layout_normalize] 0-gl-
dht: found anomalies in /bio/mmacchie/Nematodes. holes=2 overlaps=0
[2014-01-05 02:27:20.769090] W [dht-selfheal.c:900:dht_selfheal_directory] 0-
gl-dht: 1 subvolumes down -- not fixing
[2014-01-05 02:27:20.784335] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 
0-gl-client-2: 

more at: http://pastie.org/8602381

alarming since it says: 
[2014-01-05 02:27:20.769090] W [dht-selfheal.c:900:dht_selfheal_directory] 0-
gl-dht: 1 subvolumes down -- not fixing

All my servers and bricks appear to be up and online:

Sat Jan 04 18:54:09 [0.76 0.30 0.20]  root@biostor1:~
1003 $ gluster volume status gl detail | egrep Brick|Online
Brick: Brick bs2:/raid1
Online   : Y   
Brick: Brick bs2:/raid2
Online   : Y   
Brick: Brick bs3:/raid1
Online   : Y   
Brick: Brick bs3:/raid2
Online   : Y   
Brick: Brick bs4:/raid1
Online   : Y   
Brick: Brick bs4:/raid2
Online   : Y   
Brick: Brick bs1:/raid1
Online   : Y   
Brick: Brick bs1:/raid2
Online   : Y   


The gluster server logs seem to be fairly quiet thru this.  the followig 
contains the logs for the last day or so from the 4 servers, reduced by the 
following command to eliminate the 'socket.c:2788' errors

grep -v socket.c:2788 /var/log/glusterfs/etc-glusterfs-glusterd.vol.log

http://pastie.org/8602412

hjm


On Saturday, January 04, 2014 10:45:29 PM Vijay Bellur wrote:
 On 01/04/2014 07:21 AM, harry mangalam wrote:
  This is a distributed-only glusterfs on 4 servers with 2 bricks each on
  an IPoIB network.
  
  Thanks to a misconfigured autoupdate script, when 3.4.2 was released
  today, my gluster servers tried to update themselves. 2 succeeded, but
  then failed to restart, the other 2 failed to update and kept running.
  
  Not realizing the sequence of events, I restarted the 2 that failed to
  restart, which gave

[Gluster-users] fractured/split glusterfs - 2 up, 2 down for an hour

2014-01-03 Thread harry mangalam

This is a distributed-only glusterfs on 4 servers with 2 bricks each on an 
IPoIB network.

Thanks to a misconfigured autoupdate script, when 3.4.2 was released today, my 
gluster servers tried to update themselves.  2 succeeded, but then failed to 
restart, the other 2 failed to update and kept running.

Not realizing the sequence of events, I restarted the 2 that failed to 
restart, which gave my fs 2 servers running 3.4.1 and 2 running 3.4.2.  

When I realized this after about 30m, I shut everything down and updated the 2 
remaining to 3.4.2 and then restarted but now I'm getting lots of reports of 
file errors of the type 'endpoints not connected' and the like:

[2014-01-04 01:31:18.593547] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 
0-gl-client-2: remote operation failed: Transport endpoint i
s not connected. Path: /bio/fishm/test_cuffdiff.sh 
(----)
[2014-01-04 01:31:18.594928] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 
0-gl-client-2: remote operation failed: Transport endpoint i
s not connected. Path: /bio/fishm/test_cuffdiff.sh 
(----)
[2014-01-04 01:31:18.595818] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 
0-gl-client-2: remote operation failed: Transport endpoint i
s not connected. Path: /bio/fishm/.#test_cuffdiff.sh (14c3b612-e952-4aec-
ae18-7f3dbb422dcc)
[2014-01-04 01:31:18.597381] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 
0-gl-client-2: remote operation failed: Transport endpoint i
s not connected. Path: /bio/fishm/test_cuffdiff.sh 
(----)
[2014-01-04 01:31:18.598212] W [client-rpc-fops.c:814:client3_3_statfs_cbk] 0-
gl-client-2: remote operation failed: Transport endpoint is
 not connected
[2014-01-04 01:31:18.598236] W [dht-diskusage.c:45:dht_du_info_cbk] 0-gl-dht: 
failed to get disk info from gl-client-2
[2014-01-04 01:31:19.912210] W [socket.c:514:__socket_rwv] 0-gl-client-2: 
readv failed (No data available)
[2014-01-04 01:31:22.912717] W [socket.c:514:__socket_rwv] 0-gl-client-2: 
readv failed (No data available)
[2014-01-04 01:31:25.913208] W [socket.c:514:__socket_rwv] 0-gl-client-2: 
readv failed (No data available)

The servers at the same time provided the following error 'E' messages:
Fri Jan 03 17:46:42 [0.20 0.12 0.13]  root@biostor1:~
1008 $ grep ' E ' /var/log/glusterfs/bricks/raid1.log |grep '2014-01-03' 
[2014-01-03 06:11:36.251786] E [server-helpers.c:751:server_alloc_frame] (--
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103) [0x3161e090d3] (--
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x245) [0x3161e08f85] (--
/usr/lib64/glusterfs/3.4.1/xlator/protocol/server.so(server3_3_lookup+0xa0) 
[0x7fa60e577170]))) 0-server: invalid argument: conn
[2014-01-03 06:11:36.251813] E [rpcsvc.c:450:rpcsvc_check_and_reply_error] 0-
rpcsvc: rpc actor failed to complete successfully
[2014-01-03 17:48:44.236127] E [rpc-transport.c:253:rpc_transport_load] 0-rpc-
transport: /usr/lib64/glusterfs/3.4.1/rpc-transport/rdma.so: cannot open 
shared object file: No such file or directory
[2014-01-03 19:15:26.643378] E [rpc-transport.c:253:rpc_transport_load] 0-rpc-
transport: /usr/lib64/glusterfs/3.4.2/rpc-transport/rdma.so: cannot open 
shared object file: No such file or directory


The missing/misbehaving files /are/ accessible on the individual bricks but 
not thru gluster.

 This is a distributed-only setup, not replicated, so it seems like the 

gluster volume heal volume

is appropriate.

Do the gluster wizards agree? 

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] gluster fails under heavy array job load load

2013-12-13 Thread harry mangalam


Bug 1043009 Submitted


On Thursday, December 12, 2013 11:46:03 PM Anand Avati wrote:
 Please provide the full client and server logs (in a bug report). The
 snippets give some hints, but are not very meaningful without the full
 context/history since mount time (they have after-the-fact symptoms, but
 not the part which show the reason why disconnects happened).
 
 Even before looking into the full logs here are some quick observations:
 
 - write-behind-window-size = 1024MB seems *excessively* high. Please set
 this to 1MB (default) and check if the stability improves.
 
 - I see RDMA is enabled on the volume. Are you mounting clients through
 RDMA? If so, for the purpose of diagnostics can you mount through TCP and
 check the stability improves? If you are using RDMA with such a high
 write-behind-window-size, spurious ping-timeouts are an almost certainty
 during heavy writes. The RDMA driver has limited flow control, and setting
 such a high window-size can easily congest all the RDMA buffers resulting
 in spurious ping-timeouts and disconnections.
 
 Avati
 
 On Thu, Dec 12, 2013 at 5:03 PM, harry mangalam 
harry.manga...@uci.eduwrote:
   Hi All,
  
  (Gluster Volume Details at bottom)
  
  
  
  I've posted some of this previously, but even after various upgrades,
  attempted fixes, etc, it remains a problem.
  
  
  
  
  
  Short version: Our gluster fs (~340TB) provides scratch space for a
  ~5000core academic compute cluster.
  
  Much of our load is streaming IO, doing a lot of genomics work, and that
  is the load under which we saw this latest failure.
  
  Under heavy batch load, especially array jobs, where there might be
  several 64core nodes doing I/O on the 4servers/8bricks, we often get job
  failures that have the following profile:
  
  
  
  Client POV:
  
  Here is a sampling of the client logs (/var/log/glusterfs/gl.log) for all
  compute nodes that indicated interaction with the user's files
  
  http://pastie.org/8548781
  
  
  
  Here are some client Info logs that seem fairly serious:
  
  http://pastie.org/8548785
  
  
  
  The errors that referenced this user were gathered from all the nodes that
  were running his code (in compute*) and agglomerated with:
  
  
  
  cut -f2,3 -d']' compute* |cut -f1 -dP | sort | uniq -c | sort -gr
  
  
  
  and placed here to show the profile of errors that his run generated.
  
  http://pastie.org/8548796
  
  
  
  so 71 of them were:
  
  W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-7: remote
  operation failed: Transport endpoint is not connected.
  
  etc
  
  
  
  We've seen this before and previously discounted it bc it seems to have
  been related to the problem of spurious NFS-related bugs, but now I'm
  wondering whether it's a real problem.
  
  Also the 'remote operation failed: Stale file handle. ' warnings.
  
  
  
  There were no Errors logged per se, tho some of the W's looked fairly
  nasty, like the 'dht_layout_dir_mismatch'
  
  
  
  From the server side, however, during the same period, there were:
  
  0 Warnings about this user's files
  
  0 Errors
  
  458 Info lines
  
  of which only 1 line was not a 'cleanup' line like this:
  
  ---
  
  10.2.7.11:[2013-12-12 21:22:01.064289] I
  [server-helpers.c:460:do_fd_cleanup] 0-gl-server: fd cleanup on
  /path/to/file
  
  ---
  
  it was:
  
  ---
  
  10.2.7.14:[2013-12-12 21:00:35.209015] I
  [server-rpc-fops.c:898:_gf_server_log_setxattr_failure] 0-gl-server:
  113697332: SETXATTR /bio/tdlong/RNAseqIII/ckpt.1084030
  (c9488341-c063-4175-8492-75e2e282f690) == trusted.glusterfs.dht
  
  ---
  
  
  
  We're losing about 10% of these kinds of array jobs bc of this, which is
  just not supportable.
  
  
  
  
  
  
  
  Gluster details
  
  
  
  servers and clients running gluster 3.4.0-8.el6 over QDR IB, IPoIB, thru 2
  Mellanox, 1 Voltaire switches, Mellanox cards, CentOS 6.4
  
  
  
  $ gluster volume info
  
   Volume Name: gl
  
  Type: Distribute
  
  Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
  
  Status: Started
  
  Number of Bricks: 8
  
  Transport-type: tcp,rdma
  
  Bricks:
  
  Brick1: bs2:/raid1
  
  Brick2: bs2:/raid2
  
  Brick3: bs3:/raid1
  
  Brick4: bs3:/raid2
  
  Brick5: bs4:/raid1
  
  Brick6: bs4:/raid2
  
  Brick7: bs1:/raid1
  
  Brick8: bs1:/raid2
  
  Options Reconfigured:
  
  performance.write-behind-window-size: 1024MB
  
  performance.flush-behind: on
  
  performance.cache-size: 268435456
  
  nfs.disable: on
  
  performance.io-cache: on
  
  performance.quick-read: on
  
  performance.io-thread-count: 64
  
  auth.allow: 10.2.*.*,10.1.*.*
  
  
  
  
  
  'gluster volume status gl detail':
  
  http://pastie.org/8548826
  
  
  
  ---
  
  Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
  
  [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
  
  415 South Circle View Dr, Irvine, CA, 92697 [shipping]
  
  MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps

Re: [Gluster-users] gluster fails under heavy array job load load

2013-12-13 Thread harry mangalam

On Thursday, December 12, 2013 11:46:03 PM Anand Avati wrote:
 - I see RDMA is enabled on the volume. Are you mounting clients through
 RDMA? If so, for the purpose of diagnostics can you mount through TCP and
 check the stability improves? If you are using RDMA with such a high
 write-behind-window-size, spurious ping-timeouts are an almost certainty
 during heavy writes. The RDMA driver has limited flow control, and setting
 such a high window-size can easily congest all the RDMA buffers resulting
 in spurious ping-timeouts and disconnections. 

Is there a way to remove the RDMA transport option once it is enabled.  I was 
under the impression that our system was NOT using RDMA, but from the logs, I 
see the following that implies that they /are/ using RDMA now.

== 10.2.7.11 ==
  4: option transport-type socket,rdma
[2013-12-10 17:42:12.498076] I [glusterd-pmap.c:227:pmap_registry_bind] 0-
pmap: adding brick /raid1.rdma on port 49153
[2013-12-10 17:42:15.571287] I [glusterd-pmap.c:227:pmap_registry_bind] 0-
pmap: adding brick /raid2.rdma on port 49155

== 10.2.7.12 ==
  4: option transport-type socket,rdma
[2013-12-10 17:42:17.974841] I [glusterd-pmap.c:227:pmap_registry_bind] 0-
pmap: adding brick /raid1.rdma on port 49153
[2013-12-10 17:42:21.266486] I [glusterd-pmap.c:227:pmap_registry_bind] 0-
pmap: adding brick /raid2.rdma on port 49155

== 10.2.7.13 ==
  4: option transport-type socket,rdma
[2013-12-10 17:42:17.929753] I [glusterd-pmap.c:227:pmap_registry_bind] 0-
pmap: adding brick /raid1.rdma on port 49153
[2013-12-10 17:42:21.646482] I [glusterd-pmap.c:227:pmap_registry_bind] 0-
pmap: adding brick /raid2.rdma on port 49155

== 10.2.7.14 ==
  4: option transport-type socket,rdma
[2013-12-10 17:42:15.791176] I [glusterd-pmap.c:227:pmap_registry_bind] 0-
pmap: adding brick /raid1.rdma on port 49153
[2013-12-10 17:42:15.941182] I [glusterd-pmap.c:227:pmap_registry_bind] 0-
pmap: adding brick /raid2.rdma on port 49155


---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] gluster fails under heavy array job load load

2013-12-13 Thread harry mangalam

Hi Alex,

Thanks for taking the time to think about this.

I don't have metrics at hand, but I tend to think not for 2 reasons.
- when I have looked at stats from the network, it has never been close to 
saturating - the bottlenecks appear to be most at the gluster server side.
I get emailed if my servers go above a load of 8 (the servers have 8 cores) 
and when that happens, I often get complaints from users that they've had 
incomplete runs.

At these points the network load is often fairly high (1GB/s, aggregate), but 
on a QDR network, that shouldn't be saturating.

-  the same jobs, when run using another distributed FS, on the same IB 
fabric, have no such behavior, which would tend to point the fault at gluster 
or (granted) my configuration of it.

- while a lot of the IO load is large streaming RW, there are a subsection of 
jobs that users insist on using Zillions of Tiny (ZOT) files as output - they 
use the file names for indices or as table row entries.  (One user had 20M 
files in a tree). We're trying to educate them, but it takes time and energy.
Gluster seems to have a lot of trouble traversing these huge file fields, 
moreso than DFSs that use metadata servers.

That said, it has been stable otherwise and there are a lot of things to 
recommend it.

hjm





On Friday, December 13, 2013 02:00:19 PM Alex Chekholko wrote:
 Hi Harry,
 
 My best guess is that you overloaded your interconnect.  Do you have
 metrics for if/when your network was saturated?  That would cause
 Gluster clients to time out.
 
 My best guess is that you went into the E state of your USE
 (Utilization, Saturation, Error) spectrum.
 
 IME, that is a common pattern for out Lustre/GPFS clients, you get all
 kinds of weird error states if you manage to saturate your I/O for an
 extended period of time and fill all of the buffers everywhere.
 
 Regards,
 Alex
 
 On 12/12/2013 05:03 PM, harry mangalam wrote:
  Short version: Our gluster fs (~340TB) provides scratch space for a
  ~5000core academic compute cluster.
  
  Much of our load is streaming IO, doing a lot of genomics work, and that
  is the load under which we saw this latest failure.

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] gluster fails under heavy array job load load

2013-12-12 Thread harry mangalam

Hi All,
(Gluster Volume Details at bottom)

I've posted some of this previously, but even after various upgrades, 
attempted fixes, etc, it remains a problem.


Short version:  Our gluster fs (~340TB) provides scratch space for a ~5000core 
academic compute cluster.  
Much of our load is streaming IO, doing a lot of genomics work, and that is 
the load under which we saw this latest failure.
Under heavy batch load, especially array jobs, where there might be several 
64core nodes doing I/O on the 4servers/8bricks, we often get job failures that 
have the following profile:

Client POV:
Here is a sampling of the client logs (/var/log/glusterfs/gl.log) for all 
compute nodes that indicated interaction with the user's files
http://pastie.org/8548781

Here are some client Info logs that seem fairly serious:
http://pastie.org/8548785

The errors that referenced this user were gathered from all the nodes that 
were running his code (in compute*) and agglomerated with:

cut -f2,3 -d']' compute* |cut -f1 -dP | sort | uniq -c | sort -gr 

and placed here to show the profile of errors that his run generated.
http://pastie.org/8548796

so 71 of them were:
  W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-7: remote 
operation failed: Transport endpoint is not connected. 
etc

We've seen this before and previously discounted it bc it seems to have been 
related to the problem of spurious NFS-related bugs, but now I'm wondering 
whether it's a real problem. 
Also the 'remote operation failed: Stale file handle. ' warnings.

There were no Errors logged per se, tho some of the W's looked fairly nasty, 
like the 'dht_layout_dir_mismatch'

From the server side, however, during the same period, there were:
0 Warnings about this user's files
0 Errors 
458 Info lines
of which only 1 line was not a 'cleanup' line like this:
---
10.2.7.11:[2013-12-12 21:22:01.064289] I [server-helpers.c:460:do_fd_cleanup] 
0-gl-server: fd cleanup on /path/to/file
---
it was:
---
10.2.7.14:[2013-12-12 21:00:35.209015] I [server-rpc-
fops.c:898:_gf_server_log_setxattr_failure] 0-gl-server: 113697332: SETXATTR 
/bio/tdlong/RNAseqIII/ckpt.1084030 (c9488341-c063-4175-8492-75e2e282f690) == 
trusted.glusterfs.dht
---

We're losing about 10% of these kinds of array jobs bc of this, which is just 
not supportable.



Gluster details

servers and clients running gluster 3.4.0-8.el6 over QDR IB, IPoIB, thru 2 
Mellanox, 1 Voltaire switches, Mellanox cards, CentOS 6.4

$ gluster volume info
 
Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.write-behind-window-size: 1024MB
performance.flush-behind: on
performance.cache-size: 268435456
nfs.disable: on
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*


'gluster volume status gl detail': 
http://pastie.org/8548826

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Where does the 'date' string in '/var/log/glusterfs/gl.log' come from?

2013-12-11 Thread harry mangalam

To confirm: Joe's explanation, uncomfortable as it, looks to be the correct 
one.  When the servers were powered off and restarted (so the gluster 
processes /had/ to be restarted, the new ones started up and used the correct 
time format.

The 'problem' clients were the ones which were running the updated version; 
when all the clients were forced to restart the glusterfs; they all appear to 
be running with the UTC time (tho hard to tell since the number of logged 
incidents has fallen dramatically).

There is a repeating entry in all the server logs tho:

1001 $ tail -3 /var/log/glusterfs/etc-glusterfs-glusterd.vol.log
[2013-12-11 18:26:41.824180] E [socket.c:2788:socket_connect] 0-management: 
connection attempt failed (Connection refused)
[2013-12-11 18:26:44.825089] E [socket.c:2788:socket_connect] 0-management: 
connection attempt failed (Connection refused)
[2013-12-11 18:26:47.825952] E [socket.c:2788:socket_connect] 0-management: 
connection attempt failed (Connection refused)

Is there a way to detect which client(s) this is coming from?


On Tuesday, December 10, 2013 11:23:11 AM Joe Julian wrote:
 If I were to hazard a guess, since the timestamp is not configurable and
 *is* UTC in 3.4, it would seem that any server that's logging in local time
 must not be running 3.4. Sure, it's installed, but the application hasn't
 been restarted since it was installed. That's the only thing I can think of
 that would allow that behavior.

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Where does the 'date' string in '/var/log/glusterfs/gl.log' come from?

2013-12-10 Thread harry mangalam

On Tuesday, December 10, 2013 12:49:25 PM Sharuzzaman Ahmat Raslan wrote:
 Hi Harry,
 
 Did you setup ntp on each of the node, and sync the time to one single
 source?

Yes, this is done by ROCKS and all the nodes have the identical time.
(2admins have checked repeatedly)

 Thanks.
 
 On Tue, Dec 10, 2013 at 12:44 PM, harry mangalam 
harry.manga...@uci.eduwrote:
   Admittedly I should search the source, but I wonder if anyone knows this
  
  offhand.
  
  
  
  Background: of our 84 ROCKS (6.1) -provisioned compute nodes, 4 have
  picked up an 'advanced date' in the /var/log/glusterfs/gl.log file - that
  date string is running about 5-6 hours ahead of the system date and all
  the
  Gluster servers (which are identical and correct). The time advancement
  does not appear to be identical tho it's hard to tell since it only shows
  on errors and those update irregularly.
  
  
  
  All the clients are the same version and all the servers are the same
  (gluster v 3.4.0-8.el6.x86_64
  
  
  
  This would not be of interest except that those 4 clients are losing
  files, unable to reliably do IO, etc on the gluster fs. They don't appear
  to be having problems with NFS mounts, nor with a Fraunhofer FS that is
  also mounted on each node,
  
  
  
  Rebooting 2 of them has no effect - they come right back with an advanced
  date.
  
  
  
  
  
  ---
  
  Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
  
  [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
  
  415 South Circle View Dr, Irvine, CA, 92697 [shipping]
  
  MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
  
  ---
  
  
  
  ___
  Gluster-users mailing list
  Gluster-users@gluster.org
  http://supercolony.gluster.org/mailman/listinfo/gluster-users

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Where does the 'date' string in '/var/log/glusterfs/gl.log' come from?

2013-12-10 Thread harry mangalam

On Tuesday, December 10, 2013 10:42:28 AM Vijay Bellur wrote:
 On 12/10/2013 10:14 AM, harry mangalam wrote:
  Admittedly I should search the source, but I wonder if anyone knows this
  offhand.
  
  Background: of our 84 ROCKS (6.1) -provisioned compute nodes, 4 have
  picked up an 'advanced date' in the /var/log/glusterfs/gl.log file -
  that date string is running about 5-6 hours ahead of the system date and
  all the Gluster servers (which are identical and correct). The time
  advancement does not appear to be identical tho it's hard to tell since
  it only shows on errors and those update irregularly.
 
 The timestamps in the log file are by default in UTC. That could
 possibly explain why the timestamps look advanced in the log file.

that seems to make sense.  The advanced time on the 4 problems nodes looks to 
be the correct UTC time, but the others are using /local time/ in their logs, 
for some reason.  And localtime nodes are the ones NOT having problems.  
...??!

However, this looks to be more of a ROCKS / config problem than a general 
gluster problem at this point.  All the nodes have the md5-identical 
/etc/localtime, but they seem to be behaving differently as to the logging.

Thanks for the pointer.

hjm


  All the clients are the same version and all the servers are the same
  (gluster v 3.4.0-8.el6.x86_64
  
  This would not be of interest except that those 4 clients are losing
  files, unable to reliably do IO, etc on the gluster fs. They don't
  appear to be having problems with NFS mounts, nor with a Fraunhofer FS
  that is also mounted on each node,
 
 Do you observe anything in the client log files of these machines that
 indicate I/O problems?

Yes. 

 
 Thanks,
 Vijay
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Where does the 'date' string in '/var/log/glusterfs/gl.log' come from?

2013-12-09 Thread harry mangalam

Admittedly I should search the source, but I wonder if anyone knows this 
offhand.

Background:  of our  84 ROCKS (6.1) -provisioned compute nodes, 4 have picked 
up an 'advanced date'  in the /var/log/glusterfs/gl.log file - that date 
string is running about 5-6 hours ahead of the system date and all the Gluster 
servers (which are identical and correct).  The time advancement does not 
appear to be identical tho it's hard to tell since it only shows on errors and 
those update irregularly.

All the clients are the same version and all the servers are the same (gluster 
v 3.4.0-8.el6.x86_64

This would not be of interest except that those 4 clients are losing files, 
unable to reliably do IO, etc on the gluster fs.  They don't appear to be 
having problems with NFS mounts, nor with a Fraunhofer FS that is also mounted 
on each node,

Rebooting 2 of them has no effect - they come right back with an advanced 
date.


---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Slow metadata

2013-10-08 Thread harry mangalam

This is a widely perceived feature/bug of gluster.  It also affects other 
distributed filesystems, tho generally not as much.

We've done 2 things to address this.  One is a distributed 'du' that is  
clusterfork'ed out to the storage nodes and compiles the results.  This is 
realtime and will provide data to that point. If you're interested in it, let 
me know and I can provide the code to do this. However, it requires 
clusterfork, some per-site config, and is specific to 'du', altho it could be 
modified to support other shell commands.

Here's the difference in performance on a fairly busy gluster system (4 
storage nodes, 8 volumes, 340TB, 60% used)
=
14:54:09 root@hpc-s:/som
1226 $ time du -sh abusch/* 
694Mabusch/MATS-gtf
^C

real3m58.098s --- killed after ~4m
user0m0.033s
sys 0m0.351s

14:58:24 root@hpc-s:/som
1227 $ gfdu abusch/\*

INFO: Corrected gluster starting path: [/som/abusch/*]
About to execute [/root/bin/cf --script --tar=GLSRV du -s /raid1/som/abusch/* 
; du -s /raid2/som/abusch/*; ]
Go? [yN]y

INFO: For raw results [cd /root/cf/CF-du--s--raid1-som-
abu-14.58.38_2013-10-08]

Size:   File|Dir
693.8203 M  /som/abusch/MATS-gtf
  1.5292 G  /som/abusch/MISO-gffs
764.5117 M  /som/abusch/MISO-gffs-v2
 23.8720 G  /som/abusch/deepSeq
 25.2845 G  /som/abusch/genomes
  5.4239 G  /som/abusch/index
 16.8011 G  /som/abusch/index2
---
 74.3348 G  Total

time was ~4s
=


The other approach is with the RobinHood Policy Engine 
http://sourceforge.net/apps/trac/robinhood which runs on a cron and recurses 
thru your FS, taking X hours, but compiles that info into a MySQL DB that is 
instantly responsive (but could be slightly out of date).  NTL, it's a very 
helpful tool to detect hotspots and ZOTfiles (Zillions Of Tiny files)

We are using it to monitor NFS volumes, Gluster, and Fraunhofer FSs.
It is a very slick system and a student (Adam Brenner) is modifying it to 
generate better stats via the web interface.

See his github and the robinhood trac:

https://github.com/abrenner/robinhood-multifs-web


http://sourceforge.net/apps/trac/robinhood



On Tuesday, October 08, 2013 09:07:52 AM Anders Salling Andersen wrote:
 Hi all i have a 50tb glusterfs replicated setup, with Many small files. My
 metadata is very slow ex. Du -sh takes over 24 hours. Is there a Way to
 make faster metadata ?
 
 Regards Anders.
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] working inifiniband HW ??

2013-07-08 Thread Harry Mangalam

I concur - I used the same cards on our test cluster - just make sure you
upgrade the firmware  to the latest revision:
http://www.mellanox.com/content/pages.php?pg=custom_firmware_table

The fw upgrade is not trivial - well, it IS trivial, but bringing all the
info together was not - if you want some rough notes, let me know.

the other harry




On Mon, Jul 8, 2013 at 5:23 AM, Justin Clift jcl...@redhat.com wrote:

 On 08/07/2013, at 10:20 AM, HL wrote:
  I am currently testing glusterfs on a small scale  non-production env
 with ordinary nics.
 
  I would like to purchase a couple of inifiniband nics in order to
 connect 3 servers in point to point mode
  that is with no switches in between.
 
  Since I've noticed that some of you have this kind of H/W
  Any info on brands/models and
  a good known to work setup
  will be highly appreciated.

 Depends on the kind of performance you're after. :)

 If you're just wanting something better than 10GbE at minimal cost,
 these work fine under Linux:

   http://www.ebay.co.uk/itm/360657396651
   ~US$40 each, not counting postage

 They're rebadged Mellanox MHGH28-XTC cards.  Completely ok to be
 flashed with standard Mellanox firmware.

 For cables, I use these:

   http://www.ebay.co.uk/itm/251200441924
   US$14 each, not counting postage

 Note, I'm comfortable getting stuff off eBay where the seller
 seems ok.  So far, so good. ;)

 If you do decide to use the above stuff, and later on want a switch,
 try and find a Voltaire ISR 9024D-M.  They're DDR infiniband (20Gb/s
 per port), and can run fanless (completely silent).

   http://www.ebay.co.uk/itm/350827346490

 They can be had pretty cheaply if you're willing to wait for good
 pricing.  I got mine for ~US$270.

 Hope that helps. :)

 Regards and best wishes,

 Justin Clift


  Regards,
  Harry
  ___
  Gluster-users mailing list
  Gluster-users@gluster.org
  http://supercolony.gluster.org/mailman/listinfo/gluster-users

 --
 Open Source and Standards @ Red Hat

 twitter.com/realjustinclift

 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users




-- 
Data backup by the NSA - your tax dollars at work.
--
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Glusterfs 3.3 rapidly generating write errors under heavy load.

2013-04-12 Thread harry mangalam

   
Online   : Y   
Pid  : 2961
File System  : xfs 
Device   : /dev/sdd
Mount Options: rw,noatime,sunit=512,swidth=7680,allocsize=32m
Inode Size   : 256 
Disk Space Free  : 24.1TB  
Total Disk Space : 40.9TB  
Inode Count  : 8789028864  
Free Inodes  : 8786101010  
--
Brick: Brick bs1:/raid1
Port : 24013   
Online   : Y   
Pid  : 3043
File System  : xfs 
Device   : /dev/sdc
Mount Options: rw,noatime,sunit=512,swidth=8192,allocsize=32m
Inode Size   : 256 
Disk Space Free  : 29.1TB  
Total Disk Space : 43.7TB  
Inode Count  : 9374964096  
Free Inodes  : 9372036362  
--
Brick: Brick bs1:/raid2
Port : 24015   
Online   : Y   
Pid  : 3049
File System  : xfs 
Device   : /dev/sdd
Mount Options: rw,noatime,sunit=512,swidth=7680,allocsize=32m
Inode Size   : 256 
Disk Space Free  : 25.9TB  
Total Disk Space : 40.9TB  
Inode Count  : 8789028864  
Free Inodes  : 8786101382  
 

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
A Message From a Dying Veteran http://goo.gl/tTHdo

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] gluster 3.3 sporadic 'Permission denied' failure under load in cluster env

2013-04-10 Thread harry mangalam

Sending this again, since I'm not even sure that the 1st made it to the list 
and it's just happened again, even with the same user (one of the heaviest 
users, but I don't think there's anything odd about his usage).

In the last 3 days, we've had 6 such errors (resulting in the logged error:

E [posix.c:1730:posix_create] 0-gl-posix: setting gfid on [file] failed


An question that could be answered is: has anyone had such errors in their 
brick logs show up?

ie:

grep -n posix.c:1730:posix_create /var/log/glusterfs/bricks/raid[12].log

hjm

=== previously ===

We have a ~2500core academic cluster with saturating amounts of use. 
The main data store is running on a 4 node/8brick/340TB/QDR IB gluster 3.3 
filesystem.  All are 8xOpteron/32GB systems with 3ware 9750 SAS controllers
The servers are all running SL6.2 and are stable, with load running stably at 
about 2 continuously.

gluster is config'ed as:

Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.write-behind-window-size: 1024MB
performance.flush-behind: on
performance.cache-size: 268435456
nfs.disable: on
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*

 
Many of our users run large array jobs under SGE and especially during those 
runs where there is LOTS of IO, we will VERY occasionally (20 times since last 
June, according to brick logs) see these kinds of errors, resulting in the 
failure of that particular element of the array job.  

Sometimes these are acceptable, but often the next job depends on all elements 
of the array job to complete correctly. At any rate, from the fs POV they 
should all complete.

The rarity of this error and the type of error, and where it is located 
suggest that it might be a hash collision..?  According to gluster bugzilla 
this doesn't seem to be a registered bug, so here I am asking if this has been 
seen by others and how this might be addressed.

=
 The error below being reported by Grid Engine says:
 
 user root 03/21/2013 15:29:23 [507:26777]: error: can't open output
 file
 /gl/bio/krthornt/WTCCC/autosomal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o2
 54058.103: Permission denied 03/21/2013 15:29:23 [400:25458]: wait3
=

Looking thru all the server logs (/var/log/glusterfs/etc-glusterfs-
glusterd.vol.log), reveals nothing about this error, but the brick logs yeild 
this set of lines referencing that file at the correct time:

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:18.667171] W [posix-
handle.c:461:posix_handle_hard] 0-gl-posix: link 
/raid1/bio/krthornt/WTCCC/autosomal_
analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 - 
/raid1/.glusterfs/5a/0e/5a0e87a6-e35d-4368-841e-b45802fecc4e failed (File 
exists)

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:18.667249] E 
[posix.c:1730:posix_create] 0-gl-posix: setting gfid on 
/raid1/bio/krthornt/WTCCC/autosomal_
analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 failed

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:19.241602] I [server3_1-
fops.c:1538:server_open_cbk] 0-gl-server: 644765: OPEN 
/bio/krthornt/WTCCC/autoso
mal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 (5a0e87a6-
e35d-4368-841e-b45802fecc4e) == -1 (Permission denied)

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:19.520455] I [server3_1-
fops.c:1538:server_open_cbk] 0-gl-server: 644970: OPEN 
/bio/krthornt/WTCCC/autoso
mal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 (5a0e87a6-
e35d-4368-841e-b45802fecc4e) == -1 (Permission denied)

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] gluster 3.3 sporadic 'Permission denied' failure under load in cluster env

2013-03-22 Thread harry mangalam

We have a ~2500core academic cluster with saturating amounts of use. 
The main data store is running on a 4 node/8brick/340TB/QDR IB gluster 3.3 
filesystem.  All are 8xOpteron/32GB systems with 3ware 9750 SAS controllers
The servers are all running SL6.2 and are stable, with load running stably at 
about 2 continuously.

gluster is config'ed as:

Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.write-behind-window-size: 1024MB
performance.flush-behind: on
performance.cache-size: 268435456
nfs.disable: on
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*

 
Many of our users run large array jobs under SGE and especially during those 
runs where there is LOTS of IO, we will VERY occasionally (20 times since last 
June, according to brick logs) see these kinds of errors, resulting in the 
failure of that particular element of the array job.  

Sometimes these are acceptable, but often the next job depends on all elements 
of the array job to complete correctly. At any rate, from the fs POV they 
should all complete.

The rarity of this error and the type of error, and where it is located 
suggest that it might be a hash collision..?  According to gluster bugzilla 
this doesn't seem to be a registered bug, so here I am asking if this has been 
seen by others and how this might be addressed.

=
 The error below being reported by Grid Engine says:
 
 user root 03/21/2013 15:29:23 [507:26777]: error: can't open output
 file
 /gl/bio/krthornt/WTCCC/autosomal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o2
 54058.103: Permission denied 03/21/2013 15:29:23 [400:25458]: wait3
=

Looking thru all the server logs (/var/log/glusterfs/etc-glusterfs-
glusterd.vol.log), reveals nothing about this error, but the brick logs yeild 
this set of lines referencing that file at the correct time:

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:18.667171] W [posix-
handle.c:461:posix_handle_hard] 0-gl-posix: link 
/raid1/bio/krthornt/WTCCC/autosomal_
analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 - 
/raid1/.glusterfs/5a/0e/5a0e87a6-e35d-4368-841e-b45802fecc4e failed (File 
exists)

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:18.667249] E 
[posix.c:1730:posix_create] 0-gl-posix: setting gfid on 
/raid1/bio/krthornt/WTCCC/autosomal_
analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 failed

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:19.241602] I [server3_1-
fops.c:1538:server_open_cbk] 0-gl-server: 644765: OPEN 
/bio/krthornt/WTCCC/autoso
mal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 (5a0e87a6-
e35d-4368-841e-b45802fecc4e) == -1 (Permission denied)

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:19.520455] I [server3_1-
fops.c:1538:server_open_cbk] 0-gl-server: 644970: OPEN 
/bio/krthornt/WTCCC/autoso
mal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 (5a0e87a6-
e35d-4368-841e-b45802fecc4e) == -1 (Permission denied)

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Slow read performance

2013-03-08 Thread harry mangalam

Have you run oprofile on the client and server simultaneously to see if 
there's some race condition developing?  Obviously the NFS client is fine, but 
it's clear that there's nothing wrong with the hardware.

oprofile will at least reveal where the the bits are vacationing and may point 
a specific bottleneck.

oprofile.sf.net for docs and examples (pretty good); fairly easy to set up to 
profile applications; a bit more trouble if you're trying to profile kernel 
interactions, but it looks like you might not have to.

I wouldn't want to forklift 160TB either.  My sympathies.

hjm

On Thursday, March 07, 2013 09:27:42 PM Thomas Wakefield wrote:
 inode size is 256.
 
 Pretty stuck with these settings and ext4.  I missed the memo that Gluster
 started to prefer xfs, back in the 2.x days xfs was not the preferred
 filesystem.  At this point it's a 340TB filesystem with 160TB used.  I just
 added more space, and was doing some followup testing and wasn't impressed
 with the results.  But I am sure I was happier before with the performance.
 
 Still running CentOS 5.8
 
 Anything else I could look at?
 
 Thanks, Tom

...
---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] GlusterFS performance

2013-03-05 Thread harry mangalam

This kind of info is surprisingly hard to obtain.  The gluster docs do contain 
some of it, ie:

http://community.gluster.org/a/linux-kernel-tuning-for-glusterfs/

I also found well-described kernel tuning parameters in the FHGFS wiki (as 
another distibuted fs, they share some characteristics)

http://www.fhgfs.com/wiki/wikka.php?wakka=StorageServerTuning

and more XFS tuning filesystem params here:

http://www.mythtv.org/wiki/Optimizing_Performance#Further_Information

and here:
http://www.mysqlperformanceblog.com/2011/12/16/setting-up-xfs-the-simple-
edition

But of course, YMMV and a number of these parameters conflict and/or have 
serious tradeoffs, as you discovered.

LSI recently loaned me a Nytro SAS controller (on-card SSD-cached) which seems 
pretty phenomenal on a single brick (and is predicted to perform well based on 
their profiling), but am waiting for another node to arrive before I can test 
it under true gluster conditions.  Anyone else tried this hardware?

hjm

On Tuesday, March 05, 2013 12:34:41 PM Nikita A Kardashin wrote:
 Hello all!
 
 This problem is solved by me today.
 Root of all in the incompatibility of gluster cache and kvm cache.
 
 Bug reproduces if KVM virtual machine created with cache=writethrough
 (default for OpenStack) option and hosted on GlusterFS volume. If any other
 (cache=writeback or cache=none with direct-io) cacher used, performance of
 writing to existing file inside VM is equal to bare storage (from host
 machine) write performance.
 
 I think, it must be documented in Gluster and maybe filled a bug.
 
 Other question. Where I can read something about gluster tuning (optimal
 cache size, write-behind, flush-behind use cases and other)? I found only
 options list, without any how-to or tested cases.
 
 
 2013/3/5 Toby Corkindale toby.corkind...@strategicdata.com.au
 
  On 01/03/13 21:12, Brian Candler wrote:
  On Fri, Mar 01, 2013 at 03:30:07PM +0600, Nikita A Kardashin wrote:
  If I try to execute above command inside virtual machine (KVM),
  first
  time all going right - about 900MB/s (cache effect, I think), but if
  
  I
  
  run this test again on existing file - task (dd) hungs up and can be
  stopped only by Ctrl+C.
  Overall virtual system latency is poor too. For example, apt-get
  upgrade upgrading system very, very slow, freezing on Unpacking
  replacement and other io-related steps.
  Does glusterfs have any tuning options, that can help me?
  
  If you are finding that processes hang or freeze indefinitely, this is
  not
  a question of tuning, this is simply broken.
  
  Anyway, you're asking the wrong person - I'm currently in the process of
  stripping out glusterfs, although I remain interested in the project.
  
  I did find that KVM performed very poorly, but KVM was not my main
  application and that's not why I'm abandoning it.  I'm stripping out
  glusterfs primarily because it's not supportable in my environment,
  because
  there is no documentation on how to analyse and recover from failure
  scenarios which can and do happen. This point in more detail:
  http://www.gluster.org/**pipermail/gluster-users/2013-**
  January/035118.htmlhttp://www.gluster.org/pipermail/gluster-users/2013-J
  anuary/035118.html
  
  The other downside of gluster was its lack of flexibility, in particular
  the
  fact that there is no usage scaling factor on bricks, so that even with a
  simple distributed setup all your bricks have to be the same size.  Also,
  the object store feature which I wanted to use, has clearly had hardly
  any
  testing (even the RPM packages don't install properly).
  
  I *really* wanted to deploy gluster, because in principle I like the idea
  of
  a virtual distribution/replication system which sits on top of existing
  local filesystems.  But for storage, I need something where operational
  supportability is at the top of the pile.
  
  I have to agree; GlusterFS has been in use here in production for a while,
  and while it mostly works, it's been fragile and documentation has been
  disappointing. Despite 3.3 being in beta for a year, it still seems to
  have
  been poorly tested. For eg, I can't believe almost no-one else noticed
  that
  the log files were busted.. nor that the bug report has been around for
  quarter of a year without being responded to or fixed.
  
  I have to ask -- what are you moving to now, Brian?
  
  -Toby
  
  
  __**_
  Gluster-users mailing list
  Gluster-users@gluster.org
  http://supercolony.gluster.**org/mailman/listinfo/gluster-**usershttp://s
  upercolony.gluster.org/mailman/listinfo/gluster-users

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
Something must be done. [X] is something. Therefore, we must

Re: [Gluster-users] Peer Probe

2013-02-25 Thread harry mangalam

It /might be/probably is/ DNS-related. 

Are you trying to do this with RDMA or IPoIB?

If IPoIB, are ALL your /etc/hosts files in sync (IB names separate and 
distinct from the ethernet interfaces) and responsive to the appropriate 
interfaces?  

Do the IB interfaces show up as distinct (and connected) on an 'ifconfig -a' 
and 'ibstat' dump?  

Do all the peers show up on an 'ibhosts' query?

What is the output of:
gluster volume status your_volume
and 
gluster volume status your_volume detail


hjm

On Monday, February 25, 2013 07:46:00 PM Tony Saenz wrote:
 It shows this but it's still going through my NIC cards and not the
 Infiniband. (Checked the traffic on the cards themselves)
 
 [root@fpsgluster ~]# gluster peer status
 Number of Peers: 1
 
 Hostname: fpsgluster2
 Uuid: 9b7e7c2d-f05b-4cc8-b55a-571e383328d0
 State: Peer in Cluster (Connected)
 
 On Feb 25, 2013, at 10:51 AM, Torbjørn Thorsen torbj...@trollweb.no wrote:
  Your error message seems to indicate that the peer is already in the
  storage pool ?
  What is the output of gluster peer status ?
  
  On Mon, Feb 25, 2013 at 7:28 PM, Tony Saenz t...@filmsolutions.com 
wrote:
  Any help please? The regular NICs are fine which is what it currently
  sees but I'd like to move them over to the Infiniband cards. 
  On Feb 22, 2013, at 1:50 PM, Anthony Saenz t...@filmsolutions.com 
wrote:
  Hey,
  
  I was wondering if I could get a bit of help.. I installed a new
  Infiniband card into my servers but I'm unable to get it to come up as
  a peer. Is there something I'm missing?
  
  [root@fpsgluster testvault]# gluster peer probe fpsgluster2ib
  Probe on host fpsgluster2ib port 0 already in peer list
  
  [root@fpsgluster testvault]# yum list installed | grep gluster
  glusterfs.x86_64   3.3.1-1.el6 
  installed
  glusterfs-devel.x86_64 3.3.1-1.el6 
  installed
  glusterfs-fuse.x86_64  3.3.1-1.el6 
  installed
  glusterfs-geo-replication.x86_64   3.3.1-1.el6 
  installed
  glusterfs-rdma.x86_64  3.3.1-1.el6 
  installed
  glusterfs-server.x86_643.3.1-1.el6 
  installed
  
  Thanks.
  
  ___
  Gluster-users mailing list
  Gluster-users@gluster.org
  http://supercolony.gluster.org/mailman/listinfo/gluster-users
  
  --
  Vennlig hilsen
  Torbjørn Thorsen
  Utvikler / driftstekniker
  
  Trollweb Solutions AS
  - Professional Magento Partner
  www.trollweb.no
  
  Telefon dagtid: +47 51215300
  Telefon kveld/helg: For kunder med Serviceavtale
  
  Besøksadresse: Luramyrveien 40, 4313 Sandnes
  Postadresse: Maurholen 57, 4316 Sandnes
  
  Husk at alle våre standard-vilkår alltid er gjeldende
 
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
Something must be done. [X] is something. Therefore, we must do it.
Bruce Schneier, on American response to just about anything.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Peer Probe

2013-02-25 Thread harry mangalam

That looks OK (but your 2 MTUs are mismatched - should fix that).

   UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1  - 1st
   UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1 - 2nd
^
IBDEV=ibX
modprobe ib_umad 
modprobe ib_ipoib
echo connected  /sys/class/net/${IBDEV}/mode
echo 65520  /sys/class/net/${IBDEV}/mtu

how did you set up the peering? By name?  by IP#?
(I assume pinging by hostname also works both ways?)

If you can't get the peers to ack, then what do the logs say on failure to:

peer probe host
  or create the volume 
gluster volume create volnamehost1ib:/gl_part   host2ib:/gl_part

hjm

On Monday, February 25, 2013 09:50:02 PM Tony Saenz wrote:
 Trying to first get this working with IPoIB
 
 [root@fpsgluster ~]# ibhosts
 Ca: 0x0011757937b2 ports 1 fpsgluster2 qib0
 Ca: 0x001175792af2 ports 1 fpsgluster qib0
 
 I'm able to ping the other box from Infiniband to Infiniband card
 
 Ifconfig uses the ioctl access method to get the full address information,
 which limits hardware addresses to 8 bytes. Because Infiniband address has
 20 bytes, only the first 8 bytes are displayed correctly. Ifconfig is
 obsolete! For replacement check ip.
 ib0   Link encap:InfiniBand  HWaddr
 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet
 addr:10.0.4.35  Bcast:10.0.4.255  Mask:255.255.255.0 inet6 addr:
 fe80::211:7500:79:2af2/64 Scope:Link
   UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
   RX packets:1567 errors:0 dropped:0 overruns:0 frame:0
   TX packets:587 errors:0 dropped:24 overruns:0 carrier:0
   collisions:0 txqueuelen:256
   RX bytes:342622 (334.5 KiB)  TX bytes:96554 (94.2 KiB)
 
 [root@fpsgluster2 ~]# ifconfig ib0
 Ifconfig uses the ioctl access method to get the full address information,
 which limits hardware addresses to 8 bytes. Because Infiniband address has
 20 bytes, only the first 8 bytes are displayed correctly. Ifconfig is
 obsolete! For replacement check ip.
 ib0   Link encap:InfiniBand  HWaddr
 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet
 addr:10.0.4.34  Bcast:10.0.4.255  Mask:255.255.255.0 inet6 addr:
 fe80::211:7500:79:37b2/64 Scope:Link
   UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
   RX packets:599 errors:0 dropped:0 overruns:0 frame:0
   TX packets:1558 errors:0 dropped:8 overruns:0 carrier:0
   collisions:0 txqueuelen:256
   RX bytes:95180 (92.9 KiB)  TX bytes:346728 (338.6 KiB)
 
 [root@fpsgluster ~]# ping -I ib0 10.0.4.34
 PING 10.0.4.34 (10.0.4.34) from 10.0.4.35 ib0: 56(84) bytes of data.
 64 bytes from 10.0.4.34: icmp_seq=1 ttl=64 time=12.6 ms
 64 bytes from 10.0.4.34: icmp_seq=2 ttl=64 time=0.184 ms
 
 /etc/hosts looks correct
 
 [root@fpsgluster2 ~]# cat /etc/hosts | grep ib
 10.0.4.35  fpsglusterib
 10.0.4.34  fpsgluster2ib
 
 [root@fpsgluster ~]# cat /etc/hosts| grep ib
 10.0.4.35  fpsglusterib
 10.0.4.34  fpsgluster2ib
 
 I haven't created the new volume yet as I can't get the peer probe to work
 off the Infiniband card. It's only seeing the NIC cards I currently have it
 hooked in to.
 
 
 On Feb 25, 2013, at 11:57 AM, harry mangalam harry.manga...@uci.edu
 
  wrote:
  It /might be/probably is/ DNS-related.
  
  Are you trying to do this with RDMA or IPoIB?
  
  If IPoIB, are ALL your /etc/hosts files in sync (IB names separate and
  distinct from the ethernet interfaces) and responsive to the appropriate
  interfaces?
  
  Do the IB interfaces show up as distinct (and connected) on an 'ifconfig
  -a' and 'ibstat' dump?
  
  Do all the peers show up on an 'ibhosts' query?
  
  What is the output of:
  gluster volume status your_volume
  and
  gluster volume status your_volume detail
  
  
  hjm
  
  On Monday, February 25, 2013 07:46:00 PM Tony Saenz wrote:
  It shows this but it's still going through my NIC cards and not the
  Infiniband. (Checked the traffic on the cards themselves)
  
  [root@fpsgluster ~]# gluster peer status
  Number of Peers: 1
  
  Hostname: fpsgluster2
  Uuid: 9b7e7c2d-f05b-4cc8-b55a-571e383328d0
  State: Peer in Cluster (Connected)
  
  On Feb 25, 2013, at 10:51 AM, Torbjørn Thorsen torbj...@trollweb.no 
wrote:
  Your error message seems to indicate that the peer is already in the
  storage pool ?
  What is the output of gluster peer status ?
  
  On Mon, Feb 25, 2013 at 7:28 PM, Tony Saenz t...@filmsolutions.com
  
  wrote:
  Any help please? The regular NICs are fine which is what it currently
  sees but I'd like to move them over to the Infiniband cards.
  On Feb 22, 2013, at 1:50 PM, Anthony Saenz t...@filmsolutions.com
  
  wrote:
  Hey,
  
  I was wondering if I could get a bit of help.. I installed a new
  Infiniband card into my servers but I'm unable to get it to come up as
  a peer. Is there something I'm missing?
  
  [root@fpsgluster testvault]# gluster peer probe fpsgluster2ib
  Probe

Re: [Gluster-users] high CPU load on all bricks

2013-02-14 Thread harry mangalam

 but
 nothing solid.
 
 
 
 If the consensus is that NFS will not gain anything then I won't waste the
 time setting it all up.
 
 
 NFS gains you the use of FSCache to cache directories and file stats making
 directory listings faster, but it adds overhead decreasing the overall
 throughput (from all the reports I've seen).
 
 I would suspect that you have the kernel nfs server running on your servers.
 Make sure it's disabled.
 
 
 
 
 
 
 Thanks,
 
 ~Mike C.
 
 
 
 
 
 From: gluster-users-boun...@gluster.org
 [mailto:gluster-users-boun...@gluster.org] On Behalf Of Michael Colonno
 Sent: Friday, February 01, 2013 4:46 PM
 To: gluster-users@gluster.org
 Subject: Re: [Gluster-users] high CPU load on all bricks
 
 
 
 Update: after a few hours the CPU usage seems to have dropped
 down to a small value. I did not change anything with respect to the
 configuration or unmount / stop anything as I wanted to see if this would
 persist for a long period of time. Both the client and the self-mounted
 bricks are now showing CPU  1% (as reported by top). Prior to the larger
 CPU loads I installed a bunch of software into the volume (~ 5 GB total). Is
 this kind a transient behavior - by which I mean larger CPU loads after a
 lot of filesystem activity in short time - typical? This is not a problem
 in my deployment; I just want to know what to expect in the future and to
 complete this thread for future users. If this is expected behavior we can
 wrap up this thread. If not then I'll do more digging into the logs on the
 client and brick sides.
 
 
 
 Thanks,
 
 ~Mike C.
 
 
 
 From: Joe Julian [mailto:j...@julianfamily.org]
 Sent: Friday, February 01, 2013 2:08 PM
 To: Michael Colonno; gluster-users@gluster.org
 Subject: Re: [Gluster-users] high CPU load on all bricks
 
 
 
 Check the client log(s).
 
 Michael Colonno mcolo...@stanford.edu wrote:
 
 Forgot to mention: on a client system (not a brick) the
 glusterfs process is consuming ~ 68% CPU continuously. This is a much less
 powerful desktop system so the CPU load can't be compared 1:1 with the
 systems comprising the bricks but still very high. So the issue seems to
 exist with both glusterfsd and glusterfs processes.
 
 
 
 Thanks,
 
 ~Mike C.
 
 
 
 From: gluster-users-boun...@gluster.org
 [mailto:gluster-users-boun...@gluster.org] On Behalf Of Michael Colonno
 Sent: Friday, February 01, 2013 12:46 PM
 To: gluster-users@gluster.org
 Subject: [Gluster-users] high CPU load on all bricks
 
 
 
 Gluster gurus ~
 
 
 
 I've deployed and 8-brick (2x replicate) Gluster 3.3.1 volume on
 CentOS 6.3 with tcp transport. I was able to build, start, mount, and use
 the volume. On each system contributing a brick, however, my CPU usage
 (glusterfsd) is hovering around 20% (virtually zero memory usage
 thankfully). These are brand new, fairly beefy servers so 20% CPU load is
 quite a bit. The deployment is pretty plain with each brick mounting the
 volume to itself via a glusterfs mount. I assume this type of CPU usage is
 atypically high; is there anything I can do to investigate what's soaking up
 CPU and minimize it? Total usable volume size is only about 22 TB (about 45
 TB total with 2x replicate).
 
 
 
 Thanks,
 
 ~Mike C.
 
 
 
 
   _
 
 
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users
 
 
 
 
 
 
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
Something must be done. [X] is something. Therefore, we must do it.
Bruce Schneier, on American response to just about anything.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] NFS availability

2013-01-30 Thread harry mangalam

On Thursday, January 31, 2013 11:28:04 AM glusterzhxue wrote:
 Hi all,
 As is known to us all, gluster provides NFS mount. However, if the mount
 point fails, clients will lose connection to Gluster. While if we use
 gluster native client, this fail will have no effect on clients. For
 example:
 mount -t glusterfs host1:/vol1  /mnt
 
 If host1 goes down for some reason, client works still, it has no sense
 about the failure(suppose we have multiple gluster servers).  

The client will still fail (in most cases) since host1 (if I follow you) is 
part of the gluster groupset. Certainly if it's a distributed-only, maybe not 
if it's a dist/repl gluster.  But if host1 goes down, the client will not be 
able to find a gluster vol to mount.

 However, if
 we use the following:
 mount -t nfs -o vers=3   host1:/vol1 /mnt

 If host1 failed, client will lose connection to gluster servers.

If the client was mounting the glusterfs via a re-export from an intermediate 
host, you might be able to failover to another intermediate NFS server, but if 
it was a gluster host, it would fail due to the reasons above.

 Now, we want to use NFS way. Could anyone give us some suggestion to solve
 the issue?

Multiple intermediate NFS servers with round-robin addressing?
Anyone tried this?

 
 Thanks
 
 Zhenghua

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
Something must be done. [X] is something. Therefore, we must do it.
Bruce Schneier, on American response to just about anything.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Question regarding write performance issues

2013-01-17 Thread harry mangalam

Just a guess, but how are the writes being done?  If they're being written in 
zillions of tiny writes, then what you may be seeing is described here:
http://moo.nac.uci.edu/~hjm/bduc/BDUC_USER_HOWTO.html#writeperfongl
and the following stanza on named pipes.

This is often the case with the large files being used in NGS/HTS where the 
fasta/fastq files are composed of millions of short (60-100 char) lines of 
characters and are typically written line-by-line.

hjm



On Wednesday, January 16, 2013 02:47:37 PM Ayelet Shemesh wrote:
 Hi to all Gluster experts,
 
 I have a cluster of 10 machines exposing a volume into which 12 other
 machines do many writes of large files (~100-300MB each).
 In general I'm very happy with gluster. It's a great solution, and is quite
 stable (thanks for the great work!).
 
 However, I have a problem which I was unable to solve yet, nor find any
 solution to in the documentation or on this list archive.
 
 When the client machines write locally, and then just copy the files they
 created to the gluster mount - everything works great.
 When the client machines write directly to the gluster mounted volume I get
 a huge performance hit.
 In one specific test case the difference was 20 minutes for the copy and 8
 hours for the direct write.
 
 I tried to set the iocache attributes of write-behind-window and
 flush-behind, but to no avail.
 
 I will very much appreciate your help in solving this problem.
 
 Thanks,
 Ayelet

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
Something must be done. [X] is something. Therefore, we must do it.
Bruce Schneier, on American response to just about anything.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Redux: Stale NFS file handle

2013-01-04 Thread harry mangalam

Following up on myself..

There is a RH bug report https://bugzilla.redhat.com/show_bug.cgi?id=678544 
that sounds like what I'm seeing, but it's associated with the NFS client.  
We're using the gluster native client, so it shouldn't be related, but it 
looks like there may be some NFS client code that was used in the gluster 
native client code as well (the cause of the 'Stale NFS handle' warning in a 
non-NFS setting). 

Could this be the case?  Could the pseudo-solution for the NFS case also work 
for the gluster native client case?  ie: mount -o acdirmin=[23] ...

It doesn't seem like 'acdirmin'  is a mount option for glusterfs but is there 
a gluster equivalent?

Also, this bug https://bugzilla.redhat.com/show_bug.cgi?id=844584 that 
complains about the the same thing as I reference below has this notation from 
Amar:
---
..this issue ['Stale NFS handle'] can happen in the case where file is changed 
on server after it was being accessed from one node (changed as in, re-
created). as long as the application doesn't see any issue, this should be ok.
---

In our case, it's causing the SGE array job to fail (or at least it appears to 
be highly related):
=
SGE error extract:

Shepherd error:
01/03/2013 20:03:03 [507:64196]: error: can't chdir to 
/gl/bio/krthornt/build_div/yak/line10_CY22B/prinses: No such file or directory
=
glusterfs log on client (native gluster client):

FS file handle. Path: /bio/krthornt/build_div/yak/line10_CY22B/prinses 
(590802f1-7fba-4103-ba30-e4d415b9db36)
[2013-01-03 20:03:03.168229] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-
gl-client-1: remote operation failed: Stale NFS file handle. Path: 
/bio/krthornt/build_div/yak/line10_CY22B/prinses (590802f1-7fba-4103-ba30-
e4d415b9db36)
[2013-01-03 20:03:03.168287] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-
gl-client-2: remote operation failed: Stale NFS file handle. Path: 
/bio/krthornt/build_div/yak/line10_CY22B/prinses (590802f1-7fba-4103-ba30-
e4d415b9db36)

(and 13 more identical lines except for the timestamp (extending out to 
'20:03:17.202688').

so while we might not be losing data /per se/, this failure is definitely 
causing our cluster to lose jobs.

If this has not been reported previously, I'd be happy to file an official bug 
report.

hjm


On Thursday, January 03, 2013 05:56:10 PM harry mangalam wrote:
 From ~ an hour's googling and reading, it looks like this (not uncommon)
 bug/warning/error has not necessarily been associated with data loss, but we
 are finding that our gluster fs is interrupting our cluster jobs with the
 'Stale NFS handle' Warnings like this (on the client):
 
 [2013-01-03 12:30:59.149230] W [client3_1-fops.c:2630:client3_1_lookup_cbk]
 0- gl-client-0: remote operation failed: Stale NFS file handle. Path:
 /bio/krthornt/build_div/yak/line06_CY08A/prinses (3b0aa7b2-bf7f-4b27-
 b515-32e94b1206e3)
 
 (and 7 more, differing by the timestamp of 1s).
 
 The dir mentioned existed before the job was asked to read from it and
 shortly after the SGE failed, I checked that the glusterfs (/bio) was still
 mounted and that dir was still r/w.
 
 We are getting these errors infrequently, but fairly regularly (a couple
 times a week, usually during a big array job that heavily reads from a
 particular dir) and I haven't seen any resolutions of the fault besides the
 vocabulary being corrected.  I know it's not nec an NFS problem, but I
 haven't seen a fix from the gluster folks.
 
 Our glusterfs on this system is set up like this (over QDR/tcpoib)
 
 $ gluster volume info gl
 
 Volume Name: gl
 Type: Distribute
 Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
 Status: Started
 Number of Bricks: 8
 Transport-type: tcp,rdma
 Bricks:
 Brick1: bs2:/raid1
 Brick2: bs2:/raid2
 Brick3: bs3:/raid1
 Brick4: bs3:/raid2
 Brick5: bs4:/raid1
 Brick6: bs4:/raid2
 Brick7: bs1:/raid1
 Brick8: bs1:/raid2
 Options Reconfigured:
 auth.allow: 10.2.*.*,10.1.*.*
 performance.io-thread-count: 64
 performance.quick-read: on
 performance.io-cache: on
 nfs.disable: on
 performance.cache-size: 268435456
 performance.flush-behind: on
 performance.write-behind-window-size: 1024MB
 
 and otherwise appears to be happy.
 
 We were having a low-level problem with the RAID servers, where this
 LSI/3ware error was temporally close (~2m) to the gluster error:
 
 LSI 3DM2 alert -- host: biostor4.oit.uci.edu
 Jan 03, 2013 03:32:09PM - Controller 6
 ERROR - Drive timeout detected: encl=1, slot=3
 
 This error seemed to be related to construction around our data center and
 dust related with it.  We have had 10s of these LSI/3ware errors with no
 related gluster errors or apparent problems with the RAIDs.  No drives were
 ejected from the RAIDs and the errors did not repeat.  3ware explains:
 http://cholla.mmto.org/computers/3ware/3dm2/en/3DM_2_OLH-8-6.html
 ==
 009h Drive timeout detected
 
 The 3ware RAID controller has a sophisticated recovery mechanism to handle
 various types of failures

[Gluster-users] Redux: Stale NFS file handle

2013-01-03 Thread harry mangalam

  
  
  
  ___
  Gluster-users mailing list
  Gluster-users@gluster.org mailto:Gluster-users@gluster.org
  http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
  
  ___
  Gluster-users mailing list
  Gluster-users@gluster.org mailto:Gluster-users@gluster.org
  http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
  
  ___
  Gluster-users mailing list
  Gluster-users@gluster.org
  http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster in a cluster

2012-11-15 Thread harry mangalam

This doesn't seem like a good way to do what I think you want to do.

1 - /scratch should be as fast as possible, so putting it on a distributed fs, 
unless that fs is optimized for speed (mumble.Lustre.mumble), is a mistake.

2 - if you insist on doing this with gluster (perhaps because none of your 
individual /scratch partitions is large enough), making a dist  replicated 
/scratch is making a bad decision worse as replication will slow the process 
down even more. (Why replicate what is a temp data store?)

3 - integrating the gluster server into the rocks environment (on a per-node 
basis) seems like a recipe for .. well, migraines, at least 

If you need a relatively fast, simple, large, reliable, aggressively caching 
fs for /scratch, NFS to a large RAID0/10 has some attractions, unless the 
gluster server fanout IO overwhelms the aforementioned attractions.

IMHO...




On Thursday, November 15, 2012 09:30:52 AM Jerome wrote:
 Dear all
 
 I'm testing Gluster in a cluster of compute nodes, based on Rocks. The
 idea is to use the scratch of each nodes as a big volume for scratch. It
 permit to access to this scratch system file on all the nodes of the
 cluster.
 For the moment, i have installed this gluster system on 4 nodes, ona
 distributed replica of 2, like this:
 
 # gluster volume info
 
 Volume Name: scratch1
 Type: Distributed-Replicate
 Volume ID: c8c3e3fe-c785-4438-86eb-0b84c7c29123
 Status: Started
 Number of Bricks: 2 x 2 = 4
 Transport-type: tcp
 Bricks:
 Brick1: compute-2-0:/state/partition1/scratch
 Brick2: compute-2-1:/state/partition1/scratch
 Brick3: compute-2-6:/state/partition1/scratch
 Brick4: compute-2-9:/state/partition1/scratch
 
 # gluster volume status
 Status of volume: scratch1
 Gluster process   PortOnline  
 Pid
 
 -- Brick compute-2-0:/state/partition1/scratch24009   Y   
 16464
 Brick compute-2-1:/state/partition1/scratch   24009   Y   3848
 Brick compute-2-6:/state/partition1/scratch   24009   Y   511
 Brick compute-2-9:/state/partition1/scratch   24009   Y   2086
 NFS Server on localhost   38467   N   
 4060
 Self-heal Daemon on localhost N/A N   4065
 NFS Server on compute-2-0 38467   Y   16470
 Self-heal Daemon on compute-2-0   N/A Y   
 16476
 NFS Server on compute-2-9 38467   Y   2092
 Self-heal Daemon on compute-2-9   N/A Y   
 2099
 NFS Server on compute-2-6 38467   Y   517
 Self-heal Daemon on compute-2-6   N/A Y   
 524
 NFS Server on compute-2-1 38467   Y   3854
 Self-heal Daemon on compute-2-1   N/A Y   
 3860
 
 
 All of this run correctly, i used some stress file to advise that
 configuration could be runnable.
 My problem is when a node reboot accidentaly, or for some administration
 task: the node reinstall itself, and the gluster volume begin to
 fail I detect taht the UUID of a machine is generated during the
 instalation, so i develop some script to get back the original UUID of
 the node. Despote this, the node could not get back in the volume. I
 miss some special task to do. So, it is possible to do a such system
 with gluster? or maybe i have to reconfigure all of the voluem when a
 node reinstall?
 
 Best regards.
-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
Passive-Aggressive Supporter of the The Canada Party:
  http://www.americabutbetter.com/

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Very slow directory listing and high CPU usage on replicated volume

2012-11-05 Thread harry mangalam

Jeff Darcy wrote a nice piece in his hekafs blog about 'the importance of 
keeping things sequential' which is essentially about the contention for heads 
between data io and journal io.
http://hekafs.org/index.php/2012/11/the-importance-of-staying-sequential/
(also congrats on the Linux Journal article on the glupy python/gluster 
approach).

We've been experimenting with SSDs on ZFS (using the SSDs fo the ZIL 
(journal)) and while it's provided a little bit of a boost, it has not been 
dramatic.  Ditto XFS.  However, we did not stress it at all with heavy loads 
in a gluster env and I'm now thinking that there is where you would see the 
improvement. (see Jeff's graph about how the diff in threads/load affects 
IOPS).

Is anyone running a gluster system with the underlying XFS writing the journal 
to SSDs?  If so, any improvement?  I would have expected to hear about this as 
a recommended architecture for gluster if it had performed MUCH better, but 
...?

We're about to combine 2 clusters and may just go ahead with this approach as 
a /scratch system to test this approach.

hjm



On Monday, November 05, 2012 07:58:22 AM Jonathan Lefman wrote:
 I take it back. Things deteriorated pretty quickly after I began dumping
 data onto my volume from multiple clients. Initially my transfer rates were
 okay, not fast, but livable. However after about an hour of copying several
 terabytes from 3-4 client machines, the rates of transfer often dropped to
 lb/s. Sometimes I would see a couple second burst of good transfer rates.
 
 Anyone have ideas on how to address this effectively? I'm at a loss.
 
 -Jon
 On Nov 2, 2012 1:21 PM, Jonathan Lefman jonathan.lef...@essess.com
 
 wrote:
  I should have also said that my volume is working well now and all is
  well.
  
  -Jon
  
  
  On Fri, Nov 2, 2012 at 1:21 PM, Jonathan Lefman 
  
  jonathan.lef...@essess.com wrote:
  Thank you Brian. I'm happy to hear that this behavior is not typical. I
  am now using xfs on all of my drives.  I also wiped out the entire
  
   /etc/glusterd directory for good measure.  I bet that there was residual
  
  information from a previous attempt at a gluster volume that must have
  caused problems.  Or moving to xfs from ext4 is an amazing fix, but I
  think
  this is less likely.
  
  I appreciate your time responding to me.
  
  -Jon
  
  On Nov 2, 2012 4:44 AM, Brian Candler b.cand...@pobox.com wrote:
  On Thu, Nov 01, 2012 at 08:03:21PM -0400, Jonathan Lefman wrote:
  Soon after loading up about 100 MB of small files (about 300kb
  
  each),
  
  the drive usage is at 1.1T.
  
  That is very odd. What do you get if you run du and df on the individual
  bricks themselves? 100MB is only ~330 files of 300KB each.
  
  Did you specify any special options to mkfs.ext4? Maybe -l 512 would
  help,
  as the xattrs are more likely to sit within the indoes themselves.
  
  If you start everything from scratch, it would be interesting to see df
  stats when the filesystem is empty.  It may be that a huge amount of
  space
  has been allocated to inodes.  If you expect most of your files 16KB
  then
  you could add -i 16384 to mkfs.ext4 to reduce the space reserved for
  inodes.
  But using xfs would be better, as it doesn't reserve any space for
  inodes,
  it allocates it dynamically.
  
  Ignore the comment that glusterfs is not designed for handling large
  count
  small files - 300KB is not small.
  
  Regards,
  
  Brian.
-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
Passive-Aggressive Supporter of the The Canada Party:
  http://www.americabutbetter.com/

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] volume started but not 'startable', not 'stoppable'

2012-10-08 Thread harry mangalam

Sorry for not responding immediately - I was drowning in flopsweat trying to 
get it back up.  After some false starts, mostly due to premature mounting of 
the inconsistent volfile.

remounting after the glusterfs came back and re-established a valid volfile  
seems to have resolved everything.

Thanks very much for the help..

hjm



On Monday, October 08, 2012 02:30:07 PM John Mark Walker wrote:
 I suspect this didn't go through - forwarding.
 
 - Original Message -
 
  Harry,
  
  Could you paste/attach the contents of /var/lib/glusterd/gli/info
  files and the glusterd log files from the 4 peers in cluster?
  From the volume-info snippet you had pasted, it appears that
  the node which was shutdown differs in its view of the volume's
  status.
  
  thanks,
  krish
  
  - Original Message -
  
   From: harry mangalam johnm...@johnmark.org
   To: gluster-users@gluster.org
   Sent: Monday, October 8, 2012 2:49:05 AM
   Subject: Re: [Gluster-users] volume started but not 'startable',
   
 not 'stoppable'
   
   And a few more data points: it appears the reason for the flaky
   gluster fs is
   that not all the servers are running glusterfsd's (see below).  Is
   there a way
   to force the servers to all start the glusterfsd's as they're
   supposed to?
   
   The mystery rebalance did complete, and seems to have fixed some
   but
   not all
   
   problem files - ie:
drwx-- 2 spoorkas spoorkas  8211 Jun  2 00:22
QPSK_2Tx_2Rx_BH_Method2/
?- ? ????
QPSK_2Tx_2Rx_ML_Method1
   
   And the started/not started status has gotten weirder if possble..
   
   The gluster volume is still being exported to clients, despite
   gluster
   insisting that the volume is not started (servers are pbs[1234]
   result of
   $ gluster volume status
   pbs1:Volume gli is not started
   pbs2:Volume gli is not started
   pbs3:Volume gli is not started
   pbs4:Volume gli is not started
   
   $ gluster volume info:
   pbs1:Status: Stopped
   pbs2:Status: Started  - aha!
   pbs3:Status: Started  - aha!
   pbs4:Status: Started
   
   This correlates with the glusterfsd status in which only pbs[23]
   are
   running
   glusterfsd:
   
   pbs2:root  1799  0.1  0.0 184296 16464 ?Ssl  13:07
   
 0:06
   
   /usr/sbin/glusterfsd -s localhost --volfile-id gli.pbs2ib.bducgl -p
   /var/lib/glusterd/vols/gli/run/pbs2ib-bducgl.pid -S
   /tmp/c70b2f910e2fe1bb485b1d76ef63e3db.socket --brick-name /bducgl
   -l
   /var/log/glusterfs/bricks/bducgl.log --xlator-option
   *-posix.glusterd-
   uuid=26de63bd-c5b7-48ba-b81d-5d77a533d077 --brick-port 24025 24026
   --xlator-
   option gli-server.transport.rdma.listen-port=24026 --xlator-option
   gli-
   server.listen-port=24025
   
   pbs3:root  1751  0.1  0.0 184168 16468 ?Ssl  13:07
   
 0:06
   
   /usr/sbin/glusterfsd -s localhost --volfile-id gli.pbs3ib.bducgl -p
   /var/lib/glusterd/vols/gli/run/pbs3ib-bducgl.pid -S
   /tmp/7096377992feb7f5a7805cafd82c3100.socket --brick-name /bducgl
   -l
   /var/log/glusterfs/bricks/bducgl.log --xlator-option
   *-posix.glusterd-
   uuid=c79c4084-d6b9-4af9-b975-40dd6aa99b42 --brick-port 24018 24020
   --xlator-
   option gli-server.transport.rdma.listen-port=24020 --xlator-option
   gli-
   server.listen-port=24018
   
   pbs[14] are only running the glusterd process, not any glusterfsd's
   
   In previous startups, pbs4 WAS running a glusterfsd, but pbs1 has
   not
   run one
   since the powerdown AFAIK.
   
   hjm
   
   On Saturday, October 06, 2012 10:19:14 PM harry mangalam wrote:
...and should have added:

the rebalance log (the volume claimed to be rebalancing before I
shut it
down but was idle or wedged at that time) is active as well with
about 1
warning of a 1 subvolumes down -- not fixing for every 3
informational

messages:
 2012-10-06 22:05:35.396650] I
 [dht-rebalance.c:1058:gf_defrag_migrate_data]

0-gli-dht: migrate data called on /nlduong/nduong2-t-
illiac/workspace/m5_sim/trunk/src/arch/.svn/tmp/wcprops

[2012-10-06 22:05:35.451925] I
[dht-layout.c:593:dht_layout_normalize]
0-gli- dht: found anomalies in /nlduong/nduong2-t-
illiac/workspace/m5_sim/trunk/src/arch/.svn/wcprops. holes=1
overlaps=0

[2012-10-06 22:05:35.451957] W
[dht-selfheal.c:875:dht_selfheal_directory]
0- gli-dht: 1 subvolumes down -- not fixing


previously...

gluster 3.3, running on ubuntu 10.04, was running OK, had to shut
down for a
power outage.

When I tried to shut it down, it insisted that it was
rebalancing,
but
seeemed wedged - no activity in the logs.

Was able to shut it down tho.

After power was restored, tried to restart the volume but altho
the
4 peers
claimed to be visible and could ping each other etc:
==
Sat Oct 06 21

Re: [Gluster-users] volume started but not 'startable', not 'stoppable'

2012-10-08 Thread harry mangalam

Hi Amar.  Thanks SO much. That did it. (well, there are other remaining 
problems but they seem to be easily addressed relative to that initial 
problem).

My Mum told me never to force anything, but you've proved her wrong. :)

for others following in this thread - a  'force stop' and a 'force start' made 
everything come back.  I'm re-running a rebalance to address some of the file 
inconsistencies, but almost all of them were resolved in the /forced/ 
restarting of the volume.

hjm

On Monday, October 08, 2012 03:46:54 PM Amar Tumballi wrote:
  And since they think it's not started, I can't stop it.
  
  How is this resolvable?
 
 can you try 'gluster volume stop VOLNAME force' ? (or 'gluster volume
 start VOLNAME force'
 
 Regards,
 Amar
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users
-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
Passive-Aggressive Supporter of the The Canada Party:
  http://www.americabutbetter.com/

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] volume started but not 'startable', not 'stoppable'

2012-10-06 Thread harry mangalam

 root  18 Sep 10 13:41 bonnie2/

drwx-- 2 spoorkas spoorkas  8211 Jun  2 00:22 QPSK_2Tx_2Rx_BH_Method2/
?- ? ???? QPSK_2Tx_2Rx_ML_Method1
drwx-- 2 spoorkas spoorkas  8237 Jun  3 11:22 QPSK_2Tx_2Rx_ML_Method2/
drwx-- 2 spoorkas spoorkas 12288 Jun  4 01:24 QPSK_2Tx_3Rx_BH/
drwx-- 2 spoorkas spoorkas  4232 Jun  2 00:26 QPSK_2Tx_3Rx_BH_Method1/
drwx-- 2 spoorkas spoorkas  8274 Jun  2 00:34 QPSK_2Tx_3Rx_BH_Method2/
?- ? ???? QPSK_2Tx_3Rx_ML_Method1
?- ? ???? QPSK_2Tx_3Rx_ML_Method2
-rw-r--r-- 1 spoorkas spoorkas 0 Apr 17 14:16 simple.sh.e1802207

(These files appear to be intact on the individual bricks tho.)

==
Sat Oct 06 21:38:18 [0.76 0.71 0.58]  root@pbs2:/var/log/glusterfs/bricks
568 $ gluster volume status
Volume gli is not started
==

and since that is the case, other utilities also claim this:

==
Sat Oct 06 21:41:25 [1.04 0.84 0.65]  root@pbs2:/var/log/glusterfs/bricks
571 $ gluster volume status gli detail
Volume gli is not started
==

And since they think it's not started, I can't stop it.

How is this resolvable?

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
Passive-Aggressive Supporter of the The Canada Party:
  http://www.americabutbetter.com/
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Retraction: Protocol stacking: gluster over NFS

2012-10-02 Thread harry mangalam

Hi All,

Well, it http://goo.gl/hzxyw was too good to be true. Under extreme, 
extended IO on a 48core node, some part of the the NFS stack collapses and 
leads to an IO lockup thru NFS.  We've replicated it on 48core and 64 core 
nodes, but don't know yet whether it acts similarly on lower-core-count nodes.

Tho I haven't had time to figure out exactly /how/ it collapses, I owe it to 
those who might be thinking of using it to tell them not to.

This is what I wrote, describing the situation to some co-workers:
===
With Joseph's and Kevin's help, I've been able to replicate Kevin's complete 
workflow on BDUC and executed it with a normally mounted gluster fs and my 
gluster-via-NFS-loopback (on both NFS3 and NFS4 clients).  

The good news is that the workflow went to completion on BDUC with the native 
gluster fs mount, doing pretty decent IO on one node - topping out at about 
250MB/s in and 75MB/s out (DDR IB)
ib1
  KB/s in  KB/s out
 268248.1  62278.40
 262835.1  64813.55
 248466.0  61000.24
 250071.3  67770.03
 252924.1  67235.13
 196261.3  56165.20
 255562.3  68524.45
 237479.3  68813.99
 209901.8  73147.73
 217020.4  70855.45

The bad news is that I've been able to replicate the failures that JF has 
seen.  The workflow starts normally but then eats up free RAM as KT's workflow 
saturates the nodes with about 26 instances of samtools which does a LOT of 
IO (10s of GB in the ~30m of the run).  This was the case even when I 
increased the number of nfsd's to 16 and even 32.

When using native gluster, the workflow goes to completion in about 23 hrs - 
about the same as when KT executed it on his machine (using NFS I think..?).
However when using the loopback mount, on both NFS3 and NFS4, it locks up the 
NFS side (the gluster mount continues to be R/W), requiring a hard reset on 
the node to clear the NFS error.  It is interesting that the samtools 
processes lock up during /reads/, not writes (via stracing several of the 
processes)

I found this entry in a FraunhoferFS discussion:
from https://groups.google.com/forum/?fromgroups=#!topic/fhgfs-
user/XoGPbv3kfhc
[[
In general, any network file system that uses the standard kernel page 
cache on the client side (including e.g. NFS, just to give another 
example) is not suitable for running client and server on the same 
machine, because that would lead to memory allocation deadlocks under 
high memory pressure - so you might want to watch out for that. 
(fhgfs uses a different caching mechanism on the clients to allow 
running it in such scenarios.) 
]]
but why this would be the case, I'm not sure - the server and client processes 
should be unable to step on each others data structures, so why they would 
interfere with each other is unclear.  Others on this list have mentioned 
similar opinions - I'd be interested in why this is theoretically the case.

The upshot is that under extreme, extended IO, NFS will lock up, so while we 
haven't seen it on BDUC except for KT's workflow, it's repeatable and we can't 
recover from it smoothly.  So we should move away from it.

I haven't been able to test it on a 3.x kernel (but will after this weekend); 
it's possible that it might work better, but I'm not optimistic.

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
Passive-Aggressive Supporter of the The Canada Party:
  http://www.americabutbetter.com/

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] NFS over gluster stops responding under write load

2012-09-20 Thread harry mangalam

 heading towards OOM
  territory.  The glusterfs daemon is currently consuming 90% of MEM
  according to top.
  
  thanks
 
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
What does it say about a society that would rather send its 
children to kill and die for oil than to get on a bike?


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] cannot create a new volume with a brick that used to be part of a deleted volume?

2012-09-18 Thread harry mangalam

I believe gluster writes 2 entries into the top level of your gluster brick 
filesystems:

-rw-r--r--   2 root root36 2012-06-22 15:58 .gl.mount.check
drw--- 258 root root  8192 2012-04-16 13:20 .glusterfs

You will have to remove these as well as all the other fs info from the volume 
to re-add the fs as another brick.

Or just remake the filesystem - instantaneous with XFS, less so with ext4.

hjm

On Tuesday, September 18, 2012 11:03:35 AM Lonni J Friedman wrote:
 Greetings,
 I'm running v3.3.0 on Fedora16-x86_64.  I used to have a replicated
 volume on two bricks.  This morning I deleted it successfully:
 
 [root@farm-ljf0 ~]# gluster volume stop gv0
 Stopping volume will make its data inaccessible. Do you want to
 continue? (y/n) y
 Stopping volume gv0 has been successful
 [root@farm-ljf0 ~]# gluster volume delete gv0
 Deleting volume will erase all information about the volume. Do you
 want to continue? (y/n) y
 Deleting volume gv0 has been successful
 [root@farm-ljf0 ~]# gluster volume info all
 No volumes present
 
 
 I then attempted to create a new volume using the same bricks that
 used to be part of the (now) deleted volume, but it keeps refusing 
 failing claiming that the brick is already part of a volume:
 
 [root@farm-ljf1 ~]# gluster volume create gv0 rep 2 transport tcp
 10.31.99.165:/mnt/sdb1 10.31.99.166:/mnt/sdb1
 /mnt/sdb1 or a prefix of it is already part of a volume
 [root@farm-ljf1 ~]# gluster volume info all
 No volumes present
 
 
 Note farm-ljf0 is 10.31.99.165 and farm-ljf1 is 10.31.99.166.  I also
 tried restarting glusterd (and glusterfsd) hoping that might clear
 things up, but it had no impact.
 
 How can /mnt/sdb1 be part of a volume when there are no volumes present?
 Is this a bug, or am I just missing something obvious?
 
 thanks
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
What does it say about a society that would rather send its 
children to kill and die for oil than to get on a bike?


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Protocol stacking: gluster over NFS

2012-09-17 Thread harry mangalam

Just to clarify, we are using the native kNFS-server from the distro, not 
gluster's NFS implementation.  Note that CentOS 5.7/5.8 does not seem to 
support this kind of loopback mounting with the kNFS version they use (kNFS 
ver 1.0.8/9). 

However, the recent kNFS servers from Ubuntu/Debian (kNFSv 1.2.0) and SL 6.2 
(1.2.3) do support it.  We're still testing but have not yet found the kind of 
deadlocks/crashes that others have mentioned with the gluster NFS (touch 
wood).

hjm

On Monday, September 17, 2012 09:08:08 AM Jeff White wrote:
 I was under the impression that self-mounting NFS of any kind (mount -t
 nfs localhost...) was a dangerous thing.  When I did that with gNFS I
 could cause a server to crash in no time at all with a simple dd into
 the mount point. I was under the impression that kNFS would have the
 same problem though I have not tested in myself (this was discussed in
 #gluster on irc.freenode.net some time ago).  I'm guessing this would be
 a bug in the kernel.  Has anyone seen issues or crashes with locally
 mounted NFS (either gNFS or kNFS)?
 
 Jeff White - GNU+Linux Systems Administrator
 University of Pittsburgh - CSSD
 
 On 09/14/2012 03:22 PM, John Mark Walker wrote:
  A note on recent history:
  
  There were past attempts to export GlusterFS client mounts over NFS, but
  those used the GlusterFS NFS service. I believe this is the first
  instance in the wild of someone trying this with knfsd.
  
  With the former, while there was increased performance, there would
  invariably be race conditions that would lock up GlusterFS. See the
  ominous warnings posted on this QA thread:
  http://community.gluster.org/a/nfs-performance-with-fuse-client-redundanc
  y/
  
  I am curious to see if using knfsd, as opposed to GlusterFS' NFS service,
  yields a long-term solution for this type of workload. Please do continue
  to keep us updated.
  
  Thanks,
  JM
  

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
What does it say about a society that would rather send its 
children to kill and die for oil than to get on a bike?


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Protocol stacking: gluster over NFS

2012-09-14 Thread harry mangalam

Hi Venky - thank for the link to this translator.  I'll take a look at it, but 
right now, we don't have too much trouble with reads - it's the 'zillions of 
tiny writes' problem that's hosing us and the NFS solution gives us a bit more 
headroom.

We'll be moving this out to part of our cluster today (unless someone can 
convince me otherwise) and we'll see if it shows real-world improvements.

hjm

On Friday, September 14, 2012 11:33:34 AM Venky Shankar wrote:
 Hi Harry,
 
 There is a compression translator in Gerrit which you might be
 interested in: http://review.gluster.org/#change,3251
 
 It compresses data (using zlib library) before it is sent out to the
 network from the server and on the other side (client; FUSE mount) it
 decompresses it. Also, note that it only does it for data transferred as
 part of the read fop and the volfiles needs to be hand edited (CLI
 support is still pending)
 
 I've not performed any performance run till now but I plan to do so soon.
 
 -Venky
 
 On Friday 14 September 2012 10:25 AM, harry mangalam wrote:
  Hi All,
  
  We have been experimenting with 'protocol stacking' - that is, running
  gluster over NFS.
  
  What I mean:
  - mounting a gluster fs via the native client,
  - then NFS-exporting the gluster fs to the client itself
  - then mounting that gluster fs via NFS3 to take advantage of the
  client-side caching.
  
  We've tried it on a limited basis (single client) and not only does it
  work, but it works surprisingly well, gaining about 2-3X the write
  performance relative to the native gluster mount on uncompressed data,
  using small writes. Using compressed data (piping thru gzip, for example)
  is more variable - if the data is highly compressible, it tends to
  increase performance; if less compressible, it tends to decrease
  performance.  As I noted previously http://goo.gl/7G7k3, piping small
  writes thru gzip /can/ tremendously increase performance on a gluster fs
  in some bioinformatics applications.
  
  A graph of the performance on various file sizes (created by a trivial
  program that does zillions of tiny writes - a sore point in the gluster
  performance spectrum) is shown here:
  
  http://moo.nac.uci.edu/~hjm/fs_perf_gl-tmp-glnfs.png
  
  The graphs show the time to complete and sync on a set of writes from 10MB
  to 30GB on 3 fs's:
  
  - /tmp on the client's system disk (a single 10K USCSI)
  - /gl, a 4 server, 4 brick gluster (distributed-only) fs
  - /gl/nfs, the same gluster fs, loopback-mounted via NFS3 on the client
  
  The results show that using a gluster fs loopback-mounted to itself
  increased performance by 2-3X, increasing as the file size increased to
  30GB.
  
  The client (64GB RAM) was otherwise idle when I did these tests.
  
  In addition (data not shown), I also tested how compression (piping the
  output thru gzip) affected the total time-to-complete.  In one case, due
  to the identical string being written, gzip managed about 1000X
  compression, so the eventual file size sent to the disk was almost
  inconsequential.  Nevertheless, the extra time for the compression was
  more than made up for by the reduced data and adding gzip decreased the
  time-to-complete significantly.  In other testing with less compressible
  data (shown above), the compression time overwhelmed the write time and
  all the fs had essentially identical times per file size.
  
  In all cases, the writes were followed by a 'sync' to flush the cache.
  
  It seems that the loopback NFS mounting of the gluster fs is a fairly
  obvious win (overall, about 2-3x times the write speed) in terms of
  taking avantage of gluster's fs scaling and namespace with NFS3's
  client-side caching, but I'd like to hear from other gluster users as to
  possible downsides of this approach.
  
  hjm
 
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
What does it say about a society that would rather send its 
children to kill and die for oil than to get on a bike?


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Protocol stacking: gluster over NFS

2012-09-14 Thread harry mangalam

Well, it was too clever for me too :) - someone else suggested it when I was 
describing some of the options we were facing. I admit to initially thinking 
that it was silly to expect better performance by stacking protocols, but we 
tried it and it seems to have worked.

To your point:

the 'client' is the end node that uses the gluster storage - in our case it's 
a compute node (w/ limited storage) in a research cluster.

the 'server' is the collection of nodes that provides the gluster storage.

the client mounts the server with the native gluster client, providing all the 
gluster advantages of single namespace, scalability, reliability, etc. to 
'client:/glmount'

the client then exports that gluster fs via NFS to itself, so 
'client:/glmount' is listed in '/etc/exports' as rw to itself.

the client then mounts itself (innuendo and disturbing mental images 
notwithstanding) via NFS:
'mount -t nfs localhost:/glmount /glnfs'
so that the gluster fs (/glmount) is NFS-loopback-mounted on the client 
(itself):

from our test case, simplified:
---
hmangala@claw5:~
$ cat /etc/mtab  # (all non-gluster-related entries deleted)
...
pbs1ib:/gli/glmount  fuse.glusterfs   \  
 rw,default_permissions,allow_other,max_read=131072  0 0 
...
claw5:/glmount  /glnfs   nfs   rw,addr=10.255.78.40 0
...
---

in the above extract, pbs1ib:/gli is the gluster fs that is mounted to 
'claw5:/glmount'.

claw5 then NFS-mounts claw5:/glmount onto /glnfs which users actually use to 
read/write.

I agree, not very intuitive... but it seems to work.  

This is with NFS3 clients.  NFS4 may provide an additional perf boost by 
allowing clients to work out of cache until it's forced to sync, but we 
haven't tried that yet and the test methodology we used wouldn't show a gain 
anyway.  I'll have to try to create a more realistic test harness.


hjm

On Friday, September 14, 2012 01:04:59 PM Whit Blauvelt wrote:
 On Fri, Sep 14, 2012 at 09:41:42AM -0700, harry mangalam wrote:
What I mean:
- mounting a gluster fs via the native client,
- then NFS-exporting the gluster fs to the client itself
- then mounting that gluster fs via NFS3 to take advantage of the
client-side caching.
 
 Harry,
 
 What is the client itself here? I'm having trouble picturing what's doing
 what with what. No doubt because it's too clever for me. Maybe a bit more
 description would clarify it nonetheless.
 
 Thanks,
 Whit
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
What does it say about a society that would rather send its 
children to kill and die for oil than to get on a bike?


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] Protocol stacking: gluster over NFS

2012-09-13 Thread harry mangalam

Hi All,

We have been experimenting with 'protocol stacking' - that is, running gluster 
over NFS.

What I mean:
- mounting a gluster fs via the native client,
- then NFS-exporting the gluster fs to the client itself
- then mounting that gluster fs via NFS3 to take advantage of the client-side 
caching.

We've tried it on a limited basis (single client) and not only does it work, 
but it works surprisingly well, gaining about 2-3X the write performance 
relative to the native gluster mount on uncompressed data, using small writes.  
Using compressed data (piping thru gzip, for example) is more variable - if 
the data is highly compressible, it tends to increase performance; if less 
compressible, it tends to decrease performance.  As I noted previously 
http://goo.gl/7G7k3, piping small writes thru gzip /can/ tremendously 
increase performance on a gluster fs in some bioinformatics applications.

A graph of the performance on various file sizes (created by a trivial program 
that does zillions of tiny writes - a sore point in the gluster performance 
spectrum) is shown here:

http://moo.nac.uci.edu/~hjm/fs_perf_gl-tmp-glnfs.png

The graphs show the time to complete and sync on a set of writes from 10MB to 
30GB on 3 fs's:

- /tmp on the client's system disk (a single 10K USCSI)
- /gl, a 4 server, 4 brick gluster (distributed-only) fs 
- /gl/nfs, the same gluster fs, loopback-mounted via NFS3 on the client

The results show that using a gluster fs loopback-mounted to itself increased 
performance by 2-3X, increasing as the file size increased to 30GB.

The client (64GB RAM) was otherwise idle when I did these tests.  

In addition (data not shown), I also tested how compression (piping the output 
thru gzip) affected the total time-to-complete.  In one case, due to the 
identical string being written, gzip managed about 1000X compression, so the 
eventual file size sent to the disk was almost inconsequential.  Nevertheless, 
the extra time for the compression was more than made up for by the reduced 
data and adding gzip decreased the time-to-complete significantly.  In other 
testing with less compressible data (shown above), the compression time 
overwhelmed the write time and all the fs had essentially identical times per 
file size. 

In all cases, the writes were followed by a 'sync' to flush the cache.  

It seems that the loopback NFS mounting of the gluster fs is a fairly obvious 
win (overall, about 2-3x times the write speed) in terms of taking avantage of 
gluster's fs scaling and namespace with NFS3's client-side caching, but I'd 
like to hear from other gluster users as to possible downsides of this 
approach.

hjm
-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] XFS and MD RAID

2012-09-10 Thread harry mangalam

We're using 3ware Inc 9750 SAS2/SATA-II RAID controllers in a 4-brick, 400TB 
gluster system.  The 4 have performed very well overall in about 6mo of 
production work, alerting us to problem disks, etc.  Tho 3ware is an LSI 
product now, this model retains the familiar if somewhat grunty 3dm2 interface 
and usable cli, as opposed to the sulfuric acid enema interface of many native 
LSI controllers.

We also have used mdadm with the older 8-port Marvell PCI-X software raid 
controller in another older gluster system which have worked flawlessly.
lspci says: MV88SX6081 8-port SATA II PCI-X.  This one is also ZFS-
compatible.

As others have said, using mdadm is much easier due to it's unified interface 
on top of heterogeneous hardware and if it's any slower, I haven't felt it.  

Using the hardware RAID was sort of forced on me due to fear of the 
commandline from others using the system. :(

hjm

On Monday, September 10, 2012 09:39:18 AM Brian Candler wrote:
 On Mon, Sep 10, 2012 at 09:29:25AM +0800, Jack Wang wrote:
  below patch should fix your bug.
 
 Thank you Jack - that was a very quick response! I'm building a new kernel
 with this patch now and will report back.
 
 However, I think the existence of this bug suggests that Linux with software
 RAID is unsuitable for production use.  There has obviously been no testing
 of basic critical functionality like hot-plugging drives, and serious
 regressions are introduced into supposedly stable kernels.
 
 So I'm now on the lookout for a 24-port SATA RAID controller with good Linux
 support. What are my options?
 
 Googling I have found:
 
 * 3ware 9650SE-24
 * Areca ARC-1280ML
 * LSI MegaRAID 9280-24i (newer SAS/SATA)
 * Areca ARC-1882ix-24 (newer SAS/SATA)
 
 However I see some people suggesting just a RAID card with a few ports plus
 a SAS expander backplane.  This would be fine too - I don't mind an
 aggregate throughput limit of 6Gb/s for some or all of the drives.  I just
 want to be sure that the RAID controller will handle all the possible
 failure modes and swap events of the various drives.
 
 Regards,
 
 Brian.
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
What does it say about a society that would rather send its 
children to kill and die for oil than to get on a bike?


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] Non-progressing, Unstoppable rebalance on 3.3

2012-08-23 Thread Harry Mangalam

Following an interchange with Jeff Darcy and Shishir Gowda, I started
a rebalance of my cluster (3.3 on Ubuntu 10.04.4).
Note: shortly after it started, 3/4 of the glusterfsd's shut down
(which was exciting..).  I stopped and restarted glusterd and the
glusterfsd's restarted in turn and all was well, however it may have
caused a problem with the rebalance:

After 2 days of waiting, the rebalance has apparently done nothing
(distracted by other things) and presents with the same values as it
had originally:

Thu Aug 23 10:35:11 [0.00 0.00 0.00]  root@pbs1:/var/log/glusterfs
770 $ gluster volume rebalance gli status
 Node Rebalanced-files  size   scanned  failures
  status
-  ---   ---   ---   ---

localhost0000in progress
   pbs4ib0000not started
   pbs2ib 1380547324969 76863  completed
   pbs3ib0000not started

(the above has the leading 32 blanks trimmed from the output - is
there a reason for including those in the output?)
the above implies that it is at least partially  in progress, but
after stopping it:

Thu Aug 23 10:53:26 [0.00 0.00 0.00]  root@pbs1:/var/log/glusterfs
774 $ gluster volume rebalance gli stop
 Node Rebalanced-files  size   scanned  failures
  status
-  ---   ---   ---   ---

localhost0000in progress
   pbs4ib0000not started
   pbs2ib 1380547324969 76863  completed
   pbs3ib0000not started
Stopped rebalance process on volume gli

it still seems to be going:
Thu Aug 23 10:53:28 [0.00 0.00 0.00]  root@pbs1:/var/log/glusterfs
775 $ gluster volume rebalance gli status
 Node Rebalanced-files  size   scanned  failures
  status
-  ---   ---   ---   ---

localhost0000in progress
   pbs4ib0000not started
   pbs2ib 1380547324969 76863  completed
   pbs3ib0000not started

Examining the server nodes, only pbs1 (localhost in the above output)
had glusterfs running, and since it may have been 'orphaned' when I
had the glusterfsd hiccups and has been hanging since that time.
However, when I killed it, nothing changes.  gluster still reports
that the rebalance is in progress (even tho no glusterfs's are running
on any of the nodes).

If I try to reset it with a 'start force':
Thu Aug 23 11:14:39 [0.06 0.04 0.00]  root@pbs1:/var/log/glusterfs
789 $ gluster volume rebalance gli start force
Rebalance on gli is already started

and the status remains exactly as above.

From the clients POV, all seems to be fine, but I've got a hanging
rebalance that is both annoying and worrying.

Is there a way to reset this smoothly, or dies it require a server restart?

hjm



-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] files on gluster brick that have '---------T' designation.

2012-08-20 Thread Harry Mangalam

I have a working but unbalanced gluster config where one brick has about 2X
the usage of the 3 others.  I started a remove-brick to force a resolution
of this problem (Thanks to JD for the help!), but it's going very slowly,
about 2.2MB/s over DDR IPoIB or ~2.3 files/s.  In investigating the
problem, I may have found a partial explanation - I have found 100s of
thousands (maybe millions) of zero-length files existing on the problem
brick that do not exist on the client view that have the designation
'-T'
via 'ls -l'

ie:

/bducgl/alamng/Research/Yuki/newF20/runF20_2513/data:
total 0
-T 2 root root 0 2012-08-04 11:23 backward_sm1003
-T 2 root root 0 2012-08-04 11:23 backward_sm1007
-T 2 root root 0 2012-08-04 11:23 backward_sm1029

I suspect that these are the ones that are responsible for the enormous
expansion of the storage space on this brick and the very slow speed of
 the 'remove-brick' operation.

Does this sound possible?  Can I delete these files on the brick to resolve
the imbalance?  If not, is there a way to process them in some better way
to rationalize the imbalance?

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] files on gluster brick that have '---------T' designation.

2012-08-20 Thread Harry Mangalam

Hi Shishir,

Thanks for your attention.

Hmm - your explanation makes some sense, but those 'T' files don't show up
in the client view of the dir - only in the brick view.  Is that valid?

I'm using 3.3 on 4 ubuntu 12.04 servers over DDR IPoIB, and the command to
initiate the remove brick was:

$ gluster volume  remove-brick gli pbs3ib:/bducgl  start

and the current status is:

$ gluster volume  remove-brick gli pbs3ib:/bducgl  status
  Node Rebalanced-files  size   scanned  failures
  status
 -  ---   ---   ---   ---

 localhost00   13770221406
 stopped
pbs2ib00   168991 6921
 stopped
pbs3ib   724683 594890945282  44028040in
progress
pbs4ib00   169081 7923
 stopped

(the failures were the same as were seen as when I tried the rebalance
command previously).

Best
harry

On Mon, Aug 20, 2012 at 7:09 PM, Shishir Gowda sgo...@redhat.com wrote:

 Hi Harry,

 These are valid files in glusterfs-dht xlator configured volumes. These
 are known as link files, which dht uses to maintain files on the hashed
 subvol, when the actual data resides in non hashed subvolumes(rename can
 lead to these). The cleanup of these files will be taken care of by running
 rebalance.
 Can you please provide the gluster version you are using, and the remove
 brick command you used?

 With regards,
 Shishir

 - Original Message -
 From: Harry Mangalam hjmanga...@gmail.com
 To: gluster-users gluster-users@gluster.org
 Sent: Tuesday, August 21, 2012 5:01:05 AM
 Subject: [Gluster-users] files on gluster brick that have '-T'
  designation.


 I have a working but unbalanced gluster config where one brick has about
 2X the usage of the 3 others. I started a remove-brick to force a
 resolution of this problem (Thanks to JD for the help!), but it's going
 very slowly, about 2.2MB/s over DDR IPoIB or ~2.3 files/s. In investigating
 the problem, I may have found a partial explanation - I have found 100s of
 thousands (maybe millions) of zero-length files existing on the problem
 brick that do not exist on the client view that have the designation '
 -T' via 'ls -l'


 ie:



 /bducgl/alamng/Research/Yuki/newF20/runF20_2513/data:
 total 0
 -T 2 root root 0 2012-08-04 11:23 backward_sm1003
 -T 2 root root 0 2012-08-04 11:23 backward_sm1007
 -T 2 root root 0 2012-08-04 11:23 backward_sm1029


 I suspect that these are the ones that are responsible for the enormous
 expansion of the storage space on this brick and the very slow speed of the
 'remove-brick' operation.


 Does this sound possible? Can I delete these files on the brick to resolve
 the imbalance? If not, is there a way to process them in some better way to
 rationalize the imbalance?

 --
 Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
 [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
 415 South Circle View Dr, Irvine, CA, 92697 [shipping]
 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)


 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users




-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] files on gluster brick that have '---------T' designation.

2012-08-20 Thread Harry Mangalam

Hi Shishir,

Here's the 'df -h' of the appropriate filesystem on all 4 of the gluster
servers.
It has equilibrated a bit since the original post - pbs3 has decreased from
73% and the others have increased from about 29%, but still, slow.

pbs1:/dev/sdb6.4T  2.0T  4.4T  32% /bducgl
pbs2:/dev/md08.2T  2.9T  5.4T  35% /bducgl
pbs3:/dev/md127  8.2T  5.3T  3.0T  65% /bducgl
pbs4:/dev/sda6.4T  2.2T  4.3T  34% /bducgl

The 'errors-only' extract of the log (since the remove-brick was started)
is here:
http://moo.nac.uci.edu/~hjm/gluster/remove-brick_errors.log.gz  (2707
lines)
and the last 100 lines of the active log (gli-rebalance.log) is here:
http://pastie.org/4559913

Thanks for your help.
Harry

On Mon, Aug 20, 2012 at 7:42 PM, Shishir Gowda sgo...@redhat.com wrote:

 Hi Harry,

 That is correct, the files wont be seen on the client.
 Can you provide an output of these:
 1. df of all exports
 2. Provide remove-brick/rebalance (volname-rebalance.log) log (if large
 just the failure messages, and tail of the the file).

 With regards,
 Shishir

 - Original Message -
 From: Harry Mangalam hjmanga...@gmail.com
 To: Shishir Gowda sgo...@redhat.com
 Cc: gluster-users gluster-users@gluster.org
 Sent: Tuesday, August 21, 2012 8:00:42 AM
 Subject: Re: [Gluster-users] files on gluster brick that have '-T'
 designation.

 Hi Shishir,


 Thanks for your attention.



 Hmm - your explanation makes some sense, but those 'T' files don't show up
 in the client view of the dir - only in the brick view. Is that valid?


 I'm using 3.3 on 4 ubuntu 12.04 servers over DDR IPoIB, and the command to
 initiate the remove brick was:


 $ gluster volume remove-brick gli pbs3ib:/bducgl start


 and the current status is:



 $ gluster volume remove-brick gli pbs3ib:/bducgl status
 Node Rebalanced-files size scanned failures status
 - --- --- --- --- 
 localhost 0 0 137702 21406 stopped
 pbs2ib 0 0 168991 6921 stopped
 pbs3ib 724683 594890945282 4402804 0 in progress
 pbs4ib 0 0 169081 7923 stopped


 (the failures were the same as were seen as when I tried the rebalance
 command previously).


 Best
 harry


 On Mon, Aug 20, 2012 at 7:09 PM, Shishir Gowda  sgo...@redhat.com 
 wrote:


 Hi Harry,

 These are valid files in glusterfs-dht xlator configured volumes. These
 are known as link files, which dht uses to maintain files on the hashed
 subvol, when the actual data resides in non hashed subvolumes(rename can
 lead to these). The cleanup of these files will be taken care of by running
 rebalance.
 Can you please provide the gluster version you are using, and the remove
 brick command you used?

 With regards,
 Shishir

 - Original Message -
 From: Harry Mangalam  hjmanga...@gmail.com 
 To: gluster-users  gluster-users@gluster.org 
 Sent: Tuesday, August 21, 2012 5:01:05 AM
 Subject: [Gluster-users] files on gluster brick that have '-T'
 designation.


 I have a working but unbalanced gluster config where one brick has about
 2X the usage of the 3 others. I started a remove-brick to force a
 resolution of this problem (Thanks to JD for the help!), but it's going
 very slowly, about 2.2MB/s over DDR IPoIB or ~2.3 files/s. In investigating
 the problem, I may have found a partial explanation - I have found 100s of
 thousands (maybe millions) of zero-length files existing on the problem
 brick that do not exist on the client view that have the designation '
 -T' via 'ls -l'


 ie:



 /bducgl/alamng/Research/Yuki/newF20/runF20_2513/data:
 total 0
 -T 2 root root 0 2012-08-04 11:23 backward_sm1003
 -T 2 root root 0 2012-08-04 11:23 backward_sm1007
 -T 2 root root 0 2012-08-04 11:23 backward_sm1029


 I suspect that these are the ones that are responsible for the enormous
 expansion of the storage space on this brick and the very slow speed of the
 'remove-brick' operation.


 Does this sound possible? Can I delete these files on the brick to resolve
 the imbalance? If not, is there a way to process them in some better way to
 rationalize the imbalance?

 --
 Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
 [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
 415 South Circle View Dr, Irvine, CA, 92697 [shipping]
 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)


 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users




 --
 Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
 [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
 415 South Circle View Dr, Irvine, CA, 92697 [shipping]
 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)




-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long

Re: [Gluster-users] files on gluster brick that have '---------T' designation.

2012-08-20 Thread Harry Mangalam

All the bricks are 3.3, and all the bricks were started via starting
glusterd  on each of them and then peer-probing etc.

The initial reason for starting this fix-layout/rebalance/remove brick was
a CPU overload on the pbs3 brick (load of 30 on a 4 CPU server) that was
dramatically decreasing performance.  I killed glusterd, restarted it and
checked that it has re-established connection with the 'gluster peer
status' command which implied that all 4 peers were connected.  I didn't
find out until later that this was incorrect using the ' gluster volume
status VOLUME detail' command.

So this peer has been hashed and thrashed somewhat (and miraculously is
still serving files), but in the process, has gone out of proper balance
with the other peers.

It sounds like you're saying that this:

  Node Rebalanced-files  size   scanned  failures
  status
 -  ---   ---   ---   ---

 localhost00   13770221406
 stopped
pbs2ib00   168991 6921
 stopped
pbs3ib   724683 594890945282  44028040in
progress
pbs4ib00   169081 7923
 stopped

implies that the other peers are not participating in the remove-brick?
The change in storage across the servers implies that they are
participating, just very slowly.

On the other hand the last of the errors stopped 2 days ago (there are no
more errors in the last 350MB of the rebalance logs, which also implies
that the rest of the files are being migrated, just very slowly..

At any rate, if you've diagnosed the problem, what is the solution? A
cluster-wide glusterd restart to sync the uuids?  Or is there another way
to re-identify them to each other?

Best,
Harry



On Mon, Aug 20, 2012 at 9:06 PM, Shishir Gowda sgo...@redhat.com wrote:

 Hi Harry,

 Are all the bricks from 3.3? Or did you start any of the bricks manually
 (not through gluster volume commands) remove-brick/rebalance processes are
 started across all nodes(1-per node) of the volume.
 We use the node-uuid to distribute work across nodes. So migration is
 handled by all the nodes, to which the data belongs. In your case, there
 are errors being reported that the node-uuid is not
 available.

 With regards,
 Shishir

 - Original Message -
 From: Harry Mangalam hjmanga...@gmail.com
 To: Shishir Gowda sgo...@redhat.com
 Cc: gluster-users gluster-users@gluster.org
 Sent: Tuesday, August 21, 2012 8:37:06 AM
 Subject: Re: [Gluster-users] files on gluster brick that have '-T'
 designation.


 Hi Shishir,


 Here's the 'df -h' of the appropriate filesystem on all 4 of the gluster
 servers.
 It has equilibrated a bit since the original post - pbs3 has decreased
 from 73% and the others have increased from about 29%, but still, slow.


 pbs1:/dev/sdb 6.4T 2.0T 4.4T 32% /bducgl
 pbs2:/dev/md0 8.2T 2.9T 5.4T 35% /bducgl
 pbs3:/dev/md127 8.2T 5.3T 3.0T 65% /bducgl
 pbs4:/dev/sda 6.4T 2.2T 4.3T 34% /bducgl


 The 'errors-only' extract of the log (since the remove-brick was started)
 is here:
  http://moo.nac.uci.edu/~hjm/gluster/remove-brick_errors.log.gz  (2707
 lines)
 and the last 100 lines of the active log (gli-rebalance.log) is here:
  http://pastie.org/4559913 


 Thanks for your help.
 Harry

 On Mon, Aug 20, 2012 at 7:42 PM, Shishir Gowda  sgo...@redhat.com 
 wrote:


 Hi Harry,

 That is correct, the files wont be seen on the client.
 Can you provide an output of these:
 1. df of all exports
 2. Provide remove-brick/rebalance (volname-rebalance.log) log (if large
 just the failure messages, and tail of the the file).

 With regards,
 Shishir

 - Original Message -
 From: Harry Mangalam  hjmanga...@gmail.com 
 To: Shishir Gowda  sgo...@redhat.com 
 Cc: gluster-users  gluster-users@gluster.org 
 Sent: Tuesday, August 21, 2012 8:00:42 AM
 Subject: Re: [Gluster-users] files on gluster brick that have '-T'
 designation.

 Hi Shishir,


 Thanks for your attention.



 Hmm - your explanation makes some sense, but those 'T' files don't show up
 in the client view of the dir - only in the brick view. Is that valid?


 I'm using 3.3 on 4 ubuntu 12.04 servers over DDR IPoIB, and the command to
 initiate the remove brick was:


 $ gluster volume remove-brick gli pbs3ib:/bducgl start


 and the current status is:



 $ gluster volume remove-brick gli pbs3ib:/bducgl status
 Node Rebalanced-files size scanned failures status
 - --- --- --- --- 
 localhost 0 0 137702 21406 stopped
 pbs2ib 0 0 168991 6921 stopped
 pbs3ib 724683 594890945282 4402804 0 in progress
 pbs4ib 0 0 169081 7923 stopped


 (the failures were the same as were seen as when I tried the rebalance
 command previously).


 Best
 harry


 On Mon, Aug 20, 2012 at 7:09 PM, Shishir Gowda  sgo...@redhat.com 
 wrote:


 Hi Harry,

 These are valid files in glusterfs

Re: [Gluster-users] Problem mounting Gluster volume [3.3]

2012-08-14 Thread Harry Mangalam

On Tue, Aug 14, 2012 at 7:56 AM, Paolo Di Tommaso paolo.ditomm...@gmail.com
 wrote:

 [2012-08-14 10:42:05.550471] E [socket.c:1715:socket_connect_finish]
 0-glusterfs: connection to  failed (No route to host)



ping uses icmp whereas the mount command uses TCP and they can use
different routing (as I was just taught by my network admins).

Does ssh work?  I believe that a gluster mount has to succeed by both
forward and reverse DNS (like ssh.. well to work without spewing warnings).

try traceroute to the server and see if its IP # resolves to the same host.
 Are the rest of the gluster hosts resolvable in the same way?  Do you have
old /etc/hosts entries that might interfere with the DNS resolution?

hjm


-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;

2012-08-11 Thread Harry Mangalam

Thanks for your comments.

I use mdadm on many servers and I've seen md numbering like this a fair
bit. Usually it occurs after a another RAID has been created and the
numbering shifts.  Neil Brown (mdadm's author) , seems to think it's fine.
 So I don't think that's the problem.  And you're right - this is a
Frankengluster made from a variety of chassis and controllers and normally
it's fine.   As Brian noted, it's all the same to gluster, mod some small
local differences in IO performance.

Re the size difference, I'll explicitly rebalance the brick after the
fix-layout finishes, but I'm even more worried about this fantastic
increase in CPU usage and its effect on user performance.

In the fix-layout routines (still running), I've seen CPU usage of
glusterfsd rise to ~400% and loadavg go up to 15 on all the servers
(except the pbs3, the one that originally had that problem).  That high
load does not last long tho (maybe a few mintes - we've just installed
nagios on these nodes and I'm getting a ton of emails about load increasing
and then decreasing on all the nodes (except pbs3).  When the load goes
very high on a server node, the user-end performance drops appreciably.

hjm



On Sat, Aug 11, 2012 at 4:20 AM, Brian Candler b.cand...@pobox.com wrote:

 On Sat, Aug 11, 2012 at 12:11:39PM +0100, Nux! wrote:
  On 10.08.2012 22:16, Harry Mangalam wrote:
  pbs3:/dev/md127  8.2T  5.9T  2.3T  73% /bducgl  ---
 
  Harry,
 
  The name of that md device (127) indicated there may be something
  dodgy going on there. A device shouldn't be named 127 unless some
  problems occured. Are you sure your drives are OK?

 I have systems with /dev/md127 all the time, and there's no problem. It
 seems to number downwards from /dev/md127 - if I create md array on the
 same
 system it is /dev/md126.

 However, this does suggest that the nodes are not configured identically:
 two are /dev/sda or /dev/sdb, which suggests either plain disk or hardware
 RAID, while two are /dev/md0 or /dev/127, which is software RAID.

 Although this could explain performance differences between the nodes, this
 is transparent to gluster and doesn't explain why the files are unevenly
 balanced - unless there is one huge file which happens to have been
 allocated to this node.

 Regards,

 Brian.

 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users




-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;

2012-08-11 Thread Harry Mangalam

On Sat, Aug 11, 2012 at 9:41 AM, Brian Candler b.cand...@pobox.com wrote:

 On Sat, Aug 11, 2012 at 08:31:51AM -0700, Harry Mangalam wrote:
 Re the size difference, I'll explicitly rebalance the brick after the
 fix-layout finishes, but I'm even more worried about this fantastic
 increase in CPU usage and its effect on user performance.

 This presumably means you were originally running the cluster with fewer
 nodes, and then added some later?


No, but the unbalanced current situation suggests that at some point, it
got out of balance.



 In the fix-layout routines (still running), I've seen CPU usage of
 glusterfsd rise to ~400% and loadavg go up to 15 on all the servers
 (except the pbs3, the one that originally had that problem).  That
 high
 load does not last long tho (maybe a few mintes - we've just installed
 nagios on these nodes and I'm getting a ton of emails about load
 increasing and then decreasing on all the nodes (except pbs3).  When
 the load goes very high on a server node, the user-end performance
 drops appreciably.

 Maybe worth trying an strace (strace -f -p pid 2strace.out) on the
 glusterfsd process, or whatever it is which is causing the high load,
 during
 such a burst, just for a few seconds. The output might give some clues.


Good idea.  I'll watch and when it goes wacko and post the  filtered
results.

Thanks
Harry



-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;

2012-08-10 Thread Harry Mangalam

running 3.3 distributed on IPoIB on 4 nodes, 1 brick per node.  Any idea
why, on one of those nodes, glusterfsd would go berserk, running up to 370%
CPU and driving load to 30 (file performance on the clients slows to a
crawl). While very slow, it continued to serve out files. This is the
second time this has happened in about a week. I had turned on the gluster
nfs services, but wasn't using it when this happened.  It's now off.

kill -HUP did nothing to either glusterd or glusterfsd, so I had to kill
both and restart glusterd. That solved the overload on glusterfsd and
performance is back to near normal. I'm now doing a rebalance/fix-layout
which is running as expected, but will take the weekend to complete.  I did
notice that the affected node (pbs3) has more files than the others, tho
I'm not sure that this is significant.

Filesystem   Size  Used Avail Use% Mounted on
pbs1:/dev/sdb6.4T  1.9T  4.6T  29% /bducgl
pbs2:/dev/md08.2T  2.4T  5.9T  30% /bducgl
pbs3:/dev/md127  8.2T  5.9T  2.3T  73% /bducgl  ---
pbs4:/dev/sda6.4T  1.8T  4.6T  29% /bducgl


-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Unable to rebalance...status or stop after upgrade to 3.3

2012-08-08 Thread Harry Mangalam

This sounds similar, tho not identical to a problem that I had recently
(descriibed here:
http://gluster.org/pipermail/gluster-users/2012-August/011054.html
My problems resulted were teh result of starting this kind of rebalance
with a server node appearing to be connected (via the 'gluster peer status'
output, but not actually being connected as shown by the
'gluster volume status all detail' output. Note especially the part that
describes its online state.

--
Brick: Brick pbs3ib:/bducgl
Port : 24018
Online : N =
Pid : 20953
File System : xfs

You may have already verified this, but what I did was to start a rebalance
/ fix-layout with a disconnected brick and it went ahead and tried to do
it, unsuccessfully as you might guess.. But when I finally was able to
reconnect the downed brick, and restart the rebalance, it (astonishingly)
was able to bring everything back. So props to the gluster team.

hjm

On Wed, Aug 8, 2012 at 11:58 AM, Dan Bretherton
d.a.brether...@reading.ac.uk wrote:

Hello All-
I have noticed another problem after upgrading to version 3.3. I am
unable to do gluster volume rebalance VOLUME fix-layout status or
...fix-layout ... stop after starting a rebalance operation with gluster
volume rebalance VOLUME fix-layout start. The fix-layout operation
seemed to be progressing normally on all the servers according to the log
files, but all attempts to do status or stop result in the CLI usage
message being returned. The only reference to the rebalance commands in
the log files were these, which all the servers seem to have one or more of.

[root@romulus glusterfs]# grep rebalance *.log
etc-glusterfs-glusterd.vol.**log:[2012-08-08 12:49:04.870709] W
[socket.c:1512:__socket_proto_**state_machine] 0-management: reading from
socket failed. Error (Transport endpoint is not connected), peer
(/var/lib/glusterd/vols/**tracks/rebalance/cb21050d-**
05c2-42b3-8660-230954bab324.**sock)
tracks-rebalance.log:[2012-08-**06 10:41:18.550241] I
[graph.c:241:gf_add_cmdline_**options] 0-tracks-dht: adding option
'rebalance-cmd' for volume 'tracks-dht' with value '4'

The volume name is tracks by the way. I wanted to stop the rebalance
operation because it seemed to be causing a very high load on some of the
servers had been running for several days. I ended up having to manually
kill the rebalance processes on all the servers followed by restarting
glusterd.

After that I found that one of the servers had rebalance_status=4 in
file /var/lib/glusterd/vols/tracks/**node_state.infohttp://node_state.info,
whereas all the others had rebalance_status=0. I manually changed the
'4' to '0' and restarted glusterd. I don't know if this was a consequence
of the way I had killed the rebalance operation or the cause of the strange
behaviour. I don't really want to start another rebalance going to test
because the last one was so disruptive.

Has anyone else experienced this problem since upgrading to 3.3?

Regards,
Dan.

__**_
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/**mailman/listinfo/gluster-usershttp://gluster.org/cgi-bin/mailman/listinfo/gluster-users

--
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] brick online or not? Don't trust 'gluster peer status'

2012-08-06 Thread Harry Mangalam

As a final(?) follow-up to my problem, after restarting the rebalance with:

 gluster volume rebalance [vol-name] fix-layout start

it finished up last night after plowing thru the entirety of the filesystem
- fixing about ~1M files (apparently ~2.2TB), all while the fs remained
live (tho probably a bit slower than users would have liked).   That's a
strong '+' in the gluster column for resiliency.

I started the rebalance without waiting for any advice to the contrary.
 3.3 is supposed to have a built-in rebalance operator, but I saw no
evidence of it.  Other info from gluster.org suggested that it wouldn't do
any harm to do this, so I went ahead and started it.  Do the gluster
wizards have any final words on this before I write this up in our trouble
report?

best wishes
harry


On Thu, Aug 2, 2012 at 4:37 PM, Harry Mangalam hjmanga...@gmail.com wrote:

 Further to what I wrote before:
 gluster server overload; recovers, now Transport endpoint is not
 connected for some files
 http://goo.gl/CN6ud

 I'm getting conflicting info here.  On one hand, the peer that had its
 glusterfsd  lock up seems to be in the gluster system, according to
 the frequently referenced 'gluster peer status'

 Thu Aug 02 15:48:46 [1.00 0.89 0.92]  root@pbs1:~
 729 $ gluster peer status
 Number of Peers: 3

 Hostname: pbs4ib
 Uuid: 2a593581-bf45-446c-8f7c-212c53297803
 State: Peer in Cluster (Connected)

 Hostname: pbs2ib
 Uuid: 26de63bd-c5b7-48ba-b81d-5d77a533d077
 State: Peer in Cluster (Connected)

 Hostname: pbs3ib
 Uuid: c79c4084-d6b9-4af9-b975-40dd6aa99b42
 State: Peer in Cluster (Connected)

 On the other hand, some errors that I provided yesterday:
 ===
 [2012-08-01 18:07:26.104910] W
 [dht-selfheal.c:875:dht_selfheal_directory] 0-gli-dht: 1 subvolumes
 down -- not fixing
 ===

 as well as this information:
 $ gluster volume status all detail

 [top 2 brick stanzas trimmed; they're online]

 --
 Brick: Brick pbs3ib:/bducgl
 Port : 24018
 Online   : N   =
 Pid  : 20953
 File System  : xfs
 Device   : /dev/md127
 Mount Options: rw
 Inode Size   : 256
 Disk Space Free  : 6.1TB
 Total Disk Space : 8.2TB
 Inode Count  : 1758158080
 Free Inodes  : 1752326373

 --
 Brick: Brick pbs4ib:/bducgl
 Port : 24009
 Online   : Y
 Pid  : 20948
 File System  : xfs
 Device   : /dev/sda
 Mount Options: rw
 Inode Size   : 256
 Disk Space Free  : 4.6TB
 Total Disk Space : 6.4TB
 Inode Count  : 1367187392
 Free Inodes  : 1361305613

 The above implies fairly strongly that the brick did not re-establish
 connection to the volume, altho the gluster peer info did.

 Strangely enough, when I RE-restarted the glusterd, it DID come back
 and re-joined the gluster volume and now the (restarted) fix-layout
 job is proceeding without those  subvolumes
 down -- not fixing errors, just a steady stream of 'found
 anomalies/fixing the layout' messages, tho at the rate that it's going
 it looks like it will take several days.

 Still better several days to fix the data on-disk and having the fs
 live than having to tell users that their data is gone and then having
 to rebuild from zero.  Luckily, it's officially a /scratch filesystem.

 Harry

 --
 Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
 [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
 415 South Circle View Dr, Irvine, CA, 92697 [shipping]
 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)




-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster-users Digest, Vol 51, Issue 46

2012-08-03 Thread Harry Mangalam

Hi Ben,

Thanks for the expert advice.

On Fri, Aug 3, 2012 at 2:35 PM, Ben England bengl...@redhat.com wrote:

 4. Re: kernel parameters for improving gluster writes on millions of small
 writes (long) (Harry Mangalam)

 Harry, You are correct, Glusterfs throughput with small write transfer
 sizes is a client-side problem, here are workarounds that at least some
 applications could use.


Not to be impertinent nor snarky, but why is the gluster client written in
this way and is that a high priority for fixing?  It seems that
caching/buffering is one of the great central truths of computer science in
general.  Is there a countering argument for not doing this?

1) NFS client is one workaround, since it buffers writes using the kernel
 buffer cache.


Yes, I tried this and I find the same thing.  One thing I am unclear about
tho is whether you can set up and run 1 NFS server per gluster server node.
 ie my glusterfs runs on 4 servers - could I connect clients to each one
using a round robin selection or other load/bandwidth balancing approach?
 I've read opinions that seem to support both yes and no.



 2) If your app does not have a configurable I/O size, but it lets you
 write to stdout, you can try piping your output to stdout and letting dd
 aggregate your I/O to the filesystem for you.  In this example we triple
 single-thread write throughput for 4-KB I/O requests in this example.


I agree again - I wrote up this for the gluster 'hints' http://goo.gl/NyMXO
using gzip (other utilities seem to work as well, as do named pipes for
handling more complex output options.



[nice examples deleted]



 3) If your program is written in C and it uses stdio.h, you can probably
 do setvbuf() C RTL call to increase buffer size to something greater than
 8 KB, which is the default in gcc-4.4.


Most of our users are not programmers and so this is not an option in most
cases.


 http://en.cppreference.com/w/c/io/setvbuf
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users




-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Change NFS parameters post-start

2012-08-03 Thread Harry Mangalam

Thanks Joe, for this (and other help on the IRC)
Yes, I did check this and no it's not running.

Harry

On Fri, Aug 3, 2012 at 4:26 PM, Joe Julian j...@julianfamily.org wrote:

 You also have to ensure that the kernel nfs server isn't running.




-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Unable to peer probe after upgrade to 3.3

2012-08-02 Thread Harry Mangalam

-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users



-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] brick online or not? Don't trust 'gluster peer status'

2012-08-02 Thread Harry Mangalam

Further to what I wrote before:
gluster server overload; recovers, now Transport endpoint is not
connected for some files
http://goo.gl/CN6ud

I'm getting conflicting info here.  On one hand, the peer that had its
glusterfsd  lock up seems to be in the gluster system, according to
the frequently referenced 'gluster peer status'

Thu Aug 02 15:48:46 [1.00 0.89 0.92]  root@pbs1:~
729 $ gluster peer status
Number of Peers: 3

Hostname: pbs4ib
Uuid: 2a593581-bf45-446c-8f7c-212c53297803
State: Peer in Cluster (Connected)

Hostname: pbs2ib
Uuid: 26de63bd-c5b7-48ba-b81d-5d77a533d077
State: Peer in Cluster (Connected)

Hostname: pbs3ib
Uuid: c79c4084-d6b9-4af9-b975-40dd6aa99b42
State: Peer in Cluster (Connected)

On the other hand, some errors that I provided yesterday:
===
[2012-08-01 18:07:26.104910] W
[dht-selfheal.c:875:dht_selfheal_directory] 0-gli-dht: 1 subvolumes
down -- not fixing
===

as well as this information:
$ gluster volume status all detail

[top 2 brick stanzas trimmed; they're online]
--
Brick: Brick pbs3ib:/bducgl
Port : 24018
Online   : N   =
Pid  : 20953
File System  : xfs
Device   : /dev/md127
Mount Options: rw
Inode Size   : 256
Disk Space Free  : 6.1TB
Total Disk Space : 8.2TB
Inode Count  : 1758158080
Free Inodes  : 1752326373
--
Brick: Brick pbs4ib:/bducgl
Port : 24009
Online   : Y
Pid  : 20948
File System  : xfs
Device   : /dev/sda
Mount Options: rw
Inode Size   : 256
Disk Space Free  : 4.6TB
Total Disk Space : 6.4TB
Inode Count  : 1367187392
Free Inodes  : 1361305613

The above implies fairly strongly that the brick did not re-establish
connection to the volume, altho the gluster peer info did.

Strangely enough, when I RE-restarted the glusterd, it DID come back
and re-joined the gluster volume and now the (restarted) fix-layout
job is proceeding without those  subvolumes
down -- not fixing errors, just a steady stream of 'found
anomalies/fixing the layout' messages, tho at the rate that it's going
it looks like it will take several days.

Still better several days to fix the data on-disk and having the fs
live than having to tell users that their data is gone and then having
to rebuild from zero.  Luckily, it's officially a /scratch filesystem.

Harry

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] gluster server overload; recovers, now Transport endpoint is not connected for some files

2012-08-01 Thread Harry Mangalam

/benchmarks/ALPBench/Face_Rec/data/EBGM_CSUNG
[2012-08-01 18:04:31.462275] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/ALPBench/Face_Rec/data/EBGM_CSU_FG
[2012-08-01 18:04:31.778421] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/ALPBench/Face_Rec/data/csuScrapShots
[2012-08-01 18:04:31.885009] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/ALPBench/Face_Rec/data/csuScrapShots/normSep2002sfi
[2012-08-01 18:04:32.337981] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/ALPBench/Face_Rec/data/csuScrapShots/source
[2012-08-01 18:04:32.441383] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/ALPBench/Face_Rec/data/csuScrapShots/source/pgm
[2012-08-01 18:04:32.558827] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/ALPBench/Face_Rec/data/faceGraphsWiskott
[2012-08-01 18:04:32.617823] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/ALPBench/Face_Rec/data/novelGraphsWiskott

Unfortunately, I'm also seeing this:

[2012-08-01 18:07:26.104859] I [dht-layout.c:593:dht_layout_normalize]
0-gli-dht: found anomalies in
/nlduong/benchmarks/SPEC2K6-org/benchspec/CPU2006/403.gcc/data/test/input.
holes=1 overlaps=0
[2012-08-01 18:07:26.104910] W
[dht-selfheal.c:875:dht_selfheal_directory] 0-gli-dht: 1 subvolumes
down -- not fixing
[2012-08-01 18:07:26.104996] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/SPEC2K6-org/benchspec/CPU2006/403.gcc/data/test/input
[2012-08-01 18:07:26.189403] I [dht-layout.c:593:dht_layout_normalize]
0-gli-dht: found anomalies in
/nlduong/benchmarks/SPEC2K6-org/benchspec/CPU2006/403.gcc/data/test/output.
holes=1 overlaps=0
[2012-08-01 18:07:26.189457] W
[dht-selfheal.c:875:dht_selfheal_directory] 0-gli-dht: 1 subvolumes
down -- not fixing

which implies that some of the errors are not fixable.

Is there a best-practices solution  for this problem?  I suspect this
is one of the most common problems to affect an operating gluster fs.

hjm

--
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] Change NFS parameters post-start

2012-07-27 Thread Harry Mangalam

In trying to convert clients from using the gluster native client to
an NFS client, I'm trying to get the gluster volume mounted on a test
mount point on the same client that the native client has mounted the
volume.  The client refuses with the error:

 mount -t nfs bs1:/gl /mnt/glnfs
mount: bs1:/gl failed, reason given by server: No such file or directory

In looking at the gluster nfs.log, it looks like the nfs volume was
mounted RDMA-only, which is odd, seeing that the 3.3-1 does not fully
support RDMA:

the nfs.log is 99% these failure messages:

E [rdma.c:4458:tcp_connect_finish] 0-gl-client-2: tcp connect to etc

but the few messages that aren't these reveal that:

  1: volume gl-client-0
  2: type protocol/client
  3: option remote-host bs2
  4: option remote-subvolume /raid1
  5: option transport-type rdma  -
  6: option username a2994eef-60d6-4609-a6d1-8d760cf82424
  7: option password bbf8e05d-6ada-4371-99d0-09b4c55cc899
  8: end-volume

The volume was created tcp,rdma (before I realized that rdma was
temporarily deprecated):

Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.write-behind-window-size: 1024MB
performance.flush-behind: on
performance.cache-size: 268435456
nfs.disable: off
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*

 and the gluster clients talk to it just fine over IPoIB

But the NFS client apparently insists on trying to use RDMA, which
isn't being used


I didn't originally ask for or want the NFS subsystem and later turned
it off, (nfs.disable = on) but now i want to use it and I'd like to be
able to tell it to use sockets/TCP.  Is there a way to  do this after
the fact?  That is, without destroying and re-creating the current
volume as tcp-only.

I have another gluster FS where the transport type is set to tcp and
it's working fine under NFS:

 1: volume gli-client-0
 2: type protocol/client
 3: option remote-host pbs1ib
 4: option remote-subvolume /bducgl
 5: option transport-type tcp
 6: option username c173a866-a561-4da9-b977-93f8df4766a1
 7: option password 09480722-0b0f-4b41-bc73-9970fe129d27
 8: end-volume


hjm

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] kernel parameters for improving gluster writes on millions of small writes (long)

2012-07-26 Thread Harry Mangalam

Hi Bryan,

thanks for the suggestion.  In fact, we're using XFS for the
underlying filesystem (under 3ware controllers) and it was tuned (or
at least I thought it was) for large files.  We do get decent perf on
large file reads and writes as long as the writes are fairly large.

I'll post my controller and XFS settings to see if they seem odd.

I was experimenting more last night after the latest revelations and
discovered some more things that may be illuminating.

I wrote 2 tiny perl scripts; one that wrote ~400MB in short writes
(burp) and another (bigburp) that created the same-sized string but
in-memory and then wrote it in one write.  The burp script took about
4 times as long to write to a file on a gluster fs (and sync) as did
the bigburp script.

 So, if the writes are as I described previously (the output of
individual writes of 100bytes), the performance is very poor and the
gluster process is driven very high -  100% for several seconds, I'm
assuming due to queued instructions).  If the same amount of data is
written in a single write, the performance is pretty good and while
the gluster process goes high, it doesn't exceed about 60% and it
lasts only a few sec.

Why should this be?  Why should Linux file caching care if the data to
be written is the result of a single write or the result of lots of
writes (other than the function call overhead - would that explain
it?)...?   I can test that with oprofile, but it doesn't explain why
the gluster process takes so much longer to process one than the
other.  From its POV, it should just be data, regardless from where it
came.  Or am I missing some critical point?

If it does matter that it's not just the size of the files but the way
they are created that has a large effect on gluster write performance,
then gluster (or at least the native gluster client) will not be
appropriate for a lot of bioinformatics apps, many of which use this
kind of write profile.

hjm

On Thu, Jul 26, 2012 at 6:23 AM, Washer, Bryan bwas...@netsuite.com wrote:


 Harry,

 Just a question but what file system are you using under the gluster system?
 You may need to tune that before you continue to try and tune the output
 system. I found that by tuning using the xfs file system and tuning it for
 very large files I was able to improve my performance quite a bit. In this
 case though I was working with a lot of big files so my tuning would not
 help you..but just wanted to make sure you had looked at this detail in your
 setup.

 Bryan


 -Original Message-
 From: gluster-users-boun...@gluster.org
 [mailto:gluster-users-boun...@gluster.org] On Behalf Of Harry Mangalam
 Sent: Wednesday, July 25, 2012 8:02 PM
 To: gluster-users
 Subject: [Gluster-users] kernel parameters for improving gluster writes on
 millions of small writes (long)

 This is a continuation of my previous posts about improving write perf
 when trapping millions of small writes to a gluster filesystem.
 I was able to improve write perf by ~30x by running STDOUT thru gzip
 to consolidate and reduce the output stream.

 Today, another similar problem, having to do with yet another
 bioinformatics program (which these days typically handle the 'short
 reads' that come out of the majority of sequencing hardware, each read
 being 30-150 characters, with some metadata typically in an ASCII file
 containing millions of such entries). Reading them doesn't seem to be
 a problem (at least on our systems) but writing them is quite awful..

 The program is called 'art_illumina' from the Broad Inst's 'ALLPATHS'
 suite and it generates an artificial Illumina data set from an input
 genome. In this case about 5GB of the type of data described above.
 Like before, the gluster process goes to 100% and the program itself
 slows to ~20-30% of a CPU. In this case, the app's output cannot be
 extrnally trapped by redirecting thru gzip since the output flag
 specifies the base filename for 2 files that are created internally
 and then written directly. This prevents even setting up a named pipe
 to trap and process the output.

 Since this gluster storage was set up specifically for bioinformatics,
 this is a repeating problem and while some of the issues can be dealt
 with by trapping and converting output, it would be VERY NICE if we
 could deal with it at the OS level.

 The gluster volume is running over IPoIB on QDR IB and looks like this:
 Volume Name: gl
 Type: Distribute
 Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
 Status: Started
 Number of Bricks: 8
 Transport-type: tcp,rdma
 Bricks:
 Brick1: bs2:/raid1
 Brick2: bs2:/raid2
 Brick3: bs3:/raid1
 Brick4: bs3:/raid2
 Brick5: bs4:/raid1
 Brick6: bs4:/raid2
 Brick7: bs1:/raid1
 Brick8: bs1:/raid2
 Options Reconfigured:
 performance.write-behind-window-size: 1024MB
 performance.flush-behind: on
 performance.cache-size: 268435456
 nfs.disable: on
 performance.io-cache: on
 performance.quick-read: on
 performance.io-thread-count: 64
 auth.allow: 10.2.*.*,10.1

Re: [Gluster-users] kernel parameters for improving gluster writes on millions of small writes (long)

2012-07-26 Thread Harry Mangalam

I had not, tho I had searched for something like this for a good bit
yesterday(?!)  Back to google class for me..

Thanks very much!

hjm

On Thu, Jul 26, 2012 at 8:07 AM, John Mark Walker johnm...@redhat.com wrote:
 Harry,

 Have you seen this post?

 http://community.gluster.org/a/linux-kernel-tuning-for-glusterfs/


 Be sure and read all the comments, as Ben England chimes in on the comments, 
 and he's one of the performance engineers at Red Hat.

 -JM


 - Harry Mangalam hjmanga...@gmail.com wrote:
 This is a continuation of my previous posts about improving write perf
 when trapping millions of small writes to a gluster filesystem.
 I was able to improve write perf by ~30x by running STDOUT thru gzip
 to consolidate and reduce the output stream.

 Today, another similar problem, having to do with yet another
 bioinformatics program (which these days typically handle the 'short
 reads' that come out of the majority of sequencing hardware, each read
 being 30-150 characters, with some metadata typically in an ASCII file
 containing millions of such entries).  Reading them doesn't seem to be
 a problem (at least on our systems) but writing them is quite awful..

 The program is called 'art_illumina' from the Broad Inst's 'ALLPATHS'
 suite and it generates an artificial Illumina data set from an input
 genome.  In this case about 5GB of the type of data described above.
 Like before, the gluster process goes to 100% and the program itself
 slows to ~20-30% of a CPU.  In this case, the app's output cannot be
 extrnally trapped by redirecting thru gzip since the output flag
 specifies the base filename for 2 files that are created internally
 and then written directly.  This prevents even setting up a named pipe
 to trap and process the output.

 Since this gluster storage was set up specifically for bioinformatics,
 this is a repeating problem and while some of the issues can be dealt
 with by trapping and converting output, it would be VERY NICE if we
 could deal with it at the OS level.

 The gluster volume is running over IPoIB on QDR IB and looks like this:
 Volume Name: gl
 Type: Distribute
 Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
 Status: Started
 Number of Bricks: 8
 Transport-type: tcp,rdma
 Bricks:
 Brick1: bs2:/raid1
 Brick2: bs2:/raid2
 Brick3: bs3:/raid1
 Brick4: bs3:/raid2
 Brick5: bs4:/raid1
 Brick6: bs4:/raid2
 Brick7: bs1:/raid1
 Brick8: bs1:/raid2
 Options Reconfigured:
 performance.write-behind-window-size: 1024MB
 performance.flush-behind: on
 performance.cache-size: 268435456
 nfs.disable: on
 performance.io-cache: on
 performance.quick-read: on
 performance.io-thread-count: 64
 auth.allow: 10.2.*.*,10.1.*.*

 I've tried to increase every caching option that might improve this
 kind of performance, but it doesn't seem to help.  At this point, I'm
 wondering whether changing the client (or server) kernel parameters
 will help.

 The client's meminfo is:
  cat  /proc/meminfo
 MemTotal:   529425924 kB
 MemFree:241833188 kB
 Buffers:  355248 kB
 Cached: 279699444 kB
 SwapCached:0 kB
 Active:  2241580 kB
 Inactive:   278287248 kB
 Active(anon): 190988 kB
 Inactive(anon):   287952 kB
 Active(file):2050592 kB
 Inactive(file): 277999296 kB
 Unevictable:   16856 kB
 Mlocked:   16856 kB
 SwapTotal:  563198732 kB
 SwapFree:   563198732 kB
 Dirty:  1656 kB
 Writeback: 0 kB
 AnonPages:486876 kB
 Mapped:19808 kB
 Shmem:   164 kB
 Slab:1475476 kB
 SReclaimable:1205944 kB
 SUnreclaim:   269532 kB
 KernelStack:5928 kB
 PageTables:27312 kB
 NFS_Unstable:  0 kB
 Bounce:0 kB
 WritebackTmp:  0 kB
 CommitLimit:827911692 kB
 Committed_AS: 536852 kB
 VmallocTotal:   34359738367 kB
 VmallocUsed: 1227732 kB
 VmallocChunk:   33888774404 kB
 HardwareCorrupted: 0 kB
 AnonHugePages:376832 kB
 HugePages_Total:   0
 HugePages_Free:0
 HugePages_Rsvd:0
 HugePages_Surp:0
 Hugepagesize:   2048 kB
 DirectMap4k:  201088 kB
 DirectMap2M:15509504 kB
 DirectMap1G:521142272 kB

 and the server's meminfo is:

 $ cat  /proc/meminfo
 MemTotal:   32861400 kB
 MemFree: 1232172 kB
 Buffers:   29116 kB
 Cached: 30017272 kB
 SwapCached:   44 kB
 Active: 18840852 kB
 Inactive:   11772428 kB
 Active(anon): 492928 kB
 Inactive(anon):75264 kB
 Active(file):   18347924 kB
 Inactive(file): 11697164 kB
 Unevictable:   0 kB
 Mlocked:   0 kB
 SwapTotal:  16382900 kB
 SwapFree:   16382680 kB
 Dirty: 8 kB
 Writeback: 0 kB
 AnonPages:566876 kB
 Mapped:14212 kB
 Shmem:  1276 kB
 Slab: 429164 kB
 SReclaimable: 324752 kB
 SUnreclaim:   104412 kB
 KernelStack:3528 kB
 PageTables:16956 kB

Re: [Gluster-users] kernel parameters for improving gluster writes on millions of small writes (long)

2012-07-26 Thread Harry Mangalam

 I read and am still digesting the kernel tuning parameters mentioned
in John's link.  There's another useful link that expands on some of
the same points here:

The Linux Page Cache and pdflush:
Theory of Operation and Tuning for Write-Heavy Loads
http://www.westnet.com/~gsmith/content/linux-pdflush.htm

However, while I digest them, I have a few more observations:  It's
not that the server is slow, it's the gluster native client that is.
So I'm not sure that increasing the perf of the server will help much
at this point.

I wrote a tiny script (burp.pl) that just emits lots of short strings
to stdout like the problem app that originated this discussion (and a
colleague did the same with a C++ app) to verify. If I send stdout to
my gluster fs via the native gluster client, I observe a steady stream
of data at about 14MB/s (this is on a DDR/IPoIB cluster)

$ time `./burp.pl 100/gl/hmangala/burp.out  sync`

real0m29.646s
user0m17.830s
sys 0m2.000s

In this case, burp.pl is only getting about 70% of a CPU and the
gluster process is getting ~40%.

Here'e the ifstat output for the IB channel (~1 entry/s).  Note the
continuous data out rate of about 14MB/s (and the odd input rate of
about 1MB/s).

   ib1
 KB/s in  KB/s out
   0.00  0.00
   0.00  0.00  burp  starts
 383.34   5200.51
1039.43  14243.11
1031.59  14132.11
1037.36  14223.32
1044.20  14304.81
1040.40  14288.45
1037.78  14217.64
1042.19  14306.66
1036.54  14200.05
1062.26  14699.87
1072.64  14711.29
1072.87  14694.52
1065.18  14608.67
1074.23  14711.32
1073.26  14711.43
1069.79  14672.60
1066.66  14608.58
1067.68  14647.14
1074.16  14711.48
1069.16  14651.39
1077.19  14767.32
1075.74  14736.75
1068.77  14634.86
1066.81  14625.90
1063.89  14586.81
1064.79  14608.46
1065.37  14583.04
1065.10  14604.44
1063.86  14591.14
 388.41   5323.84burp ends
   0.00  0.00
---
30460.65   417607.51   totals
(30MB input vs 417MB output)


for the NFS mounted channel;
(mount command:
 mount -o mountproto=tcp,vers=3,noatime,auto -t nfs pbs1ib:/gli /mnt/glnfs

$ time `./burp.pl 100/mnt/glnfs/hmangala/burp.out  sync`

real0m24.704s   a little faster
user0m20.710s
sys 0m0.810s

In this case burp.pl gets 100% of a CPU; gluster isn't involved and so
doesn't register.

Here'e the ifstat output for the IB channel:
Note the complete lack of input and no data output until the very end
when it bursts at ~140MB/s.

   ib1
 KB/s in  KB/s out
0.00  0.00
0.73  0.00   burp starts
0.18  0.00
0.00  0.00
1.33  1.88
0.00  0.00
0.00  0.00
0.04  0.00
0.04  0.00
0.00  0.00
0.00  0.00
0.00  0.00
0.00  0.00
0.00  0.00
0.00  0.00
0.00  0.00
0.00  0.00
0.00  0.00
0.04  1.29
0.00  0.00
0.00  0.00
0.00  0.00
0.00  0.00
  314.08  83002.70
  517.10  142239.6
  513.94  141469.4
  123.96  33219.89   burp ends
0.04  0.00

1471.44   399934.76 Totals
(1.47MB input vs 400MB output)

It's hard to argue with that.  NFS is clearly superior / more
efficient on a single process and may be more efficient overall for
the use cases on our clusters.

So why doesn't the gluster native client do client-side caching like
NFS?  It looks like it's explicitly refusing to be cached by the usual
(and usually excellent) Linux mechanisms.
What's the reason for declining this OS advantage on the client side
while providing such a technically sweet solution on the server side?
I'm at a loss to explain this behavior to our technical group.

previous deleted
-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] temp fix: Simultaneous reads and writes from specific apps to IPoIB volume seem to conflict and kill performance.

2012-07-24 Thread Harry Mangalam

The problem described in the subject appears NOT to be the case.  It's
not that simultaneous reads and writes dramatically decrease perf, but
that the type of /writes/ being done by this app (bedtools) kills
performance. If this was a self-writ app or an infrequently used one,
I wouldn't bother writing this up, but bedtools is a fairly popular
genomics app and since many installations use gluster to host Next-Gen
sequencing data and analysis, I thought I'd follow up on my own post.

The short version:
=
 Insert gzip to compress and stream the data before sending it to
gluster fs.  The improvement in IO (and application) performance is
dramatic.

ie (all files on a gluster fs)

genomeCoverageBed -ibam RS_11261.bam  -g \
ref/dmel-all-chromosome-r5.1.fasta -d |gzip  output.cov.gz

inserting the '| gzip' increased the app speed by more than 30X
(relative to not using it on a gluster fs; however it even improved
the wall clock speed of the app relative to running on a local
filesystem by about 1/3), decreased the gluster CPU utilization by
~99% and reduced the output size by 80%. So, wins all round.


The long version:

  The type of writes that bedtools does is also fairly common - lots
of writes of tiny amounts of data.

As I understand it (which may be wrong; please correct) the gluster
native client (which we're using) does not buffer IO as well as the
NFS client, which is why we frequently see complaints about gluster vs
NFS perf.
The apparent problem for bedtools is that these zillions of tiny
writes are being handled separately or at least not cached well to be
consolidated in a large write.  To present the data to gluster as a
continuous stream instead of these tiny writes, they have to be
'converted' to such a stream. gzip is a nice solution because it
compresses as it converts.  Aparently anything that takes STDIN,
buffers it appropriately and then spits it out on STDOUT will work.
Even piping the data thru 'cat' will work to allow bedtools to
continue to run at 100%, tho it will increase the gluster CPU
utilization to 90%. 'cat' of course uses less CPU (~14%) while gzip
will use more  (~60%) tho decreasing gluster;s use enormously.

I did try the performance options I mentioned earlier:

performance.write-behind-window-size: 1024MB
performance.flush-behind: on

  They did not seem to help at all and I'd still like an explanation
of what they're supposed to do.

The upshot is that this seems like, if not a bug, then at least an
opportunity to improve gluster perfomance considerably.

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] Simultaneous reads and writes from specific apps to IPoIB volume seem to conflict and kill performance.

2012-07-23 Thread Harry Mangalam

I have fairly new gluster fs of 4 nodes with 2  RAID6 bricks on each
node connected to a cluster via IPoIB on QDR IB.
The servers are all SL6.2, running gluster 3.3-1; the clients are
running the gluster-released glusterfs-fuse-3.3.0qa42-1 
glusterfs-3.3.0qa42-1.

The volume seems normal:
$ gluster volume info
Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.cache-size: 268435456
nfs.disable: on
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*

The logs on both the server and client are remarkable in their lack of
anything amiss (the server has the previously reported zillion times
repeating string of:

I [socket.c:1798:socket_event_handler] 0-transport: disconnecting now

which seems to be correlated with turning the NFS server off.  This
has been mentioned before.

The gluster volume log, stripped of that line is here:
http://pastie.org/4309225

Individual large-file reads and writes are in the 300MB/s range which
is not magnificent but tolerable. However, we've recently detected
what appears to be a conflict in reading and writing for some
applications.  When some applications are reading and writing to the
gluster fs, the client
/usr/sbin/glusterfs increases its CPU consunmption to 100% and the IO
goes to almost zero.

When the inputs are on the gluster fs and the output is on another fs,
performance is as good as on a local RAID.
This seems to be specific to a particular application (bedtools,
perhaps some other openmp genomics apps - still checking).  Other
utilities  (cp,  perl, tar, and other utilities ) that read and write
to the gluster filesystem seem to be able to push and pull fairly
large amount of data to/from it.

The client is running a genomics utility (bedtools) which reads a very
large chunks  of data from the gluster fs, then aligns it to a
reference genome.  Stracing the run yields this stanza, after which it
hangs until I kill it.  The user has said that it does complete but at
a speed hundreds of times slower (maybe timing out at each step..?)

open(/data/users/tdlong/bin/genomeCoverageBed, O_RDONLY) = 3
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffcf0e5bb0) = -1 ENOTTY
(Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)   = 0
read(3, #!/bin/sh\n${0%/*}/bedtools genom..., 80) = 42
lseek(3, 0, SEEK_SET)   = 0
getrlimit(RLIMIT_NOFILE, {rlim_cur=4*1024, rlim_max=4*1024}) = 0
dup2(3, 255)= 255
close(3)= 0
fcntl(255, F_SETFD, FD_CLOEXEC) = 0
fcntl(255, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE)
fstat(255, {st_mode=S_IFREG|0755, st_size=42, ...}) = 0
lseek(255, 0, SEEK_CUR) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
read(255, #!/bin/sh\n${0%/*}/bedtools genom..., 42) = 42
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x2ae9318729e0) = 8229
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x436f40, [], SA_RESTORER, 0x3cb64302d0},
{SIG_DFL, [], SA_RESTORER, 0x3cb64302d0}, 8) = 0
wait4(-1,

Does this indicate any optional tuning or operational parameters that
we should be using?

hjm

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Simultaneous reads and writes from specific apps to IPoIB volume seem to conflict and kill performance.

2012-07-23 Thread Harry Mangalam

Some more info..

I think the problem is the way the bedtools is writing the output -
it's not getting buffered correctly.
Using some more useful strace flags to force strace into the fork'ed
child, you can see that the output is being written, just very slowly
due to the awful, horrible, skeezy, skanky, lazy, wanky way that
biologists (me included) tend to write code. ie:

after the data is read in and processed, you get gigantic amounts of
this kind of output being written to the file
[pid 17021] 21:56:21 write(1, U\t137095\t43\n, 12) = 12 0.000120
[pid 17021] 21:56:21 write(1, U\t137096\t40\n, 12) = 12 0.000119
[pid 17021] 21:56:21 write(1, U\t137097\t40\n, 12) = 12 0.000119
[pid 17021] 21:56:21 write(1, U\t137098\t40\n, 12) = 12 0.000119
[pid 17021] 21:56:21 write(1, U\t137099\t38\n, 12) = 12 0.000116
[pid 17021] 21:56:21 write(1, U\t137100\t38\n, 12) = 12 0.000119
[pid 17021] 21:56:21 write(1, U\t137101\t38\n, 12) = 12 0.000117

ie (the file itself):
...
 137098 U   137098  40
 137099 U   137099  38
 137100 U   137100  38
 137101 U   137101  38
 137102 U   137102  36

IT looks like the current gluster config  isn't being set up to buffer
this particular output correctly, so it's being written on a
write-by-write basis.

As noted below, my gluster performance options are:

 performance.cache-size: 268435456
 performance.io-cache: on
 performance.quick-read: on
 performance.io-thread-count: 64

Is there an option to address this extremely slow write perf?

These options (p 38 of the 'Gluster File System 3.3.0 Administration
Guide') sound like they may help but without knowing what they
actually do, I'm hesitant to apply them to what is now a live fs.

performance.flush-behind:
If this option is set ON,
instructs write-behind
translator to perform
flush in background,
by returning success
(or any errors, if any of
previous writes were
failed) to application
even before flush
is sent to backend
filesystem.

performance.write-behind-window-size
Size of the per-file
write-behind buffer.

Advice?

hjm


On Mon, Jul 23, 2012 at 4:59 PM, Harry Mangalam hjmanga...@gmail.com wrote:
 I have fairly new gluster fs of 4 nodes with 2  RAID6 bricks on each
 node connected to a cluster via IPoIB on QDR IB.
 The servers are all SL6.2, running gluster 3.3-1; the clients are
 running the gluster-released glusterfs-fuse-3.3.0qa42-1 
 glusterfs-3.3.0qa42-1.

 The volume seems normal:
 $ gluster volume info
 Volume Name: gl
 Type: Distribute
 Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
 Status: Started
 Number of Bricks: 8
 Transport-type: tcp,rdma
 Bricks:
 Brick1: bs2:/raid1
 Brick2: bs2:/raid2
 Brick3: bs3:/raid1
 Brick4: bs3:/raid2
 Brick5: bs4:/raid1
 Brick6: bs4:/raid2
 Brick7: bs1:/raid1
 Brick8: bs1:/raid2
 Options Reconfigured:
 performance.cache-size: 268435456
 nfs.disable: on
 performance.io-cache: on
 performance.quick-read: on
 performance.io-thread-count: 64
 auth.allow: 10.2.*.*,10.1.*.*

 The logs on both the server and client are remarkable in their lack of
 anything amiss (the server has the previously reported zillion times
 repeating string of:

 I [socket.c:1798:socket_event_handler] 0-transport: disconnecting now

 which seems to be correlated with turning the NFS server off.  This
 has been mentioned before.

 The gluster volume log, stripped of that line is here:
 http://pastie.org/4309225

 Individual large-file reads and writes are in the 300MB/s range which
 is not magnificent but tolerable. However, we've recently detected
 what appears to be a conflict in reading and writing for some
 applications.  When some applications are reading and writing to the
 gluster fs, the client
 /usr/sbin/glusterfs increases its CPU consunmption to 100% and the IO
 goes to almost zero.

 When the inputs are on the gluster fs and the output is on another fs,
 performance is as good as on a local RAID.
 This seems to be specific to a particular application (bedtools,
 perhaps some other openmp genomics apps - still checking).  Other
 utilities  (cp,  perl, tar, and other utilities ) that read and write
 to the gluster filesystem seem to be able to push and pull fairly
 large amount of data to/from it.

 The client is running a genomics utility (bedtools) which reads a very
 large chunks  of data from the gluster fs, then aligns it to a
 reference genome.  Stracing the run yields this stanza, after which it
 hangs until I kill it.  The user has said that it does complete but at
 a speed hundreds of times slower (maybe timing out at each step..?)

 open(/data/users/tdlong/bin/genomeCoverageBed, O_RDONLY) = 3
 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffcf0e5bb0) = -1 ENOTTY
 (Inappropriate ioctl for device)
 lseek(3, 0, SEEK_CUR)   = 0
 read(3, #!/bin/sh\n${0%/*}/bedtools genom..., 80) = 42
 lseek(3, 0, SEEK_SET)   = 0
 getrlimit(RLIMIT_NOFILE, {rlim_cur

Re: [Gluster-users] Distributed replicated volume

2012-07-18 Thread Harry Mangalam

I think the strong consensus of gluster users would be that this is
the wrong filesystem to use for what you propose.  gluster has a lot
to rec it but using it for mail or other ZOTfiles (Zillions Of
Tinyfiles) will bring nothing but astonishingly poor performance,
tearing of hair, rending of clothes, pain, and heartbreak.

hjm

On Wed, Jul 18, 2012 at 12:43 PM, Gandalf Corvotempesta
gandalf.corvotempe...@gmail.com wrote:
 Hi,
 i'm planning to develop a 4 node distributed replicated infrastructure with
 Gluster 3.3
 I'll use 4 DELL R510 with 8x 2TB SATA disks each, giving us 32TB raw
 redundant and distributed capacity.

 Some questions:
  - in the near future we will add 2 other identical nodes, will be possibile
 to extend the gluster volume going up to 32+16GB or raw capacity?

  - What do you suggest, RAID5 or no-raid (one disk for each brick)? Our
 primary use will be mail and web servers, so many small files.

 Any other advice?

 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users




-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] according to mtab, GlusterFS is already mounted

2012-07-05 Thread Harry Mangalam

Do you have some dangling symlinks?
/home - /home2 (or vice versa)

ie

ls -ld /home*

what does 'mount' or /etc/mtab  say?

(assuming that the '_' are supposed to be spaces; if not, all bets are off)

hjm

On Thu, Jul 5, 2012 at 1:41 AM, Jon Tegner teg...@renget.se wrote:
 Hi,

 I want to mount from two different gluster-filesystems, according to the
 following lines in fstab:

 server1:glusterStore1___/home1__glusterfs___defaults,_netdev,transport=rdma___0_0
 server2:/glusterStore2___/home2__glusterfs___defaults,_netdev,transport=rdma___0_0

 However, when the second one is mounted I get the following message:

 /sbin/mount.glusterfs: according to mtab, GlusterFS is already mounted on
 /home

 However, both system seems to be mounted OK, am I doing something terribly
 wrong here, or can I just disregard this message?

 On server1 I'm using 3.2.3-1.

 On the client and server2 its 3.2.6-1.

 Regards,

 /jon

 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users




-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster-3.3 Puzzler

2012-06-27 Thread Harry Mangalam

which OS are you using?  I believe 3.3 will install but won't run on
older CentOSs (5.7/5.8) due to libc skew.
and you did 'modprobe fuse' before you tried to mount it...?

hjm

On Wed, Jun 27, 2012 at 12:46 PM, Robin, Robin rob...@muohio.edu wrote:
 Hi,

 Just updated to Gluster-3.3; I can't seem to mount my initial test volume. I
 did the mount on the gluster server itself (which works on Gluster-3.2).

 # rpm -qa | grep -i gluster
 glusterfs-fuse-3.3.0-1.el6.x86_64
 glusterfs-server-3.3.0-1.el6.x86_64
 glusterfs-3.3.0-1.el6.x86_64

 # gluster volume info all

 Volume Name: vmvol
 Type: Replicate
 Volume ID: b105560a-e157-4b94-bac9-39378db6c6c9
 Status: Started
 Number of Bricks: 1 x 2 = 2
 Transport-type: tcp
 Bricks:
 Brick1: mualglup01:/mnt/gluster/vmvol001
 Brick2: mualglup02:/mnt/gluster/vmvol001
 Options Reconfigured:
 auth.allow: 127.0.0.1,134.53.*,10.*

 ## mount -t glusterfs mualglup01.mcs.muohio.edu:vmvol /mnt/test (did this on
 the gluster machine itself)

 I'm getting the following in the logs:
 +--+
 [2012-06-27 15:40:52.116160] I [rpc-clnt.c:1660:rpc_clnt_reconfig]
 0-vmvol-client-0: changing port to 24009 (from 0)
 [2012-06-27 15:40:52.116479] I [rpc-clnt.c:1660:rpc_clnt_reconfig]
 0-vmvol-client-1: changing port to 24009 (from 0)
 [2012-06-27 15:40:56.055124] I
 [client-handshake.c:1636:select_server_supported_programs] 0-vmvol-client-0:
 Using Program GlusterFS 3.3.0, Num (1298437), Version (330)
 [2012-06-27 15:40:56.055575] I
 [client-handshake.c:1433:client_setvolume_cbk] 0-vmvol-client-0: Connected
 to 10.0.72.132:24009, attached to remote volume '/mnt/gluster/vmvol001'.
 [2012-06-27 15:40:56.055610] I
 [client-handshake.c:1445:client_setvolume_cbk] 0-vmvol-client-0: Server and
 Client lk-version numbers are not same, reopening the fds
 [2012-06-27 15:40:56.055682] I [afr-common.c:3627:afr_notify]
 0-vmvol-replicate-0: Subvolume 'vmvol-client-0' came back up; going online.
 [2012-06-27 15:40:56.055871] I
 [client-handshake.c:453:client_set_lk_version_cbk] 0-vmvol-client-0: Server
 lk version = 1
 [2012-06-27 15:40:56.057871] I
 [client-handshake.c:1636:select_server_supported_programs] 0-vmvol-client-1:
 Using Program GlusterFS 3.3.0, Num (1298437), Version (330)
 [2012-06-27 15:40:56.058277] I
 [client-handshake.c:1433:client_setvolume_cbk] 0-vmvol-client-1: Connected
 to 10.0.72.133:24009, attached to remote volume '/mnt/gluster/vmvol001'.
 [2012-06-27 15:40:56.058304] I
 [client-handshake.c:1445:client_setvolume_cbk] 0-vmvol-client-1: Server and
 Client lk-version numbers are not same, reopening the fds
 [2012-06-27 15:40:56.063514] I [fuse-bridge.c:4193:fuse_graph_setup] 0-fuse:
 switched to graph 0
 [2012-06-27 15:40:56.063638] I
 [client-handshake.c:453:client_set_lk_version_cbk] 0-vmvol-client-1: Server
 lk version = 1
 [2012-06-27 15:40:56.063802] I [fuse-bridge.c:4093:fuse_thread_proc] 0-fuse:
 unmounting /mnt/test
 [2012-06-27 15:40:56.064207] W [glusterfsd.c:831:cleanup_and_exit]
 (--/lib64/libc.so.6(clone+0x6d) [0x35f0ce592d] (--/lib64/libpthread.so.0()
 [0x35f14077f1] (--/usr/sbin/glusterfs(glusterfs_sigwaiter+0xdd)
 [0x405cfd]))) 0-: received signum (15), shutting down
 [2012-06-27 15:40:56.064250] I [fuse-bridge.c:4643:fini] 0-fuse: Unmounting
 '/mnt/test'.

 The server and client should be the same version (as attested by the rpm).
 I've seen that some other people are getting the same errors in the archive;
 no solutions were offered.

 Any help is appreciated.

 Thanks,
 Robin


 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users




-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Fedora 17 GlusterFS 3.3.0 problmes

2012-06-22 Thread Harry Mangalam

Were you automounting your glusterfs?

This is extremely similar to what I described previously:

http://comments.gmane.org/gmane.comp.file-systems.gluster.user/9241

  In my case the servers were running SL6.2; clients were running
CentOS 5.7 and were automounting the glusterfs with IPoIB.  When we
switched to hard mounts, the problem seems to have gone away, but I'd
be interested in seeing this resolved.  The Client lk-version numbers
are not same, reopening the fds strings in the logs were also
identical, even when I compiled new client versions that matched the
server version.

If this does not have a trivial answer, I'll be happy to file an
official bug report.  One report could be a twit :). second one could
be a bug.

hjm

On Fri, Jun 22, 2012 at 7:49 AM, Nathan Stratton nat...@robotics.net wrote:

 When I do a NFS mount and do a ls I get:

 [root@ovirt share]# ls
 ls: reading directory .: Too many levels of symbolic links

 [root@ovirt share]# ls -fl
 ls: reading directory .: Too many levels of symbolic links
 total 3636
 drwxr-xr-x   3 root root 16384 Jun 21 19:34 .
 dr-xr-xr-x. 21 root root  4096 Jun 21 19:29 ..
 drwxr-xr-x   3 root root 16384 Jun 21 19:34 .
 dr-xr-xr-x. 21 root root  4096 Jun 21 19:29 ..
 drwxr-xr-x   3 root root 16384 Jun 21 19:34 .
 dr-xr-xr-x. 21 root root  4096 Jun 21 19:29 ..
 drwxr-xr-x   3 root root 16384 Jun 21 19:34 .
 dr-xr-xr-x. 21 root root  4096 Jun 21 19:29 ..
 drwxr-xr-x   3 root root 16384 Jun 21 19:34 .

 {and so on}

 When I try a fuse mount I get:
 [client-handshake.c:1445:client_setvolume_cbk] 0-share-client-0: Server and
 Client lk-version numbers are not same, reopening the fds

 I have tried the fedora 16 RPMs and also built new fedora 17 RPMs.

 Full Log:

 [2012-06-21 19:24:35.633510] I [glusterfsd.c:1666:main]
 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.3.0
 [2012-06-21 19:24:35.646832] I [io-cache.c:1549:check_cache_size_ok]
 0-share-quick-read: Max cache size is 16825450496
 [2012-06-21 19:24:35.646916] I [io-cache.c:1549:check_cache_size_ok]
 0-share-io-cache: Max cache size is 16825450496
 [2012-06-21 19:24:35.660807] I [client.c:2142:notify] 0-share-client-0:
 parent translators are ready, attempting connect on transport
 [2012-06-21 19:24:35.664026] I [client.c:2142:notify] 0-share-client-1:
 parent translators are ready, attempting connect on transport
 [2012-06-21 19:24:35.666801] I [client.c:2142:notify] 0-share-client-2:
 parent translators are ready, attempting connect on transport
 [2012-06-21 19:24:35.669385] I [client.c:2142:notify] 0-share-client-3:
 parent translators are ready, attempting connect on transport
 [2012-06-21 19:24:35.671951] I [client.c:2142:notify] 0-share-client-4:
 parent translators are ready, attempting connect on transport
 [2012-06-21 19:24:35.674514] I [client.c:2142:notify] 0-share-client-5:
 parent translators are ready, attempting connect on transport
 [2012-06-21 19:24:35.677093] I [client.c:2142:notify] 0-share-client-6:
 parent translators are ready, attempting connect on transport
 [2012-06-21 19:24:35.679652] I [client.c:2142:notify] 0-share-client-7:
 parent translators are ready, attempting connect on transport
 Given volfile:
 +--+
  1: volume share-client-0
  2:     type protocol/client
  3:     option remote-host virt01.casabi.net
  4:     option remote-subvolume /export
  5:     option transport-type tcp
  6:     option username f0a71b3a-1cb2-4a38-8620-f3853aa90798
  7:     option password 9641cac2-de93-4621-b94b-9320ada20fd7
  8: end-volume
  9:
  10: volume share-client-1
  11:     type protocol/client
  12:     option remote-host virt02.casabi.net
  13:     option remote-subvolume /export
  14:     option transport-type tcp
  15:     option username f0a71b3a-1cb2-4a38-8620-f3853aa90798
  16:     option password 9641cac2-de93-4621-b94b-9320ada20fd7
  17: end-volume
  18:
  19: volume share-client-2
  20:     type protocol/client
  21:     option remote-host virt03.casabi.net
  22:     option remote-subvolume /export
  23:     option transport-type tcp
  24:     option username f0a71b3a-1cb2-4a38-8620-f3853aa90798
  25:     option password 9641cac2-de93-4621-b94b-9320ada20fd7
  26: end-volume
  27:
  28: volume share-client-3
  29:     type protocol/client
  30:     option remote-host virt04.casabi.net
  31:     option remote-subvolume /export
  32:     option transport-type tcp
  33:     option username f0a71b3a-1cb2-4a38-8620-f3853aa90798
  34:     option password 9641cac2-de93-4621-b94b-9320ada20fd7
  35: end-volume
  36:
  37: volume share-client-4
  38:     type protocol/client
  39:     option remote-host virt05.casabi.net
  40:     option remote-subvolume /export
  41:     option transport-type tcp
  42:     option username f0a71b3a-1cb2-4a38-8620-f3853aa90798
  43:     option password 9641cac2-de93-4621-b94b-9320ada20fd7
  44: end-volume
  45:
  46: volume share-client-5
  47:     type

Re: [Gluster-users] shrinking volume

2012-06-21 Thread harry mangalam

In 3.3, this is exactly what 'remove-brick' does.  It migrates the data
off an active volume, and when it's done, allows the removed brick to be
upgraded, shut down, killed off etc.

gluster volume  remove-brick vol server:/brick  start

(takes a while to start up, but then goes fairly rapidly.)

following is the result of a recent remove-brick I did with 3.3

% gluster volume  remove-brick gl bs1:/raid1  status
 Node Rebalanced-files  size   scanned  failures
status
-  ---   ---   ---   ---

localhost2  10488397779   120in
progress
  bs20000not
started
  bs30000not
started
  bs40000not
started

(time passes )

$ gluster volume  remove-brick gl bs1:/raid2  status
Node Rebalanced-files  size
scanned  failures status
   -  ---   ---
---   ---   
   localhost  952  26889337908
83060  completed



Note that once the 'status' says completed, you need to issue the
remove-brick command again to actually finalize the operation.

And that 'remove-brick' command will not clear the dir structure on the
removed brick.


On Thu, 2012-06-21 at 12:29 -0400, Brian Cipriano wrote:
 Hi all - is there a safe way to shrink an active gluster volume without 
 losing files?
 
 I've used remove-brick before, but this causes the files on that brick 
 to be removed from the volume. Which is fine for some situations.
 
 But I'm trying to remove a brick without losing files. This is because 
 our file usage can grow dramatically over short periods. During those 
 times we add a lot of buffer to our gluster volume, to keep it at about 
 50% usage. After things settle down and file usage isn't changing as 
 much, we'd like to remove some bricks in order to keep usage at about 
 80%. (These bricks are AWS EBS volumes - we want to remove the bricks to 
 save a little $ when things are slow.)
 
 So what I'd like to do is the following. This is a simple distributed 
 volume, no replication.
 
 * Let gluster know I want to remove a brick
 * No new files will go to that brick
 * Gluster starts copying files from that brick to other bricks, 
 essentially rebalancing the data
 * Once all files have been duplicated onto other bricks, the brick is 
 marked as removed and I can do a normal remove-brick
 * Over the course of this procedure the files are always available 
 because there's always at least one active copy of every file
 
 This procedure seems very similar to replace-brick, except the goal 
 would be to evenly distribute to all other active bricks (without 
 interfering with pre-exiting files), not one new brick.
 
 Is there any way to do this?
 
 I *could* just do my remove-brick, then manually distribute the files 
 from that old brick back onto the volume, but that would cause those 
 files to become unavailable for some amount of time.
 
 Many thanks for all your help,
 
 - brian
 
 
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] How Fatal? Server and Client lk-version numbers are not same, reopening the fds

2012-06-20 Thread harry mangalam

Despite Joe Landman's sage advice to the contrary, I'm trying to
convince an IPoIB volume to service requests from a GbE client via
some /etc/hosts manipulation. (This may or may not be related to the
automount problems we're having as well.)

This has worked (and continues to work) well on another cluster with a
slightly older version of gluster - the 3.3.0qa42 version on both server
and client).

In the following case the servers are on IPoIB (net 10.2.x.x) and GbE
(10.1.x.x) and can ping back and forth on their respective networks to
all the clients and servers.  The gluster volume was created using the
IPoIB network and numbers:

Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
nfs.disable: on
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*

using 3.3 servers on SciLi 6.2 and  3.3.0qa42 clients. When trying to
mount the gluster volume on an ethernet client, coerced into believing
that the server is on ethernet using /etc/hosts manipulations, it
doesn't complete the mount, failing rapidly with the following log:
http://pastie.org/4123348.   The server log doesn't seem to show
anything.

There are repeated references to Server and Client lk-version numbers
are not same, reopening the fds.  

Is this a fatal error or a side effect?

How important is having exactly matching versions?




___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] How Fatal? Server and Client lk-version numbers are not same, reopening the fds

2012-06-20 Thread harry mangalam

To check whether the point version skew might have an effect, I compiled
gluster from the source:
http://download.gluster.org/pub/gluster/glusterfs/LATEST/glusterfs-3.3.0.tar.gz
 and tried it again.  However, even tho the server and client are now
from (I assume) the same source, I still get that error:

[2012-06-20 17:13:52.084846] I
[client-handshake.c:453:client_set_lk_version_cbk] 0-gl-client-6: Server
lk version = 1
[2012-06-20 17:13:52.087352] I
[client-handshake.c:1636:select_server_supported_programs]
0-gl-client-5: Using Program GlusterFS 3.3.0, Num (1298437), Version
(330)

The server was intalled from the CentOS binary of that version:
http://download.gluster.org/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-server-3.3.0-1.el6.x86_64.rpm
the client from the self-compiled code.

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] Too many levels of symbolic links with glusterfs automounting

2012-06-19 Thread harry mangalam

(Apologies if this already posted, but I recently had to change smtp servers 
which scrambled some list permissions, and I haven't seen it post)

I set up a 3.3 gluster volume for another sysadmin and he has added it
to his cluster via automount.  It seems to work initially but after some
time (days) he is now regularly seeing this warning:
Too many levels of symbolic links
when he tries to traverse the mounted filesystems.

$ df: `/share/gl': Too many levels of symbolic links

It's supposed to be mounted on /share/gl with a symlink to /gl
ie:  /gl - /share/gl

I've been using gluster with static mounts on a cluster and have never
seen this behavior; google does not seem to record anyone else seeing
this with gluster. However, I note that the Howto Automount GlusterFS
page at 
http://www.gluster.org/community/documentation/index.php/Howto_Automount_GlusterFS
has been deleted. Is automounting no longer supported?

His auto.master file is as follows (sorry for the wrapping):

w1
-rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async
  10.1.50.2:/
w2
-rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async
  10.1.50.3:/
mathbio
-rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async
  10.1.50.2:/
tw
-rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async
  10.1.50.4:/
shwstore
-rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async
shwraid.biomol.uci.edu:/
djtstore
-rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async
djtraid.biomol.uci.edu:/
djtstore2
-rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async
djtraid2.biomol.uci.edu:/djtraid2:/
djtstore3
-rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async
djtraid3.biomol.uci.edu:/djtraid3:/
kevin
-rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async
  10.2.255.230:/
samlab
-rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async
  10.2.255.237:/
new-data
-rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async
  nas-1-1.ib:/
gl-fstype=glusterfs
bs1:/


He has never seen this behavior with the other automounted fs's.  The
system logs from the affected nodes do not have any gluster strings that
appear to be relevant, but /var/log/glusterfs/share-gl.log ends with
this series of odd lines:

[2012-06-18 08:57:38.964243] I
[client-handshake.c:453:client_set_lk_version_cbk] 0-gl-client-6: Server
lk version = 1
[2012-06-18 08:57:38.964507] I [fuse-bridge.c:3376:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13
kernel 7.16
[2012-06-18 09:16:48.692701] W
[client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: remote
operation failed: Stale NFS file handle.
Path: /tdlong/RILseq/makebam.commands
(90193380-d107-4b6c-b02f-ab53a0f65148)
[2012-06-18 09:16:48.693030] W
[client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: remote
operation failed: Stale NFS file handle.
Path: /tdlong/RILseq/makebam.commands
(90193380-d107-4b6c-b02f-ab53a0f65148)
[2012-06-18 09:16:48.693165] W
[client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: remote
operation failed: Stale NFS file handle.
Path: /tdlong/RILseq/makebam.commands
(90193380-d107-4b6c-b02f-ab53a0f65148)
[2012-06-18 09:16:48.693394] W
[client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: remote
operation failed: Stale NFS file handle.
Path: /tdlong/RILseq/makebam.commands
(90193380-d107-4b6c-b02f-ab53a0f65148)
[2012-06-18 10:56:32.756551] I [fuse-bridge.c:4037:fuse_thread_proc]
0-fuse: unmounting /share/gl
[2012-06-18 10:56:32.757148] W [glusterfsd.c:816:cleanup_and_exit]
(--/lib64/libc.so.6(clone+0x6d) [0x3829ed44bd]
(--/lib64/libpthread.so.0 [0x382aa0673d]
(--/usr/sbin/glusterfs(glusterfs_sigwaiter+0x17c) [0x40524c]))) 0-:
received signum (15), shutting down

Any hints as to why this is happening?




___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Too many levels of symbolic links with glusterfs automounting

2012-06-19 Thread harry mangalam

One client log file is here:
http://goo.gl/FyYfy

On the server side, on bs1  bs4, there is a huge, current nfs.log file
(odd since I neither wanted nor configured an nfs export).  It is filled
entirely with these lines:
 tail -5 nfs.log
[2012-06-19 21:11:54.402567] E [rdma.c:4458:tcp_connect_finish]
0-gl-client-1: tcp connect to  failed (Connection refused)
[2012-06-19 21:11:54.406023] E [rdma.c:4458:tcp_connect_finish]
0-gl-client-2: tcp connect to  failed (Connection refused)
[2012-06-19 21:11:54.409486] E [rdma.c:4458:tcp_connect_finish]
0-gl-client-3: tcp connect to  failed (Connection refused)
[2012-06-19 21:11:54.412822] E [rdma.c:4458:tcp_connect_finish]
0-gl-client-6: tcp connect to 10.2.7.11:24008 failed (Connection
refused)
[2012-06-19 21:11:54.416231] E [rdma.c:4458:tcp_connect_finish]
0-gl-client-7: tcp connect to 10.2.7.11:24008 failed (Connection
refused)

on servers bs2, bs3 there is a current, huge log of this line, repeating
every 3s:

[2012-06-19 21:14:00.907387] I [socket.c:1798:socket_event_handler]
0-transport: disconnecting now



I was reminded as I was copying it that the client and servers are
slightly different - the client is 3.3.0qa42-1 while the server is
3.3.0-1.  Is this enough version skew to cause a difference?  There
are no other problems that I'm aware of but if it's the case that a
slight version skew will be problematic, I'll be careful to keep them
exactly aligned.  I think this was done since the final release binary
did not support the glibc that we were usin gon the compute nodes and
the 3.3.0qa42-1 did.  Perhaps too sloppy...?

gluster volume info
 
Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*

gluster volume status
Status of volume: gl
Gluster process PortOnline
Pid
--
Brick bs2:/raid124009   Y
2908
Brick bs2:/raid224011   Y
2914
Brick bs3:/raid124009   Y
2860
Brick bs3:/raid224011   Y
2866
Brick bs4:/raid124009   Y
2992
Brick bs4:/raid224011   Y
2998
Brick bs1:/raid124013   Y
10122
Brick bs1:/raid224015   Y
10154
NFS Server on localhost 38467   Y
9475
NFS Server on 10.2.7.11 38467   Y
10160
NFS Server on bs2   38467   N
N/A
NFS Server on bs3   38467   N
N/A


Hmm sure enough, bs1 and bs4 (localhost in the above info) appear to be
running NFS servers, while bs2  bs3 are not...?  

OK - after some googling, the gluster nfs serive can be shut off with 
gluster volume set gl nfs.disable on

and now the status looks like this:

gluster volume status
Status of volume: gl
Gluster process PortOnline
Pid
--
Brick bs2:/raid124009   Y
2908
Brick bs2:/raid224011   Y
2914
Brick bs3:/raid124009   Y
2860
Brick bs3:/raid224011   Y
2866
Brick bs4:/raid124009   Y
2992
Brick bs4:/raid224011   Y
2998
Brick bs1:/raid124013   Y
10122
Brick bs1:/raid224015   Y
10154



hjm

On Tue, 2012-06-19 at 13:05 -0700, Anand Avati wrote:
 Can you post the complete logs? Is the 'Too many levels of symbolic
 links' (or ELOOP) logs seen in the client log or brick logs?
 
 
 Avati
 
 On Tue, Jun 19, 2012 at 11:22 AM, harry mangalam
 hjmanga...@gmail.com wrote:
 (Apologies if this already posted, but I recently had to
 change smtp servers
 which scrambled some list permissions, and I haven't seen it
 post)
 
 I set up a 3.3 gluster volume for another sysadmin and he has
 added it
 to his cluster via automount.  It seems to work initially but
 after some
 time (days) he is now regularly seeing this warning:
 Too many levels of symbolic links
 when he tries to traverse the mounted filesystems.
 
 $ df: `/share

[Gluster-users] Too many levels of symbolic links with glusterfs automounting

2012-06-18 Thread harry mangalam

I set up a 3.3 gluster volume for another sysadmin and he has added it
to his cluster via automount.  It seems to work initially but after some
time (days) he is now regularly seeing this warning:
Too many levels of symbolic links

$ df: `/share/gl': Too many levels of symbolic links

when he tries to traverse the mounted filesystems.

I've been using gluster with static mounts on a cluster and have never
seen this behavior; google does not seem to record anyone else seeing
this with gluster. However, I note that the Howto Automount GlusterFS
page at 
http://www.gluster.org/community/documentation/index.php/Howto_Automount_GlusterFS
has been deleted. Is automounting no longer supported?

His auto.master file is as follows (sorry for the wrapping):

w1
-rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async
  10.1.50.2:/
w2
-rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async
  10.1.50.3:/
mathbio
-rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async
  10.1.50.2:/
tw
-rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async
  10.1.50.4:/
shwstore
-rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async
shwraid.biomol.uci.edu:/
djtstore
-rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async
djtraid.biomol.uci.edu:/
djtstore2
-rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async
djtraid2.biomol.uci.edu:/djtraid2:/
djtstore3
-rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async
djtraid3.biomol.uci.edu:/djtraid3:/
kevin
-rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async
  10.2.255.230:/
samlab
-rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async
  10.2.255.237:/
new-data
-rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async
  nas-1-1.ib:/
gl-fstype=glusterfs
bs1:/


He has never seen this behavior with the other automounted fs's.  The
system logs from the affected nodes do not have any gluster strings that
appear to be relevant, but /var/log/glusterfs/share-gl.log ends with
this series of odd lines:

[2012-06-18 08:57:38.964243] I
[client-handshake.c:453:client_set_lk_version_cbk] 0-gl-client-6: Server
lk version = 1
[2012-06-18 08:57:38.964507] I [fuse-bridge.c:3376:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13
kernel 7.16
[2012-06-18 09:16:48.692701] W
[client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: remote
operation failed: Stale NFS file handle.
Path: /tdlong/RILseq/makebam.commands
(90193380-d107-4b6c-b02f-ab53a0f65148)
[2012-06-18 09:16:48.693030] W
[client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: remote
operation failed: Stale NFS file handle.
Path: /tdlong/RILseq/makebam.commands
(90193380-d107-4b6c-b02f-ab53a0f65148)
[2012-06-18 09:16:48.693165] W
[client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: remote
operation failed: Stale NFS file handle.
Path: /tdlong/RILseq/makebam.commands
(90193380-d107-4b6c-b02f-ab53a0f65148)
[2012-06-18 09:16:48.693394] W
[client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: remote
operation failed: Stale NFS file handle.
Path: /tdlong/RILseq/makebam.commands
(90193380-d107-4b6c-b02f-ab53a0f65148)
[2012-06-18 10:56:32.756551] I [fuse-bridge.c:4037:fuse_thread_proc]
0-fuse: unmounting /share/gl
[2012-06-18 10:56:32.757148] W [glusterfsd.c:816:cleanup_and_exit]
(--/lib64/libc.so.6(clone+0x6d) [0x3829ed44bd]
(--/lib64/libpthread.so.0 [0x382aa0673d]
(--/usr/sbin/glusterfs(glusterfs_sigwaiter+0x17c) [0x40524c]))) 0-:
received signum (15), shutting down





___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] remove-brick redux; repeat as necessary

2012-06-13 Thread Harry Mangalam

I had the opportunity (actually, desperate requirement) to try this 
again on a newly live system and while it worked again, it required 2 
remove-brick statements to actually get the volume to drop it.  The 
first apparently did the data moving, the second was necessary to tell 
the volume to drop the brick.


Since I had 2 bricks to drop, and re-add, it was repeatable.  Not really 
serious enough for a bug report, but .. interesting .. to experience.



--
Harry Mangalam,Research Computing, OIT,
Rm 225 MSTB, UC Irvine [mailcode 2225]
Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
Lat/Long: (33.642025,-117.844414) [paste into Google Maps]
--

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] too many redirects at the gluster download page

2012-06-08 Thread Harry Mangalam

It may be just me/chrome, but trying to dl the latest gluster results by 
clicking on the Download button next to the Ant, leads not to a download 
page but to the info page.  It invites you to go back to the gluster.org 
page from when you just came.


And when you click on the alternative 'Download' links (the button on 
the upper right or the larger Download GlusterFS icon with the package 
image, you get this in Chrome:


This webpage has a redirect loop
The webpage at http://www.gluster.org/download/ has resulted in too many 
redirects. Clearing your cookies for this site or allowing third-party 
cookies may fix the problem. If not, it is possibly a server 
configuration issue and not a problem with your computer.


Bug or feature?

--
Harry Mangalam,Research Computing, OIT,
Rm 225 MSTB, UC Irvine [mailcode 2225]
Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
Lat/Long: (33.642025,-117.844414) [paste into Google Maps]
--

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] unable to delete 'empty' dirs with 3.3qa42

2012-06-06 Thread Harry Mangalam

Just before updrading to 3.3final, we had an rsync collision on our 
gluster filesystem which left us with undeletable dirs.


The transport is IPoIB over 4 bricks as shown below.

$ gluster volume info

Volume Name: gli
Type: Distribute
Volume ID: 76cc5e88-0ac4-42ac-a4a3-31bf2ba611d4
Status: Started
Number of Bricks: 4
Transport-type: tcp,rdma
Bricks:
Brick1: pbs1ib:/bducgl
Brick2: pbs2ib:/bducgl
Brick3: pbs3ib:/bducgl
Brick4: pbs4ib:/bducgl
Options Reconfigured:
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64

The last 100 line of the gluster log is here: http://pastie.org/4036243

The problem started when I was running an rsync with --delete while 
moving a large /home to the gluster dir to prune the target of older 
cruft in a /home move..  A user directed a job to start reading from a 
directory that was simultaneously being pruned and all hell broke loose.


Now we're left with a number of dirs and files that can't be deleted.  
'du' says there are files in an apparently empty dir:


 root@bduc-login:/gl/tvanerp/adni
561 $ du -sh *
40K dicom

altho 'ls' can't see them from the client;

root@bduc-login:/gl/tvanerp/adni
562 $ ls -lat dicom
total 40
drwxrwxr-x 3 tvanerp tvanerp 94214 Jun  5 23:40 ./
drwxrwxr-x 3 tvanerp tvanerp72 Apr 23 13:11 ../


From the bricks you can see that there are still files there:
Tue Jun 05 23:18:02 [3.06 2.42 1.65][457.48/606]  root@bduc-login:~
556 $ ssh pbs2 'ls -lat /bducgl/tvanerp/adni/dicom'
total 24
drwxrwxr-x 3 7335 7335 28672 2012-06-05 22:59 .
drwxr-xr-x 2 7335 7335  8192 2012-06-05 14:50 128_S_0947_20080422_MPRAGE
drwxrwxr-x 3 7335 733518 2012-04-23 13:11 ..

Tue Jun 05 23:18:26 [2.82 2.42 1.67][457.62/606]  root@bduc-login:~
557 $ ssh pbs3 'ls -lat /bducgl/tvanerp/adni/dicom'
total 28
drwxrwxr-x 3 7335 7335 28672 2012-06-05 23:00 .
drwxr-xr-x 2 7335 7335 12288 2012-06-05 14:52 128_S_0947_20080422_MPRAGE
drwxrwxr-x 3 7335 733518 2012-04-23 13:11 ..

Tue Jun 05 23:18:51 [2.73 2.43 1.69][457.7/606]  root@bduc-login:~
558 $ ssh pbs4 'ls -lat /bducgl/tvanerp/adni/dicom'
total 32
drwxrwxr-x 3 7335 7335 36864 2012-06-05 22:59 .
drwxr-xr-x 2 7335 7335 12288 2012-06-05 14:50 128_S_0947_20080422_MPRAGE
-T 2 root root 0 2012-05-20 18:35 adni_pib_subjects.txt
drwxrwxr-x 3 7335 733518 2012-04-23 13:11 ..

but the client is not able to see/delete them, even as root.

Suggestions?

harry




--
Harry Mangalam,Research Computing, OIT,
Rm 225 MSTB, UC Irvine [mailcode 2225]
Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
Lat/Long: (33.642025,-117.844414) [paste into Google Maps]
--

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] A very special announcement from Gluster.org

2012-05-31 Thread Harry Mangalam


WooHooo!  Thanks, Glusteristas!  Timing is quite fortuitous for us.
hjm

On 05/31/2012 09:33 AM, John Mark Walker wrote:


Today, we're announcing the next generation of GlusterFS 
http://www.gluster.org/, version 3.3. The release has been a year in 
the making and marks several firsts: the first post-acquisition 
release under Red Hat, our first major act as an openly-governed 
project http://www.gluster.org/roadmaps/and our first foray beyond 
NAS. We've also taken our first steps towards merging big data and 
unstructured data storage, giving users and developers new ways of 
managing their data scalability challenges.


GlusterFS is an open source, fully distributed storage solution for 
the world's ever-increasing volume of unstructured data. It is a 
software-only, highly available, scale-out, centrally managed storage 
pool that can be backed by POSIX filesystems that support extended 
attributes, such as Ext3/4, XFS, BTRFS and many more.


This release provides many of the most commonly requested features 
including proactive self-healing, quorum enforcement, and granular 
locking for self-healing, as well as many additional bug fixes and 
enhancements.


Some of the more noteworthy features include:

  * Unified File and Object storage -- Blending OpenStack's Object
Storage API http://openstack.org/projects/storage/ with
GlusterFS provides simultaneous read and write access to data as
files or as objects.
  * HDFS compatibility -- Gives Hadoop administrators the ability to
run MapReduce jobs on unstructured data on GlusterFS and access
the data with well-known tools and shell scripts.
  * Proactive self-healing -- GlusterFS volumes will now automatically
restore file integrity after a replica recovers from failure.
  * Granular locking -- Allows large files to be accessed even during
self-healing, a feature that is particularly important for VM images.
  * Replication improvements -- With quorum enforcement you can be
confident that  your data has been written in at least the
configured number of places before the file operation returns,
allowing a user-configurable adjustment to fault tolerance vs
performance.

*
*Visit http://www.gluster.org http://gluster.org/ to download. 
Packages are available for most distributions, including Fedora, 
Debian, RHEL, Ubuntu and CentOS.


Get involved! Join us on #gluster on freenode, join our mailing list 
http://www.gluster.org/interact/mailinglists/, 'like' our Facebook 
page http://facebook.com/GlusterInc, follow us on Twitter 
http://twitter.com/glusterorg, or check out our LinkedIn group 
http://www.linkedin.com/groups?gid=99784.


GlusterFS is an open source project sponsored by Red Hat 
http://www.redhat.com/®, who uses it in its line of Red Hat Storage 
http://www.redhat.com/storage/ products.


(this post published at 
http://www.gluster.org/2012/05/introducing-glusterfs-3-3/ )




___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users



--
Harry Mangalam,Research Computing, OIT,
Rm 225 MSTB, UC Irvine [mailcode 2225]
Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
Lat/Long: (33.642025,-117.844414) [paste into Google Maps]
--

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] 'remove-brick' is removing more bytes than are in the brick(?)

2012-05-24 Thread Harry Mangalam

Installed the qa42 version on servers and clients and under load, it 
worked as advertised (tho of course more slowly than I would have 
liked :)) - removed ~1TB in just under 24 hr (on a DDR-IB connected 4 
node set) ~ 40MB/s overall tho there were a huge number of tiny files.

The remove-brick cleared the brick (~1TB), tho with an initial set of 
120 failures (what does this mean?)

  Rebalanced
  Node   files size   scanned failures  status   timestamp
-- ---     ---   
---
pbs2ib   15676  69728541188365886   120in progress   May 
22 17:33:14
pbs2ib   24844 134323243354449667   120in progress   May 
22 18:08:56 
pbs2ib   37937 166673066147714175   120in progress   May 
22 19:08:21 
pbs2ib   42014 173145657374806556   120in progress   May 
22 19:33:21
pbs2ib  418842 222883965887   5729324   120in progress   May 
23 07:15:19
pbs2ib  419148 222907742889   5730903   120in progress   May 
23 07:16:26
pbs2ib  507375 266212060954   6192573   120in progress   May 
23 09:48:05
pbs2ib  540201 312712114570   6325234   120in progress   May 
23 11:15:51
pbs2ib  630332 416533679754   6633562   120in progress   May 
23 14:24:16
pbs2ib  644156 416745820627   6681746   120in progress   May 
23 14:45:44
pbs2ib  732989 432162450646   7024331   120  completed   May 
23 17:26:20

(sorry for any wrapping)

and finally deleted the files:
root@pbs2:~
404 $ df -h
Filesystem  Size  Used Avail Use% Mounted on
/dev/md08.2T 1010G  7.2T  13% /bducgl   - retained brick
/dev/sda1.9T  384M  1.9T   1% /bducgl1  - removed brick

altho it left the directory skeleton (is this a bug or feature?):

root@pbs2:/bducgl1
406 $ ls
aajames   aelsadek  amentes   avuong1   btatevos  chiaoyic  dbecerra
aamelire  aganesan  anasr awaring   bvillac   clarkap   dbkeator
aaskariz  agkentanhml balakire  calvinjs  cmarcum   dcs 
abanaiya  agold argardne  bgajare   casem cmarkega  dcuccia 
aboessen  ahnsh arup  biggsjcbatmall  courtnem  detwiler
abondar   aihlerasidhwa   binz  cesar crex  dgorur  
abraatz   aisenber  asuncion  bjanakal  cestark   cschendhealion
abriscoe  akathaatenner   blind cfalvoculverr   dkyu
abuschalai2 atfrank   blutescgalasso  daliz dmsmith 
acohanalamngathinabmmiller  cgarner   danieldmvuong 
acstern   allisons  athsu bmobashe  chadwicr  dariusa   dphillip
ademirta  almquist  aveidlab  brentmchangd1   dasherdshanthi
etc

And once completed with the 'commit' command, it no longer reports the 
brick as part of the volume:
$ gluster volume info gli
Volume Name: gli
Type: Distribute
Volume ID: 76cc5e88-0ac4-42ac-a4a3-31bf2ba611d4
Status: Started
Number of Bricks: 4
Transport-type: tcp,rdma
Bricks:
Brick1: pbs1ib:/bducgl
Brick2: pbs2ib:/bducgl 
   -- no more pbs2ib:/bducgl 
Brick3: pbs3ib:/bducgl
Brick4: pbs4ib:/bducgl
Options Reconfigured:
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64

And no longer reports the removed brick as part of the gluster volume:
$ gluster volume status
Status of volume: gli
Gluster process  PortOnline  Pid
---
Brick pbs1ib:/bducgl 24016   Y   10770
Brick pbs2ib:/bducgl 24025   Y   1788
Brick pbs3ib:/bducgl 24018   Y   20953
Brick pbs4ib:/bducgl 24009   Y   20948


So this was a big improvement over the previous trial.  the only 
glitches were the 120 failures (which mean...?) and the directory 
skeleton left on the removed brick which may be a feature..?

So it seems to have been fixed in qa42.

thanks!

hjm




On Tuesday 22 May 2012 00:02:02 Amar Tumballi wrote:
  pbs2ib 8780091379699182236 2994733 in progress
 
 Hi Harry,
 
 Can you please test once again with 'glusterfs-3.3.0qa42' and
 confirm the behavior? This seems like a bug (suspect it to be some
 overflow type of bug, not sure yet). Please help us with opening a
 bug report, meantime, we will investigate on this issue.
 
 Regards,
 Amar

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Performance issues with striped volume over Infiniband

2012-04-21 Thread Harry Mangalam

try 'ifstat' to see traffic on all interfaces simultaneously:

501 $ ifstat

  
   eth2ib1
 KB/s in  KB/s out   KB/s in  KB/s out
0.80  0.70  0.00  0.00
0.19  0.15  0.00  0.00
0.07  0.15  0.00  0.00

pkg ifstat in debian/ubuntu.

hjm

On Saturday 21 April 2012 02:45:10 Ionescu, A. wrote:
 Michael,
 
 Thanks for your suggestion. I had the same intuition as you, but
 then I used iptraf and saw no eth0 traffic associated with I/O
 on the Gluster volume. (the tool doesn't show the ib0 interface,
 unfortunately). node01 and node02 are entered into /etc/hosts,
 they resolve to the ipoib addresses and are pingable.
 
 I will try increasing the number of threads and applying the patch
 Bryan suggested.
 
 Thanks,
 Adrian

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] IPoIB Volume (3.3b3) started but not online, not mountable

2012-04-18 Thread Harry Mangalam

I was successfully running a IPoIB gluster testbed (V3.3b3 on Ubuntu 
10.04.04) and brought it down smoothly to adjust some parameters.  
It now looks like this (the options reconfigured were just added):

# gluster volume info
 
Volume Name: gli
Type: Distribute
Volume ID: 76cc5e88-0ac4-42ac-a4a3-31bf2ba611d4
Status: Started
Number of Bricks: 5
Transport-type: tcp,rdma
Bricks:
Brick1: pbs1ib:/bducgl
Brick2: pbs2ib:/bducgl
Brick3: pbs2ib:/bducgl1
Brick4: pbs3ib:/bducgl
Brick5: pbs4ib:/bducgl
Options Reconfigured:
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.255.77.*, 128.200.15.*, 10.255.78.*, 10.255.89.*


however, a status query gives this:

# gluster volume status

Status of volume: gli
Gluster process PortOnline  
Pid
--
Brick pbs1ib:/bducgl24016   N   
N/A
Brick pbs2ib:/bducgl24023   N   
N/A
Brick pbs2ib:/bducgl1   24025   N   
N/A
Brick pbs3ib:/bducgl24016   N   
N/A
Brick pbs4ib:/bducgl24016   N   
N/A
NFS Server on localhost 38467   N   
N/A
NFS Server on pbs4ib38467   N   
N/A
NFS Server on pbs3ib38467   N   
N/A
NFS Server on pbs2ib38467   N   
N/A

(I didn't want the NFS Server options - is that a default to start 
it?)

But the operative bit is that it's not online, despite being started.  
What could give this situation?


As might be expected, clients can't mount the gluster vol.

The last part of the /etc-glusterfs-glusterd.vol.log/ is many lines 
like this:
[2012-04-18 11:36:57.456318] E [socket.c:2115:socket_connect] 0-
management: connection attempt failed (Connection refused)
and the last lines before are a number of stanzas like this:

[2012-04-18 11:31:14.698184] I [glusterd-op-
sm.c::glusterd_op_ac_send_commit_op] 0-management: Sent op req to 
3 peers
[2012-04-18 11:31:14.698379] I [glusterd-rpc-
ops.c:1294:glusterd3_1_commit_op_cbk] 0-glusterd: Received ACC from 
uuid: 2a593581-bf45-446c-8f7c-212c53297803
[2012-04-18 11:31:14.698496] I [glusterd-rpc-
ops.c:1294:glusterd3_1_commit_op_cbk] 0-glusterd: Received ACC from 
uuid: c79c4084-d6b9-4af9-b975-40dd6aa99b42
[2012-04-18 11:31:14.698581] I [glusterd-rpc-
ops.c:1294:glusterd3_1_commit_op_cbk] 0-glusterd: Received ACC from 
uuid: 26de63bd-c5b7-48ba-b81d-5d77a533d077
[2012-04-18 11:31:14.698834] I [glusterd-rpc-
ops.c:606:glusterd3_1_cluster_unlock_cbk] 0-glusterd: Received ACC 
from uuid: 2a593581-bf45-446c-8f7c-212c53297803
[2012-04-18 11:31:14.698879] I [glusterd-rpc-
ops.c:606:glusterd3_1_cluster_unlock_cbk] 0-glusterd: Received ACC 
from uuid: 26de63bd-c5b7-48ba-b81d-5d77a533d077
[2012-04-18 11:31:14.698910] I [glusterd-rpc-
ops.c:606:glusterd3_1_cluster_unlock_cbk] 0-glusterd: Received ACC 
from uuid: c79c4084-d6b9-4af9-b975-40dd6aa99b42
[2012-04-18 11:31:14.698929] I [glusterd-op-
sm.c:2491:glusterd_op_txn_complete] 0-glusterd: Cleared local lock
[2012-04-18 11:31:15.410106] E [socket.c:2115:socket_connect] 0-
management: connection attempt failed (Connection refused)

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] IPoIB Volume (3.3b3) started but not online, not mountable

2012-04-18 Thread Harry Mangalam

With JoeJulian's help, tracked this down to what looks like a bug in 
the IP# format which causes glusterfsd to crash.  The bug is: 
https://bugzilla.redhat.com:443/show_bug.cgi?id=813937
 
If anyone has an immediate workaround or correction, be glad to hear 
of it.

hjm
-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] IPoIB Volume (3.3b3) started but not online, not mountable

2012-04-18 Thread Harry Mangalam

Interim fix is to use ONLY commas, no spaces allowed (this used to be 
oK previously)

gluster volume set gli auth.allow \
 '10.255.77.*,128.200.15.*,10.255.78.*,10.255.89.*'

is ok (glusterfsd starts correctly)

but 
gluster volume set gli auth.allow '10.255.77.*, 128.200.15.*, 
10.255.78.*, 10.255.89.*'

is NOT OK (glusterfsd will not start).

hjm

On Wednesday 18 April 2012 12:56:08 Harry Mangalam wrote:
 With JoeJulian's help, tracked this down to what looks like a bug
 in the IP# format which causes glusterfsd to crash.  The bug is:
 https://bugzilla.redhat.com:443/show_bug.cgi?id=813937
 
 If anyone has an immediate workaround or correction, be glad to
 hear of it.
 
 hjm

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] IPoIB Volume (3.3b3) started but not online, not mountable

2012-04-18 Thread Harry Mangalam

And one more observation that will probably be obvious in retrospect.  
If you enable auth.allow (on 3.3b3), it will do reverse lookups to 
verify hostnames so it will be more complicated to share an IPoIB 
gluster volume to IPoEth clients.

I had been overriding DNS entries with /etc/hosts entries, but the 
auth.allow option will prevent that hack.

If anyone knows how to share an IPoIB volume to ethernet clients in a 
more formally correct way, I'd be happy to learn of it.


-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Best practice: Scaling down / Removing Servers

2012-04-14 Thread Harry Mangalam

As I read the new 3.3b3 FAQ, yes.   The remove-brick feature can do 
this (altho I have not yet tried it - maybe later this next week).

http://community.gluster.org/q/what-s-new-in-glusterfs-3-3/

hjm

On Saturday 14 April 2012 02:36:41 Philip wrote:
 We have some issues with the RAID-Controller of our gluster
 servers. We decided to buy complete new servers to replace the
 others but it is quite unclear how to perform this task without
 downtime.
 
 It is possible to add the new servers to the volume, rebalance it
 and then get all the data off the old servers and remove them
 after this?

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster/RDMA

2012-03-30 Thread Harry Mangalam

thanks for the advice. Those modules were not autoloaded, but could be 
loaded manually.

I now have this profile:

$ lsmod |egrep '(ib|rdma)'
ib_umad12363  0 
ib_ucm 12496  0 
rdma_ucm   11779  0 
ib_uverbs  31902  2 ib_ucm,rdma_ucm
rdma_cm31030  1 rdma_ucm
ib_cm  40316  2 ib_ucm,rdma_cm
iw_cm   9385  1 rdma_cm
ib_sa  22006  2 rdma_cm,ib_cm
ib_addr 6766  1 rdma_cm
zlib_deflate   21834  1 btrfs
libcrc32c   1244  1 btrfs
ib_mthca  140934  0 
ib_mad 41321  4 ib_umad,ib_cm,ib_sa,ib_mthca
ib_core64935  10 
ib_umad,ib_ucm,rdma_ucm,ib_uverbs,rdma_cm,ib_cm,iw_cm,ib_sa,ib_mthca,ib_mad


and semi-magically, on one of the nodes, ibv_devinfo reports:


root@pbs3:~
661 $ ibv_devinfo
hca_id: mthca0
transport:  InfiniBand (0)
fw_ver: 1.0.800
node_guid:  0006:6a00:9800:6e55
sys_image_guid: 0006:6a00:9800:6e55
vendor_id:  0x066a
vendor_part_id: 25204
hw_ver: 0xA0
board_id:   MT_023002
phys_port_cnt:  1
port:   1
state:  PORT_INIT (2)
max_mtu:2048 (4)
active_mtu: 512 (2)
sm_lid: 0
port_lid:   0
port_lmc:   0x00


However, 2 other nodes are still mute - there's obviously some kind of 
software drift between them, but this gives me some purchase to figure 
other things out.

Still haven't upgraded the firmware, but this allows me to extract the 
PSID and get the right FW image.

Thanks!
Harry



On Monday 07 November 2011 06:44:06 Ben England wrote:
 To Harry Mangalam about Gluster/RDMA:
 
 make sure these modules are loaded
 
 # modprobe -v rdma_ucm
 # modprobe -v ib_uverbs
 # modprobe -v ib_ucm
 
 To run the subnet manager
 
 # modprobe -v ib_umad
 
 Make sure libibverbs and (libmlx4 or libmthca) RPMs are installed.
 
 I don't understand why they appropriate modules aren't loaded
 automatically.  Could put something in /etc/modprobe.d/ to make
 this happen maybe?  Infiniband should not require troubleshooting
 after 5-10 years of development, it should just work.
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487 
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
This signature has been OCCUPIED!
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] forcing a brick to unload storage to replace disks?

2012-03-30 Thread Harry Mangalam

How is this functionality invoked? I've now upgraded to 3.3b2, which I 
believe is the latest version available.

hjm

On Friday 09 December 2011 00:23:12 Amar Tumballi wrote:
  Is there a process whereby I can clear a brick by forcing the
  files to migrate to the other bricks?
 
 Hi Harry,
 
 This feature got committed to master branch (upstream) recently,
 with which a remove-brick will take care of migrating data out of
 the brick.
 
 This feature is not part of any of current 3.2.x (or earlier)
 releases.
 
 If you are in testing/validating phase, 3.3.0qa15 should have this
 feature for you.
 
 Regards,
 Amar
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487 
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
Citzens United: Democracy on meth 
- Walter Egan
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Problem: Gluster performance too high..

2012-03-29 Thread Harry Mangalam

Thanks!  bonnie++ was going to be my next step. 
I thought of doing the local create-local-files-and-copy, but it 
complicates the process a bit (he said, lazily sipping his mint 
julep).  But the time to read a local copy would be fairly trivial 
compared to the network time, so this sounds like it would be a good 
follow-up.



On Thursday 29 March 2012 12:40:23 Jeff White wrote:
 Maybe it's cheating by writing sparse files or something of the
 like  because it knows it's all zeros?  Create some files locally
 from /dev/urandom and copy them.  I think you'll see much lower
 performance. Better yet, use bonnie++.
 
 Jeff White - Linux/Unix Systems Engineer
 University of Pittsburgh - CSSD

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Problem: Gluster performance too high..

2012-03-29 Thread Harry Mangalam

On Thursday 29 March 2012 12:50:48 Harry Mangalam wrote:
 My assumption was that at some point there would be a 1Gb
 bottleneck, but if the packets were switched all the way thru,
 there would be a theoretical max related the number of
 gluster-server Gb ports (4).  So until I approach 4Gb/s, I guess I
 wouldn't necessarily see this bottleneck.  Is that correct?

Adding a little bit more:

Digging thru my raw numbers, the fastest completion of the script 
yeilds an effective thruput of ~478MB/s, a bit less than the 
theoretical max of 500MB/s (if my above mumblings were correct), so 
there may be something to this.

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] gluster 3.31b1 destroying mountpoints?

2012-03-22 Thread Harry Mangalam

Testing gluster 3.3b2 on Ubuntu 10.04.4

3.3b1 seemed to be fine in this regard, but I'm getting a very 
peculiar effect of trying to mount gluster with 3.3b2

$ mount -t glusterfs pbs1:/gl /gll
Mount failed. Please check the log file for more details.

$ ls -l /
 ...
drwxr-xr-x 193 root  root  12288 2012-03-22 13:19 etc
d?   ? ? ? ?? gl
d?   ? ? ? ?? gll
drwxr-xr-x 403 root  root  12288 2012-03-17 08:26 home
 ...

the mount seems to be destroying the mountpoints. Both /gl and /gll 
were created for the mountpoint, and then destroyed by trying to mount 
the gluster volume on it.

I did not use a transport option since it's supposed to default to 
socket.

The package used on both servers and clients was the Debian pkg:
glusterfs_3.3beta2-1_amd64_with_rdma.deb

The volume was created with:

gluster volume create gl \
transport tcp,rdma \
pbs1:/bducgl \
pbs2:/bducgl  pbs2:/bducgl1 \
pbs3:/bducgl \
pbs4:/bducgl
gluster volume set gl auth.allow 10.255.78.*,10.255.89.*,128.200.15.*
gluster volume set gl performance.io-thread-count 64

(I'm also experimenting with IB transport but would like to test the 
TCP transport as well).

and the server-side status was:
root@pbs1:~# gluster volume info

Volume Name: gl
Type: Distribute
Status: Created
Number of Bricks: 5
Transport-type: tcp,rdma
Bricks:
Brick1: pbs1:/bducgl
Brick2: pbs2:/bducgl
Brick3: pbs2:/bducgl1
Brick4: pbs3:/bducgl
Brick5: pbs4:/bducgl
Options Reconfigured:
performance.io-thread-count: 64
auth.allow: 10.255.78.*,10.255.89.*,xxx.xxx.xx.*

on the client side, /var/log/glusterfs/gl.log says:

[2012-03-22 13:08:52.903423] W [fuse-bridge.c:2280:fuse_statfs_cbk] 0-
glusterfs-fuse: 33: ERR = -1 (Transport endpoint is not connected)
[2012-03-22 13:08:53.681680] I [client.c:1885:client_rpc_notify] 0-gl-
client-4: disconnected


On the server side, there are many lines of the format:
[2012-03-21 14:22:13.279799] W [rpc-
transport.c:183:rpc_transport_load] 0-rpc-transport: missing 'option 
transport-type'. defaulting to socket
[2012-03-21 14:22:13.487349] I [cli-rpc-
ops.c:1000:gf_cli3_1_set_volume_cbk] 0-cli: Received resp to set
[2012-03-21 14:22:13.487548] I [input.c:46:cli_batch] 0-: Exiting 
with: 0
[2012-03-21 14:23:23.869580] W [rpc-
transport.c:183:rpc_transport_load] 0-rpc-transport: missing 'option 
transport-type'. defaulting to socket
[2012-03-21 14:23:24.74471] I [cli-rpc-
ops.c:1000:gf_cli3_1_set_volume_cbk] 0-cli: Received resp to set
[2012-03-21 14:23:24.74668] I [input.c:46:cli_batch] 0-: Exiting with: 
0
[2012-03-22 12:56:51.299987] W [rpc-
transport.c:183:rpc_transport_load] 0-rpc-transport: missing 'option 
transport-type'. defaulting to socket
[2012-03-22 12:56:51.425471] I [cli-rpc-
ops.c:413:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to get vol: 0
[2012-03-22 12:56:51.425694] I [cli-rpc-
ops.c:606:gf_cli3_1_get_volume_cbk] 0-: Returning: 0
[2012-03-22 12:56:51.462537] I [cli-rpc-
ops.c:413:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to get vol: 0
[2012-03-22 12:56:51.462649] I [cli-rpc-
ops.c:606:gf_cli3_1_get_volume_cbk] 0-: Returning: 0
[2012-03-22 12:56:51.462663] I [input.c:46:cli_batch] 0-: Exiting 
with: 0


This can't be normal

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] should be: gluster 3.31b2 destroying mountpoints?

2012-03-22 Thread Harry Mangalam

Resolution:

As Brian noted, and Haris alluded to, these strange dirs are chowned 
this way when a non-running glusterfs is mounted.  When I started the 
gluster volume remotely, they immediately mounted, such that I had the 
same gluster volume mounted on both /gl and /gll (/gll was created so 
I could ignore the odd state of /gl for a while).  I was able to 
umount the gluster vol from /gll and delete it normally.

So now it seems to be working correctly.  

While this might be an edge case, I wonder if this might be detected 
and a relevant error emitted by the client (yes, easy for me to 
say..)..

thanks.

hjm

On Thursday 22 March 2012 14:05:25 Brian Candler wrote:
 On Thu, Mar 22, 2012 at 01:56:49PM -0700, Harry Mangalam wrote:
 Previous email had a typo in Subject line.
 
 What do you mean by destroying the mountpoint?
 
 I have seen those d??? entries before (not with gluster).
 IIRC it's when a directory has 'read' but not 'execute' bits set.
 
 $ mkdir foo
 $ chmod foo/bar
 $ chmod 666 foo
 $ ls -l foo
 ls: cannot access foo/bar: Permission denied
 total 0
 -? ? ? ? ?? bar

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Reducing a volume

2012-02-13 Thread Harry Mangalam

Is this documented anywhere?  I'm in a position where this would be 
very useful.  I've posted a few queries about this as well and if the 
function exists, it's an anonymous one.

Last year, Amar responded that:
If you are in testing/validating phase, 3.3.0qa15 should have this 
feature for you.

But there's no docs to point to how to do it. If the development docs 
could be made available as a wiki or something similar, we stunt-
testers could provide feedback, examples of gotchas and edge cases, 
and what works as expected that might have value.

Best wishes,
Harry

On Monday 13 February 2012 00:42:10 Brian Candler wrote:
 On Mon, Feb 13, 2012 at 12:37:20AM +0100, Arnold Krille wrote:
  the third issue I encountered today: How do I tell gluster to
  remove two bricks of my six-brick-two-replica volume without
  loosing data?
 
 From what I've read (but I've not tried it yet), this will be a
 feature in 3.3
 
 http://community.gluster.org/q/what-s-new-in-glusterfs-3-3/
 
 Remove-brick can migrate data to remaining bricks
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
Advertising the device from which your email was sent may reflect 
poorly on your imagination, self-image, and/or technical ability.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Reducing a volume

2012-02-13 Thread Harry Mangalam

On Monday 13 February 2012 10:12:13 John Mark Walker wrote:
 - Original Message -
 
  But there's no docs to point to how to do it. If the development
  docs could be made available as a wiki or something similar, we
  stunt-testers could provide feedback, examples of gotchas and
  edge cases, and what works as expected that might have value.
 
 Greetings - yes, you are quite correct. We are in the process of
 documenting new features in 3.3 and putting them on the wiki.
 
 Are you volunteering to assist us in this process :)

Yes, as implied in the post - I'd be happy to.

 Have you tried out any of the new QA builds to test?

I'm running 3.3b1 and will be upgrading to the 3.3b2 later this week 
if there are no other disasters taking precedence..  If there's a 
place to get more recent versions, I'll try those as well.

Best,
Harry
-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
Advertising the device from which your email was sent may reflect 
poorly on your imagination, self-image, and/or technical ability.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] Preliminary 3.3 Admin/User docs?

2012-02-10 Thread Harry Mangalam

There have been quote a lot of hints and suggestions about what 3.3 
will provide. I'm runnning 3.3b1, but am having trouble figuring out 
just what features it has and how to exploit them.

In particular, the ability to force a brick to unload is apparently 
available in  3.3.0qa15 (ref below), but how does one go about doing 
this, other than reading source code?

Are there preliminary Admin docs or a wiki where this info is being 
assembled?  I looked thru the docs on the gluster site but there 
doesn't seem to be much in the way of docs such as the very useful but 
now a bit dated 'Gluster_FS_3.2_Admin_Guide.pdf'

The Gluster Management Console is now in early release:
http://download.gluster.org/pub/gluster/glustermc/1.0/Documentation/User_Guide/html/index.html
and has some docs - is this what we're supposed to use to manipulate 
our glusterfs's?

best wishes
harry


On Friday 09 December 2011 00:23:12 Amar Tumballi wrote:
  Is there a process whereby I can clear a brick by forcing the
  files to migrate to the other bricks?
 
 Hi Harry,
 
 This feature got committed to master branch (upstream) recently,
 with which a remove-brick will take care of migrating data out of
 the brick.
 
 This feature is not part of any of current 3.2.x (or earlier)
 releases.
 
 If you are in testing/validating phase, 3.3.0qa15 should have this
 feature for you.


-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
Advertising the device from which your email was sent may reflect 
poorly on your imagination, self-image, and/or technical ability.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] what happens when adding a pre-populated brick?

2011-12-16 Thread Harry Mangalam

Thanks very much for the answer, Jeff.  Lacking further info, I'll 
hang out on the gluster irc and over the break, try it out on an 
experimental volume.

hjm

On Friday 16 December 2011 07:43:06 Jeff White wrote:
 I have been told this is do-able but I haven't tested it much
 myself.  I would be interested in hearing about this from anyone
 who has done it.
 
  From what I heard on irc you can do the following:
 
 1. Have existing data in server1:/data1
 2. Stop all changes to server1:/data1
 3. Create a volume: gluster volume createserver1:/data1
 server2:/data2 4. Mount the new volume via FUSE
 5. Trigger a self-heal: find gluster-mount -noleaf -print0 | xargs
 --null stat /dev/null
 
 I also heard that there could be GFID problem if the data was
 previously in another Gluster volume.
 
 Jeff White - Linux/Unix Systems Engineer
 University of Pittsburgh - CSSD
 
 On 12/15/2011 01:48 PM, Harry Mangalam wrote:
  The use case is that we have a multiTB data partition that we
  would like to glusterize. Could we add that store to a gluster
  volume and have it explicitly rebalance across the gluster
  volume? Or would the existing files/layout be ignored?
  
  this would be a big selling point in justifying gluster to owners
  of large existing data stores.
  
  
  Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
  
  [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
  
  MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
  
  --
  
  This signature has been OCCUPIED!

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487 
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
This signature has been OCCUPIED!
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] what happens when adding a pre-populated brick?

2011-12-15 Thread Harry Mangalam

The use case is that we have a multiTB data partition that we would 
like to glusterize.  Could we add that store to a gluster volume and 
have it explicitly rebalance across the gluster volume?  Or would the 
existing files/layout be ignored?

this would be a big selling point in justifying gluster to owners of 
large existing data stores.

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487 
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
This signature has been OCCUPIED!
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] where's the API docs?

2011-12-14 Thread Harry Mangalam

this still works for me..

http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-
beta-2/

hjm

On Wednesday 14 December 2011 11:54:26 Redding, Erik wrote:
 I'm working on a proposal for a 1pb Gluster rig and I wanted to be
 able to present programmers with the API docs to sweeten the deal
 and to be able to programmatically mange it.  I assumed there was
 a REST API for interfacing with the filesystem, either management
 or actual file i/o, because it appears in the feature list from
 time to time.
 
 I'm trying to dig around on how to pull 3.3 beta but with the Red
 Hat transition, I'm finding all of those offerings have
 disappeared.
 
 I'm digging around and
 Erik Redding
 Systems Programmer, RHCE
 Core Systems
 Texas State University-San Marcos
 
 On Dec 14, 2011, at 11:19 AM, John Mark Walker wrote:
  - Original Message -
  
  Ah - Ok thanks, Jeff. I was looking for the Swift and REST API
  docs. I assumed there were API interfaces in 3.2.5.  I'll go
  dig up some roadmap info.
  
  I guess the first question to ask is, what are you looking to do?
  
  -JM

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487 
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
This signature has been OCCUPIED!
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] mixing rdma/tcp bricks, rebalance operation locked up

2011-12-13 Thread Harry Mangalam

I built an rdma-based volume out of 5 bricks:

$ gluster volume info

Volume Name: glrdma
Type: Distribute
Status: Started
Number of Bricks: 6
Transport-type: rdma
Bricks:
Brick1: pbs1:/data2
Brick2: pbs2:/data2
Brick3: pbs3:/data2
Brick4: pbs3:/data
Brick5: pbs4:/data
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on

and everything was working well.  I then tried to add a TCP/socket 
brick to it, thinking that it would be refused, but gluster happily 
added it:

$ gluster volume info

Volume Name: glrdma
Type: Distribute
Status: Started
Number of Bricks: 6
Transport-type: rdma
Bricks:
Brick1: pbs1:/data2
Brick2: pbs2:/data2
Brick3: pbs3:/data2
Brick4: pbs3:/data
Brick5: pbs4:/data
Brick6: dabrick:/data2   -- TCP/socket brick
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on

However, not too suprisingly, there are problems when I tried to 
rebalance the added brick.  It allowed me to start a rebalance/fix-
layout, but it never ended and the logs continue to contain the 
following reports of 'connection refused' (see at bottom).

Attempts to remove the TCP brick are unsuccessful, even after stopping 
the volume:

$ gluster volume stop glrdma 
Stopping volume will make its data inaccessible. Do you want to 
continue? (y/n) y
Stopping volume glrdma has been successful

$ gluster volume remove-brick glrdma dabrick:/data2
Removing brick(s) can result in data loss. Do you want to Continue? 
(y/n) y
Remove Brick unsuccessful

(more errors citing missing 'option transport-type'. defaulting to 
socket:

[2011-12-13 10:34:57.241676] I [cli-rpc-
ops.c:1073:gf_cli3_1_remove_brick_cbk] 0-cli: Received resp to remove 
brick
[2011-12-13 10:34:57.241852] I [input.c:46:cli_batch] 0-: Exiting 
with: -1
[2011-12-13 10:46:08.937294] W [rpc-
transport.c:606:rpc_transport_load] 0-rpc-transport: missing 'option 
transport-type'. defaulting to socket
[2011-12-13 10:46:09.110636] I [cli-rpc-
ops.c:417:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to get vol: 0
[2011-12-13 10:46:09.110845] I [cli-rpc-
ops.c:596:gf_cli3_1_get_volume_cbk] 0-: Returning: 0
[2011-12-13 10:46:09.111038] I [cli-rpc-
ops.c:417:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to get vol: 0
[2011-12-13 10:46:09.111070] I [cli-rpc-
ops.c:596:gf_cli3_1_get_volume_cbk] 0-: Returning: 0
[2011-12-13 10:46:09.111080] I [input.c:46:cli_batch] 0-: Exiting 
with: 0
[2011-12-13 10:52:18.142283] W [rpc-
transport.c:606:rpc_transport_load] 0-rpc-transport: missing 'option 
transport-type'. defaulting to socket


 And the rebalance operations now seem to be locked up, since the 
response to rebalance is nonsensical: (the commands were given 
serially, with no other intervening commands)

 $ gluster volume rebalance glrdma fix-layout start
Rebalance on glrdma is already started

$ gluster volume rebalance glrdma fix-layout status
rebalance stopped

$ gluster volume rebalance glrdma fix-layout stop
stopped rebalance process of volume glrdma 
(after rebalancing 0 files totaling 0 bytes)

$ gluster volume rebalance glrdma fix-layout start
Rebalance on glrdma is already started

Is there a way to back out of this situation? Or has incorrectly 
adding the TCP brick permanently hosed the volume? 

And does this imply a bug in the add-brick routine? (hopefully fixed?)




Logs extracts
--
tc-glusterd-mount-glrdma.log (and nfs.log, even tho I haven't tried to 
export it via nfs) has zillions of these lines:

[2011-12-13 10:36:11.702130] E [rdma.c:4417:tcp_connect_finish] 0-
glrdma-client-5: tcp connect to  failed (Connection refused)

cli.log has many of these lines:
[2011-12-13 10:34:55.142428] W [rpc-
transport.c:606:rpc_transport_load] 0-rpc-transport: missing 'option 
transport-type'. defaulting to socket


-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487 
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
This signature has been OCCUPIED!
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] forcing a brick to unload storage to replace disks?

2011-12-08 Thread Harry Mangalam

Hi All,

More time for gluster;  after much travel in the twisty dark tunnels 
of OFED, IB, card firmware  upgrades, OS compatibility, etc, I now 
have a distributed rdma volume over 5 bricks (2 on one server) and it 
seems to be working well.  I would now like to force-unload one brick 
to emulate a disk upgrade process.

Here's my vol info:
---
Thu Dec 08 11:44:05 [0.08 0.05 0.01]  root@pbs3:~
522 $ gluster volume info

Volume Name: glrdma
Type: Distribute
Status: Started
Number of Bricks: 5
Transport-type: rdma
Bricks:
Brick1: pbs1:/data2
Brick2: pbs2:/data2
Brick3: pbs3:/data2
Brick4: pbs3:/data
Brick5: pbs4:/data
---


From the Admin doc, I can do a 'replace-brick' operation but that 
seems to require an unused brick as when I try to do that to an 
already incorporated brick, gluster complains that:

---
Thu Dec 08 11:52:12 [0.00 0.01 0.00]  root@pbs3:~
524 $ gluster volume replace-brick glrdma pbs2:/data2 pbs4:/data start
Brick: pbs4:/data already in use
---

Is there a process whereby I can clear a brick by forcing the files to 
migrate to the other bricks?



-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487 
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
This signature has been OCCUPIED!
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] setting up a server

2011-11-28 Thread Harry Mangalam

See Chapter 6 of the Admin guide:

6. Setting Up GlusterFS Server Volumes

This assumes you've dedicated servers to be gluster bricks.  If not, 
that's obviously 1st.  Then install the gluster software, then on to 
chapter 6.

hjm




On Monday 28 November 2011 13:07:07 Steven Jones wrote:
 Hi,
 
 I have the install guide and the admin guidenothing in either
 that I can see from the contents pages tells me how to create the
 first server ( I assume that is what I have to do?)
 
 Is there another doc Im missing? or a good URL for a howto on a
 redhat based machine?
 
 Also from what I can see the free version is cli only?
 
 and there is no virtual (vmware) appliance?
 
 
 regards
 
 Steven Jones
 
 Technical Specialist - Linux RHCE
 
 Victoria University, Wellington, NZ
 
 0064 4 463 6272
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487 
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
This signature has been OCCUPIED!
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] OT: possible to upgrade mellanox HBA firmware from ubuntu?

2011-11-09 Thread Harry Mangalam

I can verify that the CentOS approach works as Dan described it with 
the caveats prefixed by ##hjm, using a testbed machine running a fresh 
install of CentOS6.

Thanks to Dan for such a nice outline.

hjm

On Tuesday 08 November 2011 12:11:44 Dan Cyr wrote:
 Harry,
 
 I had these same problems when I upgraded my firmware.
 
 Here's what I ended up doing (Centos6 or RHEL6 *not 6.1*):
 
 If you use the default kernel: 2.6.32_71.el6 (don't run yum
 update), then you can use the kernel-mft RPM included with OFED
 (kernel-mft-2.7.0-2.6.32_71.el6.x86_64.x86_64.rpm)
 
 yum -y install mstflint
 wget 
http://www.mellanox.com/downloads/ofed/MLNX_OFED_LINUX-1.5.3-1.0.
0-rhel6-x86_64.iso mkdir -p /mnt/iso

 mount -o loop MLNX_OFED_LINUX-1.5.3-1.0.0-rhel6-x86_64.iso /mnt/iso/ 
 yum -y install  /mnt/iso/RPMS/kernel-
mft-2.7.0-2.6.32_71.el6.x86_64.x86_64.rpm  --nogpgcheck 

##hjm: you need to install the mst app from the mft rpm that is
##hjm: distributed in the above iso image 
##hjm: sudo rpm -i /mnt/iso/RPMS/mft-2.7.0-20.x86_64.rpm

 mst start
 ls /dev/mst/ # Verify the proper device is there (*_pciconf?) - if
 there are none, this needs to be resolved before continuing.

##hjm: this worked fine:
ls /dev/mst/
   
mt25204_pciconf0  mt25204_pci_cr0

 mkdir -p /usr/src/firmware
 cd /usr/src/firmware
 # My cards are SuperMicro UIO - replace the below commands
 appropriately #wget
 ftp://ftp.supermicro.com/Firmware/InfiniBand/AOC-UINF-m2/aocuinfm
 2_20090609.zip #unzip aocuinfm2_20090609.zip
 # -allow_psid_change might not be required - check with you
 hardware vendor - For me I found this link:
 http://64.174.237.178/support/faqs/faq.cfm?faq=9803 #flint -d
 /dev/mst/mt25418_pciconf0 -i aocuinfm2_20090609.bin -nofs
 -allow_psid_change burn #reboot
===
##hjm: found this latest firmware upgrade. 
http://www.mellanox.com/content/pages.php?pg=custom_firmware_table

And once you get the appro package and download and unpack it, 
the VERY latest firmware needs to be created with mlxburn
like this: 
$ cd ~hjm/fw-25204-rel-1_2_940
$ sudo mlxburn -fw ./fw-25204-rel.mlx -dev /dev/mst/mt25204_pci_cr0 
-nofs

## and it seems to work:

$ flint -d /dev/mst/mt25204_pciconf0 q
Image type:  Failsafe
FW Version:  1.2.940 ---VERY latest fw.
I.S. Version:1
Device ID:   25204
Description: Node Port1Sys image
GUIDs:   00066a0098006e5f 00066a00a0006e5f 00066a0098006e5f 
Board ID:j (MT_023002)
VSD: j
PSID:MT_023002
===

so with that, the 


 
 Good Luck.
 Dan
 
 From: gluster-users-boun...@gluster.org
 [mailto:gluster-users-boun...@gluster.org] On Behalf Of Harry
 Mangalam Sent: Tuesday, November 08, 2011 11:45 AM
 To: gluster-users@gluster.org
 Subject: [Gluster-users] OT: possible to upgrade mellanox HBA
 firmware from ubuntu?
 
 
 Sorry for the OT, but this problem is preventing my further testing
 of gluster and this groups seems like they might know..
 
 I've been looking into this for a few days and have not run across
 any success stories in upgrading firmware using Ubuntu (Mellanox
 supports RH and Suse), but I've tried to install the MFT/MST
 packages and it keeps erroring out for various reasons.
 
 If it's not possible (or very painful), I'll upgrade the card
 firmware via a LiveCD/DVD of SciLinux or CentOS on another
 machine.
 
 Other suggestions for upgrading the IB card firmware gratefully
 accepted.
 
 Harry
 
 --
 
 Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
 
 [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
 
 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
 
 --
 
 This signature has been OCCUPIED!

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487 
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
This signature has been OCCUPIED!
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] OT: possible to upgrade mellanox HBA firmware from ubuntu?

2011-11-08 Thread Harry Mangalam

Sorry for the OT, but this problem is preventing my further testing of 
gluster and this groups seems like they might know..

I've been looking into this for a few days and have not run across any 
success stories in upgrading firmware using Ubuntu (Mellanox supports 
RH and Suse), but I've tried to install the MFT/MST packages and it 
keeps erroring out for various reasons.

If it's not possible (or very painful), I'll upgrade the card firmware 
via a LiveCD/DVD of SciLinux or CentOS on another machine.

Other suggestions for upgrading the IB card firmware gratefully 
accepted.

Harry

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487 
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
This signature has been OCCUPIED!
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

1 2 >

1 - 100 of 109 matches

Mail list logo