Re: [Gluster-users] gluster fails under heavy array job load load

2013-12-13 Thread harry mangalam
Hi Alex,

Thanks for taking the time to think about this.

I don't have metrics at hand, but I tend to think not for 2 reasons.
- when I have looked at stats from the network, it has never been close to 
saturating - the bottlenecks appear to be most at the gluster server side.
I get emailed if my servers go above a load of 8 (the servers have 8 cores) 
and when that happens, I often get complaints from users that they've had 
incomplete runs.

At these points the network load is often fairly high (1GB/s, aggregate), but 
on a QDR network, that shouldn't be saturating.

-  the same jobs, when run using another distributed FS, on the same IB 
fabric, have no such behavior, which would tend to point the fault at gluster 
or (granted) my configuration of it.

- while a lot of the IO load is large streaming RW, there are a subsection of 
jobs that users insist on using Zillions of Tiny (ZOT) files as output - they 
use the file names for indices or as table row entries.  (One user had >20M 
files in a tree). We're trying to educate them, but it takes time and energy.
Gluster seems to have a lot of trouble traversing these huge file fields, 
moreso than DFSs that use metadata servers.

That said, it has been stable otherwise and there are a lot of things to 
recommend it.

hjm





On Friday, December 13, 2013 02:00:19 PM Alex Chekholko wrote:
> Hi Harry,
> 
> My best guess is that you overloaded your interconnect.  Do you have
> metrics for if/when your network was saturated?  That would cause
> Gluster clients to time out.
> 
> My best guess is that you went into the "E" state of your "USE
> (Utilization, Saturation, Error)" spectrum.
> 
> IME, that is a common pattern for out Lustre/GPFS clients, you get all
> kinds of weird error states if you manage to saturate your I/O for an
> extended period of time and fill all of the buffers everywhere.
> 
> Regards,
> Alex
> 
> On 12/12/2013 05:03 PM, harry mangalam wrote:
> > Short version: Our gluster fs (~340TB) provides scratch space for a
> > ~5000core academic compute cluster.
> > 
> > Much of our load is streaming IO, doing a lot of genomics work, and that
> > is the load under which we saw this latest failure.

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] gluster fails under heavy array job load load

2013-12-13 Thread Lists

On 12/13/2013 02:00 PM, Alex Chekholko wrote:


My best guess is that you overloaded your interconnect.  Do you have 
metrics for if/when your network was saturated?  That would cause 
Gluster clients to time out.


My best guess is that you went into the "E" state of your "USE 
(Utilization, Saturation, Error)" spectrum.


IME, that is a common pattern for out Lustre/GPFS clients, you get all 
kinds of weird error states if you manage to saturate your I/O for an 
extended period of time and fill all of the buffers everywhere. 


When we tried to roll out GlusterFS for a production environment a few 
years ago, we ran into exactly this problem. Our scenario was a 
multi-master cluster, and the worst part appeared to be log files. Any 
time a host wrote to a log file it had to synchronize the log file. And 
since there were multiple masters, this very quickly clogged our 
interconnect and ended things.


We ended up rolling back GlusterFS for this purpose and moved to a 
distributed, asynchronous logging system rolled in house that used Linux 
kernel message queues, with the understanding that replicated log files 
would see a small amount of jitter and out-of-order appearance between 
hosts. While this may sound irreverent, all log entries have a time 
stamp anyway so it's all good and has worked well for us.


It may be that this has been fixed recently, but it's a use case I 
thought might warrant consideration.


-Ben


___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gluster fails under heavy array job load load

2013-12-13 Thread Alex Chekholko

Hi Harry,

My best guess is that you overloaded your interconnect.  Do you have 
metrics for if/when your network was saturated?  That would cause 
Gluster clients to time out.


My best guess is that you went into the "E" state of your "USE 
(Utilization, Saturation, Error)" spectrum.


IME, that is a common pattern for out Lustre/GPFS clients, you get all 
kinds of weird error states if you manage to saturate your I/O for an 
extended period of time and fill all of the buffers everywhere.


Regards,
Alex


On 12/12/2013 05:03 PM, harry mangalam wrote:

Short version: Our gluster fs (~340TB) provides scratch space for a
~5000core academic compute cluster.

Much of our load is streaming IO, doing a lot of genomics work, and that
is the load under which we saw this latest failure.



--
Alex Chekholko ch...@stanford.edu
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gluster fails under heavy array job load load

2013-12-13 Thread harry mangalam
On Thursday, December 12, 2013 11:46:03 PM Anand Avati wrote:
> - I see RDMA is enabled on the volume. Are you mounting clients through
> RDMA? If so, for the purpose of diagnostics can you mount through TCP and
> check the stability improves? If you are using RDMA with such a high
> write-behind-window-size, spurious ping-timeouts are an almost certainty
> during heavy writes. The RDMA driver has limited flow control, and setting
> such a high window-size can easily congest all the RDMA buffers resulting
> in spurious ping-timeouts and disconnections. 

Is there a way to remove the RDMA transport option once it is enabled.  I was 
under the impression that our system was NOT using RDMA, but from the logs, I 
see the following that implies that they /are/ using RDMA now.

==> 10.2.7.11 <==
  4: option transport-type socket,rdma
[2013-12-10 17:42:12.498076] I [glusterd-pmap.c:227:pmap_registry_bind] 0-
pmap: adding brick /raid1.rdma on port 49153
[2013-12-10 17:42:15.571287] I [glusterd-pmap.c:227:pmap_registry_bind] 0-
pmap: adding brick /raid2.rdma on port 49155

==> 10.2.7.12 <==
  4: option transport-type socket,rdma
[2013-12-10 17:42:17.974841] I [glusterd-pmap.c:227:pmap_registry_bind] 0-
pmap: adding brick /raid1.rdma on port 49153
[2013-12-10 17:42:21.266486] I [glusterd-pmap.c:227:pmap_registry_bind] 0-
pmap: adding brick /raid2.rdma on port 49155

==> 10.2.7.13 <==
  4: option transport-type socket,rdma
[2013-12-10 17:42:17.929753] I [glusterd-pmap.c:227:pmap_registry_bind] 0-
pmap: adding brick /raid1.rdma on port 49153
[2013-12-10 17:42:21.646482] I [glusterd-pmap.c:227:pmap_registry_bind] 0-
pmap: adding brick /raid2.rdma on port 49155

==> 10.2.7.14 <==
  4: option transport-type socket,rdma
[2013-12-10 17:42:15.791176] I [glusterd-pmap.c:227:pmap_registry_bind] 0-
pmap: adding brick /raid1.rdma on port 49153
[2013-12-10 17:42:15.941182] I [glusterd-pmap.c:227:pmap_registry_bind] 0-
pmap: adding brick /raid2.rdma on port 49155


---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] gluster fails under heavy array job load load

2013-12-13 Thread harry mangalam

Bug 1043009 Submitted


On Thursday, December 12, 2013 11:46:03 PM Anand Avati wrote:
> Please provide the full client and server logs (in a bug report). The
> snippets give some hints, but are not very meaningful without the full
> context/history since mount time (they have after-the-fact symptoms, but
> not the part which show the reason why disconnects happened).
> 
> Even before looking into the full logs here are some quick observations:
> 
> - write-behind-window-size = 1024MB seems *excessively* high. Please set
> this to 1MB (default) and check if the stability improves.
> 
> - I see RDMA is enabled on the volume. Are you mounting clients through
> RDMA? If so, for the purpose of diagnostics can you mount through TCP and
> check the stability improves? If you are using RDMA with such a high
> write-behind-window-size, spurious ping-timeouts are an almost certainty
> during heavy writes. The RDMA driver has limited flow control, and setting
> such a high window-size can easily congest all the RDMA buffers resulting
> in spurious ping-timeouts and disconnections.
> 
> Avati
> 
> On Thu, Dec 12, 2013 at 5:03 PM, harry mangalam 
wrote:
> >  Hi All,
> > 
> > (Gluster Volume Details at bottom)
> > 
> > 
> > 
> > I've posted some of this previously, but even after various upgrades,
> > attempted fixes, etc, it remains a problem.
> > 
> > 
> > 
> > 
> > 
> > Short version: Our gluster fs (~340TB) provides scratch space for a
> > ~5000core academic compute cluster.
> > 
> > Much of our load is streaming IO, doing a lot of genomics work, and that
> > is the load under which we saw this latest failure.
> > 
> > Under heavy batch load, especially array jobs, where there might be
> > several 64core nodes doing I/O on the 4servers/8bricks, we often get job
> > failures that have the following profile:
> > 
> > 
> > 
> > Client POV:
> > 
> > Here is a sampling of the client logs (/var/log/glusterfs/gl.log) for all
> > compute nodes that indicated interaction with the user's files
> > 
> > 
> > 
> > 
> > 
> > Here are some client Info logs that seem fairly serious:
> > 
> > 
> > 
> > 
> > 
> > The errors that referenced this user were gathered from all the nodes that
> > were running his code (in compute*) and agglomerated with:
> > 
> > 
> > 
> > cut -f2,3 -d']' compute* |cut -f1 -dP | sort | uniq -c | sort -gr
> > 
> > 
> > 
> > and placed here to show the profile of errors that his run generated.
> > 
> > 
> > 
> > 
> > 
> > so 71 of them were:
> > 
> > W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-7: remote
> > operation failed: Transport endpoint is not connected.
> > 
> > etc
> > 
> > 
> > 
> > We've seen this before and previously discounted it bc it seems to have
> > been related to the problem of spurious NFS-related bugs, but now I'm
> > wondering whether it's a real problem.
> > 
> > Also the 'remote operation failed: Stale file handle. ' warnings.
> > 
> > 
> > 
> > There were no Errors logged per se, tho some of the W's looked fairly
> > nasty, like the 'dht_layout_dir_mismatch'
> > 
> > 
> > 
> > From the server side, however, during the same period, there were:
> > 
> > 0 Warnings about this user's files
> > 
> > 0 Errors
> > 
> > 458 Info lines
> > 
> > of which only 1 line was not a 'cleanup' line like this:
> > 
> > ---
> > 
> > 10.2.7.11:[2013-12-12 21:22:01.064289] I
> > [server-helpers.c:460:do_fd_cleanup] 0-gl-server: fd cleanup on
> > /path/to/file
> > 
> > ---
> > 
> > it was:
> > 
> > ---
> > 
> > 10.2.7.14:[2013-12-12 21:00:35.209015] I
> > [server-rpc-fops.c:898:_gf_server_log_setxattr_failure] 0-gl-server:
> > 113697332: SETXATTR /bio/tdlong/RNAseqIII/ckpt.1084030
> > (c9488341-c063-4175-8492-75e2e282f690) ==> trusted.glusterfs.dht
> > 
> > ---
> > 
> > 
> > 
> > We're losing about 10% of these kinds of array jobs bc of this, which is
> > just not supportable.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Gluster details
> > 
> > 
> > 
> > servers and clients running gluster 3.4.0-8.el6 over QDR IB, IPoIB, thru 2
> > Mellanox, 1 Voltaire switches, Mellanox cards, CentOS 6.4
> > 
> > 
> > 
> > $ gluster volume info
> > 
> >  Volume Name: gl
> > 
> > Type: Distribute
> > 
> > Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
> > 
> > Status: Started
> > 
> > Number of Bricks: 8
> > 
> > Transport-type: tcp,rdma
> > 
> > Bricks:
> > 
> > Brick1: bs2:/raid1
> > 
> > Brick2: bs2:/raid2
> > 
> > Brick3: bs3:/raid1
> > 
> > Brick4: bs3:/raid2
> > 
> > Brick5: bs4:/raid1
> > 
> > Brick6: bs4:/raid2
> > 
> > Brick7: bs1:/raid1
> > 
> > Brick8: bs1:/raid2
> > 
> > Options Reconfigured:
> > 
> > performance.write-behind-window-size: 1024MB
> > 
> > performance.flush-behind: on
> > 
> > performance.cache-size: 268435456
> > 
> > nfs.disable: on
> > 
> > performance.io-cache: on
> > 
> > performance.quick-read: on
> > 
> > performance.io-thread-count: 64
> > 
> > auth.allow: 10.2.*.*,10.1.*.*
> > 
> > 
> > 
> > 

Re: [Gluster-users] gluster fails under heavy array job load load

2013-12-12 Thread Anand Avati
Please provide the full client and server logs (in a bug report). The
snippets give some hints, but are not very meaningful without the full
context/history since mount time (they have after-the-fact symptoms, but
not the part which show the reason why disconnects happened).

Even before looking into the full logs here are some quick observations:

- write-behind-window-size = 1024MB seems *excessively* high. Please set
this to 1MB (default) and check if the stability improves.

- I see RDMA is enabled on the volume. Are you mounting clients through
RDMA? If so, for the purpose of diagnostics can you mount through TCP and
check the stability improves? If you are using RDMA with such a high
write-behind-window-size, spurious ping-timeouts are an almost certainty
during heavy writes. The RDMA driver has limited flow control, and setting
such a high window-size can easily congest all the RDMA buffers resulting
in spurious ping-timeouts and disconnections.

Avati


On Thu, Dec 12, 2013 at 5:03 PM, harry mangalam wrote:

>  Hi All,
>
> (Gluster Volume Details at bottom)
>
>
>
> I've posted some of this previously, but even after various upgrades,
> attempted fixes, etc, it remains a problem.
>
>
>
>
>
> Short version: Our gluster fs (~340TB) provides scratch space for a
> ~5000core academic compute cluster.
>
> Much of our load is streaming IO, doing a lot of genomics work, and that
> is the load under which we saw this latest failure.
>
> Under heavy batch load, especially array jobs, where there might be
> several 64core nodes doing I/O on the 4servers/8bricks, we often get job
> failures that have the following profile:
>
>
>
> Client POV:
>
> Here is a sampling of the client logs (/var/log/glusterfs/gl.log) for all
> compute nodes that indicated interaction with the user's files
>
> 
>
>
>
> Here are some client Info logs that seem fairly serious:
>
> 
>
>
>
> The errors that referenced this user were gathered from all the nodes that
> were running his code (in compute*) and agglomerated with:
>
>
>
> cut -f2,3 -d']' compute* |cut -f1 -dP | sort | uniq -c | sort -gr
>
>
>
> and placed here to show the profile of errors that his run generated.
>
> 
>
>
>
> so 71 of them were:
>
> W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-7: remote
> operation failed: Transport endpoint is not connected.
>
> etc
>
>
>
> We've seen this before and previously discounted it bc it seems to have
> been related to the problem of spurious NFS-related bugs, but now I'm
> wondering whether it's a real problem.
>
> Also the 'remote operation failed: Stale file handle. ' warnings.
>
>
>
> There were no Errors logged per se, tho some of the W's looked fairly
> nasty, like the 'dht_layout_dir_mismatch'
>
>
>
> From the server side, however, during the same period, there were:
>
> 0 Warnings about this user's files
>
> 0 Errors
>
> 458 Info lines
>
> of which only 1 line was not a 'cleanup' line like this:
>
> ---
>
> 10.2.7.11:[2013-12-12 21:22:01.064289] I
> [server-helpers.c:460:do_fd_cleanup] 0-gl-server: fd cleanup on
> /path/to/file
>
> ---
>
> it was:
>
> ---
>
> 10.2.7.14:[2013-12-12 21:00:35.209015] I
> [server-rpc-fops.c:898:_gf_server_log_setxattr_failure] 0-gl-server:
> 113697332: SETXATTR /bio/tdlong/RNAseqIII/ckpt.1084030
> (c9488341-c063-4175-8492-75e2e282f690) ==> trusted.glusterfs.dht
>
> ---
>
>
>
> We're losing about 10% of these kinds of array jobs bc of this, which is
> just not supportable.
>
>
>
>
>
>
>
> Gluster details
>
>
>
> servers and clients running gluster 3.4.0-8.el6 over QDR IB, IPoIB, thru 2
> Mellanox, 1 Voltaire switches, Mellanox cards, CentOS 6.4
>
>
>
> $ gluster volume info
>
>  Volume Name: gl
>
> Type: Distribute
>
> Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
>
> Status: Started
>
> Number of Bricks: 8
>
> Transport-type: tcp,rdma
>
> Bricks:
>
> Brick1: bs2:/raid1
>
> Brick2: bs2:/raid2
>
> Brick3: bs3:/raid1
>
> Brick4: bs3:/raid2
>
> Brick5: bs4:/raid1
>
> Brick6: bs4:/raid2
>
> Brick7: bs1:/raid1
>
> Brick8: bs1:/raid2
>
> Options Reconfigured:
>
> performance.write-behind-window-size: 1024MB
>
> performance.flush-behind: on
>
> performance.cache-size: 268435456
>
> nfs.disable: on
>
> performance.io-cache: on
>
> performance.quick-read: on
>
> performance.io-thread-count: 64
>
> auth.allow: 10.2.*.*,10.1.*.*
>
>
>
>
>
> 'gluster volume status gl detail':
>
> 
>
>
>
> ---
>
> Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
>
> [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
>
> 415 South Circle View Dr, Irvine, CA, 92697 [shipping]
>
> MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
>
> ---
>
>
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-

[Gluster-users] gluster fails under heavy array job load load

2013-12-12 Thread harry mangalam
Hi All,
(Gluster Volume Details at bottom)

I've posted some of this previously, but even after various upgrades, 
attempted fixes, etc, it remains a problem.


Short version:  Our gluster fs (~340TB) provides scratch space for a ~5000core 
academic compute cluster.  
Much of our load is streaming IO, doing a lot of genomics work, and that is 
the load under which we saw this latest failure.
Under heavy batch load, especially array jobs, where there might be several 
64core nodes doing I/O on the 4servers/8bricks, we often get job failures that 
have the following profile:

Client POV:
Here is a sampling of the client logs (/var/log/glusterfs/gl.log) for all 
compute nodes that indicated interaction with the user's files


Here are some client Info logs that seem fairly serious:


The errors that referenced this user were gathered from all the nodes that 
were running his code (in compute*) and agglomerated with:

cut -f2,3 -d']' compute* |cut -f1 -dP | sort | uniq -c | sort -gr 

and placed here to show the profile of errors that his run generated.


so 71 of them were:
  W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-7: remote 
operation failed: Transport endpoint is not connected. 
etc

We've seen this before and previously discounted it bc it seems to have been 
related to the problem of spurious NFS-related bugs, but now I'm wondering 
whether it's a real problem. 
Also the 'remote operation failed: Stale file handle. ' warnings.

There were no Errors logged per se, tho some of the W's looked fairly nasty, 
like the 'dht_layout_dir_mismatch'

>From the server side, however, during the same period, there were:
0 Warnings about this user's files
0 Errors 
458 Info lines
of which only 1 line was not a 'cleanup' line like this:
---
10.2.7.11:[2013-12-12 21:22:01.064289] I [server-helpers.c:460:do_fd_cleanup] 
0-gl-server: fd cleanup on /path/to/file
---
it was:
---
10.2.7.14:[2013-12-12 21:00:35.209015] I [server-rpc-
fops.c:898:_gf_server_log_setxattr_failure] 0-gl-server: 113697332: SETXATTR 
/bio/tdlong/RNAseqIII/ckpt.1084030 (c9488341-c063-4175-8492-75e2e282f690) ==> 
trusted.glusterfs.dht
---

We're losing about 10% of these kinds of array jobs bc of this, which is just 
not supportable.



Gluster details

servers and clients running gluster 3.4.0-8.el6 over QDR IB, IPoIB, thru 2 
Mellanox, 1 Voltaire switches, Mellanox cards, CentOS 6.4

$ gluster volume info
 
Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.write-behind-window-size: 1024MB
performance.flush-behind: on
performance.cache-size: 268435456
nfs.disable: on
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*


'gluster volume status gl detail': 


---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users