Re: [Gluster-users] Gluster-users Digest, Vol 86, Issue 1 - Message 5: client load high using FUSE mount

2015-06-01 Thread Ben England


- Original Message -
> From: gluster-users-requ...@gluster.org
> To: gluster-users@gluster.org
> Sent: Monday, June 1, 2015 8:00:01 AM
> Subject: Gluster-users Digest, Vol 86, Issue 1
> 
> Message: 5
> Date: Mon, 01 Jun 2015 13:11:13 +0200
> From: Mitja Miheli? 
> To: gluster-users@gluster.org
> Subject: [Gluster-users] Client load high (300) using fuse mount
> Message-ID: <556c3dd1.1080...@arnes.si>
> Content-Type: text/plain; charset=utf-8; format=flowed
> 
> Hi!
> 
> I am trying to set up a Wordpress cluster using GlusterFS used for
> storage. Web nodes will access the same Wordpress install on a volume
> mounted via FUSE from a 3 peer GlusterFS TSP.
> 
> I started with one web node and Wordpress on local storage. The load
> average was constantly about 5. iotop showed about 300kB/s disk reads or
> less. The load average was below 6.
> 
> When I mounted the GlusterFS volume to the web node the 1min load
> average went over 300. Each of the 3 peers is transmitting about 10MB/s
> to my web node regardless of the load.
> TSP peers are on 10Gbit NICs and the web node is on a 1Gbit NIC.

30 MB/s is about 1/3 line speed for a 1-Gbps NIC port.  Sounds like network 
latency and lack of client-side caching might be your bottleneck, might want to 
put a 10-Gbps NIC port on your client.  You did disable client-side caching 
(md-cache and io-cache translators) below, was that your intent?  Also, 
defaults for these translators are very conservative, if only 1 client you may 
want to increase time that data is cached (in the client) using FUSE mount 
options "entry-timeout=30" and "attribute-timeout=30".  Unlike non-distributed 
Linux filesystems, Gluster is very conservative about client side caching to 
avoid cache coherency issues.

> 
> I'm out of ideas here... Could it be the network?
> What should I look at for optimizing the network stack on the client?
> 
> Options set on TSP:
> Options Reconfigured:
> performance.cache-size: 4GB
> network.ping-timeout: 15
> cluster.quorum-type: auto
> network.remote-dio: on
> cluster.eager-lock: on
> performance.stat-prefetch: off
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> performance.cache-refresh-timeout: 4
> performance.io-thread-count: 32
> nfs.disable: on
> 

Too many tunings, what are these intended to do?  The "gluster volume reset" 
command allows you to undo this.  in Gluster 3.7, the "gluster volume get 
your-volume all" command lets you see what the defaults are.  

> Regards, Mitja
> 
> --
> --
> Mitja Miheli?
> ARNES, Tehnolo?ki park 18, p.p. 7, SI-1001 Ljubljana, Slovenia
> tel: +386 1 479 8877, fax: +386 1 479 88 78
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster-users Digest, Vol 85, Issue 22 - 9. Re: seq read performance comparion between libgfapi and fuse

2015-05-29 Thread Ben England
Paul, I don't check this list every day, I would expect you can get more than 
half of minimum of network line speed or storage block device speed using a 
single libgfapi sequential read thread.  I did not see any throughput 
calculation or file size in your e-mail.

HTH, inline below...

-ben e

- Original Message -
> From: gluster-users-requ...@gluster.org
> To: gluster-users@gluster.org
> Sent: Friday, May 22, 2015 8:00:02 AM
> Subject: Gluster-users Digest, Vol 85, Issue 22
> 
> Message: 8
> Date: Fri, 22 May 2015 18:50:40 +0800
> From: Paul Guo 
> To: gluster-users@gluster.org
> Subject: [Gluster-users] seq read performance comparion between
>   libgfapi andfuse
> Message-ID: <555f0a00.2060...@foxmail.com>
> Content-Type: text/plain; charset=gbk; format=flowed
> 
> Hello,
> 
> I wrote two simple single-process seq read test case to compare libgfapi
> and fuse. The logic looks like this.
> 
> char buf[32768];
> while (1) {
>cnt = read(fd, buf, sizeof(buf));
>  if (cnt == 0)
>  break;
>  else if (cnt > 0)
>  total += cnt;
>   // No "cnt < 0" was found during testing.
> }
> 
> Following is the time which is needed to finish reading a large file.
> 
> fuse libgfapi
> direct io: 40s  51s
> non direct io: 40s  47s
> 
> The version is 3.6.3 on centos6.5. The result shows that libgfapi is
> obviously slower than the fuse interface although the cpu cycles were
> saved a lot during libgfapi testing. Each test was run before cleaning
> up all kernel pageche&inode&dentry caches and stopping and then starting
> glusterd&gluster (to clean up gluster cache).

so if you use libgfapi in a single-threaded app, you may need to tune gluster 
volume parameter read-ahead-page-count (defaults to 4).  The default is 
intended to trade-off single-thread performance for better aggregate 
performance and response time.  Here is a example of how to tune it for a 
single-thread use case, don't do this all the time. 

gluster volume set your-volume performance.read-ahead-page-count 16

As a debugging tool, you can try disabling readahead translator altogether 

# gluster v set your-volume read-ahead off

To reset parameters to defaults:

# gluster v set your-volume read-ahead
# gluster v set your-volume read-ahead-page-count

I have a benchmark for libgfapi testing in case this is useful to you:

https://github.com/bengland2/parallel-libgfapi

please e-mail me direct if problems with it.

> 
> I tested direct io because I suspected that fuse kernel readahead
> helped more than the read optimization solutions in gluster. I searched
> a lot but I did not find much about the comparison between fuse and
> libgfapi. Anyone has known about this and known why?
> 

If you use O_DIRECT you may be  bypassing readahead translator in Gluster and 
this may account for your problem.  Try NOT using O_DIRECT, and try above 
tuning.  Or if you really need O_DIRECT on client, try this command, which 
disables O_DIRECT on the server side but not the client, it's equivalent of NFS 
behavior.

# gluster v set your-volume network.remote-dio on

Also try turning off io-cache translator which will not help you here.

# gluster v set your-volume io-cache off

Also, O_DIRECT is passed all the way to the server by Gluster so your disk 
reads will ALSO use O_DIRECT, this is terrible for performance.  You want to 
have block device readahead when doing this test.  Suggest you set it to at 
least 4096 KB for block devices used for Gluster brick mountpoints.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] High CPU Usage - Glusterfsd

2015-02-22 Thread Ben England
Renchu, 

I didn't see anything about average file size and read/write mix.  One example 
of how to observe both of these, as well as latency and throughput - on server 
run these commands:

# gluster volume profile your-volume start
# gluster volume profile your-volume info > /tmp/dontcare
# sleep 60
# gluster volume profile your-volume info > profile-for-last-minute.log

There is also a "gluster volume top" command that may be of use to you in 
understanding what your users are doing with Gluster.

Also you may want to run "top -H" and see whether any threads in either 
glusterfsd or smbd are at or near 100% CPU - if so, you really are hitting a 
CPU bottleneck.  Looking at process CPU utilization can be deceptive, since a 
process may include multiple threads.  "sar -n DEV 2" will show you network 
utilization, and "iostat -mdx /dev/sd? 2" on your server will show block device 
queue depth (latter two tools require sysstat rpm).  Together these can help 
you to understand what kind of bottleneck you are seeing.

I don't see how many "bricks" are in your Gluster volume but it sounds like you 
have only one glusterfsd/server.   If you have idle cores on your servers, you 
can harness more CPU power by using multiple bricks/server, which results in 
multiple glusterfsd processes on each server, allowing greater parallelism.
For example, you can do this by presenting individual disk drives as bricks 
rather than RAID volumes.

Let us know if these suggestions helped

-ben england

- Original Message -
> From: "Renchu Mathew" 
> To: gluster-users@gluster.org
> Cc: gluster-de...@gluster.org
> Sent: Sunday, February 22, 2015 7:09:09 AM
> Subject: [Gluster-devel] High CPU Usage - Glusterfsd
> 
> 
> 
> Dear all,
> 
> 
> 
> I have implemented glusterfs storage on my company – 2 servers with
> replicate. But glustherfsd shows more than 100% CPU utilization most of the
> time. So it is so slow to access the gluster volume. My setup is two
> glusterfs servers with replication. The gluster volume (almost 10TB of data)
> is mounted on another server (glusterfs native client) and using samba share
> for the network users to access those files. Is there any way to reduce the
> processor usage on these servers? Please give a solution ASAP since the
> users are complaining about the poor performance. I am using glusterfs
> version 3.6.
> 
> 
> 
> Regards
> 
> 
> 
> Renchu Mathew | Sr. IT Administrator
> 
> 
> 
> 
> 
> 
> 
> CRACKNELL DUBAI | P.O. Box 66231 | United Arab Emirates | T +971 4 3445417 |
> F +971 4 3493675 | M +971 50 7386484
> 
> ABU DHABI | DUBAI | LONDON | MUSCAT | DOHA | JEDDAH
> 
> EMAIL ren...@cracknell.com | WEB www.cracknell.com
> 
> 
> 
> This email, its content and any files transmitted with it are intended solely
> for the addressee(s) and may be legally privileged and/or confidential. If
> you are not the intended recipient please let us know by email reply and
> delete it from the system. Please note that any views or opinions presented
> in this email do not necessarily represent those of the company. Email
> transmissions cannot be guaranteed to be secure or error-free as information
> could be intercepted, corrupted, lost, destroyed, arrive late or incomplete,
> or contain viruses. The company therefore does not accept liability for any
> errors or omissions in the contents of this message which arise as a result
> of email transmission.
> 
> 
> 
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster-users Digest, Vol 77, Issue 2

2014-09-08 Thread Ben England



> Message: 9
> Date: Tue, 2 Sep 2014 17:17:25 +0800
> From: Jaden Liang 
> To: gluster-de...@gluster.org, gluster-users@gluster.org
> Subject: [Gluster-users] [Gluster-devel] Regarding the write
>   performance in replica 1 volume in 1Gbps Ethernet, get about 50MB/s
>   while writing single file.
> Message-ID:
>   
> Content-Type: text/plain; charset="utf-8"
> 
> Hello, gluster-devel and gluster-users team,
> 
> We are running a performance test in a replica 1 volume and find out the
> single file sequence writing performance only get about 50MB/s in a 1Gbps
> Ethernet. However, if we test multiple files sequence writing, the writing
> performance can go up to 120MB/s which is the top speed of network.
> 

not sure what you mean, are you writing multiple files concurrently or 1 at a 
time?  With FUSE, this matters -- I typically see best throughput with more 
than one file being transferred at the same time.

> We also tried to use the stat xlator to find out where is the bottleneck of
> single file write performance. Here is the stat data:
> 
> Client-side:
> ..
> vs_vol_rep1-client-8.latency.WRITE=total:21834371.00us,
> mean:2665.328491us, count:8192, max:4063475, min:1849
> ..
> 
> Server-side:
> ..
> /data/sdb1/brick1.latency.WRITE=total:6156857.00us, mean:751.569458us,
> count:8192, max:230864, min:611
> ..
> 

what's your write transfer size?   with FUSE, this really matters a lot, since 
FUSE does not aggregate writes, so each write has to travel from the 
application to the glusterfs mountpoint process, resulting in slow performance 
for small transfer sizes.   In general, it's a good idea to supply the details 
of your workload generator and how it was run, so we can compare with other 
known workloads and results.  

> Note that the test is write a 1GB single file sequentially to a replica 1
> volume through 1Gbps Ethernet network.
> 

So for example try using

# dd if=/dev/zero of=/mnt/glusterfs/your-file.dd bs=1024k count=1k

and see whether your throughput is still 50 MB/s. 

> On the client-side, we can see there are 8192 write requests totally. Every
> request will write 128KB data. Total eclipsed time is 21834371us, about 21
> seconds. The mean time of request is 2665us, about 2.6ms which means it
> could only serves about 380 requests in 1 seconds. Plus there are other
> time consuming like statfs, lookup, but those are not major reasons.
> 
> On the server-side, the mean time of request is 751us include write data to
> HDD disk. So we think that is not the major reason.
> 
> And we also modify some codes to do the statistic of system epoll elapsed
> time. It only took about 20us from enqueue data to finish sent-out.
> 
> Now we are heading to the rpc mechanism in glusterfs. Still, we think this
> issue maybe encountered in gluster-devel or gluster-users teams. Therefor,
> any suggestions would be grateful. Or have anyone know such issue?
> 
> Best regards,
> Jaden Liang
> 9/2/2014
> 
> 
> --
> Best regards,
> Jaden Liang
> -- next part --
> An HTML attachment was scrubbed...
> URL:
> 
> 
> --
> 
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
> 
> End of Gluster-users Digest, Vol 77, Issue 2
> 
> 
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster-users Digest, Vol 76, Issue 18 - Re: reading not distributed across bricks

2014-08-12 Thread Ben England

> Message: 1
> Date: Mon, 11 Aug 2014 09:53:30 -0400 (EDT)
> From: Justin Clift 
> To: Pranith Kumar Karampuri 
> Cc: gluster-users@gluster.org, Ray Mannings 
> Subject: Re: [Gluster-users] Reading not distributed across bricks
> Message-ID:
>   <417182971.4749068.1407765210056.javamail.zim...@redhat.com>
> Content-Type: text/plain; charset=utf-8
> 
> - Original Message -
> > hi Ray,
> >Reads are served from the bricks which respond the fastest at the
> > moment. They are not load-balanced.
> 
> Maybe a good feature for 3.7? :)
> 

Ray,
There already is a feature, from gluster volume set help:
"Option: cluster.read-hash-mode
Description: inode-read fops happen only on one of the bricks in replicate. AFR 
will prefer the one computed using the method specified using this option
0 = first responder, 
1 = hash by GFID of file (all clients use same subvolume), 
2 = hash by GFID of file and client PID"

This is particularly useful for benchmark tests, where the system may not have 
response time data sufficient to properly load balance and I have seen all the 
clients select the same replica using default value of 0.  The value 2 is nice 
because if many clients are reading the same file, the load is distributed 
across bricks.
-ben

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster-users Digest, Vol 75, Issue 25 - striped volume x8, poor sequential read performance

2014-07-29 Thread Ben England
Sergey, cmts inline...

Is your intended workload really single-client single-thread?Or is it more 
MPI-like?  For example, do you have many clients reading from different parts 
of the same large file?  If the latter, perhaps IOR would be a better benchmark 
for you.

Sorry I'm not familiar with striping translator.

- Original Message -
> From: gluster-users-requ...@gluster.org
> To: gluster-users@gluster.org
> Sent: Tuesday, July 22, 2014 7:21:56 AM
> Subject: Gluster-users Digest, Vol 75, Issue 25
> 
> --
> 
> Message: 9
> Date: Mon, 21 Jul 2014 21:35:15 +0100 (BST)
> From: Sergey Koposov 
> To: gluster-users@gluster.org
> Subject: [Gluster-users] glusterfs, striped volume x8, poor sequential
>   read performance, good write performance
> Message-ID:
>   
> Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII
> 
> Hi,
> 
> I have a HPC installation with 8 nodes. Each node has a software
> RAID1 using two NLSAS disks. And the disks from 8 nodes are combined into
> large shared striped 20Tb glusterfs partition which seems to show
> abnormally slow sequential read performance, with good write performance.
> 
> Basically I see is that the write performance is very decent  ~
> 500Mb/sec (tested using dd):
> 
> [root@ bigstor]# dd if=/dev/zero of=test2 bs=1M count=10
> 10+0 records in
> 10+0 records out
> 10485760 bytes (105 GB) copied, 186.393 s, 563 MB/s
> 
> And all this is is not just seating in the cache of each node, as I see the
> data being flushed to disks with approximately right speed.
> 
> In the same time the read performance is
> (tested using dd with dropping of the caches beforehand) is really bad:
> 
> [root@ bigstor]# dd if=/data/bigstor/test of=/dev/null bs=1M
> count=1
> 1+0 records in
> 1+0 records out
> 1048576 bytes (10 GB) copied, 309.821 s, 33.8 MB/s
> 
> When doing this glusterfs processes only take ~ 10-15% of the CPU max. So it
> isn't CPU starving.
> 
> The underlying  devices do not seem to be loaded at all:
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sda   0.00 0.00   73.000.00  9344.00 0.00   256.00
> 0.111.48   1.47  10.70
> 
> To check that the disks are not the problem
> I did a separate test of the read-speed of the raided disks on all machines
> and they have read speads of ~ 180Mb/s (uncached). So they aren't the
> problem.
> 

Gluster has a read-ahead-page-count setting, I'd try setting it up to 16 (as 
high as it will go), default is 4.  Writes are different because the write to a 
brick can complete before the data hits the disk (in other words, as soon as 
the data reaches server memory), but with reads if the data is not cached in 
memory then your only solution is to get all bricks reading at the same time.

Contrast this with a single-brick 12-disk RAID6 volume (with 32-MB readahead) 
that can hit 800 MB/s on read.  Clearly it isn't the rest of Gluster that's 
holding you back, it's probably the stripe translator behavior.  Does stripe 
translator support parallel reads to different subvolumes in the stripe?  Can 
you post a protocol trace that shows the on-the-wire behavior (collect with 
tcpdump, display with wireshark).

You could try running a re-read test without the stripe translator, I suspect 
it will perform better based on my own experience.

> I also tried to increase the readahead on the raid disks
> echo 2048 > /sys/block/md126/queue/read_ahead_kb
> but that doesn't seem to help at all.
> 

To prove this, try re-reading a file that fits in Linux buffer cache on servers 
-- block device readahead is then irrelevant since there is no disk I/O at all. 
 You are then doing a network test with Gluster.

Also, try just doing a dd read from the "brick" (subvolume) directly.

> Does anyone have any advice what to do here ? What knobs to adjust ?
> To me it looks like a bug, being honest,  but I would be happy if there is
> magic switch I forgot to turn on )
> 

Second, if you are using IPOIB, try jumbo frame setting of MTU=65520 and 
MODE=connected (in ifcfg-ib0) to reduce Infiniband interrupts on client side.  

Try FUSE mount option "-o gid-timeout=2" . 

What is the stripe width of the Gluster volume in KB?  Looks like it's the 
default, I forget what this is but you probably want it to be something like 
128 KB x 8.  A very large stripe size will prevent Gluster from utilizing > 1 
brick at the same time.


> Here is more details about my system
> 
> OS: Centos 6.5
> glusterfs : 3.4.4
> Kernel 2.6.32-431.20.3.el6.x86_64
> mount options and df output:
> 
> [root@ bigstor]# cat /etc/mtab
> 
> /dev/md126p4 /data/glvol/brick1 xfs rw 0 0
> node1:/glvol /data/bigstor fuse.glusterfs
> rw,default_permissions,allow_other,max_read=131072 0 0
> 
> [root@ bigstor]# df
> Filesystem   1K-blocksUsed  Available Use% Mounted on
> /dev/md126p42516284988  235682084

Re: [Gluster-users] Gluster-users Digest, Vol 59, Issue 15 - GlusterFS performance

2013-03-02 Thread Ben England

- Original Message -
> From: gluster-users-requ...@gluster.org
> To: gluster-users@gluster.org
> Sent: Friday, March 1, 2013 4:03:13 PM
> Subject: Gluster-users Digest, Vol 59, Issue 15
> 
> --
> 
> Message: 2
> Date: Fri, 01 Mar 2013 10:22:21 -0800
> From: Joe Julian 
> To: gluster-users@gluster.org
> Subject: Re: [Gluster-users] GlusterFS performance
> Message-ID: <5130f1dd.9050...@julianfamily.org>
> Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"
> 
> The kernel developers introduced a bug into ext4 that has yet to be
> fixed. If you use xfs you won't have those hangs.
> 
> On 03/01/2013 01:30 AM, Nikita A Kardashin wrote:
> > Hello again!
> >
> > I am complete rebuild my storage.
> > As base: ext4 over mdadm-raid1
> > Gluster volume in distributed-replicated mode with settings:
> >
> > Options Reconfigured:
> > performance.cache-size: 1024MB
> > nfs.disable: on
> > performance.write-behind-window-size: 4MB
> > performance.io-thread-count: 64
> > features.quota: off
> > features.quota-timeout: 1800
> > performance.io-cache: on
> > performance.write-behind: on
> > performance.flush-behind: on
> > performance.read-ahead: on
> >
> > As result, I got write performance about 80MB/s on dd if=/dev/zero
> > of=testfile.bin bs=100M count=10, 

Make sure your network and storage bricks are performing as you expect them to, 
Gluster is only as good as underlying hardware.  What happens with reads?  What 
happens when you do multiple threads doing writes? 

for n in `seq 1 4` ; do 
  eval "dd if=/dev/zero of=testfile$n.bin bs=100M count=10 &"
done
time wait

> > If I try to execute above command inside virtual machine (KVM),
> > first
> > time all going right - about 900MB/s (cache effect, I think), but
> > if I
> > run this test again on existing file - task (dd) hungs up and can
> > be
> > stopped only by Ctrl+C.

In future, post qemu process command line (from ps awux).  Are you writing to 
"local" file system inside virtual disk image or are you mounting Gluster from 
inside the VM?  If you are going through /dev/vda then are you using KVM qemu 
cache=writeback?  You could try cache=writethrough or cache=none, see comments 
below for cache=none.  Also, try io=threads not io=native.  

> >
> > Overall virtual system latency is poor too. For example, apt-get
> > upgrade upgrading system very, very slow, freezing on "Unpacking
> > replacement" and other io-related steps.
> >

If you don't have a fast connection to storage, the Linux VM will buffer write 
data in the kernel buffer cache until it runs out of memory for that 
(vm.dirty_ratio), then it will freeze any process that issues the writes.If 
your VM has a lot of memory relative to storage speed, this can result in very 
long delays.  Try reducing Linux kernel vm.dirty_background_ratio to get writes 
going sooner and vm.dirty_ratio so that the freezes don't last as long.  You 
can even reduce VM's block device queue depth.  But most of all make sure that 
gluster writes are performing near a typical local block device speed.

> > Does glusterfs have any tuning options, that can help me?
> >
> >

If your workload is strictly large-file, try this volume tuning:

-- storage.linux-aio: off (default) 
   
cluster.eager-lock: enable 
 (default is disabled)
network.remote-dio: on (default is off)
performance.write-behind-window-size: 1MB (default)

for pure single-thread sequential read workload, you can tune read-ahead 
translator to be more aggressive, this will help single-thread reads, but don't 
do this for other workloads, such as virtual machine images in the Gluster 
volume (will appear to Gluster as more of a random I/O workload).

performance.read-ahead-page-count: 16 (default is 4 128-KB prefetched buffers)

http://community.gluster.org/a/linux-kernel-tuning-for-glusterfs/

Red Hat Storage distribution will help tune Linux block device for better 
performance on many workloads.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster-users Digest, Vol 54, Issue 5 -- 3.3.0 replica performance

2012-10-07 Thread Ben England


- Original Message -
> From: gluster-users-requ...@gluster.org
> To: gluster-users@gluster.org
> Sent: Friday, October 5, 2012 9:29:33 AM
> Subject: Gluster-users Digest, Vol 54, Issue 5
> 

> Message: 3
> Date: Fri, 05 Oct 2012 21:50:16 +1000
> From: Andrew 
> To: gluster-users@gluster.org
> Subject: [Gluster-users] 3.3.0 replica performance
> Message-ID: <506ec978.3070...@donehue.net>
> 
> Hi All,
> 
> I have two systems connected via 10GBE .  Hardware is new and
> performs
> well (more details below).  I am hitting problems with write
> performance.  I have spent a few days reviewing previous posts
> without
> success. Any advice would be greatly appreciated.
> 
Andrew, 300 MB/s for sequential write is reasonable expectation for replicated 
writes with 10-GbE.  You shouldn't need the volume parameters.Make sure you 
don't have a CPU bottleneck, I'm not sure what speed CPUs you have, use "top", 
press H, see if any thread in glusterfs (client) or glusterfsd(server) hits 
100% either on client or server.  Try single replica to see if the problem is 
replication-related or something else.  If it's related to replication, you can 
try setting "gluster volume set your-volume performance.eager-lock on" (remount 
afterwards).  Consider Red Hat Storage if you want a solution that is 
pre-tuned, performance-tested and supported by Red Hat. -- Ben England, Red Hat
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster-users Digest, Vol 53, Issue 56 -- GlusterFS performance (Steve Thompson)

2012-10-02 Thread Ben England
Steve,

try glusterfs 3.3 and look at: 

http://community.gluster.org/a/linux-kernel-tuning-for-glusterfs/

There will be more optimizations in the next Gluster release.  Take advantage 
of the translators that Gluster supplies, including readahead translator and 
quick-read translator.

Red Hat does offer support for Red Hat Storage based on Gluster, and it has a 
pre-packaged tuning profile built into it.   We test with 10-GbE networks and 
Gluster 3.3 does have reasonably good performance for large-file sequential 
workloads (and it's scalable).
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster-users Digest, Vol 51, Issue 46

2012-08-03 Thread Ben England
4. Re: kernel parameters for improving gluster writes on millions of small 
writes (long) (Harry Mangalam)

Harry, You are correct, Glusterfs throughput with small write transfer sizes is 
a client-side problem, here are workarounds that at least some applications 
could use. 

1) NFS client is one workaround, since it buffers writes using the kernel 
buffer cache.

2) If your app does not have a configurable I/O size, but it lets you write to 
stdout, you can try piping your output to stdout and letting dd aggregate your 
I/O to the filesystem for you.  In this example we triple single-thread write 
throughput for 4-KB I/O requests in this example.

[root@perf56 ~]# mount -t glusterfs perf66-10ge:/repl2 /mnt/repl2

[root@perf56 ~]# dd if=/dev/zero of=/mnt/repl2/a.dd bs=4k count=1024k
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 55.8191 s, 76.9 MB/s

[root@perf56 ~]# dd if=/dev/zero bs=4k count=1024k | dd of=/mnt/repl2/a.dd 
bs=1024k
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 20.6882 s, 208 MB/s
0+58116 records in
0+58116 records out
4294967296 bytes (4.3 GB) copied, 19.9023 s, 216 MB/s

3) If your program is written in "C" and it uses stdio.h, you can probably do 
setvbuf() "C" RTL call to increase buffer size to something greater than 8 KB, 
which is the default in gcc-4.4.

http://en.cppreference.com/w/c/io/setvbuf
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster-users Digest, Vol 51, Issue 49

2012-08-03 Thread Ben England
> Message: 4
> Date: Fri, 27 Jul 2012 15:29:41 -0700
> From: Harry Mangalam 
> Subject: [Gluster-users] Change NFS parameters post-start
> To: gluster-users 
> Message-ID:
>   
> Content-Type: text/plain; charset=ISO-8859-1
> 
> In trying to convert clients from using the gluster native client to
> an NFS client, I'm trying to get the gluster volume mounted on a test
> mount point on the same client that the native client has mounted the
> volume.  The client refuses with the error:
> 
>  mount -t nfs bs1:/gl /mnt/glnfs
> mount: bs1:/gl failed, reason given by server: No such file or
> directory
> 

Harry,

Have you tried: 
# mount -t nfs -o nfsvers=3,tcp bs1:/gl /mnt/glnfs

Also, there is an /etc/sysconfig/nfs file that may let you remove RDMA as a 
mount option for NFS.

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster-users Digest, Vol 49, Issue 25 -- Disk utilization

2012-05-21 Thread Ben England
Peter,

see comments marked with ben> below, hope this helps.

Message: 1
Date: Tue, 15 May 2012 22:12:10 +0200
From: Peter Frey 
Subject: [Gluster-users] Disk utilisation
To: gluster-users@gluster.org
Message-ID:

Content-Type: text/plain; charset="iso-8859-1"

Hi,

we are using Gluster to make http file downloads available. We currently
have 2 gluster servers serving a replicated volume. Each gluster server has
22 disks in a hardware raid, the underlying file system is XFS. The average
file size is around 3-4MB. There are stored around 16TB of data on the
volume.

ben> Linux distro version and Gluster version would be helpful.  What RAID 
stripe element size?  If you have 64-KB stripe element size, then EVERY disk 
will be made busy by reading a single 4-MB file.  Striping will not help you 
much at that file size.  ~130 mbit/s = ~15 MB/s, most disks can read at > 50 
MB/s, so your total system throughput is far less than throughput of a single 
disk drive, so why use striping?  Wouldn't it be better to be able to serve 
many files in parallel from your disks?  You may want to increase readahead if 
the application tends to sequentially read the entire file, try increasing it 
way up, the Linux default of 128 KB is not good for Gluster.   Lastly, try the 
deadline I/O scheduler on your data disks, CFQ can't help with a Gluster server.


Once we start sending live http traffic towards the infrastructure we see a
horrible performance. For instance if the outgoing bandwidth on each of the
gluster servers is at ~130mbit/s our hardware raid has a busy rate of ~30%.
Once we increase the traffic towards 250mbit/s the busy rate doubles to
60%. With this the iowait values also increase.

We started to play with the read buffers on the http servers. There is no
difference between loading the whole file into memory at once and loading
the file in 64k chunks. This makes me believe that the gluster server loads
the file with its own buffers and the clients buffer has no influence. We
have also enabled profiling on the gluster volume: There are roughly 18
read() calls for each open() call which should be an indication for too
small buffers.

ben> Gluster avoids read caching on the client side.  You can give Gluster 
servers more memory so that XFS can cache more files if this leads to more 
cache hits.  If you really need aggressive client-side caching, you can NFS 
mount the gluster server.  If your app is HTTP-based and is RESTful then there 
are web caching servers that can intercept requests before they reach your 
application.   18 read calls/open is not a terrible ratio.  In my experience, 
if network tuning is correct and read files are cached (or prefetched) on the 
server, Gluster reads at network speed (which is why disk read-ahead is 
important).  How much traffic can your network transmit?   Have you tested 
network by itself (i.e. without using Gluster to test it?)


We have also made the mistake to store all files in a single directory but
XFS advertises that it can handle millions of files in a single directory
so it shouldn't be a problem or should it?

ben> Never put millions of files in a single directory if you can help it.  
Many file systems do not do well with this many files/directory.  But even if 
the filesystem is perfect at it, applications that attempt to display directory 
contents (other than "find") tend to lock up because apps will read entire 
directory, read all inodes in directory, sort them, then display them.  Classic 
example: "ls" command.

ben> Recent XFS versions (such as version in RHEL6.2) handles metadata far 
better than before (e.g. RHEL6.1), so you may want to make sure you're using 
the right one.  
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] This benchmark is OK?

2012-05-21 Thread Ben England
Lawrence, here are my answers to your questions, based on the analysis that 
follows them.

Q1: Why the Read is better than write, is ok?
Less-than optimal bonding mode is reason for difference

Q2: Refer to experience, This benchmark is best , good or bad?
The results are consistent with past experience with bonding mode 0 
(round-robin)

Q3: How to optimize the gluster to impove the benchmark?
I suspect network configuration needs optimization, but you might not need 
striping feature.

Q4: How can I get other friends' benchmark, which I use to compared?
I don't understand question, but iozone is what I use for sequential I/O 
benchmarking


configuration -- It appears that you have 2 servers and 6 clients, and your 
gluster volume is striped, with no replication.

servers:
each server has 4 cores
each server has 4-way bonding mode 0
each server has 12 1-TB drives configured as 2 6-drive RAID volumes (what kind 
of RAID?)

clients:
each has 2-way bonding mode 0
client disks are irrelevant since gluster does not use them 

results:
You do cluster iozone sequential write test followed by sequential read test 
with 25 threads, total of 50 GB data, using 1-MB transfer size, including fsync 
and close in throughput calculation.  Results are:

initial write: 414 MB/s   
re-write:  447 MB/s
initial read: 655 MB/s
re-read: 778 MB/s

Clients have 12 NICs and servers have 8 NICs, so cross-sectional bandwidth 
between clients and servers is ~800 MB/s.  So for your volume type you would 
treat 800 MB/s as the theoretical limit of your iozone throughput.  It appears 
that you have enough threads and enough disk drives in your servers that your 
storage should not be the bottleneck.   With this I/O transfer size and volume 
type, a server CPU bottleneck is less likely, but you should still check.

With Linux bonding, the biggest challenge is to load-balancing INCOMING traffic 
to the servers (writes) -- almost any mode can load-balance outgoing traffic 
(reads) across NICs in bond.  For writes, you are transmitting from the many 
client NICs to the fewer server NICs.  The problem with bonding mode 0 is that 
ARP protocol associates a single MAC address with a single IP address, and 
bonding mode 0 assigns the SAME MAC ADDRESS to all NICs, so the network switch 
will "learn" that the last port that transmitted with that MAC address is the 
port where all receives should take place for that MAC address.  This will 
reduce the effectiveness of bonding at balancing receive load across available 
server NICs.  For better load-balancing of receive traffic by switch and 
servers, try:

bonding mode 6 (balance-alb) -- use if clients and servers are mostly on the 
same VLAN.  In this mode, the Linux bonding driver will use ARP to load-balance 
clients across available server NICs (NICs retain unique MAC addresses), so 
network switch can deliver IP packets from different clients to different 
server NICs.   This can result in optimal utilization of server NICs when 
clients/server ratio is larger than number of server NICs, usually with no 
switch configuration necessary.

bonding mode 4 (803.2ad "trunking") -- if switch supports it, you can configure 
the switch and the servers to treat all server NICs as a single "trunk".  Any 
incoming IP packet destined for that server can be passed by the switch to 
whichever server NIC is least busy (subject to constraints?).  This works even 
when clients are on a different subnet, and does not depend on ARP protocol, 
but both servers and switch must be configured for this to work, and switch 
configuration in the past has been vendor-specific.

Also, I do not see how using gluster striping feature will improve your 
performance with this workload.  Gluster striping could be expected to help 
under 2 conditions:
- Gluster client can read/write data much faster than any one server can
- Gluster client is only reading/writing one or two files at a time
Neither of these conditions is satisfied by your workload and configuration.

You achieved close to the network throughput limit for re-reads.  The 
difference between initial read and re-read result suggests that you might be 
able to improve your initial read result with better pre-fetching on the server 
block devices.

Ben England, Red Hat



- Original Message -
From: "Amar Tumballi" 
To: "Ben England" 
Sent: Saturday, May 19, 2012 2:22:03 AM
Subject: Fwd: [Gluster-users]  This benchmark is OK?

Ben,

When you have time, can you have a look on this thread and respond?

-Amar

 Original Message 
Subject:[Gluster-users] This benchmark is OK?
Date:   Thu, 17 May 2012 00:11:48 +0800
From:   soft_lawre...@hotmail.com 
Reply-To:   soft_lawrency 
To: gluster-users 
CC: wangleiyf 



Hi Amar,
here is my benchmark, pls help me to evaluate it.
1. [Env - Storage servers]
Gluster version: 3.3 beta3
OS : CentOS 6.1
2* Server : CPU : E5506 @ 2

Re: [Gluster-users] Gluster-users Digest, Vol 48, Issue 18 - Horrible Gluster Performance

2012-04-16 Thread Ben England
Philip,

What parts of your system perform well?   Can you give a specific example of 
your workload (what you are asking system to do)?  If it's a mixture of 
different workloads that's important too.  What version of Gluster and Linux 
are you using?  My suggestions would be 

a) to reset all your gluster tuning parameters to their default values unless 
you are sure that they actually improve performance, and 

b) try to isolate your performance problem to as simple a workload as possible 
before you try to fix it, and try to determine what workloads DO work well in 
your configuration.  This will make it easier for others to help.  

c) if latency spikes are the issue, this sounds like it could be related to 
writes being excessively buffered by Linux kernel and then being flushed all at 
once, which can block reads.  If so, Use "iostat -kx /dev/sd? 5" or equivalent 
to observe.  You can throttle back "dirty pages" in kernel and avoid buffering 
dirty pages for long periods of time to avoid these spikes.  

http://community.gluster.org/a/linux-kernel-tuning-for-glusterfs/ provides some 
suggestions that may be relevant to your problem, my recommendations are in a 
comment here.  

>Message: 9
>Date: Fri, 13 Apr 2012 11:25:58 +0200
>From: Philip 
>Subject: [Gluster-users] Horrible Gluster Performance
>To: gluster-users@gluster.org
>Message-ID:

>Content-Type: text/plain; charset="iso-8859-1"

>I have a small GlusterFS Cluster providing a replicated volume. Each server
>has 2 SAS disks for the OS and logs and 22 SATA disks for the actual data
>striped together as a RAID10 using MegaRAID SAS 9280-4i4e with this
>configuration: http://pastebin.com/2xj4401J

>Connected to this cluster are a few other servers with the native client
>running nginx to serve files stored on it in the order of 3-10MB.

>Right now a storage server has a outgoing bandwith of 300Mbit/s and the
>busy rate of the raid array is at 30-40%. There are also strange
>side-effects: Sometimes the io-latency skyrockets and there is no access
>possible on the raid for >10 seconds. This happens at 300Mbit/s or
>1000Mbit/s of outgoing bandwidth. The file system used is xfs and it has
>been tuned to match the raid stripe size.

>I've tested all sorts of gluster settings but none seem to have any effect
>because of that I've reset the volume configuration and it is using the
>default one.

>Does anyone have an idea what could be the reason for such a bad
>performance? 22 Disks in a RAID10 should deliver *way* more throughput.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster-users Digest, Vol 47, Issue 10, topic 1, write perf.

2012-03-06 Thread Ben England
Harold Hannelius, 

If you are using the "cp" utility to copy into Gluster then you may be running 
into a problem with use of an extremely small record size coupled with write 
replication overhead, try 
dd if=/dev/zero of=/gluster/test.dd bs=1024k count=256
Which uses a 1-MB write size, and see if that behaves differently.
I'm not a debian expert but in RHEL5.3 the "cp" utility record size was 4 KB 
and in RHEL6.2 the same utility has a 32-KB I/O size, you can check using 
strace utility.
For small-write-size utilities you can also try NFS to Gluster server, since 
NFS client aggregates writes.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Gluster-users] RDMA/Ethernet wi ROCEE - failed to modify QP to RTR

2011-11-14 Thread Ben England
Did any RDMA/Ethernet users see this Gluster error?  If so do you know what 
caused it and how to fix?  If you haven't seen it, what RPMs and configuration 
do you use specific to RDMA/Ethernet?

[2011-11-10 10:30:20.595801] C 
[rdma.c:2417:rdma_connect_qp]0-rpc-transport/rdma: Failed to modify QP to RTR 
[2011-11-10 10:30:20.595930] E [rdma.c:4159:rdma_handshake_pollin] 
0-rpc-transport/rdma: rdma.management: failed to connect with remote QP

I see this when I run RDMA over Ethernet using ROCEE RPMs, but when I run over 
Infiniband using RHEL 6.2-, it runs fine.  On the same Ethernet configuration, 
Gluster/TCP runs fine, NFS/RDMA runs fine as does AMQP app.  But qperf and 
rping utilities fail in the same way.  Firmware on the HCAs is not the latest, 
is it worth risk to upgrade?

I went into debugger and found line where qperf fails, it's near line 2056 in 
rdma.c in qperf sources (qperf-debuginfo,  I did Makefile)

(gdb)
2088} else if (dev->trans == IBV_QPT_RC) {
(gdb)
2090flags = IBV_QP_STATE  |
(gdb)
2097if (ibv_modify_qp(dev->qp, &rtr_attr, flags) != 0)
(gdb)
2098error(SYS, "failed to modify QP to RTR");
(gdb)

Gluster fails in rdma_connect_qp() calling the same routine, but perhaps with 
different parameters.  
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Gluster-users] Gluster/RDMA

2011-11-07 Thread Ben England
To Harry Mangalam about Gluster/RDMA:

make sure these modules are loaded

# modprobe -v rdma_ucm
# modprobe -v ib_uverbs
# modprobe -v ib_ucm

To run the subnet manager

# modprobe -v ib_umad

Make sure libibverbs and (libmlx4 or libmthca) RPMs are installed.

I don't understand why they appropriate modules aren't loaded automatically.  
Could put something in /etc/modprobe.d/ to make this happen maybe?  Infiniband 
should not require troubleshooting after 5-10 years of development, it should 
just work.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users