Re: [Gluster-users] gluster client performance

2011-08-09 Thread Jesse Stroik

Pavan,

Thank you for your help.  We wanted to get back to you with our results 
and observations.  I'm cc'ing gluster-users for posterity.


We did experiment with enable-trickling-writes.  That was one of the 
translator tunables we wanted to know the precise syntax for so that we 
could be certain we were disabling it.  As hoped, disabling trickling 
writes improved performance somewhat.


We are definitely interested in any other undocumented write-buffer 
related tunables.  We've tested the documented tuning parameters.


Performance improved significantly when we switched clients to mainline 
kernel (2.5.35-13).  We also updated to OFED 1.5.3 but it wasn't 
responsible for the performance improvement.


Our findings with 32KB block size (cp) write performance:

250-300MB/sec single stream performance
400MB/sec multiple-stream per client performance

This is much higher than we observed with kernel 2.6.18 series.  Using 
the 2.6.18 line, we also observed virtually no difference between 
running single stream tests and multi stream tests suggesting a 
bottleneck with the fabric.


Both 2.6.18 and 2.6.35-13 performed very well (about 600MB/sec) when 
writing 128KB blocks.


When I disabled write-behind on the 2.6.18 series of kernels as a test, 
performance plummeted to a few MB/sec when writing blocks sizes smaller 
than 128KB.  We did not test this extensively.


Disabling enable-trickling-writes gave us approximately a 20% boost, 
reflected in the numbers above, for single-stream writes.  We observed 
no significant difference with several streams per client due to 
disabling that tunable.


For reference, we are running another cluster file system on the same 
underlying hardware/software.  With both the old kernel (2.6.18.x) and 
the new kernel (2.6.35-13) we get approximately:


450-550MB/sec single stream performance
1200MB+/sec multiple stream per client performance

We set the test directory to write entire files to a single LUN which is 
how we configured gluster in an effort to mitigate differences.


It is treacherous to speculate why we might be more limited with gluster 
over RDMA than the other cluster file system without spending a 
significant amount of analysis.  That said, I wonder if there may be an 
issue with the way in which fuse handles write buffers causing a 
bottleneck for RMDA.


The bottom line is that our observed performance was poor using the 
2.6.18 RHEL 5 kernel line relative to the mainline (2.6.35) kernels. 
Updating to the newer kernels was well worth the testing and downtime. 
Hopefully this information can help others.


Best,
Jesse Stroik
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] gluster client performance

2011-08-09 Thread Pavan T C

On Wednesday 10 August 2011 12:11 AM, Jesse Stroik wrote:

Pavan,

Thank you for your help. We wanted to get back to you with our results
and observations. I'm cc'ing gluster-users for posterity.

We did experiment with enable-trickling-writes. That was one of the
translator tunables we wanted to know the precise syntax for so that we
could be certain we were disabling it. As hoped, disabling trickling
writes improved performance somewhat.

We are definitely interested in any other undocumented write-buffer
related tunables. We've tested the documented tuning parameters.

Performance improved significantly when we switched clients to mainline
kernel (2.5.35-13). We also updated to OFED 1.5.3 but it wasn't
responsible for the performance improvement.

Our findings with 32KB block size (cp) write performance:

250-300MB/sec single stream performance
400MB/sec multiple-stream per client performance


Ok. Lets see if we can improve this further. Please use the following 
tunables as suggested below:


For write-behind -
option cache-size 16MB

For read-ahead -
option page-count 16

For io-cache -
option cache-size 64MB

You will need to place these lines in the client volume file, restart 
the server and remount the volume on the clients.
Your client (fuse) volume file sections will look like below (of course, 
with change in the volume name) -


volume testvol-write-behind
type performance/write-behind
option cache-size 16MB
subvolumes testvol-client-0
end-volume

volume testvol-read-ahead
type performance/read-ahead
option page-count 16
subvolumes testvol-write-behind
end-volume

volume testvol-io-cache
type performance/io-cache
option cache-size 64MB
subvolumes testvol-read-ahead
end-volume

Run your copy command with these tunables. For now, lets have the 
default setting for trickling writes which is 'ENABLED'. You can simply 
remove this tunable from the volume file to get the default behaviour.


Pavan


This is much higher than we observed with kernel 2.6.18 series. Using
the 2.6.18 line, we also observed virtually no difference between
running single stream tests and multi stream tests suggesting a
bottleneck with the fabric.

Both 2.6.18 and 2.6.35-13 performed very well (about 600MB/sec) when
writing 128KB blocks.

When I disabled write-behind on the 2.6.18 series of kernels as a test,
performance plummeted to a few MB/sec when writing blocks sizes smaller
than 128KB. We did not test this extensively.

Disabling enable-trickling-writes gave us approximately a 20% boost,
reflected in the numbers above, for single-stream writes. We observed no
significant difference with several streams per client due to disabling
that tunable.

For reference, we are running another cluster file system on the same
underlying hardware/software. With both the old kernel (2.6.18.x) and
the new kernel (2.6.35-13) we get approximately:

450-550MB/sec single stream performance
1200MB+/sec multiple stream per client performance

We set the test directory to write entire files to a single LUN which is
how we configured gluster in an effort to mitigate differences.

It is treacherous to speculate why we might be more limited with gluster
over RDMA than the other cluster file system without spending a
significant amount of analysis. That said, I wonder if there may be an
issue with the way in which fuse handles write buffers causing a
bottleneck for RMDA.

The bottom line is that our observed performance was poor using the
2.6.18 RHEL 5 kernel line relative to the mainline (2.6.35) kernels.
Updating to the newer kernels was well worth the testing and downtime.
Hopefully this information can help others.

Best,
Jesse Stroik


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] gluster client performance

2011-07-27 Thread Pavan T C

[..]



I don't know why my writes are so slow compared to reads. Let me know
if you're able to get better write speeds with the newer version of
gluster and any of the configurations (if they apply) that I've
posted. It might compel me to upgrade.



From your documentation of nfsspeedtest, I see that the reads can 
happen either via dd or via perl's sysread. I'm not sure if one is 
better over the other.


Secondly - Are you doing direct IO on the backend XFS ? If not, try it 
with direct IO so that you are not misled by the memory situation in the 
system at the time of your test. It will give a clearer picture of what 
your backend is capable of.


Your test is such that you write a file and immediately read the same 
file back. It is possible that a good chunk of it is cached on the 
backend. After the write, do a flush of the filesystem caches by using:

echo 3  /proc/sys/vm/drop_caches. Sleep for a while. Then do the read.
Or as suggested earlier, resort to direct IO while testing the backend FS.

Pavan
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] gluster client performance

2011-07-27 Thread Pavan T C

But that still does not explain why you should get as low as 50 MB/s for
a single stream single client write when the backend can support direct
IO throughput of more than 700 MB/s.

On the server, can you collect:

# iostat -xcdh 2  iostat.log.brickXX

for the duration of the dd command ?

and

# strace -f -o stracelog.server -tt -T -e trace=write,writev -p
glusterfsd.pid
(again for the duration of the dd command)


Hi John,

A small change in the request. I hope you have not already spent time on 
this. The strace command should be:


strace -f -o stracelog.server -tt -T -e trace=pwrite -p
glusterfsd.pid

Thanks,
Pavan



With the above, I want to measure the delay between the writes coming in
from the client. iostat will describe the IO scenario on the server.
Once the exercise is done, please attach the iostat.log.brickXX and
stracelog.server.


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] gluster client performance

2011-07-27 Thread John Lalande

On 07/27/2011 12:53 AM, Pavan T C wrote:




2. What is the disk bandwidth you are getting on the local filesystem
on a given storage node ? I mean, pick any of the 10 storage servers
dedicated for Gluster Storage and perform a dd as below:

Seeing an average of 740 MB/s write, 971 GB/s read.


I presume you did this in one of the /data-brick*/export directories ?
Command output with the command line would have been clearer, but 
thats fine.

That is correct -- we used /data-brick1/export.




3. What is the IB bandwidth that you are getting between the compute
node and the glusterfs storage node? You can run the tool rdma_bw to
get the details:

30407: Bandwidth peak (#0 to #976): 2594.58 MB/sec
30407: Bandwidth average: 2593.62 MB/sec
30407: Service Demand peak (#0 to #976): 978 cycles/KB
30407: Service Demand Avg : 978 cycles/KB



This looks like a DDR connection. ibv_devinfo -v will tell a better 
story about the line width and speed of your infiniband connection.

QDR should have a much higher bandwidth.
But that still does not explain why you should get as low as 50 MB/s 
for a single stream single client write when the backend can support 
direct IO throughput of more than 700 MB/s.
ibv_devinfo shows 4x for active width and 10 Gbps for active speed. Not 
sure why we're not seeing better bandwidth with rdma_bw -- we'll have to 
troubleshoot that some more -- but I agree, it shouldn't be the limiting 
factor as far the Gluster client speed problems we're seeing.


I'll send you the log files you requested off-list.

John

--



John Lalande
University of Wisconsin-Madison
Space Science  Engineering Center
1225 W. Dayton Street, Room 439, Madison, WI 53706
608-263-2268 / john.lala...@ssec.wisc.edu





smime.p7s
Description: S/MIME Cryptographic Signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] gluster client performance

2011-07-26 Thread Pavan T C

On Tuesday 26 July 2011 03:42 AM, John Lalande wrote:

Hi-

I'm new to Gluster, but am trying to get it set up on a new compute
cluster we're building. We picked Gluster for one of our cluster file
systems (we're also using Lustre for fast scratch space), but the
Gluster performance has been so bad that I think maybe we have a
configuration problem -- perhaps we're missing a tuning parameter that
would help, but I can't find anything in the Gluster documentation --
all the tuning info I've found seems geared toward Gluster 2.x.

For some background, our compute cluster has 64 compute nodes. The
gluster storage pool has 10 Dell PowerEdge R515 servers, each with 12 x
2 TB disks. We have another 16 Dell PowerEdge R515s used as Lustre
storage servers. The compute and storage nodes are all connected via QDR
Infiniband. Both Gluster and Lustre are set to use RDMA over Infiniband.
We are using OFED version 1.5.2-20101219, Gluster 3.2.2 and CentOS 5.5
on both the compute and storage nodes.


Hi John,

I would need some more information about your setup to estimate the 
performance you should get with your gluster setup.


1. Can you provide the details of how disks are connected to the storage 
boxes ? Is it via FC ? What raid configuration is it using (if at all any) ?


2. What is the disk bandwidth you are getting on the local filesystem on 
a given storage node ? I mean, pick any of the 10 storage servers 
dedicated for Gluster Storage and perform a dd as below:


Write bandwidth measurement:
dd if=/dev/zero of=/export_directory/10g_file bs=128K count=8 
oflag=direct


Read bandwidth measurement:
dd if=/export_directory/10g_file of=/dev/null bs=128K count=8 
iflag=direct


[The above command is doing a direct IO of 10GB via your backend FS - 
ext4/xfs.]


3. What is the IB bandwidth that you are getting between the compute 
node and the glusterfs storage node? You can run the tool rdma_bw to 
get the details:


On the server, run:
# rdma_bw -b
[ -b measures bi-directional bandwidth]

On the compute node, run,
# rdma_bw -b server

[If you have not already installed it, rdma_bw is available via -
http://mirror.centos.org/centos/5/os/x86_64/CentOS/perftest-1.2.3-1.el5.x86_64.rpm]

Lets start with this, and I will ask for more if necessary.

Pavan



Oddly, it seems like there's some sort of bottleneck on the client side
-- for example, we're only seeing about 50 MB/s write throughput from a
single compute node when writing a 10GB file. But, if we run multiple
simultaneous writes from multiple compute nodes to the same Gluster
volume, we get 50 MB/s from each compute node. However, running multiple
writes from the same compute node does not increase throughput. The
compute nodes have 48 cores and 128 GB RAM, so I don't think the issue
is with the compute node hardware.

With Lustre, on the same hardware, with the same version of OFED, we're
seeing write throughput on that same 10 GB file as follows: 476 MB/s
single stream write from a single compute node and aggregate performance
of more like 2.4 GB/s if we run simultaneous writes. That leads me to
believe that we don't have a problem with RDMA, otherwise Lustre, which
is also using RDMA, should be similarly affected.

We have tried both xfs and ext4 for the backend file system on the
Gluster storage nodes (we're currently using ext4). We went with
distributed (not distributed striped) for the Gluster volume -- the
thought was that if there was a catastrophic failure of one of the
storage nodes, we'd only lose the data on that node; presumably with
distributed striped you'd lose any data striped across that volume,
unless I have misinterpreted the documentation.

So ... what's expected/normal throughput for Gluster over QDR IB to a
relatively large storage pool (10 servers / 120 disks)? Does anyone have
suggested tuning tips for improving performance?

Thanks!

John



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] gluster client performance

2011-07-26 Thread Sabuj Pattanayek
 3. What is the IB bandwidth that you are getting between the compute node
 and the glusterfs storage node? You can run the tool rdma_bw to get the
 details:

This is what I got  on bidirectional :

2638: Bandwidth peak (#0 to #785): 6052.22 MB/sec
2638: Bandwidth average: 6050.02 MB/sec
2638: Service Demand peak (#0 to #785): 364 cycles/KB
2638: Service Demand Avg  : 364 cycles/KB
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] gluster client performance

2011-07-26 Thread John Lalande

Thanks for your help, Pavan!


Hi John,

I would need some more information about your setup to estimate the 
performance you should get with your gluster setup.


1. Can you provide the details of how disks are connected to the 
storage boxes ? Is it via FC ? What raid configuration is it using (if 
at all any) ?
The disks are 2TB near-line SAS direct attached via a PERC H700 
controller (the Dell PowerEdge R515 has 12 3.5 drive bays). They are in 
a RAID6 config, exported as a single volume, that's split into 3 
equal-size partitions (due to ext4's (well, e2fsprogs') 16 TB limit).


2. What is the disk bandwidth you are getting on the local filesystem 
on a given storage node ? I mean, pick any of the 10 storage servers 
dedicated for Gluster Storage and perform a dd as below:

Seeing an average of 740 MB/s write, 971 GB/s read.



3. What is the IB bandwidth that you are getting between the compute 
node and the glusterfs storage node? You can run the tool rdma_bw to 
get the details:

30407: Bandwidth peak (#0 to #976): 2594.58 MB/sec
30407: Bandwidth average: 2593.62 MB/sec
30407: Service Demand peak (#0 to #976): 978 cycles/KB
30407: Service Demand Avg  : 978 cycles/KB


Here's our gluster config:

# gluster volume info data

Volume Name: data
Type: Distribute
Status: Started
Number of Bricks: 30
Transport-type: rdma
Bricks:
Brick1: data-3-1-infiniband.infiniband:/data-brick1/export
Brick2: data-3-3-infiniband.infiniband:/data-brick1/export
Brick3: data-3-5-infiniband.infiniband:/data-brick1/export
Brick4: data-3-7-infiniband.infiniband:/data-brick1/export
Brick5: data-3-9-infiniband.infiniband:/data-brick1/export
Brick6: data-3-11-infiniband.infiniband:/data-brick1/export
Brick7: data-3-13-infiniband.infiniband:/data-brick1/export
Brick8: data-3-15-infiniband.infiniband:/data-brick1/export
Brick9: data-3-17-infiniband.infiniband:/data-brick1/export
Brick10: data-3-19-infiniband.infiniband:/data-brick1/export
Brick11: data-3-1-infiniband.infiniband:/data-brick2/export
Brick12: data-3-3-infiniband.infiniband:/data-brick2/export
Brick13: data-3-5-infiniband.infiniband:/data-brick2/export
Brick14: data-3-7-infiniband.infiniband:/data-brick2/export
Brick15: data-3-9-infiniband.infiniband:/data-brick2/export
Brick16: data-3-11-infiniband.infiniband:/data-brick2/export
Brick17: data-3-13-infiniband.infiniband:/data-brick2/export
Brick18: data-3-15-infiniband.infiniband:/data-brick2/export
Brick19: data-3-17-infiniband.infiniband:/data-brick2/export
Brick20: data-3-19-infiniband.infiniband:/data-brick2/export
Brick21: data-3-1-infiniband.infiniband:/data-brick3/export
Brick22: data-3-3-infiniband.infiniband:/data-brick3/export
Brick23: data-3-5-infiniband.infiniband:/data-brick3/export
Brick24: data-3-7-infiniband.infiniband:/data-brick3/export
Brick25: data-3-9-infiniband.infiniband:/data-brick3/export
Brick26: data-3-11-infiniband.infiniband:/data-brick3/export
Brick27: data-3-13-infiniband.infiniband:/data-brick3/export
Brick28: data-3-15-infiniband.infiniband:/data-brick3/export
Brick29: data-3-17-infiniband.infiniband:/data-brick3/export
Brick30: data-3-19-infiniband.infiniband:/data-brick3/export
Options Reconfigured:
nfs.disable: on

--



John Lalande
University of Wisconsin-Madison
Space Science  Engineering Center
1225 W. Dayton Street, Room 439, Madison, WI 53706
608-263-2268 / john.lala...@ssec.wisc.edu




smime.p7s
Description: S/MIME Cryptographic Signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] gluster client performance

2011-07-26 Thread Pavan T C

On Tuesday 26 July 2011 09:24 PM, John Lalande wrote:

Thanks for your help, Pavan!


Hi John,

I would need some more information about your setup to estimate the
performance you should get with your gluster setup.

1. Can you provide the details of how disks are connected to the
storage boxes ? Is it via FC ? What raid configuration is it using (if
at all any) ?

The disks are 2TB near-line SAS direct attached via a PERC H700
controller (the Dell PowerEdge R515 has 12 3.5 drive bays). They are in
a RAID6 config, exported as a single volume, that's split into 3
equal-size partitions (due to ext4's (well, e2fsprogs') 16 TB limit).


2. What is the disk bandwidth you are getting on the local filesystem
on a given storage node ? I mean, pick any of the 10 storage servers
dedicated for Gluster Storage and perform a dd as below:

Seeing an average of 740 MB/s write, 971 GB/s read.


I presume you did this in one of the /data-brick*/export directories ?
Command output with the command line would have been clearer, but thats 
fine.






3. What is the IB bandwidth that you are getting between the compute
node and the glusterfs storage node? You can run the tool rdma_bw to
get the details:

30407: Bandwidth peak (#0 to #976): 2594.58 MB/sec
30407: Bandwidth average: 2593.62 MB/sec
30407: Service Demand peak (#0 to #976): 978 cycles/KB
30407: Service Demand Avg : 978 cycles/KB


This looks like a DDR connection. ibv_devinfo -v will tell a better 
story about the line width and speed of your infiniband connection.

QDR should have a much higher bandwidth.

But that still does not explain why you should get as low as 50 MB/s for 
a single stream single client write when the backend can support direct 
IO throughput of more than 700 MB/s.


On the server, can you collect:

# iostat -xcdh 2  iostat.log.brickXX

for the duration of the dd command ?

and

# strace -f -o stracelog.server -tt -T -e trace=write,writev -p 
glusterfsd.pid

(again for the duration of the dd command)

With the above, I want to measure the delay between the writes coming in 
from the client. iostat will describe the IO scenario on the server.
Once the exercise is done, please attach the iostat.log.brickXX and 
stracelog.server.


Pavan




Here's our gluster config:

# gluster volume info data

Volume Name: data
Type: Distribute
Status: Started
Number of Bricks: 30
Transport-type: rdma
Bricks:
Brick1: data-3-1-infiniband.infiniband:/data-brick1/export
Brick2: data-3-3-infiniband.infiniband:/data-brick1/export
Brick3: data-3-5-infiniband.infiniband:/data-brick1/export
Brick4: data-3-7-infiniband.infiniband:/data-brick1/export
Brick5: data-3-9-infiniband.infiniband:/data-brick1/export
Brick6: data-3-11-infiniband.infiniband:/data-brick1/export
Brick7: data-3-13-infiniband.infiniband:/data-brick1/export
Brick8: data-3-15-infiniband.infiniband:/data-brick1/export
Brick9: data-3-17-infiniband.infiniband:/data-brick1/export
Brick10: data-3-19-infiniband.infiniband:/data-brick1/export
Brick11: data-3-1-infiniband.infiniband:/data-brick2/export
Brick12: data-3-3-infiniband.infiniband:/data-brick2/export
Brick13: data-3-5-infiniband.infiniband:/data-brick2/export
Brick14: data-3-7-infiniband.infiniband:/data-brick2/export
Brick15: data-3-9-infiniband.infiniband:/data-brick2/export
Brick16: data-3-11-infiniband.infiniband:/data-brick2/export
Brick17: data-3-13-infiniband.infiniband:/data-brick2/export
Brick18: data-3-15-infiniband.infiniband:/data-brick2/export
Brick19: data-3-17-infiniband.infiniband:/data-brick2/export
Brick20: data-3-19-infiniband.infiniband:/data-brick2/export
Brick21: data-3-1-infiniband.infiniband:/data-brick3/export
Brick22: data-3-3-infiniband.infiniband:/data-brick3/export
Brick23: data-3-5-infiniband.infiniband:/data-brick3/export
Brick24: data-3-7-infiniband.infiniband:/data-brick3/export
Brick25: data-3-9-infiniband.infiniband:/data-brick3/export
Brick26: data-3-11-infiniband.infiniband:/data-brick3/export
Brick27: data-3-13-infiniband.infiniband:/data-brick3/export
Brick28: data-3-15-infiniband.infiniband:/data-brick3/export
Brick29: data-3-17-infiniband.infiniband:/data-brick3/export
Brick30: data-3-19-infiniband.infiniband:/data-brick3/export
Options Reconfigured:
nfs.disable: on



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Gluster-users] gluster client performance

2011-07-25 Thread John Lalande

Hi-

I'm new to Gluster, but am trying to get it set up on a new compute 
cluster we're building. We picked Gluster for one of our cluster file 
systems (we're also using Lustre for fast scratch space), but the 
Gluster performance has been so bad that I think maybe we have a 
configuration problem -- perhaps we're missing a tuning parameter that 
would help, but I can't find anything in the Gluster documentation -- 
all the tuning info I've found seems geared toward Gluster 2.x.


For some background, our compute cluster has 64 compute nodes. The 
gluster storage pool has 10 Dell PowerEdge R515 servers, each with 12 x 
2 TB disks. We have another 16 Dell PowerEdge R515s used as Lustre 
storage servers. The compute and storage nodes are all connected via QDR 
Infiniband. Both Gluster and Lustre are set to use RDMA over Infiniband. 
We are using OFED version 1.5.2-20101219, Gluster 3.2.2 and CentOS 5.5 
on both the compute and storage nodes.


Oddly, it seems like there's some sort of bottleneck on the client side 
-- for example, we're only seeing about 50 MB/s write throughput from a 
single compute node when writing a 10GB file. But, if we run multiple 
simultaneous writes from multiple compute nodes to the same Gluster 
volume, we get 50 MB/s from each compute node. However, running multiple 
writes from the same compute node does not increase throughput. The 
compute nodes have 48 cores and 128 GB RAM, so I don't think the issue 
is with the compute node hardware.


With Lustre, on the same hardware, with the same version of OFED, we're 
seeing write throughput on that same 10 GB file as follows: 476 MB/s 
single stream write from a single compute node and aggregate performance 
of more like 2.4 GB/s if we run simultaneous writes. That leads me to 
believe that we don't have a problem with RDMA, otherwise Lustre, which 
is also using RDMA, should be similarly affected.


We have tried both xfs and ext4 for the backend file system on the 
Gluster storage nodes (we're currently using ext4). We went with 
distributed (not distributed striped) for the Gluster volume -- the 
thought was that if there was a catastrophic failure of one of the 
storage nodes, we'd only lose the data on that node; presumably with 
distributed striped you'd lose any data striped across that volume, 
unless I have misinterpreted the documentation.


So ... what's expected/normal throughput for Gluster over QDR IB to a 
relatively large storage pool (10 servers / 120 disks)? Does anyone have 
suggested tuning tips for improving performance?


Thanks!

John

--



John Lalande
University of Wisconsin-Madison
Space Science  Engineering Center
1225 W. Dayton Street, Room 439, Madison, WI 53706
608-263-2268 / john.lala...@ssec.wisc.edu




smime.p7s
Description: S/MIME Cryptographic Signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] gluster client performance

2011-07-25 Thread Sabuj Pattanayek
Hi,

Here's our QDR IB gluster setup:

http://piranha.structbio.vanderbilt.edu

We're still using gluster 3.0 on all our servers and clients as well
as CENTOS5.6 kernels and ofed 1.4. To simulate a single stream I use
this nfsSpeedTest script I wrote :

http://code.google.com/p/nfsspeedtest/

From a single QDR IB connected client to our /pirstripe directory
which is a stripe of the gluster storage servers, this is the
performance I get (note use a file size  amount of RAM on client and
server systems, 13GB in this case) :

4k block size :

111 pir4:/pirstripe% /sb/admin/scripts/nfsSpeedTest -s 13g -y
pir4: Write test (dd): 142.281 MB/s 1138.247 mbps 93.561 seconds
pir4: Read test (dd): 274.321 MB/s 2194.570 mbps 48.527 seconds

testing from 8k - 128k block size on the dd, best performance was
achieved at 64k block sizes:

114 pir4:/pirstripe% /sb/admin/scripts/nfsSpeedTest -s 13g -b 64k -y
pir4: Write test (dd): 213.344 MB/s 1706.750 mbps 62.397 seconds
pir4: Read test (dd): 955.328 MB/s 7642.620 mbps 13.934 seconds

This is to the /pirdist directories which are mounted in distribute
mode (file is written to only one of the gluster servers) :

105 pir4:/pirdist% /sb/admin/scripts/nfsSpeedTest -s 13g -y
pir4: Write test (dd): 182.410 MB/s 1459.281 mbps 72.978 seconds
pir4: Read test (dd): 244.379 MB/s 1955.033 mbps 54.473 seconds

106 pir4:/pirdist% /sb/admin/scripts/nfsSpeedTest -s 13g -y -b 64k
pir4: Write test (dd): 204.297 MB/s 1634.375 mbps 65.160 seconds
pir4: Read test (dd): 340.427 MB/s 2723.419 mbps 39.104 seconds

For reference/control, here's the same test writing straight to the
XFS filesystem on one of the gluster storage nodes:

[sabujp@gluster1 tmp]$ /sb/admin/scripts/nfsSpeedTest -s 13g -y
gluster1: Write test (dd): 398.971 MB/s 3191.770 mbps 33.366 seconds
gluster1: Read test (dd): 234.563 MB/s 1876.501 mbps 56.752 seconds

[sabujp@gluster1 tmp]$ /sb/admin/scripts/nfsSpeedTest -s 13g -y -b 64k
gluster1: Write test (dd): 442.251 MB/s 3538.008 mbps 30.101 seconds
gluster1: Read test (dd): 219.708 MB/s 1757.660 mbps 60.590 seconds

The read test seems to scale linearly with the # of storage servers
(almost 1GB/s!). Interestingly, the /pirdist read test at 64k block
size was 120MB/s faster than the read test straight from XFS, however,
it could have been that gluster1 was busy and when I read from
/pirdist the file was actually being read from one of the other 4 less
busy storage nodes.

Here's our storage node setup (many of these settings may not apply to v3.2) :



volume posix-stripe
  type storage/posix
  option directory /export/gluster1/stripe
end-volume

volume posix-distribute
type storage/posix
option directory /export/gluster1/distribute
end-volume

volume locks
  type features/locks
  subvolumes posix-stripe
end-volume

volume locks-dist
  type features/locks
  subvolumes posix-distribute
end-volume

volume iothreads
  type performance/io-threads
  option thread-count 16
  subvolumes locks
end-volume

volume iothreads-dist
  type performance/io-threads
  option thread-count 16
  subvolumes locks-dist
end-volume

volume server
  type protocol/server
  option transport-type ib-verbs
  option auth.addr.iothreads.allow 10.2.178.*
  option auth.addr.iothreads-dist.allow 10.2.178.*
  option auth.addr.locks.allow 10.2.178.*
  option auth.addr.posix-stripe.allow 10.2.178.*
  subvolumes iothreads iothreads-dist locks posix-stripe
end-volume



Here's our stripe client setup :



volume client-stripe-1
  type protocol/client
  option transport-type ib-verbs
  option remote-host gluster1
  option remote-subvolume iothreads
end-volume

volume client-stripe-2
  type protocol/client
  option transport-type ib-verbs
  option remote-host gluster2
  option remote-subvolume iothreads
end-volume

volume client-stripe-3
  type protocol/client
  option transport-type ib-verbs
  option remote-host gluster3
  option remote-subvolume iothreads
end-volume

volume client-stripe-4
  type protocol/client
  option transport-type ib-verbs
  option remote-host gluster4
  option remote-subvolume iothreads
end-volume

volume client-stripe-5
  type protocol/client
  option transport-type ib-verbs
  option remote-host gluster5
  option remote-subvolume iothreads
end-volume

volume readahead-gluster1
  type performance/read-ahead
  option page-count 4   # 2 is default
  option force-atime-update off # default is off
  subvolumes client-stripe-1
end-volume

volume readahead-gluster2
  type performance/read-ahead
  option page-count 4   # 2 is default
  option force-atime-update off # default is off
  subvolumes client-stripe-2
end-volume

volume readahead-gluster3
  type performance/read-ahead
  option page-count 4   # 2 is default
  option force-atime-update off # default is off
  subvolumes client-stripe-3
end-volume

volume readahead-gluster4
  type performance/read-ahead
  option page-count 4   # 2 is default
  option force-atime-update off # default is