Re: [Gluster-users] is glusterfs DHT really distributed?

2009-09-29 Thread Wei Dong
I think TCP_NODELAY is critical to performance.  Actuall after spending 
a large number of unfruitful hours on glusterfs, I wrote my own simple 
shared storage with BerkeleyDB backend, and I found that enabling 
TCP_NODELAY on my system gives me nearly 10x readback throughput.  
Thanks for pointing this out, I'll definitely try that.


- Wei

Mark Mielke wrote:

On 09/29/2009 03:39 AM, David Saez Padros wrote:

The
second is 'option transport.socket.nodelay on' in each of your
protocol/client _and_ protocol/server volumes.


where is this option documented ?


I'm a little surprised TCP_NODELAY isn't set by default? I set it on 
all servers I write as a matter of principle.


The Nagle algorithm is for very simple servers to have acceptable 
performance. The type of servers that benefit, are the type of servers 
that do writes of individual bytes (no buffering).


Serious servers intended to perform well should be able to easily beat 
the Nagle algorithm. writev(), sendmsg(), or even write(buffer) where 
the buffer is built first, should all beat the Nagle algorithm in 
terms of increased throughput and reduced latency. On Linux, there is 
also TCP_CORK. Unless GlusterFS does small writes, I suggest 
TCP_NODELAY be set by default in future releases.


Just an opinion. :-)

Cheers,
mark



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] is glusterfs DHT really distributed?

2009-09-29 Thread Vijay Bellur

David Saez Padros wrote:

Hi


The
second is 'option transport.socket.nodelay on' in each of your
protocol/client _and_ protocol/server volumes.


where is this option documented ?


Thanks for pointing this out.

We wanted to expose this as a regular option in the upcoming 2.1 release 
and had introduced this as an experimental option in 2.0.x releases.

Hence it will be documented in the 2.1 user manual.

Thanks,
Vijay
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] is glusterfs DHT really distributed?

2009-09-29 Thread Vijay Bellur

Mark Mielke wrote:
I'm a little surprised TCP_NODELAY isn't set by default? I set it on 
all servers I write as a matter of principle.


Serious servers intended to perform well should be able to easily beat 
the Nagle algorithm. writev(), sendmsg(), or even write(buffer) where 
the buffer is built first, should all beat the Nagle algorithm in 
terms of increased throughput and reduced latency. On Linux, there is 
also TCP_CORK. Unless GlusterFS does small writes, I suggest 
TCP_NODELAY be set by default in future releases.


Just an opinion. :-)


Thanks for this feedback, Mark. Pre-2.0.3, there was no option to turn 
off Nagle's algorithm. We introduced this in 2.0.3 and are debating 
whether this needs to be made the default, since it involves altering a 
default behavior :-). We will certainly consider making this the default 
behavior in our upcoming releases.


Thanks,
Vijay

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] is glusterfs DHT really distributed?

2009-09-29 Thread Mark Mielke

On 09/29/2009 03:39 AM, David Saez Padros wrote:

The
second is 'option transport.socket.nodelay on' in each of your
protocol/client _and_ protocol/server volumes.


where is this option documented ?


I'm a little surprised TCP_NODELAY isn't set by default? I set it on all 
servers I write as a matter of principle.


The Nagle algorithm is for very simple servers to have acceptable 
performance. The type of servers that benefit, are the type of servers 
that do writes of individual bytes (no buffering).


Serious servers intended to perform well should be able to easily beat 
the Nagle algorithm. writev(), sendmsg(), or even write(buffer) where 
the buffer is built first, should all beat the Nagle algorithm in terms 
of increased throughput and reduced latency. On Linux, there is also 
TCP_CORK. Unless GlusterFS does small writes, I suggest TCP_NODELAY be 
set by default in future releases.


Just an opinion. :-)

Cheers,
mark

--
Mark Mielke

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] is glusterfs DHT really distributed?

2009-09-29 Thread David Saez Padros

Hi


The
second is 'option transport.socket.nodelay on' in each of your
protocol/client _and_ protocol/server volumes.


where is this option documented ?

--
Thanx & best regards ...


   David Saez Padroshttp://www.ols.es
   On-Line Services 2000 S.L.   telf+34 902 50 29 75



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] is glusterfs DHT really distributed?

2009-09-28 Thread Anand Avati
>  http://www.gluster.com/community/documentation/index.php/Translators/cluster/distribute
>
> It seems to suggest that 'lookup-unhashed' says that the default is 'on'.
>
> Perhaps try turning it 'off'?

Wei,
   There are two things we would like you to try. First is what Mark
has just pointed, the 'option lookup-unhashed off' in distribute. The
second is 'option transport.socket.nodelay on' in each of your
protocol/client _and_ protocol/server volumes. Do let us know what
influence these changes have on your performance.

Avati
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] is glusterfs DHT really distributed?

2009-09-28 Thread Mark Mielke

On 09/28/2009 10:51 AM, Wei Dong wrote:
Your reply makes all sense to me.  I remember that auto-heal happens 
at file reading; doest that mean opening a file for read is also a 
global operation?  Do you mean that there's no other way of copying 30 
million files to our 66-node glusterfs cluster for parallel processing 
other than waiting for half a month?  Can I somehow disable self-heal 
and get a seedup?


Things turn out to be too bad for me.


On this page:

http://www.gluster.com/community/documentation/index.php/Translators/cluster/distribute


It seems to suggest that 'lookup-unhashed' says that the default is 'on'.

Perhaps try turning it 'off'?

Cheers,
mark






Mark Mielke wrote:

On 09/28/2009 10:35 AM, Wei Dong wrote:

Hi All,

I noticed a very weird phenomenon when I'm copying data (200KB image 
files) to our glusterfs storage.  When I run only run client, it 
copies roughly 20 files per second and as soon as I start a second 
client on another machine, the copy rate of the first client 
immediately degrade to 5 files per second.   When I stop the second 
client, the first client will immediately speed up to the original 
20 files per second.  When I run 15 clients, the aggregate 
throughput is about 8 files per second, much worse than running only 
one client.  Neither CPU nor network is saturated.  My volume file 
is attached.  The servers are running on a 66 node cluster and the 
clients are a 15-node cluster.


We have 33x2 servers and at most 15 separate machines, with each 
server serving < 0.5 clients on average.  I cannot think of a reason 
for a distributed system to behave like this.  There must be some 
kind of central access point.


Although there is probably room for the GlusterFS folk to optimize...

You should consider directory write operations to involve the whole 
cluster. Creating a file is a directory write operation. Think of how 
it might have to do self-heal across the cluster, make sure the name 
is right and not already in use across the cluster, and such things.


Once you get to reads and writes for a particular file, it should be 
distributed.


Cheers,
mark







--
Mark Mielke

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] is glusterfs DHT really distributed?

2009-09-28 Thread Wei Dong
Your reply makes all sense to me.  I remember that auto-heal happens at 
file reading; doest that mean opening a file for read is also a global 
operation?  Do you mean that there's no other way of copying 30 million 
files to our 66-node glusterfs cluster for parallel processing other 
than waiting for half a month?  Can I somehow disable self-heal and get 
a seedup?


Things turn out to be too bad for me.

- Wei


Mark Mielke wrote:

On 09/28/2009 10:35 AM, Wei Dong wrote:

Hi All,

I noticed a very weird phenomenon when I'm copying data (200KB image 
files) to our glusterfs storage.  When I run only run client, it 
copies roughly 20 files per second and as soon as I start a second 
client on another machine, the copy rate of the first client 
immediately degrade to 5 files per second.   When I stop the second 
client, the first client will immediately speed up to the original 20 
files per second.  When I run 15 clients, the aggregate throughput is 
about 8 files per second, much worse than running only one client.  
Neither CPU nor network is saturated.  My volume file is attached.  
The servers are running on a 66 node cluster and the clients are a 
15-node cluster.


We have 33x2 servers and at most 15 separate machines, with each 
server serving < 0.5 clients on average.  I cannot think of a reason 
for a distributed system to behave like this.  There must be some 
kind of central access point.


Although there is probably room for the GlusterFS folk to optimize...

You should consider directory write operations to involve the whole 
cluster. Creating a file is a directory write operation. Think of how 
it might have to do self-heal across the cluster, make sure the name 
is right and not already in use across the cluster, and such things.


Once you get to reads and writes for a particular file, it should be 
distributed.


Cheers,
mark



___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] is glusterfs DHT really distributed?

2009-09-28 Thread Mark Mielke

On 09/28/2009 10:35 AM, Wei Dong wrote:

Hi All,

I noticed a very weird phenomenon when I'm copying data (200KB image 
files) to our glusterfs storage.  When I run only run client, it 
copies roughly 20 files per second and as soon as I start a second 
client on another machine, the copy rate of the first client 
immediately degrade to 5 files per second.   When I stop the second 
client, the first client will immediately speed up to the original 20 
files per second.  When I run 15 clients, the aggregate throughput is 
about 8 files per second, much worse than running only one client.  
Neither CPU nor network is saturated.  My volume file is attached.  
The servers are running on a 66 node cluster and the clients are a 
15-node cluster.


We have 33x2 servers and at most 15 separate machines, with each 
server serving < 0.5 clients on average.  I cannot think of a reason 
for a distributed system to behave like this.  There must be some kind 
of central access point.


Although there is probably room for the GlusterFS folk to optimize...

You should consider directory write operations to involve the whole 
cluster. Creating a file is a directory write operation. Think of how it 
might have to do self-heal across the cluster, make sure the name is 
right and not already in use across the cluster, and such things.


Once you get to reads and writes for a particular file, it should be 
distributed.


Cheers,
mark

--
Mark Mielke

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Gluster-users] is glusterfs DHT really distributed?

2009-09-28 Thread Wei Dong

Hi All,

I noticed a very weird phenomenon when I'm copying data (200KB image 
files) to our glusterfs storage.  When I run only run client, it copies 
roughly 20 files per second and as soon as I start a second client on 
another machine, the copy rate of the first client immediately degrade 
to 5 files per second.   When I stop the second client, the first client 
will immediately speed up to the original 20 files per second.  When I 
run 15 clients, the aggregate throughput is about 8 files per second, 
much worse than running only one client.  Neither CPU nor network is 
saturated.  My volume file is attached.  The servers are running on a 66 
node cluster and the clients are a 15-node cluster.


We have 33x2 servers and at most 15 separate machines, with each server 
serving < 0.5 clients on average.  I cannot think of a reason for a 
distributed system to behave like this.  There must be some kind of 
central access point.


- Wei






volume posix0
type storage/posix
option directory /state/partition1/gluster
end-volume

volume lock0
type features/locks
subvolumes posix0
end-volume

volume brick0
type performance/io-threads
option thread-count 4
subvolumes lock0
end-volume

volume posix1
type storage/posix
option directory /state/partition2/gluster
end-volume

volume lock1
type features/locks
subvolumes posix1
end-volume

volume brick1
type performance/io-threads
option thread-count 4
subvolumes lock1
end-volume

volume posix2
type storage/posix
option directory /state/partition3/gluster
end-volume

volume lock2
type features/locks
subvolumes posix2
end-volume

volume brick2
type performance/io-threads
option thread-count 4
subvolumes lock2
end-volume

volume posix3
type storage/posix
option directory /state/partition4/gluster
end-volume

volume lock3
type features/locks
subvolumes posix3
end-volume

volume brick3
type performance/io-threads
option thread-count 4
subvolumes lock3
end-volume

volume server
type protocol/server
option transport-type tcp
option transport.socket.listen-port 7001
option auth.addr.brick0.allow *.*.*.*
option auth.addr.brick1.allow *.*.*.*
option auth.addr.brick2.allow *.*.*.*
option auth.addr.brick3.allow *.*.*.*
subvolumes brick0 brick1 brick2 brick3
end-volume


volume brick-0-0-0
type protocol/client
option transport-type tcp
option remote-host c8-0-0
option remote-port 7001
option remote-subvolume brick0
end-volume

volume brick-0-0-1
type protocol/client
option transport-type tcp
option remote-host c8-1-0
option remote-port 7001
option remote-subvolume brick0
end-volume

volume rep-0-0
type cluster/replicate
subvolumes brick-0-0-0 brick-0-0-1
end-volume

volume brick-0-1-0
type protocol/client
option transport-type tcp
option remote-host c8-0-0
option remote-port 7001
option remote-subvolume brick1
end-volume

volume brick-0-1-1
type protocol/client
option transport-type tcp
option remote-host c8-1-0
option remote-port 7001
option remote-subvolume brick1
end-volume

volume rep-0-1
type cluster/replicate
subvolumes brick-0-1-0 brick-0-1-1
end-volume

volume brick-0-2-0
type protocol/client
option transport-type tcp
option remote-host c8-0-0
option remote-port 7001
option remote-subvolume brick2
end-volume

volume brick-0-2-1
type protocol/client
option transport-type tcp
option remote-host c8-1-0
option remote-port 7001
option remote-subvolume brick2
end-volume

volume rep-0-2
type cluster/replicate
subvolumes brick-0-2-0 brick-0-2-1
end-volume

volume brick-0-3-0
type protocol/client
option transport-type tcp
option remote-host c8-0-0
option remote-port 7001
option remote-subvolume brick3
end-volume

volume brick-0-3-1
type protocol/client
option transport-type tcp
option remote-host c8-1-0
option remote-port 7001
option remote-subvolume brick3
end-volume

volume rep-0-3
type cluster/replicate
subvolumes brick-0-3-0 brick-0-3-1
end-volume

volume brick-1-0-0
type protocol/client
option transport-type tcp
option remote-host c8-0-1
option remote-port 7001
option remote-subvolume brick0
end-volume

volume brick-1-0-1
type protocol/client
option transport-type tcp
option remote-host c8-1-1
option remote-port 7001
option remote-subvolume brick0
end-volume

volume rep-1-0
type cluster/replicate
subvolumes brick-1-0-0 brick-1-0-1
end-volume

volume brick-1-1-0
type protocol/client
option transport-type tcp
option remote-host c8-0-1
option remote-port 7001
option remote-subvolume brick1
end-volume

volume brick-1-1-1
type protocol/client
option transport-type tcp
option remote-host c8-1-1
option remote-port 7001
option remote-subvolume brick1
end-volume

volume rep-1-1
type cluster/replicate
subvolumes brick-1-1-0 brick-1-1-1
end-volume

volume brick-1-2-0
type protocol/client
option transport-type tcp
option remote-host c8-0-1
option remote-port 7001
option remote-subvolume brick2
end-volume

volume brick-1-2-1
type protocol/client
option transport-type tcp
option remote-host c8-1-1
option remote-port 7001
option remote-subvolume brick2
end-volume

volume rep-1-2
type clu