Re: [Gluster-users] is glusterfs DHT really distributed?
I think TCP_NODELAY is critical to performance. Actuall after spending a large number of unfruitful hours on glusterfs, I wrote my own simple shared storage with BerkeleyDB backend, and I found that enabling TCP_NODELAY on my system gives me nearly 10x readback throughput. Thanks for pointing this out, I'll definitely try that. - Wei Mark Mielke wrote: On 09/29/2009 03:39 AM, David Saez Padros wrote: The second is 'option transport.socket.nodelay on' in each of your protocol/client _and_ protocol/server volumes. where is this option documented ? I'm a little surprised TCP_NODELAY isn't set by default? I set it on all servers I write as a matter of principle. The Nagle algorithm is for very simple servers to have acceptable performance. The type of servers that benefit, are the type of servers that do writes of individual bytes (no buffering). Serious servers intended to perform well should be able to easily beat the Nagle algorithm. writev(), sendmsg(), or even write(buffer) where the buffer is built first, should all beat the Nagle algorithm in terms of increased throughput and reduced latency. On Linux, there is also TCP_CORK. Unless GlusterFS does small writes, I suggest TCP_NODELAY be set by default in future releases. Just an opinion. :-) Cheers, mark ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] is glusterfs DHT really distributed?
David Saez Padros wrote: Hi The second is 'option transport.socket.nodelay on' in each of your protocol/client _and_ protocol/server volumes. where is this option documented ? Thanks for pointing this out. We wanted to expose this as a regular option in the upcoming 2.1 release and had introduced this as an experimental option in 2.0.x releases. Hence it will be documented in the 2.1 user manual. Thanks, Vijay ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] is glusterfs DHT really distributed?
Mark Mielke wrote: I'm a little surprised TCP_NODELAY isn't set by default? I set it on all servers I write as a matter of principle. Serious servers intended to perform well should be able to easily beat the Nagle algorithm. writev(), sendmsg(), or even write(buffer) where the buffer is built first, should all beat the Nagle algorithm in terms of increased throughput and reduced latency. On Linux, there is also TCP_CORK. Unless GlusterFS does small writes, I suggest TCP_NODELAY be set by default in future releases. Just an opinion. :-) Thanks for this feedback, Mark. Pre-2.0.3, there was no option to turn off Nagle's algorithm. We introduced this in 2.0.3 and are debating whether this needs to be made the default, since it involves altering a default behavior :-). We will certainly consider making this the default behavior in our upcoming releases. Thanks, Vijay ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] is glusterfs DHT really distributed?
On 09/29/2009 03:39 AM, David Saez Padros wrote: The second is 'option transport.socket.nodelay on' in each of your protocol/client _and_ protocol/server volumes. where is this option documented ? I'm a little surprised TCP_NODELAY isn't set by default? I set it on all servers I write as a matter of principle. The Nagle algorithm is for very simple servers to have acceptable performance. The type of servers that benefit, are the type of servers that do writes of individual bytes (no buffering). Serious servers intended to perform well should be able to easily beat the Nagle algorithm. writev(), sendmsg(), or even write(buffer) where the buffer is built first, should all beat the Nagle algorithm in terms of increased throughput and reduced latency. On Linux, there is also TCP_CORK. Unless GlusterFS does small writes, I suggest TCP_NODELAY be set by default in future releases. Just an opinion. :-) Cheers, mark -- Mark Mielke ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] is glusterfs DHT really distributed?
Hi The second is 'option transport.socket.nodelay on' in each of your protocol/client _and_ protocol/server volumes. where is this option documented ? -- Thanx & best regards ... David Saez Padroshttp://www.ols.es On-Line Services 2000 S.L. telf+34 902 50 29 75 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] is glusterfs DHT really distributed?
> http://www.gluster.com/community/documentation/index.php/Translators/cluster/distribute > > It seems to suggest that 'lookup-unhashed' says that the default is 'on'. > > Perhaps try turning it 'off'? Wei, There are two things we would like you to try. First is what Mark has just pointed, the 'option lookup-unhashed off' in distribute. The second is 'option transport.socket.nodelay on' in each of your protocol/client _and_ protocol/server volumes. Do let us know what influence these changes have on your performance. Avati ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] is glusterfs DHT really distributed?
On 09/28/2009 10:51 AM, Wei Dong wrote: Your reply makes all sense to me. I remember that auto-heal happens at file reading; doest that mean opening a file for read is also a global operation? Do you mean that there's no other way of copying 30 million files to our 66-node glusterfs cluster for parallel processing other than waiting for half a month? Can I somehow disable self-heal and get a seedup? Things turn out to be too bad for me. On this page: http://www.gluster.com/community/documentation/index.php/Translators/cluster/distribute It seems to suggest that 'lookup-unhashed' says that the default is 'on'. Perhaps try turning it 'off'? Cheers, mark Mark Mielke wrote: On 09/28/2009 10:35 AM, Wei Dong wrote: Hi All, I noticed a very weird phenomenon when I'm copying data (200KB image files) to our glusterfs storage. When I run only run client, it copies roughly 20 files per second and as soon as I start a second client on another machine, the copy rate of the first client immediately degrade to 5 files per second. When I stop the second client, the first client will immediately speed up to the original 20 files per second. When I run 15 clients, the aggregate throughput is about 8 files per second, much worse than running only one client. Neither CPU nor network is saturated. My volume file is attached. The servers are running on a 66 node cluster and the clients are a 15-node cluster. We have 33x2 servers and at most 15 separate machines, with each server serving < 0.5 clients on average. I cannot think of a reason for a distributed system to behave like this. There must be some kind of central access point. Although there is probably room for the GlusterFS folk to optimize... You should consider directory write operations to involve the whole cluster. Creating a file is a directory write operation. Think of how it might have to do self-heal across the cluster, make sure the name is right and not already in use across the cluster, and such things. Once you get to reads and writes for a particular file, it should be distributed. Cheers, mark -- Mark Mielke ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] is glusterfs DHT really distributed?
Your reply makes all sense to me. I remember that auto-heal happens at file reading; doest that mean opening a file for read is also a global operation? Do you mean that there's no other way of copying 30 million files to our 66-node glusterfs cluster for parallel processing other than waiting for half a month? Can I somehow disable self-heal and get a seedup? Things turn out to be too bad for me. - Wei Mark Mielke wrote: On 09/28/2009 10:35 AM, Wei Dong wrote: Hi All, I noticed a very weird phenomenon when I'm copying data (200KB image files) to our glusterfs storage. When I run only run client, it copies roughly 20 files per second and as soon as I start a second client on another machine, the copy rate of the first client immediately degrade to 5 files per second. When I stop the second client, the first client will immediately speed up to the original 20 files per second. When I run 15 clients, the aggregate throughput is about 8 files per second, much worse than running only one client. Neither CPU nor network is saturated. My volume file is attached. The servers are running on a 66 node cluster and the clients are a 15-node cluster. We have 33x2 servers and at most 15 separate machines, with each server serving < 0.5 clients on average. I cannot think of a reason for a distributed system to behave like this. There must be some kind of central access point. Although there is probably room for the GlusterFS folk to optimize... You should consider directory write operations to involve the whole cluster. Creating a file is a directory write operation. Think of how it might have to do self-heal across the cluster, make sure the name is right and not already in use across the cluster, and such things. Once you get to reads and writes for a particular file, it should be distributed. Cheers, mark ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] is glusterfs DHT really distributed?
On 09/28/2009 10:35 AM, Wei Dong wrote: Hi All, I noticed a very weird phenomenon when I'm copying data (200KB image files) to our glusterfs storage. When I run only run client, it copies roughly 20 files per second and as soon as I start a second client on another machine, the copy rate of the first client immediately degrade to 5 files per second. When I stop the second client, the first client will immediately speed up to the original 20 files per second. When I run 15 clients, the aggregate throughput is about 8 files per second, much worse than running only one client. Neither CPU nor network is saturated. My volume file is attached. The servers are running on a 66 node cluster and the clients are a 15-node cluster. We have 33x2 servers and at most 15 separate machines, with each server serving < 0.5 clients on average. I cannot think of a reason for a distributed system to behave like this. There must be some kind of central access point. Although there is probably room for the GlusterFS folk to optimize... You should consider directory write operations to involve the whole cluster. Creating a file is a directory write operation. Think of how it might have to do self-heal across the cluster, make sure the name is right and not already in use across the cluster, and such things. Once you get to reads and writes for a particular file, it should be distributed. Cheers, mark -- Mark Mielke ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] is glusterfs DHT really distributed?
Hi All, I noticed a very weird phenomenon when I'm copying data (200KB image files) to our glusterfs storage. When I run only run client, it copies roughly 20 files per second and as soon as I start a second client on another machine, the copy rate of the first client immediately degrade to 5 files per second. When I stop the second client, the first client will immediately speed up to the original 20 files per second. When I run 15 clients, the aggregate throughput is about 8 files per second, much worse than running only one client. Neither CPU nor network is saturated. My volume file is attached. The servers are running on a 66 node cluster and the clients are a 15-node cluster. We have 33x2 servers and at most 15 separate machines, with each server serving < 0.5 clients on average. I cannot think of a reason for a distributed system to behave like this. There must be some kind of central access point. - Wei volume posix0 type storage/posix option directory /state/partition1/gluster end-volume volume lock0 type features/locks subvolumes posix0 end-volume volume brick0 type performance/io-threads option thread-count 4 subvolumes lock0 end-volume volume posix1 type storage/posix option directory /state/partition2/gluster end-volume volume lock1 type features/locks subvolumes posix1 end-volume volume brick1 type performance/io-threads option thread-count 4 subvolumes lock1 end-volume volume posix2 type storage/posix option directory /state/partition3/gluster end-volume volume lock2 type features/locks subvolumes posix2 end-volume volume brick2 type performance/io-threads option thread-count 4 subvolumes lock2 end-volume volume posix3 type storage/posix option directory /state/partition4/gluster end-volume volume lock3 type features/locks subvolumes posix3 end-volume volume brick3 type performance/io-threads option thread-count 4 subvolumes lock3 end-volume volume server type protocol/server option transport-type tcp option transport.socket.listen-port 7001 option auth.addr.brick0.allow *.*.*.* option auth.addr.brick1.allow *.*.*.* option auth.addr.brick2.allow *.*.*.* option auth.addr.brick3.allow *.*.*.* subvolumes brick0 brick1 brick2 brick3 end-volume volume brick-0-0-0 type protocol/client option transport-type tcp option remote-host c8-0-0 option remote-port 7001 option remote-subvolume brick0 end-volume volume brick-0-0-1 type protocol/client option transport-type tcp option remote-host c8-1-0 option remote-port 7001 option remote-subvolume brick0 end-volume volume rep-0-0 type cluster/replicate subvolumes brick-0-0-0 brick-0-0-1 end-volume volume brick-0-1-0 type protocol/client option transport-type tcp option remote-host c8-0-0 option remote-port 7001 option remote-subvolume brick1 end-volume volume brick-0-1-1 type protocol/client option transport-type tcp option remote-host c8-1-0 option remote-port 7001 option remote-subvolume brick1 end-volume volume rep-0-1 type cluster/replicate subvolumes brick-0-1-0 brick-0-1-1 end-volume volume brick-0-2-0 type protocol/client option transport-type tcp option remote-host c8-0-0 option remote-port 7001 option remote-subvolume brick2 end-volume volume brick-0-2-1 type protocol/client option transport-type tcp option remote-host c8-1-0 option remote-port 7001 option remote-subvolume brick2 end-volume volume rep-0-2 type cluster/replicate subvolumes brick-0-2-0 brick-0-2-1 end-volume volume brick-0-3-0 type protocol/client option transport-type tcp option remote-host c8-0-0 option remote-port 7001 option remote-subvolume brick3 end-volume volume brick-0-3-1 type protocol/client option transport-type tcp option remote-host c8-1-0 option remote-port 7001 option remote-subvolume brick3 end-volume volume rep-0-3 type cluster/replicate subvolumes brick-0-3-0 brick-0-3-1 end-volume volume brick-1-0-0 type protocol/client option transport-type tcp option remote-host c8-0-1 option remote-port 7001 option remote-subvolume brick0 end-volume volume brick-1-0-1 type protocol/client option transport-type tcp option remote-host c8-1-1 option remote-port 7001 option remote-subvolume brick0 end-volume volume rep-1-0 type cluster/replicate subvolumes brick-1-0-0 brick-1-0-1 end-volume volume brick-1-1-0 type protocol/client option transport-type tcp option remote-host c8-0-1 option remote-port 7001 option remote-subvolume brick1 end-volume volume brick-1-1-1 type protocol/client option transport-type tcp option remote-host c8-1-1 option remote-port 7001 option remote-subvolume brick1 end-volume volume rep-1-1 type cluster/replicate subvolumes brick-1-1-0 brick-1-1-1 end-volume volume brick-1-2-0 type protocol/client option transport-type tcp option remote-host c8-0-1 option remote-port 7001 option remote-subvolume brick2 end-volume volume brick-1-2-1 type protocol/client option transport-type tcp option remote-host c8-1-1 option remote-port 7001 option remote-subvolume brick2 end-volume volume rep-1-2 type clu