Re: [Gluster-users] GlusterFS 3.0.2 small file read performance benchmark

2010-03-02 Thread Tejas N. Bhise
Ed,

oplocks are implemented by SAMBA and it would not be a part of GlusterFS per se 
till we implement a native SAMBA translator ( something that would replace the 
SAMBA server itself with a thin SAMBA kind of a layer on top of GlusterFS 
itself ). We are doing that for NFS by building an NFS translator.

At some point, it would be interesting to explore, clustered SAMBA using ctdb, 
where two GlusterFS clients can export the same volume. ctdb itself seems to be 
coming up well now.

Regards,
Tejas.

- Original Message -
From: Ed W li...@wildgooses.com
To: Gluster Users gluster-users@gluster.org
Sent: Wednesday, March 3, 2010 12:10:47 AM GMT +05:30 Chennai, Kolkata, Mumbai, 
New Delhi
Subject: Re: [Gluster-users] GlusterFS 3.0.2 small file readperformance 
benchmark

On 01/03/2010 20:44, Ed W wrote:

 I believe samba (and probably others) use a two way lock escalation 
 facility to mitigate a similar problem.  So you can read-lock or 
 phrased differently, express your interest in caching some 
 files/metadata and then if someone changes what you are watching the 
 lock break is pushed to you to invalidate your cache.

Seems NFS v4 implements something similar via delegations (not 
believed implemented in linux NFSv4 though...)

In samba the equivalent are called op locks

I guess this would be a great project for someone interested to work on 
- op-lock translator for gluster

Ed W
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] GlusterFS 3.0.2 small file read performance benchmark

2010-03-02 Thread Ed W
Well, oplocks are an SMB definition, but the basic concept of 
opportunistic locking is independent of the filesystem.  For example it 
appears that oplocks now appear in the NFS v4 standard under the name 
delegations (I would assume some variation of oplocks also exists in 
GFS and OCFS, but I'm not familiar with them)


The basic concept would potentially provide a huge performance boost for 
glusterfs because it allows cache coherent writeback caching.


In fact lets cut to the chase - what we desire is cache coherent 
writeback caching, ie reads to one server can be served from local 
client cache, but if the file is changed elsewhere then instantly our 
cache here is invalidated, and likewise we can write at will to a local 
copy of the file and allow it to get out of sync with the other servers, 
but as soon as some other server tries to read/write to our file then we 
must be notified and flush our cache (and request alternative locks or 
fall back to sync reads/writes)


How do we do this?  Well in NFS v3 and before and I believe in Glusterfs 
there is implemented only a cache and hope option, which caches data 
for a second or so and hopes the file doesn't change under us.  The 
improved algorithm is opportunistic locking where the client indicates 
to the server the desire to work with some data locally and get it out 
of sync with the server - the server then tracks that reservation and if 
some other client wants to access the data it pushes a lock break to the 
original client and informs it that it needs to fsync and run without 
the oplock


I believe that an oplock service this could be implemented via a new 
translator which works in conjunction with the read and writeback 
caching. Effectively it would be a two way lock manager, but it's job is 
somewhat simpler in that all it needs do is vary the existing caches on 
a per file basis.  So for example if we read some attributes for some 
files then at present they are blindly cached for X ms and then dropped, 
but our oplock translator will instead allow the attributes to be cached 
indefinitely until we get a push notification from the server side that 
our cache must be invalidated.  Same also with writes - we can use 
writeback cache as long as no one else has tried to read or write to our 
file, but as soon as someone else touches it we need to fsync and run 
without cache


I have had a very quick glance at the current locks module and it's 
quite a bit more complex than I might have guessed...  I had wondered if 
it might not be possible to make the locks module talk to the cache 
module and add server side lock breaking through that module?  
Essentially it's the addition of the push lock breaking which helps, 
so if we are reading away and some other client modifies a file then we 
need a feedback loop to invalide our read cache


Perhaps this is all implemented in glusterfs already though and I'm just 
missing the point...


Cheers

Ed W

On 02/03/2010 18:52, Tejas N. Bhise wrote:

Ed,

oplocks are implemented by SAMBA and it would not be a part of GlusterFS per se 
till we implement a native SAMBA translator ( something that would replace the 
SAMBA server itself with a thin SAMBA kind of a layer on top of GlusterFS 
itself ). We are doing that for NFS by building an NFS translator.

At some point, it would be interesting to explore, clustered SAMBA using ctdb, 
where two GlusterFS clients can export the same volume. ctdb itself seems to be 
coming up well now.

Regards,
Tejas.

- Original Message -
From: Ed Wli...@wildgooses.com
To: Gluster Usersgluster-users@gluster.org
Sent: Wednesday, March 3, 2010 12:10:47 AM GMT +05:30 Chennai, Kolkata, Mumbai, 
New Delhi
Subject: Re: [Gluster-users] GlusterFS 3.0.2 small file readperformance 
benchmark

On 01/03/2010 20:44, Ed W wrote:
   

I believe samba (and probably others) use a two way lock escalation
facility to mitigate a similar problem.  So you can read-lock or
phrased differently, express your interest in caching some
files/metadata and then if someone changes what you are watching the
lock break is pushed to you to invalidate your cache.
 

Seems NFS v4 implements something similar via delegations (not
believed implemented in linux NFSv4 though...)

In samba the equivalent are called op locks

I guess this would be a great project for someone interested to work on
- op-lock translator for gluster

Ed W
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
   


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] GlusterFS 3.0.2 small file read performance benchmark

2010-03-01 Thread Ed W

On 27/02/2010 18:56, John Feuerstein wrote:

It would be really great if all of this could be cached within io-cache,
only falling back to a namespace query (and probably locking) if
something wants to write to the file, or if the result is longer than
cache-timeout seconds in the cache. So even if the file has been
renamed, is unlinked, has changed permissions / metadata - simply take
the version of the io-cache until it's invalidated. At least that is
what I would expect the io-cache to do. This will introduce a
discrepancy between the cached file version and the real version in the
global namespace, but isn't that what one would expect from caching...?
   


I believe samba (and probably others) use a two way lock escalation 
facility to mitigate a similar problem.  So you can read-lock or 
phrased differently, express your interest in caching some 
files/metadata and then if someone changes what you are watching the 
lock break is pushed to you to invalidate your cache.


It seems like something similar would be a candidate for implementation 
with the gluster native clients?


You still have performance issues with random reads because when you try 
to open some file and you still need to check it's not open/locked/needs 
replicating from some other brick.  However, what you can do is have 
proactive caching with an active notification of any cache invalidation 
and this benefits the situation where you re-read stuff you already 
read, and/or you have an effective read-ahead which is grabbing stuff 
for you


Interesting problem

Ed W
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Gluster-users] GlusterFS 3.0.2 small file read performance benchmark

2010-02-27 Thread John Feuerstein
Greetings,

in contrast to some performance tips regarding small file *read*
performance, I want to share these results. The test is rather simple
but yields some very remarkable results: 400% improved read performance
by simply dropping some of the so called performance translators!

Please note that this test resembles a simplified version of our
workload, which is more or less sequential, read-only small file serving
with an average of 100 concurrent clients. (We use GlusterFS as a
flat-file backend to a cluster of webservers, which is hit only after
missing some caches in a more sophisticated caching infrastructure on
top of it)

The test-setup is a 3 node AFR cluster, with server+client on each one,
single process model (one volfile, the local volume is attached to
within the same process to save overhead), connected via 1 Gbit
Ethernet. This way each node can continue to operate on it's own, even
if the whole internal network for GlusterFS is down.

We used commodity hardware for the test. Each node is identical:
- Intel Core i7
- 12G RAM
- 500GB filesystem
- 1 Gbit NIC dedicated for GlusterFS

Software:
- Linux 2.6.32.8
- GlusterFS 3.0.2
- FUSE inited with protocol versions: glusterfs 7.13 kernel 7.13
- Filesystem / Storage Backend:
  - LVM2 on top of software RAID 1
  - ext4 with noatime

I will paste the configurations inline, so people can comment on them.


/etc/fstab:
-
/dev/data/test /mnt/brick/test  ext4noatime   0 2

/etc/glusterfs/test.vol  /mnt/glusterfs/test  glusterfs
noauto,noatime,log-level=NORMAL,log-file=/var/log/glusterfs/test.log 0 0
-


***
Please note: this is the final configuration with the best results. All
translators are numbered to make the explanation easier later on. Unused
translators are commented out...
The volume spec is identical on all nodes, except that the bind-address
option in the server volume [*4*] is adjusted.
***

/etc/glusterfs/test.vol
-
# Sat Feb 27 16:53:00 CET 2010 John Feuerstein j...@feurix.com
#
# Single Process Model with AFR (Automatic File Replication).


##
## Storage backend
##

#
# POSIX STORAGE [*1*]
#
volume posix
  type storage/posix
  option directory /mnt/brick/test/glusterfs
end-volume

#
# POSIX LOCKS [*2*]
#
#volume locks
volume brick
  type features/locks
  subvolumes posix
end-volume


##
## Performance translators (server side)
##

#
# IO-Threads [*3*]
#
#volume brick
#  type performance/io-threads
#  subvolumes locks
#  option thread-count 8
#end-volume

### End of performance translators


#
# TCP/IP server [*4*]
#
volume server
  type protocol/server
  subvolumes brick
  option transport-type tcp
  option transport.socket.bind-address 10.1.0.1   # FIXME
  option transport.socket.listen-port 820
  option transport.socket.nodelay on
  option auth.addr.brick.allow 127.0.0.1,10.1.0.1,10.1.0.2,10.1.0.3
end-volume


#
# TCP/IP clients [*5*]
#
volume node1
  type protocol/client
  option remote-subvolume brick
  option transport-type tcp/client
  option remote-host 10.1.0.1
  option remote-port 820
  option transport.socket.nodelay on
end-volume

volume node2
  type protocol/client
  option remote-subvolume brick
  option transport-type tcp/client
  option remote-host 10.1.0.2
  option remote-port 820
  option transport.socket.nodelay on
end-volume

volume node3
  type protocol/client
  option remote-subvolume brick
  option transport-type tcp/client
  option remote-host 10.1.0.3
  option remote-port 820
  option transport.socket.nodelay on
end-volume


#
# Automatic File Replication Translator (AFR) [*6*]
#
# NOTE: node3 is the primary metadata node, so this one *must*
#   be listed first in all volume specs! Also, node3 is the global
#   favorite-child with the definite file version if any conflict
#   arises while self-healing...
#
volume afr
  type cluster/replicate
  subvolumes node3 node1 node2
  option read-subvolume node2
  option favorite-child node3
end-volume



##
## Performance translators (client side)
##

#
# IO-Threads [*7*]
#
#volume client-threads-1
#  type performance/io-threads
#  subvolumes afr
#  option thread-count 8
#end-volume

#
# Write-Behind [*8*]
#
volume wb
  type performance/write-behind
  subvolumes afr
  option cache-size 4MB
end-volume


#
# Read-Ahead [*9*]
#
#volume ra
#  type performance/read-ahead
#  subvolumes wb
#  option page-count 2
#end-volume


#
# IO-Cache [*10*]
#
volume cache
  type performance/io-cache
  subvolumes wb
  option cache-size 1024MB
  option cache-timeout 60
end-volume

#
# Quick-Read for small files [*11*]
#
#volume qr
#  type performance/quick-read
#  subvolumes cache
#  option cache-timeout 60
#end-volume

#
# Metadata prefetch [*12*]
#
#volume sp
#  type performance/stat-prefetch
#  subvolumes qr
#end-volume

#
# IO-Threads [*13*]
#
#volume 

Re: [Gluster-users] GlusterFS 3.0.2 small file read performance benchmark

2010-02-27 Thread John Feuerstein
After reading the mail again I'm under the impression that I didn't make
it clear enough: We don't have a pure read-only, but mostly read-only
workload. This is the reason why we've tried GlusterFS with AFR, so we
can have a multi-master read/write filesystem with a persitent copy on
each node. If we wouldn't need write access every here and then, we
could have gone with plain copies of the data.


Now another idea is the following, based on the fact that the local ext4
filesystem + VFS cache is *much* faster:

 GlusterFS with populated IO-Cache:
 real0m38.576s
 user0m3.356s
 sys 0m6.076s

# Work directly on the back-end (this is read-only...)
$ cd /mnt/brick/test/glusterfs/data

# Ext4 without VFS Cache:
$ echo 3  /proc/sys/vm/drop_caches
$ for ((i=0;i100;i++)); do tar cf - .  /dev/null  done; time wait
real0m1.598s
user0m2.136s
sys 0m3.696s

# Ext4 with VFS Cache:
$ for ((i=0;i100;i++)); do tar cf - .  /dev/null  done; time wait
real0m1.312s
user0m2.264s
sys 0m3.256s


So the idea now is to bind-mount the backend filesystem *read-only* and
use it for all read operations. For all write operations, use the
GlusterFS mountpoint which provides locking etc. (This implies some sort
of Read/Write splitting, but we can do that...)

The downside is that the backend read operations won't make use of the
GlusterFS on-demand self-healing. But since 99% of our read-only files
are write once, read a lot of times... -- this could work out. After a
node failure, a simple ls -lR should self-heal everything and the
backend is fine too. The chance to read a broken file is very low?

Any comments on this idea? Is there something else that could go wrong
by using the backend in a pure read-only fashion that I've missed?

Any ideas why the GlusterFS performance/io-cache translator with a
cache-timeout of 60 is still so slow? Is there any way to *really* cache
meta and filedata on GlusterFS _without_ hitting the network and thus
getting very poor small file performance introduced by network latency?

Are there any plans to implement support for FS-Cache [1] (CacheFS,
Cachefiles), shipped with recent Linux kernels? Or to improve io-cache
likewise?

[1] http://people.redhat.com/steved/fscache/docs/FS-Cache.pdf

Lots of questions... :)

Best regards,
John
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] GlusterFS 3.0.2 small file read performance benchmark

2010-02-27 Thread John Feuerstein
Another thing that makes me wonder is the read-subvolume setting:

 volume afr
   type cluster/replicate
   ...
   option read-subvolume node2
   ...
 end-volume

So even if we play around and set this to the local node or some remote
node respectively, it won't gain any performance for small files. Looks
like the whole bottleneck for small files is meta-data and the global
namespace lookup.

It would be really great if all of this could be cached within io-cache,
only falling back to a namespace query (and probably locking) if
something wants to write to the file, or if the result is longer than
cache-timeout seconds in the cache. So even if the file has been
renamed, is unlinked, has changed permissions / metadata - simply take
the version of the io-cache until it's invalidated. At least that is
what I would expect the io-cache to do. This will introduce a
discrepancy between the cached file version and the real version in the
global namespace, but isn't that what one would expect from caching...?

Note that the cache-size was in all tests on all nodes 1024MB, and the
whole set of test-data was ~240MB. Add some meta-data and it's probably
at 250MB. In addition, cache-timeout was 60 seconds, while the whole
test took around 40 seconds.

So *all* of the read-only test could have been served completely by the
io-cache... or am I mistaken here?

I'm trying to understand the poor performance, because network latency
should be eliminated by the cache.

Could some Gluster-Dev please elaborate a bit on that one?


Best Regards,
John
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users