Re: [Gluster-users] GlusterFS performance questions

2011-03-15 Thread Ed W
On 14/03/2011 22:18, Alexander Todorov wrote:
 Hello folks,
 I'm looking for GlusterFS performance metrics. What I'm interested in
 particular is:
 
 * Do adding more bricks to a volume make reads faster?
 * How do replica count affect that?

Although no one seems to be really talking about performance in these
terms, I think the limiting factor is usually going to be network
latency.  In very approximate terms, each time you touch a file in
Glusterfs you need to ask every other brick for it's opinion as to
whether you have the newest file or not.  Therefore your file IO/sec is
bounded by your network latency...

So I would presume that those who get infiniband network hardware and
it's few uS latency times see far better performance than those of us on
gigabit and the barely sub millisec latency that this entails?

So I suspect you can predict rough performance while changing the
hardware by thinking about how the network constrains you.  eg consider
your access pattern, small files/large files, small reads/large reads,
number of bricks, etc

Note it doesn't seem popular to discuss performance in these terms, but
I think if you read through the old posts in the lists you will see that
really it's this network latency vs required access patterns which
determine whether they feel gluster is fast/slow?

To jump to a conclusion, it makes sense that large reads on large files
do much better than accessing lots of small files...  If you make the
files large enough then you start to test the disk performance, etc

Good luck

Ed W
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] What NAS device(s) do you use? And why?

2010-12-12 Thread Ed W

On 11/12/2010 16:17, Rudi Ahlers wrote:

If you use any NAS (or a SAN) devices, what do you use? And I'm
referring more to larger scale network storage than your home PC or
home theater system.

We've had very good experiences with our NetGear ReadyNAS devices but
I'm in the market for something new. The NetGear's aren't the cheapest
ones around but they do what it says on the box. My only real gripe
with them is the lack of decent scalability.

TheCus devices seems to be rather powerful as well, and you can stack
upto 5 units together. But that's where the line stops.


You said no HTPC systems and then listed a couple?

I would have thought at the 100TB level you would want to have the 
experience to manage the machine in house anyway?  You want to be 100% 
comfortable that when that machine goes down you can rescue it...


So I would suggest a Norco or Supermicro case - these go up to 30-36 
drives per physical box.  Then choose your favourite distro and get 
super comfortable with the ins and outs of LVM, linux raid and iscsi.  
Break it, fix it, break it, 


There is a growing amount of support for RAID6 as being far more 
reliable than RAID10 for a given set of parameters (and given 
definition of reliable).  RAID10 is capable of far more IOPs though, 
so pick your poison...  I definitely buy the double parity argument 
though, so try and gain it somehow...  (The issue in practice seems to 
be that the first drive feels like protection, but once it's failed 
it's ever so easy to have some kind of tiny error during recovery, eg 
unscrubbed array, unplug wrong drive, gremlin, second drive failure, etc)


I think you can buy a well supported Supermicro box with support from a 
well supported enterprise distro and still spend less than a mid-spec 
NAS at the level you are aiming at?  However, I would 100% concede that 
above the level of NAS boxes using off the shelf linux software there is 
a potentially large performance gap, eg a NetApp box should blow away 
your linux box (caveat - don't own a netapp box...)


Remember also that at this kind of storage level you need to be really 
sure what your goals are.  It's not so hard to get 100TB in a single 
chassis, but getting it reliable and fast (choose your own 
definition) is a tradeoff and much harder


Good luck - I love hearing about these larger projects, please send some 
feedback on your choices?


Ed W


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Possible to use gluster w/ email services + Tuning for fast replication

2010-11-01 Thread Ed W



Right now, I am testing out a 2 node setup, with one server replicating data to another 
node. One thing I noticed was when I created a file or directory on the server, the new 
data does not replicate to the other node. The only time data is synced from server to 
the other node is when I run gluster volume rebalance test start. Is this 
normal? I had envisioned gluster would constantly replicate changes from the server to 
the other nodes, am I off base?


Are you examining the second node directly, ie not by mounting it?  I 
think the point is that replication only happens when you observe the 
second node?


Glusterfs is targeted for HTPC applications where typically the nodes 
are all connected over high performance interlinks.  It appears that 
performance degrades very quickly as the latency between nodes increases 
and so whether the solution works for you is largely going to be 
determined by the latency between nodes on your network connection.


I'm not actually sure what some representative numbers should be?  I 
have two machines hooked up using bonded-rr intel gigabit cards 
(crossover to each other) and these ping at around 0.3ms.  However, I 
have one other machine on a gigabit connection, hooked up to a switch 
and that sometimes drops to around 0.15ms...  I believe infiniband will 
drop that latency to some few tens of microseconds?


So basically every file access on my system would suffer a 0.3ms access 
latency.  This is better than a spining disk with no cache which comes 
in more like 3-10ms, but obviously it's still not brilliant


Please let us know how you get on?

Good luck

Ed W
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Resync failure

2010-10-17 Thread Ed W

 On 28/09/2010 08:11, Marcus Bointon wrote:

On 28 Sep 2010, at 06:30, Craig Carl wrote:


The extended attributes on the files will be different between the two servers. 
Depending on the version of rsync you are running it may be reporting 
differences because of the attributes. Can you md5sum on both servers a couple 
of the files rsync is telling you are out-of-sync? If there isn't a difference 
in the md5sum values you are good to go. Otherwise please let us know.

When I say they're out of sync I mean that there are files on one but not the 
other (both ways around, so both additions and deletions have not happened at 
some point) - I'm using cluster/replicate.


Hi Marcus

Can you confirm that you got into this situation by fiddling with the 
files outside of gluster?


I hate reading reports like this on the list because it worries me that 
stuff can get out of sync, but in at least the majority of cases the 
reason for the lack of sync appears to be some variation of talking to 
the underlying volume directly rather than through the gluster mount 
point?  Can you confirm your problem was traced to this?


Cheers

Ed W
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Gluster-users] Async Replication

2010-10-17 Thread Ed W

 On 11/10/2010 19:36, Aaron Porter wrote:

On Fri, Oct 8, 2010 at 10:03 AM, Christopher J Bidwell
cbidw...@usgs.gov  wrote:

Can gluster be used as a WAN-based distributed filesystem?  I've got four
servers spread around the country to provide geographic redundancy and am
looking for a good system that I can maintain continuity between the
servers.  Replication, etc.  Currently using rsync which is just terrible
and has a high overhead as I've got large directories that need continuous
updates.

We've got some small scale testing going (and working) West Coast (US)
-  East Coast (US). Bandwidth isn't so much of a problem as latency.
Gluster doesn't have an async mode, so you have to wait for your
operations to complete on all nodes -- that can take a while. Our
current setup backs a couple Samba shares, users seem happy.


I think the paying customers for gluster are HPC compute clusters 
where as a class you have kind of significant fraction of memory level 
speed between servers.  As a result the current focus has been around 
improving throughput for an environment which has very low latency 
access to all the nodes


I think it's clear that the solution would be some kind of smart 
distributed lock manager which can push the locks out to the closest 
server to the client, but it's clearly not a trivial step up from the 
current code base and will likely require someone to pay for it.


I think the Gluster developers might be on the cusp of being receptive 
to such a feature request, but I sense at this stage its likely to need 
to be accompanied by some financial commitment...


If there were some other businesses with this requirement then now is 
probably a good time to show your hand.  That said I would expect to 
face some reasonable costs if we wanted to pay some gluster devs for 
some time on this?  I guess it's possible they would do it for a reduced 
rate than just time and materials, but first lets see if anyone else 
pipes up before we ask for prices...


GFS has such a lock manager and I would have thought in the first 
instance the right answer was to investigate whether integration with 
this would make sense / solve the underlying problem


Anyway, I guess the point is just because you and I find gluster useful 
on commodity hardware doesn't mean we are actually the current 
development target market, just lucky users!


Cheers

Ed W
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Hardware advice?

2010-09-27 Thread Ed W


Now we have to see what kind of price we can get here in sweden, as I 
guess there will be a hefty shipping cost if ordering things from US 
and the PSU will not be the right one for us.


PSU's are almost exclusively 110/220/240/250V (ie anything) these days.  
Check with the supplier, but I doubt it's an issue and will be the same 
part wherever you buy from.  Warranty is a slight issue, but far less 
than you might imagine (you are always at the mercy of the muppets who 
run the shop you buy from, wherever you buy...)


Shipping worldwide is pretty inexpensive these days.  I regularly ship 
30KG parcels from the UK to places such as the US or Singapore.  Prices 
around the £80-£140 mark are normal for say Fedex on a 1-2 day express 
shipment.  Likely if you choose a slower carrier you can pull that down 
a lot further.  If you do your own freight forwarding then things will 
be even cheaper (but it's a pain in the arse clearing customs yourself, 
etc - up to you how much you want to economise)


My feeling is that it's not a problem where you shop

That said I have no idea who these US folks are, so my point is as much 
that you should buy from my UK guy as you should buy from the US...  
Someone helpful to help spec the kit is very helpful though.


So far I'm really impressed with Supermicro (and Intel actually) 
prices.  I'm buying 2x machines with quadcore L3426 (low power), 16GB 
ram, 6TB raid drives, quad gigabit, in a 1U chassis and they are coming 
in a little over £1,400 each.  Someone is bound to tell me I'm being 
robbed, but that seems very good to me (and I could have saved some cash 
if I went for a lowerspec mainboard or chassis even...)


Good luck

Ed W

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Hardware advice?

2010-09-27 Thread Ed W

 On 27/09/2010 09:09, Janne Aho wrote:

On 25/09/10 00:36, Jason Alinen wrote:

Can we setup a call for Monday with our sales engineers?

If so, 2pm PST is available.


Thanks for the offer, but I think the shipping cost will be a 
disadvantige (don't think you have free shipping to sweden) and 9h 
time difference, but over all I guess you would give a better service 
than your swedish counterparts.



Often your local distributor for supermicro will need to buy in the 
equipment from the US for you anyway (just ordered some from a UK chap 
and the delivery time is 2 weeks since it's not in stock.  I tried a few 
other places and they all quote 5-7 days (which means shipping from the 
US on demand).  Hence you probably are no worse off to buy from a US 
supplier if the other aspects work for you...


Good luck

Ed W
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Configuration suggestions (aka poor/slow performance on new hardware)

2010-03-31 Thread Ed W

On 31/03/2010 06:14, Tom Lanyon wrote:

On 31/03/2010, at 2:36 PM, Raghavendra G wrote:

   

Current design of write-behind acknowledges writes (to applications) even
when they've not hit the disk. Can you please explain how this design is
different (if it is different) from the idea you've explained above?
 

Is this gluster method of write-behind acknowledging the writes before they've 
left the client? The method Ed was describing is that the write is acknowledged 
only once its reached the server (and a defined number of replication targets), 
even though it hasn't hasn't been written to disk on the server yet. This is a 
hybrid approach which safeguards against client power failure before the write 
(which has already been acknowledged) gets pushed to any servers, but improves 
performance over end-to-end write-through as it does not wait for the write 
acknowledgement from the physical disk(s).

   



Agreed.  So assuming say one client talking over network to a 100 server 
replicas (absurd for the purposes of clarification)


Our safety levels are:

1) ACK sent as soon as app sends data to the client OS and before it's 
even left the client machine. Complete data loss possible if the client 
is unplugged/dies at that instance. (weak / fast)


2) ACK sent only once data sent to all 100 replicas AND data written to 
disk. Data loss only possible if all replicas are lost. (strong / slowest)


3) ACK sent once X server machines have received the request (to ram).  
Data loss possible if all server machines lost before they write the 
request to disk. Good compromise of speed vs reliability guarantees



In the simplest situation of a single server then we have roughly 
achieved the effect of moving the writeback cache to the serverside.  In 
the case of multiple servers with exactly equal latency to the client 
then we have roughly achieved the same as moving writeback cache to 
serverside on all servers.  In the case of non equal latency between 
client and server, or with server side replication, or with very busy 
servers then we gain a performance improvement due to the lower latency 
before the ACK sent to the client


I thought this was a very clever technique and actually very compatible 
with the gluster philosophy (independent bricks)


Ed W
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Configuration suggestions (aka poor/slow performance on new hardware)

2010-03-29 Thread Ed W

On 26/03/2010 18:22, Ramiro Magallanes wrote:


You coud run the genfiles script simultaneuosly (my english is really
poor, we can change the subject of this mail for something like poor
performance and poor english xDDD) but its not like a thread aplication
(iozone rulez).

If I run 3 process of genfiles.sh i get 440, 441, and 450 files. (1300
files aprox.) but if you add some more procees you're not going to
obtain any big number :)

With 6 genfiles at the same time i have :

PID 12832 : 249 files created in 60 seconds.
PID 12830 : 249 files created in 60 seconds.
PID 12829 : 248 files created in 60 seconds.
PID 12827 : 262 files created in 60 seconds.
PID 12828 : 252 files created in 60 seconds.
PID 12831 : 255 files created in 60 seconds.

1515 files .
   


Just speaking theoretically, but I believe that without a writebehind 
cache on the client side then gluster is required to effectively sync 
after each file operation (well it's probably only half a sync, but some 
variation of this problem), this is safe, but of course decreases 
writespeed to be something which is a function of the network latency.


So in your case if you had say around 1ms of latency then you would be 
limited to around 1,000 operations per second simply due to the wait 
until the far side acks the operation.  This seems to correlate with the 
figures you are seeing (show your ping speed and correlate it with IOs 
per sec?)


I don't see this as a gluster issue - it's a fundamental limitation of 
whether you want an ack for network based operations?  Many people 
switch to fiberchannel or similar for the io for exactly this reason.  
If you can drop the latency by a factor of 10 then you are increasing 
your IOs by a factor of 10.


Untested, but at least theoretically switching on writeback caching on 
the client should mean that it ploughs on without waiting for network 
latency to give you your ack.  Lots of potential issues, but if this is 
ok for your requirements then give it a try?



Note, just an idea for the gluster guys, but I think I saw in AFS (or 
was it something else?) a kind of hybrid server side writeback cache.  
The idea was that the server could ack the write if a certain number of 
storage nodes at least had the pending IO in memory, even if it hadn't 
hit the disk yet.  This is subtly different to server side writeback, 
but seems like a very neat idea.  Note it's probably not relevant to 
small file creation tests like above, but for other situations


I do think some of the benchmarks here might not be really addressing 
network latency as the limiting bottleneck?


Good luck

Ed W
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] GlusterFS 3.0.2 small file read performance benchmark

2010-03-02 Thread Ed W
Well, oplocks are an SMB definition, but the basic concept of 
opportunistic locking is independent of the filesystem.  For example it 
appears that oplocks now appear in the NFS v4 standard under the name 
delegations (I would assume some variation of oplocks also exists in 
GFS and OCFS, but I'm not familiar with them)


The basic concept would potentially provide a huge performance boost for 
glusterfs because it allows cache coherent writeback caching.


In fact lets cut to the chase - what we desire is cache coherent 
writeback caching, ie reads to one server can be served from local 
client cache, but if the file is changed elsewhere then instantly our 
cache here is invalidated, and likewise we can write at will to a local 
copy of the file and allow it to get out of sync with the other servers, 
but as soon as some other server tries to read/write to our file then we 
must be notified and flush our cache (and request alternative locks or 
fall back to sync reads/writes)


How do we do this?  Well in NFS v3 and before and I believe in Glusterfs 
there is implemented only a cache and hope option, which caches data 
for a second or so and hopes the file doesn't change under us.  The 
improved algorithm is opportunistic locking where the client indicates 
to the server the desire to work with some data locally and get it out 
of sync with the server - the server then tracks that reservation and if 
some other client wants to access the data it pushes a lock break to the 
original client and informs it that it needs to fsync and run without 
the oplock


I believe that an oplock service this could be implemented via a new 
translator which works in conjunction with the read and writeback 
caching. Effectively it would be a two way lock manager, but it's job is 
somewhat simpler in that all it needs do is vary the existing caches on 
a per file basis.  So for example if we read some attributes for some 
files then at present they are blindly cached for X ms and then dropped, 
but our oplock translator will instead allow the attributes to be cached 
indefinitely until we get a push notification from the server side that 
our cache must be invalidated.  Same also with writes - we can use 
writeback cache as long as no one else has tried to read or write to our 
file, but as soon as someone else touches it we need to fsync and run 
without cache


I have had a very quick glance at the current locks module and it's 
quite a bit more complex than I might have guessed...  I had wondered if 
it might not be possible to make the locks module talk to the cache 
module and add server side lock breaking through that module?  
Essentially it's the addition of the push lock breaking which helps, 
so if we are reading away and some other client modifies a file then we 
need a feedback loop to invalide our read cache


Perhaps this is all implemented in glusterfs already though and I'm just 
missing the point...


Cheers

Ed W

On 02/03/2010 18:52, Tejas N. Bhise wrote:

Ed,

oplocks are implemented by SAMBA and it would not be a part of GlusterFS per se 
till we implement a native SAMBA translator ( something that would replace the 
SAMBA server itself with a thin SAMBA kind of a layer on top of GlusterFS 
itself ). We are doing that for NFS by building an NFS translator.

At some point, it would be interesting to explore, clustered SAMBA using ctdb, 
where two GlusterFS clients can export the same volume. ctdb itself seems to be 
coming up well now.

Regards,
Tejas.

- Original Message -
From: Ed Wli...@wildgooses.com
To: Gluster Usersgluster-users@gluster.org
Sent: Wednesday, March 3, 2010 12:10:47 AM GMT +05:30 Chennai, Kolkata, Mumbai, 
New Delhi
Subject: Re: [Gluster-users] GlusterFS 3.0.2 small file readperformance 
benchmark

On 01/03/2010 20:44, Ed W wrote:
   

I believe samba (and probably others) use a two way lock escalation
facility to mitigate a similar problem.  So you can read-lock or
phrased differently, express your interest in caching some
files/metadata and then if someone changes what you are watching the
lock break is pushed to you to invalidate your cache.
 

Seems NFS v4 implements something similar via delegations (not
believed implemented in linux NFSv4 though...)

In samba the equivalent are called op locks

I guess this would be a great project for someone interested to work on
- op-lock translator for gluster

Ed W
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
   


___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Gluster-users] Issue with replication of open files server reboot

2010-03-01 Thread Ed W
Hi, Is there an open bug report that I can follow development on the 
issue reported here:

http://gluster.com/community/documentation/index.php/Understanding_AFR_Translator#File_re-opening_after_a_server_comes_back_up:


For my use case it seems rather worrying that if one server goes down 
then potentially all open files at that point are now corrupted? As I 
understand the issue the files will never be corrected or self-healed - 
is this correct?


Thanks

Ed W
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] GlusterFS 3.0.2 small file read performance benchmark

2010-03-01 Thread Ed W

On 27/02/2010 18:56, John Feuerstein wrote:

It would be really great if all of this could be cached within io-cache,
only falling back to a namespace query (and probably locking) if
something wants to write to the file, or if the result is longer than
cache-timeout seconds in the cache. So even if the file has been
renamed, is unlinked, has changed permissions / metadata - simply take
the version of the io-cache until it's invalidated. At least that is
what I would expect the io-cache to do. This will introduce a
discrepancy between the cached file version and the real version in the
global namespace, but isn't that what one would expect from caching...?
   


I believe samba (and probably others) use a two way lock escalation 
facility to mitigate a similar problem.  So you can read-lock or 
phrased differently, express your interest in caching some 
files/metadata and then if someone changes what you are watching the 
lock break is pushed to you to invalidate your cache.


It seems like something similar would be a candidate for implementation 
with the gluster native clients?


You still have performance issues with random reads because when you try 
to open some file and you still need to check it's not open/locked/needs 
replicating from some other brick.  However, what you can do is have 
proactive caching with an active notification of any cache invalidation 
and this benefits the situation where you re-read stuff you already 
read, and/or you have an effective read-ahead which is grabbing stuff 
for you


Interesting problem

Ed W
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] Issue with replication of open files server reboot

2010-03-01 Thread Ed W

On 01/03/2010 20:19, Vikas Gorur wrote:

On Mar 1, 2010, at 11:59 AM, Ed W wrote:

   

Hi, Is there an open bug report that I can follow development on the issue 
reported here:

http://gluster.com/community/documentation/index.php/Understanding_AFR_Translator#File_re-opening_after_a_server_comes_back_up:

 


This issue has been fixed in the 3.x releases.
   


Aha! Super. Many thanks

I updated the docs to state this (link above)

Is anything else on that page resolved in 3.x? eg selfheal of files not 
on first subvolume or selfheal of hardlinked files?


Thanks

Ed W
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users