Re: [Gluster-users] GlusterFS performance questions
On 14/03/2011 22:18, Alexander Todorov wrote: Hello folks, I'm looking for GlusterFS performance metrics. What I'm interested in particular is: * Do adding more bricks to a volume make reads faster? * How do replica count affect that? Although no one seems to be really talking about performance in these terms, I think the limiting factor is usually going to be network latency. In very approximate terms, each time you touch a file in Glusterfs you need to ask every other brick for it's opinion as to whether you have the newest file or not. Therefore your file IO/sec is bounded by your network latency... So I would presume that those who get infiniband network hardware and it's few uS latency times see far better performance than those of us on gigabit and the barely sub millisec latency that this entails? So I suspect you can predict rough performance while changing the hardware by thinking about how the network constrains you. eg consider your access pattern, small files/large files, small reads/large reads, number of bricks, etc Note it doesn't seem popular to discuss performance in these terms, but I think if you read through the old posts in the lists you will see that really it's this network latency vs required access patterns which determine whether they feel gluster is fast/slow? To jump to a conclusion, it makes sense that large reads on large files do much better than accessing lots of small files... If you make the files large enough then you start to test the disk performance, etc Good luck Ed W ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] What NAS device(s) do you use? And why?
On 11/12/2010 16:17, Rudi Ahlers wrote: If you use any NAS (or a SAN) devices, what do you use? And I'm referring more to larger scale network storage than your home PC or home theater system. We've had very good experiences with our NetGear ReadyNAS devices but I'm in the market for something new. The NetGear's aren't the cheapest ones around but they do what it says on the box. My only real gripe with them is the lack of decent scalability. TheCus devices seems to be rather powerful as well, and you can stack upto 5 units together. But that's where the line stops. You said no HTPC systems and then listed a couple? I would have thought at the 100TB level you would want to have the experience to manage the machine in house anyway? You want to be 100% comfortable that when that machine goes down you can rescue it... So I would suggest a Norco or Supermicro case - these go up to 30-36 drives per physical box. Then choose your favourite distro and get super comfortable with the ins and outs of LVM, linux raid and iscsi. Break it, fix it, break it, There is a growing amount of support for RAID6 as being far more reliable than RAID10 for a given set of parameters (and given definition of reliable). RAID10 is capable of far more IOPs though, so pick your poison... I definitely buy the double parity argument though, so try and gain it somehow... (The issue in practice seems to be that the first drive feels like protection, but once it's failed it's ever so easy to have some kind of tiny error during recovery, eg unscrubbed array, unplug wrong drive, gremlin, second drive failure, etc) I think you can buy a well supported Supermicro box with support from a well supported enterprise distro and still spend less than a mid-spec NAS at the level you are aiming at? However, I would 100% concede that above the level of NAS boxes using off the shelf linux software there is a potentially large performance gap, eg a NetApp box should blow away your linux box (caveat - don't own a netapp box...) Remember also that at this kind of storage level you need to be really sure what your goals are. It's not so hard to get 100TB in a single chassis, but getting it reliable and fast (choose your own definition) is a tradeoff and much harder Good luck - I love hearing about these larger projects, please send some feedback on your choices? Ed W ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Possible to use gluster w/ email services + Tuning for fast replication
Right now, I am testing out a 2 node setup, with one server replicating data to another node. One thing I noticed was when I created a file or directory on the server, the new data does not replicate to the other node. The only time data is synced from server to the other node is when I run gluster volume rebalance test start. Is this normal? I had envisioned gluster would constantly replicate changes from the server to the other nodes, am I off base? Are you examining the second node directly, ie not by mounting it? I think the point is that replication only happens when you observe the second node? Glusterfs is targeted for HTPC applications where typically the nodes are all connected over high performance interlinks. It appears that performance degrades very quickly as the latency between nodes increases and so whether the solution works for you is largely going to be determined by the latency between nodes on your network connection. I'm not actually sure what some representative numbers should be? I have two machines hooked up using bonded-rr intel gigabit cards (crossover to each other) and these ping at around 0.3ms. However, I have one other machine on a gigabit connection, hooked up to a switch and that sometimes drops to around 0.15ms... I believe infiniband will drop that latency to some few tens of microseconds? So basically every file access on my system would suffer a 0.3ms access latency. This is better than a spining disk with no cache which comes in more like 3-10ms, but obviously it's still not brilliant Please let us know how you get on? Good luck Ed W ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Resync failure
On 28/09/2010 08:11, Marcus Bointon wrote: On 28 Sep 2010, at 06:30, Craig Carl wrote: The extended attributes on the files will be different between the two servers. Depending on the version of rsync you are running it may be reporting differences because of the attributes. Can you md5sum on both servers a couple of the files rsync is telling you are out-of-sync? If there isn't a difference in the md5sum values you are good to go. Otherwise please let us know. When I say they're out of sync I mean that there are files on one but not the other (both ways around, so both additions and deletions have not happened at some point) - I'm using cluster/replicate. Hi Marcus Can you confirm that you got into this situation by fiddling with the files outside of gluster? I hate reading reports like this on the list because it worries me that stuff can get out of sync, but in at least the majority of cases the reason for the lack of sync appears to be some variation of talking to the underlying volume directly rather than through the gluster mount point? Can you confirm your problem was traced to this? Cheers Ed W ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] Async Replication
On 11/10/2010 19:36, Aaron Porter wrote: On Fri, Oct 8, 2010 at 10:03 AM, Christopher J Bidwell cbidw...@usgs.gov wrote: Can gluster be used as a WAN-based distributed filesystem? I've got four servers spread around the country to provide geographic redundancy and am looking for a good system that I can maintain continuity between the servers. Replication, etc. Currently using rsync which is just terrible and has a high overhead as I've got large directories that need continuous updates. We've got some small scale testing going (and working) West Coast (US) - East Coast (US). Bandwidth isn't so much of a problem as latency. Gluster doesn't have an async mode, so you have to wait for your operations to complete on all nodes -- that can take a while. Our current setup backs a couple Samba shares, users seem happy. I think the paying customers for gluster are HPC compute clusters where as a class you have kind of significant fraction of memory level speed between servers. As a result the current focus has been around improving throughput for an environment which has very low latency access to all the nodes I think it's clear that the solution would be some kind of smart distributed lock manager which can push the locks out to the closest server to the client, but it's clearly not a trivial step up from the current code base and will likely require someone to pay for it. I think the Gluster developers might be on the cusp of being receptive to such a feature request, but I sense at this stage its likely to need to be accompanied by some financial commitment... If there were some other businesses with this requirement then now is probably a good time to show your hand. That said I would expect to face some reasonable costs if we wanted to pay some gluster devs for some time on this? I guess it's possible they would do it for a reduced rate than just time and materials, but first lets see if anyone else pipes up before we ask for prices... GFS has such a lock manager and I would have thought in the first instance the right answer was to investigate whether integration with this would make sense / solve the underlying problem Anyway, I guess the point is just because you and I find gluster useful on commodity hardware doesn't mean we are actually the current development target market, just lucky users! Cheers Ed W ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Hardware advice?
Now we have to see what kind of price we can get here in sweden, as I guess there will be a hefty shipping cost if ordering things from US and the PSU will not be the right one for us. PSU's are almost exclusively 110/220/240/250V (ie anything) these days. Check with the supplier, but I doubt it's an issue and will be the same part wherever you buy from. Warranty is a slight issue, but far less than you might imagine (you are always at the mercy of the muppets who run the shop you buy from, wherever you buy...) Shipping worldwide is pretty inexpensive these days. I regularly ship 30KG parcels from the UK to places such as the US or Singapore. Prices around the £80-£140 mark are normal for say Fedex on a 1-2 day express shipment. Likely if you choose a slower carrier you can pull that down a lot further. If you do your own freight forwarding then things will be even cheaper (but it's a pain in the arse clearing customs yourself, etc - up to you how much you want to economise) My feeling is that it's not a problem where you shop That said I have no idea who these US folks are, so my point is as much that you should buy from my UK guy as you should buy from the US... Someone helpful to help spec the kit is very helpful though. So far I'm really impressed with Supermicro (and Intel actually) prices. I'm buying 2x machines with quadcore L3426 (low power), 16GB ram, 6TB raid drives, quad gigabit, in a 1U chassis and they are coming in a little over £1,400 each. Someone is bound to tell me I'm being robbed, but that seems very good to me (and I could have saved some cash if I went for a lowerspec mainboard or chassis even...) Good luck Ed W ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Hardware advice?
On 27/09/2010 09:09, Janne Aho wrote: On 25/09/10 00:36, Jason Alinen wrote: Can we setup a call for Monday with our sales engineers? If so, 2pm PST is available. Thanks for the offer, but I think the shipping cost will be a disadvantige (don't think you have free shipping to sweden) and 9h time difference, but over all I guess you would give a better service than your swedish counterparts. Often your local distributor for supermicro will need to buy in the equipment from the US for you anyway (just ordered some from a UK chap and the delivery time is 2 weeks since it's not in stock. I tried a few other places and they all quote 5-7 days (which means shipping from the US on demand). Hence you probably are no worse off to buy from a US supplier if the other aspects work for you... Good luck Ed W ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Configuration suggestions (aka poor/slow performance on new hardware)
On 31/03/2010 06:14, Tom Lanyon wrote: On 31/03/2010, at 2:36 PM, Raghavendra G wrote: Current design of write-behind acknowledges writes (to applications) even when they've not hit the disk. Can you please explain how this design is different (if it is different) from the idea you've explained above? Is this gluster method of write-behind acknowledging the writes before they've left the client? The method Ed was describing is that the write is acknowledged only once its reached the server (and a defined number of replication targets), even though it hasn't hasn't been written to disk on the server yet. This is a hybrid approach which safeguards against client power failure before the write (which has already been acknowledged) gets pushed to any servers, but improves performance over end-to-end write-through as it does not wait for the write acknowledgement from the physical disk(s). Agreed. So assuming say one client talking over network to a 100 server replicas (absurd for the purposes of clarification) Our safety levels are: 1) ACK sent as soon as app sends data to the client OS and before it's even left the client machine. Complete data loss possible if the client is unplugged/dies at that instance. (weak / fast) 2) ACK sent only once data sent to all 100 replicas AND data written to disk. Data loss only possible if all replicas are lost. (strong / slowest) 3) ACK sent once X server machines have received the request (to ram). Data loss possible if all server machines lost before they write the request to disk. Good compromise of speed vs reliability guarantees In the simplest situation of a single server then we have roughly achieved the effect of moving the writeback cache to the serverside. In the case of multiple servers with exactly equal latency to the client then we have roughly achieved the same as moving writeback cache to serverside on all servers. In the case of non equal latency between client and server, or with server side replication, or with very busy servers then we gain a performance improvement due to the lower latency before the ACK sent to the client I thought this was a very clever technique and actually very compatible with the gluster philosophy (independent bricks) Ed W ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Configuration suggestions (aka poor/slow performance on new hardware)
On 26/03/2010 18:22, Ramiro Magallanes wrote: You coud run the genfiles script simultaneuosly (my english is really poor, we can change the subject of this mail for something like poor performance and poor english xDDD) but its not like a thread aplication (iozone rulez). If I run 3 process of genfiles.sh i get 440, 441, and 450 files. (1300 files aprox.) but if you add some more procees you're not going to obtain any big number :) With 6 genfiles at the same time i have : PID 12832 : 249 files created in 60 seconds. PID 12830 : 249 files created in 60 seconds. PID 12829 : 248 files created in 60 seconds. PID 12827 : 262 files created in 60 seconds. PID 12828 : 252 files created in 60 seconds. PID 12831 : 255 files created in 60 seconds. 1515 files . Just speaking theoretically, but I believe that without a writebehind cache on the client side then gluster is required to effectively sync after each file operation (well it's probably only half a sync, but some variation of this problem), this is safe, but of course decreases writespeed to be something which is a function of the network latency. So in your case if you had say around 1ms of latency then you would be limited to around 1,000 operations per second simply due to the wait until the far side acks the operation. This seems to correlate with the figures you are seeing (show your ping speed and correlate it with IOs per sec?) I don't see this as a gluster issue - it's a fundamental limitation of whether you want an ack for network based operations? Many people switch to fiberchannel or similar for the io for exactly this reason. If you can drop the latency by a factor of 10 then you are increasing your IOs by a factor of 10. Untested, but at least theoretically switching on writeback caching on the client should mean that it ploughs on without waiting for network latency to give you your ack. Lots of potential issues, but if this is ok for your requirements then give it a try? Note, just an idea for the gluster guys, but I think I saw in AFS (or was it something else?) a kind of hybrid server side writeback cache. The idea was that the server could ack the write if a certain number of storage nodes at least had the pending IO in memory, even if it hadn't hit the disk yet. This is subtly different to server side writeback, but seems like a very neat idea. Note it's probably not relevant to small file creation tests like above, but for other situations I do think some of the benchmarks here might not be really addressing network latency as the limiting bottleneck? Good luck Ed W ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] GlusterFS 3.0.2 small file read performance benchmark
Well, oplocks are an SMB definition, but the basic concept of opportunistic locking is independent of the filesystem. For example it appears that oplocks now appear in the NFS v4 standard under the name delegations (I would assume some variation of oplocks also exists in GFS and OCFS, but I'm not familiar with them) The basic concept would potentially provide a huge performance boost for glusterfs because it allows cache coherent writeback caching. In fact lets cut to the chase - what we desire is cache coherent writeback caching, ie reads to one server can be served from local client cache, but if the file is changed elsewhere then instantly our cache here is invalidated, and likewise we can write at will to a local copy of the file and allow it to get out of sync with the other servers, but as soon as some other server tries to read/write to our file then we must be notified and flush our cache (and request alternative locks or fall back to sync reads/writes) How do we do this? Well in NFS v3 and before and I believe in Glusterfs there is implemented only a cache and hope option, which caches data for a second or so and hopes the file doesn't change under us. The improved algorithm is opportunistic locking where the client indicates to the server the desire to work with some data locally and get it out of sync with the server - the server then tracks that reservation and if some other client wants to access the data it pushes a lock break to the original client and informs it that it needs to fsync and run without the oplock I believe that an oplock service this could be implemented via a new translator which works in conjunction with the read and writeback caching. Effectively it would be a two way lock manager, but it's job is somewhat simpler in that all it needs do is vary the existing caches on a per file basis. So for example if we read some attributes for some files then at present they are blindly cached for X ms and then dropped, but our oplock translator will instead allow the attributes to be cached indefinitely until we get a push notification from the server side that our cache must be invalidated. Same also with writes - we can use writeback cache as long as no one else has tried to read or write to our file, but as soon as someone else touches it we need to fsync and run without cache I have had a very quick glance at the current locks module and it's quite a bit more complex than I might have guessed... I had wondered if it might not be possible to make the locks module talk to the cache module and add server side lock breaking through that module? Essentially it's the addition of the push lock breaking which helps, so if we are reading away and some other client modifies a file then we need a feedback loop to invalide our read cache Perhaps this is all implemented in glusterfs already though and I'm just missing the point... Cheers Ed W On 02/03/2010 18:52, Tejas N. Bhise wrote: Ed, oplocks are implemented by SAMBA and it would not be a part of GlusterFS per se till we implement a native SAMBA translator ( something that would replace the SAMBA server itself with a thin SAMBA kind of a layer on top of GlusterFS itself ). We are doing that for NFS by building an NFS translator. At some point, it would be interesting to explore, clustered SAMBA using ctdb, where two GlusterFS clients can export the same volume. ctdb itself seems to be coming up well now. Regards, Tejas. - Original Message - From: Ed Wli...@wildgooses.com To: Gluster Usersgluster-users@gluster.org Sent: Wednesday, March 3, 2010 12:10:47 AM GMT +05:30 Chennai, Kolkata, Mumbai, New Delhi Subject: Re: [Gluster-users] GlusterFS 3.0.2 small file readperformance benchmark On 01/03/2010 20:44, Ed W wrote: I believe samba (and probably others) use a two way lock escalation facility to mitigate a similar problem. So you can read-lock or phrased differently, express your interest in caching some files/metadata and then if someone changes what you are watching the lock break is pushed to you to invalidate your cache. Seems NFS v4 implements something similar via delegations (not believed implemented in linux NFSv4 though...) In samba the equivalent are called op locks I guess this would be a great project for someone interested to work on - op-lock translator for gluster Ed W ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] Issue with replication of open files server reboot
Hi, Is there an open bug report that I can follow development on the issue reported here: http://gluster.com/community/documentation/index.php/Understanding_AFR_Translator#File_re-opening_after_a_server_comes_back_up: For my use case it seems rather worrying that if one server goes down then potentially all open files at that point are now corrupted? As I understand the issue the files will never be corrected or self-healed - is this correct? Thanks Ed W ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] GlusterFS 3.0.2 small file read performance benchmark
On 27/02/2010 18:56, John Feuerstein wrote: It would be really great if all of this could be cached within io-cache, only falling back to a namespace query (and probably locking) if something wants to write to the file, or if the result is longer than cache-timeout seconds in the cache. So even if the file has been renamed, is unlinked, has changed permissions / metadata - simply take the version of the io-cache until it's invalidated. At least that is what I would expect the io-cache to do. This will introduce a discrepancy between the cached file version and the real version in the global namespace, but isn't that what one would expect from caching...? I believe samba (and probably others) use a two way lock escalation facility to mitigate a similar problem. So you can read-lock or phrased differently, express your interest in caching some files/metadata and then if someone changes what you are watching the lock break is pushed to you to invalidate your cache. It seems like something similar would be a candidate for implementation with the gluster native clients? You still have performance issues with random reads because when you try to open some file and you still need to check it's not open/locked/needs replicating from some other brick. However, what you can do is have proactive caching with an active notification of any cache invalidation and this benefits the situation where you re-read stuff you already read, and/or you have an effective read-ahead which is grabbing stuff for you Interesting problem Ed W ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Issue with replication of open files server reboot
On 01/03/2010 20:19, Vikas Gorur wrote: On Mar 1, 2010, at 11:59 AM, Ed W wrote: Hi, Is there an open bug report that I can follow development on the issue reported here: http://gluster.com/community/documentation/index.php/Understanding_AFR_Translator#File_re-opening_after_a_server_comes_back_up: This issue has been fixed in the 3.x releases. Aha! Super. Many thanks I updated the docs to state this (link above) Is anything else on that page resolved in 3.x? eg selfheal of files not on first subvolume or selfheal of hardlinked files? Thanks Ed W ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users