Re: [Gluster-users] which components needs ssh keys?
> Have we deprecated SSL/TLS for the local I/O and management paths? The > code's still there, and I think I've even seen patches to it recently. Never mind. Saw that you were talking about ssh, not ssl. One more reason we should stop saying "ssl" and call it "tls" I guess. ;) ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] which components needs ssh keys?
On Wed, Jan 3, 2018, at 3:23 AM, Aravinda wrote: > Only Geo-replication uses SSH since it is between two Clusters. All > other features are limited to single Cluster/Volume, so communications > happens via Glusterd(Port tcp/24007 and brick ports(tcp/47152-47251)) Have we deprecated SSL/TLS for the local I/O and management paths? The code's still there, and I think I've even seen patches to it recently. ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] ZFS with SSD ZIL vs XFS
On Tue, Oct 10, 2017, at 11:19 AM, Gandalf Corvotempesta wrote: > Anyone made some performance comparison between XFS and ZFS with ZIL > on SSD, in gluster environment ? > > I've tried to compare both on another SDS (LizardFS) and I haven't > seen any tangible performance improvement. > > Is gluster different ? Probably not. If there is, it would probably favor XFS. The developers at Red Hat use XFS almost exclusively. We at Facebook have a mix, but XFS is (I think) the most common. Whatever the developers use tends to become "the way local filesystems work" and code is written based on that profile, so even without intention that tends to get a bit of a boost. To the extent that ZFS makes different tradeoffs - e.g. using lots more memory, very different disk access patterns - it's probably going to have a bit more of an "impedance mismatch" with the choices Gluster itself has made. If you're interested in ways to benefit from a disk+SSD combo under XFS, it is possible to configure XFS with a separate journal device but I believe there were some bugs encountered when doing that. Richard Wareing's upcoming Dev Summit talk on Hybrid XFS might cover those, in addition to his own work on using an SSD in even more interesting ways. ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Gluster operations speed limit
On Tue, Aug 1, 2017, at 06:16 AM, Alexey Zakurin wrote: > I have a large distributed-replicated Glusterfs volume, that contains > few hundreds VM's images. Between servers 20Gb/sec link. > When I start some operations like healing or removing, storage > performance becomes too low for a few days and server load becomes like > this: > > 13:06:32 up 13 days, 20:02, 3 users, load average: 43.62, 31.75, > 23.53. > > Is it possible to set limit on this operations? Actually, VM's on my > cluster becomes offline, when I start healing, rebalance or removing > brick. In addition to the cgroups workaround that Mohit mentions, there are two longer-term efforts in progress (that I'm aware of) to address this and similar issues. (1) Some folks at Red Hat are working on limiting the number of files that SHD will heal at one time (https://github.com/gluster/glusterfs/issues/255). (2) At Facebook, we're working on a more general solution to apportion I/O among any users of a system, where "users" might be real users or internal pseudo-users such as self heal or rebalance (https://github.com/gluster/glusterfs/issues/266). Either or both of these might land in 4.0; we're still planning that release, so no definite answer yet. ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Add single server
On Mon, May 1, 2017, at 02:34 PM, Gandalf Corvotempesta wrote: > I'm still thinking that saving (I don't know where, I don't know how) > a mapping between > files and bricks would solve many issues and add much more flexibility. Every system we've discussed has a map. The differences are only in the granularity, and how the map is stored. Per-file maps inevitably become a scaling problem, so a deterministic function is used to map individual files into a much smaller number of buckets, placement groups, hash ranges, or whatever. Then information about those buckets and their locations is stored somehow: * Centrally - Lustre, HDFS, Moose/Lizard * Distributed among a few servers - Ceph, possibly Gluster with DHT2 * Distributed among all servers - Gluster today No matter which approach you use, you can manipulate the maps. Without changing the fundamental structure of Gluster, you could take a brick's hash range and split it in two to create two bricks. Then you could quietly migrate the files in one brick to anywhere else in the background. That doesn't quite work today because the two bricks would be trying to operate on the same directories, seeing each others' files, etc. Making it more transparent won't be easy, but the changes would be pretty well localized to DHT. Brick multiplexing can help too, because it allows a volume to be created with many more bricks initially so they'd already be in separate directories and ready to move. Multiple bricks living in one process also makes coordination during such transitions much easier. This has been part of my plan for years, not only to support adding a single server but also to support more sophisticated forms of tiering, quality of service, etc. The big question as I see it is what we can do *in the near term* to make N+1 addition easier on *existing* clusters. That probably deserves a separate answer, so I'll leave it for another time. ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] What is the CLI NUFA option "local-volume-name" good for?
On Fri, Apr 28, 2017, at 10:57 AM, Jan Wrona wrote: > I've been struggling with NUFA for a while now and I know very well what > the "option local-volume-name brick" in the volfile does. In fact, I've > been using a filter to force gluster to use the local subvolume I want > instead of the first local subvolume it finds, but filters are very > unreliable. Recently I've found this bug [1] and thought that I'll > finally be able to set the NUFA's "local-volume-name" option > *per-server* through the CLI without the use of the filter, but no. This > options sets the value globally, so I'm asking what is the use of LOCAL > volume name set GLOBALLY with the same value on every server? You're right that it would be a bit silly to set this using "gluster volume set" but it makes much more sense as a command-line override using "--xlator-option" instead. Then it could in fact be different on every client, even though it's not even set in the volfile. ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Does glusterfs supports brick locality?
On Mon, Apr 24, 2017, at 10:14 AM, atris adam wrote: > I have two data centers in two different provinces. Each data centers > have 3 servers. I want to setup cloud storage with glusterfs.> I want to make > one glusterfs volume with these information. > > province "a"==> 3 servers, each server has one 5TB brick (bricks > number is from 1-3)> province "b"==> 3 servers, each server has one 5TB brick > (bricks > number is from 4-6)> > distributed gluster volume size: 30TB > > Does glusterfs support servers with long distance? > If yes, if an end user in province "a" create a file, where the file > is allocated? I mean which brick is selected to write the file on it? > I need the brick number from 1-3 to be> selected. Because I think if > glusterfs select other bricks (num 4-6, > which is in province "b"), higher latency and unnecessary network > traffic will be consumed. Am I right? The "nufa" translator/option exists for exactly this purpose, but it's a bit more limited than what I think you want. When creating a file, it will create it in a brick *on the same node* if possible, but it doesn't distinguish between other nodes near or far away. This might be good enough if you're using NFS as the access protocol, because in that case the "same node" is the one running the NFS/Gluster proxy. ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Remove an artificial limitation of disperse volume
- Original Message - > Okay so the 4 nodes thing is a kind of exception? What about 8 nodes > with redundancy 4? > > I made a table to recap possible configurations, can you take a quick > look and tell me if it's OK? > > Here: https://gist.github.com/olivierlambert/8d530ac11b10dd8aac95749681f19d2c As I understand it, the "power of two" thing is only about maximum efficiency, and other values can work without wasting space (they'll just be a bit slower). So, for example, with 12 disks you would be able to do 10+2 and get 83% space efficiency. Xavier's the expert, though, so it's probably best to let him clarify. ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Remove an artificial limitation of disperse volume
> So far, I can't create a disperse volume if the redundancy level is > 50% or more the number of bricks. I know that perfs would be better in > dist/rep, but what if I prefer anyway to have disperse? > > Conclusion: would it be possible to have a "force" flag during > disperse volume creation even if redundancy is higher that 50%? The problem is that the math behind erasure coding doesn't work for all fragment counts and redundancy levels. To get two-failure protection you need more than four bricks. If you had multiple disks in each server you could get protection against multiple disk failures, but you still wouldn't have protection against multiple server failures. The only thing your "force" flag could do is allow placement of multiple fragments on a single physical disk, but then you wouldn't even have protection against two disk failures. If you want higher levels of protection you need more disks, either to satisfy the mathematical requirements of EC or to overcome the space inefficiency of replication. ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] rebalance and volume commit hash
> I don't understand why new commit hash is generated for the volume during > rebalance process? I think it should be generated only during add/remove > brick events but not during rebalance. The mismatch only becomes important during rebalance. Prior to that, even if we've added or removed a brick, the layouts haven't changed and the optimization is still as valid as it was before. If there are multiple add/remove operations, we don't need or want to change the hash between them. Conversely, there are cases besides add/remove brick where we might want to do a rebalance - e.g. after replace-brick with a brick of a different size, or to change between total-space vs. free-space weighting. Changing the hash in add/remove brick doesn't handle these cases, but changing it at the start of rebalance does. ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] rebalance and volume commit hash
> Can you tell me please why every volume rebalance generates a new value > for the volume commit hash? > > If I have fully rebalanced cluster (or almost) with millions of > directories then rebalance has to change DHT xattr for every directory > only because there is a new volume commit hash value. It is pointless in > my opinion. Is there any reason behind this? As I observed, the volume > commit hash is set at the rebalance beginning which totally destroys > benefit of lookup optimization algorithm for directories not > scanned/fixed yet by this rebalance run. It disables the optimization because the optimization would no longer lead to correct results. There are plenty of distributed filesystems that seem to have "fast but wrong" as a primary design goal; we're not one of them. The best way to think of the volume-commit-hash update is as a kind of cache invalidation. Lookup optimization is only valid as long as we know that the actual distribution of files within a directory is consistent with the current volume topology. That ceases to be the case as soon as we add or remove a brick, leaving us with three choices. (1) Don't do lookup optimization at all. *Every* time we fail to find a file on the brick where hashing says it should be, look *everywhere* else. That's how things used to work, and still work if lookup optimization is disabled. The drawback is that every add/remove brick operation causes a permanent and irreversible degradation of lookup performance. Even on a freshly created volume, lookups for files that don't exist anywhere will cause every brick to be queried. (2) Mark every directory as "unoptimized" at the very beginning of rebalance. Besides being almost as slow as fix-layout itself, this would require blocking all lookups and other directory operations *anywhere in the volume* while it completes. (3) Change the volume commit hash, effectively marking every directory as unoptimized without actually having to touch every one. The root-directory operation is cheap and almost instantaneous. Checking each directory commit hash isn't free, but it's still a lot better than (1) above. With upcalls we can enhance this even further. Now that you know a bit more about the tradeoffs, do "pointless" and "destroys the benefit" still seem accurate? ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Cheers and some thoughts
> Both ceph and lizard manage this automatically. > If you want, you can add a single disk to a working cluster and automatically > the whole cluster is rebalanced transparently with no user intervention This relates to the granularity problem I mentioned earlier. As long as we're not splitting bricks into smaller units, our flexibility to do things like add a single disk is very limited and the performance impact of rebalance is large. Automatically triggering rebalance would avoid a manual step, but it would just make the pain immediate instead of prolonged. ;) When we start splitting bricks into tiles or bricklets or whatever we want to call them, a lot of what you talk about will become more feasible. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Cheers and some thoughts
> Gluster (3.8.7) coped perfectly - no data loss, no maintenance required, > each time it came up by itself with no hand holding and started healing > nodes, which completed very quickly. VM's on gluster auto started with > no problems, i/o load while healing was ok. I felt quite confident in it. Glad to hear that part went well. > The alternate cluster fs - not so good. Many times running VM's were > corrupted, several times I lost the entire filesystem. Also IOPS where > atrocious (fuse based). It easy to claim HA when you exclude such things > as power supply failures, dodgy network switches etc. Too true. Unfortunately, I think just about every distributed storage system has to go through this learning curve, from not handling failure at all to handling the simplest/easiest cases to handling the weird stuff that real deployments can throw at you. It's not just about the actual failure handling, either. Sometimes, it's about things you do in the main I/O path, such as not throwing away O_SYNC flags to claim better performance. From the information you've provided, I'll bet that's where your data corruption came from. > I think glusters active/active quorum based design, where is every node > is a master is a winner, active/passive systems where you have a SPOF > master are difficult to DR manage. Active/passive designs create a very tough set of tradeoffs. Detecting and responding to failures quickly enough, while also avoiding false alarms, is like balancing on a knife edge. Then there's problems with overload turning into failure, with failback, etc. It can all be done right and work well, but it's *really* hard. While I guess it's better than nothing, experience has shown that active/active designs are easier to make robust, and the techniques for doing so have been well known for at least a decade or so. > However :) Things I'd really like to see in Gluster: > > - More flexible/easier management of servers and bricks (add/remove/replace) > > - More flexible replication rules > > One of the things I really *really* like with LizardFS is the powerful > goal system and chunkservers. Nodes and disks can be trivially easily > added/removed on the fly and chunks will be shuffled, replicated or > deleted to balance the system. Individual objects can have difference > goals (replication levels) which can also be changed on the fly and the > system will rebalance them. Objects can even be changed from/to simple > replication to Erasure Encoded objects. > > I doubt this could be fitted to the existing gluster, but is there > potential for this sort of thing in Gluster 4.0? I read the design docs > and they look ambitious. There used to be an idea called "data classification" to cover this kind of case. You're right that setting arbitrary goals for arbitrary objects would be too difficult. However, we could have multiple pools with different replication/EC strategies, then use a translator like the one for tiering to control which objects go into which pools based on some kind of policy. To support that with a relatively small number of nodes/bricks we'd also need to be able to split bricks into smaller units, but that's not really all that hard. Unfortunately, although many of these ideas have been around for at least a year and a half, nobody has ever been freed up to work on them. Maybe, with all of the interest in multi-tenancy to support containers and hyperconvergence and whatever else, we might finally be able to get these under way. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Very slow writes through Samba mount to Gluster with crypt on
> Is there some known formula for getting performance out of this stack, or is > Samba with Glusterfs with encryption-at-rest just not that workable a > proposition for now? I think it's very likely that the combination you describe is not workable. The crypt translator became an orphan years ago, when the author left a highly idiosyncratic blob of code and practically no tests behind. Nobody has tried to promote it since then, and "at your own risk" has been the answer for anyone who asks. If you found it in the source tree and decided to give it a try, I'm sorry. Even though it's based in large part on work I had done for HekaFS, I personally wouldn't trust it to keep my data correctly let alone securely. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] Community Meetings - Feedback on new meeting format
> This has resulted in several good changes, > a. Meetings are now more livelier with more people speaking up and > making themselves heard. > b. Each topic in the open floor gets a lot more time for discussion. > c. Developers are sending out weekly updates of works they are doing, > and linking those mails in the meeting agenda. I agree with these points. People seem much more engaged during the meeting, which is a good thing. > Thought the response and attendance to the initial 2 meetings was > good, it dropped for the last 2. This week in particular didn't have a > lot of updates added to the meeting agenda. It seems like interest has > dropped already. > > We could probably do a better job of collecting updates to make it > easier for people to add their updates, but the current format of > adding updates to etherpad(/hackmd) is simple enough. I'd like to know > if there is anything else preventing people from providing updates. I'm one of the culprits here. As an observation, not an excuse, I'll point out that we were already missing lots of updates from people who didn't even show up to the meetings. Has the overall level of missed updates gone up or down? Has the level of attention paid to them? If people provide updates about as consistently, and those updates are at least as detailed (possibly more because they're written and meant to be read asynchronously), then we might actually be *ahead* of where we were before. The new format gets a big +1 from me. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Automation of single server addition to replica
> And that's why I really prefere gluster, without any metadata or > similiar. > But metadata server aren't mandatory to archive automatic rebalance. > Gluster is already able to rebalance and move data around the cluster, > and already has the tool to add a single server even in a replica 3. > > What i'm asking is to automate this feature. Gluster could be able to > move bricks around without user intervention. Some of us have thought long and hard about this. The root of the problem is that our I/O stack works on the basic of replicating bricks, not files. Changing that would be hard, but so is working with it. Most ideas (like Joe's) involve splitting larger bricks into smaller ones, so that the smaller units can be arranged into more flexible configurations. So, for example, let's say you have bricks X through Z each split in two. Define replica sets along the diagonal and place some files A through L. Brick X Brick Y Brick Z +-+-+-+ Subdirectory 1 | A B C D | E F G H | I J K L | +-+-+-+ Subdirectory 2 | I J K L | A B C D | E F G H | +-+-+-+ Now you want to add a fourth brick on a fourth machine. Now each (divided) brick should contain three files instead of four, so some will have to move. Here's one possibility, based on our algorithms to maximize overlaps between the old and new DHT hash ranges. Brick X Brick Y Brick Z Brick W +-+-+-+-+ Subdirectory 1 | A B C | D E F | J K L | G H I | +-+-+-+-+ Subdirectory 2 | G H I | A B C | D E F | J K L | +-+-+-+-+ Even trying to minimize data motion, a third of all the files have to be moved. This can be reduced still further by splitting the original bricks into even smaller parts, and that actually meshes quite well with the "virtual nodes" technique used by other systems that do similar hash-based distribution, but it gets so messy that I won't even try to draw the pictures. The main point is that doing all this requires significant I/O, with significant impact on other activity on the system, so it's not necessarily true that we should just do it without user intervention. Can we automate this process? Yes, and we should. This is already in scope for GlusterD 2. However, in addition to the obvious recalculation and rebalancing, it also means setting up the bricks differently even when a volume is first created, and making sure that we don't double-count available space on two bricks that are really on the same disks or LVs, and so on. Otherwise, the initial setup will seem simple but later side-effects could lead to confusion. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] understanding dht value
> Thanks for pointing to the article. I have been following the article all the > way. What intrigues me is the dht values associated with sub directories. > [root@glusterhackervm3 glus]# getfattr -n trusted.glusterfs.dht -e hex > /brick2/vol > getfattr: Removing leading '/' from absolute path names > # file: brick2/vol > trusted.glusterfs.dht=0x00017de2 > [root@glusterhackervm3 glus]# getfattr -n trusted.glusterfs.dht -e hex > /brick2/vol/d/ > getfattr: Removing leading '/' from absolute path names > # file: brick2/vol/d/ > trusted.glusterfs.dht=0x00017ffe > Does it mean that only files whose DHT value ranges from 0x00 to 0x 7ffe > can be saved inside the ‘d’ directory. But then it provides a very narrow > range of 0x7de2 to 0x 7ffe to be created in that directory. To know the distribution for a directory as seen by the user, you need to look at the xattrs for the matching directory on every brick. What the above shows us is that, *for this brick*: in brick2/vol, this brick will store files with hashes from 7de2 to in the subdirectory /brick2/vol/d, this brick will store files from to 7ffe Note that this is *completely independent* for each directory, and only affects placement of non-directories. There's no set-intersection going on between the ranges for a directory and its parent(s). The subdirectory looks like what I'd expect to see for a volume with two (possibly replicated) DHT subvolumes. The root looks a lot weirder - almost, but not quite, the top half of the hash distribution. I think this could happen if here had been multiple brick additions/removals and rebalances, but I'm not sure what that combination would have to be. If .../d had been created later, it would be unaffected by all of these prior actions and would still get an exactly-half share of the hash range. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] How gluster parallelize reads
> Anyway, in gerrit you are talking about "local" reads. How could you > have a "local" read? This would be possible only mounting the volume > locally on a server. is this a supported configuration? Whether or not it's supported for native protocol, it's a common case when using NFS or SMB with the servers for those protocols appearing as native-protocol clients on the server machines. > Probably, a "priority" could be added in mount option, so that when > mounting the gluster volume i can set the preferred host for reads. > > Something like this: > > mount -t glusterfs -o preferred-read-host=1.2.3.4 server1:/test-volume > /mnt/glusterfs It's a great idea that would work well for a volume containing a single replica set, but what about when that volume contains multiple? Specify a preferred read source for each? Even that will get tricky when we start to work around the limitation of adding bricks in multiples of the replica count. Then we'll be building new replica sets "automatically" so the user would have to keep re-examining the volume structure to decide on a new priority list. Also, what should we do if that priority list is "pathological" in the sense of creating unnecessary hot spots? Should we accept it as an expression of the user's will anyway, or override it to ensure continued smooth operation? IMO we should try harder to find the right answers *autonomously*, perhaps based on user-specified relationships between client networks and servers. (Ceph does some of this in their CRUSH maps, but I think that conflates separate problems of managing placement and traffic.) To look at it another way, we'd be doing the same calculations the user might do to create that explicit priority list, except we'd be in a better position to *re*calculate that list when appropriate. We're thinking about some of this in the context of handling multiple networks better in general, but it's still a bit of a research effort because AFAICT nobody else has come up with much empirically-backed research to guide solutions. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] How gluster parallelize reads
> > 0 means use the first server to respond I think - at least that's my guess > > of what "first up server" means > > 1 hashed by GFID, so clients will use the same server for a given file but > > different files may be accessed from different nodes. > > I think that 1 is better. > Why "0" is the default ? Basic storage-developer conservatism. Zero was the behavior before read-hash-mode was implemented. As strongly as some of us might believe that such tweaks lead to better behavior - as I did with this one in 2012[1] - we've kind of learned the hard way that existing users often disagree with our estimations. Thus, new behavior is often kept as a "special" for particular known environments or use cases, and the default is left unchanged until there's clear feedback indicating it should be otherwise. [1] http://review.gluster.org/#/c/2926/ ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] EC clarification
> 2016-09-21 20:56 GMT+02:00 Serkan Çoban: > > Then you can use 8+3 with 11 servers. > > Stripe size won't be good: 512*(8-3) = 2560 and not 2048 (or multiple) It's not really 512*(8+3) though. Even though there are 11 fragments, they only contain 8 fragments' worth of data. They just encode it with enough redundancy that *any* 8 contains the whole. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] CFP for Gluster Developer Summit
Two proposals, both pretty developer-focused. (1) Gluster: The Ugly Parts Like any code base its size and age, Gluster has accumulated its share of dead, redundant, or simply inelegant code. This code makes us more vulnerable to bugs, and slows our entire development process for any feature. In this interactive discussion, we'll identify translators or other modules that can be removed or significantly streamlined, and develop a plan for doing so within the next year or so. Bring your favorite gripes and pet peeves (about the code). (2) Gluster Debugging Every developer has their own "bag of tricks" for debugging Gluster code - things to look for in logs, options to turn on, obscure test-script features, gdb macros, and so on. In this session we'll share many of these tricks, and hopefully collect more, along with a plan to document them so that newcomers can get up to speed more quickly. I could extend #2 to cover more user/support level problem diagnosis, but I think I'd need a co-presenter for that because it's not an area in which I feel like an expert myself. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] One client can effectively hang entire gluster array
> In either of these situations, one glusterfsd process on whatever peer the > client is currently talking to will skyrocket to *nproc* cpu usage (800%, > 1600%) and the storage cluster is essentially useless; all other clients > will eventually try to read or write data to the overloaded peer and, when > that happens, their connection will hang. Heals between peers hang because > the load on the peer is around 1.5x the number of cores or more. This occurs > in either gluster 3.6 or 3.7, is very repeatable, and happens much too > frequently. I have some good news and some bad news. The good news is that features to address this are already planned for the 4.0 release. Primarily I'm referring to QoS enhancements, some parts of which were already implemented for the bitrot daemon. I'm still working out the exact requirements for this as a general facility, though. You can help! :) Also, some of the work on "brick multiplexing" (multiple bricks within one glusterfsd process) should help to prevent the thrashing that causes a complete freeze-up. Now for the bad news. Did I mention that these are 4.0 features? 4.0 is not near term, and not getting any nearer as other features and releases keep "jumping the queue" to absorb all of the resources we need for 4.0 to happen. Not that I'm bitter or anything. ;) To address your more immediate concerns, I think we need to consider more modest changes that can be completed in more modest time. For example: * The load should *never* get to 1.5x the number of cores. Perhaps we could tweak the thread-scaling code in io-threads and epoll to check system load and not scale up (or even scale down) if system load is already high. * We might be able to tweak io-threads (which already runs on the bricks and already has a global queue) to schedule requests in a fairer way across clients. Right now it executes them in the same order that they were read from the network. That tends to be a bit "unfair" and that should be fixed in the network code, but that's a much harder task. These are only weak approximations of what we really should be doing, and will be doing in the long term, but (without making any promises) they might be sufficient and achievable in the near term. Thoughts? ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Securing GlusterD management
As some of you might already have noticed, GlusterD has been notably insecure ever since it was written. Unlike our I/O path, which does check access control on each request, anyone who can craft a CLI RPC request and send it to GlusterD's well known TCP port can do anything that the CLI itself can do. TLS support was added for GlusterD a while ago, but it has always been a bit problematic and as far as I know hasn't been used much. It's a bit of a chicken-and-egg problem. Nobody wants to use a buggy or incomplete feature, but as long as nobody's using it there's little incentive to improve it. Recently, there have been some efforts to add features which would turn the existing security problem into a full-fledged "arbitrary code execution" vulnerability (as the security folks would call it). These efforts have been blocked, but they have also highlighted the fact that we're *long* past the point where we should have tried to make GlusterD more secure. To that end, I've submitted the following patch to make TLS mandatory for all GlusterD communication, with some very basic authorization for CLI commands. http://review.gluster.org/#/c/14866/ The technical details are in the commit message, but the salient point is that it requires *zero configuration* to get basic authentication and encryption. This is equivalent to putting a lock on the door. Sure, maybe everybody knows the default combination, but *at least there's a lock* and people who want to secure their systems can change the combination to whatever they want. That's better than the door hanging open, without even a solid attachment point for a lock, and it's essential infrastructure for anything else we might do. The patch also fixes some bugs that affect even today's optional TLS implementation. One significant downside of this change has to do with rolling upgrades. While it might be possible for those who are already using TLS to do a rolling upgrade, it would still require some manual steps. The vast majority of users who haven't enabled TLS will be unable to upgrade without "stopping the world" (as is already the case for enabling TLS). I'd appreciate feedback from users on both the positive and negative aspects of this change. Should it go into 3.9? Should it be backported to 3.8? Or should it wait until 4.0? Feedback from developers is also appreciated, though at this point I think any problems with the patch itself have already been resolved to the point where GlusterFS with the patch is more stable than GlusterFS without it. I'm just fighting through some NetBSD testing issues at this point, hoping to make that situation better as well. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] files on glusterfs disappears
> Thank you for the answer, > if I have understood you suggest to disable NUFA to verify if this is > the problem originator, > is it correct? That would certainly provide a very useful data point. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] files on glusterfs disappears
> This could be because of nufa xlator. As you say the files are present on the > brick I don't suspect RDMA here. Agreed. > Is nufa still supported? Could this a bug in nufa + dht? Until we explicitly decide to stop building and distributing it, it's still "supported" in some sense, but only to the extent that there's someone available to look at it. Unfortunately, nobody has that as an assignment and our tests for NUFA are minimal so they're not likely to detect breakage automatically. With the amount of change we've seen to DHT over the last several months, it's entirely possible that a NUFA bug or two has crept in. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] Default quorum for 2 way replication
> I like the default to be 'none'. Reason: If we have 'auto' as quorum for > 2-way replication and first brick dies, there is no HA. If users are > fine with it, it is better to use plain distribute volume "Availability" is a tricky word. Does it mean access to data now, or later despite failure? Taking a volume down due to loss of quorum might be equivalent to having no replication in the first sense, but certainly not in the second. When the possibility (likelihood?) of split brain is considered, enforcing quorum actually does a *better* job of preserving availability in the second sense. I believe this second sense is most often what users care about, and therefore quorum enforcement should be the default. I think we all agree that quorum is a bit slippery when N=2. That's where there really is a tradeoff between (immediate) availability and (highest levels of) data integrity. That's why arbiters showed up first in the NSR specs, and later in AFR. We should definitely try to push people toward N>=3 as much as we can. However, the ability to "scale down" is one of the things that differentiate us vs. both our Ceph cousins and our true competitors. Many of our users will stop at N=2 no matter what we say. However unwise that might be, we must still do what we can to minimize harm when things go awry. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gluster small file performance
Note: The log files attached have the No data available messages parsed out to reduce the file size. There were an enormous amount of these. One of my colleagues submitted something to the message board about these errors in 3.7.3. [2015-08-17 17:03:37.270219] W [fuse-bridge.c:1230:fuse_err_cbk] 0-glusterfs-fuse: 6643: REMOVEXATTR() /boost_1_57_0/boost/accumulators/accumulators.hpp = -1 (No data available) [2015-08-17 17:03:37.271004] W [fuse-bridge.c:1230:fuse_err_cbk] 0-glusterfs-fuse: 6646: REMOVEXATTR() /boost_1_57_0/boost/accumulators/accumulators.hpp = -1 (No data available) [2015-08-17 17:03:37.271663] W [fuse-bridge.c:1230:fuse_err_cbk] 0-glusterfs-fuse: 6648: REMOVEXATTR() /boost_1_57_0/boost/accumulators/accumulators.hpp = -1 (No data available) [2015-08-17 17:03:37.274273] W [fuse-bridge.c:1230:fuse_err_cbk] 0-glusterfs-fuse: 6662: REMOVEXATTR() /boost_1_57_0/boost/accumulators/accumulators_fwd.hpp = -1 (No data available) I can't help but wonder how much these are affecting your performance. That's a lot of extra messages, and even more effort to log the failures. When I run your tests myself, I don't see any of these and I don't see a performance drop-off either. Maybe something ACL- or SELinux-related? It would be extra-helpful to get a stack trace for just one of these, to see where they're coming from. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gluster small file performance
I changed the logging to error to get rid of these messages as I was wondering if this was part of the problem. It didn't change the performance. Also, I get these same errors both before and after the reboot. I only see the slowdown after the reboot. I have SELinux disabled. Not sure about ACL. Don't think I can turn ACL off on XFS. I am happy to post results of strace. Do I just do 'strace tar -xPf boost.tar strace.log'? That will show the calls if they're coming from tar, but I suspect they're internally generated so you'd have to attach strace to the glusterfsd process. Either way, you'd probably want to add -e removexattr to keep the results manageable. That will at least tell us *which* xattr we're trying to remove, which might give a clue to what's going on. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] trouble mounting ssl-enabled volume
- Original Message - Hi all, I'm just installing my first ever glusterfs volume, and am running into trouble, which I think may be related to using ssl. I don't have a network I can trust, so using secure authentication and encryption is a show-stopper for me. I am using gluster 3.6.3 on Debian stable, and the command I'm using to mount is: # mount -t glusterfs localhost:/austen /home and the error message I am seeing is the following: # tail -23 /var/log/glusterfs/home.log +--+ [2015-06-16 00:12:12.691413] I [socket.c:379:ssl_setup_connection] 0-austen-client-0: peer CN = elliot [2015-06-16 00:12:12.691978] I [rpc-clnt.c:1761:rpc_clnt_reconfig] 0-austen-client-0: changing port to 49152 (from 0) [2015-06-16 00:12:12.694267] I [socket.c:379:ssl_setup_connection] 0-austen-client-1: peer CN = wentworth [2015-06-16 00:12:12.695846] I [rpc-clnt.c:1761:rpc_clnt_reconfig] 0-austen-client-1: changing port to 49152 (from 0) [2015-06-16 00:12:12.703270] I [socket.c:379:ssl_setup_connection] 0-austen-client-0: peer CN = elliot [2015-06-16 00:12:12.703544] I [client-handshake.c:1413:select_server_supported_programs] 0-austen-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2015-06-16 00:12:12.703912] W [client-handshake.c:1109:client_setvolume_cbk] 0-austen-client-0: failed to set the volume (Permission denied) Are you setting auth.ssl-allow to enable specific users (identified by CN) to access the volume? The following page shows how. http://www.gluster.org/community/documentation/index.php/SSL Also, note that the CN can't contain spaces. I know that's inconvenient, but space was already used as a delimiter and changing that would have affected backward compatibility. [2015-06-16 00:12:12.703940] W [client-handshake.c:1135:client_setvolume_cbk] 0-austen-client-0: failed to get 'process-uuid' from reply dict [2015-06-16 00:12:12.703956] E [client-handshake.c:1141:client_setvolume_cbk] 0-austen-client-0: SETVOLUME on remote-host failed: Authentication failed [2015-06-16 00:12:12.703970] I [client-handshake.c:1225:client_setvolume_cbk] 0-austen-client-0: sending AUTH_FAILED event [2015-06-16 00:12:12.703992] E [fuse-bridge.c:5145:notify] 0-fuse: Server authenication failed. Shutting down. [2015-06-16 00:12:12.704010] I [fuse-bridge.c:5599:fini] 0-fuse: Unmounting '/home'. [2015-06-16 00:12:12.709146] I [socket.c:379:ssl_setup_connection] 0-austen-client-1: peer CN = wentworth [2015-06-16 00:12:12.710243] I [client-handshake.c:1413:select_server_supported_programs] 0-austen-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2015-06-16 00:12:12.711294] W [client-handshake.c:1109:client_setvolume_cbk] 0-austen-client-1: failed to set the volume (Permission denied) [2015-06-16 00:12:12.711321] W [client-handshake.c:1135:client_setvolume_cbk] 0-austen-client-1: failed to get 'process-uuid' from reply dict [2015-06-16 00:12:12.711330] E [client-handshake.c:1141:client_setvolume_cbk] 0-austen-client-1: SETVOLUME on remote-host failed: Authentication failed [2015-06-16 00:12:12.711339] I [client-handshake.c:1225:client_setvolume_cbk] 0-austen-client-1: sending AUTH_FAILED event [2015-06-16 00:12:12.711349] E [fuse-bridge.c:5145:notify] 0-fuse: Server authenication failed. Shutting down. [2015-06-16 00:12:12.711358] I [fuse-bridge.c:5599:fini] 0-fuse: Unmounting '/home'. [2015-06-16 00:12:12.711374] E [mount-common.c:228:fuse_mnt_umount] 0-glusterfs-fuse: fuse: failed to unmount /home: Invalid argument [2015-06-16 00:12:12.711586] W [glusterfsd.c:1194:cleanup_and_exit] (-- 0-: received signum (15), shutting down Sadly, I have very little idea as to how to debug this. I fear it may be a problem with my ssl keys (I created a CA key and used it to sign the keys for the two servers, but may have done this wrong. Any suggestions are welcome. I understand I haven't given all the information you likely need to help, but I don't even know what information would really be relevant, as I do not understand what this AUTH_FAILED event means. David ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] reading from local replica?
In short, it would seem that either were I to use geo-repliciation, whether recommended or not in this kind of usage, I'd need to own both which volume to mount and what to do with writes when the client has chosen to mount the slave. True. Various active/active geo-replication solutions have been on the road map for some time, but in each release there are other things deemed more important. :( Finally, given that ping times between regions are typically in excess of 200 ms in my case, would you strongly discourage AFR usage? Pretty strongly. The AFR write protocol is quite latency-sensitive. Obviously, this affects performance. Also, as RTT increases it becomes harder and harder to tune things so that network brownouts don't become full partitions. If the read-replica selection options worked, then reads should be OK and an almost entirely read-only workload might be OK. Otherwise, I'd say you're likely to have a bad time. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] reading from local replica?
Am I misunderstanding cluster.read-subvolume/cluster.read-subvolume-index? I have two regions, A and B with servers a and b in, respectfully, each region. I have clients in both regions. Intra-region communication is fast, but the pipe between the regions is terrible. I'd like to minimize inter-region communication to as close to glusterfs write operations only and have reads go to the server in the region the client is running in. I have created a replica volume as: gluster volume create gv0 replica 2 a:/data/brick1/gv0 b:/data/brick1/gv0 force As a baseline, if I use scp to copy from the brick directly, I get -- for a 100M file -- times of about 6s if the client scps from the server in the same region and anywhere from 3 to 5 minutes if I the client scps the server in the other region. I was under the impression (from something I read but can't now find) that glusterfs automatically picks the fastest replica, but that has not been my experience; glusterfs seems to generally prefer the server in the other region over the local one, with times usually in excess of 4 minutes. The choice of which replica to read from has become rather complicated over time. The first parameter that matters is cluster.read-hash-mode, which selects between dynamic and (two forms of) static selection. For the default mode, we try to spread the read load across replicas based on both the file's ID and the client's. For read-hash-mode=0 *only*, we do this. * If choose-local is set (as it is by default) and there's a local replica, use that. * Otherwise, select a replica based on fastest *initial* response. Note that these are both a bit prone to hot spots, which is why this method is not the default. Also, re-evaluating response times is as likely to lead to mobile hotspot behavior as anything else - clients keep following each other around to previously idle but now overloaded replicas, moving the congestion around but never resolving it. Thus, we only tend to re-evaluate in response to brick up/down events. Probably some room for improvement here. That brings us to read-subvolume and read-subvolume-index. The difference between them is that read-subvolume takes a translator *name* (which you'd have to get from the volfile) and only applies to one replica set within a volume. It's really only useful for testing and debugging. By contrast, read-subvolume-index applies to all replica sets in a volume and doesn't require any knowledge of translator names. Either one is used *before* read-hash-mode; if it's set, and if the corresponding replica is up, it will be chosen. Yes, it's a bit of a mess. However, as you've clearly guessed, this is a pretty critical decision so it's nice to have many different ways to control it. I've also tried having clients mount the volume using the xlator options cluster.read-subvolume and cluster.read-subvolume-index, but neither seem to have any impact. Here are sample mount commands to show what I'm attempting: mount -t glusterfs -o xlator-option=cluster.read-subvolume=gv0-client-0 or 1 a:/gv0 /mnt/glusterfs mount -t glusterfs -o xlator-option=cluster.read-subvolume-index=0 or 1 a:/gv0 /mnt/glusterfs I would guess that the translator options are somehow not being passed all the way through to the translator that actually makes the decision. If it is being passed, it definitely should force the decision as described above. There might be a bug here, or perhaps I'm just misunderstanding code I haven't read in a while. Also, please not that synchronous replication (AFR) isn't really intended or expected to work over long distances. Anything over 5ms RTT is risky territory; that's why we have separate geo-replication. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] reading from local replica?
Sorry for neglecting to mention the version, it's 3.7.1. I've filed a bug to track this. https://bugzilla.redhat.com/show_bug.cgi?id=1229808 ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] reading from local replica?
So, maybe passing these options as a mount command doesn't work/is a no-op, but what I don't understand is why -- given that there is no measure by which glusterfs should ever conclude the replica in the other region is ever faster than the replica in the same region. If read-subvolume or read-subvolume-index is somehow not getting through, then we're back to the read-hash-mode default - which does *not* try to use round-trip-time measurements. Still, it should give different results for different files. In fact, it appears as though glusterfs is *preferring* the slower replica. It's hard to see how that would be the case, since the code to set read_child based on first-to-reply seems to be *missing* in the current code. :( What version are you running, again? ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Network Topology Question
- Original Message - Hi there, I'm planning to setup a 3-Node Cluster for oVirt and would like to use 56 GBe (RoCe) exclusively for GlusterFS. Since 56 GBe switches are far too expensive and it's not planned to add more nodes and furthermore this would add a SPOF I'd like to cross connect the nodes as shown in the diagram below: Node 1 Node 2 Node3 ||___|||| |___| This way there's a dedicated 56 Gbit connection to/from each member node. Is is possible to do this with GlusterFS? My first thought was to have different IPs on each node's /etc/host mapped to the node hostnames but I'm unsure if I can force GlusterFS to hostnames instead of IPs. There are two ways you can do this. Both involve asymmetric configurations. Imagine that you have three subnets, one per wire: 192.168.1.1 and 192.168.1.2 between Node1 and Node2 192.168.2.1 and 192.168.2.2 between Node2 and Node3 192.168.3.1 and 192.168.3.2 between Node1 and Node3 So, /etc/hosts on Node1 would look like this: 192.168.1.2 node2 192.168.3.2 node3 On Node2 you'd have this: 192.168.1.1 node1 192.168.2.2 node3 And so on. Note that these are all different than the clients, which would have entries (probably in DNS rather than /etc/hosts) for the servers' slower external addresses. The other way to do the same thing is with explicit host routes or iptables rules. In that kind of setup, you put each server into its own subnet, then add routes on the others to go through the interfaces you want. For example: node1 is 172.30.16.1 node2 is 172.30.17.1 node3 is 172.30.18.1 Therefore, on node1 (using the interface addresses above): route add -host node2 gw 192.168.1.2 route add -host node3 gw 192.168.3.2 On node2: route add -host node1 gw 192.168.1.1 route add -host node3 gw 192.168.2.2 And so on, again. Don't forget to turn on IP forwarding. Also, this still requires that the servers have a different /etc/hosts than clients, but at least it can be the same across all servers. Alternatively, you could use the same /etc/hosts (or DNS) everywhere, if you can add routes on the clients as well. All that said, the benefit of such a configuration is rather limited. Using FUSE or GFAPI, replication will still occur over the slow client network because it's being driven by the clients (this is likely to change in 4.0). On the other hand, self-heal and rebalance traffic will use the faster internal network. SMB and NFS will use both, so they might see some benefit in *aggregate* but not per-client throughput. Depending on your usage pattern, the extra complexity of setting up this kind of routing might not be worth the effort. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] A HowTo for setting up network encryption with GlusterFS
I've written a how-to for setting up network encryption on GlusterFS at [1]. This was something that was requested as setting up network encryption is not really easy. I've tried to cover all possible cases. Great job, Kaushal! Thank you. Please read through, and let me know of any changes,improvements needed. I did spot a couple of minor things. I'll forward those off-list. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Multi-tenancy
Can anyone provide any insight on how to configure gluster networking to support multi-tenancy by separating Native/NFS/SMB client connections at layer 2? Our thinking was each client will come into our network on a dedicated vlan but unsure whether gluster can support say a dedicated client trunk interface with 50 or so vlan interfaces? Is this possible? And if not what could be another way? Some of this is works by default and some of it's work in progress. When a brick sends a reply to a client request, that reply will simply follow the default routing for its destination. In a VLAN environment, this would mean sending it on the pseudo-interface for that VLAN, so in effect the traffic for groups of clients on separate VLANs will remain segregated. What we don't have is a way to do VLAN-based access control across native, NFS, and SMB. The brick I/O infrastructure does support address-based access control, but IIRC that doesn't affect who can connect at the TCP level. We'll still initially accept connections on any interface, and then close any that don't pass the address filter. If you want to play with this, then auth.allow is the volume option you want to look at. I don't know of any similar options for NFS or SMB, so there might be no way to prevent them from accepting connections on any VLAN. Maybe someone from one of those teams can correct me. In 4.0 we're working on ways to give users more control over what networks get used for what. Primarily this is to let internal traffic (replication, self-heal, and so on) go over a private back-end network. However, giving users more explicit control over the relationships between volumes and front-end networks has also come up. The feature page is here. http://www.gluster.org/community/documentation/index.php/Features/SplitNetwork Does that help? ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Question on file reads from a distributed replicated volume
Do, gluster volume set help. There is a pretty good explanation of the read subvolume preferences and options. Specifically, you'll want to look at the cluster.read-hash-mode option, which has one of three values: (0) Each client will determine which brick seems fastest, and use that for all files unless a brick *failure* causes it to re-evaluate. If the once-fastest brick becomes slower this will *not* be noticed by clients unless there's a failure. Unfortunately, this mode is likely to *create* such a condition by overloading one server. (1) The read child for each file will be found using a hash of its GFID, to ensure even distribution. Note that if some servers are faster than others, the distribution will be *even* but not *optimal*. This mode is the default. (2) Similar to (1) except that each client used the hash of the file's GFID *plus its own PID*, so that different clients will be spread across different bricks and avoid file-level hot spots. All of these modes might be overridden if one of the bricks is local to the client. In that case, the client will always read the local copy and this option is effectively ignored. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Synchronous replication, or no?
I was under the impression that gluster replication was synchrounous, so the appserver would not return back to the client until the created file was replicated to the other server. But this does not seem to be the case, because sleeping a little bit always seems to make the read failures go away. Is there any other reason why a file created is not immediately available on a second request? It's quite possible that the replication is synchronous (the bits do hit disk before returning) but that the results are not being seen immediately due to caching at some level. There are some GlusterFS mount options (especially --negative-timeout) that might be relevant here, but it's also possible that the culprit is somewhere above that in your app servers. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Synchronous replication, or no?
Jeff: I don't really understand how a write-behind translator could keep data in memory before flushing to the replication module if the replication is synchronous. Or put another way, from whose perspective is the replication synchronous? The gluster daemon or the creating client? That's actually a more complicated question than many would think. When we say synchronous replication we're talking about *durability* (i.e. does the disk see it) from the perspective of the replication module. It does none of its own caching or buffering. When it is asked to do a write, it does not report that write as complete until all copies have been updated. However, durability is not the same as consistency (i.e. do *other clients* see it) and the replication component does not exist in a vacuum. There are other components both before and after that can affect durability and consistency. We've already touched on the after part. There might be caches at many levels that become stale as the result of a file being created and written. Of particular interest here are negative directory entries which indicate that a file is *not* present. Until those expire, it is possible to see a file as not there even though it does actually exist on disk. We can control some of this caching, but not all. The other side is *before* the replication module, and that's where write-behind comes in. POSIX does not require that a write be immediately durable in the absence of O_SYNC/fsync and so on. We do honor those requirements where applicable. However, the most common user expectation is that we will defer/batch/coalesce writes, because making every write individually immediate and synchronous has a very large performance impact. Therefore we implement write-behind, as a layer above replication. Absent any specific request to perform a write immediately, data might sit there for an indeterminate (but usually short) time before the replication code even gets to see it. I don't think write-behind is likely to be the issue here, because it only applies to data within a file. It will pass create(2) calls through immediately, so all servers should become aware of the file's existence right away. On the other hand, various forms of caching on the *client* side (even if they're the same physical machines) could still prevent a new file from being seen immediately. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Synchronous replication, or no?
Ok, that made a lot of sense. I guess what I was expecting was that the writes were (close to) immediately consistent, but Gluster is rather designed to be eventually consistent. All distributed file systems are, to some extent; we just try to be clearer than most about what the guarantees are. For example, some buffer at the client *despite* fsync or O_SYNC. The temptation is obvious; POSIX single system image behavior is far more expensive in a distributed file system than in a local one, and everyone has to compete on performance. We're actually far stricter than most when it comes to durability, and the performance disadvantage has been difficult to bear sometimes. Hopefully, now that we have the upcall facility (developed for NFSv4) we can improve consistency as well without having to give up more performance. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] Got a slogan idea?
What I am saying is that if you have a slogan idea for Gluster, I want to hear it. You can reply on list or send it to me directly. I will collect all the proposals (yours and the ones that Red Hat comes up with) and circle back around for community discussion in about a month or so. Personally I don't like any of these all that much, but maybe they'll get someone else thinking. GlusterFS: your data, your way GlusterFS: any data, any servers, any protocol GlusterFS: scale-out storage for everyone GlusterFS: software defined storage for everyone GlusterFS: the Swiss Army Knife of storage ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] SSL ciphers
I dug a bit on the matter and I'm a quite puzzled here. In OpenSSL, there's a SSLv23_METHOD which selects which is more appropriate but I see nothing equivalent for TLS! Each version have its dedicated function call like TLSv1_METHOD, TLSv1_1_METHOD and TLSv1_2_METHOD! I was kind of surprised by the same thing, but I guess I shouldn't have been. This only scratches the surface of the horror that is the OpenSSL API, but what's really scary is that the two main alternatives (GnuTLS and NSS) seem even worse. I used to have hopes of switching to PolarSSL, which has a better and better-documented API, but I keep getting buried by other tasks so I don't know if/when that will ever happen. Thank you very much for pointing out the interesting bits and helping figure out things. Have fun debugging :-) You're quite welcome. Misery loves company. ;) Please keep us informed of your findings. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] SSL ciphers
socket.c:2915 priv-ssl_meth = (SSL_METHOD *)TLSv1_method(); I'm really glad to hear that :-) FWIW, using TLSv1_2_method instead doesn't immediately seem to break. Unfortunately, every possible piece of code for 3.7 got merged one second before the feature-freeze deadline today, and that generated a lot of wreckage. I'll have to wait for that to clear before I can do a meaningful test of this one-line change. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] SSL ciphers
The problem with Gluster setting is that's impossible to go above HIGH:!SSLv2:!3DES:!RC4:!aNULL:!ADH Which is bad.. Gluster uses SSL only and not TLS :-( An upgrade should be considered. That is untrue in current code: socket.c:2915 priv-ssl_meth = (SSL_METHOD *)TLSv1_method(); Please put the version you're complaining about into a bug report. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Quorum setup for 2+1
I have a follow-up question. When a node is disconnected from the rest, the client gets an error message Transport endpoint is not connected and all access is prevented. Write access must not be allowed to such a node. I understand that. In my case it would be a desired feature to be able to at least read files. Is it possible to retain read-only access? Client-side quorum can do this with the cluster.quorum-reads option, but it lacks support for arbiters. The options you've set enforce quorum at the server side, by killing brick daemons if quorum is lost. I suppose it might be possible to add the read-only translator instead of killing the daemon, but AFAIK there's no plan to add that feature. Could I use cluster.quorum-type and cluster.quorum-count? Would it work with a rep 2+1 setup or would I need a rep 3 setup? To use client-side quorum, you'd need a true replica-3 setup (not just two plus an arbiter). ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Quorum setup for 2+1
I have a follow-up question. When a node is disconnected from the rest, the client gets an error message Transport endpoint is not connected and all access is prevented. Write access must not be allowed to such a node. I understand that. In my case it would be a desired feature to be able to at least read files. Is it possible to retain read-only access? Client-side quorum can do this with the cluster.quorum-reads option, but it lacks support for arbiters. The options you've set enforce quorum at the server side, by killing brick daemons if quorum is lost. I suppose it might be possible to add the read-only translator instead of killing the daemon, but AFAIK there's no plan to add that feature. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Quorum setup for 2+1
I wold like to setup server side quorum by using the following setup: - 2x storage nodes (s-node-1, s-node-2) - 1x arbiter node (s-node-3) So the trusted storage pool has three peers. This is my volume info: Volume Name: wp-vol-0 Type: Replicate Volume ID: 8808ee87-b201-474f-83ae-6f08eb259b43 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: s-node-1:/gluster/gvol0/brick0/brick Brick2: s-node-2:/gluster/gvol0/brick0/brick I would like to setup the server side quorum so that any two nodes would have quorum. s-node-1, s-node-2 = quorum s-node-1, s-node-3 = quorum s-node-2, s-node-3 = quorum According to the Gluster guys at FOSDEM this should be possible. I have been fiddling with the quorum options, but have not been able to achieve the desired setup. Theoretically I would do: # gluster volume set wp-vol-0 cluster.server-quorum-type server # gluster volume set wp-vol-0 cluster.server-quorum-ratio 60 But the cluster.server-quorum-ratio option produces an error: volume set: failed: Not a valid option for single volume How would I achieve the desired setup? Somewhat counter-intuitively, server-quorum-type is a *volume* option but server-quorum-ratio is a *cluster wide* option. Therefore, instead of specifying a volume name on that command, use this: # gluster volume set all cluster.server-quorum-ratio 60 ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Configure separate network for inter-node communication
I would be very interested to read your blog post as soon as its out and I guess many others too. Please do post the link to this list as soon as its online. Sorry, forgot to do this earlier. It's here: http://pl.atyp.us/2015-03-life-on-the-server-side.html ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] poor performance with encryption and SSL enabled
I took the recommendation of disabled the stripes. Now I just have encryption (at rest) and SSL enabled. The test I am running is a bwa indexing. Basic dd read/writes work fine and I don't see any errors in the gluster logs. Then when I try the bwa index I see the following: /shared/perftest/bwa/bwa index -a bwtsw hg19.fa [bwa_index] Pack FASTA... 26.29 sec [bwa_index] Construct BWT for the packed sequence... BWTIncConstructFromPacked() : Can't read from hg19.fa.pac : Unexpected end of file This does look like some sort of bad interaction between the two features. I'll add it as a bug report and see if we can get someone assigned. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Configure separate network for inter-node communication
I have two gluster nodes in a replicated setup and have connected the two nodes together directly through a 10 Gbit/s crossover cable. Now I would like to tell gluster to use this seperate private network for any communications between the two nodes. Does that make sense? Will this bring me any performance gain? and if yes how do I configure that? It is possible, but it's not likely to improve performance much (yet). The easiest way to do this is to use a custom /etc/hosts on the servers, so that *on a server* every other server's name resolves to its private back-end address. Meanwhile, clients resolve that same name to the server's front-end address. You can get a similar effect with explicit host routes or iptables rules on the servers. The reason this won't have much effect on performance is that the servers do not (currently) replicate to one another. Instead, clients send data directly to every replica themselves. The only time time a private network would see much traffic would be when the clients are actually the servers performing administrative operations - self heal, rebalance, and so on. In 4.0, both parts of this answer would be different. First, we expect to have better handling of multiple networks and multi-homed hosts, including user specification of which networks to use for which traffic[1]. Second, 4.0 will have a new form of replication which *does* replicate directly between servers[2]. Parts of this second feature are in fact likely to appear well before the rest of 4.0, using the server-to-server data flow but retaining our current methods of tracking changes and re-syncing servers after a failure. In fact I'm writing a blog post right now about this, including some performance measurements. I'll respond again here when it's done. [1] http://www.gluster.org/community/documentation/index.php/Features/SplitNetwork [2] http://www.gluster.org/community/documentation/index.php/Features/new-style-replication ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] poor performance with encryption and SSL enabled
SSL certs are self-signed and generated on all servers. Combined into a glusterfs.ca in /etc/ssl. By itself the SSL is working well. Glad to hear it. ;) If I run dd or any i/o operations I see a flurry of these messages in the logs. [2015-02-24 16:58:51.144099] W [stripe.c:5288:stripe_internal_getxattr_cbk] (-- /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x3fd0620550] (-- /usr/lib64/glusterfs/3.6.2/xlator/cluster/stripe.so(stripe_internal_getxattr_cbk+0x36a)[0x7f6a152a12ba] (-- /usr/lib64/glusterfs/3.6.2/xlator/protocol/client.so(client3_3_fgetxattr_cbk+0x174)[0x7f6a154db284] (-- /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x3fd0e0ea75] (-- /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x142)[0x3fd0e0ff02] ) 0-data-stripe-3: invalid argument: frame-local Have you tried encryption (at rest) without striping, or vice versa? I suspect some kind of bad interaction between the two, but before we go down that path it would be nice to make sure they're working separately. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] Looking for volunteer to write up official How to do GlusterFS in the Cloud: The Right Way for Rackspace...
I could probably chip in too. I've run tons of my own science experiments on Rackspace instead of our own hardware, becase that makes my results more reproducible by others. If we can enable more people to do likewise, that benefits everyone. P.S. Hi Jesse. Small world, huh? ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] Looking for volunteer to write up official How to do GlusterFS in the Cloud: The Right Way for Rackspace...
Looks like we have four volunteers: * Ben Turner (primary GlusterFS perf tuning guy) * Jeff Darcy (greybeard GlusterFS developer and scalability expert) * Josh Boon (experienced GlusterFS guy - Ubuntu focused) * Nico Schottelius (newer GlusterFS guy - familiar with Ubuntu/CentOS) This sounds like a fairly good mix, so lets go with that. Ben and Jeff, does it make sense for you two to do the leading, with Josh and Nico involved and learning/assisting/idea-generation/stuff as needed? Sounds good to me. By purest coincidence, I was planning to do some experiments on Rackspace today anyway. I'll try to take notes, and share them when I'm done. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] REMINDER: GlusterFS.next (a.k.a. 4.0) status/planning meeting
This is *tomorrow* at 12:00 UTC (approximately 15.5 hours from now) in #gluster-meeting on Freenode. See you all there! - Original Message - Perhaps it's not obvious to the broader community, but a bunch of people have put a bunch of work into various projects under the 4.0 banner. Some of the results can be seen in the various feature pages here: http://www.gluster.org/community/documentation/index.php/Planning40 Now that the various subproject feature pages have been updated, it's time to get people together and decide what 4.0 is *really* going to be. To that end, I'd like to schedule an IRC meeting for February 6 at 12:00 UTC - that's this Friday, same time as the triage/community meetings but on Friday instead of Tuesday/Wednesday. An initial agenda includes: * Introduction and expectation-setting * Project-by-project status and planning * Discussion of future meeting formats and times * Discussion of collaboration tools (e.g. gluster.org wiki or Freedcamp) going forward. Anyone with an interest in the future of GlusterFS is welcome to attend. This is *not* a Red Hat only effort, tied to Red Hat product needs and schedules and strategies. This is a chance for the community to come together and define what the next generation of distributed file systems for the real world will look like. I hope to see everyone there. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] GlusterFS.next (a.k.a. 4.0) status/planning meeting
Perhaps it's not obvious to the broader community, but a bunch of people have put a bunch of work into various projects under the 4.0 banner. Some of the results can be seen in the various feature pages here: http://www.gluster.org/community/documentation/index.php/Planning40 Now that the various subproject feature pages have been updated, it's time to get people together and decide what 4.0 is *really* going to be. To that end, I'd like to schedule an IRC meeting for February 6 at 12:00 UTC - that's this Friday, same time as the triage/community meetings but on Friday instead of Tuesday/Wednesday. An initial agenda includes: * Introduction and expectation-setting * Project-by-project status and planning * Discussion of future meeting formats and times * Discussion of collaboration tools (e.g. gluster.org wiki or Freedcamp) going forward. Anyone with an interest in the future of GlusterFS is welcome to attend. This is *not* a Red Hat only effort, tied to Red Hat product needs and schedules and strategies. This is a chance for the community to come together and define what the next generation of distributed file systems for the real world will look like. I hope to see everyone there. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] how to shrink client translator
gluster volume set volname open-behind off turns off this xlator in the client stack. There is no way to turn off debug/io-stats. Any reason why you would like to turn off io-stats translator? For improving efficiency. It might not be a very fruitful kind of optimization. Repeating an experiment someone else had done a while ago, I just ran an experiment to compare a normal client volfile vs. one with a *hundred* extra do-nothing translators added. There was no statistically significant difference, even on a fairly capable SSD-equipped system. I/O latency variation and other general measurement noise still far outweigh the cost of a few extra function calls to invoke translators that aren't doing any I/O themselves. Is there any command to show the current translator tree after dynamic adding or deletting any xlator? The new graph should show up in the logs. Also, you can always use gluster system getspec xxx to get the current client volfile for any volume xxx. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] ... i was able to produce a split brain...
Pranith and I had a discussion regarding this issue and here is what we have in our mind right now. We plan to provide the user commands to execute from mount so that he can access the files in split-brain. This way he can choose which copy is to be used as source. The user will have to perform a set of getfattrs and setfattrs (on virtual xattrs) to decide which child to choose as source and inform AFR with his decision. A) To know the split-brain status : getfattr -n trusted.afr.split-brain-status path-to-file This will provide user with the following details - 1) Whether the file is in metadata split-brain 2) Whether the file is in data split-brain It will also list the name of afr-children to choose from. Something like : Option0: client-0 Option1: client-1 We also tell the user what the user could do to view metadata/data info; like stat to get metadata etc. B) Now the user has to choose one of the options (client-x/client-y..) to inspect the files. e.g., setfattr -n trusted.afr.split-brain-choice -v client-0 path-to-file We save the read-child info in inode-ctx in order to provide the user access to the file in split-brain from that child. Once the user inspects the file, he proceeds to do the same from the other child of replica pair and makes an informed decision. C) Once the above steps are done, AFR is to be informed with the final choice for source. This is achieved by - (say the fresh copy is in client-0) e.g., setfattr -n trusted.afr.split-brain-heal-finalize -v client-0 path-to-file This child will be chosen as source and split-brain resolution will be done. +1 That looks quite nice, and AFAICT shouldn't be prohibitively hard to implement. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] ... i was able to produce a split brain...
On 01/27/2015 11:43 PM, Joe Julian wrote: No, there's not. I've been asking for this for years. Hey Joe, Vijay and I were just talking about this today. We were wondering if you could give us the inputs to make it a feature to implement. Here are the questions I have: Basic requirements if I understand correctly are as follows: 1) User should be able to fix the split-brain without any intervention from admin as the user knows best about the data. 2) He should be able to preview some-how about the data before selecting the copy which he/she wants to preserve. One possibility would be to implement something like DHT's filter_loc_subvol_key, though perhaps using child indices instead of translator names. Another would be a script which can manipulate volfiles and use GFAPI to fetch a specific version of a file. I've written several scripts which can do the necessary volfile manipulation. If we finally have a commitment to do something like this, actually implementing it will be the easy part. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Reddit thread on GlusterFS
Created a reddit account and posted. Could use some up votes though if we don't want It seems GlusterFS popularity didn't take off, and Ceph ate Gluster's lunch. to be the top comment. I wouldn't worry about it too much. Most people know that Redditors tend to be negative, contrarian, and clueless. Start a thread about Ceph and you'd probably see people talking about how difficult and unstable and slow it was during the one hour they spent with it. Some of them might even make disparaging comparisons to GlusterFS. It's better for us to let comments there stand in their usual Reddit context than to risk being accused of astroturfing. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] New architecture: some advice needed
Actually I have 3 supermicro servers with 12 4TB SATA disks each and 2 SSD (in each server) Each server also has one dual port DDR Infiniband card. I would like to create a scale-out storage infrastructure (primary used by web servers), totally HA and fault tollerant. I was thinking about 1 brick for each SATA disks in Distributed Dispersed mode. Replica set to 3 (so, actually, only 12*4TB=48TB would be available) What do you suggest? Is Distributed Dispersed good for my environment or should I go with Distributed Replicated ? In replicated mode, I can always access to raw files , in case of disaster, this would not be possible with dispersed mode, right? Which are pro and cons between replicated and dispersed modes? We plan to add up to 10 servers (all with 12*4 SATA disks) in the near future ending to 336TB of available and replicated space. Any suggestions? The key tradeoffs here are storage utilization vs. performance. In general, erasure codes (disperse) will give better storage utilization than replication for the same level of performance. However, this might not be the case for N=3. With replication, that will protect against two failures. However, from the admin guide section on disperse: redundancy_ must be greater than 0, and the total number of bricks must be greater than 2 * _redundancy_ I interpret this to mean that for two-failure protection you would need at least five bricks. With three bricks disperse can only offer one-failure protection. In this case it's roughly equivalent to RAID-5, with only a 50% storage penalty vs. 100% for replica 2 offering the same protection. The other issue is performance. With disperse, all writes *and reads* must be done to all bricks, and at a stripe size equal to 512 times the number of bricks (minus those used for redundancy). This means more data transfer, especially for reads, and also more write contention than with replication. This being new code, some optimizations that already exist for replication do not yet exist for disperse even though they're applicable. Adding Xavier, who's the real expert on disperse, in case I got something wrong here. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] # of replica != number pf bricks?
What happens if I have 3 peers for quorum. I create 3 bricks and want to have only two replicas in my volume. The number of *bricks* must be a multiple of the replica count, but quorum is based on the number of *servers* and there can be multiple bricks per server. Therefore, if you have servers A, B, and C with two bricks each, you can do this: volume create foo replica 2 \ A:/brick0 B:/brick0 C:/brick0 A:/brick1 B:/brick1 C:/brick1 First we'll combine this into the following two-way replica sets: A:/brick0 and B:/brick0 C:/brick0 and A:/brick1 B:/brick1 and C:/brick1 Then we'll distribute files among those three sets. If one server fails then we'll still have quorum (2/3) and each replica set will have at least one surviving replica. If two fail then neither of those things will be true and we'll disable the volume. In 4.0 we plan to improve on this by splitting bricks and creating the necessary replica sets from the pieces ourselves. Besides making configuration simpler, this should remove the restriction on the number of bricks being a multiple of the replica count, and also redistribute load more evenly during or after a failure. 4.0 is a long way off, though, so I probably shouldn't even be talking about it. ;) ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Stupid question re multiple networks
What if the gluster servers are also clients? I locally plan to use a number of servers acting as gluster and VM servers, so that gluster serves both the VM's and other clients. I think that fits fairly well into this paradigm. Note that the routing of traffic is by *type* (e.g. user I/O, rebalance) rather than by destination. By default, everything's on the same network, so things would work just as now. If you want, you can redirect user I/O over one network and internal traffic over another, even if the machines are both clients and servers. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Stupid question re multiple networks
AFAIK multiple network scenario only works if you are not using Gluster via FUSE mounts or gfapi from remote hosts. It will work for NFS access or when you set up something like Samba with CTDB. Just not with native Gluster as the server always tells the clients which addresses to connect to: ie your storage hosts will always supply the connection details of the hosts that are configured in gluster to your storage clients. I wonder if this could be gaffer-taped with some bridging/vlan/arp spoofing trickery but I'm not sure I'd trust such a hack. It would be *really* nice if there was a way to set up gluster so you could specify different IPs for backend and frontend operations. As you suggest, there are various kinds of trickery that can be used to fake multi-network support even for native mounts. I've seen it done via split-horizon DNS, explicit host routes, and iptables. *Proper* support for multiple networks is part of the proposed 4.0 feature set. http://www.gluster.org/community/documentation/index.php/Planning40 In fact, I would greatly appreciate your help defining what proper means in this context. Clearly, we need to add the concept of a network to our (informal) object model, and sort out the host/address/network relationships. Then we need a way to direct certain traffic flows to certain networks. The question is: how do we present this to the user? Let's take a whack at how to define networks etc. using the CLI's current object-verb syntax (even though it's a bit clunky). gluster network add user-net 1.2.3.0/24 gluster network add back-end 5.6.0.0/16 gluster peer probe 1.2.3.4 gluster peer probe 5.6.7.8 So far, so good. Note that on the second probe we should be able to recognize that this is just a new address (on another network) for the host we already added with the first probe. Heartbeats, quorums, etc. should also be aware of multi-homed hosts. Maybe there's a better syntax, but this will do for now. Let's add a volume. gluster volume create silly-vol 1.2.3.4:/brick So, which network address should the daemon for 1.2.3.4:/brick expose for clients? Which address should it use for internal traffic such as rebalance or self-heal? This is where it gets tricky. Let's start by saying that *by default* all traffic is on the interface specified on the volume create line. If we want to do something different... # ONLY redirect rebalance traffic. gluster volume route silly-vol rebalance back-end Now rebalance traffic goes through 5.6.7.8 instead. Is that intuitive? What about these? # Export a volume on multiple networks. gluster volume route silly-vol client user-net some-other-net # Redirect rebalance, self-heal, anything else we think of. gluster volume route silly-vol all-mgmt back-end # Redirect GLOBALLY instead of per volume. gluster cluster route rebalance back-end Does this seem like it's heading in the right direction? It doesn't look too bad to me, but my perspective is hardly typical. Is there something *users* would like to be able to do with multiple networks that can't be expressed this way, or is there some better way to define how these multiple networks should be used? Please let us know. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] hekafs.org is not accessible
The links on http://gluster.org/documentation/architecture/internals/Dougw:A_Newbie%27s_Guide_to_Gluster_Internals/ pointing to Jeff's tutorials on hekafs.org seem to be broken. Perhaps they are mirrored somewhere else? When I saw that the domain was going to expire a few months ago, I tried to find out if anyone at Red Hat would be interested in taking it over. Nobody seemed to be, so I let it lapse. Now it's in a state where the domain registrars wouldn't even let me revive. Meanwhile, the files are all still accessible two ways: (1) Modify the URL to point to //pl.atyp.us/hekafs.org/... instead (2) Modify your /etc/hosts to have an entry for hekafs.org which is the same as pl.atyp.us (currently 162.243.99.140) With method (1) any secondary URLs e.g. for images are likely to be broken. With method (2) everything should be just as it would be if the domain were still alive. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Small files
To what extent, is Gluster a good choice for the many small files scenario, as opposed to HDFS? Last I checked, hdfs would consume humongous memory resources if the cluster has many small files, given its architecture. There are some hackish solutions on top HDFS for the case of many small files rather than huge files, but it would be nice to find a file system that matches that scenario well as is. So I wonder how would Gluster do when files are typically small. We're not as bad as HDFS, but it's still not what I'd call a good scenario for us. While we have good space efficiency for small files, and we don't have a single-metadata-server SPOF either, the price we pay is a hit to our performance for creates (and renames). There are several efforts under way to improve this, but there's only so much we can do when directory contents must be consistent across the volume despite being spread across many bricks (or replica sets). More details on those efforts are here. http://www.gluster.org/community/documentation/index.php/Features/Feature_Smallfile_Perf ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Perfomance issue on a 90+% full file system
Yup, pretty common for us. Once we hit ~90% on either of our two production clusters (107 TB usable each), performance takes a beating. I don't consider this a problem, per se. Most file systems (clustered or otherwise) are the same. I consider a high water mark for any production file system to be 80% (and I consider that vendor agnostic), at which time action should be taken to begin clean up. That's good sysadminning 101. I can't think of a good reason for such a steep drop-off in GlusterFS. Sure, performance should degrade somewhat due to fragmenting, but not suddenly. It's not like Lustre, which would do massive preallocation and fall apart when there was no longer enough space to do that. It might be worth measuring average latency at the local-FS level, to see if the problem is above or below that line. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] To GlusterFS or not...
SSD has been considered but is not an option due to cost. SAS has been considered but is not a option due to the relatively small sizes of the drives. We are *rapidly* growing towards a PB of actual online storage. We are exploring raid controllers with onboard SSD cache which may help. We have had some pretty good results with those in the lab. They're not *always* beneficial, and getting the right SSD:disk ratio for your workload might require some experimentation, but it's certainly a good direction to explore. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] To GlusterFS or not...
The biggest issue that we are having, is that we are talking about -billions- of small (max 5MB) files. Seek times are killing us completely from what we can make out. (OS, HW/RAID has been tweaked to kingdom come and back). This is probably the key point. It's unlikely that seek times are going to get better with GlusterFS, unless it's because the new servers have more memory and disks, but if that's the case then you might as well just deploy more memory and disks in your existing scheme. On top of that, using any distributed file system is likely to mean more network round trips, to maintain consistency. There would be a benefit from letting GlusterFS handle the distribution (and redistribution) of files automatically instead of having to do your own sharding, but that's not the same as a performance benefit. I’m not yet too clued up on all the GlusterFS naming, but essentially if we do go the GlusterFS route, we would like to use non replicated storage bricks on all the front-end, as well as back-end servers in order to maximize storage. That's fine, so long as you recognize that recovering from a failed server becomes more of a manual process, but it's probably a moot point in light of the seek-time issue mentioned above. As much as I hate to discourage people from using GlusterFS, it's even worse to have them be disappointed, or for other users with other needs to be so as we spend time trying to fix the unfixable. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] Who's who ?
For new columns which may be useful, these ones spring to mind: * Twitter username - many people have them these days * A free form text description - eg I'm Justin, I'm into databases, storage, and developing embedded human augmentation systems. ;) * Some kind of thumbnail photo - probably as the first column on the left I think the current table is already quite wide, and adding more columns is going to be very problematic design-wise. Instead, I suggest that we make each person's name a link to their wiki user page, where they can put whatever contact or other info makes sense. I just did that for myself, and it barely takes more time than updating the Who's Who page itself (plus it cuts down on the update notifications for that page). ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] Proposal for GlusterD-2.0
Has anyone looked into whether LogCabin can provide the consistent small storage based on RAFT for Gluster? https://github.com/logcabin/logcabin I have no experience with using it so I cannot say if it is good or suitable. I do know the following project uses it and it's just not as easy to setup as Gluster is - it also has Zookeeper support etc. https://ramcloud.atlassian.net/wiki/display/RAM/RAMCloud LogCabin is the canonical implementation of Raft, by the author of the Raft protocol, so it was the first implementation I looked at. Sad to say, it didn't seem that stable. AFAIK RAMCloud - itself an academic project - is the only user, whereas etcd and consul are being used by multiple projects and in production. Also, I found the etcd code at least more readable than LogCabin despite the fact that I've worked in C++ before and had never seen any Go code until that time. Then again, those were early days for all three projects (consul didn't even exist yet) so things might have changed. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] Proposal for GlusterD-2.0
Yes. I came across Salt currently for unified management for storage to manage gluster and ceph which is still in planning phase. I could think of a complete requirement of infra requirement to solve from glusterd to unified management. Calamari ceph management already uses Salt. It would be the ideal solution with Salt (or any infra) if gluster, ceph and unified management uses. I think the idea of using Salt (or similar) is interesting, but it's also key that Ceph still has its mon cluster as well. (Is mon calamari an *intentional* Star Wars reference?) As I see it, glusterd or anything we use to replacement has multiple responsibilities: (1) Track the current up/down state of cluster members and resources. (2) Store configuration and coordinate changes to it. (3) Orchestrate complex or long-running activities (e.g. rebalance). (4) Provide service discovery (current portmapper). Salt and its friends clearly shine at (2) and (3), though they outsource the actual data storage to an external data store. With such a data store, (4) becomes pretty trivial. The sticking point for me is (1). How does Salt handle that need, or how might it be satisfied on top of the facilities Salt does provide? I can see *very* clearly how to do it on top of etcd or consul. Could those in fact be used for Salt's data store? It seems like Salt shouldn't need a full-fledged industrial strength database, just something with high consistency/availability and some basic semantics. Maybe we should try to engage with the Salt developers to come up with ideas. Or find out exactly what functionality they found still needs to be in the mon cluster and not in Salt. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] Proposal for GlusterD-2.0
For distributed store, I would think of MongoDB which provides distributed/replicated/highly available/master read-write/slave read-only database. Lets get what community think about SaltStack and/or MongoDB. I definitely do not think MongoDB is the right tool for this job. I'm not one of those people who just bash MongoDB out of fashion, either. I frequently defend them against such attacks, and I used MongoDB for some work on CloudForms a while ago. However, a full MongoDB setup carries a pretty high operational complexity, to support high scale and rich features . . . which we don't need. This part of our system doesn't need sharding. It doesn't need complex ad-hoc query capability. If we don't need those features, we *certainly* don't need the complexity that comes with them. We need something with the very highest levels of reliability and consistency, with as little complexity as possible to go with that. Even its strongest advocates would probably agree that MongoDB doesn't fit those requirements very well. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] Proposal for GlusterD-2.0
Is there any reason not to consider zookeeper? I did bring up that idea a while ago. I'm no Java fan myself, but still I was surprised by the vehemence of the reactions. To put it politely, many seemed to consider the dependency on Java unacceptable for both resource and security reasons. Some community members said that they'd be forced to switch to another DFS if we went that way. It didn't seem like a very promising direction to explore further. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] Proposal for GlusterD-2.0
Isn't some of this covered by crm/corosync/pacemaker/heartbeat? Sorta, kinda, mostly no. Those implement virtual synchrony, which is closely related to consensus but not quite the same even in a formal CS sense. In practice, using them is *very* different. Two jobs ago, I inherited a design based on the idea that if everyone starts at the same state and handles the same messages in the same order (in that case they were using Spread) then they'd all stay consistent. Sounds great in theory, right? Unfortunately, in practice it meant that returning a node which had missed messages to a consistent state was our problem, and it was an unreasonably complex one. Debugging failure-during-recovery problems in that code was some of the least fun I ever had at that job. A consensus protocol, with its focus on consistency of data rather than consistency of communication, seems like a better fit for what we're trying to achieve. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] Proposal for GlusterD-2.0
As part of the first phase, we aim to delegate the distributed configuration store. We are exploring consul [1] as a replacement for the existing distributed configuration store (sum total of /var/lib/glusterd/* across all nodes). Consul provides distributed configuration store which is consistent and partition tolerant. By moving all Gluster related configuration information into consul we could avoid split-brain situations. Overall, I like the idea. But I think you knew that. ;) Is the idea to run consul on all nodes as we do with glusterd, or to run it only on a few nodes (similar to Ceph's mon cluster) and then use them to coordinate membership etc. for the rest? ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] split-brain on glusterfs running with quorum on server and client
I have a replicate glusterfs setup on 3 Bricks ( replicate = 3 ). I have client and server quorum turned on. I rebooted one of the 3 bricks. When it came back up, the client started throwing error messages that one of the files went into split brain. This is a good example of how split brain can happen even with all kinds of quorum enabled. Let's look at those xattrs. BTW, thank you for a very nicely detailed bug report which includes those. BRICK1 [root@ip-172-31-38-189 ~]# getfattr -d -m . -e hex /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 getfattr: Removing leading '/' from absolute path names # file: data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 trusted.afr.PL2-client-0=0x trusted.afr.PL2-client-1=0x0001 trusted.afr.PL2-client-2=0x0001 trusted.gfid=0xea950263977e46bf89a0ef631ca139c2 BRICK 2 === [root@ip-172-31-16-220 ~]# getfattr -d -m . -e hex /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 getfattr: Removing leading '/' from absolute path names # file: data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 trusted.afr.PL2-client-0=0x0d46 trusted.afr.PL2-client-1=0x trusted.afr.PL2-client-2=0x trusted.gfid=0xea950263977e46bf89a0ef631ca139c2 BRICK 3 = [root@ip-172-31-12-218 ~]# getfattr -d -m . -e hex /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 getfattr: Removing leading '/' from absolute path names # file: data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 trusted.afr.PL2-client-0=0x0d46 trusted.afr.PL2-client-1=0x trusted.afr.PL2-client-2=0x trusted.gfid=0xea950263977e46bf89a0ef631ca139c2 Here, we see that brick 1 shows a single pending operation for the other two, while they show 0xd46 (3398) pending operations for brick 1. Here's how this can happen. (1) There is exactly one pending operation. (2) Brick1 completes the write first, and says so. (3) Client sends messages to all three, saying to decrement brick1's count. (4) All three bricks receive and process that message. (5) Brick1 fails. (6) Brick2 and brick3 complete the write, and say so. (7) Client tells all bricks to decrement remaining counts. (8) Brick2 and brick3 receive and process that message. (9) Brick1 is dead, so its counts for brick2/3 stay at one. (10) Brick2 and brick3 have quorum, with all-zero pending counters. (11) Client sends 0xd46 more writes to brick2 and brick3. Note that at no point did we lose quorum. Note also the tight timing required. If brick1 had failed an instant earlier, it would not have decremented its own counter. If it had failed an instant later, it would have decremented brick2's and brick3's as well. If brick1 had not finished first, we'd be in yet another scenario. If delayed changelog had been operative, the messages at (3) and (7) would have been combined to leave us in yet another scenario. As far as I can tell, we would have been able to resolve the conflict in all those cases. *** Key point: quorum enforcement does not totally eliminate split brain. It only makes the frequency a few orders of magnitude lower. *** So, is there any way to prevent this completely? Some AFR enhancements, such as the oft-promised outcast feature[1], might have helped. NSR[2] is immune to this particular problem. Policy based split brain resolution[3] might have resolved it automatically instead of merely flagging it. Unfortunately, those are all in the future. For now, I'd say the best approach is to resolve the conflict manually and try to move on. Unless there's more going on than meets the eye, recurrence should be very unlikely. [1] http://www.gluster.org/community/documentation/index.php/Features/outcast [2] http://www.gluster.org/community/documentation/index.php/Features/new-style-replication [3] http://www.gluster.org/community/documentation/index.php/Features/pbspbr ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] complete f......p thanks to glusterfs...applause, you crashed weeks of work
ssl keys have to be 2048-bit fixed size No, they don't. all keys have to bey verywhere(all versionswhich noob programmed that ??) That noob would be me. t's not necessary to have the same key on all servers, but using different ones would be even more complex and confusing for users. Instead, the servers authenticate to one another using a single identity. According to SSL 101, anyone authenticating as an identity needs the key for that identity, because it's really the key - not the publicly readable cert - that guarantees authenticity. If you want to set up a separate key+cert for each server, each one having a CA file for the others, you certainly can and it works. However, you'll still have to deal with distributing those new certs. That's inherent to how SSL works. Instead of forcing a particular PKI or cert-distribution scheme on users, the GlusterFS SSL implementation is specifically intended to let users make those choices. only control connection is encrypted That's not true. There are *separate* options to control encryption for the data path, and in fact that code's much older. Why separate? Because the data-path usage of SSL is based on a different identity model - probably more what you expected, with a separate identity per client instead of a shared one between servers. At a certain point it also used tons of diskspace due to not deleting files in the .glusterfs directory , (but still being connected and up serving volumes) For a long time, the only internal conditions that might have caused the .glusterfs links not to be cleaned up were about 1000x less common than similar problems which arise when users try to manipulate files directly on the bricks. Perhaps if you could describe what you were doing on the bricks, we could help identify what was going on and suggest safer ways of achieving the same goals. IT WAS A LONG AND PAINFUL SYNCING PROCESS until i thought i was happy ;) Syncing what? I'm guessing a bit here, but it sounds like you were trying to do the equivalent of a replace-brick (or perhaps rebalance) by hand. As you've clearly discovered, such attempts are fraught with peril. Again, with some more constructive engagement perhaps we can help guide you toward safer solutions. Due to an Online-resizing lvm/XFS glusterfs (i watch the logs nearly all the time) i discovered mismacthing disk layouts , realizing also that server1 was up and happy when you mount from it, but server2 spew input/output errors on several directories (for now just in that volume), The mismatching layout messages are usually the result of extended attributes that are missing from one brick's copy of a directory. It's possible that the XFS resize code is racy, in the sense that extended attributes become unavailable at some stage even though the directory itself is still accessible. I suggest that you follow up on that bug with the XFS developers, who are sure to be much more polite and responsive than we are. i tried to rename one directory, it created a recursive loop inside XFS (e.g. BIGGEST FILE-SYSTEM FAIL : TWO INODES linking to one dir , ideally containing another) i got at least the XFS loop solved. Another one for the XFS developers. Then the pre-last resort option came up.. deleted the volumes, cleaned all xattr on that ~2T ... and recreated the volumes, since shd seems to work somehow since 3.4 You mention that you cleared all xattrs. Did you also clear out .glusterfs? In general, using anything but a completely empty directory tree as a brick can be a bit problematic. Maybe anyone has a suggestion , except create a new clean volume and move all your TB's . More suggestions might have been available if you had sought them earlier. At this point, none of us can tell what state your volume is in, and there are many indications that it's probably a state none of us have never seen or anticipated. As you've found, attempting random fixes in such a situation often makes things worse. It would be irresponsible for us to suggest that you go down even more unknown and untried paths. Our first priority should be to get things back to a known and stable state. Unfortunately at this point the only such state at this point would seem to be a clean volume. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] Transparent encryption in GlusterFS: Implications on manageability
I.1 Generating the master volume key Master volume key should be generated by user on the trusted machine. Recommendations on master key generation provided at section 6.2 of the manpages [1]. Generating of master volume key is in user's competence. That was fine for an initial implementation, but it's still the single largest obstacle to adoption of this feature. Looking forward, we need to provide full CLI support for generating keys in the necessary format, specifying their location, etc. I.2 Location of the master volume key when mounting a volume At mount time the crypt translator searches for a master volume key on the client machine at the location specified by the respective translator option. If there is no any key at the specified location, or the key at specified location is in improper format, then mount will fail. Otherwise, the crypt translator loads the key to its private memory data structures. Location of the master volume key can be specified at volume creation time (see option master-key, section 6.7 of the man pages [1]). However, this option can be overridden by user at mount time to specify another location, see section 7 of manpages [1], steps 6, 7, 8. Again, we need to improve on this. We should support this as a volume or mount option in its own right, not rely on the generic --xlator-option mechanism. Adding options to mount.glusterfs isn't hard. Alternatively, we could make this look like a volume option settable once through the CLI, even though the path is stored locally on the client. Or we could provide a separate special-purpose command/script, which again only needs to be run once. It would even be acceptable to treat the path to the key file (not its contents!) as a true volume option, stored on the servers. Any of these would be better than requiring the user to understand our volfile format and construction so that they can add the necessary option by hand. II. Check graph of translators on your client machine after mount! During mount your client machine receives configuration info from the non-trusted server. In particular, this info contains the graph of translators, which can be subjected to tampering, so that encryption won't be invoked for your volume at all. So it is highly important to verify this graph. After successful mount make sure that the graph of translators contains the crypt translator with proper options (see FAQ#1, section 11 of the manpages [1]). It is important to verify the graph, but not by poking through log files and not without more information about what to look for. So we got a volfile that includes the crypt translator, with some options. The *code* should ensure that the master-key option has the value from the command line or local config, and not some other. If we have to add special support for this in otherwise-generic graph initialization code, that's fine. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] User-serviceable snapshots design
* Since a snap volume will refer to multiple bricks, we'll need more brick daemons as well. How are *those* managed? This is infra handled by the core snapshot functionality/feature. When a snap is created, it is treated not only as a lvm2 thin-lv but as a glusterfs volume as well. The snap volume is activated and mounted and made available for regular use through the native fuse-protocol client. Management of these is not part of the USS feature. But handled as part of the core snapshot implementation. If we're auto-starting snapshot volumes, are we auto-stopping them as well? According to what policy? USS (mainly snapview-server xlator) talks to the snapshot volumes (and hence the bricks) through the glfs_t *, and passing a glfs_object pointer. So snapview-server is using GFAPI from within a translator? This caused a *lot* of problems in NSR reconciliation, especially because of how GFAPI constantly messes around with the THIS pointer. Does the USS work include fixing these issues? If snapview-server runs on all servers, how does a particular client decide which one to use? Do we need to do something to avoid hot spots? Overall, it seems like having clients connect *directly* to the snapshot volumes once they've been started might have avoided some complexity or problems. Was this considered? * How does snapview-server manage user credentials for connecting to snap bricks? What if multiple users try to use the same snapshot at the same time? How does any of this interact with on-wire or on-disk encryption? No interaction with on-disk or on-wire encryption. Multiple users can always access the same snapshot (volume) at the same time. Why do you see any restrictions there? If we're using either on-disk or on-network encryption, client keys and certificates must remain on the clients. They must not be on servers. If the volumes are being proxied through snapview-server, it needs those credentials, but letting it have them defeats both security mechanisms. Also, do we need to handle the case where the credentials have changed since the snapshot was taken? This is probably a more general problem with snapshots themselves, but still needs to be considered. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] User-serviceable snapshots design
* How do clients find it? Are we dynamically changing the client side graph to add new protocol/client instances pointing to new snapview-servers, or is snapview-client using RPC directly? Are the snapview-server ports managed through the glusterd portmapper interface, or patched in some other way? Adding a protocol/client instance to connect to protocol/server at the daemon. So now the client graph is being dynamically modified, in ways that make it un-derivable from the volume configuration (because they're based in part on user activity since then)? What happens if a normal graph switch (e.g. due to add-brick) happens? I'll need to think some more about what this architectural change really means. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] Proposal for improvements for heal commands
2) According to the feedback we got, Commands: gluster volume heal volname info healed/heal-failed are not helpful in debugging anything. So I am thinking of deprecating these two commands. Reasons: - The commands only give the last 1024 entries that succeeded/failed, so most of the times users need to inspect logs. Seems reasonable, though if it's just an issue of not keeping enough information to be useful we could fix that by simply retaining more. 3) gluster volume heal volname info split-brain will be re-implemented to print all the files that are in split-brain instead of the limited 1024 entries. - One constant complaint is that even after the file is fixed from split-brain, it may still show up in the previously cached output. In this implementation the goal is to remove all the caching and compute the results afresh. This seems reasonable too. I can't help but wonder if it might be worth tracking split-brain files using a Merkle tree approach like we did with xtime, so we could track any number of such files efficiently. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] User-serviceable snapshots design
Overall, it seems like having clients connect *directly* to the snapshot volumes once they've been started might have avoided some complexity or problems. Was this considered? Can you explain this in more detail? Are you saying that the virtual namespace overlay used by the current design can be reused along with returning extra info to clients or is this a new approach where you make the clients much more intelligent than they are in the current approach? Basically the clients would have the same intelligence that now resides in snapview-server. Instead of spinning up a new protocol/client to talk to a new snapview-server, they'd send a single RPC to start the snapshot brick daemons, then connect to those itself. Of course, this exacerbates the problem with dynamically changing translator graphs on the client side, because now they dynamically added parts will be whole trees (corresponding to whole volfiles) instead of single protocol/client translators. Long term, I think we should consider *not* handling these overlays as modifications to the main translator graph, but instead allowing multiple translator graphs to be active in the glusterfs process concurrently. For example, this greatly simplifies the question of how to deal with a graph change after we've added several overlays. * Splice method: graph comparisons must be enhanced to ignore the overlays, overlays must be re-added after the graph switch takes place, etc. * Multiple graph method: just change the main graph (the one that's rooted at mount/fuse) and leave the others alone. Stray thought: does any of this break when we're in an NFS or Samba daemon instead of a native-mount glusterfs daemon? If we're using either on-disk or on-network encryption, client keys and certificates must remain on the clients. They must not be on servers. If the volumes are being proxied through snapview-server, it needs those credentials, but letting it have them defeats both security mechanisms. Also, do we need to handle the case where the credentials have changed since the snapshot was taken? This is probably a more general problem with snapshots themselves, but still needs to be considered. Agreed. Very nice point you brought up. We will need to think a bit more on this Jeff. This is what reviews are for. ;) Another thought: are there any interesting security implications because USS allows one user to expose *other users'* previous versions through the automatically mounted snapshot? ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] User-serviceable snapshots design
client graph is not dynamically modified. the snapview-client and protocol/server are inserted by volgen and no further changes are made on the client side. I believe Anand was referring to Adding a protocol/client instance to connect to protocol/server at the daemon as an action being performed by volgen. OK, so let's say we create a new volfile including connections for a snapshot that didn't even exist when the client first mounted. Are you saying we do a full graph switch to that new volfile? That still seems dynamic. Doesn't that still mean we need to account for USS state when we regenerate the next volfile after an add-brick (for example)? One way or another the graph's going to change, which creates a lot of state-management issues. Those need to be addressed in a reviewable design so everyone can think about it and contribute their thoughts based on their perspectives. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] User-serviceable snapshots design
Overall, it seems like having clients connect *directly* to the snapshot volumes once they've been started might have avoided some complexity or problems. Was this considered? Yes this was considered. I have mentioned the two reasons why this was dropped in the other mail. I look forward to the next version of the design which reflects the new ideas since this email thread started. They were: a) snap view generation requires privileged ops to glusterd. So moving this task to the server side solves a lot of those challenges. Not really. A server-side component issuing privileged requests whenever a client asks it to is no more secure than a client-side component issuing them directly. There needs to be some sort of authentication and authorization at the glusterd level (the only place these all converge). This is a more general problem that we've had with glusterd for a long time. If security is a sincere concern for USS, shouldn't we address it by trying to move the general solution forward? ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] User-serviceable snapshots design
No graph changes either on client side or server side. The snap-view-server will detect availability of new snapshot from glusterd, and will spin up a new glfs_t for the corresponding snap, and start returning new list of names in readdir(), etc. I asked if we were dynamically changing the client graph to add new protocol/client instances. Here is Varun's answer. Adding a protocol/client instance to connect to protocol/server at the daemon. Apparently the addition he mentions wasn't the kind I was asking about, but something that only occurs at normal volfile-generation time. Is that correct? No volfile/graph changes at all. Creation/removal of snapshots is handled in the form of a dynamic list of glfs_t's on the server side. So we still have dynamically added graphs, but they're wrapped up in GFAPI objects? Let's be sure to capture that nuance in v2 of the spec. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] User-serviceable snapshots design
Attached is a basic write-up of the user-serviceable snapshot feature design (Avati's). Please take a look and let us know if you have questions of any sort... A few. The design creates a new type of daemon: snapview-server. * Where is it started? One server (selected how) or all? * How do clients find it? Are we dynamically changing the client side graph to add new protocol/client instances pointing to new snapview-servers, or is snapview-client using RPC directly? Are the snapview-server ports managed through the glusterd portmapper interface, or patched in some other way? * Since a snap volume will refer to multiple bricks, we'll need more brick daemons as well. How are *those* managed? * How does snapview-server manage user credentials for connecting to snap bricks? What if multiple users try to use the same snapshot at the same time? How does any of this interact with on-wire or on-disk encryption? I'm sure I'll come up with more later. Also, next time it might be nice to use the upstream feature proposal template *as it was designed* to make sure that questions like these get addressed where the whole community can participate in a timely fashion. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] User-serviceable snapshots design
Attached is a basic write-up of the user-serviceable snapshot feature design (Avati's). Please take a look and let us know if you have questions of any sort... A few. The design creates a new type of daemon: snapview-server. * Where is it started? One server (selected how) or all? * How do clients find it? Are we dynamically changing the client side graph to add new protocol/client instances pointing to new snapview-servers, or is snapview-client using RPC directly? Are the snapview-server ports managed through the glusterd portmapper interface, or patched in some other way? * Since a snap volume will refer to multiple bricks, we'll need more brick daemons as well. How are *those* managed? * How does snapview-server manage user credentials for connecting to snap bricks? What if multiple users try to use the same snapshot at the same time? How does any of this interact with on-wire or on-disk encryption? I'm sure I'll come up with more later. Also, next time it might be nice to use the upstream feature proposal template *as it was designed* to make sure that questions like these get addressed where the whole community can participate in a timely fashion. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Would there be a use for cluster-specific filesystem tools?
(thanks to brain-dead Zimbra for the empty response before) Okay, so interest seems to be there. What tools would be useful? So far my list consists of: 1) du -sk or -s --si 2) rm -fr 3) find (or at least find -print) What else would you add to this list? How about grep -r? ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Inktank acquisition
As many of you have probably heard by now, we're joining forces with our good friends working on Ceph at Inktank. As one of the community's semi-official bloggers, here's my own take on this momentous event. http://pl.atyp.us/2014-04-inktank-acquisition.html (same thing inline, for convenience) I know a lot of people are going to be asking me about Red Hat's acquisition of Inktank, so I've decided to collect some thoughts on the subject. The very very simple version is that **I'm delighted**. Occasional sniping back and forth notwithstanding, I've always been a huge fan of Ceph and the people working on it. This is great news. More details in a bit, but first I have to take care of some administrivia. *Unlike everything else I have ever written here, this post has been submitted to my employer for approval prior to publication. I swear to you that it's still my own sincere thoughts, but I believe it's an ethical requirement for independent bloggers such as myself to be up front about any such entanglement no matter how slight the effect might have been. Now, on with the real content.* As readers and conference-goers beyond number can attest, I've always said that Ceph and GlusterFS are allies in a common fight against common rivals. First, we've both stood against proprietary storage appliances, including both traditional vendors and the latest crop of startups. A little less obviously, we've also both stood for Real File Systems. Both projects have continued to implement and promote the classic file system API even as other projects (some even with the gall to put FS in their names) implement various stripped-down APIs that don't preserve the property of working with every script and library and application of the last thirty years. Not having to rewrite applications, or import/export data between various special-purpose data stores, is a **huge** benefit to users. Naturally, these two projects have a lot of similarities. In addition to the file system API, both have tried to address object and block APIs as well. Because of their slightly different architectures and user bases, however, they've approached those interfaces in slightly different ways. For example, GlusterFS is files all the way down whereas Ceph has separate bulk-data and metadata layers. GlusterFS distributes cluster management among all servers, while Ceph limits some of that to a dedicated monitor subset. Whether it's because of these technical differences or because of relationships or pure happenstance, the two projects have experienced different levels of traction in each of these markets. This has led to different lessons, and different ideas embedded in each project's code. One of the nice things about joining forces is that we each gain even more freedom than before to borrow each other's ideas. Yes, they were both open source, so we could always do some of that, but it's not like we could have used one project's management console on top of the other's data path. GlusterFS using RADOS would have been unthinkable, as would Ceph using GFAPI. Now, all things are possible. In each area, we have the chance to take two sets of ideas and either converge on the better one or merge the two to come up with something even better than either was before. I don't know what the outcomes will be, or even what all of the pieces are that we'll be looking at, but I do know that there are some very smart people joining the team I'm on. Whenever that happens, all sorts of unpredictable good things tend to happen. So, welcome to my new neighbors from the Ceph community. Come on in, make yourself comfortable by the fire, and let's have a good long chat. ___ Announce mailing list annou...@gluster.org http://supercolony.gluster.org/mailman/listinfo/announce ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Glusterfs Rack-Zone Awareness feature...
When I create a new replicated volume, using only 2 nodes, I use this command line : ‘gluster volume create vol_name replica 2 transport tcp server1 :/export/brick1/1 server2 :/export/brick1/1’ server1 and server2 are in 2 different datacenters. Now, if I want to expand gluster volume, using 2 new servers (ex : server3 and server4) , I use those command lines : ‘gluster volume add-brick vol_name server3: /export/brick1/1’ ‘gluster volume add-brick vol_name server4: /export/brick1/1’ ‘gluster volume rebalance vol_name fix-layout start’ ‘gluster volume rebalance vol_name start’ How the rebalance command work ? How to be sure that replicated data are not stored on servers hosted in the same datacenter ? he replica count. Some of the infrastructure we need for the data classification task in 3.6 will allow us to relax that limitation and even support multiple replica counts within one volume, but for the simple/general case that probably won't be until 3.7 or later. For now, we ensure when a volume is created or bricks are added that the bricks within each replica set are not co-located. Also, since this is the first time I've noticed a mention of multiple *data centers* (as opposed to multiple racks within one data center), it's important to note that AFR will fall down quite badly if the latency is greater than a few milliseconds. NSR will be much better at handling such environments, but won't be available for a while yet. Yeah, I'm also frustrated that all the good stuff always seems to be in the future. ;) ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Glusterfs Rack-Zone Awareness feature...
I do not understand why it could be a problem to place the data's replica on a different node group. If a group of node become unavailable (due to datacenter failure, for example) volume should remain online, using the second group. I'm not sure what you're getting at here. If you're talking about initial placement of replicas, we can place all members of each replica set in different node groups (e.g. racks). If you're talking about adding new replica members when a previous one has failed, then the question is *when*. Re-populating a new replica can be very expensive. It's not worth starting if the previously failed replica is likely to come back before you're done. We provide the tools (e.g. replace-brick) to deal with longer term or even permanent failures, but we don't re-replicate automatically. Is that what you're talking about? ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Glusterfs Rack-Zone Awareness feature...
I have a little question. I have read glusterfs documentation looking for a replication management. I want to be able to localize replicas on nodes hosted in 2 Datacenters (dual-building). CouchBase provide the feature, I’m looking for GlusterFs : “Rack-Zone Awareness”. https://blog.couchbase.com/announcing-couchbase-server-25 “Rack-Zone Awareness - This feature will allow logical groupings of Couchbase Server nodes (where each group is physically located on a rack or an availability zone). Couchbase Server will automatically allocate replica copies of data on servers that belong to a group different from where the active data lives. This significantly increases reliability in case an entire rack becomes unavailable. This is of particularly importance for customers running deployments in public clouds.” Do you know if Glusterfs provide a similar feature ? If not, do you plan to develop it, in the near future ? There are two parts to the answer. Rack-aware placement in general is part of the data classification feature planned for the 3.6 release. http://www.gluster.org/community/documentation/index.php/Features/data-classification With this feature, files can be placed according to various policies using any of several properties associated with objects or physical locations. Rack-aware placement would use the physical location of a brick. Tiering would use the performance properties of a brick and the access time/frequency of an object. Multi-tenancy would use the tenant identity for both bricks and objects. And so on. It's all essentially the same infrastructure. For replication decisions in particular, there needs to be another piece. Right now, the way we use N bricks with a replication factor of R is to define N/R replica sets each containing R members. This is sub-optimal in many ways. We can still compare the value or fitness of two replica sets for storing a particular object, but our options are limited to the replica sets as defined last time bricks were added or removed. The differences between one choice and another effectively get smoothed out, and the load balancing after a failure is less than ideal. To do this right, we need to use more (overlapping) combinations of bricks. Some of us have discussed ways that we can do this without sacrificing the modularity of having distribution and replication as two separate modules, but there's no defined plan or date for that feature becoming available. BTW, note that using *too many* combinations can also be a problem. Every time an object is replicated across a certain set of storage locations, it creates a coupling between those locations. Before long, all locations are coupled together, so that *any* failure of R-1 locations anywhere in the system will result in data loss or unavailability. Many systems, possibly including Couchbase Server, have made this mistake and become *less* reliable as a result. Emin Gün Sirer does a better job describing the problem - and solutions - than I do, here: http://hackingdistributed.com/2014/02/14/chainsets/ ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Gluster 3.4.2 on Redhat 6.5
- Original Message - I see two separate bugs there. 1. A missing package requirement 2. The process hanging in a reproducible way. I've submitted a fix for #2. http://review.gluster.org/#/c/7360/ ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Different brick sizes in a volume
On Tue, Mar 18, 2014 at 1:42 PM, Greg Waite I've been playing around with a 2x2 distributed replicated setup with replicating group 1 having a different brick size than replicating group 2. I've been running into out of disk errors when the smaller replicating pair disks fill up. I know of the minimum free disk feature which should prevent this issue. My question is, are there features that allow gluster to smartly use different brick sizes so extra space on larger bricks do not go unused? It looks like different sized bricks will be a core feature in 3.6 (coming soon). Correct. In fact, a lot of the logic already exists and is even in the tree. http://review.gluster.org/#/c/3573/ The trick now is to get that logic integrated into the place where rebalance calculates the new layout. Until then, you could try running that script, but I should warn you that it hasn't been looked at for over a year so you should try it out on a small test volume first to make sure it's still doing the right thing(s). I'll be glad to help with that. Another thing you can do is to divide your larger bricks in two - or three, or whatever's necessary to even things out. This means more ports, more glusterfsd processes, quite possibly some performance loss as those contend with one another, but it's something you can do *right now* that's pretty easy and bullet-proof. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] PLEASE READ ! We need your opinion. GSOC-2014 and the Gluster community
I am a little bit impressed by the lack of action on this topic. I hate to be that guy, specially being new here, but it has to be done. If I've got this right, we have here a chance of developing Gluster even further, sponsored by Google, with a dedicated programmer for the summer. In other words, if we play our cards right, we can get a free programmer and at least a good start/advance on this fantastic. Welcome, Carlos. I think it's great that you're taking initiative here. However, it's also important to set proper expectations for what a GSoC intern could reasonably be expected to achieve. I've seen some amazing stuff out of GSoC, but if we set the bar too high then we end up with incomplete code and the student doesn't learn much except frustration. GlusterFS consists of 430K lines of code in the core project alone. Most of it's written in a style that is generally hard for newcomers to pick up - both callback-oriented and highly concurrent, often using our own unique interpretation of standard concepts. It's also in an area (storage) that is not well taught in most universities. Given those facts and the short duration of GSoC, it's important to focus on projects that don't require deep knowledge of existing code, to keep the learning curve short and productive time correspondingly high. With that in mind, let's look at some of your suggestions. I think it would be nice to listen to the COMMUNITY (yes, that means YOU), for either suggestions, or at least a vote. It certainly would have been nice to have you at the community IRC meeting yesterday, at which we discussed release content for 3.6 based on the feature proposals here: http://www.gluster.org/community/documentation/index.php/Planning36 The results are here: http://titanpad.com/glusterfs-3-6-planning My opinion, being also my vote, in order of PERSONAL preference: 1) There is a project going on ( https://forge.gluster.org/disperse ), that consists on re-writing the stripe module on gluster. This is specially important because it has a HUGE impact on Total Cost of Implementation (customer side), Total Cost of Ownership, and also matching what the competition has to offer. Among other things, it would allow gluster to implement a RAIDZ/RAID5 type of fault tolerance, much more efficient, and would, as far as I understand, allow you to use 3 nodes as a minimum stripe+replication. This means 25% less money in computer hardware, with increased data safety/resilience. This was decided as a core feature for 3.6. I'll let Xavier (the feature owner) answer w.r.t. whether there's any part of it that would be appropriate for GSoC. 2) We have a recurring issue with split-brain solution. There is an entry on trello asking/suggesting a mechanism that arbitrates this resolution automatically. I pretty much think this could come together with another solution that is file replication consistency check. This is also core for 3.6 under the name policy based split brain resolution: http://www.gluster.org/community/documentation/index.php/Features/pbspbr Implementing this feature requires significant knowledge of AFR, which both causes split brain and would be involved in its repair. Because it's also one of our most complicated components, and the person who just rewrote it won't be around to offer help, I don't think this project *as a whole* would be a good fit for GSoC. On the other hand, there might be specific pieces of the policy implementation (not execution) that would be a good fit. 3) Accelerator node project. Some storage solutions out there offer an accelerator node, which is, in short, a, extra node with a lot of RAM, eventually fast disks (SSD), and that works like a proxy to the regular volumes. active chunks of files are moved there, logs (ZIL style) are recorded on fast media, among other things. There is NO active project for this, or trello entry, because it is something I started discussing with a few fellows just a couple of days ago. I thought of starting to play with RAM disks (tmpfs) as scratch disks, but, since we have an opportunity to do something more efficient, or at the very least start it, why not ? Looks like somebody has read the Isilon marketing materials. ;) A full production-level implementation of this, with cache consistency and so on, would be a major project. However, a non-consistent prototype good for specific use cases - especially Hadoop, as Jay mentions - would be pretty easy to build. Having a GlusterFS server (for the real clients) also be a GlusterFS client (to the real cluster) is pretty straightforward. Testing performance would also be a significant component of this, and IMO that's something more developers should learn about early in their careers. I encourage you to keep thinking about how this could be turned into a real GSoC proposal. Keep the ideas coming! ___ Gluster-users mailing list
Re: [Gluster-users] gfid files which are not hard links anymore
Most likely reason is that someone deleted these files manually from the brick directories. You must never access/modify the data from the brick directories directly Unfortunately, that's exactly what users must do to resolve split-brain. Until we implement a mechanism for people to do this through the client mount, we need to make sure users know how to remove files properly themselves. Here are a couple of relevant blog posts. http://www.gluster.org/2012/07/fixing-split-brain-with-glusterfs-3-3/ http://joejulian.name/blog/glusterfs-split-brain-recovery-made-easy/ There are also some efforts under way that should make this better in the future. http://www.gluster.org/community/documentation/index.php/Features/pbspbr http://www.gluster.org/2012/06/healing-split-brain/ http://review.gluster.org/#/c/4132/ ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] 3.6 Feature Go/No-go in this week's community meeting
Since the feature proposal freeze for 3.6 has happened, I am considering to have the 3.6 feature go/no-go decision making as part of this week's community meeting on Wednesday. Does that seem acceptable to all? If yes, this agenda item can probably consume the entire 60 minutes and we might have to move other agenda items to next week. Works for me. We really need to get this done. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] Mechanisms for automatic management of Gluster
This is along the lines of tools for sysadmins. I plan on using these algorithms for puppet-gluster, but will try to maintain them separately as a standalone tool. The problem: Given a set of bricks and servers, if they have a logical naming convention, can an algorithm decide the ideal order. This could allow parameters such as replica count, and chained=true/false/offset#. The second problem: Given a set of bricks in a volume, if someone adds X bricks and removes Y bricks, is this valid, and what is the valid sequence of add/remove brick commands. I've written some code with test cases to try and figure this all out. I've left out a lot of corner cases, but the boilerplate is there to make it happen. Hopefully it's self explanatory. (gluster.py) Read and run it. Once this all works, the puppet-gluster use case is magic. It will be able to take care of these operations for you (if you want). For non puppet users, this will give admins the confidence to know what commands they should _probably_ run in what order. I say probably because we assume that if there's an error, they'll stop and inspect first. I haven't yet tried to implement the chained cases, or anything involving striping. There are also some corner cases with some of the current code. Once you add chaining and striping, etc, I realized it was time to step back and ask for help :) I hope this all makes sense. Comments, code, test cases are appreciated! It's a good start. For the chained case, you'd probably want to start with something like this: # Convert the input into a list of lists like this: # [ # [ 'host1', [ 'path1', 'path2', ... ], # [ 'host2', [ 'path1', 'path2', ... ], # ... # ] out_list = [] while in_list: first_host = in_list.pop() first_path = first_host[1].pop() # If there are any bricks left on this host, move the host to # the end so the next iteration will start with the next host. # Otherwise, we've used all bricks from this host so discard. if first_host[1]: in_list.append(first_host) second_host = in_list[0] second_host = second_host[1].pop() # Have we exhausted this host as well? if not second_host[1]: del in_list[0] out_list.append({'host':first_host[0],'path',first_path}) out_list.append({'host':second_host[0],'path',second_path}) return out_list (I haven't actually run this. It's merely illustrative of the algorithm.) Can you spot the bug? If one host has more bricks than the others, it might run out of bricks on other hosts to pair with, so it'll end up pairing with itself. For example, consider the following input: H1P1, H1P2, H1P3, H1P4, H2P1, H2P2, H3P1, H3P2 This algorithm would yield the following replica pairs. H1P1 + H2P1 H2P2 + H3P1 H3P2 + H1P2 H1P3 + H1P4 (oops) Instead, we need to find this: H1P1 + H2P1 H2P2 + H1P2 H1P3 + H3P1 H3P2 + H1P4 I would actually not try to deal with this in the loop above. Why not? Because that loop's already going to get a bit hairy when it's enhanced to handle replica counts greater than two. Instead, I would deal with the imbalance cases *up front* - check the number of bricks for each host, then equalize them e.g. by splitting a host with many bricks into two virtual hosts separated by enough others that they'll never pair with one another. Alternatively, one could do a recursive implementation, roughly like this: if less than rep_factor hosts left, fail pick rep_factor bricks from different hosts loop: pass remainder to recursive call if result is valid, combine and return pick a *different* rep_factor bricks from different hosts That will generate *some* valid order if any exists, but it will tend toward sub-optimal orders where e.g. all of X's bricks are paired with all of Y's instead of being spread around. There might be some sort of optimization pass we could do that would swap replica-set members to address this, but I'm sure you can see it's already becoming a hard problem. I'd have to code up full versions of both algorithms and run them on many different inputs to say with any confidence which is better. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Atomic file updates
I'm not currently a Gluster user but I'm hoping it's the answer to a problem I'm working on. I manage a private web site that is basically a reporting tool for equipment located at several hundred sites. Each site regularly uploads zipped XML files to a cloud based server and this also provides a web interface to the data using apache/PHP. The problem I need to solve is that with a single server disk I/O has become a bottleneck. The plan is to use a load balancer and multiple web servers with a 4-node Gluster volume behind to store the data. Data would be replicated over 2 nodes. The uploaded files are stored and then unzipped ready for reading by the web interface code. Each file is unzipped into a temporary file and then renamed, e.g. file1.xml.zip --unzip-- uniquename.tmp --rename-- file1.xml Use of the rename function makes these updates atomic. How can I achieve atomic updates in this way using a Gluster volume? My understanding is that renaming a file on a Gluster volume causes a link file to be created and that clearly wouldn't be appropriate where there are frequent updates. Creating a file with one name and then renaming it to another *might* cause creation of linkfiles, but I think concerns about linkfiles are often overblown. The one extra call to create a linkfile isn't much compared to those for creating the file, writing into it, and then renaming it even if the rename is local to one brick. What really matters is the performance of the entire sequence, with or without the linkfile. That said, there's also a trick you can use to avoid creation of a linkfile. Other tools, such as rsync and our own object interface, use the same write-then-rename idiom. To serve them, there's an option called extra-hash-regex that can be used to place files on the right brick according to their final name even though they're created with another. Unfortunately, specifying that option via the command line doesn't seem to work (it creates a malformed volfile) so you have to mount a bit differently. For example: glusterfs --volfile-server=a_server --volfile-id=a_volume \ --xlator-option a_volume-dht.extra_hash_regex='(.*+)tmp' \ /a/mountpoint The important part is that second line. That causes any file with a tmp suffix to be hashed and placed as though only the part in the first parenthesized part of the regex (i.e. without the tmp) was there. Therefore, creating xxxtmp and then renaming it to xxx is the same as just creating xxx in the first place as far as linkfiles etc. are concerned. Note that the excluded part can be anything that a regex can match, including a unique random number. If I recall, rsync uses temp files something like this: fubar = .fubar.NN (where NNN is a random number) I know this probably seems a little voodoo-ish, but with a little bit of experimentation to find the right regex you should be able to avoid those dreaded linkfiles altogether. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Atomic file updates
Are you saying that with these mount options I can just write files directly without using flock or renaming a temporary file, and that other processes trying to read the file will always see a complete and consistent view of the file? For write-once files, the rename is really the key to ensuring that readers never see an incomplete file. If you ever rewrite a file in place, you'll need flock to avoid reading a partially updated (i.e. inconsistent) file. Jay's suggestions might also be helpful even though they both have to do with metadata, because we use attributes to determine when it's necessary to re-read a file that might have changed. It's kind of up to you to determine which combination is needed to meet your own consistency goals with your own workload. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users