[Gluster-devel] Log archiving for smoke runs as well
I've changed the smoke.sh script to archive the logs on failures. The archives will be saved into /d/logs/smoke and will be available for download from [1]. I've also moved to location of the regression log archives from /d/logs to /d/logs/regression. These will now be available for download at [2]. ~kaushal [1]: http://build.gluster.org:443/logs/smoke/ [2]: http://build.gluster.org:443/logs/regression/ ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Split-brain present and future in afr
- Original Message - From: Jeff Darcy jda...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Tuesday, May 20, 2014 10:08:12 PM Subject: Re: [Gluster-devel] Split-brain present and future in afr 1. Better protection for split-brain over time. 2. Policy based split-brain resolution. 3. Provide better availability with client quorum and replica 2. I would add the following: (4) Quorum enforcement - any kind - on by default. For replica - 3 we can do that. For replica 2, quorum implementation at the moment is not good enough. Until we fix it correctly may be we should let it be. We can revisit that decision once we come up with better solution for replica 2. (5) Fix the problem of volumes losing quorum because unrelated nodes went down (i.e. implement volume-level quorum). (6) Better tools for users to resolve split brain themselves. Agreed. Already in plan for 3.6. For 3, we are planning to introduce arbiter bricks that can be used to determine quorum. The arbiter bricks will be dummy bricks that host only files that will be updated from multiple clients. This will be achieved by bringing about variable replication count for configurable class of files within a volume. In the case of a replicated volume with one arbiter brick per replica group, certain files that are prone to split-brain will be in 3 bricks (2 data bricks + 1 arbiter brick). All other files will be present in the regular data bricks. For example, when oVirt VM disks are hosted on a replica 2 volume, sanlock is used by oVirt for arbitration. sanloclk lease files will be written by all clients and VM disks are written by only a single client at any given point of time. In this scenario, we can place sanlock lease files on 2 data + 1 arbiter bricks. The VM disk files will only be present on the 2 data bricks. Client quorum is now determined by looking at 3 bricks instead of 2 and we have better protection when network split-brains happen. Constantly filtering requests to use either N or N+1 bricks is going to be complicated and hard to debug. Every data-structure allocation or loop based on replica count will have to be examined, and many will have to be modified. That's a *lot* of places. This also overlaps significantly with functionality that can be achieved with data classification (i.e. supporting multiple replica levels within the same volume). What use case requires that it be implemented within AFR instead of more generally and flexibly? 1) It wouldn't still bring in arbiter for replica 2. 2) That would need more bricks, more processes, more ports. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Gluster on OSX
Hi guys, Do you reckon we should get that Mac Mini in the Westford lab set up to automatically test Gluster builds each night or something? If so, we should probably take/claim ownership of it, upgrade the memory in it, and (possibly) see if it can be put in the DMZ. Thoughts? + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Split-brain present and future in afr
Constantly filtering requests to use either N or N+1 bricks is going to be complicated and hard to debug. Every data-structure allocation or loop based on replica count will have to be examined, and many will have to be modified. That's a *lot* of places. This also overlaps significantly with functionality that can be achieved with data classification (i.e. supporting multiple replica levels within the same volume). What use case requires that it be implemented within AFR instead of more generally and flexibly? 1) It wouldn't still bring in arbiter for replica 2. It's functionally the same, just implemented in a more modular fashion. Either way, for the same set of data that was previously replicated twice, most data would still be replicated twice but some subset would be replicated three times. The policy filter is just implemented in a translator dedicated to the purpose, instead of within AFR. In addition to being simpler, this keeps the user experience consistent for setting this vs. other kinds of policies. 2) That would need more bricks, more processes, more ports. Fewer, actually. Either approach requires that we split bricks (as the user sees them). One way we turn N user bricks into N regular bricks plus N/2 arbiter bricks. The other way we turn N user bricks into N bricks for the replica-2 part and another N for the replica-3 part. That seems like slightly more, but (a) it's the same user view, and (b) for processes and ports it will actually be less. Since data classification is likely to involve splitting bricks many times, and multi-tenancy likewise, the data classification project is already scoped to include multiplexing multiple bricks into one process on one port (like HekaFS used to do). Thus the total number of ports and processes for an N-brick volume will go back down to N even with the equivalent of arbiter functionality. Doing replica 2.5 as part of data classification instead of within AFR also has other advantages. For example, it naturally gives us support for overlapping replica sets - an often requested feature to spread load more evenly after a failure. Perhaps most importantly, it doesn't require separate implementations or debugging for AFRv1, AFRv2, and NSR. Let's for once put our effort where it will do us most good, instead of succumbing to streetlight effect[1] yet again and hacking on the components that are most familiar. [1] http://en.wikipedia.org/wiki/Streetlight_effect ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] mempool disabling for regression tests
Hi Pranith, You don't have an account on build.gluster.org yet do you? It's where the current (not my stuff) regression tests are run. This is the script presently used to build the regression tests: $ more /opt/qa/build.sh #!/bin/bash set -e SRC=$(pwd); rpm -qa | grep glusterfs | xargs --no-run-if-empty rpm -e ./autogen.sh; P=/build; rm -rf $P/scratch; mkdir -p $P/scratch; cd $P/scratch; sudo rm -rf $P/install; $SRC/configure --prefix=$P/install --with-mountutildir=$P/install/sbin --with-initdir=$P/install/etc --enable-bd-xlator=yes --silent make install CFLAGS=-g -O0 -Wall -Werror -j 4 /dev/null cd $SRC; What needs to be done, so that the mempool change is in effect when the master branch is being compiled? (note, my regression testing uses a slightly different version of this stuff, here: https://forge.gluster.org/gluster-patch-acceptance-tests) + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Log archiving for smoke runs as well
Turns out this change isn't working as I had thought it would. Vijay helped me identify the problem and I've done another change. Hopefully, it works now. ~kaushal On Fri, May 23, 2014 at 2:35 PM, Kaushal M kshlms...@gmail.com wrote: I've changed the smoke.sh script to archive the logs on failures. The archives will be saved into /d/logs/smoke and will be available for download from [1]. I've also moved to location of the regression log archives from /d/logs to /d/logs/regression. These will now be available for download at [2]. ~kaushal [1]: http://build.gluster.org:443/logs/smoke/ [2]: http://build.gluster.org:443/logs/regression/ ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] mempool disabling for regression tests
On Fri, May 23, 2014 at 6:02 PM, Justin Clift jus...@gluster.org wrote: Hi Pranith, You don't have an account on build.gluster.org yet do you? It's where the current (not my stuff) regression tests are run. This is the script presently used to build the regression tests: $ more /opt/qa/build.sh #!/bin/bash set -e SRC=$(pwd); rpm -qa | grep glusterfs | xargs --no-run-if-empty rpm -e ./autogen.sh; P=/build; rm -rf $P/scratch; mkdir -p $P/scratch; cd $P/scratch; sudo rm -rf $P/install; $SRC/configure --prefix=$P/install --with-mountutildir=$P/install/sbin --with-initdir=$P/install/etc --enable-bd-xlator=yes --silent A --enable-debug flag to configure should enable the debug build. make install CFLAGS=-g -O0 -Wall -Werror -j 4 /dev/null cd $SRC; What needs to be done, so that the mempool change is in effect when the master branch is being compiled? (note, my regression testing uses a slightly different version of this stuff, here: https://forge.gluster.org/gluster-patch-acceptance-tests) + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Gluster on OSX
Do you reckon we should get that Mac Mini in the Westford lab set up to automatically test Gluster builds each night or something? If so, we should probably take/claim ownership of it, upgrade the memory in it, and (possibly) see if it can be put in the DMZ. Up to you guys, it would be great. I am doing it manually for now once in 2days :-) -- Religious confuse piety with mere ritual, the virtuous confuse regulation with outcomes ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Split-brain present and future in afr
On 23/05/2014, at 10:17 AM, Pranith Kumar Karampuri wrote: snip 2) That would need more bricks, more processes, more ports. Meh to more ports. We should be moving to a model (maybe in 4.x?) where we use less ports. Preferably just one or two in total if its feasible from a network layer. Backup applications can manage it, and they're transferring a tonne of data too. ;) + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Data classification proposal
One of the things holding up our data classification efforts (which include tiering but also other stuff as well) has been the extension of the same conceptual model from the I/O path to the configuration subsystem and ultimately to the user experience. How does an administrator define a tiering policy without tearing their hair out? How does s/he define a mixed replication/erasure-coding setup without wanting to rip *our* hair out? The included Markdown document attempts to remedy this by proposing one out of many possible models and user interfaces. It includes examples for some of the most common use cases, including the replica 2.5 case we'e been discussing recently. Constructive feedback would be greatly appreciated. # Data Classification Interface The data classification feature is extremely flexible, to cover use cases from SSD/disk tiering to rack-aware placement to security or other policies. With this flexibility comes complexity. While this complexity does not affect the I/O path much, it does affect both the volume-configuration subsystem and the user interface to set placement policies. This document describes one possible model and user interface. The model we used is based on two kinds of information: brick descriptions and aggregation rules. Both are contained in a configuration file (format TBD) which can be associated with a volume using a volume option. ## Brick Descriptions A brick is described by a series of simple key/value pairs. Predefined keys include: * **media-type** The underlying media type for the brick. In its simplest form this might just be *ssd* or *disk*. More sophisticated users might use something like *15krpm* to represent a faster disk, or *perc-raid5* to represent a brick backed by a RAID controller. * **rack** (and/or **row**) The physical location of the brick. Some policy rules might be set up to spread data across more than one rack. User-defined keys are also allowed. For example, some users might use a *tenant* or *security-level* tag as the basis for their placement policy. ## Aggregation Rules Aggregation rules are used to define how bricks should be combined into subvolumes, and those potentially combined into higher-level subvolumes, and so on until all of the bricks are accounted for. Each aggregation rule consists of the following parts: * **id** The base name of the subvolumes the rule will create. If a rule is applied multiple times this will yield *id-0*, *id-1*, and so on. * **selector** A filter for which bricks or lower-level subvolumes the rule will aggregate. This is an expression similar to a *WHERE* clause in SQL, using brick/subvolume names and properties in lieu of columns. These values are then matched against literal values or regular expressions, using the usual set of boolean operators to arrive at a *yes* or *no* answer to the question of whether this brick/subvolume is affected by this rule. * **group-size** (optional) The number of original bricks/subvolumes to be combined into each produced subvolume. The special default value zero means to collect all original bricks or subvolumes into one final subvolume. In this case, *id* is used directly instead of having a numeric suffix appended. * **type** (optional) The type of the generated translator definition(s). Examples might include AFR to do replication, EC to do erasure coding, and so on. The more general data classification task includes the definition of new translators to do tiering and other kinds of filtering, but those are beyond the scope of this document. If no type is specified, cluster/dht will be used to do random placement among its constituents. * **tag** and **option** (optional, repeatable) Additional tags and/or options to be applied to each newly created subvolume. See the replica 2.5 example to see how this can be used. Since each type might have unique requirements, such as ensuring that replication is done across machines or racks whenever possible, it is assumed that there will be corresponding type-specific scripts or functions to do the actual aggregation. This might even be made pluggable some day (TBD). Once all rule-based aggregation has been done, volume options are applied similarly to how they are now. Astute readers might have noticed that it's possible for a brick to be aggregated more than once. This is intentional. If a brick is part of multiple aggregates, it will be automatically split into multiple bricks internally but this will be invisible to the user. ## Examples Let's start with a simple tiering example. Here's what the data-classification config file might look like. brick host1:/brick media-type = ssd brick host2:/brick media-type = disk brick host3:/brick media-type = disk rule tier-1