[Gluster-devel] Log archiving for smoke runs as well

2014-05-23 Thread Kaushal M
I've changed the smoke.sh script to archive the logs on failures. The
archives will be saved into /d/logs/smoke and will be available for
download from [1].

I've also moved to location of the regression log archives from
/d/logs to /d/logs/regression. These will now be available for
download at [2].

~kaushal

[1]: http://build.gluster.org:443/logs/smoke/
[2]: http://build.gluster.org:443/logs/regression/
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Split-brain present and future in afr

2014-05-23 Thread Pranith Kumar Karampuri


- Original Message -
 From: Jeff Darcy jda...@redhat.com
 To: Pranith Kumar Karampuri pkara...@redhat.com
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Tuesday, May 20, 2014 10:08:12 PM
 Subject: Re: [Gluster-devel] Split-brain present and future in afr
 
  1. Better protection for split-brain over time.
  2. Policy based split-brain resolution.
  3. Provide better availability with client quorum and replica 2.
 
 I would add the following:
 
 (4) Quorum enforcement - any kind - on by default.

For replica - 3 we can do that. For replica 2, quorum implementation at the 
moment is not good enough. Until we fix it correctly may be we should let it 
be. We can revisit that decision once we come up with better solution for 
replica 2.

 
 (5) Fix the problem of volumes losing quorum because unrelated nodes
 went down (i.e. implement volume-level quorum).
 
 (6) Better tools for users to resolve split brain themselves.

Agreed. Already in plan for 3.6.

 
  For 3, we are planning to introduce arbiter bricks that can be used to
  determine quorum. The arbiter bricks will be dummy bricks that host only
  files that will be updated from multiple clients. This will be achieved by
  bringing about variable replication count for configurable class of files
  within a volume.
   In the case of a replicated volume with one arbiter brick per replica
   group,
   certain files that are prone to split-brain will be in 3 bricks (2 data
   bricks + 1 arbiter brick).  All other files will be present in the regular
   data bricks. For example, when oVirt VM disks are hosted on a replica 2
   volume, sanlock is used by oVirt for arbitration. sanloclk lease files
   will
   be written by all clients and VM disks are written by only a single client
   at any given point of time. In this scenario, we can place sanlock lease
   files on 2 data + 1 arbiter bricks. The VM disk files will only be present
   on the 2 data bricks. Client quorum is now determined by looking at 3
   bricks instead of 2 and we have better protection when network
   split-brains
   happen.
 
 Constantly filtering requests to use either N or N+1 bricks is going to be
 complicated and hard to debug.  Every data-structure allocation or loop
 based on replica count will have to be examined, and many will have to be
 modified.  That's a *lot* of places.  This also overlaps significantly
 with functionality that can be achieved with data classification (i.e.
 supporting multiple replica levels within the same volume).  What use case
 requires that it be implemented within AFR instead of more generally and
 flexibly?

1) It wouldn't still bring in arbiter for replica 2.
2) That would need more bricks, more processes, more ports.

 
 

Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Gluster on OSX

2014-05-23 Thread Justin Clift
Hi guys,

Do you reckon we should get that Mac Mini in the Westford
lab set up to automatically test Gluster builds each
night or something?

If so, we should probably take/claim ownership of it,
upgrade the memory in it, and (possibly) see if it can be
put in the DMZ.

Thoughts?

+ Justin

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Split-brain present and future in afr

2014-05-23 Thread Jeff Darcy
  Constantly filtering requests to use either N or N+1 bricks is going to be
  complicated and hard to debug.  Every data-structure allocation or loop
  based on replica count will have to be examined, and many will have to be
  modified.  That's a *lot* of places.  This also overlaps significantly
  with functionality that can be achieved with data classification (i.e.
  supporting multiple replica levels within the same volume).  What use case
  requires that it be implemented within AFR instead of more generally and
  flexibly?
 
 1) It wouldn't still bring in arbiter for replica 2.

It's functionally the same, just implemented in a more modular fashion.
Either way, for the same set of data that was previously replicated
twice, most data would still be replicated twice but some subset would
be replicated three times.  The policy filter is just implemented in a
translator dedicated to the purpose, instead of within AFR.  In addition
to being simpler, this keeps the user experience consistent for setting
this vs. other kinds of policies.

 2) That would need more bricks, more processes, more ports.

Fewer, actually.  Either approach requires that we split bricks (as the
user sees them).  One way we turn N user bricks into N regular bricks
plus N/2 arbiter bricks.  The other way we turn N user bricks into N
bricks for the replica-2 part and another N for the replica-3 part.
That seems like slightly more, but (a) it's the same user view, and (b)
for processes and ports it will actually be less.  Since data
classification is likely to involve splitting bricks many times, and
multi-tenancy likewise, the data classification project is already
scoped to include multiplexing multiple bricks into one process on one
port (like HekaFS used to do).  Thus the total number of ports and
processes for an N-brick volume will go back down to N even with the
equivalent of arbiter functionality.

Doing replica 2.5 as part of data classification instead of within AFR
also has other advantages.  For example, it naturally gives us support
for overlapping replica sets - an often requested feature to spread load
more evenly after a failure.  Perhaps most importantly, it doesn't
require separate implementations or debugging for AFRv1, AFRv2, and NSR.

Let's for once put our effort where it will do us most good, instead of
succumbing to streetlight effect[1] yet again and hacking on the
components that are most familiar.

[1] http://en.wikipedia.org/wiki/Streetlight_effect
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] mempool disabling for regression tests

2014-05-23 Thread Justin Clift
Hi Pranith,

You don't have an account on build.gluster.org yet do you?

It's where the current (not my stuff) regression tests are run.

This is the script presently used to build the regression
tests:

  $ more /opt/qa/build.sh
  #!/bin/bash

  set -e

  SRC=$(pwd);
  rpm -qa | grep glusterfs | xargs --no-run-if-empty rpm -e
  ./autogen.sh;
  P=/build;
  rm -rf $P/scratch;
  mkdir -p $P/scratch;
  cd $P/scratch;
  sudo rm -rf $P/install;
  $SRC/configure --prefix=$P/install --with-mountutildir=$P/install/sbin 
--with-initdir=$P/install/etc --enable-bd-xlator=yes --silent
  make install CFLAGS=-g -O0 -Wall -Werror -j 4 /dev/null
  cd $SRC;

What needs to be done, so that the mempool change is in effect
when the master branch is being compiled?

(note, my regression testing uses a slightly different version
of this stuff, here: https://forge.gluster.org/gluster-patch-acceptance-tests)

+ Justin

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Log archiving for smoke runs as well

2014-05-23 Thread Kaushal M
Turns out this change isn't working as I had thought it would. Vijay
helped me identify the problem and I've done another change.
Hopefully, it works now.

~kaushal

On Fri, May 23, 2014 at 2:35 PM, Kaushal M kshlms...@gmail.com wrote:
 I've changed the smoke.sh script to archive the logs on failures. The
 archives will be saved into /d/logs/smoke and will be available for
 download from [1].

 I've also moved to location of the regression log archives from
 /d/logs to /d/logs/regression. These will now be available for
 download at [2].

 ~kaushal

 [1]: http://build.gluster.org:443/logs/smoke/
 [2]: http://build.gluster.org:443/logs/regression/
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] mempool disabling for regression tests

2014-05-23 Thread Kaushal M
On Fri, May 23, 2014 at 6:02 PM, Justin Clift jus...@gluster.org wrote:
 Hi Pranith,

 You don't have an account on build.gluster.org yet do you?

 It's where the current (not my stuff) regression tests are run.

 This is the script presently used to build the regression
 tests:

   $ more /opt/qa/build.sh
   #!/bin/bash

   set -e

   SRC=$(pwd);
   rpm -qa | grep glusterfs | xargs --no-run-if-empty rpm -e
   ./autogen.sh;
   P=/build;
   rm -rf $P/scratch;
   mkdir -p $P/scratch;
   cd $P/scratch;
   sudo rm -rf $P/install;
   $SRC/configure --prefix=$P/install --with-mountutildir=$P/install/sbin 
 --with-initdir=$P/install/etc --enable-bd-xlator=yes --silent

A --enable-debug flag to configure should enable the debug build.

   make install CFLAGS=-g -O0 -Wall -Werror -j 4 /dev/null
   cd $SRC;

 What needs to be done, so that the mempool change is in effect
 when the master branch is being compiled?

 (note, my regression testing uses a slightly different version
 of this stuff, here: https://forge.gluster.org/gluster-patch-acceptance-tests)

 + Justin

 --
 Open Source and Standards @ Red Hat

 twitter.com/realjustinclift

 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Gluster on OSX

2014-05-23 Thread Harshavardhana

 Do you reckon we should get that Mac Mini in the Westford
 lab set up to automatically test Gluster builds each
 night or something?

 If so, we should probably take/claim ownership of it,
 upgrade the memory in it, and (possibly) see if it can be
 put in the DMZ.

Up to you guys, it would be great. I am doing it manually for now once
in 2days :-)

-- 
Religious confuse piety with mere ritual, the virtuous confuse
regulation with outcomes
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Split-brain present and future in afr

2014-05-23 Thread Justin Clift
On 23/05/2014, at 10:17 AM, Pranith Kumar Karampuri wrote:
snip
 2) That would need more bricks, more processes, more ports.


Meh to more ports.  We should be moving to a model (maybe in 4.x?)
where we use less ports.  Preferably just one or two in total if its
feasible from a network layer.  Backup applications can manage it,
and they're transferring a tonne of data too. ;)

+ Justin

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Data classification proposal

2014-05-23 Thread Jeff Darcy
One of the things holding up our data classification efforts (which include 
tiering but also other stuff as well) has been the extension of the same 
conceptual model from the I/O path to the configuration subsystem and 
ultimately to the user experience.  How does an administrator define a tiering 
policy without tearing their hair out?  How does s/he define a mixed 
replication/erasure-coding setup without wanting to rip *our* hair out?  The 
included Markdown document attempts to remedy this by proposing one out of many 
possible models and user interfaces.  It includes examples for some of the most 
common use cases, including the replica 2.5 case we'e been discussing 
recently.  Constructive feedback would be greatly appreciated.



# Data Classification Interface

The data classification feature is extremely flexible, to cover use cases from
SSD/disk tiering to rack-aware placement to security or other policies.  With
this flexibility comes complexity.  While this complexity does not affect the
I/O path much, it does affect both the volume-configuration subsystem and the
user interface to set placement policies.  This document describes one possible
model and user interface.

The model we used is based on two kinds of information: brick descriptions and
aggregation rules.  Both are contained in a configuration file (format TBD)
which can be associated with a volume using a volume option.

## Brick Descriptions

A brick is described by a series of simple key/value pairs.  Predefined keys
include:

 * **media-type**  
   The underlying media type for the brick.  In its simplest form this might
   just be *ssd* or *disk*.  More sophisticated users might use something like
   *15krpm* to represent a faster disk, or *perc-raid5* to represent a brick
   backed by a RAID controller.

 * **rack** (and/or **row**)  
   The physical location of the brick.  Some policy rules might be set up to
   spread data across more than one rack.

User-defined keys are also allowed.  For example, some users might use a
*tenant* or *security-level* tag as the basis for their placement policy.

## Aggregation Rules

Aggregation rules are used to define how bricks should be combined into
subvolumes, and those potentially combined into higher-level subvolumes, and so
on until all of the bricks are accounted for.  Each aggregation rule consists
of the following parts:

 * **id**  
   The base name of the subvolumes the rule will create.  If a rule is applied
   multiple times this will yield *id-0*, *id-1*, and so on.

 * **selector**  
   A filter for which bricks or lower-level subvolumes the rule will
   aggregate.  This is an expression similar to a *WHERE* clause in SQL, using
   brick/subvolume names and properties in lieu of columns.  These values are
   then matched against literal values or regular expressions, using the usual
   set of boolean operators to arrive at a *yes* or *no* answer to the question
   of whether this brick/subvolume is affected by this rule.

 * **group-size** (optional)  
   The number of original bricks/subvolumes to be combined into each produced
   subvolume.  The special default value zero means to collect all original
   bricks or subvolumes into one final subvolume.  In this case, *id* is used
   directly instead of having a numeric suffix appended.

 * **type** (optional)  
   The type of the generated translator definition(s).  Examples might include
   AFR to do replication, EC to do erasure coding, and so on.  The more
   general data classification task includes the definition of new translators
   to do tiering and other kinds of filtering, but those are beyond the scope
   of this document.  If no type is specified, cluster/dht will be used to do
   random placement among its constituents.

 * **tag** and **option** (optional, repeatable)  
   Additional tags and/or options to be applied to each newly created
   subvolume.  See the replica 2.5 example to see how this can be used.

Since each type might have unique requirements, such as ensuring that
replication is done across machines or racks whenever possible, it is assumed
that there will be corresponding type-specific scripts or functions to do the
actual aggregation.  This might even be made pluggable some day (TBD).  Once
all rule-based aggregation has been done, volume options are applied similarly
to how they are now.

Astute readers might have noticed that it's possible for a brick to be
aggregated more than once.  This is intentional.  If a brick is part of
multiple aggregates, it will be automatically split into multiple bricks
internally but this will be invisible to the user.

## Examples

Let's start with a simple tiering example.  Here's what the data-classification
config file might look like.

brick host1:/brick
media-type = ssd

brick host2:/brick
media-type = disk

brick host3:/brick
media-type = disk

rule tier-1