Re: [Gluster-users] which components needs ssh keys?

2018-01-05 Thread Jeff Darcy
> Have we deprecated SSL/TLS for the local I/O and management paths? The 
> code's still there, and I think I've even seen patches to it recently.

Never mind.  Saw that you were talking about ssh, not ssl. One more reason we 
should stop saying "ssl" and call it "tls" I guess.  ;)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] which components needs ssh keys?

2018-01-05 Thread Jeff Darcy


On Wed, Jan 3, 2018, at 3:23 AM, Aravinda wrote:
> Only Geo-replication uses SSH since it is between two Clusters. All 
> other features are limited to single Cluster/Volume, so communications 
> happens via Glusterd(Port tcp/24007 and brick ports(tcp/47152-47251))

Have we deprecated SSL/TLS for the local I/O and management paths? The code's 
still there, and I think I've even seen patches to it recently.

___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] ZFS with SSD ZIL vs XFS

2017-10-10 Thread Jeff Darcy
On Tue, Oct 10, 2017, at 11:19 AM, Gandalf Corvotempesta wrote:
> Anyone made some performance comparison between XFS and ZFS with ZIL
> on SSD, in gluster environment ?
> 
> I've tried to compare both on another SDS (LizardFS) and I haven't
> seen any tangible performance improvement.
> 
> Is gluster different ?

Probably not.  If there is, it would probably favor XFS.  The developers
at Red Hat use XFS almost exclusively.  We at Facebook have a mix, but
XFS is (I think) the most common.  Whatever the developers use tends to
become "the way local filesystems work" and code is written based on
that profile, so even without intention that tends to get a bit of a
boost.  To the extent that ZFS makes different tradeoffs - e.g. using
lots more memory, very different disk access patterns - it's probably
going to have a bit more of an "impedance mismatch" with the choices
Gluster itself has made.

If you're interested in ways to benefit from a disk+SSD combo under XFS,
it is possible to configure XFS with a separate journal device but I
believe there were some bugs encountered when doing that.  Richard
Wareing's upcoming Dev Summit talk on Hybrid XFS might cover those, in
addition to his own work on using an SSD in even more interesting ways.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster operations speed limit

2017-08-04 Thread Jeff Darcy


On Tue, Aug 1, 2017, at 06:16 AM, Alexey Zakurin wrote:
> I have a large distributed-replicated Glusterfs volume, that contains 
> few hundreds VM's images. Between servers 20Gb/sec link.
> When I start some operations like healing or removing, storage 
> performance becomes too low for a few days and server load becomes like 
> this:
> 
> 13:06:32 up 13 days, 20:02,  3 users,  load average: 43.62, 31.75, 
> 23.53.
> 
> Is it possible to set limit on this operations? Actually, VM's on my 
> cluster becomes offline, when I start healing, rebalance or removing 
> brick.

In addition to the cgroups workaround that Mohit mentions, there are two
longer-term efforts in progress (that I'm aware of) to address this and
similar issues.

(1) Some folks at Red Hat are working on limiting the number of files
that SHD will heal at one time
(https://github.com/gluster/glusterfs/issues/255).

(2) At Facebook, we're working on a more general solution to apportion
I/O among any users of a system, where "users" might be real users or
internal pseudo-users such as self heal or rebalance
(https://github.com/gluster/glusterfs/issues/266).

Either or both of these might land in 4.0; we're still planning that
release, so no definite answer yet.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Add single server

2017-05-12 Thread Jeff Darcy


On Mon, May 1, 2017, at 02:34 PM, Gandalf Corvotempesta wrote:
> I'm still thinking that saving (I don't know where, I don't know how)
> a mapping between
> files and bricks would solve many issues and add much more flexibility.

Every system we've discussed has a map.  The differences are only in the
granularity, and how the map is stored.  Per-file maps inevitably become
a scaling problem, so a deterministic function is used to map individual
files into a much smaller number of buckets, placement groups, hash
ranges, or whatever.  Then information about those buckets and their
locations is stored somehow:

* Centrally - Lustre, HDFS, Moose/Lizard

* Distributed among a few servers - Ceph, possibly Gluster with DHT2

* Distributed among all servers - Gluster today

No matter which approach you use, you can manipulate the maps.  Without
changing the fundamental structure of Gluster, you could take a brick's
hash range and split it in two to create two bricks.  Then you could
quietly migrate the files in one brick to anywhere else in the
background.  That doesn't quite work today because the two bricks would
be trying to operate on the same directories, seeing each others' files,
etc.  Making it more transparent won't be easy, but the changes would be
pretty well localized to DHT.  Brick multiplexing can help too, because
it allows a volume to be created with many more bricks initially so
they'd already be in separate directories and ready to move.  Multiple
bricks living in one process also makes coordination during such
transitions much easier.  This has been part of my plan for years, not
only to support adding a single server but also to support more
sophisticated forms of tiering, quality of service, etc.

The big question as I see it is what we can do *in the near term* to
make N+1 addition easier on *existing* clusters.  That probably deserves
a separate answer, so I'll leave it for another time.




___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] What is the CLI NUFA option "local-volume-name" good for?

2017-04-28 Thread Jeff Darcy


On Fri, Apr 28, 2017, at 10:57 AM, Jan Wrona wrote:
> I've been struggling with NUFA for a while now and I know very well what 
> the "option local-volume-name brick" in the volfile does. In fact, I've 
> been using a filter to force gluster to use the local subvolume I want 
> instead of the first local subvolume it finds, but filters are very 
> unreliable. Recently I've found this bug [1] and thought that I'll 
> finally be able to set the NUFA's "local-volume-name" option 
> *per-server* through the CLI without the use of the filter, but no. This 
> options sets the value globally, so I'm asking what is the use of LOCAL 
> volume name set GLOBALLY with the same value on every server?

You're right that it would be a bit silly to set this using "gluster
volume set" but it makes much more sense as a command-line override
using "--xlator-option" instead.  Then it could in fact be different on
every client, even though it's not even set in the volfile.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Does glusterfs supports brick locality?

2017-04-24 Thread Jeff Darcy



On Mon, Apr 24, 2017, at 10:14 AM, atris adam wrote:
> I have two data centers in two different provinces. Each data centers
> have 3 servers. I want to setup cloud storage with glusterfs.> I want to make 
> one glusterfs volume with these information.
> 
> province "a"==> 3 servers, each server has one 5TB brick (bricks
> number is from 1-3)> province "b"==> 3 servers, each server has one 5TB brick 
> (bricks
> number is from 4-6)> 
> distributed gluster volume size: 30TB
> 
> Does glusterfs support servers with long distance?
> If yes, if an end user in province "a" create a file, where the file
> is allocated? I mean which brick is selected to write the file on it?
> I need the brick number from 1-3 to be> selected. Because I think if 
> glusterfs select other bricks (num 4-6,
> which is in province "b"), higher latency and unnecessary network
> traffic will be consumed. Am I right?
The "nufa" translator/option exists for exactly this purpose, but it's a
bit more limited than what I think you want.   When creating a file, it
will create it in a brick *on the same node* if possible, but it doesn't
distinguish between other nodes near or far away.  This might be good
enough if you're using NFS as the access protocol, because in that case
the "same node" is the one running the NFS/Gluster proxy.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Remove an artificial limitation of disperse volume

2017-02-07 Thread Jeff Darcy


- Original Message -
> Okay so the 4 nodes thing is a kind of exception? What about 8 nodes
> with redundancy 4?
> 
> I made a table to recap possible configurations, can you take a quick
> look and tell me if it's OK?
> 
> Here: https://gist.github.com/olivierlambert/8d530ac11b10dd8aac95749681f19d2c

As I understand it, the "power of two" thing is only about maximum
efficiency, and other values can work without wasting space (they'll
just be a bit slower).  So, for example, with 12 disks you would be
able to do 10+2 and get 83% space efficiency.  Xavier's the expert,
though, so it's probably best to let him clarify.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Remove an artificial limitation of disperse volume

2017-02-07 Thread Jeff Darcy
> So far, I can't create a disperse volume if the redundancy level is
> 50% or more the number of bricks. I know that perfs would be better in
> dist/rep, but what if I prefer anyway to have disperse?
> 
> Conclusion: would it be possible to have a "force" flag during
> disperse volume creation even if redundancy is higher that 50%?

The problem is that the math behind erasure coding doesn't work for all
fragment counts and redundancy levels.  To get two-failure protection
you need more than four bricks.  If you had multiple disks in each
server you could get protection against multiple disk failures, but you
still wouldn't have protection against multiple server failures.  The
only thing your "force" flag could do is allow placement of multiple
fragments on a single physical disk, but then you wouldn't even have
protection against two disk failures.  If you want higher levels of
protection you need more disks, either to satisfy the mathematical
requirements of EC or to overcome the space inefficiency of replication.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] rebalance and volume commit hash

2017-01-17 Thread Jeff Darcy
> I don't understand why  new commit hash is generated for the volume during
> rebalance process? I think it should be generated only during add/remove
> brick events but not during rebalance.

The mismatch only becomes important during rebalance.  Prior to that, even
if we've added or removed a brick, the layouts haven't changed and the
optimization is still as valid as it was before.  If there are multiple
add/remove operations, we don't need or want to change the hash between
them.  Conversely, there are cases besides add/remove brick where we might
want to do a rebalance - e.g. after replace-brick with a brick of a
different size, or to change between total-space vs. free-space weighting.
Changing the hash in add/remove brick doesn't handle these cases, but
changing it at the start of rebalance does.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] rebalance and volume commit hash

2017-01-17 Thread Jeff Darcy
> Can you tell me please why every volume rebalance generates a new value
> for the volume commit hash?
> 
> If I have fully rebalanced cluster (or almost) with millions of
> directories then rebalance has to change DHT xattr for every directory
> only because there is a new volume commit hash value. It is pointless in
> my opinion. Is there any reason behind this? As I observed, the volume
> commit hash is set at the rebalance beginning which totally destroys
> benefit of lookup optimization algorithm for directories not
> scanned/fixed yet by this rebalance run.

It disables the optimization because the optimization would no longer
lead to correct results.  There are plenty of distributed filesystems
that seem to have "fast but wrong" as a primary design goal; we're
not one of them.

The best way to think of the volume-commit-hash update is as a kind of
cache invalidation.  Lookup optimization is only valid as long as we
know that the actual distribution of files within a directory is
consistent with the current volume topology.  That ceases to be the
case as soon as we add or remove a brick, leaving us with three choices.

(1) Don't do lookup optimization at all.  *Every* time we fail to find
a file on the brick where hashing says it should be, look *everywhere*
else.  That's how things used to work, and still work if lookup
optimization is disabled.  The drawback is that every add/remove brick
operation causes a permanent and irreversible degradation of lookup
performance.  Even on a freshly created volume, lookups for files that
don't exist anywhere will cause every brick to be queried.

(2) Mark every directory as "unoptimized" at the very beginning of
rebalance.  Besides being almost as slow as fix-layout itself, this
would require blocking all lookups and other directory operations
*anywhere in the volume* while it completes.

(3) Change the volume commit hash, effectively marking every
directory as unoptimized without actually having to touch every one.
The root-directory operation is cheap and almost instantaneous.
Checking each directory commit hash isn't free, but it's still a
lot better than (1) above.  With upcalls we can enhance this even
further.

Now that you know a bit more about the tradeoffs, do "pointless"
and "destroys the benefit" still seem accurate?

___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Cheers and some thoughts

2017-01-05 Thread Jeff Darcy
> Both ceph and lizard manage this automatically.
> If you want, you can add a single disk to a working cluster and automatically
> the whole cluster is rebalanced transparently with no user intervention

This relates to the granularity problem I mentioned earlier.  As long as
we're not splitting bricks into smaller units, our flexibility to do things
like add a single disk is very limited and the performance impact of
rebalance is large.  Automatically triggering rebalance would avoid a
manual step, but it would just make the pain immediate instead of
prolonged.  ;)  When we start splitting bricks into tiles or bricklets
or whatever we want to call them, a lot of what you talk about will become
more feasible.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Cheers and some thoughts

2017-01-05 Thread Jeff Darcy
> Gluster (3.8.7) coped perfectly - no data loss, no maintenance required,
> each time it came up by itself with no hand holding and started healing
> nodes, which completed very quickly. VM's on gluster auto started with
> no problems, i/o load while healing was ok. I felt quite confident in it.

Glad to hear that part went well.

> The alternate cluster fs - not so good. Many times running VM's were
> corrupted, several times I lost the entire filesystem. Also IOPS where
> atrocious (fuse based). It easy to claim HA when you exclude such things
> as power supply failures, dodgy network switches etc.

Too true.  Unfortunately, I think just about every distributed storage
system has to go through this learning curve, from not handling failure
at all to handling the simplest/easiest cases to handling the weird stuff
that real deployments can throw at you.  It's not just about the actual
failure handling, either.  Sometimes, it's about things you do in the
main I/O path, such as not throwing away O_SYNC flags to claim better
performance.  From the information you've provided, I'll bet that's where
your data corruption came from.

> I think glusters active/active quorum based design, where is every node
> is a master is a winner, active/passive systems where you have a SPOF
> master are difficult to DR manage.

Active/passive designs create a very tough set of tradeoffs.  Detecting
and responding to failures quickly enough, while also avoiding false
alarms, is like balancing on a knife edge.  Then there's problems with
overload turning into failure, with failback, etc.  It can all be done
right and work well, but it's *really* hard.  While I guess it's better
than nothing, experience has shown that active/active designs are easier
to make robust, and the techniques for doing so have been well known for
at least a decade or so.
 
> However :) Things I'd really like to see in Gluster:
> 
> - More flexible/easier management of servers and bricks (add/remove/replace)
> 
> - More flexible replication rules
> 
> One of the things I really *really* like with LizardFS is the powerful
> goal system and chunkservers. Nodes and disks can be trivially easily
> added/removed on the fly and chunks will be shuffled, replicated or
> deleted to balance the system. Individual objects can have difference
> goals (replication levels) which can also be changed on the fly and the
> system will rebalance them. Objects can even be changed from/to simple
> replication to Erasure Encoded objects.
> 
> I doubt this could be fitted to the existing gluster, but is there
> potential for this sort of thing in Gluster 4.0? I read the design docs
> and they look ambitious.

There used to be an idea called "data classification" to cover this
kind of case.  You're right that setting arbitrary goals for arbitrary
objects would be too difficult.  However, we could have multiple pools
with different replication/EC strategies, then use a translator like
the one for tiering to control which objects go into which pools based
on some kind of policy.  To support that with a relatively small
number of nodes/bricks we'd also need to be able to split bricks into
smaller units, but that's not really all that hard.

Unfortunately, although many of these ideas have been around for at
least a year and a half, nobody has ever been freed up to work on
them.  Maybe, with all of the interest in multi-tenancy to support
containers and hyperconvergence and whatever else, we might finally
be able to get these under way.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Very slow writes through Samba mount to Gluster with crypt on

2016-12-20 Thread Jeff Darcy
> Is there some known formula for getting performance out of this stack, or is
> Samba with Glusterfs with encryption-at-rest just not that workable a
> proposition for now?

I think it's very likely that the combination you describe is not workable.
The crypt translator became an orphan years ago, when the author left a
highly idiosyncratic blob of code and practically no tests behind.  Nobody
has tried to promote it since then, and "at your own risk" has been the
answer for anyone who asks.  If you found it in the source tree and
decided to give it a try, I'm sorry.  Even though it's based in large part
on work I had done for HekaFS, I personally wouldn't trust it to keep my
data correctly let alone securely.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] Community Meetings - Feedback on new meeting format

2016-11-17 Thread Jeff Darcy
> This has resulted in several good changes,
> a. Meetings are now more livelier with more people speaking up and
> making themselves heard.
> b. Each topic in the open floor gets a lot more time for discussion.
> c. Developers are sending out weekly updates of works they are doing,
> and linking those mails in the meeting agenda.

I agree with these points.  People seem much more engaged during the
meeting, which is a good thing.

> Thought the response and attendance to the initial 2 meetings was
> good, it dropped for the last 2. This week in particular didn't have a
> lot of updates added to the meeting agenda. It seems like interest has
> dropped already.
> 
> We could probably do a better job of collecting updates to make it
> easier for people to add their updates, but the current format of
> adding updates to etherpad(/hackmd) is simple enough. I'd like to know
> if there is anything else preventing people from providing updates.

I'm one of the culprits here.  As an observation, not an excuse, I'll
point out that we were already missing lots of updates from people
who didn't even show up to the meetings.  Has the overall level of
missed updates gone up or down?  Has the level of attention paid to
them?  If people provide updates about as consistently, and those
updates are at least as detailed (possibly more because they're
written and meant to be read asynchronously), then we might actually
be *ahead* of where we were before.

The new format gets a big +1 from me.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Automation of single server addition to replica

2016-11-09 Thread Jeff Darcy
> And that's why I really prefere gluster, without any metadata or
> similiar.
> But metadata server aren't mandatory to archive automatic rebalance.
> Gluster is already able to rebalance and move data around the cluster,
> and already has the tool to add a single server even in a replica 3.
> 
> What i'm asking is to automate this feature.  Gluster could be able to
> move bricks around without user intervention.

Some of us have thought long and hard about this.  The root of the
problem is that our I/O stack works on the basic of replicating bricks,
not files.  Changing that would be hard, but so is working with it.
Most ideas (like Joe's) involve splitting larger bricks into smaller
ones, so that the smaller units can be arranged into more flexible
configurations.  So, for example, let's say you have bricks X through Z
each split in two.  Define replica sets along the diagonal and place
some files A through L.

 Brick X   Brick Y   Brick Z
   +-+-+-+
   Subdirectory 1  | A B C D | E F G H | I J K L |
   +-+-+-+
   Subdirectory 2  | I J K L | A B C D | E F G H |
   +-+-+-+

Now you want to add a fourth brick on a fourth machine.  Now each
(divided) brick should contain three files instead of four, so some will
have to move.  Here's one possibility, based on our algorithms to
maximize overlaps between the old and new DHT hash ranges.

 Brick X   Brick Y   Brick Z   Brick W
   +-+-+-+-+
   Subdirectory 1  | A B C   | D E F   | J K L   | G H I   |
   +-+-+-+-+
   Subdirectory 2  | G H I   | A B C   | D E F   | J K L   |
   +-+-+-+-+

Even trying to minimize data motion, a third of all the files have to be
moved.  This can be reduced still further by splitting the original
bricks into even smaller parts, and that actually meshes quite well with
the "virtual nodes" technique used by other systems that do similar
hash-based distribution, but it gets so messy that I won't even try to
draw the pictures.  The main point is that doing all this requires
significant I/O, with significant impact on other activity on the
system, so it's not necessarily true that we should just do it without
user intervention.

Can we automate this process?  Yes, and we should.  This is already in
scope for GlusterD 2.  However, in addition to the obvious recalculation
and rebalancing, it also means setting up the bricks differently even
when a volume is first created, and making sure that we don't
double-count available space on two bricks that are really on the same
disks or LVs, and so on.  Otherwise, the initial setup will seem simple
but later side-effects could lead to confusion.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] understanding dht value

2016-11-08 Thread Jeff Darcy
> Thanks for pointing to the article. I have been following the article all the
> way. What intrigues me is the dht values associated with sub directories.

> [root@glusterhackervm3 glus]# getfattr -n trusted.glusterfs.dht -e hex
> /brick2/vol
> getfattr: Removing leading '/' from absolute path names
> # file: brick2/vol
> trusted.glusterfs.dht=0x00017de2

> [root@glusterhackervm3 glus]# getfattr -n trusted.glusterfs.dht -e hex
> /brick2/vol/d/
> getfattr: Removing leading '/' from absolute path names
> # file: brick2/vol/d/
> trusted.glusterfs.dht=0x00017ffe

> Does it mean that only files whose DHT value ranges from 0x00 to 0x 7ffe
> can be saved inside the ‘d’ directory. But then it provides a very narrow
> range of 0x7de2 to 0x 7ffe to be created in that directory.

To know the distribution for a directory as seen by the user, you need to look 
at the xattrs for the matching directory on every brick. What the above shows 
us is that, *for this brick*: 

   in brick2/vol, this brick will store files with hashes from 7de2 to 


   in the subdirectory /brick2/vol/d, this brick will store files from  
to 7ffe

Note that this is *completely independent* for each directory, and only affects 
placement of non-directories.  There's no set-intersection going on between the 
ranges for a directory and its parent(s).  The subdirectory looks like what I'd 
expect to see for a volume with two (possibly replicated) DHT subvolumes.  The 
root looks a lot weirder - almost, but not quite, the top half of the hash 
distribution.  I think this could happen if here had been multiple brick 
additions/removals and rebalances, but I'm not sure what that combination would 
have to be.  If .../d had been created later, it would be unaffected by all of 
these prior actions and would still get an exactly-half share of the hash range.

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] How gluster parallelize reads

2016-10-03 Thread Jeff Darcy
> Anyway, in gerrit you are talking about "local" reads. How could you
> have a "local" read? This would be possible only mounting the volume
> locally on a server. is this a supported configuration?

Whether or not it's supported for native protocol, it's a common case
when using NFS or SMB with the servers for those protocols appearing
as native-protocol clients on the server machines.

> Probably, a "priority" could be added in mount option, so that when
> mounting the gluster volume i can set the preferred host for reads.
> 
> Something like this:
> 
> mount -t glusterfs -o preferred-read-host=1.2.3.4 server1:/test-volume
> /mnt/glusterfs

It's a great idea that would work well for a volume containing a single
replica set, but what about when that volume contains multiple?  Specify
a preferred read source for each?  Even that will get tricky when we
start to work around the limitation of adding bricks in multiples of the
replica count.  Then we'll be building new replica sets "automatically"
so the user would have to keep re-examining the volume structure to
decide on a new priority list.  Also, what should we do if that priority
list is "pathological" in the sense of creating unnecessary hot spots?
Should we accept it as an expression of the user's will anyway, or
override it to ensure continued smooth operation?

IMO we should try harder to find the right answers *autonomously*,
perhaps based on user-specified relationships between client networks
and servers.  (Ceph does some of this in their CRUSH maps, but I think
that conflates separate problems of managing placement and traffic.)  To
look at it another way, we'd be doing the same calculations the user
might do to create that explicit priority list, except we'd be in a
better position to *re*calculate that list when appropriate.  We're
thinking about some of this in the context of handling multiple networks
better in general, but it's still a bit of a research effort because
AFAICT nobody else has come up with much empirically-backed research to
guide solutions.

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] How gluster parallelize reads

2016-10-03 Thread Jeff Darcy
> > 0 means use the first server to respond I think - at least that's my guess
> > of what "first up server" means
> > 1 hashed by GFID,  so clients will use the same server for a given file but
> > different files may be accessed from different nodes.
> 
> I think that 1 is better.
> Why "0" is the default ?

Basic storage-developer conservatism.  Zero was the behavior before
read-hash-mode was implemented.  As strongly as some of us might believe
that such tweaks lead to better behavior - as I did with this one in
2012[1] - we've kind of learned the hard way that existing users often
disagree with our estimations.  Thus, new behavior is often kept as a
"special" for particular known environments or use cases, and the
default is left unchanged until there's clear feedback indicating it
should be otherwise.

[1] http://review.gluster.org/#/c/2926/
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] EC clarification

2016-09-21 Thread Jeff Darcy
> 2016-09-21 20:56 GMT+02:00 Serkan Çoban :
> > Then you can use 8+3 with 11 servers.
> 
> Stripe size won't be good: 512*(8-3) = 2560 and not 2048 (or multiple)

It's not really 512*(8+3) though.  Even though there are 11 fragments,
they only contain 8 fragments' worth of data.  They just encode it with
enough redundancy that *any* 8 contains the whole.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] [Gluster-devel] CFP for Gluster Developer Summit

2016-08-22 Thread Jeff Darcy
Two proposals, both pretty developer-focused.

(1) Gluster: The Ugly Parts
Like any code base its size and age, Gluster has accumulated its share of dead, 
redundant, or simply inelegant code.  This code makes us more vulnerable to 
bugs, and slows our entire development process for any feature.  In this 
interactive discussion, we'll identify translators or other modules that can be 
removed or significantly streamlined, and develop a plan for doing so within 
the next year or so.  Bring your favorite gripes and pet peeves (about the 
code).

(2) Gluster Debugging
Every developer has their own "bag of tricks" for debugging Gluster code - 
things to look for in logs, options to turn on, obscure test-script features, 
gdb macros, and so on.  In this session we'll share many of these tricks, and 
hopefully collect more, along with a plan to document them so that newcomers 
can get up to speed more quickly.


I could extend #2 to cover more user/support level problem diagnosis, but I 
think I'd need a co-presenter for that because it's not an area in which I feel 
like an expert myself.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] One client can effectively hang entire gluster array

2016-07-08 Thread Jeff Darcy
> In either of these situations, one glusterfsd process on whatever peer the
> client is currently talking to will skyrocket to *nproc* cpu usage (800%,
> 1600%) and the storage cluster is essentially useless; all other clients
> will eventually try to read or write data to the overloaded peer and, when
> that happens, their connection will hang. Heals between peers hang because
> the load on the peer is around 1.5x the number of cores or more. This occurs
> in either gluster 3.6 or 3.7, is very repeatable, and happens much too
> frequently.

I have some good news and some bad news.

The good news is that features to address this are already planned for the
4.0 release.  Primarily I'm referring to QoS enhancements, some parts of
which were already implemented for the bitrot daemon.  I'm still working
out the exact requirements for this as a general facility, though.  You
can help!  :)  Also, some of the work on "brick multiplexing" (multiple
bricks within one glusterfsd process) should help to prevent the thrashing
that causes a complete freeze-up.

Now for the bad news.  Did I mention that these are 4.0 features?  4.0 is
not near term, and not getting any nearer as other features and releases
keep "jumping the queue" to absorb all of the resources we need for 4.0
to happen.  Not that I'm bitter or anything.  ;)  To address your more
immediate concerns, I think we need to consider more modest changes that
can be completed in more modest time.  For example:

 * The load should *never* get to 1.5x the number of cores.  Perhaps we
   could tweak the thread-scaling code in io-threads and epoll to check
   system load and not scale up (or even scale down) if system load is
   already high.

 * We might be able to tweak io-threads (which already runs on the
   bricks and already has a global queue) to schedule requests in a
   fairer way across clients.  Right now it executes them in the
   same order that they were read from the network.  That tends to
   be a bit "unfair" and that should be fixed in the network code,
   but that's a much harder task.

These are only weak approximations of what we really should be doing,
and will be doing in the long term, but (without making any promises)
they might be sufficient and achievable in the near term.  Thoughts?
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] Securing GlusterD management

2016-07-06 Thread Jeff Darcy
As some of you might already have noticed, GlusterD has been notably insecure 
ever since it was written.  Unlike our I/O path, which does check access 
control on each request, anyone who can craft a CLI RPC request and send it to 
GlusterD's well known TCP port can do anything that the CLI itself can do.  TLS 
support was added for GlusterD a while ago, but it has always been a bit 
problematic and as far as I know hasn't been used much.  It's a bit of a 
chicken-and-egg problem.  Nobody wants to use a buggy or incomplete feature, 
but as long as nobody's using it there's little incentive to improve it.

Recently, there have been some efforts to add features which would turn the 
existing security problem into a full-fledged "arbitrary code execution" 
vulnerability (as the security folks would call it).  These efforts have been 
blocked, but they have also highlighted the fact that we're *long* past the 
point where we should have tried to make GlusterD more secure.  To that end, 
I've submitted the following patch to make TLS mandatory for all GlusterD 
communication, with some very basic authorization for CLI commands.

   http://review.gluster.org/#/c/14866/

The technical details are in the commit message, but the salient point is that 
it requires *zero configuration* to get basic authentication and encryption.  
This is equivalent to putting a lock on the door.  Sure, maybe everybody knows 
the default combination, but *at least there's a lock* and people who want to 
secure their systems can change the combination to whatever they want.  That's 
better than the door hanging open, without even a solid attachment point for a 
lock, and it's essential infrastructure for anything else we might do.  The 
patch also fixes some bugs that affect even today's optional TLS implementation.

One significant downside of this change has to do with rolling upgrades.  While 
it might be possible for those who are already using TLS to do a rolling 
upgrade, it would still require some manual steps.  The vast majority of users 
who haven't enabled TLS will be unable to upgrade without "stopping the world" 
(as is already the case for enabling TLS).

I'd appreciate feedback from users on both the positive and negative aspects of 
this change.  Should it go into 3.9?  Should it be backported to 3.8?  Or 
should it wait until 4.0?  Feedback from developers is also appreciated, though 
at this point I think any problems with the patch itself have already been 
resolved to the point where GlusterFS with the patch is more stable than 
GlusterFS without it.  I'm just fighting through some NetBSD testing issues at 
this point, hoping to make that situation better as well.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] files on glusterfs disappears

2016-06-07 Thread Jeff Darcy
> Thank you for the answer,
> if I have understood you suggest to disable NUFA to verify if this is
> the problem originator,
> is it correct?

That would certainly provide a very useful data point.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] files on glusterfs disappears

2016-06-06 Thread Jeff Darcy
> This could be because of nufa xlator. As you say the files are present on the
> brick I don't suspect RDMA here.

Agreed.

> Is nufa still supported? Could this a bug in nufa + dht?

Until we explicitly decide to stop building and distributing it, it's still
"supported" in some sense, but only to the extent that there's someone
available to look at it.  Unfortunately, nobody has that as an assignment
and our tests for NUFA are minimal so they're not likely to detect breakage
automatically.  With the amount of change we've seen to DHT over the last
several months, it's entirely possible that a NUFA bug or two has crept in.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] Default quorum for 2 way replication

2016-03-04 Thread Jeff Darcy
> I like the default to be 'none'. Reason: If we have 'auto' as quorum for
> 2-way replication and first brick dies, there is no HA. If users are
> fine with it, it is better to use plain distribute volume

"Availability" is a tricky word.  Does it mean access to data now, or
later despite failure?  Taking a volume down due to loss of quorum might
be equivalent to having no replication in the first sense, but certainly
not in the second.  When the possibility (likelihood?) of split brain is
considered, enforcing quorum actually does a *better* job of preserving
availability in the second sense.  I believe this second sense is most
often what users care about, and therefore quorum enforcement should be
the default.

I think we all agree that quorum is a bit slippery when N=2.  That's
where there really is a tradeoff between (immediate) availability and
(highest levels of) data integrity.  That's why arbiters showed up first
in the NSR specs, and later in AFR.  We should definitely try to push
people toward N>=3 as much as we can.  However, the ability to "scale
down" is one of the things that differentiate us vs. both our Ceph
cousins and our true competitors.  Many of our users will stop at N=2 no
matter what we say.  However unwise that might be, we must still do what
we can to minimize harm when things go awry.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gluster small file performance

2015-08-18 Thread Jeff Darcy
 Note: The log files attached have the No data available messages parsed out
 to reduce the file size. There were an enormous amount of these. One of my
 colleagues submitted something to the message board about these errors in
 3.7.3.

  [2015-08-17 17:03:37.270219] W [fuse-bridge.c:1230:fuse_err_cbk]
  0-glusterfs-fuse: 6643: REMOVEXATTR()
  /boost_1_57_0/boost/accumulators/accumulators.hpp = -1 (No data available)
 
  [2015-08-17 17:03:37.271004] W [fuse-bridge.c:1230:fuse_err_cbk]
  0-glusterfs-fuse: 6646: REMOVEXATTR()
  /boost_1_57_0/boost/accumulators/accumulators.hpp = -1 (No data available)
 
  [2015-08-17 17:03:37.271663] W [fuse-bridge.c:1230:fuse_err_cbk]
  0-glusterfs-fuse: 6648: REMOVEXATTR()
  /boost_1_57_0/boost/accumulators/accumulators.hpp = -1 (No data available)
 
  [2015-08-17 17:03:37.274273] W [fuse-bridge.c:1230:fuse_err_cbk]
  0-glusterfs-fuse: 6662: REMOVEXATTR()
  /boost_1_57_0/boost/accumulators/accumulators_fwd.hpp = -1 (No data
  available)
 

I can't help but wonder how much these are affecting your performance. That's a 
lot of extra messages, and even more effort to log the failures. When I run 
your tests myself, I don't see any of these and I don't see a performance 
drop-off either. Maybe something ACL- or SELinux-related? It would be 
extra-helpful to get a stack trace for just one of these, to see where they're 
coming from. 
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] gluster small file performance

2015-08-18 Thread Jeff Darcy
 I changed the logging to error to get rid of these messages as I was
 wondering if this was part of the problem. It didn't change the performance.
 Also, I get these same errors both before and after the reboot. I only see
 the slowdown after the reboot.
 I have SELinux disabled. Not sure about ACL. Don't think I can turn ACL off
 on XFS.
 I am happy to post results of strace. Do I just do 'strace tar -xPf boost.tar
  strace.log'?

That will show the calls if they're coming from tar, but I suspect they're 
internally generated so you'd have to attach strace to the glusterfsd process. 
Either way, you'd probably want to add -e removexattr to keep the results 
manageable. That will at least tell us *which* xattr we're trying to remove, 
which might give a clue to what's going on. 
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] trouble mounting ssl-enabled volume

2015-06-15 Thread Jeff Darcy
- Original Message -

 Hi all,

 I'm just installing my first ever glusterfs volume, and am running into
 trouble, which I think may be related to using ssl. I don't have a network I
 can trust, so using secure authentication and encryption is a show-stopper
 for me.

 I am using gluster 3.6.3 on Debian stable, and the command I'm using to mount
 is:

 # mount -t glusterfs localhost:/austen /home

 and the error message I am seeing is the following:

 # tail -23 /var/log/glusterfs/home.log
 +--+
 [2015-06-16 00:12:12.691413] I [socket.c:379:ssl_setup_connection]
 0-austen-client-0: peer CN = elliot
 [2015-06-16 00:12:12.691978] I [rpc-clnt.c:1761:rpc_clnt_reconfig]
 0-austen-client-0: changing port to 49152 (from 0)
 [2015-06-16 00:12:12.694267] I [socket.c:379:ssl_setup_connection]
 0-austen-client-1: peer CN = wentworth
 [2015-06-16 00:12:12.695846] I [rpc-clnt.c:1761:rpc_clnt_reconfig]
 0-austen-client-1: changing port to 49152 (from 0)
 [2015-06-16 00:12:12.703270] I [socket.c:379:ssl_setup_connection]
 0-austen-client-0: peer CN = elliot
 [2015-06-16 00:12:12.703544] I
 [client-handshake.c:1413:select_server_supported_programs]
 0-austen-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
 [2015-06-16 00:12:12.703912] W [client-handshake.c:1109:client_setvolume_cbk]
 0-austen-client-0: failed to set the volume (Permission denied)

Are you setting auth.ssl-allow to enable specific users (identified by CN) to 
access the volume? The following page shows how. 

http://www.gluster.org/community/documentation/index.php/SSL 

Also, note that the CN can't contain spaces. I know that's inconvenient, but 
space was already used as a delimiter and changing that would have affected 
backward compatibility. 

 [2015-06-16 00:12:12.703940] W [client-handshake.c:1135:client_setvolume_cbk]
 0-austen-client-0: failed to get 'process-uuid' from reply dict
 [2015-06-16 00:12:12.703956] E [client-handshake.c:1141:client_setvolume_cbk]
 0-austen-client-0: SETVOLUME on remote-host failed: Authentication failed
 [2015-06-16 00:12:12.703970] I [client-handshake.c:1225:client_setvolume_cbk]
 0-austen-client-0: sending AUTH_FAILED event
 [2015-06-16 00:12:12.703992] E [fuse-bridge.c:5145:notify] 0-fuse: Server
 authenication failed. Shutting down.
 [2015-06-16 00:12:12.704010] I [fuse-bridge.c:5599:fini] 0-fuse: Unmounting
 '/home'.
 [2015-06-16 00:12:12.709146] I [socket.c:379:ssl_setup_connection]
 0-austen-client-1: peer CN = wentworth
 [2015-06-16 00:12:12.710243] I
 [client-handshake.c:1413:select_server_supported_programs]
 0-austen-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
 [2015-06-16 00:12:12.711294] W [client-handshake.c:1109:client_setvolume_cbk]
 0-austen-client-1: failed to set the volume (Permission denied)
 [2015-06-16 00:12:12.711321] W [client-handshake.c:1135:client_setvolume_cbk]
 0-austen-client-1: failed to get 'process-uuid' from reply dict
 [2015-06-16 00:12:12.711330] E [client-handshake.c:1141:client_setvolume_cbk]
 0-austen-client-1: SETVOLUME on remote-host failed: Authentication failed
 [2015-06-16 00:12:12.711339] I [client-handshake.c:1225:client_setvolume_cbk]
 0-austen-client-1: sending AUTH_FAILED event
 [2015-06-16 00:12:12.711349] E [fuse-bridge.c:5145:notify] 0-fuse: Server
 authenication failed. Shutting down.
 [2015-06-16 00:12:12.711358] I [fuse-bridge.c:5599:fini] 0-fuse: Unmounting
 '/home'.
 [2015-06-16 00:12:12.711374] E [mount-common.c:228:fuse_mnt_umount]
 0-glusterfs-fuse: fuse: failed to unmount /home: Invalid argument
 [2015-06-16 00:12:12.711586] W [glusterfsd.c:1194:cleanup_and_exit] (-- 0-:
 received signum (15), shutting down

 Sadly, I have very little idea as to how to debug this. I fear it may be a
 problem with my ssl keys (I created a CA key and used it to sign the keys
 for the two servers, but may have done this wrong.

 Any suggestions are welcome. I understand I haven't given all the information
 you likely need to help, but I don't even know what information would really
 be relevant, as I do not understand what this AUTH_FAILED event means.

 David

 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-users___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] reading from local replica?

2015-06-10 Thread Jeff Darcy
 In short, it would seem that either were I to use geo-repliciation,
 whether recommended or not in this kind of usage, I'd need to own both
 which volume to mount and what to do with writes when the client has
 chosen to mount the slave.

True.  Various active/active geo-replication solutions have been on
the road map for some time, but in each release there are other things
deemed more important.  :(

 Finally, given that ping times between regions are typically in excess
 of 200 ms in my case, would you strongly discourage AFR usage?

Pretty strongly.  The AFR write protocol is quite latency-sensitive.
Obviously, this affects performance.  Also, as RTT increases it
becomes harder and harder to tune things so that network brownouts
don't become full partitions.  If the read-replica selection options
worked, then reads should be OK and an almost entirely read-only
workload might be OK.  Otherwise, I'd say you're likely to have a bad
time.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] reading from local replica?

2015-06-09 Thread Jeff Darcy
 Am I misunderstanding cluster.read-subvolume/cluster.read-subvolume-index?
 
 I have two regions, A and B with servers a and b in,
 respectfully, each region.  I have clients in both regions. Intra-region
 communication is fast, but the pipe between the regions is terrible.
 I'd like to minimize inter-region communication to as close to glusterfs
 write operations only and have reads go to the server in the region the
 client is running in.
 
 I have created a replica volume as:
 gluster volume create gv0 replica 2 a:/data/brick1/gv0
 b:/data/brick1/gv0 force
 
 As a baseline, if I use scp to copy from the brick directly, I get --
 for a 100M file -- times of about 6s if the client scps from the server
 in the same region and anywhere from 3 to 5 minutes if I the client scps
 the server in the other region.
 
 I was under the impression (from something I read but can't now find)
 that glusterfs automatically picks the fastest replica, but that has not
 been my experience; glusterfs seems to generally prefer the server in
 the other region over the local one, with times usually in excess of 4
 minutes.

The choice of which replica to read from has become rather complicated
over time.  The first parameter that matters is cluster.read-hash-mode,
which selects between dynamic and (two forms of) static selection.  For
the default mode, we try to spread the read load across replicas based
on both the file's ID and the client's.  For read-hash-mode=0 *only*,
we do this.

 * If choose-local is set (as it is by default) and there's a local
   replica, use that.

 * Otherwise, select a replica based on fastest *initial* response.

Note that these are both a bit prone to hot spots, which is why this
method is not the default.  Also, re-evaluating response times is as
likely to lead to mobile hotspot behavior as anything else -
clients keep following each other around to previously idle but now
overloaded replicas, moving the congestion around but never resolving
it.  Thus, we only tend to re-evaluate in response to brick up/down
events.  Probably some room for improvement here.

That brings us to read-subvolume and read-subvolume-index.  The
difference between them is that read-subvolume takes a translator
*name* (which you'd have to get from the volfile) and only applies
to one replica set within a volume.  It's really only useful for
testing and debugging.  By contrast, read-subvolume-index applies
to all replica sets in a volume and doesn't require any knowledge
of translator names.  Either one is used *before* read-hash-mode;
if it's set, and if the corresponding replica is up, it will be
chosen.

Yes, it's a bit of a mess.  However, as you've clearly guessed,
this is a pretty critical decision so it's nice to have many
different ways to control it.

 I've also tried having clients mount the volume using the xlator
 options cluster.read-subvolume and cluster.read-subvolume-index, but
 neither seem to have any impact.  Here are sample mount commands to show
 what I'm attempting:
 
 mount -t glusterfs -o xlator-option=cluster.read-subvolume=gv0-client-0
 or 1 a:/gv0 /mnt/glusterfs
 mount -t glusterfs -o xlator-option=cluster.read-subvolume-index=0 or
 1 a:/gv0 /mnt/glusterfs

I would guess that the translator options are somehow not being passed
all the way through to the translator that actually makes the decision.
If it is being passed, it definitely should force the decision as
described above.  There might be a bug here, or perhaps I'm just
misunderstanding code I haven't read in a while.

Also, please not that synchronous replication (AFR) isn't really
intended or expected to work over long distances.  Anything over 5ms
RTT is risky territory; that's why we have separate geo-replication.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] reading from local replica?

2015-06-09 Thread Jeff Darcy
 Sorry for neglecting to mention the version, it's 3.7.1.


I've filed a bug to track this.

https://bugzilla.redhat.com/show_bug.cgi?id=1229808
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] reading from local replica?

2015-06-09 Thread Jeff Darcy
 So, maybe passing these options as a mount command doesn't work/is a
 no-op, but what I don't understand is why -- given that there is no
 measure by which glusterfs should ever conclude the replica in the
 other region is ever faster than the replica in the same region.

If read-subvolume or read-subvolume-index is somehow not getting through,
then we're back to the read-hash-mode default - which does *not* try to
use round-trip-time measurements.  Still, it should give different
results for different files.

 In
 fact, it appears as though glusterfs is *preferring* the slower replica.

It's hard to see how that would be the case, since the code to set
read_child based on first-to-reply seems to be *missing* in the current
code.  :(  What version are you running, again?
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Network Topology Question

2015-05-22 Thread Jeff Darcy
- Original Message - 

 Hi there,

 I'm planning to setup a 3-Node Cluster for oVirt and would like to use 56 GBe
 (RoCe)
 exclusively for GlusterFS. Since 56 GBe switches are far too expensive and
 it's not
 planned to add more nodes and furthermore this would add a SPOF I'd like to
 cross connect the nodes as shown in the diagram below:

 Node 1 Node 2 Node3
 ||___||||
 |___|

 This way there's a dedicated 56 Gbit connection to/from each member node.

 Is is possible to do this with GlusterFS?
 My first thought was to have different IPs on each node's /etc/host mapped to
 the node
 hostnames but I'm unsure if I can force GlusterFS to hostnames instead of
 IPs.

There are two ways you can do this.  Both involve asymmetric configurations.
Imagine that you have three subnets, one per wire:

  192.168.1.1 and 192.168.1.2 between Node1 and Node2
  192.168.2.1 and 192.168.2.2 between Node2 and Node3
  192.168.3.1 and 192.168.3.2 between Node1 and Node3

So, /etc/hosts on Node1 would look like this:

  192.168.1.2 node2
  192.168.3.2 node3

On Node2 you'd have this:

  192.168.1.1 node1
  192.168.2.2 node3

And so on.  Note that these are all different than the clients, which would
have entries (probably in DNS rather than /etc/hosts) for the servers'
slower external addresses.

The other way to do the same thing is with explicit host routes or iptables
rules.  In that kind of setup, you put each server into its own subnet,
then add routes on the others to go through the interfaces you want.  For
example:

  node1 is 172.30.16.1
  node2 is 172.30.17.1
  node3 is 172.30.18.1

Therefore, on node1 (using the interface addresses above):

  route add -host node2 gw 192.168.1.2
  route add -host node3 gw 192.168.3.2

On node2:

  route add -host node1 gw 192.168.1.1
  route add -host node3 gw 192.168.2.2

And so on, again.  Don't forget to turn on IP forwarding.  Also, this
still requires that the servers have a different /etc/hosts than clients,
but at least it can be the same across all servers.  Alternatively, you
could use the same /etc/hosts (or DNS) everywhere, if you can add routes
on the clients as well.

All that said, the benefit of such a configuration is rather limited.
Using FUSE or GFAPI, replication will still occur over the slow client
network because it's being driven by the clients (this is likely to
change in 4.0).  On the other hand, self-heal and rebalance traffic
will use the faster internal network.  SMB and NFS will use both, so
they might see some benefit in *aggregate* but not per-client
throughput.  Depending on your usage pattern, the extra complexity of
setting up this kind of routing might not be worth the effort.


   
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] A HowTo for setting up network encryption with GlusterFS

2015-05-07 Thread Jeff Darcy


 I've written a how-to for setting up network encryption on GlusterFS at [1].
 This was something that was requested as setting up network encryption
 is not really easy.
 I've tried to cover all possible cases.

Great job, Kaushal!  Thank you.

 Please read through, and let me know of any changes,improvements needed.

I did spot a couple of minor things.  I'll forward those off-list.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Multi-tenancy

2015-05-04 Thread Jeff Darcy
 Can anyone provide any insight on how to configure gluster networking to
 support multi-tenancy by separating Native/NFS/SMB client connections at
 layer 2? Our thinking was each client will come into our network on a
 dedicated vlan but unsure whether gluster can support say a dedicated client
 trunk interface with 50 or so vlan interfaces? Is this possible? And if not
 what could be another way?

Some of this is works by default and some of it's work in progress.
When a brick sends a reply to a client request, that reply will simply
follow the default routing for its destination.  In a VLAN environment,
this would mean sending it on the pseudo-interface for that VLAN, so in
effect the traffic for groups of clients on separate VLANs will remain
segregated.

What we don't have is a way to do VLAN-based access control across
native, NFS, and SMB.  The brick I/O infrastructure does support
address-based access control, but IIRC that doesn't affect who can
connect at the TCP level.  We'll still initially accept connections on
any interface, and then close any that don't pass the address filter.
If you want to play with this, then auth.allow is the volume option you
want to look at.  I don't know of any similar options for NFS or SMB, so
there might be no way to prevent them from accepting connections on any
VLAN.  Maybe someone from one of those teams can correct me.

In 4.0 we're working on ways to give users more control over what
networks get used for what.  Primarily this is to let internal traffic
(replication, self-heal, and so on) go over a private back-end network.
However, giving users more explicit control over the relationships
between volumes and front-end networks has also come up.  The feature
page is here.

   
http://www.gluster.org/community/documentation/index.php/Features/SplitNetwork

Does that help?
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Question on file reads from a distributed replicated volume

2015-04-13 Thread Jeff Darcy
 Do, gluster volume set help. There is a pretty good explanation of
 the read subvolume preferences and options.

Specifically, you'll want to look at the cluster.read-hash-mode option,
which has one of three values:

(0) Each client will determine which brick seems fastest, and use that
for all files unless a brick *failure* causes it to re-evaluate.  If
the once-fastest brick becomes slower this will *not* be noticed by
clients unless there's a failure.  Unfortunately, this mode is
likely to *create* such a condition by overloading one server.

(1) The read child for each file will be found using a hash of its
GFID, to ensure even distribution.  Note that if some servers are
faster than others, the distribution will be *even* but not
*optimal*.  This mode is the default.

(2) Similar to (1) except that each client used the hash of the file's
GFID *plus its own PID*, so that different clients will be spread
across different bricks and avoid file-level hot spots.

All of these modes might be overridden if one of the bricks is local to
the client.  In that case, the client will always read the local copy
and this option is effectively ignored.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Synchronous replication, or no?

2015-04-09 Thread Jeff Darcy
 I was under the impression that gluster replication was synchrounous, so the
 appserver would not return back to the client until the created file was
 replicated to the other server. But this does not seem to be the case,
 because sleeping a little bit always seems to make the read failures go
 away. Is there any other reason why a file created is not immediately
 available on a second request?

It's quite possible that the replication is synchronous (the bits do hit
disk before returning) but that the results are not being seen immediately
due to caching at some level.  There are some GlusterFS mount options
(especially --negative-timeout) that might be relevant here, but it's also
possible that the culprit is somewhere above that in your app servers.

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Synchronous replication, or no?

2015-04-09 Thread Jeff Darcy
 Jeff: I don't really understand how a write-behind translator could keep data
 in memory before flushing to the replication module if the replication is
 synchronous. Or put another way, from whose perspective is the replication
 synchronous? The gluster daemon or the creating client?

That's actually a more complicated question than many would think.  When we
say synchronous replication we're talking about *durability* (i.e. does
the disk see it) from the perspective of the replication module.  It does
none of its own caching or buffering.  When it is asked to do a write, it
does not report that write as complete until all copies have been updated.

However, durability is not the same as consistency (i.e. do *other clients*
see it) and the replication component does not exist in a vacuum.  There
are other components both before and after that can affect durability and
consistency.  We've already touched on the after part.  There might be
caches at many levels that become stale as the result of a file being
created and written.  Of particular interest here are negative directory
entries which indicate that a file is *not* present.  Until those expire,
it is possible to see a file as not there even though it does actually
exist on disk.  We can control some of this caching, but not all.

The other side is *before* the replication module, and that's where
write-behind comes in.  POSIX does not require that a write be immediately
durable in the absence of O_SYNC/fsync and so on.  We do honor those
requirements where applicable.  However, the most common user expectation
is that we will defer/batch/coalesce writes, because making every write
individually immediate and synchronous has a very large performance impact.
Therefore we implement write-behind, as a layer above replication.  Absent
any specific request to perform a write immediately, data might sit there
for an indeterminate (but usually short) time before the replication code
even gets to see it.

I don't think write-behind is likely to be the issue here, because it
only applies to data within a file.  It will pass create(2) calls through
immediately, so all servers should become aware of the file's existence
right away.  On the other hand, various forms of caching on the *client*
side (even if they're the same physical machines) could still prevent a
new file from being seen immediately.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Synchronous replication, or no?

2015-04-09 Thread Jeff Darcy
 Ok, that made a lot of sense. I guess what I was expecting was that the
 writes were (close to) immediately consistent, but Gluster is rather
 designed to be eventually consistent.

All distributed file systems are, to some extent; we just try to be
clearer than most about what the guarantees are.  For example, some
buffer at the client *despite* fsync or O_SYNC.  The temptation is
obvious; POSIX single system image behavior is far more expensive
in a distributed file system than in a local one, and everyone has
to compete on performance.  We're actually far stricter than most
when it comes to durability, and the performance disadvantage has
been difficult to bear sometimes.  Hopefully, now that we have the
upcall facility (developed for NFSv4) we can improve consistency
as well without having to give up more performance.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] Got a slogan idea?

2015-04-01 Thread Jeff Darcy
 What I am saying is that if you have a slogan idea for Gluster, I want
 to hear it. You can reply on list or send it to me directly. I will
 collect all the proposals (yours and the ones that Red Hat comes up
 with) and circle back around for community discussion in about a month
 or so.

Personally I don't like any of these all that much, but maybe they'll
get someone else thinking.

GlusterFS: your data, your way

GlusterFS: any data, any servers, any protocol

GlusterFS: scale-out storage for everyone

GlusterFS: software defined storage for everyone

GlusterFS: the Swiss Army Knife of storage
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] SSL ciphers

2015-03-22 Thread Jeff Darcy
 I dug a bit on the matter and I'm a quite puzzled here. In OpenSSL, there's a
 SSLv23_METHOD which selects which is more appropriate but I see nothing
 equivalent for TLS! Each version have its dedicated function call like
 TLSv1_METHOD, TLSv1_1_METHOD and TLSv1_2_METHOD!

I was kind of surprised by the same thing, but I guess I shouldn't have been.
This only scratches the surface of the horror that is the OpenSSL API, but
what's really scary is that the two main alternatives (GnuTLS and NSS) seem
even worse.  I used to have hopes of switching to PolarSSL, which has a
better and better-documented API, but I keep getting buried by other tasks so
I don't know if/when that will ever happen.

 Thank you very much for pointing out the interesting bits and helping figure
 out things. Have fun debugging :-)

You're quite welcome.  Misery loves company.  ;)  Please keep us informed of
your findings.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] SSL ciphers

2015-03-19 Thread Jeff Darcy
  socket.c:2915
  priv-ssl_meth = (SSL_METHOD *)TLSv1_method();
 
 I'm really glad to hear that :-)


FWIW, using TLSv1_2_method instead doesn't immediately seem to break.
Unfortunately, every possible piece of code for 3.7 got merged one
second before the feature-freeze deadline today, and that generated a
lot of wreckage.  I'll have to wait for that to clear before I can do
a meaningful test of this one-line change.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] SSL ciphers

2015-03-19 Thread Jeff Darcy
 The problem with Gluster setting is that's impossible to go above

 HIGH:!SSLv2:!3DES:!RC4:!aNULL:!ADH

 Which is bad.. Gluster uses SSL only and not TLS :-( An upgrade should be
 considered.

That is untrue in current code:

socket.c:2915
priv-ssl_meth = (SSL_METHOD *)TLSv1_method();

Please put the version you're complaining about into a bug report.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Quorum setup for 2+1

2015-03-12 Thread Jeff Darcy
   I have a follow-up question.
   When a node is disconnected from the rest, the client gets an error
   message Transport endpoint is not connected and all access is
   prevented. Write access must not be allowed to such a node. I understand
   that.
   In my case it would be a desired feature to be able to at least read
   files. Is it possible to retain read-only access?

  Client-side quorum can do this with the cluster.quorum-reads option, but
  it lacks support for arbiters.  The options you've set enforce quorum at
  the server side, by killing brick daemons if quorum is lost.  I suppose
  it might be possible to add the read-only translator instead of killing
  the daemon, but AFAIK there's no plan to add that feature.
 
 Could I use cluster.quorum-type and cluster.quorum-count?
 Would it work with a rep 2+1 setup or would I need a rep 3 setup?

To use client-side quorum, you'd need a true replica-3 setup (not just two
plus an arbiter).
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Quorum setup for 2+1

2015-03-11 Thread Jeff Darcy
 I have a follow-up question.
 When a node is disconnected from the rest, the client gets an error
 message Transport endpoint is not connected and all access is
 prevented. Write access must not be allowed to such a node. I understand
 that.
 In my case it would be a desired feature to be able to at least read
 files. Is it possible to retain read-only access?

Client-side quorum can do this with the cluster.quorum-reads option, but
it lacks support for arbiters.  The options you've set enforce quorum at
the server side, by killing brick daemons if quorum is lost.  I suppose
it might be possible to add the read-only translator instead of killing
the daemon, but AFAIK there's no plan to add that feature.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Quorum setup for 2+1

2015-03-10 Thread Jeff Darcy
 I wold like to setup server side quorum by using the following setup:
 - 2x storage nodes (s-node-1, s-node-2)
 - 1x arbiter node (s-node-3)
 So the trusted storage pool has three peers.
 
 This is my volume info:
 Volume Name: wp-vol-0
 Type: Replicate
 Volume ID: 8808ee87-b201-474f-83ae-6f08eb259b43
 Status: Started
 Number of Bricks: 1 x 2 = 2
 Transport-type: tcp
 Bricks:
 Brick1: s-node-1:/gluster/gvol0/brick0/brick
 Brick2: s-node-2:/gluster/gvol0/brick0/brick
 
 I would like to setup the server side quorum so that any two nodes would
 have quorum.
 s-node-1, s-node-2 = quorum
 s-node-1, s-node-3 = quorum
 s-node-2, s-node-3 = quorum
 According to the Gluster guys at FOSDEM this should be possible.
 
 I have been fiddling with the quorum options, but have not been able to
 achieve the desired setup.
 Theoretically I would do:
 # gluster volume set wp-vol-0 cluster.server-quorum-type server
 # gluster volume set wp-vol-0 cluster.server-quorum-ratio 60
 
 But the cluster.server-quorum-ratio option produces an error:
 volume set: failed: Not a valid option for single volume
 
 How would I achieve the desired setup?

Somewhat counter-intuitively, server-quorum-type is a *volume* option
but server-quorum-ratio is a *cluster wide* option.  Therefore, instead
of specifying a volume name on that command, use this:

# gluster volume set all cluster.server-quorum-ratio 60
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Configure separate network for inter-node communication

2015-03-09 Thread Jeff Darcy
 I would be very interested to read your blog post as soon as its out and I
 guess many others too. Please do post the link to this list as soon as its
 online.

Sorry, forgot to do this earlier.  It's here:

http://pl.atyp.us/2015-03-life-on-the-server-side.html

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] poor performance with encryption and SSL enabled

2015-03-09 Thread Jeff Darcy
 I took the recommendation of disabled the stripes. Now I just have encryption
 (at rest) and SSL enabled. The test I am running is a bwa indexing. Basic dd
 read/writes work fine and I don't see any errors in the gluster logs. Then
 when I try the bwa index I see the following:

 /shared/perftest/bwa/bwa index -a bwtsw hg19.fa
 [bwa_index] Pack FASTA... 26.29 sec
 [bwa_index] Construct BWT for the packed sequence...
 BWTIncConstructFromPacked() : Can't read from hg19.fa.pac : Unexpected end of
 file

This does look like some sort of bad interaction between the two features.
I'll add it as a bug report and see if we can get someone assigned.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Configure separate network for inter-node communication

2015-03-05 Thread Jeff Darcy
 I have two gluster nodes in a replicated setup and have connected the two
 nodes together directly through a 10 Gbit/s crossover cable. Now I would
 like to tell gluster to use this seperate private network for any
 communications between the two nodes. Does that make sense? Will this bring
 me any performance gain? and if yes how do I configure that?

It is possible, but it's not likely to improve performance much (yet).

The easiest way to do this is to use a custom /etc/hosts on the servers,
so that *on a server* every other server's name resolves to its private
back-end address.  Meanwhile, clients resolve that same name to the
server's front-end address.  You can get a similar effect with explicit
host routes or iptables rules on the servers.

The reason this won't have much effect on performance is that the
servers do not (currently) replicate to one another.  Instead, clients
send data directly to every replica themselves.  The only time time a
private network would see much traffic would be when the clients are
actually the servers performing administrative operations - self heal,
rebalance, and so on.

In 4.0, both parts of this answer would be different.  First, we expect
to have better handling of multiple networks and multi-homed hosts,
including user specification of which networks to use for which
traffic[1].  Second, 4.0 will have a new form of replication which
*does* replicate directly between servers[2].  Parts of this second
feature are in fact likely to appear well before the rest of 4.0, using
the server-to-server data flow but retaining our current methods of
tracking changes and re-syncing servers after a failure.  In fact I'm
writing a blog post right now about this, including some performance
measurements.  I'll respond again here when it's done.

[1] 
http://www.gluster.org/community/documentation/index.php/Features/SplitNetwork
[2] 
http://www.gluster.org/community/documentation/index.php/Features/new-style-replication
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] poor performance with encryption and SSL enabled

2015-02-24 Thread Jeff Darcy
 SSL certs are self-signed and generated on all servers. Combined into a
 glusterfs.ca in /etc/ssl. By itself the SSL is working well.

Glad to hear it.  ;)

 If I run dd or any i/o operations I see a flurry of these messages in the
 logs.

 [2015-02-24 16:58:51.144099] W [stripe.c:5288:stripe_internal_getxattr_cbk]
 (-- /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x3fd0620550]
 (--
 /usr/lib64/glusterfs/3.6.2/xlator/cluster/stripe.so(stripe_internal_getxattr_cbk+0x36a)[0x7f6a152a12ba]
 (--
 /usr/lib64/glusterfs/3.6.2/xlator/protocol/client.so(client3_3_fgetxattr_cbk+0x174)[0x7f6a154db284]
 (-- /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x3fd0e0ea75] (--
 /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x142)[0x3fd0e0ff02] )
 0-data-stripe-3: invalid argument: frame-local


Have you tried encryption (at rest) without striping, or vice versa?  I
suspect some kind of bad interaction between the two, but before we go
down that path it would be nice to make sure they're working separately.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] Looking for volunteer to write up official How to do GlusterFS in the Cloud: The Right Way for Rackspace...

2015-02-18 Thread Jeff Darcy
I could probably chip in too.  I've run tons of my own science
experiments on Rackspace instead of our own hardware, becase that makes
my results more reproducible by others.  If we can enable more people to
do likewise, that benefits everyone.

P.S. Hi Jesse.  Small world, huh?
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] Looking for volunteer to write up official How to do GlusterFS in the Cloud: The Right Way for Rackspace...

2015-02-18 Thread Jeff Darcy
 Looks like we have four volunteers:
 
 * Ben Turner (primary GlusterFS perf tuning guy)
 * Jeff Darcy (greybeard GlusterFS developer and scalability expert)
 * Josh Boon (experienced GlusterFS guy - Ubuntu focused)
 * Nico Schottelius (newer GlusterFS guy - familiar with Ubuntu/CentOS)
 
 This sounds like a fairly good mix, so lets go with that.
 
 Ben and Jeff, does it make sense for you two to do the leading, with
 Josh and Nico involved and learning/assisting/idea-generation/stuff as
 needed?

Sounds good to me.  By purest coincidence, I was planning to do some
experiments on Rackspace today anyway.  I'll try to take notes, and
share them when I'm done.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] REMINDER: GlusterFS.next (a.k.a. 4.0) status/planning meeting

2015-02-05 Thread Jeff Darcy
This is *tomorrow* at 12:00 UTC (approximately 15.5 hours from now) in
#gluster-meeting on Freenode.  See you all there!

- Original Message -
 Perhaps it's not obvious to the broader community, but a bunch of people
 have put a bunch of work into various projects under the 4.0 banner.
 Some of the results can be seen in the various feature pages here:
 
 http://www.gluster.org/community/documentation/index.php/Planning40
 
 Now that the various subproject feature pages have been updated, it's
 time to get people together and decide what 4.0 is *really* going to be.
 To that end, I'd like to schedule an IRC meeting for February 6 at 12:00
 UTC - that's this Friday, same time as the triage/community meetings but
 on Friday instead of Tuesday/Wednesday.  An initial agenda includes:
 
 * Introduction and expectation-setting
 
 * Project-by-project status and planning
 
 * Discussion of future meeting formats and times
 
 * Discussion of collaboration tools (e.g. gluster.org wiki or
   Freedcamp) going forward.
 
 Anyone with an interest in the future of GlusterFS is welcome to attend.
 This is *not* a Red Hat only effort, tied to Red Hat product needs and
 schedules and strategies.  This is a chance for the community to come
 together and define what the next generation of distributed file
 systems for the real world will look like.  I hope to see everyone
 there.
 
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] GlusterFS.next (a.k.a. 4.0) status/planning meeting

2015-02-02 Thread Jeff Darcy
Perhaps it's not obvious to the broader community, but a bunch of people
have put a bunch of work into various projects under the 4.0 banner.
Some of the results can be seen in the various feature pages here:

http://www.gluster.org/community/documentation/index.php/Planning40

Now that the various subproject feature pages have been updated, it's
time to get people together and decide what 4.0 is *really* going to be.
To that end, I'd like to schedule an IRC meeting for February 6 at 12:00
UTC - that's this Friday, same time as the triage/community meetings but
on Friday instead of Tuesday/Wednesday.  An initial agenda includes:

* Introduction and expectation-setting

* Project-by-project status and planning

* Discussion of future meeting formats and times

* Discussion of collaboration tools (e.g. gluster.org wiki or
  Freedcamp) going forward.

Anyone with an interest in the future of GlusterFS is welcome to attend.
This is *not* a Red Hat only effort, tied to Red Hat product needs and
schedules and strategies.  This is a chance for the community to come
together and define what the next generation of distributed file
systems for the real world will look like.  I hope to see everyone
there.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] how to shrink client translator

2015-02-02 Thread Jeff Darcy
 gluster volume set volname open-behind off turns off this xlator in
 the client stack. There is no way to turn off debug/io-stats. Any reason
 why you would like to turn off io-stats translator?

 For improving efficiency.

It might not be a very fruitful kind of optimization.  Repeating an
experiment someone else had done a while ago, I just ran an experiment to
compare a normal client volfile vs. one with a *hundred* extra do-nothing
translators added.  There was no statistically significant difference,
even on a fairly capable SSD-equipped system.  I/O latency variation and
other general measurement noise still far outweigh the cost of a few extra
function calls to invoke translators that aren't doing any I/O themselves.

 Is there any command to show the current translator tree after dynamic adding
 or deletting any xlator?

The new graph should show up in the logs.  Also, you can always use gluster
system getspec xxx to get the current client volfile for any volume xxx.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] ... i was able to produce a split brain...

2015-01-30 Thread Jeff Darcy
 Pranith and I had a discussion regarding this issue and here is what we have
 in our mind right now.
 
 We plan to provide the user commands to execute from mount so that he can
 access the files in split-brain. This way he can choose which copy is to be
 used as source. The user will have to perform a set of getfattrs and
 setfattrs (on virtual xattrs) to decide which child to choose as source and
 inform AFR with his decision.
 
 A) To know the split-brain status :
 getfattr -n trusted.afr.split-brain-status path-to-file
 
 This will provide user with the following details -
 1) Whether the file is in metadata split-brain
 2) Whether the file is in data split-brain
 
 It will also list the name of afr-children to choose from. Something like :
 Option0: client-0
 Option1: client-1
 
 We also tell the user what the user could do to view metadata/data info; like
 stat to get metadata etc.
 
 B) Now the user has to choose one of the options (client-x/client-y..) to
 inspect the files.
 e.g., setfattr -n trusted.afr.split-brain-choice -v client-0 path-to-file
 We save the read-child info in inode-ctx in order to provide the user access
 to the file in split-brain from that child. Once the user inspects the file,
 he proceeds to do the same from the other child of replica pair and makes an
 informed decision.
 
 C) Once the above steps are done, AFR is to be informed with the final choice
 for source. This is achieved by -
 (say the fresh copy is in client-0)
 e.g., setfattr -n trusted.afr.split-brain-heal-finalize -v client-0
 path-to-file
 This child will be chosen as source and split-brain resolution will be done.

+1

That looks quite nice, and AFAICT shouldn't be prohibitively hard to
implement.


___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] ... i was able to produce a split brain...

2015-01-28 Thread Jeff Darcy
 On 01/27/2015 11:43 PM, Joe Julian wrote:
  No, there's not. I've been asking for this for years.
 Hey Joe,
 Vijay and I were just talking about this today. We were
 wondering if you could give us the inputs to make it a feature to implement.
 Here are the questions I have:
 Basic requirements if I understand correctly are as follows:
 1) User should be able to fix the split-brain without any intervention
 from admin as the user knows best about the data.
 2) He should be able to preview some-how about the data before selecting
 the copy which he/she wants to preserve.

One possibility would be to implement something like DHT's
filter_loc_subvol_key, though perhaps using child indices instead of
translator names.  Another would be a script which can manipulate
volfiles and use GFAPI to fetch a specific version of a file.  I've
written several scripts which can do the necessary volfile manipulation.
If we finally have a commitment to do something like this, actually
implementing it will be the easy part.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Reddit thread on GlusterFS

2015-01-23 Thread Jeff Darcy
 Created a reddit account and posted. Could use some up votes though if we
 don't want  It seems GlusterFS popularity didn't take off, and Ceph ate
 Gluster's lunch. to be the top comment.

I wouldn't worry about it too much.  Most people know that Redditors
tend to be negative, contrarian, and clueless.  Start a thread about
Ceph and you'd probably see people talking about how difficult and
unstable and slow it was during the one hour they spent with it.  Some
of them might even make disparaging comparisons to GlusterFS.  It's
better for us to let comments there stand in their usual Reddit context
than to risk being accused of astroturfing.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] New architecture: some advice needed

2014-12-22 Thread Jeff Darcy
 Actually I have 3 supermicro servers with 12 4TB SATA disks each and 2
 SSD (in each server)
 Each server also has one dual port DDR Infiniband card.
 
 I would like to create a scale-out storage infrastructure (primary
 used by web servers), totally HA and fault tollerant.
 I was thinking about 1 brick for each SATA disks in Distributed
 Dispersed mode. Replica set to 3 (so, actually, only 12*4TB=48TB would
 be available)
 
 What do you suggest? Is Distributed Dispersed good for my environment
 or should I go with Distributed Replicated ?
 
 In replicated mode, I can always access to raw files , in case of
 disaster, this would not be possible with dispersed mode, right?
 
 Which are pro and cons between replicated and dispersed modes?
 
 We plan to add up to 10 servers (all with 12*4 SATA disks) in the near
 future ending to 336TB of available and replicated space.
 
 Any suggestions?

The key tradeoffs here are storage utilization vs. performance.  In
general, erasure codes (disperse) will give better storage utilization
than replication for the same level of performance.  However, this might
not be the case for N=3.  With replication, that will protect against
two failures.  However, from the admin guide section on disperse:

redundancy_ must be greater than 0, and the total number of bricks must
be greater than 2 * _redundancy_

I interpret this to mean that for two-failure protection you would need
at least five bricks.  With three bricks disperse can only offer
one-failure protection.  In this case it's roughly equivalent to RAID-5,
with only a 50% storage penalty vs. 100% for replica 2 offering the same
protection.

The other issue is performance.  With disperse, all writes *and reads*
must be done to all bricks, and at a stripe size equal to 512 times the
number of bricks (minus those used for redundancy).  This means more
data transfer, especially for reads, and also more write contention than
with replication.  This being new code, some optimizations that already
exist for replication do not yet exist for disperse even though they're
applicable.

Adding Xavier, who's the real expert on disperse, in case I got
something wrong here.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] # of replica != number pf bricks?

2014-12-10 Thread Jeff Darcy
 What happens if I have 3 peers for quorum. I create 3 bricks and want to have
 only two replicas in my volume.

The number of *bricks* must be a multiple of the replica count, but
quorum is based on the number of *servers* and there can be multiple
bricks per server.  Therefore, if you have servers A, B, and C with two
bricks each, you can do this:

   volume create foo replica 2 \
  A:/brick0 B:/brick0 C:/brick0 A:/brick1 B:/brick1 C:/brick1

First we'll combine this into the following two-way replica sets:

   A:/brick0 and B:/brick0
   C:/brick0 and A:/brick1
   B:/brick1 and C:/brick1

Then we'll distribute files among those three sets.  If one server fails
then we'll still have quorum (2/3) and each replica set will have at
least one surviving replica.  If two fail then neither of those things
will be true and we'll disable the volume.

In 4.0 we plan to improve on this by splitting bricks and creating the
necessary replica sets from the pieces ourselves.  Besides making
configuration simpler, this should remove the restriction on the number
of bricks being a multiple of the replica count, and also redistribute
load more evenly during or after a failure.  4.0 is a long way off,
though, so I probably shouldn't even be talking about it.  ;)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Stupid question re multiple networks

2014-11-14 Thread Jeff Darcy
 What if the gluster servers are also clients? I locally plan to use
 a number of servers acting as gluster and VM servers, so that gluster serves
 both the VM's and other clients.

I think that fits fairly well into this paradigm.  Note that the routing of
traffic is by *type* (e.g. user I/O, rebalance) rather than by destination.
By default, everything's on the same network, so things would work just as
now.  If you want, you can redirect user I/O over one network and internal
traffic over another, even if the machines are both clients and servers.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Stupid question re multiple networks

2014-11-13 Thread Jeff Darcy
 AFAIK multiple network scenario only works if you are not using Gluster via
 FUSE mounts or gfapi from remote hosts. It will work for NFS access or when
 you set up something like Samba with CTDB. Just not with native Gluster as
 the server always tells the clients which addresses to connect to: ie your
 storage hosts will always supply the connection details of the hosts that
 are configured in gluster to your storage clients.

 I wonder if this could be gaffer-taped with some bridging/vlan/arp spoofing
 trickery but I'm not sure I'd trust such a hack.

 It would be *really* nice if there was a way to set up gluster so you could
 specify different IPs for backend and frontend operations.

As you suggest, there are various kinds of trickery that can be used
to fake multi-network support even for native mounts.  I've seen it done
via split-horizon DNS, explicit host routes, and iptables.  *Proper*
support for multiple networks is part of the proposed 4.0 feature set.

http://www.gluster.org/community/documentation/index.php/Planning40

In fact, I would greatly appreciate your help defining what proper
means in this context.  Clearly, we need to add the concept of a network
to our (informal) object model, and sort out the host/address/network
relationships.  Then we need a way to direct certain traffic flows to
certain networks.  The question is: how do we present this to the user?
Let's take a whack at how to define networks etc. using the CLI's
current object-verb syntax (even though it's a bit clunky).

   gluster network add user-net 1.2.3.0/24
   gluster network add back-end 5.6.0.0/16
   gluster peer probe 1.2.3.4
   gluster peer probe 5.6.7.8

So far, so good.  Note that on the second probe we should be able to
recognize that this is just a new address (on another network) for the
host we already added with the first probe.  Heartbeats, quorums, etc.
should also be aware of multi-homed hosts.  Maybe there's a better
syntax, but this will do for now.  Let's add a volume.

   gluster volume create silly-vol 1.2.3.4:/brick

So, which network address should the daemon for 1.2.3.4:/brick expose
for clients?  Which address should it use for internal traffic such as
rebalance or self-heal?  This is where it gets tricky.  Let's start by
saying that *by default* all traffic is on the interface specified on
the volume create line.  If we want to do something different...

   # ONLY redirect rebalance traffic.
   gluster volume route silly-vol rebalance back-end

Now rebalance traffic goes through 5.6.7.8 instead.  Is that intuitive?
What about these?

   # Export a volume on multiple networks.
   gluster volume route silly-vol client user-net some-other-net

   # Redirect rebalance, self-heal, anything else we think of.
   gluster volume route silly-vol all-mgmt back-end

   # Redirect GLOBALLY instead of per volume.
   gluster cluster route rebalance back-end

Does this seem like it's heading in the right direction?  It doesn't
look too bad to me, but my perspective is hardly typical.  Is there
something *users* would like to be able to do with multiple networks
that can't be expressed this way, or is there some better way to define
how these multiple networks should be used?  Please let us know.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] hekafs.org is not accessible

2014-10-31 Thread Jeff Darcy
 The links on
 http://gluster.org/documentation/architecture/internals/Dougw:A_Newbie%27s_Guide_to_Gluster_Internals/
 pointing to Jeff's tutorials on hekafs.org seem to be broken. Perhaps
 they are mirrored somewhere else?

When I saw that the domain was going to expire a few months ago, I tried to 
find out if anyone at Red Hat would be interested in taking it over.  Nobody 
seemed to be, so I let it lapse.  Now it's in a state where the domain 
registrars wouldn't even let me revive.  Meanwhile, the files are all still 
accessible two ways:

(1) Modify the URL to point to //pl.atyp.us/hekafs.org/... instead

(2) Modify your /etc/hosts to have an entry for hekafs.org which is the same as 
pl.atyp.us (currently 162.243.99.140)

With method (1) any secondary URLs e.g. for images are likely to be broken.  
With method (2) everything should be just as it would be if the domain were 
still alive.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Small files

2014-10-27 Thread Jeff Darcy
 To what extent, is Gluster a good choice for the many small files scenario,
 as opposed to HDFS? Last I checked, hdfs would consume humongous memory
 resources if the cluster has many small files, given its architecture. There
 are some hackish solutions on top HDFS for the case of many small files
 rather than huge files, but it would be nice to find a file system that
 matches that scenario well as is. So I wonder how would Gluster do when
 files are typically small.

We're not as bad as HDFS, but it's still not what I'd call a good
scenario for us.  While we have good space efficiency for small files,
and we don't have a single-metadata-server SPOF either, the price we pay
is a hit to our performance for creates (and renames).  There are
several efforts under way to improve this, but there's only so much we
can do when directory contents must be consistent across the volume
despite being spread across many bricks (or replica sets).  More details
on those efforts are here.

http://www.gluster.org/community/documentation/index.php/Features/Feature_Smallfile_Perf
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Perfomance issue on a 90+% full file system

2014-10-06 Thread Jeff Darcy
 Yup, pretty common for us.  Once we hit ~90% on either of our two
 production clusters (107 TB usable each), performance takes a beating.
 
 I don't consider this a problem, per se.  Most file systems (clustered
 or otherwise) are the same.  I consider a high water mark for any
 production file system to be 80% (and I consider that vendor
 agnostic), at which time action should be taken to begin clean up.
 That's good sysadminning 101.

I can't think of a good reason for such a steep drop-off in GlusterFS.
Sure, performance should degrade somewhat due to fragmenting, but not
suddenly.  It's not like Lustre, which would do massive preallocation
and fall apart when there was no longer enough space to do that.  It
might be worth measuring average latency at the local-FS level, to see
if the problem is above or below that line.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] To GlusterFS or not...

2014-09-23 Thread Jeff Darcy
 SSD has been considered but is not an option due to cost.  SAS has
 been considered but is not a option due to the relatively small sizes
 of the drives.  We are *rapidly* growing towards a PB of actual online
 storage.
 
 We are exploring raid controllers with onboard SSD cache which may help.

We have had some pretty good results with those in the lab.  They're not
*always* beneficial, and getting the right SSD:disk ratio for your
workload might require some experimentation, but it's certainly a good
direction to explore.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] To GlusterFS or not...

2014-09-22 Thread Jeff Darcy

 The biggest issue that we are having, is that we are talking about
 -billions- of small (max 5MB) files. Seek times are killing us
 completely from what we can make out. (OS, HW/RAID has been tweaked to
 kingdom come and back).

This is probably the key point.  It's unlikely that seek times are going
to get better with GlusterFS, unless it's because the new servers have
more memory and disks, but if that's the case then you might as well
just deploy more memory and disks in your existing scheme.  On top of
that, using any distributed file system is likely to mean more network
round trips, to maintain consistency.  There would be a benefit from
letting GlusterFS handle the distribution (and redistribution) of files
automatically instead of having to do your own sharding, but that's not
the same as a performance benefit.

 I’m not yet too clued up on all the GlusterFS naming, but essentially
 if we do go the GlusterFS route, we would like to use non replicated
 storage bricks on all the front-end, as well as back-end servers in
 order to maximize storage.

That's fine, so long as you recognize that recovering from a failed
server becomes more of a manual process, but it's probably a moot point
in light of the seek-time issue mentioned above.  As much as I hate to
discourage people from using GlusterFS, it's even worse to have them be
disappointed, or for other users with other needs to be so as we spend
time trying to fix the unfixable.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] [Gluster-devel] Who's who ?

2014-09-15 Thread Jeff Darcy
  For new columns which may be useful, these ones spring to mind:

 * Twitter username - many people have them these days
 * A free form text description - eg I'm Justin, I'm into databases, storage,
 and developing embedded human augmentation systems. ;)
 * Some kind of thumbnail photo - probably as the first column on the left

I think the current table is already quite wide, and adding more columns
is going to be very problematic design-wise.  Instead, I suggest that we
make each person's name a link to their wiki user page, where they can
put whatever contact or other info makes sense.  I just did that for
myself, and it barely takes more time than updating the Who's Who page
itself (plus it cuts down on the update notifications for that page).
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] Proposal for GlusterD-2.0

2014-09-12 Thread Jeff Darcy
 Has anyone looked into whether LogCabin can provide the consistent small
 storage based on RAFT for Gluster?
 
 https://github.com/logcabin/logcabin
 
 I have no experience with using it so I cannot say if it is good or suitable.
 
 I do know the following project uses it and it's just not as easy to setup as
 Gluster is - it also has Zookeeper support etc.
 
 https://ramcloud.atlassian.net/wiki/display/RAM/RAMCloud

LogCabin is the canonical implementation of Raft, by the author of the Raft
protocol, so it was the first implementation I looked at.  Sad to say, it
didn't seem that stable.  AFAIK RAMCloud - itself an academic project - is
the only user, whereas etcd and consul are being used by multiple projects
and in production.  Also, I found the etcd code at least more readable than
LogCabin despite the fact that I've worked in C++ before and had never seen
any Go code until that time.  Then again, those were early days for all
three projects (consul didn't even exist yet) so things might have changed.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] Proposal for GlusterD-2.0

2014-09-11 Thread Jeff Darcy
 Yes.  I came across Salt currently for unified management for storage to
 manage gluster and ceph which is still in planning phase.  I could think of
 a complete requirement of infra requirement to solve from glusterd to
 unified management.  Calamari ceph management already uses Salt.  It would
 be the ideal solution with Salt (or any infra) if gluster, ceph and unified
 management uses.


I think the idea of using Salt (or similar) is interesting, but it's also
key that Ceph still has its mon cluster as well.  (Is mon calamari an
*intentional* Star Wars reference?)  As I see it, glusterd or anything we
use to replacement has multiple responsibilities:

(1) Track the current up/down state of cluster members and resources.

(2) Store configuration and coordinate changes to it.

(3) Orchestrate complex or long-running activities (e.g. rebalance).

(4) Provide service discovery (current portmapper).

Salt and its friends clearly shine at (2) and (3), though they outsource
the actual data storage to an external data store.  With such a data
store, (4) becomes pretty trivial.  The sticking point for me is (1).  How
does Salt handle that need, or how might it be satisfied on top of the
facilities Salt does provide?  I can see *very* clearly how to do it on
top of etcd or consul.  Could those in fact be used for Salt's data store?
It seems like Salt shouldn't need a full-fledged industrial strength
database, just something with high consistency/availability and some basic
semantics.

Maybe we should try to engage with the Salt developers to come up with
ideas.  Or find out exactly what functionality they found still needs to
be in the mon cluster and not in Salt.


___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] Proposal for GlusterD-2.0

2014-09-11 Thread Jeff Darcy
 For distributed store, I would think of MongoDB which provides
 distributed/replicated/highly available/master read-write/slave read-only
 database.  Lets get what community think about SaltStack and/or MongoDB.


I definitely do not think MongoDB is the right tool for this job.  I'm
not one of those people who just bash MongoDB out of fashion, either.  I
frequently defend them against such attacks, and I used MongoDB for some
work on CloudForms a while ago.  However, a full MongoDB setup carries a
pretty high operational complexity, to support high scale and rich
features . . . which we don't need.  This part of our system doesn't
need sharding.  It doesn't need complex ad-hoc query capability.  If we
don't need those features, we *certainly* don't need the complexity that
comes with them.  We need something with the very highest levels of
reliability and consistency, with as little complexity as possible to go
with that.  Even its strongest advocates would probably agree that
MongoDB doesn't fit those requirements very well.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] Proposal for GlusterD-2.0

2014-09-08 Thread Jeff Darcy
 Is there any reason not to consider zookeeper?

I did bring up that idea a while ago.  I'm no Java fan myself, but still
I was surprised by the vehemence of the reactions.  To put it politely,
many seemed to consider the dependency on Java unacceptable for both
resource and security reasons.  Some community members said that they'd
be forced to switch to another DFS if we went that way.  It didn't seem
like a very promising direction to explore further.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] Proposal for GlusterD-2.0

2014-09-05 Thread Jeff Darcy
 Isn't some of this covered by crm/corosync/pacemaker/heartbeat?

Sorta, kinda, mostly no.  Those implement virtual synchrony, which is
closely related to consensus but not quite the same even in a formal CS
sense.  In practice, using them is *very* different.  Two jobs ago, I
inherited a design based on the idea that if everyone starts at the same
state and handles the same messages in the same order (in that case they
were using Spread) then they'd all stay consistent.  Sounds great in
theory, right?  Unfortunately, in practice it meant that returning a
node which had missed messages to a consistent state was our problem,
and it was an unreasonably complex one.  Debugging
failure-during-recovery problems in that code was some of the least fun
I ever had at that job.  A consensus protocol, with its focus on
consistency of data rather than consistency of communication, seems like
a better fit for what we're trying to achieve.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] Proposal for GlusterD-2.0

2014-09-05 Thread Jeff Darcy
 As part of the first phase, we aim to delegate the distributed configuration
 store. We are exploring consul [1] as a replacement for the existing
 distributed configuration store (sum total of /var/lib/glusterd/* across all
 nodes). Consul provides distributed configuration store which is consistent
 and partition tolerant. By moving all Gluster related configuration
 information into consul we could avoid split-brain situations.

Overall, I like the idea.  But I think you knew that.  ;)

Is the idea to run consul on all nodes as we do with glusterd, or to run
it only on a few nodes (similar to Ceph's mon cluster) and then use them
to coordinate membership etc. for the rest?
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] split-brain on glusterfs running with quorum on server and client

2014-09-05 Thread Jeff Darcy
 I have a replicate glusterfs setup on 3 Bricks ( replicate = 3 ). I have
 client and server quorum turned on. I rebooted one of the 3 bricks. When it
 came back up, the client started throwing error messages that one of the
 files went into split brain.

This is a good example of how split brain can happen even with all kinds of
quorum enabled.  Let's look at those xattrs.  BTW, thank you for a very
nicely detailed bug report which includes those.

 BRICK1
 
 [root@ip-172-31-38-189 ~]# getfattr -d -m . -e hex
 /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
 getfattr: Removing leading '/' from absolute path names
 # file:
 data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
 trusted.afr.PL2-client-0=0x
 trusted.afr.PL2-client-1=0x0001
 trusted.afr.PL2-client-2=0x0001
 trusted.gfid=0xea950263977e46bf89a0ef631ca139c2

 BRICK 2
 ===
 [root@ip-172-31-16-220 ~]# getfattr -d -m . -e hex
 /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
 getfattr: Removing leading '/' from absolute path names
 # file:
 data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
 trusted.afr.PL2-client-0=0x0d46
 trusted.afr.PL2-client-1=0x
 trusted.afr.PL2-client-2=0x
 trusted.gfid=0xea950263977e46bf89a0ef631ca139c2

 BRICK 3
 =
 [root@ip-172-31-12-218 ~]# getfattr -d -m . -e hex
 /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
 getfattr: Removing leading '/' from absolute path names
 # file:
 data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
 trusted.afr.PL2-client-0=0x0d46
 trusted.afr.PL2-client-1=0x
 trusted.afr.PL2-client-2=0x
 trusted.gfid=0xea950263977e46bf89a0ef631ca139c2

Here, we see that brick 1 shows a single pending operation for the other
two, while they show 0xd46 (3398) pending operations for brick 1.
Here's how this can happen.

(1) There is exactly one pending operation.

(2) Brick1 completes the write first, and says so.

(3) Client sends messages to all three, saying to decrement brick1's
count.

(4) All three bricks receive and process that message.

(5) Brick1 fails.

(6) Brick2 and brick3 complete the write, and say so.

(7) Client tells all bricks to decrement remaining counts.

(8) Brick2 and brick3 receive and process that message.

(9) Brick1 is dead, so its counts for brick2/3 stay at one.

(10) Brick2 and brick3 have quorum, with all-zero pending counters.

(11) Client sends 0xd46 more writes to brick2 and brick3.

Note that at no point did we lose quorum. Note also the tight timing
required.  If brick1 had failed an instant earlier, it would not have
decremented its own counter.  If it had failed an instant later, it
would have decremented brick2's and brick3's as well.  If brick1 had not
finished first, we'd be in yet another scenario.  If delayed changelog
had been operative, the messages at (3) and (7) would have been combined
to leave us in yet another scenario.  As far as I can tell, we would
have been able to resolve the conflict in all those cases.

*** Key point: quorum enforcement does not totally eliminate split
brain.  It only makes the frequency a few orders of magnitude lower. ***

So, is there any way to prevent this completely?  Some AFR enhancements,
such as the oft-promised outcast feature[1], might have helped.
NSR[2] is immune to this particular problem.  Policy based split brain
resolution[3] might have resolved it automatically instead of merely
flagging it.  Unfortunately, those are all in the future.  For now, I'd
say the best approach is to resolve the conflict manually and try to
move on.  Unless there's more going on than meets the eye, recurrence
should be very unlikely.

[1] http://www.gluster.org/community/documentation/index.php/Features/outcast

[2] 
http://www.gluster.org/community/documentation/index.php/Features/new-style-replication

[3] http://www.gluster.org/community/documentation/index.php/Features/pbspbr
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] complete f......p thanks to glusterfs...applause, you crashed weeks of work

2014-09-02 Thread Jeff Darcy
 ssl keys have to be 2048-bit fixed size

No, they don't.

 all keys have to bey verywhere(all versionswhich noob programmed
 that ??)

That noob would be me.

t's not necessary to have the same key on all servers, but using
different ones would be even more complex and confusing for users.
Instead, the servers authenticate to one another using a single
identity.  According to SSL 101, anyone authenticating as an identity
needs the key for that identity, because it's really the key - not the
publicly readable cert - that guarantees authenticity.

If you want to set up a separate key+cert for each server, each one
having a CA file for the others, you certainly can and it works.
However, you'll still have to deal with distributing those new certs.
That's inherent to how SSL works.  Instead of forcing a particular PKI
or cert-distribution scheme on users, the GlusterFS SSL implementation
is specifically intended to let users make those choices.

 only control connection is encrypted

That's not true.  There are *separate* options to control encryption
for the data path, and in fact that code's much older.  Why separate?
Because the data-path usage of SSL is based on a different identity
model - probably more what you expected, with a separate identity per
client instead of a shared one between servers.

 At a certain point it also used tons of diskspace due to not deleting
 files in the .glusterfs directory , (but still being connected and
 up serving volumes)

For a long time, the only internal conditions that might have caused
the .glusterfs links not to be cleaned up were about 1000x less common
than similar problems which arise when users try to manipulate files
directly on the bricks.  Perhaps if you could describe what you were
doing on the bricks, we could help identify what was going on and
suggest safer ways of achieving the same goals.

 IT WAS A LONG AND PAINFUL SYNCING PROCESS until i thought i was happy
 ;)

Syncing what?  I'm guessing a bit here, but it sounds like you were
trying to do the equivalent of a replace-brick (or perhaps rebalance) by
hand.  As you've clearly discovered, such attempts are fraught with
peril.  Again, with some more constructive engagement perhaps we can
help guide you toward safer solutions.

 Due to an Online-resizing lvm/XFS glusterfs (i watch the logs nearly
 all the time) i discovered mismacthing disk layouts , realizing also
 that

 server1 was up and happy when you mount from it, but server2 spew
 input/output errors on several directories (for now just in that
 volume),

The mismatching layout messages are usually the result of extended
attributes that are missing from one brick's copy of a directory.  It's
possible that the XFS resize code is racy, in the sense that extended
attributes become unavailable at some stage even though the directory
itself is still accessible.  I suggest that you follow up on that bug
with the XFS developers, who are sure to be much more polite and
responsive than we are.

 i tried to rename one directory, it created a recursive loop inside
 XFS (e.g.  BIGGEST FILE-SYSTEM FAIL : TWO INODES linking to one dir ,
 ideally containing another) i got at least the XFS loop solved.

Another one for the XFS developers.

 Then the pre-last resort option came up.. deleted the volumes, cleaned
 all xattr on that ~2T ... and recreated the volumes, since shd seems
 to work somehow since 3.4

You mention that you cleared all xattrs.  Did you also clear out
.glusterfs?  In general, using anything but a completely empty directory
tree as a brick can be a bit problematic.

 Maybe anyone has a suggestion , except create a new clean volume and
 move all your TB's .

More suggestions might have been available if you had sought them
earlier.  At this point, none of us can tell what state your volume is
in, and there are many indications that it's probably a state none of us
have never seen or anticipated.  As you've found, attempting random
fixes in such a situation often makes things worse.  It would be
irresponsible for us to suggest that you go down even more unknown and
untried paths.  Our first priority should be to get things back to a
known and stable state.  Unfortunately at this point the only such
state at this point would seem to be a clean volume.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] Transparent encryption in GlusterFS: Implications on manageability

2014-08-13 Thread Jeff Darcy
 I.1 Generating the master volume key


 Master volume key should be generated by user on the trusted machine.
 Recommendations on master key generation provided at section 6.2 of
 the manpages [1]. Generating of master volume key is in user's
 competence.

That was fine for an initial implementation, but it's still the single
largest obstacle to adoption of this feature.  Looking forward, we need
to provide full CLI support for generating keys in the necessary format,
specifying their location, etc.

I.2 Location of the master volume key when mounting a
volume


 At mount time the crypt translator searches for a master volume key on
 the client machine at the location specified by the respective
 translator option. If there is no any key at the specified location,
 or the key at specified location is in improper format, then mount
 will fail. Otherwise, the crypt translator loads the key to its
 private memory data structures.

 Location of the master volume key can be specified at volume creation
 time (see option master-key, section 6.7 of the man pages [1]).
 However, this option can be overridden by user at mount time to
 specify another location, see section 7 of manpages [1], steps 6, 7,
 8.

Again, we need to improve on this.  We should support this as a volume
or mount option in its own right, not rely on the generic
--xlator-option mechanism.  Adding options to mount.glusterfs isn't
hard.  Alternatively, we could make this look like a volume option
settable once through the CLI, even though the path is stored locally on
the client.  Or we could provide a separate special-purpose
command/script, which again only needs to be run once.  It would even be
acceptable to treat the path to the key file (not its contents!) as a
true volume option, stored on the servers.  Any of these would be better
than requiring the user to understand our volfile format and
construction so that they can add the necessary option by hand.

II. Check graph of translators on your client machine
after mount!


 During mount your client machine receives configuration info from the
 non-trusted server. In particular, this info contains the graph of
 translators, which can be subjected to tampering, so that encryption
 won't be invoked for your volume at all. So it is highly important to
 verify this graph. After successful mount make sure that the graph of
 translators contains the crypt translator with proper options (see
 FAQ#1, section 11 of the manpages [1]).

It is important to verify the graph, but not by poking through log files
and not without more information about what to look for.  So we got a
volfile that includes the crypt translator, with some options.  The
*code* should ensure that the master-key option has the value from the
command line or local config, and not some other.  If we have to add
special support for this in otherwise-generic graph initialization code,
that's fine.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] User-serviceable snapshots design

2014-05-08 Thread Jeff Darcy
  * Since a snap volume will refer to multiple bricks, we'll need
 more brick daemons as well.  How are *those* managed?
 
 This is infra handled by the core snapshot functionality/feature. When
 a snap is created, it is treated not only as a lvm2 thin-lv but as a
 glusterfs volume as well. The snap volume is activated and mounted and
 made available for regular use through the native fuse-protocol client.
 Management of these is not part of the USS feature. But handled as part
 of the core snapshot implementation.

If we're auto-starting snapshot volumes, are we auto-stopping them as
well?  According to what policy?

 USS (mainly snapview-server xlator)
 talks to the snapshot volumes (and hence the bricks) through the glfs_t
 *, and passing a glfs_object pointer.

So snapview-server is using GFAPI from within a translator?  This caused
a *lot* of problems in NSR reconciliation, especially because of how
GFAPI constantly messes around with the THIS pointer.  Does the USS
work include fixing these issues?

If snapview-server runs on all servers, how does a particular client
decide which one to use?  Do we need to do something to avoid hot spots?

Overall, it seems like having clients connect *directly* to the snapshot
volumes once they've been started might have avoided some complexity or
problems.  Was this considered?

  * How does snapview-server manage user credentials for connecting
 to snap bricks?  What if multiple users try to use the same
 snapshot at the same time?  How does any of this interact with
 on-wire or on-disk encryption?
 
 No interaction with on-disk or on-wire encryption. Multiple users can
 always access the same snapshot (volume) at the same time. Why do you
 see any restrictions there?

If we're using either on-disk or on-network encryption, client keys and
certificates must remain on the clients.  They must not be on servers.
If the volumes are being proxied through snapview-server, it needs
those credentials, but letting it have them defeats both security
mechanisms.

Also, do we need to handle the case where the credentials have changed
since the snapshot was taken?  This is probably a more general problem
with snapshots themselves, but still needs to be considered.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] User-serviceable snapshots design

2014-05-08 Thread Jeff Darcy
  * How do clients find it?  Are we dynamically changing the client
 side graph to add new protocol/client instances pointing to new
 snapview-servers, or is snapview-client using RPC directly?  Are
 the snapview-server ports managed through the glusterd portmapper
 interface, or patched in some other way?
 Adding a protocol/client instance to connect to protocol/server at the
 daemon.

So now the client graph is being dynamically modified, in ways that
make it un-derivable from the volume configuration (because they're
based in part on user activity since then)?  What happens if a normal
graph switch (e.g. due to add-brick) happens?  I'll need to think some
more about what this architectural change really means.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] Proposal for improvements for heal commands

2014-05-08 Thread Jeff Darcy
 2) According to the feedback we got, Commands: gluster volume heal volname
 info healed/heal-failed are not helpful in debugging anything. So I am
 thinking of deprecating these two commands.
Reasons:
- The commands only give the last 1024 entries that succeeded/failed, so
most of the times users need to inspect logs.

Seems reasonable, though if it's just an issue of not keeping enough
information to be useful we could fix that by simply retaining more.

 3) gluster volume heal volname info split-brain will be re-implemented to
 print all the files that are in split-brain instead of the limited 1024
 entries.
- One constant complaint is that even after the file is fixed from
split-brain, it may still show up in the previously cached output. In
this implementation the goal is to remove all the caching and compute the
results afresh.

This seems reasonable too.  I can't help but wonder if it might be worth
tracking split-brain files using a Merkle tree approach like we did with
xtime, so we could track any number of such files efficiently.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] User-serviceable snapshots design

2014-05-08 Thread Jeff Darcy
  Overall, it seems like having clients connect *directly* to the
  snapshot volumes once they've been started might have avoided some
  complexity or problems.  Was this considered?

 Can you explain this in more detail? Are you saying that the virtual
 namespace overlay used by the current design can be reused along with
 returning extra info to clients or is this a new approach where you
 make the clients much more intelligent than they are in the current
 approach?

Basically the clients would have the same intelligence that now
resides in snapview-server.  Instead of spinning up a new
protocol/client to talk to a new snapview-server, they'd send a single
RPC to start the snapshot brick daemons, then connect to those itself.

Of course, this exacerbates the problem with dynamically changing
translator graphs on the client side, because now they dynamically added
parts will be whole trees (corresponding to whole volfiles) instead of
single protocol/client translators.  Long term, I think we should
consider *not* handling these overlays as modifications to the main
translator graph, but instead allowing multiple translator graphs to be
active in the glusterfs process concurrently.  For example, this greatly
simplifies the question of how to deal with a graph change after we've
added several overlays.

 * Splice method: graph comparisons must be enhanced to ignore the
   overlays, overlays must be re-added after the graph switch takes
   place, etc.

 * Multiple graph method: just change the main graph (the one that's
   rooted at mount/fuse) and leave the others alone.

Stray thought: does any of this break when we're in an NFS or Samba
daemon instead of a native-mount glusterfs daemon?

  If we're using either on-disk or on-network encryption, client keys
  and certificates must remain on the clients.  They must not be on
  servers.  If the volumes are being proxied through snapview-server,
  it needs those credentials, but letting it have them defeats both
  security mechanisms.
 
  Also, do we need to handle the case where the credentials have
  changed since the snapshot was taken?  This is probably a more
  general problem with snapshots themselves, but still needs to be
  considered.

 Agreed. Very nice point you brought up. We will need to think a bit
 more on this Jeff.

This is what reviews are for.  ;)  Another thought: are there any
interesting security implications because USS allows one user to expose
*other users'* previous versions through the automatically mounted
snapshot?
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] User-serviceable snapshots design

2014-05-08 Thread Jeff Darcy
 client graph is not dynamically modified. the snapview-client and
 protocol/server are inserted by volgen and no further changes are made on
 the client side. I believe Anand was referring to  Adding a protocol/client
 instance to connect to protocol/server at the daemon as an action being
 performed by volgen.

OK, so let's say we create a new volfile including connections for a snapshot
that didn't even exist when the client first mounted.  Are you saying we do
a full graph switch to that new volfile?  That still seems dynamic.  Doesn't
that still mean we need to account for USS state when we regenerate the
next volfile after an add-brick (for example)?  One way or another the
graph's going to change, which creates a lot of state-management issues.
Those need to be addressed in a reviewable design so everyone can think
about it and contribute their thoughts based on their perspectives.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] User-serviceable snapshots design

2014-05-08 Thread Jeff Darcy
 Overall, it seems like having clients connect *directly* to the
 snapshot volumes once they've been started might have avoided some
 complexity or problems. Was this considered?

 Yes this was considered. I have mentioned the two reasons why this was
 dropped in the other mail.

I look forward to the next version of the design which reflects the new
ideas since this email thread started.

 They were: a) snap view generation requires privileged ops to
 glusterd. So moving this task to the server side solves a lot of those
 challenges.

Not really.  A server-side component issuing privileged requests
whenever a client asks it to is no more secure than a client-side
component issuing them directly.  There needs to be some sort of
authentication and authorization at the glusterd level (the only place
these all converge).  This is a more general problem that we've had with
glusterd for a long time.  If security is a sincere concern for USS,
shouldn't we address it by trying to move the general solution forward?
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] User-serviceable snapshots design

2014-05-08 Thread Jeff Darcy
 No graph changes either on client side or server side. The
 snap-view-server will detect availability of new snapshot from
 glusterd, and will spin up a new glfs_t for the corresponding snap,
 and start returning new list of names in readdir(), etc.

I asked if we were dynamically changing the client graph to add new
protocol/client instances.  Here is Varun's answer.

 Adding a protocol/client instance to connect to protocol/server at the
 daemon.

Apparently the addition he mentions wasn't the kind I was asking about,
but something that only occurs at normal volfile-generation time.  Is
that correct?

 No volfile/graph changes at all. Creation/removal of snapshots is
 handled in the form of a dynamic list of glfs_t's on the server side.

So we still have dynamically added graphs, but they're wrapped up in
GFAPI objects?  Let's be sure to capture that nuance in v2 of the spec.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] User-serviceable snapshots design

2014-05-07 Thread Jeff Darcy
 Attached is a basic write-up of the user-serviceable snapshot feature
 design (Avati's). Please take a look and let us know if you have
 questions of any sort...

A few.

The design creates a new type of daemon: snapview-server.

* Where is it started?  One server (selected how) or all?

* How do clients find it?  Are we dynamically changing the client
  side graph to add new protocol/client instances pointing to new
  snapview-servers, or is snapview-client using RPC directly?  Are
  the snapview-server ports managed through the glusterd portmapper
  interface, or patched in some other way?

* Since a snap volume will refer to multiple bricks, we'll need
  more brick daemons as well.  How are *those* managed?

* How does snapview-server manage user credentials for connecting
  to snap bricks?  What if multiple users try to use the same
  snapshot at the same time?  How does any of this interact with
  on-wire or on-disk encryption?

I'm sure I'll come up with more later.  Also, next time it might
be nice to use the upstream feature proposal template *as it was
designed* to make sure that questions like these get addressed
where the whole community can participate in a timely fashion.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] User-serviceable snapshots design

2014-05-07 Thread Jeff Darcy
 Attached is a basic write-up of the user-serviceable snapshot feature
 design (Avati's). Please take a look and let us know if you have
 questions of any sort...

A few.

The design creates a new type of daemon: snapview-server.

* Where is it started?  One server (selected how) or all?

* How do clients find it?  Are we dynamically changing the client
  side graph to add new protocol/client instances pointing to new
  snapview-servers, or is snapview-client using RPC directly?  Are
  the snapview-server ports managed through the glusterd portmapper
  interface, or patched in some other way?

* Since a snap volume will refer to multiple bricks, we'll need
  more brick daemons as well.  How are *those* managed?

* How does snapview-server manage user credentials for connecting
  to snap bricks?  What if multiple users try to use the same
  snapshot at the same time?  How does any of this interact with
  on-wire or on-disk encryption?

I'm sure I'll come up with more later.  Also, next time it might
be nice to use the upstream feature proposal template *as it was
designed* to make sure that questions like these get addressed
where the whole community can participate in a timely fashion.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Would there be a use for cluster-specific filesystem tools?

2014-05-05 Thread Jeff Darcy
(thanks to brain-dead Zimbra for the empty response before)

 Okay, so interest seems to be there.  What tools would be useful?  So
 far my list consists of:
 
 1) du -sk or -s --si
 2) rm -fr
 3) find (or at least find -print)
 
 What else would you add to this list?

How about grep -r?

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] Inktank acquisition

2014-04-30 Thread Jeff Darcy
As many of you have probably heard by now, we're joining forces with our good
friends working on Ceph at Inktank.  As one of the community's semi-official
bloggers, here's my own take on this momentous event.

http://pl.atyp.us/2014-04-inktank-acquisition.html

(same thing inline, for convenience)

I know a lot of people are going to be asking me about Red Hat's acquisition of
Inktank, so I've decided to collect some thoughts on the subject.  The very
very simple version is that **I'm delighted**.  Occasional sniping back and
forth notwithstanding, I've always been a huge fan of Ceph and the people
working on it.  This is great news.  More details in a bit, but first I have to
take care of some administrivia.

*Unlike everything else I have ever written here, this post has been submitted
to my employer for approval prior to publication.  I swear to you that it's
still my own sincere thoughts, but I believe it's an ethical requirement for
independent bloggers such as myself to be up front about any such entanglement
no matter how slight the effect might have been.  Now, on with the real
content.*

As readers and conference-goers beyond number can attest, I've always said that
Ceph and GlusterFS are allies in a common fight against common rivals.  First,
we've both stood against proprietary storage appliances, including both
traditional vendors and the latest crop of startups.  A little less obviously,
we've also both stood for Real File Systems.  Both projects have continued to
implement and promote the classic file system API even as other projects (some
even with the gall to put FS in their names) implement various stripped-down
APIs that don't preserve the property of working with every script and library
and application of the last thirty years.  Not having to rewrite applications,
or import/export data between various special-purpose data stores, is a **huge**
benefit to users.

Naturally, these two projects have a lot of similarities.  In addition to the
file system API, both have tried to address object and block APIs as well.
Because of their slightly different architectures and user bases, however,
they've approached those interfaces in slightly different ways.  For example,
GlusterFS is files all the way down whereas Ceph has separate bulk-data and
metadata layers.  GlusterFS distributes cluster management among all servers,
while Ceph limits some of that to a dedicated monitor subset.  Whether it's
because of these technical differences or because of relationships or pure
happenstance, the two projects have experienced different levels of traction in
each of these markets.  This has led to different lessons, and different ideas
embedded in each project's code.

One of the nice things about joining forces is that we each gain even more
freedom than before to borrow each other's ideas.  Yes, they were both open
source, so we could always do some of that, but it's not like we could have used
one project's management console on top of the other's data path.  GlusterFS
using RADOS would have been unthinkable, as would Ceph using GFAPI.  Now, all
things are possible.  In each area, we have the chance to take two sets of ideas
and either converge on the better one or merge the two to come up with something
even better than either was before.  I don't know what the outcomes will be, or
even what all of the pieces are that we'll be looking at, but I do know that
there are some very smart people joining the team I'm on.  Whenever that
happens, all sorts of unpredictable good things tend to happen.

So, welcome to my new neighbors from the Ceph community.  Come on in, make
yourself comfortable by the fire, and let's have a good long chat.
___
Announce mailing list
annou...@gluster.org
http://supercolony.gluster.org/mailman/listinfo/announce
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Glusterfs Rack-Zone Awareness feature...

2014-04-22 Thread Jeff Darcy
 When I create a new replicated volume, using only 2 nodes, I use this command
 line : ‘gluster volume create vol_name replica 2 transport tcp server1
 :/export/brick1/1 server2 :/export/brick1/1’
 server1 and server2 are in 2 different datacenters.
 Now, if I want to expand gluster volume, using 2 new servers (ex : server3
 and server4) , I use those command lines :
 ‘gluster volume add-brick vol_name server3: /export/brick1/1’
 ‘gluster volume add-brick vol_name server4: /export/brick1/1’
 ‘gluster volume rebalance vol_name fix-layout start’
 ‘gluster volume rebalance vol_name start’
 How the rebalance command work ?
 How to be sure that replicated data are not stored on servers hosted in the
 same datacenter ?

he replica count.  Some of the infrastructure we need for the data
classification task in 3.6 will allow us to relax that limitation and
even support multiple replica counts within one volume, but for the
simple/general case that probably won't be until 3.7 or later.  For now,
we ensure when a volume is created or bricks are added that the bricks
within each replica set are not co-located.

Also, since this is the first time I've noticed a mention of multiple
*data centers* (as opposed to multiple racks within one data center),
it's important to note that AFR will fall down quite badly if the
latency is greater than a few milliseconds.  NSR will be much better at
handling such environments, but won't be available for a while yet.

Yeah, I'm also frustrated that all the good stuff always seems to be in
the future.  ;)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Glusterfs Rack-Zone Awareness feature...

2014-04-18 Thread Jeff Darcy
 I do not understand why it could be a problem to place the data's replica on
 a different node group.
 If a group of node become unavailable (due to datacenter failure, for
 example) volume should remain online, using the second group.

I'm not sure what you're getting at here.  If you're talking about initial
placement of replicas, we can place all members of each replica set in
different node groups (e.g. racks).  If you're talking about adding new
replica members when a previous one has failed, then the question is *when*.
Re-populating a new replica can be very expensive.  It's not worth starting
if the previously failed replica is likely to come back before you're done.
We provide the tools (e.g. replace-brick) to deal with longer term or even
permanent failures, but we don't re-replicate automatically.  Is that what
you're talking about?
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Glusterfs Rack-Zone Awareness feature...

2014-04-15 Thread Jeff Darcy
 I have a little question.
 I have read glusterfs documentation looking for a replication management. I
 want to be able to localize replicas on nodes hosted in 2 Datacenters
 (dual-building).
 CouchBase provide the feature, I’m looking for GlusterFs : “Rack-Zone
 Awareness”.
 https://blog.couchbase.com/announcing-couchbase-server-25
 “Rack-Zone Awareness - This feature will allow logical groupings of Couchbase
 Server nodes (where each group is physically located on a rack or an
 availability zone). Couchbase Server will automatically allocate replica
 copies of data on servers that belong to a group different from where the
 active data lives. This significantly increases reliability in case an
 entire rack becomes unavailable. This is of particularly importance for
 customers running deployments in public clouds.”

 Do you know if Glusterfs provide a similar feature ?
 If not, do you plan to develop it, in the near future ?

There are two parts to the answer. Rack-aware placement in general is part of 
the data classification feature planned for the 3.6 release. 

http://www.gluster.org/community/documentation/index.php/Features/data-classification
 

With this feature, files can be placed according to various policies using any 
of several properties associated with objects or physical locations. Rack-aware 
placement would use the physical location of a brick. Tiering would use the 
performance properties of a brick and the access time/frequency of an object. 
Multi-tenancy would use the tenant identity for both bricks and objects. And so 
on. It's all essentially the same infrastructure. 

For replication decisions in particular, there needs to be another piece. Right 
now, the way we use N bricks with a replication factor of R is to define N/R 
replica sets each containing R members. This is sub-optimal in many ways. We 
can still compare the value or fitness of two replica sets for storing a 
particular object, but our options are limited to the replica sets as defined 
last time bricks were added or removed. The differences between one choice and 
another effectively get smoothed out, and the load balancing after a failure is 
less than ideal. To do this right, we need to use more (overlapping) 
combinations of bricks. Some of us have discussed ways that we can do this 
without sacrificing the modularity of having distribution and replication as 
two separate modules, but there's no defined plan or date for that feature 
becoming available. 

BTW, note that using *too many* combinations can also be a problem. Every time 
an object is replicated across a certain set of storage locations, it creates a 
coupling between those locations. Before long, all locations are coupled 
together, so that *any* failure of R-1 locations anywhere in the system will 
result in data loss or unavailability. Many systems, possibly including 
Couchbase Server, have made this mistake and become *less* reliable as a 
result.  Emin Gün Sirer does a better job describing the problem - and 
solutions - than I do, here:

http://hackingdistributed.com/2014/02/14/chainsets/
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster 3.4.2 on Redhat 6.5

2014-03-27 Thread Jeff Darcy
- Original Message -

 I see two separate bugs there.

 1. A missing package requirement
 2. The process hanging in a reproducible way.
I've submitted a fix for #2. 

http://review.gluster.org/#/c/7360/ 
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Different brick sizes in a volume

2014-03-18 Thread Jeff Darcy
 On Tue, Mar 18, 2014 at 1:42 PM, Greg Waite
  I've been playing around with a 2x2 distributed replicated setup with
  replicating group 1 having a different brick size than replicating group 2.
  I've been running into out of disk errors when the smaller replicating
  pair disks fill up. I know of the minimum free disk feature which should
  prevent this issue. My question is, are there features that allow gluster
  to
  smartly use different brick sizes so extra space on larger bricks do not go
  unused?
 
 It looks like different sized bricks will be a core feature in 3.6
 (coming soon).

Correct.  In fact, a lot of the logic already exists and is even in the tree.

http://review.gluster.org/#/c/3573/

The trick now is to get that logic integrated into the place where
rebalance calculates the new layout.  Until then, you could try running
that script, but I should warn you that it hasn't been looked at for over a
year so you should try it out on a small test volume first to make sure
it's still doing the right thing(s).  I'll be glad to help with that.

Another thing you can do is to divide your larger bricks in two - or three,
or whatever's necessary to even things out.  This means more ports, more
glusterfsd processes, quite possibly some performance loss as those contend
with one another, but it's something you can do *right now* that's pretty
easy and bullet-proof.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] PLEASE READ ! We need your opinion. GSOC-2014 and the Gluster community

2014-03-13 Thread Jeff Darcy
 I am a little bit impressed by the lack of action on this topic. I hate to be
 that guy, specially being new here, but it has to be done.
 If I've got this right, we have here a chance of developing Gluster even
 further, sponsored by Google, with a dedicated programmer for the summer.
 In other words, if we play our cards right, we can get a free programmer and
 at least a good start/advance on this fantastic.

Welcome, Carlos.  I think it's great that you're taking initiative here.
However, it's also important to set proper expectations for what a GSoC intern
could reasonably be expected to achieve.  I've seen some amazing stuff out of
GSoC, but if we set the bar too high then we end up with incomplete code and
the student doesn't learn much except frustration.

GlusterFS consists of 430K lines of code in the core project alone.  Most of
it's written in a style that is generally hard for newcomers to pick up -
both callback-oriented and highly concurrent, often using our own unique
interpretation of standard concepts.  It's also in an area (storage) that is
not well taught in most universities.  Given those facts and the short
duration of GSoC, it's important to focus on projects that don't require deep
knowledge of existing code, to keep the learning curve short and productive
time correspondingly high.  With that in mind, let's look at some of your
suggestions.

 I think it would be nice to listen to the COMMUNITY (yes, that means YOU),
 for either suggestions, or at least a vote.

It certainly would have been nice to have you at the community IRC meeting
yesterday, at which we discussed release content for 3.6 based on the
feature proposals here:

   http://www.gluster.org/community/documentation/index.php/Planning36

The results are here:

   http://titanpad.com/glusterfs-3-6-planning

 My opinion, being also my vote, in order of PERSONAL preference:
 1) There is a project going on ( https://forge.gluster.org/disperse ), that
 consists on re-writing the stripe module on gluster. This is specially
 important because it has a HUGE impact on Total Cost of Implementation
 (customer side), Total Cost of Ownership, and also matching what the
 competition has to offer. Among other things, it would allow gluster to
 implement a RAIDZ/RAID5 type of fault tolerance, much more efficient, and
 would, as far as I understand, allow you to use 3 nodes as a minimum
 stripe+replication. This means 25% less money in computer hardware, with
 increased data safety/resilience.

This was decided as a core feature for 3.6.  I'll let Xavier (the feature
owner) answer w.r.t. whether there's any part of it that would be
appropriate for GSoC.

 2) We have a recurring issue with split-brain solution. There is an entry on
 trello asking/suggesting a mechanism that arbitrates this resolution
 automatically. I pretty much think this could come together with another
 solution that is file replication consistency check.

This is also core for 3.6 under the name policy based split brain
resolution:

   http://www.gluster.org/community/documentation/index.php/Features/pbspbr

Implementing this feature requires significant knowledge of AFR, which both
causes split brain and would be involved in its repair.  Because it's also
one of our most complicated components, and the person who just rewrote it
won't be around to offer help, I don't think this project *as a whole*
would be a good fit for GSoC.  On the other hand, there might be specific
pieces of the policy implementation (not execution) that would be a good
fit.

 3) Accelerator node project. Some storage solutions out there offer an
 accelerator node, which is, in short, a, extra node with a lot of RAM,
 eventually fast disks (SSD), and that works like a proxy to the regular
 volumes. active chunks of files are moved there, logs (ZIL style) are
 recorded on fast media, among other things. There is NO active project for
 this, or trello entry, because it is something I started discussing with a
 few fellows just a couple of days ago. I thought of starting to play with
 RAM disks (tmpfs) as scratch disks, but, since we have an opportunity to do
 something more efficient, or at the very least start it, why not ?

Looks like somebody has read the Isilon marketing materials.  ;)

A full production-level implementation of this, with cache consistency and
so on, would be a major project.  However, a non-consistent prototype good
for specific use cases - especially Hadoop, as Jay mentions - would be
pretty easy to build.  Having a GlusterFS server (for the real clients)
also be a GlusterFS client (to the real cluster) is pretty straightforward.
Testing performance would also be a significant component of this, and IMO
that's something more developers should learn about early in their careers.
I encourage you to keep thinking about how this could be turned into a real
GSoC proposal.


Keep the ideas coming!
___
Gluster-users mailing list

Re: [Gluster-users] gfid files which are not hard links anymore

2014-03-12 Thread Jeff Darcy
 Most likely reason is that someone deleted these files manually from the
 brick directories. You must never access/modify the data from the brick
 directories directly

Unfortunately, that's exactly what users must do to resolve split-brain.
Until we implement a mechanism for people to do this through the client
mount, we need to make sure users know how to remove files properly
themselves.  Here are a couple of relevant blog posts.

http://www.gluster.org/2012/07/fixing-split-brain-with-glusterfs-3-3/
http://joejulian.name/blog/glusterfs-split-brain-recovery-made-easy/

There are also some efforts under way that should make this better in the
future.

http://www.gluster.org/community/documentation/index.php/Features/pbspbr
http://www.gluster.org/2012/06/healing-split-brain/
http://review.gluster.org/#/c/4132/
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] 3.6 Feature Go/No-go in this week's community meeting

2014-03-10 Thread Jeff Darcy
 Since the feature proposal freeze for 3.6 has happened, I am considering
 to have the 3.6 feature go/no-go decision making as part of this week's
 community meeting on Wednesday. Does that seem acceptable to all? If
 yes, this agenda item can probably consume the entire 60 minutes and we
 might have to move other agenda items to next week.

Works for me.  We really need to get this done.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] Mechanisms for automatic management of Gluster

2014-02-12 Thread Jeff Darcy
 This is along the lines of tools for sysadmins. I plan on using
 these algorithms for puppet-gluster, but will try to maintain them
 separately as a standalone tool.
 
 The problem: Given a set of bricks and servers, if they have a logical
 naming convention, can an algorithm decide the ideal order. This could
 allow parameters such as replica count, and
 chained=true/false/offset#.
 
 The second problem: Given a set of bricks in a volume, if someone adds
 X bricks and removes Y bricks, is this valid, and what is the valid
 sequence of add/remove brick commands.
 
 I've written some code with test cases to try and figure this all out.
 I've left out a lot of corner cases, but the boilerplate is there to
 make it happen. Hopefully it's self explanatory. (gluster.py) Read and
 run it.
 
 Once this all works, the puppet-gluster use case is magic. It will be
 able to take care of these operations for you (if you want).
 
 For non puppet users, this will give admins the confidence to know
 what commands they should _probably_ run in what order. I say probably
 because we assume that if there's an error, they'll stop and inspect
 first.
 
 I haven't yet tried to implement the chained cases, or anything
 involving striping. There are also some corner cases with some of the
 current code. Once you add chaining and striping, etc, I realized it
 was time to step back and ask for help :)
 
 I hope this all makes sense. Comments, code, test cases are appreciated!

It's a good start.  For the chained case, you'd probably want to start
with something like this:

# Convert the input into a list of lists like this:
#   [
#  [ 'host1', [ 'path1', 'path2', ... ],
#  [ 'host2', [ 'path1', 'path2', ... ],
#  ...
#   ]
out_list = []
while in_list:
first_host = in_list.pop()
first_path = first_host[1].pop()
# If there are any bricks left on this host, move the host to
# the end so the next iteration will start with the next host.
# Otherwise, we've used all bricks from this host so discard.
if first_host[1]:
in_list.append(first_host)
second_host = in_list[0]
second_host = second_host[1].pop()
# Have we exhausted this host as well?
if not second_host[1]:
del in_list[0]
out_list.append({'host':first_host[0],'path',first_path})
out_list.append({'host':second_host[0],'path',second_path})
return out_list

(I haven't actually run this.  It's merely illustrative of the algorithm.)

Can you spot the bug?  If one host has more bricks than the others, it might
run out of bricks on other hosts to pair with, so it'll end up pairing with
itself.  For example, consider the following input:

H1P1, H1P2, H1P3, H1P4, H2P1, H2P2, H3P1, H3P2

This algorithm would yield the following replica pairs.

H1P1 + H2P1
H2P2 + H3P1
H3P2 + H1P2
H1P3 + H1P4 (oops)

Instead, we need to find this:

H1P1 + H2P1
H2P2 + H1P2
H1P3 + H3P1
H3P2 + H1P4

I would actually not try to deal with this in the loop above.  Why not?
Because that loop's already going to get a bit hairy when it's enhanced to
handle replica counts greater than two.  Instead, I would deal with the
imbalance cases *up front* - check the number of bricks for each host, then
equalize them e.g. by splitting a host with many bricks into two virtual hosts
separated by enough others that they'll never pair with one another.

Alternatively, one could do a recursive implementation, roughly like this:

if less than rep_factor hosts left, fail
pick rep_factor bricks from different hosts
loop:
pass remainder to recursive call
if result is valid, combine and return
pick a *different* rep_factor bricks from different hosts

That will generate *some* valid order if any exists, but it will tend toward
sub-optimal orders where e.g. all of X's bricks are paired with all of Y's
instead of being spread around.  There might be some sort of optimization
pass we could do that would swap replica-set members to address this, but I'm
sure you can see it's already becoming a hard problem.  I'd have to code up
full versions of both algorithms and run them on many different inputs to say
with any confidence which is better.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Atomic file updates

2014-02-12 Thread Jeff Darcy
 I'm not currently a Gluster user but I'm hoping it's the answer to a
 problem I'm working on.
 
 I manage a private web site that is basically a reporting tool for
 equipment located at several hundred sites. Each site regularly uploads
 zipped XML files to a cloud based server and this also provides a web
 interface to the data using apache/PHP. The problem I need to solve is
 that with a single server disk I/O has become a bottleneck.
 
 The plan is to use a load balancer and multiple web servers with a
 4-node Gluster volume behind to store the data. Data would be replicated
 over 2 nodes.
 
 The uploaded files are stored and then unzipped ready for reading by the
 web interface code. Each file is unzipped into a temporary file and then
 renamed, e.g.
 
 file1.xml.zip --unzip-- uniquename.tmp --rename-- file1.xml
 
 Use of the rename function makes these updates atomic.
 
 How can I achieve atomic updates in this way using a Gluster volume? My
 understanding is that renaming a file on a Gluster volume causes a link
 file to be created and that clearly wouldn't be appropriate where there
 are frequent updates.

Creating a file with one name and then renaming it to another *might*
cause creation of linkfiles, but I think concerns about linkfiles are
often overblown.  The one extra call to create a linkfile isn't much
compared to those for creating the file, writing into it, and then
renaming it even if the rename is local to one brick.  What really
matters is the performance of the entire sequence, with or without the
linkfile.

That said, there's also a trick you can use to avoid creation of a
linkfile.  Other tools, such as rsync and our own object interface,
use the same write-then-rename idiom.  To serve them, there's an
option called extra-hash-regex that can be used to place files on the
right brick according to their final name even though they're created
with another.  Unfortunately, specifying that option via the command line
doesn't seem to work (it creates a malformed volfile) so you have to
mount a bit differently.  For example:

   glusterfs --volfile-server=a_server --volfile-id=a_volume \
   --xlator-option a_volume-dht.extra_hash_regex='(.*+)tmp' \
   /a/mountpoint

The important part is that second line.  That causes any file with a
tmp suffix to be hashed and placed as though only the part in the
first parenthesized part of the regex (i.e. without the tmp) was
there.  Therefore, creating xxxtmp and then renaming it to xxx is
the same as just creating xxx in the first place as far as linkfiles
etc. are concerned.  Note that the excluded part can be anything that
a regex can match, including a unique random number.  If I recall,
rsync uses temp files something like this:

   fubar = .fubar.NN (where NNN is a random number)

I know this probably seems a little voodoo-ish, but with a little bit
of experimentation to find the right regex you should be able to avoid
those dreaded linkfiles altogether.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Atomic file updates

2014-02-12 Thread Jeff Darcy
 Are you saying that with these mount options I can just write files
 directly without using flock or renaming a temporary file, and that
 other processes trying to read the file will always see a complete and
 consistent view of the file?

For write-once files, the rename is really the key to ensuring that
readers never see an incomplete file.  If you ever rewrite a file in
place, you'll need flock to avoid reading a partially updated (i.e.
inconsistent) file.  Jay's suggestions might also be helpful even
though they both have to do with metadata, because we use attributes
to determine when it's necessary to re-read a file that might have
changed.  It's kind of up to you to determine which combination is
needed to meet your own consistency goals with your own workload.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


  1   2   3   >