from:"Jeff Darcy"

[Gluster-devel] What's the status of the quota translator?

2009-12-15 Thread Jeff Darcy

It's in the source tree, but doesn't seem to be officially documented or
supported.  I tried to use it for the first time today and the level of
breakage I found suggests that it's not really being kept up to date as
code elsewhere moves forward.  Can anyone shed any light on its status,
or tell me where/whether I should send patches?


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/gluster-devel

[Gluster-devel] Faster hashing for DHT

2010-01-05 Thread Jeff Darcy

While looking at the DHT code, I noticed that it's using a 10-round
Davies-Meyer construction to generate the hashes used for file
placement.  A little surprised, by this, I ran it by a couple of friends
who are experts in both cryptography and distributed data storage.  The
consensus seems to be that the hash used for this purpose needs to be
collision resistant but not cryptographically strong.  One theorized
that the choice made in DHT is probably based on prior examples (e.g.
Freenet and Mojo Nation) where cryptographically strong hashes were
chosen, but that the requirements driving those decisions probably don't
apply to GlusterFS.  This is a non-trivial issue because these hashes
are used quite frequently and the current one is quite computationally
expensive.  I note that Hsieh's SuperFastHash is already implemented in
GlusterFS and is used for other purposes.  It's about 3x as fast as the
DM hash, and has better collision resistance as well.  MurmurHash
(http://murmurhash.googlepages.com/) is even faster and more collision
resistant.  For future releases, I suggest dropping the DM hash and
switching to one of these others.


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Faster hashing for DHT

2010-01-06 Thread Jeff Darcy

On 01/05/2010 07:56 PM, Martin Fick wrote:
> Hmm, if it were collision resistant, wouldn't that mean that you would need 
> one server for each file you want to store?  I suspect you want many 
> collisions, just a good even distribution of those collisions,

"Collision resistance" in this context usually refers to avoidance of
*spurious* collisions - i.e. those above the level that would occur with
an ideal distribution - so "even distribution" is a good summary of its
practical effect.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Re: [Gluster-users] I/O fair share to avoid I/O bottlenecks on small clsuters

2010-02-01 Thread Jeff Darcy

On 01/31/2010 09:06 AM, Ran wrote:
> You guys are talking about network IO im taking about the gluster server disk 
> IO
> the idea to shape the trafic does make sence seens the virt machines
> server do use network to get to the disks(gluster)
> but what about if there are say 5 KVM servers(with VPS's) all on
> gluster what do you do then ? its not quite fair share seens every
> server has its own fair share and doesnt see the others .
> 
> Also there are other applications that uses gluster like mail etc..
> and i see that gluster IO is very high very often cousing the all
> storage not to work .
> Its very disturbing .

You bring up a good set of points.  Some of these problems can be
addressed at the hypervisor (i.e. GlusterFS client) level, some can be
addressed by GlusterFS itself, and some can be addressed only at the
level of the local-filesystem or block-device level on the GlusterFS
servers.  Unfortunately, I/O traffic shaping is still in its infancy
compared to what's available for networking - or perhaps even "infancy"
is too generous.  As far as the I/O stack is concerned, all of the
traffic is coming from the glusterfsd process(es) without
differentiation, so even if the functionality to apportion I/O amongst
tasks existed it wouldn't be usable without more information.  Maybe
some day...

What you can do now at the GlusterFS level, though, is make sure that
traffic is distributed across many servers and possibly across many
volumes per server to take advantage of multiple physical disks and/or
interconnects for one server.  That way, a single VM will only use a
small subset of the servers/volumes and will not starve other clients
that are using different servers/volumes (except for network bottlenecks
which are a separate issue).  That's what the "distribute" translator is
for, and it can be combined with replicate or stripe to provide those
functions as well.  Perhaps it would be useful to create and publish
some up-to-date recipes for these sorts of combinations.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Re: [Gluster-users] I/O fair share to avoid I/O bottlenecks on small clsuters

2010-02-01 Thread Jeff Darcy

On 02/01/2010 10:14 AM, Gordan Bobic wrote:
> Optimizing 
> file systems is a relatively complex thing and a lot of the conventional 
> wisdom is just plain wrong at times.

After approximately fifteen years of doing that kind of tuning, I
couldn't agree more.

>> Unfortunately, I/O traffic shaping is still in its infancy
>> compared to what's available for networking - or perhaps even "infancy"
>> is too generous.  As far as the I/O stack is concerned, all of the
>> traffic is coming from the glusterfsd process(es) without
>> differentiation, so even if the functionality to apportion I/O amongst
>> tasks existed it wouldn't be usable without more information.  Maybe
>> some day...
> 
> I don't think this would even be useful. It sounds like seeking more 
> finely grained (sub-process level!) control over disk I/O prioritisation 
> without there even being a clearly presented case about the current 
> functionality (ionice) not being sufficient.

Does such a case really need to be made explicitly?  Dividing processes
into classes is all well and good, but there can still be contention
between processes in the same class.  Being able to resolve that
contention in a fair and/or deterministic way is still useful, and still
unaddressed.

In any case, that might be a moot point.  I interpreted Ran's problem as
VMs running on GlusterFS *clients* causing contention at the GlusterFS
*servers*.  Maybe that was incorrect, but even if Ran doesn't face that
problem others do.  I certainly see and hear about it a lot from where I
sit at Red Hat, and no amount of tweaking at the hypervisor (i.e.
GlusterFS client) level will solve it.

> Hold on, you seem to be talking about something else here. You're 
> talking about clients not distributing their requests evenly across 
> servers. Is that really what the original problem was about?

My reading of Ran's mail at 09:13am on 01/30 says yes, but greater
clarity would certainly be welcome.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] ping timeout

2010-03-23 Thread Jeff Darcy

On 03/23/2010 03:23 PM, Ed W wrote:
> I'm not an active Glusterfs user yet, but what worries me about gluster 
> is this very casual attitude to split brain...  Other cluster solutions 
> take outages extremely seriously to the point they fence off the downed 
> server until it's guaranteed back into a synchronised state...

I'm not sure I'd say the attitude is casual, so much as that it
emphasizes availability over consistency.

> Once a machine has gone down then it should be fenced off and not be 
> allowed to serve files again until it's fully synced - otherwise you are 
> just asking for a set of circumstances (however, unlikely) to cause the 
> out of date data to be served...

This is a very common approach to a very common problem in clustered
systems, but it does require server-to-server communication (which
GlusterFS has historically avoided).

> A superb solution would be for the replication tracker to actually log 
> and mark dirty anything it can't fully replicate. When the replication 
> partner comes back up these could then be treated as a priority sync 
> list to get the servers back up to date?

To put a slight twist on that, it would be nice if clients knew which
servers were still in catch-up mode, and not direct traffic to them
except as part of the catch-up process.  That process, in turn, should
be based on precise logging of changes on the survivors so that only an
absolute minimum of files need to be touched.  That's kind of a whole
different replication architecture, but IMO it would be better for local
replication and practically necessary for wide-area.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Transparent encryption in GlusterFS

2011-05-06 Thread Jeff Darcy

On 05/05/2011 04:23 PM, Edward Shishkin wrote:
> The straightforward solution is to serialize read-modify-writes.
> I wonder if GlusterFS has any per-file serialization means,
> that would allow to resolve this problem. Or maybe there are
> possibilities to create such means. Any hints would be highly
> appreciated.

At a first approximation, you could just wrap the read-modify-write in
POSIX locks. That would conflict with other uses of POSIX locks, though,
and might not address the issue of "self-conflict" induced e.g. by some
of the performance translators issuing parallel writes to the same fd.
There is an "oplock" translator in CloudFS which was co-developed with
the encryption translator you're working on and which attempts to
provide the necessary conflict detection without scalability-destroying
serialization. The code does need some improvement, though, as has been
discussed on the cloudfs-devel thread you started at
https://fedorahosted.org/pipermail/cloudfs-devel/2011-May/38.html.
In particular, we need to address not just race conditions but also e.g.
forward-progress guarantees, and (as I said in that thread) I think
judicious use of server-side request queuing is the way to do that.

This kind of synchronization is needed for other things besides
encryption, by the way. For example, I've considered adding a
data-integrity translator using checksums or hashes. That would run into
exactly the same need for atomic read-modify-write sequences, requiring
exactly the same kind of coordination, so as we design this we should
try to account for the fact that there might be multiple concurrent
users (translators) at different levels.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] xattr support

2011-05-06 Thread Jeff Darcy

On 05/06/2011 08:47 AM, Hans K. Rosbach wrote:
> Hi, I am receiving mixed signals on whether GlusterFS actually
> support xattrs on the client side. I am told that it does support
> xattrs but my tests seems to indicate that it does not. I also
> seem to recollect reading a post to this list saying that it does
> not support xattrs, but I am not able to find it now.
> 
> Also a google search pretty much only tells me what I already
> know; that GlusterFS uses xattrs extensively on the server side.
> 
> Here is a test on the clients root filesystem (ext4):
> [root /]# touch testfile
> [root /]# attr -s testattr -V 2 testfile

When I attempt this same operation on an ext4 filesystem, I get
EOPNOTSUPP. When I do "man attr" it tells me why:

attr - extended attributes on XFS filesystem objects

The filesystem-independent programs for manipulating xattrs are getfattr
and setxattr. When I do this it works fine:

setfattr -n user.testattr -v "foo" testfile

Note that, in addition to using setfattr instead of attr, the above
conforms to the namespace requirements of generic xattrs by using the
"user" namespace - which depends on the filesystem being mounted with
the "user_xattr" flag.  Use "man 5 attr" for more information about
how to use non-XFS xattrs.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] xattr support

2011-05-06 Thread Jeff Darcy

On 05/06/2011 10:31 AM, Hans K. Rosbach wrote:
> I see what you mean. However this test gives the same results:
> Ext4:
> [root /]# setfattr -n user.testattr -v "foo" testfile
> 
> GlusterFS:
> [root storage1]# setfattr -n user.testattr -v "foo" glusterfile
> setfattr: glusterfile: Operation not supported
> 
> So, does GlusterFS not support xattrs or is my config somehow wrong?

GlusterFS does support xattrs - I've worked extensively with this
functionality myself - so it must be the config. Are you sure that the
server-side bricks which make up your GlusterFS filesystem are on
filesystems mounted with user_xattr?

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] xattr support

2011-05-06 Thread Jeff Darcy

On 05/06/2011 10:53 AM, Hans K. Rosbach wrote:
>> GlusterFS does support xattrs - I've worked extensively with this 
>> functionality myself - so it must be the config. Are you sure that
>> the server-side bricks which make up your GlusterFS filesystem are
>> on filesystems mounted with user_xattr?
> 
> Thank you, that was indeed the fault.

You're welcome.

> I had thought that GlusterFS would not have worked at all without 
> user_xattr on the server side, so it did not even occur to me to
> check the server side mount. A server log message about this would
> probably have been nice.

The xattrs used internally are all in the "trusted" namespace so they
don't require user_xattr.
http://cloudfs.org/2011/04/glusterfs-extended-attributes/ has some more
information about internal xattr usage; please let me know if there's
information you'd like that's missing.

> I am happy to say that I now have no remaining issues with our 
> glusterfs setup whatsoever after 2 months of extensive testing. The
> problems that did show up were quickly corrected in 3.1.3 and 3.1.4.
> We are now looking at going to production with the system.

That's awesome.  The more the merrier.  :)

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

[Gluster-devel] Multi-threaded socket code

2011-06-01 Thread Jeff Darcy

This is the infamous "gatling gun" code to allow greater parallelism
within the socket transport. It's based on my SSL-transport patch
(https://github.com/gluster/glusterfs/pull/2). The two are related
because a lot of work occurs within SSL_read/SSL_write and the inherent
single-threading of the socket code via a single polling thread severely
impacts performance in that case. In any case, this is purely a proof of
concept or request for comments at this point, due to several
deficiencies that I'll get to in a moment. To expand a bit on the commit
comment...


Good: this yields a >2x performance on my tests using SSL. On the
24-core/48GB/10GbE machines in the lab, "iozone -r 1m -i 0 -l 24"
improves from 185MB/s to over 400MB/s between a single client and single
server using SSL (850MB/s without is typical) and parallel threads go
from ~2.5 to ~7.5 (even with io-threads in both cases). There might even
be some performance in non-SSL cases, e.g. a single client connecting to
many servers, but that's just icing on the cake.

Bad: the code doesn't clean up on disconnect properly, doesn't work well
with non-blocking I/O (which is rather pointless with this code anyway),
and there seems to be some bad interaction with the glusterd port
mapper. Since CloudFS doesn't use that port mapper for other reasons,
it's not affected and I'm tempted not to care, but I guess I should
debug that some day.

Ugly: the management code is very racy, and those races are tickled by
the new threading milieu that socket_poller introduces. The patch
already fixes one pre-existing race in which glusterfs_mgmt_init sets up
a callback before setting up pointers needed by code within that
callback, but there's at least one other serious problem I'm aware of.
Some of the management code (e.g. sending the commit reply for a "volume
create" request) calls socket_submit_reply and then immediately frees
some of the data being sent, so if the message isn't sent synchronously
then the other side gets an invalid message type. Sending synchronously
is the normal case, and it's unlikely that the socket buffers will fill
up on this low-traffic path so that a deferred send will be necessary,
but it is possible. I haven't gone through the inordinately convoluted
code that assembles these messages to figure out exactly where the error
lies, and frankly I'm not wild about debugging that to deal with a
problem that pre-dates my changes. While it's unlikely that the socket
buffers would fill so that a deferred send would become necessary,
especially on the low-traffic management path, it has always been
possible and this code which frees data before its sure to be sent has
always been erroneous.
>From 2cf2bea12c8639355bf3561dc772cad773746026 Mon Sep 17 00:00:00 2001
From: Jeff Darcy 
Date: Tue, 31 May 2011 12:13:30 -0400
Subject: [PATCH] Use separate polling thread for each connection.

Good: 2x performance with SSL transport
Bad: doesn't work with portmapper, doesn't clean up properly on disconnect
Ugly: there's still a race with mgmt code calling submit_reply and then
  freeing our data out from under us if the message isn't sent
  synchronously
---
 glusterfsd/src/glusterfsd-mgmt.c  |3 +-
 rpc/rpc-transport/socket/src/socket.c |  257 +
 rpc/rpc-transport/socket/src/socket.h |2 +
 3 files changed, 167 insertions(+), 95 deletions(-)

diff --git a/glusterfsd/src/glusterfsd-mgmt.c b/glusterfsd/src/glusterfsd-mgmt.c
index 1f5f648..413790b 100644
--- a/glusterfsd/src/glusterfsd-mgmt.c
+++ b/glusterfsd/src/glusterfsd-mgmt.c
@@ -877,6 +877,8 @@ glusterfs_mgmt_init (glusterfs_ctx_t *ctx)
 gf_log ("", GF_LOG_WARNING, "failed to create rpc clnt");
 goto out;
 }
+	/* This is used from within mgmt_rpc_notify, so LET'S SET IT FIRST! */
+ctx->mgmt = rpc;
 
 ret = rpc_clnt_register_notify (rpc, mgmt_rpc_notify, THIS);
 if (ret) {
@@ -894,7 +896,6 @@ glusterfs_mgmt_init (glusterfs_ctx_t *ctx)
 if (ret)
 goto out;
 
-ctx->mgmt = rpc;
 out:
 return ret;
 }
diff --git a/rpc/rpc-transport/socket/src/socket.c b/rpc/rpc-transport/socket/src/socket.c
index dc84da7..31e8eac 100644
--- a/rpc/rpc-transport/socket/src/socket.c
+++ b/rpc/rpc-transport/socket/src/socket.c
@@ -1,4 +1,5 @@
 /*
+ * #endif
   Copyright (c) 2010 Gluster, Inc. <http://www.gluster.com>
   This file is part of GlusterFS.
 
@@ -51,6 +52,10 @@
 #define SSL_PRIVATE_KEY_OPT "transport.socket.ssl-private-key"
 #define SSL_CA_LIST_OPT "transport.socket.ssl-ca-list"
 
+#define POLL_MASK_INPUT  (POLLIN | POLLPRI)
+#define POLL_MASK_OUTPUT (POLLOUT)
+#define POLL_MASK_ERROR  (POLLERR | POLLHUP | POLLNVAL)
+
 #define __socket_proto_reset_

Re: [Gluster-devel] limiting client trust

2011-06-08 Thread Jeff Darcy

On 06/08/2011 08:25 AM, Emmanuel Dreyfus wrote:
> Hello
> 
> As far as I understand, a glusterfs server fully trusts the clients
> regarding uid/gid. It behaves just like NFS with -maproot=root.
> 
> It would beinteresting to have the ability to limit the trust. 
> For instance, one could say that 192.0.2/24 can only perform file
> operations with calling user uid range within 1000-2000.
> 
> I am ready to contribute a xlator for that.

As an alternative, might I suggest CloudFS? It's essentially a set of
GlusterFS translators, one of which not only limits client operations to
a specific UID/GID range but also dynamically maps between the client
and server UIDs based on the client machine's identity (which itself can
be determined in multiple ways including SSL authentication). In fact,
this translator was just merged up to the CloudFS master branch
yesterday, so now would be an excellent time for someone to try it and
provide feedback.

http://cloudfs.org/cloudfs-overview/
http://git.fedorahosted.org/git/?p=CloudFS.git

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

[Gluster-devel] [PATCH BUG:2999 1/1] Add SSL and multi-threading to socket transport.

2011-06-13 Thread Jeff Darcy



Signed-off-by: Jeff Darcy 
---
 rpc/rpc-transport/socket/src/socket.c |  596
+
 rpc/rpc-transport/socket/src/socket.h |   15 +
 2 files changed, 538 insertions(+), 73 deletions(-)

diff --git a/rpc/rpc-transport/socket/src/socket.c
b/rpc/rpc-transport/socket/src/socket.c
index 2948621..b52baaa 100644
--- a/rpc/rpc-transport/socket/src/socket.c
+++ b/rpc/rpc-transport/socket/src/socket.c
@@ -47,6 +47,14 @@
 #define GF_LOG_ERRNO(errno) ((errno == ENOTCONN) ? GF_LOG_DEBUG :
GF_LOG_ERROR)
 #define SA(ptr) ((struct sockaddr *)ptr)

+#define SSL_OWN_CERT_OPT"transport.socket.ssl-own-cert"
+#define SSL_PRIVATE_KEY_OPT "transport.socket.ssl-private-key"
+#define SSL_CA_LIST_OPT "transport.socket.ssl-ca-list"
+#define OWN_THREAD_OPT  "transport.socket.own-thread"
+
+#define POLL_MASK_INPUT  (POLLIN | POLLPRI)
+#define POLL_MASK_OUTPUT (POLLOUT)
+#define POLL_MASK_ERROR  (POLLERR | POLLHUP | POLLNVAL)

 #define __socket_proto_reset_pending(priv) do { \
 memset (&priv->incoming.frag.vector, 0, \
@@ -133,9 +141,127 @@
 __socket_proto_update_priv_after_read (priv, ret,
bytes_read); \
 }

-
 int socket_init (rpc_transport_t *this);

+int
+ssl_setup_connection (socket_private_t *priv, int server)
+{
+   X509 *peer;
+   char  peer_CN[256];
+   int   ret;
+
+   priv->ssl_ssl = SSL_new(priv->ssl_ctx);
+   priv->ssl_sbio = BIO_new_socket(priv->sock,BIO_NOCLOSE);
+   SSL_set_bio(priv->ssl_ssl,priv->ssl_sbio,priv->ssl_sbio);
+   if (server) {
+   ret = SSL_accept(priv->ssl_ssl);
+   }
+   else {
+   ret = SSL_connect(priv->ssl_ssl);
+   }
+   if (ret >= 0) {
+   gf_log(__func__,GF_LOG_DEBUG,"verify_result = %lu (%d)",
+  SSL_get_verify_result(priv->ssl_ssl), X509_V_OK);
+   peer = SSL_get_peer_certificate(priv->ssl_ssl);
+   if (peer) {
+   X509_NAME_get_text_by_NID(X509_get_subject_name(peer),
+   NID_commonName, peer_CN, sizeof(peer_CN)-1);
+   peer_CN[sizeof(peer_CN)-1] = '\0';
+   gf_log(__func__,GF_LOG_DEBUG,"peer CN = %s", peer_CN);
+   }
+   }
+   else {
+   unsigned long errnum;
+   char  errbuf[120];
+
+   gf_log(__func__,GF_LOG_ERROR,"connect error %d",
+  SSL_get_error(priv->ssl_ssl,ret));
+   while ((errnum = ERR_get_error())) {
+   ERR_error_string(errnum,errbuf);
+   gf_log(__func__,GF_LOG_ERROR,"  %s",errbuf);
+   }
+   }
+   return ret;
+}
+
+int
+ssl_write_one (socket_private_t *priv, void *buf, size_t len)
+{
+   int   r;
+   struct pollfd pfd;
+
+   for (;;) {
+   r = SSL_write(priv->ssl_ssl,buf,len);
+   switch (SSL_get_error(priv->ssl_ssl,r)) {
+   case SSL_ERROR_NONE:
+   return r;
+   case SSL_ERROR_WANT_READ:
+   pfd.fd = priv->sock;
+   pfd.events = POLLIN;
+   if (poll(&pfd,1,-1) < 0) {
+   gf_log(__func__,GF_LOG_ERROR,"poll error %d",
+  errno);
+   return -1;
+   }
+   break;
+   case SSL_ERROR_WANT_WRITE:
+   pfd.fd = priv->sock;
+   pfd.events = POLLOUT;
+   if (poll(&pfd,1,-1) < 0) {
+   gf_log(__func__,GF_LOG_ERROR,"poll error %d",
+  errno);
+   return -1;
+   }
+   break;
+   default:
+   gf_log(__func__,GF_LOG_ERROR,"SSL error %lu",
+  ERR_peek_error());
+   errno = EIO;
+   return -1;
+   }
+   }
+}
+
+int
+ssl_read_one (socket_private_t *priv, void *buf, size_t len)
+{
+   int   r;
+   struct pollfd pfd;
+
+   for (;;) {
+   r = SSL_read(priv->ssl_ssl,buf,len);
+   switch (SSL_get_error(priv->ssl_ssl,r)) {
+   case SSL_ERROR_NONE:
+   return r;
+   case SSL_ERROR_ZERO_RETURN:
+   return 0;
+   case SSL_ERROR_WANT_READ:
+   pfd.fd = priv->sock;
+   pfd.events = POLLIN;
+   if (poll(&pfd,1,-1) < 0) {
+   gf_lo

[Gluster-devel] [PATCH BUG:3020 1/1] Fix duplicate quota/marker symbols.

2011-06-13 Thread Jeff Darcy



Signed-off-by: Jeff Darcy 
---
 rpc/rpc-transport/socket/src/socket.c |  596
+
 rpc/rpc-transport/socket/src/socket.h |   15 +
 2 files changed, 538 insertions(+), 73 deletions(-)

diff --git a/rpc/rpc-transport/socket/src/socket.c
b/rpc/rpc-transport/socket/src/socket.c
index 2948621..b52baaa 100644
--- a/rpc/rpc-transport/socket/src/socket.c
+++ b/rpc/rpc-transport/socket/src/socket.c
@@ -47,6 +47,14 @@
 #define GF_LOG_ERRNO(errno) ((errno == ENOTCONN) ? GF_LOG_DEBUG :
GF_LOG_ERROR)
 #define SA(ptr) ((struct sockaddr *)ptr)

+#define SSL_OWN_CERT_OPT"transport.socket.ssl-own-cert"
+#define SSL_PRIVATE_KEY_OPT "transport.socket.ssl-private-key"
+#define SSL_CA_LIST_OPT "transport.socket.ssl-ca-list"
+#define OWN_THREAD_OPT  "transport.socket.own-thread"
+
+#define POLL_MASK_INPUT  (POLLIN | POLLPRI)
+#define POLL_MASK_OUTPUT (POLLOUT)
+#define POLL_MASK_ERROR  (POLLERR | POLLHUP | POLLNVAL)

 #define __socket_proto_reset_pending(priv) do { \
 memset (&priv->incoming.frag.vector, 0, \
@@ -133,9 +141,127 @@
 __socket_proto_update_priv_after_read (priv, ret,
bytes_read); \
 }

-
 int socket_init (rpc_transport_t *this);

+int
+ssl_setup_connection (socket_private_t *priv, int server)
+{
+   X509 *peer;
+   char  peer_CN[256];
+   int   ret;
+
+   priv->ssl_ssl = SSL_new(priv->ssl_ctx);
+   priv->ssl_sbio = BIO_new_socket(priv->sock,BIO_NOCLOSE);
+   SSL_set_bio(priv->ssl_ssl,priv->ssl_sbio,priv->ssl_sbio);
+   if (server) {
+   ret = SSL_accept(priv->ssl_ssl);
+   }
+   else {
+   ret = SSL_connect(priv->ssl_ssl);
+   }
+   if (ret >= 0) {
+   gf_log(__func__,GF_LOG_DEBUG,"verify_result = %lu (%d)",
+  SSL_get_verify_result(priv->ssl_ssl), X509_V_OK);
+   peer = SSL_get_peer_certificate(priv->ssl_ssl);
+   if (peer) {
+   X509_NAME_get_text_by_NID(X509_get_subject_name(peer),
+   NID_commonName, peer_CN, sizeof(peer_CN)-1);
+   peer_CN[sizeof(peer_CN)-1] = '\0';
+   gf_log(__func__,GF_LOG_DEBUG,"peer CN = %s", peer_CN);
+   }
+   }
+   else {
+   unsigned long errnum;
+   char  errbuf[120];
+
+   gf_log(__func__,GF_LOG_ERROR,"connect error %d",
+  SSL_get_error(priv->ssl_ssl,ret));
+   while ((errnum = ERR_get_error())) {
+   ERR_error_string(errnum,errbuf);
+   gf_log(__func__,GF_LOG_ERROR,"  %s",errbuf);
+   }
+   }
+   return ret;
+}
+
+int
+ssl_write_one (socket_private_t *priv, void *buf, size_t len)
+{
+   int   r;
+   struct pollfd pfd;
+
+   for (;;) {
+   r = SSL_write(priv->ssl_ssl,buf,len);
+   switch (SSL_get_error(priv->ssl_ssl,r)) {
+   case SSL_ERROR_NONE:
+   return r;
+   case SSL_ERROR_WANT_READ:
+   pfd.fd = priv->sock;
+   pfd.events = POLLIN;
+   if (poll(&pfd,1,-1) < 0) {
+   gf_log(__func__,GF_LOG_ERROR,"poll error %d",
+  errno);
+   return -1;
+   }
+   break;
+   case SSL_ERROR_WANT_WRITE:
+   pfd.fd = priv->sock;
+   pfd.events = POLLOUT;
+   if (poll(&pfd,1,-1) < 0) {
+   gf_log(__func__,GF_LOG_ERROR,"poll error %d",
+  errno);
+   return -1;
+   }
+   break;
+   default:
+   gf_log(__func__,GF_LOG_ERROR,"SSL error %lu",
+  ERR_peek_error());
+   errno = EIO;
+   return -1;
+   }
+   }
+}
+
+int
+ssl_read_one (socket_private_t *priv, void *buf, size_t len)
+{
+   int   r;
+   struct pollfd pfd;
+
+   for (;;) {
+   r = SSL_read(priv->ssl_ssl,buf,len);
+   switch (SSL_get_error(priv->ssl_ssl,r)) {
+   case SSL_ERROR_NONE:
+   return r;
+   case SSL_ERROR_ZERO_RETURN:
+   return 0;
+   case SSL_ERROR_WANT_READ:
+   pfd.fd = priv->sock;
+   pfd.events = POLLIN;
+   if (poll(&pfd,1,-1) < 0) {
+   gf_lo

[Gluster-devel] [PATCH BUG:3020 1/1] Fix duplicate quota/marker symbols.

2011-06-13 Thread Jeff Darcy

Resolved by changing some of the marker symbols so they don't conflict.
(Sorry about attaching the wrong patch previously)

Signed-off-by: Jeff Darcy 
---
 xlators/features/marker/src/marker-quota-helper.c |8 
 xlators/features/marker/src/marker-quota-helper.h |4 ++--
 xlators/features/marker/src/marker-quota.c|   16 
 xlators/features/marker/src/marker-quota.h|2 +-
 xlators/features/marker/src/marker.c  |2 +-
 5 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/xlators/features/marker/src/marker-quota-helper.c
b/xlators/features/marker/src/marker-quota-helper.c
index fba2cdd..358531d 100644
--- a/xlators/features/marker/src/marker-quota-helper.c
+++ b/xlators/features/marker/src/marker-quota-helper.c
@@ -28,7 +28,7 @@
 #include "marker-mem-types.h"

 int
-quota_loc_fill (loc_t *loc, inode_t *inode, inode_t *parent, char *path)
+mquota_loc_fill (loc_t *loc, inode_t *inode, inode_t *parent, char *path)
 {
 int ret = -1;

@@ -65,7 +65,7 @@ loc_wipe:


 int32_t
-quota_inode_loc_fill (const char *parent_gfid, inode_t *inode, loc_t *loc)
+mquota_inode_loc_fill (const char *parent_gfid, inode_t *inode, loc_t *loc)
 {
 char*resolvedpath = NULL;
 inode_t *parent   = NULL;
@@ -93,7 +93,7 @@ ignore_parent:
 if (ret < 0)
 goto err;

-ret = quota_loc_fill (loc, inode, parent, resolvedpath);
+ret = mquota_loc_fill (loc, inode, parent, resolvedpath);
 if (ret < 0)
 goto err;

@@ -314,7 +314,7 @@ quota_inode_ctx_new (inode_t * inode, xlator_t *this)
 }

 quota_local_t *
-quota_local_new ()
+mquota_local_new ()
 {
 int32_t ret = -1;
 quota_local_t  *local   = NULL;
diff --git a/xlators/features/marker/src/marker-quota-helper.h
b/xlators/features/marker/src/marker-quota-helper.h
index 9a24c8c..6432351 100644
--- a/xlators/features/marker/src/marker-quota-helper.h
+++ b/xlators/features/marker/src/marker-quota-helper.h
@@ -60,10 +60,10 @@ int32_t
 delete_contribution_node (dict_t *, char *, inode_contribution_t *);

 int32_t
-quota_inode_loc_fill (const char *, inode_t *, loc_t *);
+mquota_inode_loc_fill (const char *, inode_t *, loc_t *);

 quota_local_t *
-quota_local_new ();
+mquota_local_new ();

 quota_local_t *
 quota_local_ref (quota_local_t *);
diff --git a/xlators/features/marker/src/marker-quota.c
b/xlators/features/marker/src/marker-quota.c
index 18d76dc..3464e2a 100644
--- a/xlators/features/marker/src/marker-quota.c
+++ b/xlators/features/marker/src/marker-quota.c
@@ -689,7 +689,7 @@ update_dirty_inode (xlator_t *this,

 mq_assign_lk_owner (this, frame);

-local = quota_local_new ();
+local = mquota_local_new ();
 if (local == NULL)
 goto fr_destroy;

@@ -847,7 +847,7 @@ wind:
 goto err;
 }

-local = quota_local_new ();
+local = mquota_local_new ();
 if (local == NULL)
 goto free_size;

@@ -897,7 +897,7 @@ get_parent_inode_local (xlator_t *this,
quota_local_t *local)

 loc_wipe (&local->parent_loc);

-quota_inode_loc_fill (NULL, local->loc.parent, &local->parent_loc);
+mquota_inode_loc_fill (NULL, local->loc.parent,
&local->parent_loc);

 ret = quota_inode_ctx_get (local->loc.inode, this, &ctx);
 if (ret < 0)
@@ -1434,7 +1434,7 @@ start_quota_txn (xlator_t *this, loc_t *loc,

 mq_assign_lk_owner (this, frame);

-local = quota_local_new ();
+local = mquota_local_new ();
 if (local == NULL)
 goto fr_destroy;

@@ -1444,7 +1444,7 @@ start_quota_txn (xlator_t *this, loc_t *loc,
 if (ret < 0)
 goto fr_destroy;

-ret = quota_inode_loc_fill (NULL, local->loc.parent,
+ret = mquota_inode_loc_fill (NULL, local->loc.parent,
 &local->parent_loc);
 if (ret < 0)
 goto fr_destroy;
@@ -1862,7 +1862,7 @@ reduce_parent_size (xlator_t *this, loc_t *loc)
 if (contribution == NULL)
 goto out;

-local = quota_local_new ();
+local = mquota_local_new ();
 if (local == NULL) {
 ret = -1;
 goto out;
@@ -1875,7 +1875,7 @@ reduce_parent_size (xlator_t *this, loc_t *loc)
 local->ctx = ctx;
 local->contri = contribution;

-ret = quota_inode_loc_fill (NULL, loc->parent, &local->parent_loc);
+ret = mquota_inode_loc_fill (NULL, loc->parent,
&local->parent_loc);
 if (ret < 0)
 goto out;

@@ -1946,7 +1946,7 @@ out:
 }

 int32_t
-quota_forget (xlator_t *this, quota_inode_ctx_t *ctx)
+mquota_forget (xlator_t *this, quota_inode_ctx_t *ctx)
 {
 inode_contribution_t *contri = NULL;

[Gluster-devel] [PATCH BUG:2999] Add SSL-based authorization as well as authentication.

2011-06-24 Thread Jeff Darcy

>From 621b5e9573179818f23cb6d749794d8f34b5e885 Mon Sep 17 00:00:00 2001

This code checks whether a particular user (as authenticated by SSL)
should be allowed to connect to a particular brick, instead of allowing
any authenticated user to connect to any brick.  This only matters if
multiple bricks are exported through a single protocol/server instance.
When using Gluster tools this won't be the case because volfiles are
written to associate only one brick with each server, so each server can
just use a different valid-certificate list (ssl-ca-list).  With the
CloudFS tools multiple bricks are associated with each server, so that
wouldn't work.  This method also allows unauthorized connections to fail
more cleanly at the gf_auth level with error messages and such, instead
of failing at the SSL level due to lack of an accepted certificate.

Signed-off-by: Jeff Darcy 
---
 rpc/rpc-lib/src/rpc-transport.h|1 +
 rpc/rpc-transport/socket/src/socket.c  |   30 +--
 xlators/protocol/auth/login/src/login.c|   37
---
 xlators/protocol/server/src/server-handshake.c |9 ++
 4 files changed, 56 insertions(+), 21 deletions(-)

diff --git a/rpc/rpc-lib/src/rpc-transport.h
b/rpc/rpc-lib/src/rpc-transport.h
index 3161ec9..99add67 100644
--- a/rpc/rpc-lib/src/rpc-transport.h
+++ b/rpc/rpc-lib/src/rpc-transport.h
@@ -216,6 +216,7 @@ struct rpc_transport {

 struct list_head   list;
 intclient_bind_insecure;
+   char  *ssl_name;
 };

 struct rpc_transport_ops {
diff --git a/rpc/rpc-transport/socket/src/socket.c
b/rpc/rpc-transport/socket/src/socket.c
index 762426d..876add3 100644
--- a/rpc/rpc-transport/socket/src/socket.c
+++ b/rpc/rpc-transport/socket/src/socket.c
@@ -143,12 +143,13 @@

 int socket_init (rpc_transport_t *this);

-int
+char *
 ssl_setup_connection (socket_private_t *priv, int server)
 {
-   X509 *peer;
-   char  peer_CN[256];
-   int   ret;
+   X509 *peer  = NULL;
+   char  peer_CN[256]  = "";
+   int   ret   = -1;
+   char *value = NULL;

priv->ssl_ssl = SSL_new(priv->ssl_ctx);
priv->ssl_sbio = BIO_new_socket(priv->sock,BIO_NOCLOSE);
@@ -159,6 +160,7 @@ ssl_setup_connection (socket_private_t *priv, int
server)
else {
ret = SSL_connect(priv->ssl_ssl);
}
+
if (ret >= 0) {
gf_log(__func__,GF_LOG_DEBUG,"verify_result = %lu (%d)",
   SSL_get_verify_result(priv->ssl_ssl), X509_V_OK);
@@ -168,6 +170,8 @@ ssl_setup_connection (socket_private_t *priv, int
server)
NID_commonName, peer_CN, sizeof(peer_CN)-1);
peer_CN[sizeof(peer_CN)-1] = '\0';
gf_log(__func__,GF_LOG_DEBUG,"peer CN = %s", peer_CN);
+   /* Stop complaining, it's already length-limited. */
+   value = gf_strdup(peer_CN);
}
}
else {
@@ -181,7 +185,8 @@ ssl_setup_connection (socket_private_t *priv, int
server)
gf_log(__func__,GF_LOG_ERROR,"  %s",errbuf);
}
}
-   return ret;
+
+   return value;
 }

 int
@@ -2029,15 +2034,16 @@ int
 socket_server_event_handler (int fd, int idx, void *data,
  int poll_in, int poll_out, int poll_err)
 {
-rpc_transport_t *this = NULL;
+rpc_transport_t *this = NULL;
 socket_private_t*priv = NULL;
 int  ret = 0;
 int  new_sock = -1;
-rpc_transport_t *new_trans = NULL;
+rpc_transport_t *new_trans = NULL;
 struct sockaddr_storage  new_sockaddr = {0, };
 socklen_taddrlen = sizeof (new_sockaddr);
 socket_private_t*new_priv = NULL;
 glusterfs_ctx_t *ctx = NULL;
+   char*cname = NULL;

 this = data;
 GF_VALIDATE_OR_GOTO ("socket", this, out);
@@ -2126,12 +2132,14 @@ socket_server_event_handler (int fd, int idx,
void *data,

if (priv->use_ssl) {
new_priv->ssl_ctx = priv->ssl_ctx;
-   if (ssl_setup_connection(new_priv,1) < 0) {
+   cname = ssl_setup_connection(new_priv,1);
+   if (!cname) {
gf_log(this->name,GF_LOG_ERROR,
   "server setup failed");
close(new_sock);
goto unlock;
}
+   new_trans-&

[Gluster-devel] [PATCH BUG:3085] Make backupvolfile-server option actually work.

2011-06-24 Thread Jeff Darcy

>From b5e1a48a067ac1f72b7655f5f13bf46d9bde8334 Mon Sep 17 00:00:00 2001

The problem was that glusterfs would return zero (success) as soon as it
forked, before we really knew whether the mount using the primary
volfile server had actually succeeded or failed.  This code actually
checks for the appearance of the volume in our mount table, and retries
using the backup volfile server if it doesn't show up in a reasonable
amount of time.

It's hacky, and I know something better is coming along, but this issue
comes up daily in the IRC channel and not everyone wants to set up RRNDS
for something that a script should be able to handle.  Whoever added the
backupvolfile-server option probably meant for it to help in these
cases, but it wasn't working.

Signed-off-by: Jeff Darcy 
---
 xlators/mount/fuse/utils/mount.glusterfs.in |   18 +++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/xlators/mount/fuse/utils/mount.glusterfs.in
b/xlators/mount/fuse/utils/mount.glusterfs.in
index e429eca..aca43a9 100755
--- a/xlators/mount/fuse/utils/mount.glusterfs.in
+++ b/xlators/mount/fuse/utils/mount.glusterfs.in
@@ -123,13 +123,25 @@ start_glusterfs ()
 err=0;
 $cmd_line;

+found=0
+for i in $(seq 0 10); do
+   sleep 3
+   mount | cut -d" " -f3 | grep "^$mount_point$"
+   if [ $? = 0 ]; then
+   echo "There it is!"
+   found=1
+   break
+fi
+   echo "Still not there..."
+done
+
 # retry the failover
-if [ $? != "0" ]; then
-err=1;
+if [ $found = 0 ]; then
+   echo "Trying backupvolfile_server"
 if [ -n "$cmd_line1" ]; then
 cmd_line1=$(echo "$cmd_line1 $mount_point");
 $cmd_line1
-if [ $? != "0"]; then
+if [ $? != "0" ]; then
 err=1;
 fi
 fi
-- 
1.7.3.4

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [PATCH BUG:3085] Make backupvolfile-server option actually work.

2011-06-27 Thread Jeff Darcy

On 06/27/2011 08:09 AM, Devon Miller wrote:
> Mind you I haven't looked at this patch in context of the 
> xlators/mount/fuse/utils/mount.glusterfs.in 
>  file, I've just looked at the patch
> itself. However, it looks like *err* can only be set if *cmd_line1*
> is defined.

Then $err will remain at its previous value (from above the patch) of 0.
 I guess this could be problematic if the first invocation of glusterfs
populated the mount table and subsequently threw an error - almost the
exact opposite of the problem we have now.  In that case it would make
sense to check the error code *and* the mount-table contents, but I
don't see that as critical.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [PATCH BUG:3085] Make backupvolfile-server option actually work.

2011-06-27 Thread Jeff Darcy

On 06/27/2011 11:16 AM, Devon Miller wrote:
> I was thinking more of the case where no backup server is
> configured. If the primary server fails, and a backup is not defined,
> then found=0 and err=0. So, the script will eventually exit with a 0
> when it really should have exited with a 1.

You're right, the code doesn't fix that (pre-existing) hole.  We should
really set err=1 before we check for the definition of a backup server.
 Thanks!

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Handling EOBs in CloudFS

2011-07-15 Thread Jeff Darcy

On Fri, 15 Jul 2011 13:13:22 -0400
Devon Miller  wrote:

> Re: Approach 1
> Is there a way for process B to use file locking (or some other
> mechanism) such that it could be guaranteed a consistent view of F?

It shouldn't need to.  With either approach, the encryption translator
(actually a partner translator on the server side) is responsible for
providing any serialization necessary to prevent clients from seeing
intermediate states in the middle of a write.  With approach 1, this
includes the state where the file has been extended but the pad bytes
have not yet been moved into an xattr.  With approach 2, this includes
the state where an extending write has been issued but its result is
not yet known - so that it would be invalid to issue the setxattr to
persist the true EOF.  Much of this is covered in

http://cloudfs.org/dist/design.md

That's the document which had previously been posted on cloudfs-devel,
and to which Edward was responding.  I apologize for his attempt to
continue the discussion on another list without providing appropriate
context.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

[Gluster-devel] RFC (HekaFS): improved replication

2011-09-14 Thread Jeff Darcy

John Mark Walker asked me to re-post this here as well as on cloudfs-devel.
Feedback is most welcome, but please be aware that some discussion has already
occurred there.  Here's the archive link to see the early discussion.

https://fedorahosted.org/pipermail/cloudfs-devel/2011-September/000148.html


= HekaFS Improved Replication =

== Background and Requirements ==

One of the most serious internal complaints about GlusterFS is performance for
small synchronous requests when using their filesystem-level replication (AFR).
This problem particularly afflicts virtual-machine-image and database
workloads, reducing performance to about a third of what it "should" be
(compared on a per-server basis to NFS on the same hardware).  The fundamental
problem is that the AFR approach to making writes crash-proof involves the
following operations:

1. Lock on the primary (first) server
2. Record operation-pending state (using extended attributes) on all
   servers
3. Issue write to all servers
4. As writes complete, update operation-pending state on other servers
5. Unlock on primary server

Even with some operations in parallel, this requires a minimum of five network
round trips to/from the primary server - possibly more as step 4 might be
repeated if there are more than two replicas.  Even with pending changes to
AFR, such as coalescing step 4 updates, AFR's per-request latency is likely to
remain terrible.

Externally, users seem to focus on a different problem: the timeliness and
observability of replica repair after a server has failed and been
restored[1][2].  AFR was built on the assumption that on-demand repair of
individual files or directories as they're accessed would be sufficient.  The
message from users ever since has been unequivocal: leaving unknown numbers of
unrepaired files vulnerable to a second failure for an indefinite period is
unacceptable.  These users require immediate repair with explicit notification
of return to a fully protected state, but here they run into a second snag: the
time required to do a full xattr scan of a multi-terabyte filesystem through a
single node is also unacceptable.  Patches were submitted almost a year ago[3]
to implement precise recovery by maintaining a list of files that are partially
written and might therefore require repair, but those have never been adopted.
The recently introduced "proactive self heal" functionality is only slightly
better.  It is triggered automatically and runs inside one of the server
daemons - avoiding many machine to machine and user to kernel round trips - but
it's still single-threaded and drags all data through one server that might be
neither source nor destination.  Worse, if a second failure occurs while the
lengthy repair process for a previous failure is still ongoing, a new repair
cycle will be scheduled but might not even start for days while the previous
repair scans millions of perfectly healthy files.

The primary requirements, therefore, are:

* Improve performance for synchronous small requests

* Provide efficient "minimal" replica repair with a positive indication
  of replica status

In addition to these requirements, compatibility with planned enhancements to
distribution and wide-area replication would also be highly desirable.

== Proposed Solution ==

The origin of AFR's performance problems is that it requires extra operations
(beyond the necessary N writes) in the non-failure case to ensure correct
operation in the failure case.  The basis of the proposed solution is therefore
to be optimistic instead of pessimistic, expending minimal resources in the
normal case and taking extra steps only after a failure.  The basic write
algorithm becomes:

1. Forward the write to all N replicas
2. If all N replicas indicate success, we're done
3. If any replica fails, add information about the failed request (e.g.
   file, offset, length) to journals on the replicas where it succeeded
4. As part of the startup process, defer completion of startup until
   brought up to date by replaying peers' journals

Because the process relies on a journal, there's no need to maintain a
separate list of files in need of repair; journal contents can be examined at
any time, and if they're empty (the normal case) that serves as a positive
indication that the volume is in a fully protected state.

Doing repair as part of the startup process means that, if the failure is a
network partition rather than a server failure[4], then neither side will go
through the startup process.  Each server must therefore initiate  repair upon
being notified of another server coming up as well as during startup.  Journal
entries are pushed rather than pulled, from the servers that have them to the
newly booted or reconnected server.  Each server must also be a client, both to
receive peer-status notifications (which currently go only to clie

Re: [Gluster-devel] no dentry for non-root inode

2011-11-07 Thread Jeff Darcy

On Mon, 7 Nov 2011 17:25:24 +0100
m...@netbsd.org (Emmanuel Dreyfus) wrote:

> Emmanuel Dreyfus  wrote:
> 
> > [2011-11-05 03:12:47.856612] W [inode.c:1044:inode_path]
> > 0-/export/wd3e/inode: no dentry for non-root inode
> > -9091625530591748852: d968c71c-9c3f-471e-81d4-0ebfda34dd0c
> (...)
> > Any hint on what this warning is about? 
> 
> No reply on this one?

Whenever the git log shows that only Avati and/or Amar are changing a
piece of code in non-cosmetic ways - as in this case - my "gurus only"
warning bell goes off.  I'm not sure a "reminder" after only a single
working day, during which those two might well have been traveling or
occupied with even-higher-priority tasks, is really called for.  What
kind of response were you expecting, and from whom?

The error message seems a bit misleading: what is missing is not a true
dentry, but a GlusterFS-internal simulacrum of one, which seems to be
gone because the parent inode was forgotten.  I also see that the
message comes from __inode_path, which is only called from three of the
performance translators in non-essential code to dump fd contexts.  The
first thing I'd do is try to figure out which code path is actually
involved, and why anyone's dumping that context in the first place.
I'll bet there's a platform-specific reason, requiring a
platform-specific tweak to avoid the underlying issue.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] 0-glusterfs-fuse: xl is NULL

2011-11-10 Thread Jeff Darcy

On 11/10/2011 03:26 AM, Emmanuel Dreyfus wrote:

> I am still tracking my bug on randomly ENOENT for existing files.
> I notice this rare error in the client logs:
> 
> [2011-11-10 08:18:20.170776] E [fuse-bridge.c:2840:fuse_setlk_resume] 
>   0-glusterfs-fuse: xl is NULL
> 
> Is it something of concern, or is it harmless?

That definitely seems like something that should never happen, so I'd be a bit
concerned.  Perhaps more pertinently, ISTRC that some of the code I looked at
in our last round of investigation had to do with inodelk, so there's a decent
chance that they're related.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] DHT-based Replication

2011-11-22 Thread Jeff Darcy

On Mon, 21 Nov 2011 21:34:30 -0200
Daniel van Ham Colchete  wrote:

> So, I have a suggestion that fixes this problem. I call it DHT-based
> replication. It is different from DHT-based distribution. I already
> implemented it internally, it already worked, at least here.

Wonderful.  Patches would be most welcome.

> Giving
> the amount of money and energy this idea saves, I think this idea is
> worth a million bucks (on your hands at least). Even though it is
> really simple. I'm giving it to you guys for free. Just please give
> credit if you are going to use it.
> 
> It is very simple: hash(name) to locate the primary server (as
> usual), and use another hash, like hash(name + "#copy2") for the
> second copy and so on. You just have to certify that it doesn't fall
> into the same server, but this is easy: hash(name + "#copy2/try2").

I've discussed this issue in some of my presentations, and there's a
whole range of possible approaches.

* Distribute across pairs (current approach).  You did a good job
  explaining some of the problems here.

* Overlapping pairs around a ring.  This is possible, and has the
  advantage that load for a failed server is distributed between its
  two neighbors instead of one partner (and similarly for N>2).
  Unfortunately, it's a bit incompatible with the way DHT currently
  works, with per-directory layout maps and an assumption that a hashed
  file's parent directory will exist on the same brick (essentially
  meaning directories must exist *everywhere*).  I have in the past
  implemented an alternative DHT translator that uses a Dynamo-style
  global ring and would be much more friendly to this kind of
  replication, but it's nowhere near production quality.  It also
  requires a more dynamic translator-graph infrastructure, because it
  would involve creating AFR translators (or their moral equivalent)
  that aren't explicit in the volfile, including when bricks are added.

* Iterative hashing (your suggestion).  This extends on both the
  advantages and drawbacks of the previous approach.  Because the
  number of AFR combinations now grows exponentially, it also requires
  that the translator-graph infrastructure be much more scalable as
  well as more dynamic.  BTW, Kademlia's XOR-based consistent hashing
  offers similar characteristics to iterative ring-based hashing, and
  might be preferable for reasons too academic to describe here.

As I'm sure you can see, there are a few technical issues that remain
before a simple idea can be turned into a workable reality.  One
compromise I've considered is to assign multiple layout maps per
directory, hashing a file first to a map and then to a position within
that map.  This would provide practically the same load-spreading
advantage of the other approaches, with only slight change to existing
(and working) code.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] SSL support

2011-12-01 Thread Jeff Darcy

On Thu, 1 Dec 2011 13:24:38 +0100
m...@netbsd.org (Emmanuel Dreyfus) wrote:

> I would like to test again SSL support. Since my last attempt, the
> patch evolved into this: 
> http://review.gluster.com/#change,362
> 
> Two questions:
> - Is there a plan for restricting what DN are allowed to access?

There is a separate authorization patch, which depends on the current
one (authentication, encryption, multi-threading) so I'll be pushing it
as soon as its predecessor is ready.  For now, you can view it here:

https://lists.gnu.org/archive/html/gluster-devel/2011-06/msg00042.html

> - I understand one have to configure SSL in the autogenerated volume
> files in etc/glusterd/vols. Is there a plan for having that derived
> from information contained in etc/glusterfs/glusterd.vol?

Yes, there is a long term plan for that.  The overall plan is to merge
the HekaFS functionality into GlusterFS, but integrating the management
pieces is the most difficult part of that (no convenient translator API
there) so it's going to take a while.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] SSL support

2011-12-01 Thread Jeff Darcy

On Thu, 1 Dec 2011 15:40:21 +0100
m...@netbsd.org (Emmanuel Dreyfus) wrote:

> In the meantime, what about setting defaults values for key, cert and
> CA (admins will smlink them to actual location), and just add some
> way of enabling ssl with the gluster command?

Adding defaults is a good idea.  In fact I thought there were defaults
already, but looking at the patch I see none.  Right now, SSL is
enabled iff all three file locations are specified.  Instead, we should
probably add a fourth option to enable SSL, separately from whether
non-default file locations are specified.  Could you please add that as
a review comment on the patch so that I'll remember it as the patch
progresses?

Adding SSL-related commands to the CLI is far more painful.  That's not
to say it shouldn't be done; the only reason I'm reluctant to do it is
that it's connected to a whole lot of other management integration that
needs to happen (e.g. to make the entire management subsystem
understand the concept of a tenant) and I don't want to do it *twice*.
Would it be sufficient just to add the enable option using default
locations, but not the file-location options?

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] rfc.sh usage

2011-12-16 Thread Jeff Darcy

On 12/16/2011 09:23 AM, Emmanuel Dreyfus wrote:
> I wanted to add an updated patchset to http://review.gluster.com/#change,796
> but gerrit shows identical patchset 1 and 2. How I am supposed to do?
> 
> 
> here is what I did:
> 
> git clone 
> cp -r glusterfs getgroups && cd getgroups
> git branch getgroups
> vi xlator/mount/fuse/src/fuse-helper.c
> commit -s xlator/mount/fuse/src/fuse-helper.c
> ./rfc.sh  -> this produced patchset 1
> vi xlator/mount/fuse/src/fuse-helper.c-> update for patchset 2
> commit -s xlator/mount/fuse/src/fuse-helper.c
> ./rfc.sh  -> this produced patchset 2

It seems odd that it created a new patchset at all.  Normally you're supposed
to use the "--amend" flag on commit, to modify the existing HEAD instead of
creating a new one.  What I would have expected the above to result in a new
Gerrit review request dependent on the old request, not a new patchset on the
old request, but maybe there's more git silliness here than meets the eye.

Wait, what am I saying?  This is git, so of course it's behaving irrationally.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Updated Wireshark packages with improved Gluster support

2012-02-06 Thread Jeff Darcy

On 02/06/2012 05:28 AM, Niels de Vos wrote:
> there are updated Wireshark packages available in my Fedora People
> repository at http://repos.fedorapeople.org/repos/devos/wireshark-gluster/
> 
> You can add the Fedora or EPEL .repo file from the above URL in
> /etc/yum.repos.d and test the packages easily. Whenever there are
> substantial changes/improvements, I'll update the packages in the
> repository.
> 
> The packages can currently detect and display quite some RPC-procedures
> that Gluster uses. There is still a lot of work to be done. Some hints
> on how to help out can be found in my latest blog post:
> - http://blog.nixpanic.net/2012/02/improvements-on-displaying-gluster.html
> 
> Suggestions for improvement and feedback on the results you get is much
> appreciated.

I just want to say, as publicly as possible, that this is awesome.  Thanks, 
Niels!


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Can function arguments be modified ?

2012-03-11 Thread Jeff Darcy

On 03/10/2012 03:34 PM, Anand Avati wrote:
> There is a no rule written on stone here. It is good practice to make
> copies. Note that for things like iatt structure, you need not
> "allocate" and "free" from the heap. Most of the time you can copy to
> a structure on the stack, modify and return that. You will see that
> for parameters which get modified in the callback (typically
> aggregated from multiple subvolumes), most translators have a
> "modified" copy inside frame->local.

Object-lifecycle management in GlusterFS can be a bit tricky.  The two most
common patterns I see are that dict_t and similar structures will be
*dereferenced* either when the originator's (i.e. FUSE or protocol/server)
STACK_WIND returns or when the callback completes.  Thus, you can usually
ensure their continued existence by doing a dict_ref (or data_ref if you really
only care about a single value).  IIRC, DHT already does this for some layout
xattrs that it aggregates across subvolumes.  In most cases, though, Avati is
correct: good practice would be to make a copy.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glusterfs-3.3.0qa34 released

2012-04-10 Thread Jeff Darcy

On 04/10/2012 03:29 PM, Patrick Matthäi wrote:
> it fails to build from source with hardening build flags enabled:
> 
>  gcc -DHAVE_CONFIG_H -I. -I. -I../../../..
> -I../../../../libglusterfs/src -I../../../../contrib/uuid
> -D_FORTIFY_SOURCE=2 -fPIC -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -Wall
> -DGF_LINUX_HOST_OS -I../../../../libglusterfs/src
> -I../../../../xlators/lib/src -I../../../../rpc/rpc-lib/src -shared
> -nostartfiles -O0 -g -O2 -fstack-protector --param=ssp-buffer-size=4
> -Wformat -Wformat-security -Werror=format-security -Wall -c
> afr-lk-common.c -o afr-lk-common.o >/dev/null 2>&1
>  gcc -DHAVE_CONFIG_H -I. -I. -I../../../..
> -I../../../../libglusterfs/src -I../../../../contrib/uuid
> -D_FORTIFY_SOURCE=2 -fPIC -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -Wall
> -DGF_LINUX_HOST_OS -I../../../../libglusterfs/src
> -I../../../../xlators/lib/src -I../../../../rpc/rpc-lib/src -shared
> -nostartfiles -O0 -g -O2 -fstack-protector --param=ssp-buffer-size=4
> -Wformat -Wformat-security -Werror=format-security -Wall -c
> afr-self-heald.c  -fPIC -DPIC -o .libs/afr-self-heald.o
> afr-self-heald.c: In function '_crawl_proceed':
> afr-self-heald.c:398:17: error: format not a string literal and no
> format arguments [-Werror=format-security]
> afr-self-heald.c:398:17: error: format not a string literal and no
> format arguments [-Werror=format-security]
> cc1: some warnings being treated as errors
> make[6]: *** [afr-self-heald.lo] Error 1

Today I learned that -Werror=format-security generates totally bogus errors.
If you look at the code you'd see it's *no different* security-wise than if it
had been a string literal (which it was one line earlier) and it doesn't
contain any % substitutions anyway.  There are many tools to do this sort of
checking correctly, and I'd be totally in favor of fixing defects that they
report, but working around gcc bugs is pretty irksome.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glusterfs-3.3.0qa34 released

2012-04-10 Thread Jeff Darcy

On 04/10/2012 03:59 PM, Patrick Matthäi wrote:
> The "problem" is, that the % substitution is missing, so:
> 
> gf_log (this->name, GF_LOG_ERROR, msg);
> should become:
> gf_log (this->name, GF_LOG_ERROR, "%s", msg);
> 
> I didn't checked if this was introduced in other places, too.
> 
> In 3.2.5 there was a simmilar fault, which my co-maintainer of the
> glusterfs packaging has been fixed:
> http://review.gluster.com/#change,2598

Yes, it's easy to work around, and patches to do just that would be welcome.
I'll be the first to approve them.  OTOH, false positives are the bane of any
effort to improve software quality via static analysis.  The fact that gcc has
now generated two false positives for the same non-problem suggests that its
format-security diagnostics are not the right basis for such an effort.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glusterfs-3.3.0qa34 released

2012-04-10 Thread Jeff Darcy

On 04/10/2012 04:42 PM, Patrick Matthäi wrote:
> Ok here they are:
>
> 02-gflog2.diff:
> FTBFS as described
>
> 03-gflog3.diff:
> Same applies here
>
> 03-spelling-errors.diff:
> Multiple spelling errors fixed (mostly @ log messages)
>
> 04-man-warnings.diff:
> A few man warnings fixes (hyphens used as minus signs)

With some timely help from Avati, these have been applied to the 3.3 (master)
tree.  Thanks!

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [RFC] Improved distribution

2012-04-17 Thread Jeff Darcy

On Tue, 17 Apr 2012 01:57:35 +0200
Edward Shishkin  wrote:

> Comment 2. There is a disadvantage: in this approach all files
> 
> /foo
> /dir1/foo
> /dir1/dir2/foo
> ...
> 
> will be accumulated on the same brick. However it is possible to
> "salt" a short file names with gfid (or another id) of respective
> directory before hashing, to avoid possible attacks.

That's the easy problem.  The harder problem is that the "only split
one brick" approach creates imbalances that are *permanent and
accumulative*.  In other words, each change is likely to take us
further from optimal distribution, so that measures such as virtual
nodes become strictly necessary to preserve any semblance of proper
load/capacity balancing.  I had implemented an approach based on moving
files only from one hash range instead of only from one brick (a
demonstrably superior generalization of the idea) and it still exhibits
this behavior.  You had those results before you ever wrote this
proposal.  We still need a rebalance method that restores "perfect"
distribution, even if it's not the only one we use.

>2. Virtual nodes

Virtual nodes are a bad idea.  Even the people who included them in
Dynamo design have said as much since.  The main problem is that
the accumulate infinitely.  If one relies on them too much to fix other
problems, the result is very large numbers of virtual node IDs and very
large lookup tables.  This is compounded in our case by the fact that
information about node IDs or ranges is contained in xattrs, so adding
too many will make fetching that information (a frequent operation)
less efficient.  At the very least, virtual node IDs need to be
aggressively pruned, even if that means incurring some data-movement
cost.  Even better, I think we should stay away from them entirely.  My
favorite alternative is multiple hash rings, as follows:

ring_hash = hash(file_id)
ring_to_use = ring_hash % num_rings
node_hash = hash(ring_hash/num_rings)
node_to_use = lookup(node_hash,lookup_table[ring_to_use])

This approach rapidly approaches the same flexibility/efficiency as
virtual node IDs, with quite small values for num_rings.  With careful
assignment of ranges within each ring, it can also assure that the load
when a node fails is spread out across up to num_rings successors
instead of just one (as with a single ring even when using virtual
nodes).

> To achieve high availability and durability we replicate files on
> multiple bricks. In our case replication can be implemented as a set
> of operations with the same ring R, so we don't create a separate
> translator for replication.

Yes, we do and we will.  Distribution and replication are both
extremely complex.  Our code for both represents years of accumulated
expertise handling all manner of difficult cases, so for basic
modularity/maintainability reasons they will remain separate.

That said, a more nuanced relationship between distribution sets and
replica sets would be welcome.  If we allowed replication across
arbitrary (and possibly overlapping) sets of R nodes instead of
statically partitioning the bricks into 1/R subvolumes of DHT, we'd
gain at least the following benefits.

(1) Ability to support multiple replication levels within a single
volume.

(2) Smoother addition and removal of bricks.

(3) Better distribution of load from a failed brick.

Unfortunately, this requires that DHT be able to "spawn" AFR
translators dynamically, without them being represented directly in the
volfile.  I've written code to do this, verified that it actually
works (or at least did at that time), and published the results to the
cloudfs-devel mailing list.  It still requires a lot of work to make
sure that option changes and other kinds of reconfiguration are handled
correctly, but long term it's the direction we need to go.
>  APPENDIX
> 
> 
> 
> --
> 
> In 3 distributed hash tables with different hashing techniques
> 
> . GlusterFS DHT translator (3.2.5)
> . 64-bit ring with phi based on md5, R=1 (no replication), S=7
> . 64-bit ring with phi based on md5, R=1 (no replication), S=20
> 
> we run the same scenario:
> 
> 1) Create 100 files ("file00", "file01", ..., "file99") in a volume
> composed of 9 bricks:
> 
> "host:/root/exp0",
> "host:/root/exp1",
> ...
> 
> "host:/root/exp8".
> 
> 2) Add one brick "host:/root/exp9";
> 3) re-balance;

These results are barely usable.  When I pushed you to write up and
distribute this proposal instead of just beginning to hack on the code,
I also provided you with scripts that apply different rebalancing
methods and measure the results across different scenarios to generate
highly readable tables showing the data-movement effects.  Why did you
use a weaker methodology generating less readable results?

Also, please use code from master/3.3 for your testing and
devel

Re: [Gluster-devel] [RFC] Improved distribution

2012-04-17 Thread Jeff Darcy

On Tue, 17 Apr 2012 11:14:21 -0400 (EDT)
Kaleb Keithley  wrote:

> ISTR Avati and/or Vijay telling us — when we were in BLR — that the
> hash of the filename is salted with the hash of the pathname up to,
> but not including the filename.
> 
> Am I misremembering that? (Of course I haven't looked at the code.)

I just did, and if there's anything but the name included I'm missing
it.  Here's the DHT function that computes the hash.

  73 int 
  74 dht_hash_compute (int type, const char *name, uint32_t *hash_p)
  75 {   
  76 char *rsync_friendly_name = NULL;
  77 
  78 MAKE_RSYNC_FRIENDLY_NAME (rsync_friendly_name, name);
  79 
  80 return dht_hash_compute_internal (type,
  rsync_friendly_name, hash_p);
  81 }

The name comes from dht_subvol_get_hashed (a few levels up), thus.

 380 subvol = dht_layout_search (this, layout, loc->name);

AFAIK loc->name is just the last part of the name, and there's no
provision anywhere in this path for non-textual input like a parent
hash.  It would probably be a good idea for us to do something like
that, but currently we don't.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [RFC] Improved distribution

2012-04-17 Thread Jeff Darcy

On Tue, 17 Apr 2012 08:33:06 -0700
Anand Avati  wrote:

> The parent directory's textual path is not part of the hash
> computation, but it causes a different hash-range map in the inode
> layout and effectively a different server is picked up for the same
> basename in different directories.

That's basically a rotation, isn't it?  In other words, the same ranges
will be used, but assigned to ABC for one directory and then BCA or CAB
on others?  That's how I interpret dht_fix_layout_of_directory and
dht_selfheal_layout_alloc_start, anyway.  That should be sufficient to
avoid the particular problem Edward mentioned, but doesn't completely
solve some of the other problems around load distribution and data
migration.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Fixing Address family mess

2012-05-07 Thread Jeff Darcy

On 05/07/2012 12:39 AM, Emmanuel Dreyfus wrote:
> Quick summary of the problem: when using transport-type socket with
> transport.address-family unspecified, glusterfs binds sockets with
> AF_UNSPEC, which will use either AF_INET or AF_INET6 socket, whatever the
> kernel prefers. At mine it uses AF_INET6, while the machine is not
> configured to use IPv6. As a result, glusterfs client cannot connect
> to glusterfs server.
> 
> A workaround is to use option transport.address-family inet in
> glusterfsd/glusterd.vol but that option must also be specified in
> all volume files for all bricks and FUSE client, which is
> unfortunate because they are automatically generated. I proposed a
> patch so that glusterd transport.address-family setting is propagated
> to various places: http://review.gluster.com/3261
> 
> That did not meet consensus. Jeff Darcy notes that we should be able
> to listen both on AF_INET and AF_INET6 sockets at the same time. I
> had a look at the code, and indeed it could easily be done. The only
> trouble is how to specify the listeners. For now option transport
> defaults to socket,rdma. I suggest we add socket families in that
> specification. We would then have this default:
>option transport socket/inet,socket/inet6,rdma
> 
> With the following semantics:
>socket -> AF_UNSPEC socket (backward comaptibility)
>socket/inet -> AF_INET socket
>socket/inet6 -> AF_INET6 socket
>socket/sdp -> AF_SDP socket
>rdma -> sameas before
> 
> Any opinion on that plan? Please comment before I writa code, it will
> save me some time is the proposal is wrong.

I think it looks like the right solution. I understand that keeping the
address-family multiplexing entirely in the socket code would be more complex,
since it changes the relationship between transport instances and file
descriptors (and threads in the SSL/multi-thread case).  That's unfortunate,
but far from the most unfortunate thing about our transport code.

I do wonder whether we should use '/' as the separator, since it kind of
implies the same kind of relationships between names and paths that we use for
translator names - e.g. cluster/dht is actually used as part of the actual path
for dht.so - and in this case that relationship doesn't actually exist. Another
idea, which I don't actually like any better but which I'll suggest for
completeness, would be to express the list of address families via an option:

option transport.socket.address-family inet6

Now that I think about it, another benefit is that it supports multiple
instances of the same address family with different options, e.g. to support
segregated networks.  Obviously we lack higher-level support for that right
now, but if that should ever change then it would be nice to have the right
low-level infrastructure in place for it.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

[Gluster-devel] ZkFarmer

2012-05-07 Thread Jeff Darcy

I've long felt that our ways of dealing with cluster membership and staging of
config changes is not quite as robust and scalable as we might want.
Accordingly, I spent a bit of time a couple of weeks ago looking into the
possibility of using ZooKeeper to do some of this stuff.  Yeah, it brings in a
heavy Java dependency, but when I looked at some lighter-weight alternatives
they all seemed to be lacking in more important ways.  Basically the idea was
to do this:

* Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, or
point everyone at an existing ZooKeeper cluster.

* Use ZK ephemeral nodes as a way to track cluster membership ("peer probe"
merely updates ZK, and "peer status" merely reads from it).

* Store config information in ZK *once* instead of regenerating volfiles etc.
on every node (and dealing with the ugly cases where a node was down when the
config change happened).

* Set watches on ZK nodes to be notified when config changes happen, and
respond appropriately.

I eventually ran out of time and moved on to other things, but this or
something like it (e.g. using Riak Core) still seems like a better approach
than what we have.  In that context, it looks like ZkFarmer[1] might be a big
help.  AFAICT someone else was trying to solve almost exactly the same kind of
server/config problem that we have, and wrapped their solution into a library.
 Is this a direction other devs might be interested in pursuing some day,
if/when time allows?


[1] https://github.com/rs/zkfarmer

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] ZkFarmer

2012-05-07 Thread Jeff Darcy

On 05/07/2012 06:17 PM, Ian Latter wrote:
> Is there anything written up on why you/all want every
> node to be completely conscious of every other node?
> 
> I could see a couple of architectures that might work
> better (be more scalable) if the config minutiae were 
> either not necessary to be shared or shared in only 
> cases where the config minutiae were a dependency.

Well, these aren't exactly minutiae.  Everything at file or directory level is
fully distributed and will remain so.  We're talking only about stuff at the
volume or server level, which is very little data but very broad in scope.
Trying to segregate that only adds complexity and subtracts convenience,
compared to having it equally accessible to (or through) any server.

> RE ZK, I have an issue with it not being a binary at
> the linux distribution level.  This is the reason I don't
> currently have Gluster's geo replication module in
> place ..

What exactly is your objection to interpreted or JIT compiled languages?
Performance?  Security?  It's an unusual position, to say the least.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] ZkFarmer

2012-05-08 Thread Jeff Darcy

On 05/08/2012 12:33 AM, Anand Babu Periasamy wrote:
> Real issue is here is: GlusterFS is a fully distributed system. It is
> OK for config files to be in one place (centralized). It is easier to
> manage and backup. Avati still claims that making distributed copies
> are not a problem (volume operations are fast, versioned and
> checksumed).

It's also grossly inefficient at 100-node scale.  I'll also need some
convincing before I believe that nodes which are down during a config change
will catch up automatically and reliably in all cases.

I think this is even more of an issue with membership than with config data.
All-to-all pings are just not acceptable at 100-node or greater scale.  We need
something better, and more importantly designing cluster membership protocols
is just not a business we should even be in.  We shouldn't be devoting our own
time to that when we can just use something designed by people who have that as
their focus.

> Also the code base for replicating 3 way or all-node is
> same. We all need to come to agreement on the demerits of replicating
> the volume spec on every node.

It's somewhat similar to how we replicate data - we need enough copies to
survive a certain number of anticipated failures.

> If we are convinced to keep the config info in one place, ZK is
> certainly one a good idea. I personally hate Java dependency. I still
> struggle with Java dependencies for browser and clojure. I can digest
> that if we are going to adopt Java over Python for future external
> modules. Alternatively we can also look at creating a replicated meta
> system volume. What ever we adopt, we should keep dependencies and
> installation steps to the bare minimum and simple.

I personally hate the Java dependency too.  I'd much rather have something in
C/Go/Python/Erlang but couldn't find anything that had the same (useful)
feature set.  I also considered the idea of storing config in a hand-crafted
GlusterFS volume, using our own mechanisms for distributing/finding and
replicating data.  That's at least an area where we can claim some expertise.
Such layering does create a few interesting issues, but nothing intractable.
The big drawback is that it only solves the config-data problem; a solution
which combines that with cluster membership is IMO preferable.  The development
drag of having to maintain that functionality ourselves, and hook every new
feature into the not-very-convenient APIs that have predictably resulted, is
considerable.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] ZkFarmer

2012-05-08 Thread Jeff Darcy

On 05/08/2012 12:27 AM, Ian Latter wrote:
> The equivalent configuration in a glusterd world (from
> my experiments) pushed all of the distribute knowledge
> out to the client and I haven't had a response as to how
> to add a replicate on distributed volumes in this model,
> so I've lost replicate.

This doesn't seem to be a problem with replicate-first vs. distribute-first,
but with client-side vs. server-side deployment of those translators.  You
*can* construct your own volfiles that do these things on the servers.  It will
work, but you won't get a lot of support for it.  The issue here is that we
have only a finite number of developers, and a near-infinite number of
configurations.  We can't properly qualify everything.  One way we've tried to
limit that space is by preferring distribute over replicate, because replicate
does a better job of shielding distribute from brick failures than vice versa.
Another is to deploy both on the clients, following the scalability rule of
pushing effort to the most numerous components.  The code can support other
arrangements, but the people might not.

BTW, a similar concern exists with respect to replication (i.e. AFR) across
data centers.  Performance is going to be bad, and there's not going to be much
we can do about it.

> But in this world, the client must
> know about everything and the server is simply a set
> of served/presented disks (as volumes).  In this
> glusterd world, then, why does any server need to
> know of any other server, if the clients are doing all of
> the heavy lifting?

First, because config changes have to apply across servers.  Second, because
server machines often spin up client processes for things like repair or
rebalance.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Hide Feature

2012-05-10 Thread Jeff Darcy

On Thu, 10 May 2012 15:47:06 +1000
"Ian Latter"  wrote:

>   I have published an untested "hide" module (compiled 
> against glusterfs-3.2.6); 
> 
> A simple method for hiding an underlying directory 
> structure from parent/up-stream bricks within 
> GlusterFS.  In 2012 this code was spawned from 
> my incomplete 2009 dedupe brick code which used
> this method to protect its internal hash database
> from the user, above.
> 
> http://midnightcode.org/projects/saturn/code/hide-0.5.tgz
> 
> 
>   I am serious when I mean untested - I've not even
> loaded the module under Gluster, it simply compiles.
> 
> 
>   Let me know if there are tweaks that should be made
> or considered.

A couple of comments:

* It should be sufficient to fail lookup for paths that match your
pattern.  If that fails, the caller will never get to any others.  You
can use the quota translator as an example for something like this.

* If you want to continue supporting this yourself, then you can just
leave the code as it is, though in that case you'll want to consider
building it "out of tree" as I describe in my "Translator 101" post[1]
or do for some of my own translators[2].  Otherwise you'll need to
submit it as a patch through Gerrit according to our standard
workflow[3].  You'll also need to fix some of the idiosyncratic
indentation.  I don't remember the current policy wrt copyright
assignment, but that might be required too.

[1]
http://hekafs.org/index.php/2011/11/translator-101-lesson-3-this-time-for-real/

[2] https://github.com/jdarcy/negative-lookup

[3]
http://www.gluster.org/community/documentation/index.php/Development_Work_Flow

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Fuse operations

2012-05-10 Thread Jeff Darcy

On Thu, 10 May 2012 15:55:55 +1000
"Ian Latter"  wrote:

>   So, I guess;
> 1) Are all Fuse/FS ops handled by Gluster?
> 2) Where can I find a complete list of the 
>  Gluster fops, and not just those that have
>  been used in existing modules?

GlusterFS operations for a translator are all defined in an xlator_fops
structure.  When building translators, it can also be convenient to
look at the default_xxx and default_xxx_cbk functions for each fop you
implement.  Also, I forgot to mention in my comments on your "hide"
translator that you can often use the default_xxx_cbk callback when you
call STACK_WIND, instead of having to define your own trivial one.

FUSE operations are listed by the fuse_opcode enum.  You can check for
yourself how closely this matches our list.  They do have a few ops of
their own, we have a few of their own, and a few of theirs actually map
to our xlator_cbks instead of xlator_fops.  The points of
non-correspondence seem to be interrupt, bmap, poll and ioctl.  Maybe
Csaba can elaborate on what we do (or plan to do) about these.

> 3) Is it safe to path match on loc_t? (i.e. is
>  it fully resolved such that I won't find
>  /etc/././././passwd)?  This I could test ..

Name/path resolution is an area that has changed pretty recently, so
I'll let Avati or Amar field that one.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr()

2012-05-17 Thread Jeff Darcy

On 05/17/2012 03:35 AM, Anush Shetty wrote:
> 
> On 05/17/2012 12:56 PM, John Mark Walker wrote:
>> There are close to 600 people now subscribed to gluster-devel - how many
>> of them actually have an account on Gerritt? I honestly have no idea.
>> Another thing this would do is send a subtle message to subscribers that
>> this is not the place to discuss user issues, but perhaps there are better
>> ways to do that.
>> 
>> I've seen many projects do this - as well as send all bugzilla and github
>> notifications, but I could also see some people getting annoyed.
> 
> How about a weekly digest of the same.

Excellent idea.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Gluster internals

2012-05-21 Thread Jeff Darcy

On 05/20/2012 02:12 AM, Ian Latter wrote:
> Hello,
> 
> 
>   Couple of questions that might help make my
> module a little more sane;
> 
> 0) Is there any developer docco?  I've just done
> another quick search and I can't see any.  Let
> me know if there is and I'll try and answer the
> below myself.

Your best bet right now (if I may say so) is the stuff I've posted on
hekafs.org - the "Translator 101" articles plus the API overview at

http://hekafs.org/dist/xlator_api_2.html

> 1) What is the difference between STACK_WIND
> and STACK_WIND_COOKIE?  I.e. I've only
> ever used STACK_WIND, when should I use
> it versus the other?

I see Krishnan has already covered this.

> 2) Is there a way to write linearly within a single
> function within Gluster (or is there a reason
> why I wouldn't want to do that)?  

Any blocking ops would have to be built on top of async ops plus semaphores
etc. because (unlike e.g. an HTTP server) the underlying sockets etc. are
shared/multiplexed between users and activities.  Thus you'd get much more
context switching that way than if you stay within the async/continuation style.

Some day in the distant future, I'd like to work some more on a preprocessor
that turns linear code into async code so that it's easier to write but retains
the performance and resource-efficiency advantages of an essentially async
style.  I did some work (http://pl.atyp.us/ripper/UserGuide.html) in this area
several years ago, but it has probably bit-rotted to hell since then.  With
more recent versions of gcc and LLVM it should be possible to overcome some of
the limitations that version had.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] preparent and postparent?

2012-05-23 Thread Jeff Darcy

On Wed, 23 May 2012 16:58:02 +
Emmanuel Dreyfus  wrote:

> in the protocol/server xlator, there are many occurences where
> callbacks have a struct iatt for preparent and postparent. What are
> these for?

NFS needs them to support its style of caching.




___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] preparent and postparent?

2012-05-24 Thread Jeff Darcy

On 05/24/2012 03:10 AM, Xavier Hernandez wrote:
> preparent and postparent have the attributes (modification time, size, 
> permissions, ...) of the parent directory of the file being modified 
> before and after the modification is done.

Thank you, Xavi.  :)  If you really want to have some fun, you can take a look
at the rename callback, which has pre- and post-attributes for both the old and
new parent.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] submit changes to release-3.3

2012-06-15 Thread Jeff Darcy

On 06/15/2012 01:34 AM, Shishir Gowda wrote:
> Hi Emmanuel,
> 
> Please change to release-3.3 branch on the git.
> 'git checkout -b release-3.3 origin/release-3.3'

A slight alternative is to create the branch as an explicit tracking branch, so
that "git pull" etc. do the right things without needing to specify the repo
and branch every time.

git branch -t release-3.3 origin/release-3.3
git checkout release-3.3

AFAIK the -b and -t functions are not available in a single command, because
that would be convenient and this is git.  ;)

> 
> Apply your changes, and commit them.
> 
> The rfc.sh script will identify it as release-3.3 and do the needfull.
> 
> With regards,
> Shishir
> 
> - Original Message -
> From: "Emmanuel Dreyfus" 
> To: gluster-devel@nongnu.org
> Sent: Friday, June 15, 2012 10:39:33 AM
> Subject: [Gluster-devel] submit changes to release-3.3
> 
> Hi
> 
> Is there some doc explaining how should I submit changes to release-3.3?
> I mean what should I tell to git and rfc.sh...
> 


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] submit changes to release-3.3

2012-06-15 Thread Jeff Darcy

On 06/15/2012 09:29 AM, Niels de Vos wrote:
>> AFAIK the -b and -t functions are not available in a single command, because
>> that would be convenient and this is git.  ;)
> 
> Oh, but you can do that:
> 
>   git checkout -t -b release-3.3 origin/release-3.3

I guess I should have tried it before I said anything.  Last time I did, it
failed (perhaps for an unrelated reason) and I've just applied the obvious
workaround ever since.  Thanks!

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Recent dict changes affecting QEMU-GlusterFS patches

2012-06-18 Thread Jeff Darcy

On Mon, 2012-06-18 at 09:33 +0530, Bharata B Rao wrote:
> Hi,
> 
> I recently posted patches to integrate GlusterFS with QEMU.
> (http://lists.nongnu.org/archive/html/qemu-devel/2012-06/msg01745.html).
> While updating those patches to latest gluster git, I am seeing a
> problem and I tracked that down to this commit:
> 
> e8eb0a9cb6539a7607d4c134daf331400a93d136 (Optimize for small dicts,
> and avoid an overrun).
> 
> With this commit, I see an invalid memory reference in _dict_lookup().
> Some details from gdb are shown below:

I've seen something like this before, when commonly used structures
(like dict_t) change.  It seems like somehow not all dependencies are
getting updated properly, resulting in a mix of code that uses the old
srtucture and code that uses the new one.  I don't know how such a
problem can survive the rpmbuild process, which I always use even during
development, but I have seen the symptoms disappear when I've carefully
nuked all GlusterFS source and binaries from my system to guarantee that
I'm starting fresh.

In any case, I'll look into this a bit further and see if it might be
something else.  The dict_t structure did change with that commit, as
did the usage of some fields, so if your code relies somehow on old
behavior then it's possible that an update is needed.

> [root@bharata qemu]# gdb ./x86_64-softmmu/qemu-system-x86_64
> (gdb) set args --enable-kvm --nographic -m 1024 -smp 4 -drive
> file=gluster:/home/bharata/c-qemu-rpcbypass.vol:/dir1/F16,format=gluster,cache=none
> -net nic,model=virtio -net user -redir tcp:2000::22
> (gdb) r
> Starting program: x86_64-softmmu/qemu-system-x86_64 --enable-kvm
> --nographic -m 1024 -smp 4 -drive
> file=gluster:/home/bharata/c-qemu-rpcbypass.vol:/dir1/F16,format=gluster,cache=none
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x766e8ff6 in __strcmp_sse42 () from /lib64/libc.so.6
> Missing separate debuginfos, use: debuginfo-install
> glib2-2.30.3-1.fc16.x86_64 glibc-2.14.90-24.fc16.7.x86_64
> libuuid-2.20.1-2.3.fc16.x86_64 openssl-1.0.0j-1.fc16.x86_64
> zlib-1.2.5-6.fc16.x86_64
> (gdb) bt
> #0  0x766e8ff6 in __strcmp_sse42 () from /lib64/libc.so.6
> #1  0x77241ab1 in _dict_lookup (key=0x564e11b0 "directory",
> this=) at dict.c:204
> #2  _dict_lookup (this=, key=0x564e11b0
> "directory") at dict.c:192
> #3  0x772427ae in _dict_set (value=0x7534302c, key=
> 0x564e11b0 "directory", this=0x564c6c6c) at dict.c:254
> #4  dict_set (value=0x7534302c, key=, this=0x564c6c6c)
> at dict.c:327
> #5  dict_set (this=0x564c6c6c, key=, value=0x7534302c)
> at dict.c:313
> #6  0x7728c2a8 in volume_option (value=0x564e2470 "/vm", key=
> 0x564e11b0 "directory") at ./graph.y:249
> #7  yyparse () at ./graph.y:76
> #8  0x7728cbbc in glusterfs_graph_construct
> (fp=0x564dcbe0) at ./graph.y:597
> 
> 
> (gdb) up
> #1  0x77241ab1 in _dict_lookup (key=0x564e11b0 "directory",
> this=) at dict.c:204
> 204 if (pair->key && !strcmp (pair->key, key))
> (gdb) p *pair
> $1 = {hash_next = 0x564c6ca4, prev = 0x564dbbfc, next =
> 0x3ff0001, value =
> 0x1, key = 0x54 }
> 
> You can see that pair->key has invalid address.
> 
> I am using QEMU in RPC-bypass  mode and the volume file looks like this:
> # cat c-qemu-rpcbypass.vol
> volume vm
>   type storage/posix
>   option directory /vm
> end-volume
> 
> I am not familiar with this part of the code and hence will need time
> to debug this. Meanwhile if anyone else familiar with this part of the
> code could give some pointers, it will be useful.
> 
> Regards,
> Bharata.



___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Recent dict changes affecting QEMU-GlusterFS patches

2012-06-19 Thread Jeff Darcy

On 06/19/2012 06:41 AM, Bharata B Rao wrote:
> My code is  not dependent directly on any dict changes. All I am doing
> is glusterfs_graph_construct which eventually ends up doing
> _dict_lookup() (when parsing volume options). Will debug this and
> report if I find any clues.

I took a quick look at your patches, and came to approximately the same
conclusion.  I also spent an hour or so trying to reproduce the problem, but
wasn't able to.  Please do let us know what you find.  FWIW, I *always* build
with CFLAGS="-O0 -g" unless I'm specifically building for performance tests, to
avoid all that "value optimized out" and similar nonsense.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Bit-rot functionality

2012-07-23 Thread Jeff Darcy

On 07/23/2012 05:37 AM, Fred van Zwieten wrote:
> Bit-rot detection can be done through check-summing. It should be a very low
> priority job running on one of the bricks. The job walks the complete file
> system and, per file, calculates the check-sum, compares it with the stored
> check-sum (if present, otherwise it stores the check-sum on all involved
> bricks, because it hasn't been checked before).

I think this is a basically good idea, but I think it could be implemented more
efficiently if we ran processes on *all* bricks, each one calculating checksums
for the files in that brick.  That way all disk accesses are local, which is
important because this kind of "crawl" can take a long time.  We could also
take advantage of the marker/xtime framework to reduce the number of files we
have to check, just like we already use that framework in gsyncd to reduce the
number of files that must be replicated.  Another possibility would be to have
a translator queue a check when a file is closed.

> Bit-rot restoration could be implemented by comparing the check-sums of the
> replicas. If there is a mismatch, a more thorough check must be performed, 
> like
> running a check-sum on all replica's for that file again, do
> a bit-wise compare, or whatever. If the files are still the same,
> the check-sum(s) must be replaced. If not, there is actual bit-rot detected.
> Now what to do? Which replica holds the clean version (the thruth?). With an
> uneven number of replicas one could simply make it a democratic process and
> have it fully automated. It should however save the to be replaced version in 
> a
> separate store and notify the admin for verification. Another method would be
> to just notify the admin and do nothing.

If we detect bit-rot on a file, it's almost the same as if we detect pending
operations, and many of the same resolution strategies would apply.  If we have
another replica that's "clean" in either sense we can use it as the source.  If
all replicas have rotted, then it's equivalent to split brain.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] swapcontest usage in syncio.c

2012-08-07 Thread Jeff Darcy

On Tue, 7 Aug 2012 20:13:35 +0200
m...@netbsd.org (Emmanuel Dreyfus) wrote:

> Anand Avati  wrote:
> What was behind the decision to use swapcontext, btw? Why not just
> have a thread for each task?

I should probably let the authors speak for themselves, but I suspect
it's because operations like self-heal and rebalance can be expected to
generate a *lot* of sync calls.  Threads do still consume non-trivial
resources, and switches between them still involve a trip through the
scheduler, even if they share an address space etc.  I just ran a quick
experiment, and ping-ponging between tasks via swapcontext was ~9x as
fast as via pthreads.  I didn't measure the effect on memory
consumption, but it's likely to be at least as large as the effect on
execution time.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] NetBSD swapcontext() portability fix

2012-08-09 Thread Jeff Darcy

On Thu, 9 Aug 2012 16:32:20 +
Emmanuel Dreyfus  wrote:

> The patch sets SYNCENV_PROC_MAX to 1 for NetBSD. If I understand
> correctly, that address the problem raised here: there is only one
> thread in a syncenv. Did I get it wrong?  

That could be considered an extreme form of enforcing task/thread
affinity, and thus it would avoid the scenario we were discussing, but
it also precludes getting any kind of parallelism through the syncop
subsystem.  I'd rather see an approach that enforces task/thread
affinity (on platforms that need it) but still allows multiple tasks to
run in parallel (on any platform).  Then again, if this is sufficient
for your needs, perhaps we should just merge the patch and not worry
about parallelism unless/until someone on a similar platform expresses
a need for it.

> I asked about the issue on tech-k...@netbsd.org. I get a first reply 
> suggesting thet setjmp()/longjmp() would be better suited than 
> swapcontext() for that job. Any opinions?  

I don't think setjmp/longjmp provide an adequate alternative, since we
really do need separate stacks which are preserved across arbitrarily
many task transitions.  As soon as you do a longjmp, you start
destroying the stack from before.  That's OK for implementing
exceptions, but not for "green threads" or continuations.  That's kind
of why the ucontext calls were created, even though setjmp/longjmp
already existed at the time.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] split brain

2012-08-15 Thread Jeff Darcy

On 08/15/2012 11:27 AM, Emmanuel Dreyfus wrote:
> Attributes:
> trusted.glusterfs.dht 00 00 00 01 00 00 00 00 7f ff ff ff ff ff ff ff
> trusted.afr.gfs33-client-1  00 00 00 00 00 00 00 02 00 00 00 00
> trusted.afr.gfs33-client-0  00 00 00 00 00 00 00 00 00 00 00 00  
> trusted.gfid   29 d1 70 bb 63 91 40 ed b4 c6 27 d8 ca a7 2a 64
> 
> On the other bricks:
> trusted.glusterfs.dht 00 00 00 01 00 00 00 00 00 00 00 00 7f ff ff fe
> trusted.afr.gfs33-client-2   00 00 00 00 00 00 00 00 00 00 00 00
> trusted.afr.gfs33-client-3  00 00 00 00 00 00 00 00 00 00 00 00  
> trusted.gfid   29 d1 70 bb 63 91 40 ed b4 c6 27 d8 ca a7 2a 64
> 
> trusted.glusterfs.dht 00 00 00 01 00 00 00 00 7f ff ff ff ff ff ff ff
> trusted.afr.gfs33-client-1  00 00 00 00 00 00 00 00 00 00 00 00
> trusted.afr.gfs33-client-3  00 00 00 00 00 00 00 00 00 00 00 00  
> trusted.gfid   29 d1 70 bb 63 91 40 ed b4 c6 27 d8 ca a7 2a 64
> 
> trusted.glusterfs.dht 00 00 00 01 00 00 00 00 00 00 00 00 7f ff ff fe
> trusted.afr.gfs33-client-2   00 00 00 00 00 00 00 01 00 00 00 00 
> trusted.afr.gfs33-client-3  00 00 00 00 00 00 00 00 00 00 00 00  
> trusted.gfid   29 d1 70 bb 63 91 40 ed b4 c6 27 d8 ca a7 2a 64
> 
> I tried to understand the code here, It is reading trusted.afr.gfs33-client-*
> and it builds a matrix, which looks like this:
> pending_matrix: [ 0 1 ]
> pending_matrix: [ 2 0 ]
> 
> Then afr_sh_wise_nodes_conflict() decides that nsources = -1. 
> 
> Is there some documentation explaining how it works? Someone call tell me why
> it decides it is split brain?

I really hope the above contains a typo or copy/paste error, because if it
doesn't then ICK.  Without seeing the volfile I have to guess a little, but it
looks as though the first and third bricks above should be client-0 and
client-1 (check the matching values of trusted.glusterfs.dht) while the second
and fourth should be client-2 and client-3.  In the first place, it's odd that
the file even exists in both replica sets.  Is one a linkfile?  In any case, I
think the second and fourth bricks shown above (client-2 and client-3) are
irrelevant.

The next anomaly is the 2 in the pending matrix.  Its position indicates that
it's the second volume in the AFR definition accusing the first, and the first
must be client-1 based on the xattr name, so your volume definition must be
backwards - "subvolumes client-1 client-0" in the volfile.  That's how we get
to [0 0][2 0].  Where does the counter-accusation come from?  One clue might be
that client-1 (the third brick shown above) has xattrs for itself and
*client-3*.  Because it's missing an xattr for client-0, it's considered
ignorant and therefore we bump up other bricks' pending-operation counts for
it.  However, because of the reversed brick order that should be client-0
(second row) accusing client-1 (first column) getting us to [0 0][3 0] and
that's fully resolvable.  In fact I tried this xattr configuration, in both
directions, on a simple two-brick AFR volume myself, and it healed correctly
both times.

The only thing I can think of is that there's some further confusion or
inconsistency in how your volumes are defined, so that either the handling of
ignorant nodes is being done the wrong way or the pending-operation count from
the fourth brick shown above is being brought in even though it should be
irrelevant.  If I were you I'd double check that the volfiles look the same
everywhere, that the same brick names refer to the same physical locations
everywhere (includes checking /etc/hosts or DNS for inconsistencies), and that
the xattr values really are as reported above.  I don't think this combination
of conditions can occur without there being some kind of inconsistency there.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] split brain

2012-08-15 Thread Jeff Darcy

On 08/15/2012 02:15 PM, Emmanuel Dreyfus wrote:
>> It's odd that the file even exists in both replica sets.  
> 
> It is a directory. Directory should be on all bricks, shound't they?

Yes, they should.  That clears up that particular mystery.

>   1: volume gfs33-client-0
>   2: type protocol/client
>   3: option remote-host silo
>   4: option remote-subvolume /export/wd3a
> (...)
>   8: end-volume
>   9: 
>  10: volume gfs33-client-1
>  11: type protocol/client
>  12: option remote-host hangar
>  13: option remote-subvolume /export/wd3a
> (...)
>  17: end-volume
>  18: 
>  19: volume gfs33-client-2
>  20: type protocol/client
>  21: option remote-host hangar
>  22: option remote-subvolume /export/wd1a
> (...)
>  26: end-volume
>  27: 
>  28: volume gfs33-client-3
>  29: type protocol/client
>  30: option remote-host hotstuff
>  31: option remote-subvolume /export/wd1a
> (...)
>  35: end-volume
>  36: 
>  37: volume gfs33-replicate-0
>  38: type cluster/replicate
>  39: subvolumes gfs33-client-0 gfs33-client-1
>  40: end-volume
>  41: 
>  42: volume gfs33-replicate-1
>  43: type cluster/replicate
>  44: subvolumes gfs33-client-2 gfs33-client-3
>  45: end-volume

That all looks perfectly reasonable, which leaves us with a conundrum.  If
client-1 listed second in the replicate-0 definition then the 2 should be in
the *second* column of the pending matrix regardless of what's going on with
hosts/DNS.  It's unclear either how we get a 2 in the first column or (without
any "ignorant" bricks) we get another 1 anywhere.  Maybe if you could look at
the actual xattr values when the code enters afr_build_sources we could see
what the pending matrix looks like before we start tweaking it.  That at least
divides the problem space into cases where we have the wrong value when we
start and cases where we create a wrong value within the code.

-- 

ObSig: if you use "ask" as a noun I will ignore you for a week.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] split brain

2012-08-16 Thread Jeff Darcy

On 08/16/2012 04:55 AM, Emmanuel Dreyfus wrote:
> On all vricks, .glusterfs/3e/6b/3e6b026a-b9ed-4845-a5d1-6eb06412b3ca
> iis a symlink to directory 
> .glusterfs/4b/34/4b34a8a2-bff2-4684-b005-a36b069914ab/arch
>
> I a m bit surprised to see a link to a subdir of a .glusterfs hash.
> Is it something that makes sense? Or is it again a link(2) that should
> be remplaced by linkat(2) ?

It's part of the GFID-based back end that's new in 3.3.  Amar would be the
expert, but he's on leave right now.  Avati could probably also provide decent
answers.  I admit that I don't understand all of the nuances well enough to do 
so.

> Here are the xattr for the directory:
> 
> gfs33-client-0
> trusted.glusterfs.dht   00017fff
> trusted.afr.gfs33-client-1  0002
> trusted.afr.gfs33-client-0  
> trusted.gfid3e6b026ab9ed4845a5d16eb06412b3ca
> 
> gfs33-client-1
> trusted.glusterfs.dht   00017fff
> trusted.afr.gfs33-client-1  
> trusted.afr.gfs33-client-0  0001
> trusted.gfid3e6b026ab9ed4845a5d16eb06412b3ca
> 
> gfs33-client-2
> trusted.glusterfs.dht   00017ffe
> trusted.afr.gfs33-client-3  
> trusted.afr.gfs33-client-2  
> trusted.gfid3e6b026ab9ed4845a5d16eb06412b3ca
> 
> gfs33-client-3
> trusted.afr.gfs33-client-3  00
> trusted.afr.gfs33-client-2  00
> trusted.glusterfs.dht   00017ffe
> trusted.gfid3e6b026ab9ed4845a5d16eb06412b3ca

OK, here's something I'm much more comfortable with.  Note how this differs
from what you presented earlier, where the non-zero values were on client-0
pointing to client-1 and client-3 pointing to client-2.  Now we still have
client-0 pointing to client-1, but also client-1 pointing to client-0.  That's
a true split brain; operations seem to have completed on each node that didn't
complete on the other, so we don't know which values should take precedence.
The way I'd fix it would be to clear (not remove) one of the non-zero
trusted.afr xattrs, and let self-heal do the rest.

> I understand pending are the trusted.afr from the bricks,
> but what do they represent, by the way?

These two posts explain it about as well as I'm able:

http://hekafs.org/index.php/2011/04/glusterfs-extended-attributes/
http://hekafs.org/index.php/2012/03/glusterfs-algorithms-replication-present/

-- 

ObSig: if you use "ask" as a noun I will ignore you for a week.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] FW: Change in glusterfs[master]: Add support for --enable-debug configure option

2012-08-16 Thread Jeff Darcy

On 08/16/2012 10:36 AM, Kaleb S. KEITHLEY wrote:
> On 08/16/2012 10:32 AM, Kaleb S. KEITHLEY wrote:
>>
>> But note that for RPMs anyway, everything _is_ actually compiled with
>> -g; the debug symbols are stripped from the binaries after the debuginfo
>> RPM is produced. Off hand I'm not sure there's a lot of value in adding
>> and --enable-debug option — just install the debuginfo RPMs instead if
>> you want to debug.
>>
> 
> Obviously that only applies if you're building or installing from RPMs.

FYI, I always build from RPMs but I disable the whole separate-debuginfo part
(which I and AFAICT most developers find obnoxious).  There are many formulae
out there to do this, but in case anyone's curious here's what I have in my
.rpmmacros:

%__arch_install_post   /usr/lib/rpm/check-rpaths   /usr/lib/rpm/check-buildroot
%__strip /bin/true
%debug_package %{nil}

Back to the topic, "just install the debuginfo RPMs instead if you want to
debug" seems to be about -g whereas the real change here is to do with -O0 vs.
-O2 (vs. nothing).  Regardless of whether debug symbols are built in or
available in a separate package, trying to debug code that was built with -O2
is kind of annoying.  Having an easy way to configure with -O0, that works the
same way across .rpm and .deb and .tar builds, seems worthwhile.

-- 

ObSig: if you use "ask" as a noun I will ignore you for a week.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Clean up build system

2012-09-24 Thread Jeff Darcy

On 09/24/2012 03:16 AM, Jan Engelhardt wrote:
> On Monday 2012-09-24 08:55, Anand Avati wrote:
> 
>> Really appreciate this cleanup. Can you please submit the patches at 
>> http://review.gluster.org as described in 
>> http://www.gluster.org/community/documentation/index.php/Development_Work_Flow?
>>  
>> Thanks! Avati
> 
> There are two bugs in your development model:
>  - it requires registration
>  - requires entering patches in some awkward web form
>defeating the whole purpose of git send-email
> 
> That makes it _very_ cumbersome for passers-by.

Perhaps, but the goal is to keep the *entire* process - not just patch
submission but also patch review - more open.  Reviewing patches in gerrit is
much more convenient for most people, and does a better job of facilitating
review conversations than dumping them into the -devel list intermingled with
every other conversation there.  That's why projects much bigger than ours use
gerrit, and use it to good effect.

That said, if you're unwilling to accept a development process that differs one
iota from what you've used elsewhere, there is a precedent for team members
"shepherding" patches from others.  They would show up e.g. with the following
headers (from fe4777660a0a92da6da582103690fa0c2e5c7496).

    Original-author: domwo 
Signed-off-by: domwo 
Signed-off-by: Jeff Darcy 

Note that we absolutely do require the Signed-off-by line to indicate that you
are authorized to yield any copyright on your submissions and thus prevent
those submissions from threatening the livelihood of every other developer
because of IP lawsuits (same reason that the same thing is required on Linux
kernel submissions BTW).  If you prefer to go that route, let us know.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Proposed change in Gerrit workflow

2012-09-25 Thread Jeff Darcy

On 09/25/2012 06:43 AM, Vijay Bellur wrote:
> We intend to bring the following change in our gerrit based workflow:
> 
> - Introduce +2 and -2 for Verified in Gerrit
> - +2 for Verified to be necessary for merging a patch
> 
> The intent of this proposed change is to get additional test coverage 
> and reduce the number of regressions that can sneak by. Jenkins would 
> continue to provide +1s for all submitted changes that pass basic smoke 
> tests. An additional +2 would be necessary from somebody who tests the 
> patch. Providing a +2 for Verified would be semantically similar to 
> adding a Tested-by: tag.

I like the idea generally, but I think it would be good to have a bit more
clarity about what testing +2 requires.  Is self-testing OK, or must it be
someone else?  Are manual tests OK, or must it be a (possibly new) part of the
standard functional/regression tests?  If manual tests are OK, what level of
explanation is required w.r.t. what tests were run on what configuration?  I
don't think we need to set the bar especially high right now, but IMO it does
need to be spelled out in our development-process doc.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Build system cleanups (v2)

2012-09-25 Thread Jeff Darcy

On 09/25/2012 03:34 PM, Anand Avati wrote:
> Any volunteers for adopting these patches into gerrit?

I'll volunteer.  That doesn't mean I agree with their content or condone the
way they're being thrown over the fence at us, but at least it will allow a
proper review to occur.



___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Build system cleanups (v2)

2012-10-03 Thread Jeff Darcy

On 09/25/2012 12:40 PM, Jan Engelhardt wrote:
> 
> This patchset of 15 obsoletes the earlier set of 6
> (http://lists.nongnu.org/archive/html/gluster-devel/2012-09/msg00066.html )
> 
> 
> 
> The following changes since commit 373b25827f0250d11461fbe76dd6a0e295069171:
> 
>   core: enable process to return the appropriate error code (2012-09-21 
> 20:43:05 -0700)
> 
> are available in the git repository at:
> 
>   git://git.inai.de/glusterfs master
> 
> for you to fetch changes up to b4b0bb38a01037df70e629a6ba8c195205ca9c27:
> 
>   build: make use of system libuuid (2012-09-25 17:54:27 +0200)
> 
> 
> Jan Engelhardt (15):
>   build: add missing GF_CFLAGS in api/src/
>   build: add missing backslash in api/src/
>   build: more efficient clean
>   init.d: use proper dependencies in SUSE init script
>   init.d: implement reload action for SUSE init script
>   build: consolidate common compilation flags into one variable
>   build: replace INCLUDES by CPPFLAGS
>   build: fix a typo in the python xlator Makefile
>   build: remove two no-op lines from rdma Makefile
>   build: remove -nostartfiles flag
>   build: remove useless explicit -fPIC -shared from CFLAGS
>   build: move -L arguments out of CFLAGS
>   build: split CPPFLAGS from CFLAGS
>   build: libraries must be in LDADD/LIBADD
>   build: make use of system libuuid

With a great deal of manual adjustment and rebasing, eleven of these fifteen
patches have been merged.  The exceptions are:

(3) more efficient clean:
The proposed Solaris-derived "find" syntax is even less portable than what it
replaces, notably w.r.t. to the platforms we actually support.

(8) typo in python xlator
The Python binding has been non-functional and deprecated for a while, and
instead of trying to fix it we should just remove it (possibly to be replaced
some day by glupy).

(9) no-op lines in rdma
This removes lines that are clearly intended as continuations of the previous
one, which is simply incorrect.

(15) Use system libuuid.
I'm pretty sure we had a specific exception to the bundling rule for this, but
I don't remember the reason; maybe someone else on the list does.  In any case,
Fedora has already accepted it and they're the authority on the matter.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glusterd crashes when synctask is used in conjunction with inner functions.

2012-10-08 Thread Jeff Darcy

On October 8, 2012 7:02:34 AM Krishnan Parthasarathi 
 wrote:

I have been rewriting some of the volume operations (like volume-start,
volume-add-brick, volume-remove-brick) using synctask library (aka syncops).
This change has the following immediate benefits,

- volume-start would return success/failure depending on the success/failure of
  brick process(es) spawned.
- would make glusterd's epoll thread 'more' available.



While I was making the changes in http://review.gluster.com/3969, I 
noticed that
whenever the code executing on a synctask called into dict_foreach, 
which was supplied

a function ptr, defined as an inner function, glusterd crashed. When I rewrote
inner function as a static function, glusterd wouldn't crash.

Has anyone seen or can explain (or give possible leads to analyse) this 
behaviour?

FWIW, inner functions are only available as part of GNU extensions to C. So, I
assumed it is not such a bad thing to move the inner functions 'out', 
in my patch.


Ugh.  I noticed this pattern while I was looking at some AFR stuff 
recently.  I thought it was rather clever, and pondered a bit about how 
it might be implemented in gcc/libc.  Apparently it was a bit too 
clever, and the implementation leaves something to be desired.  An 
inner function might be more elegant than defining private structures 
to pass through our own context pointer, but in the interest of both 
portability and not having to debug compiler code I think changing 
these to work the "old fashioned" way would actually be a good thing.




___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glusterd crashes when synctask is used in conjunction with inner functions.

2012-10-08 Thread Jeff Darcy

On 10/08/2012 01:21 PM, Krishnan Parthasarathi wrote:
> FWIW, inner functions also make it harder to debug with gdb (read 
> breakpoints).
> I am yet to demonstrably establish that inner functions' would
> crash when run in synctask. I hope some of you might be able to shed
> light on how I could resolve that.

I think to answer that, we'd need to examine how inner functions are (or at
least might be) implemented.  The fundamental problem is how the inner function
actually accesses variables within the outer function's local scope.  For a
direct call from outer to inner, it's pretty easy - it's just another set of
stack offsets.  Otherwise, the inner function needs a pointer to the outer
function's stack frame.  How does it get that?  Functions in between might be
totally obvlivious to this pointer (it's not in their argument lists) so you
can't just pass it directly.  They can't be stored directly in function
pointers either, for similar reasons.  The only seemingly workable option would
be to generate a "thunk" dynamically, something like this:

int
real_inner_function (void *outer_frame, int arg)
{
/* Same as user's function but with extra arg. */
}

int
inner_function (int arg)
{
real_inner_function(COMPILE_TIME_CONSTANT,arg);
}

So now a pointer to inner_function is actually a pointer to a specific
invocation within the context of an outer-function call.  What happens when
that outer function returns?  Uh oh.  Is any of this safe in the presence of
both pthreads and ucontext threads?  Uh oh again, and we do use both in our
synctask implementation.  Hmmm.  In a way, we're doing almost everything we can
to trip up gcc's inner-function implementation, so it's kind of no surprise
that we've succeeded.

I haven't actually examined the gcc code to see if that's how inner functions
are implemented.  There are other possibilities, but most are equally
susceptible to our threading shenanigans.  The real proof point is whether the
problems go away if we stop using inner functions.  There aren't too many uses
currently, so it shouldn't be a prohibitively difficult experiment, and if
you're right that inner functions to blame then we don't need to figure out
why.  We should just stop using them.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glusterd crashes when synctask is used in conjunction with inner functions.

2012-10-08 Thread Jeff Darcy

OK, I couldn't resist.  I've attached an inner-function test program which
weakly confirms my theory.  If you run it with no arguments, the inner function
works.  If you run it with arguments, this causes the stack to be reused and
that seems to include the "thunk" I mentioned.  The result is a jump into
nowhere, followed by SIGSEGV or SIGILL (oddly I've seen both).  If it fails in
the outer-function-return case, I'll bet it fails with ucontext trickery too.

http://gcc.gnu.org/onlinedocs/gcc/Nested-Functions.html
http://stackoverflow.com/questions/2929281/are-nested-functions-a-bad-thing-in-gcc

#include 

typedef void print_fn (void);

print_fn *my_print_fn;

void
outer (void)
{
int outer_val = 0x5678;

void
inner (void)
{
printf("in inner function, result = 0x%x\n",outer_val);
}


printf("in outer function\n");
inner();
my_print_fn = &inner;
}

void
rewrite_stack (void)
{
char junk[1024] = {0,};

printf("rewriting stack\n");
}

int
main (int argc, char **argv)
{
outer();
if (argc > 1) {
rewrite_stack();
}
(*my_print_fn)();
return 0;
}

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glusterd crashes when synctask is used in conjunction with inner functions.

2012-10-08 Thread Jeff Darcy

On 10/08/2012 02:14 PM, Krishnan Parthasarathi wrote:
>>  int
>>  real_inner_function (void *outer_frame, int arg)
>>  {
>>  /* Same as user's function but with extra arg. */
>>  }
>>
>>  int
>>  inner_function (int arg)
>>  {
>>  real_inner_function(COMPILE_TIME_CONSTANT,arg);
> 
> Shouldn't the outer frame's address be RUNTIME_CONSTANT? 
> 
>>  }

For dynamic code generation (even the degenerate sort that's involved in
creating thunks and trampolines), compile time is run time.  ;)



___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glusterd crashes when synctask is used in conjunction with inner functions.

2012-10-08 Thread Jeff Darcy

On 10/08/2012 02:43 PM, Krishnan Parthasarathi wrote:
> I tried the experiment you had suggested. The following are the 
> changes I made to 'inner' function to take a single integer arg.
> On compiling (gcc inner.c) and running, I didn't see any crash :(

Are you sure you ran it both with and without arguments?  Without arguments it
doesn't overwrite the stack and you won't see a crash with either version.
With arguments it does overwrite the stack and you should see a crash (I did)
with either version.

jdarcy@jdarcy-dt snippets 14:46
$ ./inner
in outer function
in inner function, result = 0x5678:5678
in inner function, result = 0x5678:42
jdarcy@jdarcy-dt snippets 14:46
$ ./inner xxx
in outer function
in inner function, result = 0x5678:5678
rewriting stack
Segmentation fault (core dumped)



___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glusterd crashes when synctask is used in conjunction with inner functions.

2012-10-09 Thread Jeff Darcy


On 10/08/2012 11:03 PM, Amar (ಅಮರ್ ತುಂಬಳ್ಳಿ) wrote:

It was me who introduced inner functions with patch
http://review.gluster.org/3829.

The reason was to keep the changes as minimal as possible within that
particular patch. I have no objections to remove the inner functions
completely from the codebase.

But it was good exercise to understand the reason for crash though :-)


We geeks need to have fun sometimes.  ;)  Seriously, I think it is an 
elegant concept and way of doing things.  If the gcc implementation 
weren't buggy, I'd encourage use of inner functions in many more places. 
 Sadly, that doesn't seem to be the case.  We'll have to wait until 
we're all writing our code in Go or Rust or something.





___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Questions on SSL in 3.4.0qa2

2012-11-05 Thread Jeff Darcy

On 11/03/2012 02:59 AM, Emmanuel Dreyfus wrote:
> About SSL support, client logs stuff below at mount time. Is the first
> "SSL support is NOT enabled" message relevant?

Not really.  That's the connection we make to glusterd on the remote node to
get the port number for the brick (that's also why the xlator shows as
"glusterfs" instead of "gfs33-client-X") and that connection is not SSL.

> Another problem: if I mount/unmount/mount on a client, the second mount
> fails and servers are stuck in a state where it is not possible to mount
> until they are restarted. It is possible to mount the same filesystem
> from a client using different mount pojnts, though. It will only break
> when the filesystem is mounted for the second time on a given
> mountpoint.

Thanks for reporting that.  I've submitted a patch to fix it.

http://review.gluster.org/4158


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Questions on SSL in 3.4.0qa2

2012-11-06 Thread Jeff Darcy



On Nov 6, 2012, at 12:53 AM, Emmanuel Dreyfus wrote:


Jeff Darcy  wrote:


Thanks for reporting that.  I've submitted a patch to fix it.


Here is it at mount time (100% reproductible)


Are you trying to configure SSL for the management connection too?   
I've verified that we're getting into __socket_disconnect for the  
portmapper connection during mount, but we don't call SSL_clear  
because that connection isn't using SSL.  I've also verified that  
__socket_disconnect works for the main SSL connection, where it does  
call SSL_clear.  It appears that we're getting into SSL_clear for a  
non-SSL connection in your case, which is strange to say the least.




pending frames:
frame : type(0) op(0)

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2012-11-06 05:46:03configuration details:
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
spinlock 1
extattr.h 1
xattr.h 1
st_atimespec.tv_nsec 1
package-string: glusterfs 3.4.0qa2

Program terminated with signal 11, Segmentation fault.
#0  0xba389746 in SSL_clear () from /usr/lib/libssl.so.10
(gdb) bt
#0  0xba389746 in SSL_clear () from /usr/lib/libssl.so.10
#1  0xba3abc4c in __socket_disconnect (this=0xbb717000) at socket.c: 
500

#2  0xba3b36b9 in socket_poller (ctx=0xbb717000) at socket.c:2235
#3  0xbbb584ea in ?? () from /usr/lib/libpthread.so.1
#4  0xbb905ea0 in ___lwp_park50 () from /usr/lib/libc.so.12
#5  0xb780 in ?? ()
#6  0xbbb586fa in pthread_create () from /usr/lib/libpthread.so.1
#7  0xba3aca9f in socket_connect (this=0xbb715c00, port=49152) at
socket.c:2587
#8  0xbbb77b3f in rpc_transport_connect (this=0xbb715c00, port=49152)
   at rpc-transport.c:384
#9  0xbbb790cf in rpc_clnt_reconnect (trans_ptr=0xbb715c00) at
rpc-clnt.c:427
#10 0xbbba2fb4 in gf_timer_proc (ctx=0xbb7011c0) at timer.c:168
#11 0xbbb584ea in ?? () from /usr/lib/libpthread.so.1
#12 0xbb905ea0 in ___lwp_park50 () from /usr/lib/libc.so.12
#13 0xb9a0 in ?? ()
#14 0xbbb9ec5c in default_notify (this=0xbbb8797a, event=1,
data=0xba4ea000)
   at defaults.c:1307
#15 0x0804a03f in ?? ()


--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org



___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Questions on SSL in 3.4.0qa2

2012-11-06 Thread Jeff Darcy

On 11/06/2012 08:11 AM, Emmanuel Dreyfus wrote:
> Jeff Darcy  wrote:
> 
>> Are you trying to configure SSL for the management connection too?  
> 
> I just did this:
> gluster volume set gfs server.ssl true
> gluster volume set gfs client.ssl true
> 
> What is the difference, by the way?

One affects the bricks, and the other affects the clients.  There would never
be a reason to have one set without the other; the fact that they're separate
options is just an artifact of how our option system works, and should be fixed
some day.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Remote volume listing

2012-11-19 Thread Jeff Darcy

On 11/19/2012 10:47 AM, Niels de Vos wrote:
> You can get the list with the 'gluster' command from the glusterfs-server 
> package:
> 
> # gluster --remote-host=storage-01.example.com volume list
> 
> or if you prefer parsing XML:
> 
> # gluster --xml --remote-host=storage-01.example.com volume list
> 
> There is no need for the client system to be in the peer-list of the storage 
> servers, so it is pretty straight forward.

It doesn't need to be in the peer list, but it does need to have the
gluster-server package installed (one reason I think the CLI should be separate
BTW).  Also, it's even less secure than the ssh solution.


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] File integrity and consistency in geo-replication

2012-12-19 Thread Jeff Darcy


On 12/19/2012 12:57 PM, Natale Vinto wrote:

I saw the Server Quorum feature for the next version, I was wondering
if it is the one from the Duvvuri theory and if could be useful for
that case killing unconsistent bricks.


I hadn't heard of Duvvuri, and all I could find in a quick search was a couple 
of old papers about adaptive leasing.  Do you have any other references?  The 
server quorum feature allows us to avoid inconsistency from writes done without 
local quorum, but has practically no effect on geo-replication.  What are your 
expectations about quorum and consistency in a wide-area environment?



And, what about using Hadoop with the Gluster connector?


Um . . . it works?  Not sure what you're getting at here.


I think that this work would require a massive study and testing (for
me at least!), but it would be very nice do this research trying to
get an international cultural needing working thanks to a big
opensource project, "in perpetuum" :)


I agree.  It would definitely be good for us to understand what your needs are 
with respect to consistency or data integrity, and discuss how our modular 
architecture might allow us to add features that address those needs.




___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] compiling 3.3 , error

2013-01-04 Thread Jeff Darcy

On 1/4/13 4:32 PM, Jay Vyas wrote:
> Hi guys:  Im trying to build glusterfs so I can look into the shared
> object files. However, Im getting this error. 
> 
> Im not quite sure how to run make in verbose mode.   it seems like the
> mode it is running in is quiet.  Also, forgive my ignorance of the
> debugging the C build process if im missing something trivial here:
> thanks in advance ... j

Even with a normal build (I always always always use rpmbuild myself) I usually
see more error info than that.  In any case, you can always try "V=1 make" and
see if it helps.  Alternatively, you could grab an already-built RPM and unpack
it to get at the shared objects.  What exactly are you trying to do, at a
higher level?

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Client side translators doubt

2013-01-15 Thread Jeff Darcy


On 01/15/2013 01:43 PM, Gustavo Bervian Brand wrote:

   I'm trying some volumes configurations with 2 nodes, each one having a
gluster client and server running.

   Both clients have each one a volume related to my translator, which has as
sub volumes two "protocol/client" subvolumes (one subvol pointing to the local
node's IP/vol and another pointing to the remote node IP/vol).

   This works OK, and here comes the problem: when I try to change the local
vol at the client side from a "protocol/client" type to a "posix" type the read
breaks with -1 (operation not permitted).


You don't say what version you're using, but could it be one of these?

https://bugzilla.redhat.com/show_bug.cgi?id=868478
(patch for previous at http://review.gluster.org/#change,4114)
https://bugzilla.redhat.com/show_bug.cgi?id=822995

In general, going directly to storage/posix seems ill warranted.  It bypasses a 
bunch of translators like marker and access-control, for example.  As we go 
forward there are likely to be even more "helper" translators for UID mapping, 
coordination for client-side encryption or erasure coding.  Since it's not 
possible to create such a configuration through the CLI or other supported 
tools, it's not going to work properly when configurations change, either.  Is 
it really worth all that, for what is likely to be a modest performance gain in 
most cases?



___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Client side translators doubt

2013-01-16 Thread Jeff Darcy

On 01/16/2013 12:41 PM, Gustavo Bervian Brand wrote:
>   Finally, let's get back to my original configuration with both
> subvolumes as "protocol/client" types: it works ok until I try something
> unusual, which is pointing at the server side of both nodes their
> "posix" type subvolumes to the same shared path. This path is a mount
> point shared by both nodes through a lustre FS. In this case, both posix
> subvolumes, at the backend, are writing to the same place. Should I
> expect this to work without problems or changes at the posix translator
> would be necessary?

Bricks *must* use separate storage.  Even if there's a reason to create
bricks on top of shared storage (and you've accurately guessed my
reaction to that) they should be in separate subdirectories.  Otherwise
they will definitely step on each other's "private" information in
.glusterfs, and the level of change that would be necessary to make them
work a different way is considerable.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Client side translators doubt

2013-01-16 Thread Jeff Darcy

On 01/16/2013 02:21 PM, Stephan von Krawczynski wrote:
> rename ".glusterfs" in ".glusterfs-" in source

Changing something from a constant to a context-dependent variable is a
bit more complicated than that.  There are nine places where this path
is used as HIDDEN_PATH, ten where it's used as HANDLE_PFX, and a couple
of others.  All of those would have to change, plus we'd have to add
code to ensure uniqueness of brick names, etc.  It's orders of magnitude
less work than putting something in the kernel (for example), but at the
same time it's more than a single change and recompile . . . all to
avoid creating a couple of subdirectories.  That seems like a rather
poor use of everyone's time.

> make ".glusterfs" a dir and store brick-private information in a file
>  and global (shared) information in a file "global".

Maybe you should look at what's already in there and how it's used
before you assume that it could (or should) all be contained in two
plain files.  Neither anonymous file handles nor the index translator
fit such a model.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

[Gluster-devel] Bricks as first-class objects

2013-01-22 Thread Jeff Darcy

Right now, bricks are sort of second-class objects.  They're host:path pairs 
that sort of only exist within the context of the volumes where they're used, 
and they don't have any other attributes.  What if they did have their own 
identity and attributes?  Consider the following:


gluster brick create mybrick server1:/some/random/path

OK, big deal.  Now it gets a bit more interesting.

gluster brick set mybrick storage-type reallyfast

Still doesn't seem all that useful, eh?

gluster brick set otherbrick storagetype reallyfast
gluster volume set placement-pattern '*.mp4:reallyfast'

This is from http://review.gluster.org/#change,4410 which is what inspired this 
line of thinking.  Now things get much more interesting.  We can essentially 
put bricks into "placement groups" and use those to give users more control 
over where their files go.  Some of our competitors already do this.  ;) 
Here's another trick.


gluster brick stop mybrick
gluster brick move mybrick server2:/another/path
gluster brick start mybrick

Pretty obvious what happened here, isn't it?  The user wants to move a brick 
physically from server1 to server2.  This way seems very intuitive, and because 
we retain the brick's identity/attributes throughout it's very easy for us to 
do the right thing - in contrast to the arcane details of current 
replace-brick.  Being able to start/stop individual bricks in a fully 
integrated way will be very handy for testing too.


We could also do top/latency on individual bricks this way some day, and all 
sorts of other tricks too.  It doesn't even seem like it would be all that 
complicated to implement.  Any thoughts?




___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Bricks as first-class objects

2013-01-22 Thread Jeff Darcy


On 01/22/2013 01:07 PM, Jay Vyas wrote:

2 - I assume you would rather discourage people from custom coding the brick
logic - since its the volume level of abstraction that you want people normally
to work from - right... ?


I don't know if we should discourage them.  Random placement serves well in a 
great many cases, but the issue of heterogeneous storage and placing particular 
files onto particular bricks comes up all the time.  What we should do is give 
users the maximum flexibility to express their preferred policy, but then make 
the application of that policy as automagical as we can.



3 - Are there optimizations that happen in the way the gluster fuse mounts
work, wherein volumes sort of assume that the bricks aren't moving around
beneath them.?


To the extent that there are, we already need to deal with them (and do) when 
we add, remove, or replace bricks.  None of these operations fundamentally 
change; only the way they're expressed might.




___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Sick but still "alive" nodes

2013-01-25 Thread Jeff Darcy


On 01/25/2013 07:47 AM, jayunit...@gmail.com wrote:

Hi guys: I just saw an issue on the Hfds mailing list that might be a
potential problem in gluster clusters.  It kind of reminds me of
Jeff's idea of bricks as first class objects in the API.

What happens if a gluster brick is on a machine which, although still
alive, performs poorly?

would such scenarios be detected and if so, can the brick be
decommissioned/ignored/moved ? If not it would be a cool feature to
have because I'm sure it happens from time to time.


There's nothing currently in place to detect such a condition, and of 
course if we can't detect it we can't do anything about it.  There are 
also several cases where we might actually manage to make things worse 
if we try to do this ourselves.  For example, consider the case where 
the slowness is because of a short-duration contending activity.  We 
might well react just as that activity subsides, suspending that brick 
just as another brick is "going bad" due to similar transient activity 
there.  Similarly, if the system overall is truly overloaded, suspending 
bricks is a bit like squeezing a water balloon - the "bulge" just 
reappears elsewhere and all we've done is diminish total resources 
available.


I've seen problems like this with other parallel filesystems, and I'm 
pretty sure I've read papers about them too.  IMO the right place to 
deal with such issues is at the job-scheduler or similar level, where 
more of the total system state is known.  What we can do is provide more 
information about our part of the system state, plus levers that they 
can pull when they decide that preparation or correction for a 
higher-level event (that we probably don't even know about) is appropriate.


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

[Gluster-devel] opendir/readdir helper

2013-02-01 Thread Jeff Darcy

As we all know, directory-listing performance (or lack thereof) is a bit 
of a sore spot for many GlusterFS users, because it's one of the few 
places where FUSE really does make a difference.  It will probably 
always be a sore spot even with the readdirp changes that are already 
under way.  The next step would be to add a FUSE enhancement to "inject" 
directory entries before they're requested, but that's a lot of work for 
an uncertain outcome.  The FUSE haters among the kernel leadership would 
probably reject such changes without serious consideration, and even if 
I'm wrong about that it's likely to be a long time before they make it 
into the various distributions (not to mention non-Linux platform issues).


So, let's think outside the box for a bit.  What about an LD_PRELOAD 
helper?  Believe me, I know all about the problems with LD_PRELOAD, but 
I still can't think of any reasonable use case that requires readdir to 
work across a fork (for example).  The basic idea is that the LD_PRELOAD 
would catch calls to opendir/readdir and match them against paths 
matching GlusterFS volumes.  If a match is found, then it would use 
libgfapi to serve the results, without any FUSE involvement and with 
massive prefetching goodness etc.  Without a match, the helper would 
naturally fall back to the system functions.


I suspect that this approach would make listings on very large single 
directories many times faster than would ever be possible with FUSE. 
For deeply nested directories we'd need to add some more complexity so 
that we're not going through the whole connection-establishment path 
(including authentication etc.) for each directory separately, but 
that's all pretty well understood pain for pretty obvious gain.


Any other thoughts?

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Disperse xlator

2013-02-19 Thread Jeff Darcy

On 02/19/2013 11:47 AM, Xavier Hernandez wrote:
> We have published an initial version (proof of concept) of the
> disperse translator on github (https://github.com/datalab-bcn). It is a
> new GlusterFS translator with a level of fault tolerance configurable at
> creation time, but with a minimal waste of physical disk space (it's
> conceptually similar to RAID5 or RAID6). It mostly works but it's still
> in a development phase and many things will be fixed and improved,
> especially the read/write performance. There is a README where you can
> find how to compile and install it.
> 
> If you want to try it, we will ve very happy to have some feedback.

That's awesome, Xavier.  Congratulations to you and the team.  I look
forward to giving this a try.  :)



___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

[Gluster-devel] 3.4 Beta 1 Tracker

2013-03-14 Thread Jeff Darcy

It's here:

https://bugzilla.redhat.com/show_bug.cgi?id=918917

Mostly it includes the stuff that got bumped from Alpha 2, plus the
Fedora specfile resync, plus some other stuff (e.g. the "volume sync"
fix) that had come up in hallway conversations last week.  If you can
think of any other additions, please let me know.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] 3.4 Beta 1 Tracker

2013-03-18 Thread Jeff Darcy


On 03/18/2013 09:19 AM, krish wrote:

I have added the following bugs to 3.4 beta1 tracker bug.

BZ 922765 - build: package version string should show 3.4
   - Patch(release-3.4): http://review.gluster.org/4673

BZ 920916 - non-ssl sockets perform blocking connect()
  - Patch (upstream): http://review.gluster.com/4670
  - Patch (release-3.4): http://review.gluster.com/4685

I have taken the liberty to assume that the above bugs are 'important'
for the beta1 release. Let me know if I need to undo the BZ tracking of
the above 2 bugs for 3.4 beta1.


These look fine to me.  Thanks!




___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] 3.4 Beta 1 Tracker

2013-03-18 Thread Jeff Darcy


On 03/17/2013 11:18 PM, Vijay Bellur wrote:

I propose that we pull in the following two patches:

http://review.gluster.org/4495
http://review.gluster.org/4583

4495 is needed for quota to work with 3.4. 4583 clears out a logging annoyance.


Done.  Thanks!



___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

[Gluster-devel] Glusterd: A New Hope

2013-03-22 Thread Jeff Darcy

During the Bangalore "architects' summit" a couple of weeks ago, there
was a discussion about making most functions of glusterd into Somebody
Else's Problem.  Examples include cluster membership, storage of volume
configuration, and responding to changes in volume configuration.  For
those who haven't looked at it, glusterd is a bit of a maintenance and
scalability problem with three kinds of RPC (client to glusterd,
glusterd to glusterd, glusterd to glusterfsd) and its own ad-hoc
transaction engine etc.  The need for some change here is keenly felt
right now as we struggle to fix all of the race conditions that have
resulted from the hasty addition of synctasks to make up for poor
performance elsewhere in that 44K lines of C.  Delegating as much as
possible of this functionality to mature code that is mostly maintained
elsewhere would be very beneficial.  I've done some research since those
meetings, and here are some results.

The most basic idea here is to use an existing coordination service to
store cluster configuration and state.  That service would then take
responsibility for maintaining availability and consistency of the data
under its care.  The best known example of such a coordination service
is Apache's ZooKeeper[1], but there are others that don't have the
noxious Java dependency - e.g. doozer[2] written in Go, Arakoon[3]
written in OCaml, ConCoord[4] written in Python.  These all provide a
tightly consistent generally-hierarchical namespace for relatively small
amounts of data.  In addition, there are two other features that might
be useful.

* Watches: register for notification of changes to an object (or
directory/container), without having to poll.

* Ephemerals: certain objects go away when the client that created them
drops its connection to the server(s).

Here's a rough sketch of how we'd use such a service.

* Membership: a certain small set of servers (three or more) would be
manually set up as coordination-service masters, e.g. via "peer probe
xxx as master").  Other servers would connect to these masters, which
would use ephemerals to update a "cluster map" object.  Both clients and
servers could set up watches on the cluster map object to be notified of
servers joining and leaving.

* Configuration: the information we currently store in each volume's
"info" file as the basis for generating volfiles (and perhaps the
volfiles themselves) would be stored in the configuration service.
Again, servers and clients could set watches on these objects to be
notified of changes and do the appropriate graph switches, reconfigures,
quorum actions, etc.

* Maintenance operations: these would still run in glusterd (which isn't
going away).  They would use the coordination for leader election to
make sure the same activity isn't started twice, and to keep status
updated in a way that allows other nodes to watch for changes.

* Status queries: these would be handled entirely by querying objects
within the coordination service.

Of the alternatives available to us, only ZooKeeper directly supports
all of the functionality we'd want.  However, the Java dependency is
decidedly unpleasant for us and would be totally unacceptable to some of
our users.  Doozer seems the closest of the remainder; it supports
watches but not ephemerals, so we'd either have to synthesize those on
top of doozer itself or find another way to handle membership (the only
place where we use that functionality) based on the features it does
have.  The project also seems reasonably mature and active, though we'd
probably still have to devote some time to developing our own local
doozer expertise.

In a similar vein, another possibility would be to use *ourselves* as
the coordination service, via a hand-configured AFR volume.  This is
actually an approach Kaleb and I were seriously considering for HekaFS
at the time of the acquisition, and it's not without its benefits.
Using libgfapi we can prevent this special volume from having to be
mounted, and we already know how to secure the communications paths for
it (something that would require additional work with the other
solutions).  On the other hand, it would probably require additional
translators to provide both ephemerals and watches, and might require
its own non-glusterd solution to issues like failure detection and
self-heal, so it doesn't exactly meet the "make it somebody else's
problem" criterion.

In conclusion, I think our best (long term) way forward would be to
prototype a doozer-based version of glusterd.  I could possibly be
persuaded to try a "gluster on gluster" approach instead, but at this
moment it wouldn't be my first choice.  Are there any other suggestions
or objections before I forge ahead?

[1] http://zookeeper.apache.org/
[2] https://github.com/ha/doozerd
[3] http://arakoon.org/
[4] http://openreplica.org/doc/

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/

Re: [Gluster-devel] Glusterd: A New Hope

2013-03-22 Thread Jeff Darcy

On 03/22/2013 02:20 PM, Anand Avati wrote:e
> The point is that it was never a question of performance - it was to
> just get basic functionality "working".


I stand corrected.  Here's the amended statement.

"The need for some change here is keenly felt right now as we struggle
to fix all of the race conditions that have resulted from the hasty
addition of synctasks to make up for poor event handling elsewhere in
that 44K lines of C."




___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Glusterd: A New Hope

2013-03-23 Thread Jeff Darcy

On 03/22/2013 06:08 PM, Stephan von Krawczynski wrote:
> Why is it you cannot accept that it should be a _filesystem_, and nothing 
> else?
> It would have been a lot better to care about stability, keep it simple and
> feel fine. Concentrate on the strength (client based replication setups) and
> forget the rest.

"Just a filesystem" has historically been an obstacle to deployment of
distributed filesystems, and just doesn't cut it any more.  It's
important to have a coherent notion of which servers are up and which
protocol versions they can accept.  It's essential for configuration
changes to be coordinated and communicated across the cluster, if those
changes are to be non-disruptive, and that's part of glusterd's job.  It
also handles process management (both regular brick daemons and
maintenance-related tasks), quorum enforcement, and other functions.
The trend is for distributed systems to become more autonomous, not less so.

If you want to run things in a 2.x fashion, feel free.  Volfiles still
work, and will continue to do so, though you'll be giving up a lot of
functionality that way.  Nobody else is asking us to turn back the clock
and throw away functionality.  Whatever the problems might be with
glusterd's implementation, the solutions lie ahead of us.  What's behind
us should and will still that way.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Glusterd: A New Hope

2013-03-24 Thread Jeff Darcy

On 03/24/2013 04:43 AM, Stephan von Krawczynski wrote:
> So the automated distribution of the vol-files does not help
> the majority at all.

It's not just about distribution of volfiles.  It's also about ensuring
their consistency across many servers and clients, and handling dynamic
updates.  For dynamic configuration changes without reboots or even
remounts, you either need glusterd or you need to reinvent a good
portion of it.  Even you should be able to appreciate the value of that,
since dynamic configuration speeds the tuning process.  Then there are a
whole bunch of other features that I mentioned in my last reply, which
you should read this time before rushing to give the broken record
another spin.

> you just
> said you cannot solve this by yourself and want to drop it on external
> know-how

Nobody said anything of the sort.  We *can* solve it for ourselves, but
there's no reason we should expend our own resources to solve the parts
of the problem that are already well solved elsewhere.  There's no
question that there will still be plenty of domain-specific
functionality left for us to deal with ourselves.  Leveraging other
projects is just basic software-engineering good sense, as it allows us
to devote more resources to enhancements in other areas - including
those you've said need improvement.  Reinventing the wheel would be
stupid, and forcing each user to reinvent their own would be worse.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Glusterd: A New Hope

2013-03-24 Thread Jeff Darcy

On 03/24/2013 08:17 PM, Stephan von Krawczynski wrote:
> I have the strong impression you are pressed to release something in a
> timeline and for reasons currently untold.
> What is the reason for this rush of features and increasing instability, nfs
> with server-based replication and so on. All things that the original project
> really never talked about.

We're adding features primarily because there are people asking for
them.  No conspiracy theory is necessary.  We're considering things that
we wouldn't have before because more resources and a lot more users than
we did before.

> You would only do this if you had a clear time limit to reach some goal. If
> your goal would be to make a well-defined, stable and long term  GPL project
> really nobody would ask questions like these. Not in the GPL part of the
> software-engineering world.

Really?  Is that because GPL projects are notable for their focus and
orderly progress toward a single goal at a time?  Or is it because no
GPL project has ever refactored, rewritten, or replaced a core
component?  Silly me, I thought open source was about, y'know, being
open - to participation, to experimentation, to people getting involved
and doing something instead of just ranting.  Apparently you have a
different perspective.

FYI, this particular project (refactoring glusterd) is quite long term.
 It's to ensure that at some point in the future we can scale to
clusters ten times larger than now, and support configurations ten times
as complex because of features that are barely on the road map today.
It would be a real shame if people were trying to deploy at that scale
or use those other features but the management layer didn't give them
the tools to make that work.  It's called thinking ahead.

If we were really as nefarious as you make us out to be, we wouldn't be
having this discussion on a public list.  We'd be having it behind
closed doors within Red Hat, but that's not the way we operate.
Instead, I posted this to a public list so people elsewhere in the
community can be involved at the earliest possible point.  It's sad that
some people try to discourage such openness and thereby weaken the
community.  Sometimes the price of leaving the door open is that not
everyone who walks through it is well intentioned, but we just have to
deal with that as best we can.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Glusterd: A New Hope

2013-03-25 Thread Jeff Darcy

On 03/25/2013 05:38 AM, Vidar Hokstad wrote:
> I see a number of complaints about this as some sort of admission of
> failure.

I wouldn't quite characterize it as failure.  It does work, after all.
However, glusterd has kind of reached its limits.  Moving it forward has
become increasingly difficult, and it must move forward to support
future scale and features.  There's nothing wrong with hand saws and
axes for small jobs, but at a certain point you're going to need a
chainsaw.  We're at that point for glusterd IMO.

> under its care.  The best known example of such a coordination service
> is Apache's ZooKeeper[1], but there are others that don't have the
> noxious Java dependency
> 
> I'm happy you recognise the issue of Java. I'd see having to drag that
> around as a major barrier. One of the major benefits of glusterfs is the
> simplicity of deployment compared to many alternatives, and that benefit
> would be massively diminished if I needed to deal with a Java dependency.

Yeah, I think it's a non-starter.  It's a shame, really, because the
functionality is good and the people working on ZK are doing a good job.
 Nonetheless, I think the Java dependency is a deal killer.  For what
it's worth (and this is more to AB's point) I wouldn't favor *any*
solution that requires users to maintain another component.  I think
anything we use has to be fully embedded, with low resource needs and
management completely "under the covers" as far as users are concerned.
 I don't think that's possible with a big ball of Java like ZK.

> I like the Gluster on Gluster idea you mention later on.

I'm a little surprised by the positive reactions to the "Gluster on
Gluster" approach.  Even though Kaleb and I considered it for HekaFS,
it's still a bit of a hack.  In particular, we'd still have to solve the
problems of keeping that private instance available, restarting daemons
and initiating repair etc. - exactly the problems it's supposed to be
solving for the rest of the system.

> Apart from
> that, have you considered pulling out the parts of Glusterd that you'd
> like to be able to ditch and try to generalize it and see if there'd be
> any interest in it as a standalone project? Or is too much of what
> you're looking for new functionality that is not already covered by part
> of your current codebase?

We don't have anything like ZK ephemerals, and we'd need to add inotify
support (or something equivalent) as well.  Then again, those features
would then be exposed to users as well, so it might be worth it.  Maybe
we should consider how this might be arranged so that parts would be
useful for things other than GlusterFS itself.  Thanks for the idea.

> * Membership: a certain small set of servers (three or more) would be
> manually set up as coordination-service masters, e.g. via "peer probe
> xxx as master").
> 
> Careful here. Again, a big advantage of Gluster to users is to not need
> any "special" servers that require other treatment. I recognise there's
> a  bootstrap problem, but to whatever extent possible, at the very least
> try to make this transparent to users (e.g. have the cluster
> automatically make more of the nodes take on coordination-service roles
> if any are lost etc.). 

I'm a little wary of trying to hide this from users.  The coordination
servers should be chosen to minimize the risk of correlated failure, and
we currently lack the topological awareness (e.g. which server is in
which rack or attached to which switch) to do that properly.  If we just
do something like "first three servers to be configured become
configuration servers" then we run a very high risk of choosing exactly
those servers that are most likely to fail together.  :(  As long as the
extra configuration is limited to one option on "peer probe" is it
really a problem?

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Glusterd: A New Hope

2013-03-25 Thread Jeff Darcy

On 03/25/2013 10:53 AM, Vidar Hokstad wrote:
> as long as the
> chosen option doesn't make day to day troubleshooting and maintenance
> much harder...

Completely agree.

> For a large deployment I agree you _need_ to
> know those kind of things, but for a small one I'd be inclined to just
> make every node hold the configuration data. 

That's what we do now.  It does basically work at small scale, and we'd
like to preserve its simplicity whenever we can.  At larger scale, as
you point out, administrators will have to be a little more aware of the
coordination role and manage coordinator locations a little more
carefully to reduce communication cost and the corresponding potential
for partial success of an update (which could leave configs out of sync).

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Glusterd: A New Hope

2013-03-25 Thread Jeff Darcy


On 03/25/2013 12:44 PM, Alex Attarian wrote:

Adding more complexity means only making it a nightmare for administrators.
I've said this times and times and I will say it again, your documentation has
always been bad, out of respect I'm not calling it shit. If you had taken your
time and grabbed a random admin and watched him set up a system, you would've
cried for him. Until this day I don't understand why you haven't taken the time
to sit down and write a good documentation so more people can use gluster.
Instead what happens is people come look at the site, look at the docs and
examples, and run away.


Look, I'm not here to solve a documentation problem.  I've done more than any 
other developer on that front already.  I'm also not here to explain the 
difference between the GlusterFS community project and the Red Hat Storage 
product, or enumerate the features that *people have demanded* which make the 
project more complex.  Wrong forum, wrong time, maybe wrong guy.  I'm trying to 
solve a specific set of technical problems in a component that most of our 
users appreciate.  Being able to form a cluster and export a volume with three 
commands from one CLI (probe, create, start) is not something we're going to 
throw away.  People who want to build its equivalent themselves are a tiny 
minority insufficient to sustain the project.


If you disagree with the very idea of having glusterd, then *we have nothing to 
talk about*.  If you appreciate the infrastructure it provides, if you want to 
make that infrastructure as robust and scalable and convenient to use as 
possible, then by all means share your ideas or opinions on ideas that have 
already been presented.  The other users who have participated constructively 
don't deserve to be crowded out of the conversation





___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Glusterd: A New Hope

2013-03-25 Thread Jeff Darcy


On 03/25/2013 02:24 PM, John Mark Walker wrote:

I would also point out that, as the community lead, I would very much
welcome alternative solutions to be played out on gluster.org

I don't think they'll be successful, but what do I know. I just want to
make it clear that alternative solutions are welcome. If there's a subset of
you that want to maintain a 2.x release, you're welcome to do so - and I
will give you whatever tools you need to be successful.


I'm fine with that, too.  It's not the project that I think we need to pursue,
it's not the one I'm assigned to pursue, but if others want to go that route
then I'll be more than happy to give people the information they'll need.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Glusterd: A New Hope

2013-03-25 Thread Jeff Darcy


On 03/25/2013 02:44 PM, Alex Attarian wrote:

where do you get the idea that I'm against glusterd? I'm perfectly fine with
3.x versions, those are still maintainable. But if you want to add Zookeper
now, on top of Java requirement, where is it going to end?


I explicitly rejected ZK because of the Java requirement.  That kind of 
complexity and resource load just can't be hidden from the users.



Yes re-inventing is
not the best thing, but sometimes it can be much worse to add a 3rd party
component with some strenuous requirements than re-iventing. Right now things
are very easy to maintain in any of the 3.x versions, right inside glusterd.


I certainly don't think so.  A simple "volume set" command might generate 
dozens of RPC messages spread across several daemons using multiple RPC 
dispatch tables, with state machines and validation stages and all sorts of 
other complexity.  Finding out where in all that a command died can be *very* 
challenging.  Debugging problems from nodes having inconsistent volfiles 
because one died in the middle of that "volume set" command can be even worse. 
 I wouldn't call that maintainable.




Why not keep that? Even all these other functionalities that others want and
you really want to implement for scalability and flexibility, they could all be
built with your cluster on gluster solution.

I really don't want to worry about Zookeeper or Doozer when I run gluster.


You shouldn't have to.  Even if we were to use one of those - and the whole 
point of this discussion is to explore that along with other alternatives - it 
wouldn't be exposed as a separate service.  It would be embedded within 
glusterd, started and stopped when glusterd itself is, etc.  Yes, developers 
might need to learn to navigate some new code, but they would in any event and 
users/administrators shouldn't care at all.  To them it would be the same CLI 
as before, producing logs and other artifacts that are if anything more 
comprehensible than today.


I'm still not opposed to the "GlusterFS on GlusterFS" approach instead, but it 
has its own issues that need to be worked out.  Maybe someone who prefers that 
could sketch out a way to do that without having to retain all of glusterd as 
it is now to manage that special config volume (which obviously can't rely on 
the same services it provides).  There'd still be more layers, there'd still be 
more daemons to manage, and it seems like there'd be two sets of code doing 
essentially the same thing at different levels.


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Glusterd: A New Hope

2013-03-26 Thread Jeff Darcy

On 03/25/2013 11:59 PM, Anand Babu Periasamy wrote:
> gluster meta-volume + zeromq for notification (pub/sub) will solve our
> problems largely and still be light weight.  In a large scale
> deployment, it is not a good idea to declare all the servers as
> coordination servers. Since meta-volume is a regular distributed
> replicated gluster volume, it can always be expanded later depending
> on the load and availability requirements.

How do you propose to solve the bootstrap problem?  How do we choose
which subset of all servers would have pieces of this meta-volume?  How
do we coordinate its expansion, or detect and respond to failures of
itscomponents?  What issues does 0MQ introduce with respect to daemons
or network capabilities?  It would be nice to have two or more proposals
that have been been thought through to approximately the same level of
detail (which isn't all that high really).  Otherwise we can't really
make rational comparisons and choices.  Every approach to any
non-trivial problem has its pitfalls.  If we don't look for them we risk
choosing an approach with problems which are merely less obvious but no
less severe.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Glusterd: A New Hope

2013-03-26 Thread Jeff Darcy

On 03/26/2013 05:36 PM, Ian Latter wrote:
> From a user perspective the "cluster" establishment is done via text file 
> configuration to direct nodes to network services;
>   http://www.xtreemfs.org/quickstart_distr.php

Looks like an alternative worth considering.  How does it handle online
reconfiguration (e.g. our "volume set")?  Also, how does a client find a
backup DIR if the primary is down?  DNS aliases?  I think we can do
better than that.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glusterd vs glusterfs-server

2013-03-27 Thread Jeff Darcy

On 03/27/2013 11:16 AM, Jay Vyas wrote:
> 1) What are the "rm -rf" incantations I can do to completely purge
> any trace of gluster from my system so that i can start over?

When I really want to remove all traces of GlusterFS from a system,
short of reinstalling, I do this:

killall -9 -r gluster
yum remove $(rpm -qa | grep gluster)
rm -rf /var/lib/glusterd
rm -rf /etc/glusterfs

> 2) What is the difference bettwen glusterd and glusterfs-server?

There are several kinds of GlusterFS server daemons, all part of the
glusterfs-server package:

glusterd = management daemon
glusterfsd = per-brick daemon
glustershd = self-heal daemon
glusterfs = usually client-side, but also NFS on servers

The others are all started from glusterd, in response to volume start
and stop commands.  They're actually all the same executable with
different translators, but there's generally no reason to care about that.

> 3) Do I have to separately install glusterfs-server ?  I don't see
> it any where grepping through the source code.

If you want a system to be a server (even if it's a client as well) you
need to install glusterfs-server.

> 4) Does starting glusterd lead to the startup of glusterfs-server?

When glusterd starts up, it spawns any daemons that "should" be running
(according to which volumes are started, which have NFS or replication
enabled, etc.) and seem to be missing.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] regressions due to 64-bit ext4 directory cookies

2013-03-28 Thread Jeff Darcy

On 03/28/2013 10:07 AM, Theodore Ts'o wrote:
> Any update about whether Gluster can address this without needing the
> ioctl patch?  Or should we push the ioctl patch into ext4 for the next
> merge window?

We have two approaches that don't require the ioctl patch:

* http://review.gluster.org/#change,4675
This takes the approach of mapping between the underlying filesystems'
d_off values and our own, using a cache.  It works for obvious cases,
but it's a really horrible kludge.

* http://review.gluster.org/#change,4711
This is Avati's and Zach's approach, which "rounds off" the ext4 d_off
values to free up some bits that we can use.  There seems to be a
general consensus (among the people who've discussed it on this list)
that the approach is preferable, but it doesn't quite work yet.

Between those two and the possibility of "tune2fs -O ^dir_index" I think
we can keep this from affecting our users, but since they're both a bit
unclean in different ways the ioctl might still be desirable.  I'll let
others who've been more involved with that (e.g. Avati/Zach/Eric) give a
more authoritative answer.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] regressions due to 64-bit ext4 directory cookies

2013-03-28 Thread Jeff Darcy

On 03/28/2013 02:49 PM, Anand Avati wrote:
> Yes, it should, based on the theory of how ext4 was generating the
> 63bits. But Jeff's test finds that the experiment is not matching the
> theory.

FWIW, I was able to re-run my test in between stuff related to That
Other Problem.  What seems to be happening is that we read correctly
until just after d_off 0x4000, then we suddenly wrap around
- not to the very first d_off we saw, but to a pretty early one (e.g.
0x0041b6340689a32e).  This is all on a single brick, BTW, so it's pretty
easy to line up the back-end and front-end d_off values which match
perfectly up to this point.

I haven't had a chance to ponder what this all means and debug it
further.  Hopefully I'll be able to do so soon, but I figured I'd
mention it in case something about those numbers rang a bell.



___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] regressions due to 64-bit ext4 directory cookies

2013-03-28 Thread Jeff Darcy

(removed lists other than gluster-devel to avoid spamming too widely)

On 03/28/2013 03:43 PM, Jeff Darcy wrote:
> FWIW, I was able to re-run my test in between stuff related to That 
> Other Problem.  What seems to be happening is that we read correctly 
> until just after d_off 0x4000, then we suddenly wrap
> around - not to the very first d_off we saw, but to a pretty early
> one (e.g. 0x0041b6340689a32e).  This is all on a single brick, BTW,
> so it's pretty easy to line up the back-end and front-end d_off
> values which match perfectly up to this point.

I've just submitted a version of the patch that passes basic tests on
both one- and two-brick configurations, under both native protocol and
NFS.  The core problem seemed to be the calculation of max_bits, but
there were a few other issues with bits at the top or bottom being
preserved or masked off when they shouldn't have been (about what always
happens with this kind of code).  If anyone thinks I've made things less
readable, please let me know on Gerrit; I find this version easier to
follow, but such things are very much a matter of personal taste and IMO
the majority should rule.

http://review.gluster.org/#change,4711

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

1 2 >

1 - 100 of 163 matches

Mail list logo