Re: [Gluster-devel] RDMA: Patch to make use of pre registered memory

2015-01-23 Thread Anand Avati
Couple of comments -

1. rdma can register init/fini functions (via pointers) into iobuf_pool.
Absolutely no need to introduce rdma dependency into libglusterfs.

2. It might be a good idea to take a holistic approach towards zero-copy
with libgfapi + RDMA, rather than a narrow goal of "use pre-registered
memory with RDMA". Do keep the options open for RDMA'ing user's memory
pointer (passed to glfs_write()) as well.

3. It is better to make io-cache and write-behind use a new iobuf_pool for
caching purpose. There could be an optimization where they could just do
iobuf/iobref_ref() when safe - e.g io-cache can cache with iobuf_ref when
transport is socket, or write-behind can unwind by holding onto data with
iobuf_ref() when the topmost layer is FUSE or server (i.e no gfapi).

4. Next step for zero-copy would be introduction of a new fop readto()
where the destination pointer is passed from the caller (gfapi being the
primary use case). In this situation RDMA ought to register that memory if
necessary and request server to RDMA_WRITE into the pointer provided by
gfapi caller.

2. and 4. require changes in the code you would be modifying if you were to
just do "pre-registered memroy", so it is better we plan for the bigger
picture upfront. Zero-copy can improve performance (especially read) in
qemu use case.

Thanks
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] managing of THIS

2015-01-23 Thread Anand Avati
The problem you describe is very specific to glfs_new(), and not to gfapi
in general. I guess we can handle this in glfs_new by initializing an
appropriate value into THIS (save old_THIS and restore it before returning
from glfs_new). That should avoid the need for all those new macros?

Thanks

On Wed, Jan 21, 2015, 23:37 Raghavendra Bhat  wrote:

>
> Hi,
>
> In glusterfs at the time of the process coming up, it creates 5 pthread
> keys (for saving THIS, syncop, uuid buf, lkowner buf and syncop ctx).
> Even gfapi does the same thing in its glfs_new function. But with User
> Serviceable Snapshots (where a glusterfs process spawns multiple gfapi
> instances for a snapshot) this will lead to more and more consumption of
> pthread keys. In fact the old keys are lost (as same variables are used
> for creating the keys) and eventually the process will run out of
> pthread keys after 203 snapshots (maximum allowed number is 1024 per
> process.). So to avoid it, pthread keys creation can be done only once
> (using pthread_once in which, the globals_init function is called).
>
> But now a new problem arises. Say, from snapview-server xlator glfs_new
> was called (or glfs_init etc). Now gfapi wants calls THIS for some of
> its operations such properly accounting the memory within the xlator
> while allocating a new structure. But when gfapi calls THIS it gets
> snapview-server's pointer. Since snapview-server does not know about
> gfapi's internal structure, it asserts at the time of allocating.
>
> For now, a patch has been sent to handle the issue by turning off
> memory-accounting for snapshot daemon.
> (http://review.gluster.org/#/c/9430).
>
> But if memory-accounting has to be turned on for snapshot daemon, then
> the above problem has to be fixed.
> 2 ways that can be used for fixing the issue are:
>
> 1) Add the datastructures that are used by gfapi to libglusterfs (and
> hence their mem-types as well), so that any xlator that is calling gfapi
> functions (such as snapview-server as of now) will be aware of the
> memory types used by gfapi and hence will not cause problems, when
> memory accounting has to be done as part of allocations and frees.
>
> OR
>
> 2) Properly manage THIS by introducing a new macro similar to STACK_WIND
> (for now it can be called STACK_API_WIND). The macro will be much
> simpler than STACK_WIND as it need not create new frames before handing
> over the call to the next layer. Before handing over the call to gfapi
> (any call, such as glfs_new or fops such as glfs_h_open), saves THIS in
> a variable and calls the gfapi function given as an argument. After the
> function returns it again sets THIS back the value before the gfapi
> function was called.
>
> Ex:
>
> STACK_API_WIND (this, fn, ret, params)
> do {
>  xlator_t *old_THIS = NULL;
>
>  old_THIS = this;
>  ret = fn (params);
>  THIS = old_THIS;
> } while (0);
>
> a caller (as of now snapview-server xlator) would call the macro like this
> STACK_API_WIND (this, glfs_h_open, glfd, fs, object, flags);
>
>
> Please provide feedback and any suggestions or solutions to handle the
> mentioned issue are welcome.
>
> Regards,
> Raghavendra Bhat
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Reg. multi thread epoll NetBSD failures

2015-01-23 Thread Anand Avati
Since all of epoll code and its multithreading is under ifdefs, netbsd
should just continue working as single threaded poll unaffected by the
patch. If netbsd kqueue supports single shot event delivery and edge
triggered notification, we could have an equivalent implantation on netbsd
too. Even if kqueue does not support these features, it might well be worth
implementing a single threaded level triggered kqueue based event handler,
and promote netbsd from suffering from sucky vanilla poll.

Thanks

On Fri, Jan 23, 2015, 19:29 Emmanuel Dreyfus  wrote:

> Ben England  wrote:
>
> > NetBSD may be useful for exposing race conditions, but it's not clear to
> > me that all of these race conditions would happen in a non-NetBSD
> > environment,
>
> In many times, NetBSD exhibited cases where non specified,
> Linux-specific behaviors were assumed. I recall my very first finding in
> Glusterfs: Linux lets you use a mutex without calling
> pthread_mutex_init() first. That broke on NetBSD, as expected.
>
> Fixing this kind of issues is interesting beyond NetBSD support, since
> you cannot take for granted that an unspecified behavior will not be
> altered in the future.
>
> That said, I am fine if you let NetBSD run without fixing the underlying
> issue, but you have been warned :-)
>
> --
> Emmanuel Dreyfus
> http://hcpnet.free.fr/pubz
> m...@netbsd.org
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Reg. multi thread epoll NetBSD failures

2015-01-23 Thread Emmanuel Dreyfus
Ben England  wrote:

> NetBSD may be useful for exposing race conditions, but it's not clear to
> me that all of these race conditions would happen in a non-NetBSD
> environment,

In many times, NetBSD exhibited cases where non specified,
Linux-specific behaviors were assumed. I recall my very first finding in
Glusterfs: Linux lets you use a mutex without calling
pthread_mutex_init() first. That broke on NetBSD, as expected.

Fixing this kind of issues is interesting beyond NetBSD support, since
you cannot take for granted that an unspecified behavior will not be
altered in the future. 

That said, I am fine if you let NetBSD run without fixing the underlying
issue, but you have been warned :-)

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] [Gluster-users] lockd: server not responding, timed out

2015-01-23 Thread Peter Auyeung
We have a 6 nodes gluster running ubuntu on xfs sharing gluster volumes over 
NFS been running fine for 3 months.
We restarted glusterfs-server on one of the node and all NFS clients start 
getting the " lockd: server  not responding, timed out" on /var/log/messages

We are still able to read write but seems like process that require a 
persistent file lock failed like database exports.

We have an interim fix to remount the NFS with nolock option but need to know 
why that is necessary all in a sudden after a service glusterfs-server restart 
on one of the gluster node

Thanks
Peter
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Reg. multi thread epoll NetBSD failures

2015-01-23 Thread Ben England
Gluster-ians,

would it be ok to temporarily disable multi-thread-epoll on NetBSD, unless 
there is some huge demand for it?  NetBSD may be useful for exposing race 
conditions, but it's not clear to me that all of these race conditions would 
happen in a non-NetBSD environment, so are we chasing problems that non-NetBSD 
users can never see?  what do people think?  If yes, why bust our heads 
figuring them out for NetBSD right now?  

attached is a tiny, crude and possibly out-of-date patch for making 
multi-thread-epoll tunable,   If we make number of epoll threads settable, we 
could add conditional compilation to make GLUSTERFS_EPOLL_MAXTHREADS 1 for 
NetBSD without much trouble, while still allowing people to experiment with it 
on NetBSD.

>From a performance perspective, let's review why we should go to the trouble 
>of using multi-thread-epoll patch.  The original goal was to allow far greater 
>CPU utilization by Gluster than we typically were seeing.  To do this, we want 
>multiple Gluster RPC sockets to be read and processed in parallel by a single 
>process.  This is important to clients (glusterfs, libgfapi) that have to talk 
>to many bricks (example: JBOD, erasure coding), and to brick processes 
>(glusterfsd) that have to talk to many clients.  It is also important for SSD 
>support (cache tiering) because we need to be able to have the glusterfsd 
>process keep up with SSD hardware and caches, which can have orders of 
>magnitude more IOPS available than a single disk drive or even a RAID LUN, and 
>glusterfsd epoll thread is currently the bottleneck in such configurations.  
>This multi-thread-epoll enhancement seems similar to multi-queue ethernet 
>driver, etc. that spreads load across CPU cores.  RDMA 40-Gbps networking may 
>also encounter this bottleneck.  We don't want a small fraction of CPU cores 
>(often just 1) to be a bottleneck - we want either network or storage hardware 
>to be the bottleneck instead.

Finally, is it possible with multi-thread-epoll that we do not need to use the 
io-threads translator (Anand Avati's suggestion) that offloads incoming 
requests to worker threads?  In this case, the epoll threads ARE the 
server-side thread pool.  If so, this could reduce context switching and 
latency further.  I for one look forward to finding out but I do not want to 
invest in more performance testing than we have already done unless it is going 
to be upstream to use.

thanks for your help,

-Ben England, Red Hat Perf. Engr.


- Original Message -
> From: "Shyam" 
> To: "Emmanuel Dreyfus" 
> Cc: "Gluster Devel" 
> Sent: Friday, January 23, 2015 2:48:14 PM
> Subject: [Gluster-devel] Reg. multi thread epoll NetBSD failures
> 
> Patch: http://review.gluster.org/#/c/3842/
> 
> Manu,
> 
> I was not able to find the NetBSD job mentioned in the last review
> comment provided by you, pointers to that would help.
> 
> Additionally,
> 
> What is the support status of epoll on NetBSD? I though NetBSD favored
> the kqueue means of event processing over epoll and that epoll was not
> supported on NetBSD (or *BSD).
> 
> I ask this, as this patch specifically changes the number of epoll
> threads, as a result, it is possibly having a different affect on
> NetBSD, which should either be on poll or kqueue (to my understanding).
> 
> Could you shed some light on this and on the current status of epoll on
> NetBSD.
> 
> Thanks,
> Shyam
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
--- event-epoll.c.nontunable	2014-09-05 16:27:10.261223176 -0400
+++ event-epoll.c	2014-09-05 16:33:19.818407183 -0400
@@ -612,23 +612,35 @@
 }
 
 
-#define GLUSTERFS_EPOLL_MAXTHREADS 2
+#define GLUSTERFS_EPOLL_MAX_THREADS 8
+#define GLUSTERFS_EPOLL_DEFAULT_THREADS 4
 
+int glusterfs_epoll_threads = -1;
 
 static int
 event_dispatch_epoll (struct event_pool *event_pool)
 {
 	int   i = 0;
-	pthread_t pollers[GLUSTERFS_EPOLL_MAXTHREADS];
+	pthread_t pollers[GLUSTERFS_EPOLL_MAX_THREADS];
 	int   ret = -1;
+char *epoll_thrd_str = getenv("GLUSTERFS_EPOLL_THREADS");
 
-	for (i = 0; i < GLUSTERFS_EPOLL_MAXTHREADS; i++) {
+glusterfs_epoll_threads = 
+epoll_thrd_str ? atoi(epoll_thrd_str) : GLUSTERFS_EPOLL_DEFAULT_THREADS;;
+
+if (glusterfs_epoll_threads > GLUSTERFS_EPOLL_MAX_THREADS) {
+			gf_log ("epoll", GF_LOG_ERROR,
+"user requested %d threads but limit is %d",
+glusterfs_epoll_threads, GLUSTERFS_EPOLL_MAX_THREADS);
+return EINVAL;
+}
+	for (i = 0; i < glusterfs_epoll_threads; i++) {
 		ret = pthread_create (&pollers[i], NULL,
   event_dispatch_epoll_worker,
   event_pool);
 	}
 
-	for (i = 0; i < GLUSTERFS_EPOLL_MAXTHREADS; i++)
+	for (i = 0; i < glusterfs_epoll_threads; i++)
 		pthread_join (pollers[i], NULL);
 
 	return ret;
__

Re: [Gluster-devel] Reg. multi thread epoll NetBSD failures

2015-01-23 Thread Emmanuel Dreyfus
Shyam  wrote:

> Patch: http://review.gluster.org/#/c/3842/
> 
> Manu,
> 
> I was not able to find the NetBSD job mentioned in the last review 
> comment provided by you, pointers to that would help.

Yes, sorry, both regression tests hang and I had to reboot the machine,
hence you do not have the reports in gerrit:

http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/731
http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/741

Since I got the same horrible result, I did not try retriggering one
more time, feel free to try it if you need to.

> What is the support status of epoll on NetBSD? I though NetBSD favored
> the kqueue means of event processing over epoll and that epoll was not
> supported on NetBSD (or *BSD).

No support for epoll on NetBSD. The alternative would indeed be kqueue,
but not code has been written to support it. glusterfs on NetBSD uses
plain poll code right now.
 
> I ask this, as this patch specifically changes the number of epoll 
> threads, as a result, it is possibly having a different affect on 
> NetBSD, which should either be on poll or kqueue (to my understanding).

I have not looked at the reasons right now, but it must be something
outside of epoll code since NetBSD does not use it.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Reg. multi thread epoll NetBSD failures

2015-01-23 Thread Shyam

Patch: http://review.gluster.org/#/c/3842/

Manu,

I was not able to find the NetBSD job mentioned in the last review 
comment provided by you, pointers to that would help.


Additionally,

What is the support status of epoll on NetBSD? I though NetBSD favored 
the kqueue means of event processing over epoll and that epoll was not 
supported on NetBSD (or *BSD).


I ask this, as this patch specifically changes the number of epoll 
threads, as a result, it is possibly having a different affect on 
NetBSD, which should either be on poll or kqueue (to my understanding).


Could you shed some light on this and on the current status of epoll on 
NetBSD.


Thanks,
Shyam
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] [Gluster-user] Sybase backup server failed to write to Gluster NFS

2015-01-23 Thread Peter Auyeung
Soumya,

We just added the nolock NFS option on the client and it's now working with 
sybase backup process.

I wonder what could affect the client working on gluster NFS and if the nolock 
option does any bad for gluster.

Thanks
Peter

From: gluster-users-boun...@gluster.org [gluster-users-boun...@gluster.org] on 
behalf of Peter Auyeung [pauye...@connexity.com]
Sent: Friday, January 23, 2015 10:05 AM
To: Soumya Koduri; gluster-devel@gluster.org; gluster-us...@gluster.org
Subject: Re: [Gluster-users] [Gluster-devel] [Gluster-user] Sybase backup 
server failed to write to Gluster NFS

Hi Soumya,

Yes this is strange as the same group of sybase servers been able to write 
their backup to gluster for the last 3 months.

They can still write as sybase user on OS but not the sybase backup server 
process.

They were able to mount another NFS server and perform both sybase OS user 
read/write and sybase backup server process.

Here are the NFS debug log while we tried to perform backup via sybase process 
to gluster

Jan 21 14:08:46 repdb006 kernel: NFS reply getattr
Jan 21 14:08:46 repdb006 kernel: NFS call setacl
Jan 21 14:08:46 repdb006 kernel: NFS reply setacl: 0
Jan 21 14:08:46 repdb006 kernel: NFS call  lookup test~
Jan 21 14:08:46 repdb006 kernel: NFS reply lookup: 0
Jan 21 14:08:46 repdb006 kernel: NFS call  remove test~
Jan 21 14:08:46 repdb006 kernel: NFS reply remove: 0
Jan 21 14:08:46 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/-1909463086)
Jan 21 14:08:46 repdb006 kernel: NFS call  getattr
Jan 21 14:08:46 repdb006 kernel: NFS reply getattr
Jan 21 14:08:46 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/1099635544)
Jan 21 14:08:47 repdb006 kernel: NFS call  getattr
Jan 21 14:08:47 repdb006 kernel: NFS reply getattr
Jan 21 14:08:47 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/-1009531860)
Jan 21 14:08:47 repdb006 kernel: NFS call  access
Jan 21 14:08:47 repdb006 kernel: NFS reply access, status = 0
Jan 21 14:08:47 repdb006 kernel: NFS call  lookup .test.swp
Jan 21 14:08:47 repdb006 kernel: NFS reply lookup: 0
Jan 21 14:08:47 repdb006 kernel: NFS call  remove .test.swp
Jan 21 14:08:47 repdb006 kernel: NFS reply remove: 0
Jan 21 14:08:47 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/1099635544)
Jan 21 14:08:48 repdb006 kernel: NFS call  access
Jan 21 14:08:48 repdb006 kernel: NFS reply access, status = 0
Jan 21 14:08:48 repdb006 kernel: NFS call  getattr
Jan 21 14:08:48 repdb006 kernel: NFS reply getattr
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/-1009531860)
Jan 21 14:08:48 repdb006 kernel: NFS call  access
Jan 21 14:08:48 repdb006 kernel: NFS reply access, status = 0
Jan 21 14:08:48 repdb006 kernel: NFS call  readdirplus 0
Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 2
Jan 21 14:08:48 repdb006 kernel: NFS call  readdirplus 37
Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 1
Jan 21 14:08:48 repdb006 kernel: NFS call  readdirplus 3073
Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 1
Jan 21 14:08:48 repdb006 kernel: NFS call  readdirplus 3074
Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 1
Jan 21 14:08:48 repdb006 kernel: NFS call  readdirplus 3075
Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 1
Jan 21 14:08:48 repdb006 kernel: NFS call  readdirplus 3076
Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 0
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-1009531860, 
32768) = fff5
Jan 21 14:08:48 repdb006 kernel: NFS call getacl
Jan 21 14:08:48 repdb006 kernel: NFS reply getacl: 0
Jan 21 14:08:48 repdb006 kernel: NFS call  getattr
Jan 21 14:08:48 repdb006 kernel: NFS reply getattr
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-1009531860, 
16384) = 
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/487151148, 
32768) = 
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/487151148, 
16384) = 
Jan 21 14:08:48 repdb006 kernel: NFS call  getattr
Jan 21 14:08:48 repdb006 kernel: NFS reply getattr
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-1991882046, 
32768) = 
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-1991882046, 
16384) = 
Jan 21 14:08:48 repdb006 kernel: NFS call  getattr
Jan 21 14:08:48 repdb006 kernel: NFS reply getattr
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/1247752239, 
32768) = 
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/1247752239, 
16384) = 
Jan 21 14:08:48 repdb006 kernel: NFS call  getattr
Jan 21 14:08:48 repdb006 kernel: NFS reply getattr
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/2137268371, 
32768) = 
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/2137268371, 
16384) = 
Jan 21 14:08:48 repdb006 kernel: NFS call  getattr
Jan 21 14:08:48 repdb006 kernel: NFS reply getattr
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_forget_cached_acl

Re: [Gluster-devel] [Gluster-user] Sybase backup server failed to write to Gluster NFS

2015-01-23 Thread Peter Auyeung
Hi Soumya,

Yes this is strange as the same group of sybase servers been able to write 
their backup to gluster for the last 3 months.

They can still write as sybase user on OS but not the sybase backup server 
process.

They were able to mount another NFS server and perform both sybase OS user 
read/write and sybase backup server process.

Here are the NFS debug log while we tried to perform backup via sybase process 
to gluster

Jan 21 14:08:46 repdb006 kernel: NFS reply getattr
Jan 21 14:08:46 repdb006 kernel: NFS call setacl
Jan 21 14:08:46 repdb006 kernel: NFS reply setacl: 0
Jan 21 14:08:46 repdb006 kernel: NFS call  lookup test~
Jan 21 14:08:46 repdb006 kernel: NFS reply lookup: 0
Jan 21 14:08:46 repdb006 kernel: NFS call  remove test~
Jan 21 14:08:46 repdb006 kernel: NFS reply remove: 0
Jan 21 14:08:46 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/-1909463086)
Jan 21 14:08:46 repdb006 kernel: NFS call  getattr
Jan 21 14:08:46 repdb006 kernel: NFS reply getattr
Jan 21 14:08:46 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/1099635544)
Jan 21 14:08:47 repdb006 kernel: NFS call  getattr
Jan 21 14:08:47 repdb006 kernel: NFS reply getattr
Jan 21 14:08:47 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/-1009531860)
Jan 21 14:08:47 repdb006 kernel: NFS call  access
Jan 21 14:08:47 repdb006 kernel: NFS reply access, status = 0
Jan 21 14:08:47 repdb006 kernel: NFS call  lookup .test.swp
Jan 21 14:08:47 repdb006 kernel: NFS reply lookup: 0
Jan 21 14:08:47 repdb006 kernel: NFS call  remove .test.swp
Jan 21 14:08:47 repdb006 kernel: NFS reply remove: 0
Jan 21 14:08:47 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/1099635544)
Jan 21 14:08:48 repdb006 kernel: NFS call  access
Jan 21 14:08:48 repdb006 kernel: NFS reply access, status = 0
Jan 21 14:08:48 repdb006 kernel: NFS call  getattr
Jan 21 14:08:48 repdb006 kernel: NFS reply getattr
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/-1009531860)
Jan 21 14:08:48 repdb006 kernel: NFS call  access
Jan 21 14:08:48 repdb006 kernel: NFS reply access, status = 0
Jan 21 14:08:48 repdb006 kernel: NFS call  readdirplus 0
Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 2
Jan 21 14:08:48 repdb006 kernel: NFS call  readdirplus 37
Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 1
Jan 21 14:08:48 repdb006 kernel: NFS call  readdirplus 3073
Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 1
Jan 21 14:08:48 repdb006 kernel: NFS call  readdirplus 3074
Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 1
Jan 21 14:08:48 repdb006 kernel: NFS call  readdirplus 3075
Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 1
Jan 21 14:08:48 repdb006 kernel: NFS call  readdirplus 3076
Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 0
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-1009531860, 
32768) = fff5
Jan 21 14:08:48 repdb006 kernel: NFS call getacl
Jan 21 14:08:48 repdb006 kernel: NFS reply getacl: 0
Jan 21 14:08:48 repdb006 kernel: NFS call  getattr
Jan 21 14:08:48 repdb006 kernel: NFS reply getattr
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-1009531860, 
16384) = 
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/487151148, 
32768) = 
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/487151148, 
16384) = 
Jan 21 14:08:48 repdb006 kernel: NFS call  getattr
Jan 21 14:08:48 repdb006 kernel: NFS reply getattr
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-1991882046, 
32768) = 
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-1991882046, 
16384) = 
Jan 21 14:08:48 repdb006 kernel: NFS call  getattr
Jan 21 14:08:48 repdb006 kernel: NFS reply getattr
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/1247752239, 
32768) = 
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/1247752239, 
16384) = 
Jan 21 14:08:48 repdb006 kernel: NFS call  getattr
Jan 21 14:08:48 repdb006 kernel: NFS reply getattr
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/2137268371, 
32768) = 
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/2137268371, 
16384) = 
Jan 21 14:08:48 repdb006 kernel: NFS call  getattr
Jan 21 14:08:48 repdb006 kernel: NFS reply getattr
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/-702684555)
Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-702684555, 
32768) = fff5
Jan 21 14:08:48 repdb006 kernel: NFS call getacl
Jan 21 14:08:48 repdb006 kernel: NFS reply getacl: 0
Jan 21 14:08:49 repdb006 kernel: NFS call  getattr
Jan 21 14:08:49 repdb006 kernel: NFS reply getattr
Jan 21 14:08:49 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-702684555, 
16384) = 
Jan 21 14:09:31 repdb006 kernel: NFS call  getattr
Jan 21 14:09:31 repdb006 kernel: NFS reply getattr
Jan 21 14:09:31 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/-1964955375)
Jan 21 14:09:31 repdb006 kernel: NFS cal

Re: [Gluster-devel] RDMA: Patch to make use of pre registered memory

2015-01-23 Thread Ben England
Rafi, great results, thanks.   Your "io-cache off" columns are read tests with 
the io-cache translator disabled, correct?  What jumps out at me from your 
numbers are two things:

- io-cache translator destroys RDMA read performance. 
- approach 2i) "register iobuf pool" is the best approach.
-- on reads with io-cache off, 32% better than baseline and 21% better than 1) 
"separate buffer" 
-- on writes, 22% better than baseline and 14% better than 1)

Can someone explain to me why the typical Gluster site wants to use the 
io-cache translator, given that FUSE now caches file data?  Should we just have 
it turned off by default at this point?  This would buy us time to change 
io-cache implementation to be compatible with RDMA (see below option "2ii").

remaining comments inline

-ben

- Original Message -
> From: "Mohammed Rafi K C" 
> To: gluster-devel@gluster.org
> Cc: "Raghavendra Gowdappa" , "Anand Avati" 
> , "Ben Turner"
> , "Ben England" , "Suman Debnath" 
> 
> Sent: Friday, January 23, 2015 7:43:45 AM
> Subject: RDMA: Patch to make use of pre registered memory
> 
> Hi All,
> 
> As I pointed out earlier, for rdma protocol, we need to register memory
> which is used during rdma read and write with rdma device. In fact it is a
> costly operation. To avoid the registration of memory in i/o path, we
> came up with two solutions.
> 
> 1) To use a separate per-registered iobuf_pool for rdma. The approach
> needs an extra level copying in rdma for each read/write request. ie, we
> need to copy the content of memory given by application to buffers of
> rdma in the rdma code.
> 

copying data defeats the whole point of RDMA, which is to *avoid* copying data. 
  

> 2) Register default iobuf_pool in glusterfs_ctx with rdma device during
> the rdma
> initialize. Since we are registering buffers from the default pool for
> read/write, we don't require either registration or copying. 

This makes far more sense to me.

> But the
> problem comes when io-cache translator is turned-on; then for each page
> fault, io-cache will take a ref on the io-buf of the response buffer to
> cache it, due to this all the pre-allocated buffer will get locked with
> io-cache very soon.
> Eventually all new requests would get iobufs from new iobuf_pools which
> are not
> registered with rdma and we will have to do registration for every iobuf.
> To address this issue, we can:
> 
>  i)  Turn-off io-cache
> (we chose this for testing)
> ii)  Use separate buffer for io-cache, and offload from
> default pool to io-cache buffer.
> (New thread to offload)


I think this makes sense, because if you get a io-cache translator cache hit, 
then you don't need to go out to the network, so io-cache memory doesn't have 
to be registered with RDMA.

> iii) Dynamically register each newly created arena with rdma,
>  for this need to bring libglusterfs code and transport
> layer code together.
>  (Will need changes in packaging and may bring hard
> dependencies of rdma libs)
>iv) Increase the default pool size.
> (Will increase the footprint of glusterfs process)
> 

registration with RDMA only makes sense to me when data is going to be 
sent/received over the RDMA network.  Is it hard to tell in advance which 
buffers will need to be transmitted?

> We implemented two approaches,  (1) and (2i) to get some
> performance numbers. The setup was 4*2 distributed-replicated volume
> using ram disks as bricks to avoid hard disk bottleneck. And the numbers
> are attached with the mail.
> 
> 
> Please provide the your thoughts on these approaches.
> 
> Regards
> Rafi KC
> 
> 
> 
Seperate buffer for rdma (1)No change   
Register Default iobuf pool(2i) 
write   readio-cache offwrite   readio-cache offwrite   
readio-cache off
1   373 527 656 343 483 532 446 
512 696
2   380 528 668 347 485 540 426 
525 715
3   376 527 594 346 482 540 422 
526 720
4   381 533 597 348 484 540 413 
526 710
5   372 527 479 347 482 538 422 
519 719
Note: (varying result )
Average 376.4   528.4   598.8   346.2   483.2   538 425.8   
521.6   712

command read:   echo 3 > /proc/sys/vm/drop_caches; dd 
if=/home/ram0/mount0/foo.txt of=/dev/null bs=1024K count=1000;
write   echo 3 > /proc/sys/vm/drop_caches; dd 
of=/home/ram0/mount0/foo.txt if=/dev/zero bs=1024K count=1000 conv=sync;


vol info"Volume Name: xcube
Type: Distributed-Replicate
Volume ID: 84cbc80f-bf93-4b10-9865-79a129efe2f5
Sta

[Gluster-devel] Netbsd regrsssion needs review

2015-01-23 Thread Emmanuel Dreyfus
Hello

Can someone please have a look at these. Most netbsd regression failures
are caused by spurious failures fixed there.
http://review.gluster.org/9461  (only fix the test script)
http://review.gluster.org/9074  (this one probably needs a discussion)

I can add a third one fixed this morning:
http://review.gluster.org/9483  (only fixes the test script)

And while I am there, two backports:
http://review.gluster.org/9448
http://review.gluster.org/9484

On Tue, Jan 20, 2015 at 05:28:24AM +0100, Emmanuel Dreyfus wrote:
> NetBSD regression tests suffers a few spurious failures. I have fixes
> for the two worst offenders:
> http://review.gluster.org/9461
> http://review.gluster.org/9074
> 
> Can someone please review?

-- 
Emmanuel Dreyfus
m...@netbsd.org

-- 
Emmanuel Dreyfus
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Netbsd regrsssion needs review

2015-01-23 Thread Emmanuel Dreyfus
Hello

Can someone please have a look at these. Most netbsd regression failures
are caused by spurious failures fixed there.
http://review.gluster.org/9461  (only fix the test script)
http://review.gluster.org/9074  (this one probably needs a discussion)

I can add a third one fixed this morning:
http://review.gluster.org/9483  (only fixes the test script)

And while I am there, two backports:
http://review.gluster.org/9448
http://review.gluster.org/9484

On Tue, Jan 20, 2015 at 05:28:24AM +0100, Emmanuel Dreyfus wrote:
> NetBSD regression tests suffers a few spurious failures. I have fixes
> for the two worst offenders:
> http://review.gluster.org/9461
> http://review.gluster.org/9074
> 
> Can someone please review?

-- 
Emmanuel Dreyfus
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-user] Sybase backup server failed to write to Gluster NFS

2015-01-23 Thread Soumya Koduri
In that case, most likely it seems to be an issue with the backup 
servers you are using.


Maybe you can first try verifying the NFS client on that machine. Issue 
write fops directly on the NFS mount points used by those servers.


Enable rpcdebug -> "rpcdebug -m nfs all" and check "/var/log/messages" 
for any errors.


Thanks,
Soumya

On 01/22/2015 11:11 PM, Peter Auyeung wrote:

Hi Soumya,

I was able to mount the same volume on other NFS client and do writes

got the following nfs.log entries when write




[2015-01-22 17:39:03.528405] I 
[afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 
0-sas02-replicate-1:  metadata self heal  is successfully completed,   metadata 
self heal from source sas02-client-2 to sas02-client-3,  metadata - Pending 
matrix:  [ [ 0 0 ] [ 0 0 ] ], on /RepDBSata02
[2015-01-22 17:39:03.529407] I 
[afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 
0-sas02-replicate-2:  metadata self heal  is successfully completed,   metadata 
self heal from source sas02-client-4 to sas02-client-5,  metadata - Pending 
matrix:  [ [ 0 0 ] [ 0 0 ] ], on /RepDBSata02


Thanks
Peter

From: Soumya Koduri [skod...@redhat.com]
Sent: Wednesday, January 21, 2015 9:05 PM
To: Peter Auyeung; gluster-devel@gluster.org; gluster-us...@gluster.org
Subject: Re: [Gluster-devel] [Gluster-user] Sybase backup server failed to 
write to Gluster NFS

Hi Peter,

Can you please try manually mounting those volumes using any/other nfs
client and check if you are able to perform write operations. Also
please collect the gluster nfs log while doing so.

Thanks,
Soumya

On 01/22/2015 08:18 AM, Peter Auyeung wrote:

Hi,

We have been having 5 sybase servers doing dump/export to Gluster NFS
for couple months and yesterday it started to give us these error on not
able to write files

The gluster NFS export is not full and we can still move and write files
as sybase unix user from the sybase servers.

There are no error logs on gluster nfs nor the bricks and etc-glusterfs
logs and no nfs client error on the sybase servers neither.

The NFS export was a replica 2 volume (3x2)

I created another NFS export from same gluster but a distributed only
volume and still giving out the same error.

Any Clue?

Thanks
Peter

Jan 20 20:04:17 2015: Backup Server: 6.53.1.1: OPERATOR: Volume on
device '/dbbackup01/db/full/pr_rssd_id_repsrv_rssd.F01-20-20-04.e'
cannot be opened for write access. Mount another volume.
Jan 20 20:04:17 2015: Backup Server: 6.78.1.1: EXECUTE sp_volchanged
  @session_id = 87,
  @devname =
'/dbbackup01/db/full/pr_rssd_id_repsrv_rssd.F01-20-20-04.e',
  @action = { 'PROCEED' | 'RETRY' | 'ABORT' }
Jan 20 20:04:26 2015: Backup Server: 6.53.1.1: OPERATOR: Volume on
device '/dbbackup01/db/full/pr_rssd_id_repsrv_rssd.F01-20-20-04.a'
cannot be opened for write access. Mount another volume.
Jan 20 20:04:26 2015: Backup Server: 6.78.1.1: EXECUTE sp_volchanged
  @session_id = 87,
  @devname =
'/dbbackup01/db/full/pr_rssd_id_repsrv_rssd.F01-20-20-04.a',
  @action = { 'PROCEED' | 'RETRY' | 'ABORT' }
Jan 20 20:05:41 2015: Backup Server: 6.53.1.1: OPERATOR: Volume on
device '/dbbackup01/db/full/pr_rssd_id_repsrv_rssd.F01-20-20-04.d'
cannot be opened for write access. Mount another volume.
Jan 20 20:05:41 2015: Backup Server: 6.78.1.1: EXECUTE sp_volchanged
  @session_id = 87,
  @devname =
'/dbbackup01/db/full/pr_rssd_id_repsrv_rssd.F01-20-20-04.d',
  @action = { 'PROCEED' | 'RETRY' | 'ABORT' }



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Requesting review on rdma admin documentation

2015-01-23 Thread RAGHAVENDRA TALUR
Hi All,

I have put up a admin doc on rdma for review at
http://review.gluster.org/#/c/9443/.
Reviews appreciated.

-- 
Raghavendra Talur
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] RDMA: Patch to make use of pre registered memory

2015-01-23 Thread Mohammed Rafi K C
Hi All,

As I pointed out earlier, for rdma protocol, we need to register memory
which is used during rdma read and write with rdma device. In fact it is a
costly operation. To avoid the registration of memory in i/o path, we
came up with two solutions.

1) To use a separate per-registered iobuf_pool for rdma. The approach
needs an extra level copying in rdma for each read/write request. ie, we
need to copy the content of memory given by application to buffers of
rdma in the rdma code.

2) Register default iobuf_pool in glusterfs_ctx with rdma device during
the rdma
initialize. Since we are registering buffers from the default pool for
read/write, we don't require either registration or copying. But the
problem comes when io-cache translator is turned-on; then for each page
fault, io-cache will take a ref on the io-buf of the response buffer to
cache it, due to this all the pre-allocated buffer will get locked with
io-cache very soon.
Eventually all new requests would get iobufs from new iobuf_pools which
are not
registered with rdma and we will have to do registration for every iobuf.
To address this issue, we can:

 i)  Turn-off io-cache
(we chose this for testing)
ii)  Use separate buffer for io-cache, and offload from
default pool to io-cache buffer.
(New thread to offload)
iii) Dynamically register each newly created arena with rdma,
 for this need to bring libglusterfs code and transport
layer code together.
 (Will need changes in packaging and may bring hard
dependencies of rdma libs)
   iv) Increase the default pool size.
(Will increase the footprint of glusterfs process)

We implemented two approaches,  (1) and (2i) to get some
performance numbers. The setup was 4*2 distributed-replicated volume
using ram disks as bricks to avoid hard disk bottleneck. And the numbers
are attached with the mail.


Please provide the your thoughts on these approaches.

Regards
Rafi KC


Seperate buffer for rdma (1)No change   
Register Default iobuf pool(2i) 
write   readio-cache offwrite   readio-cache offwrite   
readio-cache off
1   373 527 656 343 483 532 446 
512 696
2   380 528 668 347 485 540 426 
525 715
3   376 527 594 346 482 540 422 
526 720
4   381 533 597 348 484 540 413 
526 710
5   372 527 479 347 482 538 422 
519 719
Note: (varying result )
Average 376.4   528.4   598.8   346.2   483.2   538 425.8   
521.6   712

command read:   echo 3 > /proc/sys/vm/drop_caches; dd 
if=/home/ram0/mount0/foo.txt of=/dev/null bs=1024K count=1000;
write   echo 3 > /proc/sys/vm/drop_caches; dd 
of=/home/ram0/mount0/foo.txt if=/dev/zero bs=1024K count=1000 conv=sync;


vol info"Volume Name: xcube
Type: Distributed-Replicate
Volume ID: 84cbc80f-bf93-4b10-9865-79a129efe2f5
Status: Started
Snap Volume: no
Number of Bricks: 4 x 2 = 8
Transport-type: rdma
Bricks:
Brick1: 192.168.44.105:/home/ram0/b0
Brick2: 192.168.44.106:/home/ram0/b0
Brick3: 192.168.44.107:/brick/0/b0
Brick4: 192.168.44.108:/brick/0/b0
Brick5: 192.168.44.105:/home/ram1/b1
Brick6: 192.168.44.106:/home/ram1/b1
Brick7: 192.168.44.107:/brick/1/b1
Brick8: 192.168.44.108:/brick/1/b1
Options Reconfigured:
performance.io-cache: on
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] netbsd build failure

2015-01-23 Thread Emmanuel Dreyfus
Raghavendra Bhat  wrote:

> You have mentioned that patch http://review.gluster.org/#/c/9469/ breaks 
> the build on netbsd. Where is it failing?
> I tried to check the link for netbsd tests for the patch 
> (http://build.gluster.org/job/netbsd6-smoke/2431/). But I got the below 
> error.

Here:
http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/708/console

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Support to distinguish locks set via clients using SYNCOP framework

2015-01-23 Thread Soumya Koduri

Hi,

Server lock xlator seems to be distinguishing the various locks set by 
the clients on a file using two parameters -


* client UUID
* frame->root->lk_owner

Hence if it is the same client setting locks on a file, for the server 
to treat them as being from different owners, the client need to pass 
different lk_owners(as 'frame->root->lk_owner') for all those locks.


At present FUSE and gluster-nfs set this field while creating the frame 
whereas this support is missing for the clients using the SYNCOP framework.


Have made the changes to fix the same. In addition have added an API in 
gfapi to pass the lk_owner to syncopctx.


http://review.gluster.org/#/c/9482/

Kindly review the changes.

Thanks,
Soumya
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] netbsd build failure

2015-01-23 Thread Raghavendra Bhat


Hi Emmanuel,

You have mentioned that patch http://review.gluster.org/#/c/9469/ breaks 
the build on netbsd. Where is it failing?
I tried to check the link for netbsd tests for the patch 
(http://build.gluster.org/job/netbsd6-smoke/2431/). But I got the below 
error.



 Status Code: 404

Exception:
Stacktrace:

(none)


Regards,
Raghavendra Bhat

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel