Re: [Gluster-devel] RDMA: Patch to make use of pre registered memory
Couple of comments - 1. rdma can register init/fini functions (via pointers) into iobuf_pool. Absolutely no need to introduce rdma dependency into libglusterfs. 2. It might be a good idea to take a holistic approach towards zero-copy with libgfapi + RDMA, rather than a narrow goal of "use pre-registered memory with RDMA". Do keep the options open for RDMA'ing user's memory pointer (passed to glfs_write()) as well. 3. It is better to make io-cache and write-behind use a new iobuf_pool for caching purpose. There could be an optimization where they could just do iobuf/iobref_ref() when safe - e.g io-cache can cache with iobuf_ref when transport is socket, or write-behind can unwind by holding onto data with iobuf_ref() when the topmost layer is FUSE or server (i.e no gfapi). 4. Next step for zero-copy would be introduction of a new fop readto() where the destination pointer is passed from the caller (gfapi being the primary use case). In this situation RDMA ought to register that memory if necessary and request server to RDMA_WRITE into the pointer provided by gfapi caller. 2. and 4. require changes in the code you would be modifying if you were to just do "pre-registered memroy", so it is better we plan for the bigger picture upfront. Zero-copy can improve performance (especially read) in qemu use case. Thanks ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] managing of THIS
The problem you describe is very specific to glfs_new(), and not to gfapi in general. I guess we can handle this in glfs_new by initializing an appropriate value into THIS (save old_THIS and restore it before returning from glfs_new). That should avoid the need for all those new macros? Thanks On Wed, Jan 21, 2015, 23:37 Raghavendra Bhat wrote: > > Hi, > > In glusterfs at the time of the process coming up, it creates 5 pthread > keys (for saving THIS, syncop, uuid buf, lkowner buf and syncop ctx). > Even gfapi does the same thing in its glfs_new function. But with User > Serviceable Snapshots (where a glusterfs process spawns multiple gfapi > instances for a snapshot) this will lead to more and more consumption of > pthread keys. In fact the old keys are lost (as same variables are used > for creating the keys) and eventually the process will run out of > pthread keys after 203 snapshots (maximum allowed number is 1024 per > process.). So to avoid it, pthread keys creation can be done only once > (using pthread_once in which, the globals_init function is called). > > But now a new problem arises. Say, from snapview-server xlator glfs_new > was called (or glfs_init etc). Now gfapi wants calls THIS for some of > its operations such properly accounting the memory within the xlator > while allocating a new structure. But when gfapi calls THIS it gets > snapview-server's pointer. Since snapview-server does not know about > gfapi's internal structure, it asserts at the time of allocating. > > For now, a patch has been sent to handle the issue by turning off > memory-accounting for snapshot daemon. > (http://review.gluster.org/#/c/9430). > > But if memory-accounting has to be turned on for snapshot daemon, then > the above problem has to be fixed. > 2 ways that can be used for fixing the issue are: > > 1) Add the datastructures that are used by gfapi to libglusterfs (and > hence their mem-types as well), so that any xlator that is calling gfapi > functions (such as snapview-server as of now) will be aware of the > memory types used by gfapi and hence will not cause problems, when > memory accounting has to be done as part of allocations and frees. > > OR > > 2) Properly manage THIS by introducing a new macro similar to STACK_WIND > (for now it can be called STACK_API_WIND). The macro will be much > simpler than STACK_WIND as it need not create new frames before handing > over the call to the next layer. Before handing over the call to gfapi > (any call, such as glfs_new or fops such as glfs_h_open), saves THIS in > a variable and calls the gfapi function given as an argument. After the > function returns it again sets THIS back the value before the gfapi > function was called. > > Ex: > > STACK_API_WIND (this, fn, ret, params) > do { > xlator_t *old_THIS = NULL; > > old_THIS = this; > ret = fn (params); > THIS = old_THIS; > } while (0); > > a caller (as of now snapview-server xlator) would call the macro like this > STACK_API_WIND (this, glfs_h_open, glfd, fs, object, flags); > > > Please provide feedback and any suggestions or solutions to handle the > mentioned issue are welcome. > > Regards, > Raghavendra Bhat > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Reg. multi thread epoll NetBSD failures
Since all of epoll code and its multithreading is under ifdefs, netbsd should just continue working as single threaded poll unaffected by the patch. If netbsd kqueue supports single shot event delivery and edge triggered notification, we could have an equivalent implantation on netbsd too. Even if kqueue does not support these features, it might well be worth implementing a single threaded level triggered kqueue based event handler, and promote netbsd from suffering from sucky vanilla poll. Thanks On Fri, Jan 23, 2015, 19:29 Emmanuel Dreyfus wrote: > Ben England wrote: > > > NetBSD may be useful for exposing race conditions, but it's not clear to > > me that all of these race conditions would happen in a non-NetBSD > > environment, > > In many times, NetBSD exhibited cases where non specified, > Linux-specific behaviors were assumed. I recall my very first finding in > Glusterfs: Linux lets you use a mutex without calling > pthread_mutex_init() first. That broke on NetBSD, as expected. > > Fixing this kind of issues is interesting beyond NetBSD support, since > you cannot take for granted that an unspecified behavior will not be > altered in the future. > > That said, I am fine if you let NetBSD run without fixing the underlying > issue, but you have been warned :-) > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > m...@netbsd.org > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Reg. multi thread epoll NetBSD failures
Ben England wrote: > NetBSD may be useful for exposing race conditions, but it's not clear to > me that all of these race conditions would happen in a non-NetBSD > environment, In many times, NetBSD exhibited cases where non specified, Linux-specific behaviors were assumed. I recall my very first finding in Glusterfs: Linux lets you use a mutex without calling pthread_mutex_init() first. That broke on NetBSD, as expected. Fixing this kind of issues is interesting beyond NetBSD support, since you cannot take for granted that an unspecified behavior will not be altered in the future. That said, I am fine if you let NetBSD run without fixing the underlying issue, but you have been warned :-) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] [Gluster-users] lockd: server not responding, timed out
We have a 6 nodes gluster running ubuntu on xfs sharing gluster volumes over NFS been running fine for 3 months. We restarted glusterfs-server on one of the node and all NFS clients start getting the " lockd: server not responding, timed out" on /var/log/messages We are still able to read write but seems like process that require a persistent file lock failed like database exports. We have an interim fix to remount the NFS with nolock option but need to know why that is necessary all in a sudden after a service glusterfs-server restart on one of the gluster node Thanks Peter ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Reg. multi thread epoll NetBSD failures
Gluster-ians, would it be ok to temporarily disable multi-thread-epoll on NetBSD, unless there is some huge demand for it? NetBSD may be useful for exposing race conditions, but it's not clear to me that all of these race conditions would happen in a non-NetBSD environment, so are we chasing problems that non-NetBSD users can never see? what do people think? If yes, why bust our heads figuring them out for NetBSD right now? attached is a tiny, crude and possibly out-of-date patch for making multi-thread-epoll tunable, If we make number of epoll threads settable, we could add conditional compilation to make GLUSTERFS_EPOLL_MAXTHREADS 1 for NetBSD without much trouble, while still allowing people to experiment with it on NetBSD. >From a performance perspective, let's review why we should go to the trouble >of using multi-thread-epoll patch. The original goal was to allow far greater >CPU utilization by Gluster than we typically were seeing. To do this, we want >multiple Gluster RPC sockets to be read and processed in parallel by a single >process. This is important to clients (glusterfs, libgfapi) that have to talk >to many bricks (example: JBOD, erasure coding), and to brick processes >(glusterfsd) that have to talk to many clients. It is also important for SSD >support (cache tiering) because we need to be able to have the glusterfsd >process keep up with SSD hardware and caches, which can have orders of >magnitude more IOPS available than a single disk drive or even a RAID LUN, and >glusterfsd epoll thread is currently the bottleneck in such configurations. >This multi-thread-epoll enhancement seems similar to multi-queue ethernet >driver, etc. that spreads load across CPU cores. RDMA 40-Gbps networking may >also encounter this bottleneck. We don't want a small fraction of CPU cores >(often just 1) to be a bottleneck - we want either network or storage hardware >to be the bottleneck instead. Finally, is it possible with multi-thread-epoll that we do not need to use the io-threads translator (Anand Avati's suggestion) that offloads incoming requests to worker threads? In this case, the epoll threads ARE the server-side thread pool. If so, this could reduce context switching and latency further. I for one look forward to finding out but I do not want to invest in more performance testing than we have already done unless it is going to be upstream to use. thanks for your help, -Ben England, Red Hat Perf. Engr. - Original Message - > From: "Shyam" > To: "Emmanuel Dreyfus" > Cc: "Gluster Devel" > Sent: Friday, January 23, 2015 2:48:14 PM > Subject: [Gluster-devel] Reg. multi thread epoll NetBSD failures > > Patch: http://review.gluster.org/#/c/3842/ > > Manu, > > I was not able to find the NetBSD job mentioned in the last review > comment provided by you, pointers to that would help. > > Additionally, > > What is the support status of epoll on NetBSD? I though NetBSD favored > the kqueue means of event processing over epoll and that epoll was not > supported on NetBSD (or *BSD). > > I ask this, as this patch specifically changes the number of epoll > threads, as a result, it is possibly having a different affect on > NetBSD, which should either be on poll or kqueue (to my understanding). > > Could you shed some light on this and on the current status of epoll on > NetBSD. > > Thanks, > Shyam > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > --- event-epoll.c.nontunable 2014-09-05 16:27:10.261223176 -0400 +++ event-epoll.c 2014-09-05 16:33:19.818407183 -0400 @@ -612,23 +612,35 @@ } -#define GLUSTERFS_EPOLL_MAXTHREADS 2 +#define GLUSTERFS_EPOLL_MAX_THREADS 8 +#define GLUSTERFS_EPOLL_DEFAULT_THREADS 4 +int glusterfs_epoll_threads = -1; static int event_dispatch_epoll (struct event_pool *event_pool) { int i = 0; - pthread_t pollers[GLUSTERFS_EPOLL_MAXTHREADS]; + pthread_t pollers[GLUSTERFS_EPOLL_MAX_THREADS]; int ret = -1; +char *epoll_thrd_str = getenv("GLUSTERFS_EPOLL_THREADS"); - for (i = 0; i < GLUSTERFS_EPOLL_MAXTHREADS; i++) { +glusterfs_epoll_threads = +epoll_thrd_str ? atoi(epoll_thrd_str) : GLUSTERFS_EPOLL_DEFAULT_THREADS;; + +if (glusterfs_epoll_threads > GLUSTERFS_EPOLL_MAX_THREADS) { + gf_log ("epoll", GF_LOG_ERROR, +"user requested %d threads but limit is %d", +glusterfs_epoll_threads, GLUSTERFS_EPOLL_MAX_THREADS); +return EINVAL; +} + for (i = 0; i < glusterfs_epoll_threads; i++) { ret = pthread_create (&pollers[i], NULL, event_dispatch_epoll_worker, event_pool); } - for (i = 0; i < GLUSTERFS_EPOLL_MAXTHREADS; i++) + for (i = 0; i < glusterfs_epoll_threads; i++) pthread_join (pollers[i], NULL); return ret; __
Re: [Gluster-devel] Reg. multi thread epoll NetBSD failures
Shyam wrote: > Patch: http://review.gluster.org/#/c/3842/ > > Manu, > > I was not able to find the NetBSD job mentioned in the last review > comment provided by you, pointers to that would help. Yes, sorry, both regression tests hang and I had to reboot the machine, hence you do not have the reports in gerrit: http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/731 http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/741 Since I got the same horrible result, I did not try retriggering one more time, feel free to try it if you need to. > What is the support status of epoll on NetBSD? I though NetBSD favored > the kqueue means of event processing over epoll and that epoll was not > supported on NetBSD (or *BSD). No support for epoll on NetBSD. The alternative would indeed be kqueue, but not code has been written to support it. glusterfs on NetBSD uses plain poll code right now. > I ask this, as this patch specifically changes the number of epoll > threads, as a result, it is possibly having a different affect on > NetBSD, which should either be on poll or kqueue (to my understanding). I have not looked at the reasons right now, but it must be something outside of epoll code since NetBSD does not use it. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Reg. multi thread epoll NetBSD failures
Patch: http://review.gluster.org/#/c/3842/ Manu, I was not able to find the NetBSD job mentioned in the last review comment provided by you, pointers to that would help. Additionally, What is the support status of epoll on NetBSD? I though NetBSD favored the kqueue means of event processing over epoll and that epoll was not supported on NetBSD (or *BSD). I ask this, as this patch specifically changes the number of epoll threads, as a result, it is possibly having a different affect on NetBSD, which should either be on poll or kqueue (to my understanding). Could you shed some light on this and on the current status of epoll on NetBSD. Thanks, Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] [Gluster-user] Sybase backup server failed to write to Gluster NFS
Soumya, We just added the nolock NFS option on the client and it's now working with sybase backup process. I wonder what could affect the client working on gluster NFS and if the nolock option does any bad for gluster. Thanks Peter From: gluster-users-boun...@gluster.org [gluster-users-boun...@gluster.org] on behalf of Peter Auyeung [pauye...@connexity.com] Sent: Friday, January 23, 2015 10:05 AM To: Soumya Koduri; gluster-devel@gluster.org; gluster-us...@gluster.org Subject: Re: [Gluster-users] [Gluster-devel] [Gluster-user] Sybase backup server failed to write to Gluster NFS Hi Soumya, Yes this is strange as the same group of sybase servers been able to write their backup to gluster for the last 3 months. They can still write as sybase user on OS but not the sybase backup server process. They were able to mount another NFS server and perform both sybase OS user read/write and sybase backup server process. Here are the NFS debug log while we tried to perform backup via sybase process to gluster Jan 21 14:08:46 repdb006 kernel: NFS reply getattr Jan 21 14:08:46 repdb006 kernel: NFS call setacl Jan 21 14:08:46 repdb006 kernel: NFS reply setacl: 0 Jan 21 14:08:46 repdb006 kernel: NFS call lookup test~ Jan 21 14:08:46 repdb006 kernel: NFS reply lookup: 0 Jan 21 14:08:46 repdb006 kernel: NFS call remove test~ Jan 21 14:08:46 repdb006 kernel: NFS reply remove: 0 Jan 21 14:08:46 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/-1909463086) Jan 21 14:08:46 repdb006 kernel: NFS call getattr Jan 21 14:08:46 repdb006 kernel: NFS reply getattr Jan 21 14:08:46 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/1099635544) Jan 21 14:08:47 repdb006 kernel: NFS call getattr Jan 21 14:08:47 repdb006 kernel: NFS reply getattr Jan 21 14:08:47 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/-1009531860) Jan 21 14:08:47 repdb006 kernel: NFS call access Jan 21 14:08:47 repdb006 kernel: NFS reply access, status = 0 Jan 21 14:08:47 repdb006 kernel: NFS call lookup .test.swp Jan 21 14:08:47 repdb006 kernel: NFS reply lookup: 0 Jan 21 14:08:47 repdb006 kernel: NFS call remove .test.swp Jan 21 14:08:47 repdb006 kernel: NFS reply remove: 0 Jan 21 14:08:47 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/1099635544) Jan 21 14:08:48 repdb006 kernel: NFS call access Jan 21 14:08:48 repdb006 kernel: NFS reply access, status = 0 Jan 21 14:08:48 repdb006 kernel: NFS call getattr Jan 21 14:08:48 repdb006 kernel: NFS reply getattr Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/-1009531860) Jan 21 14:08:48 repdb006 kernel: NFS call access Jan 21 14:08:48 repdb006 kernel: NFS reply access, status = 0 Jan 21 14:08:48 repdb006 kernel: NFS call readdirplus 0 Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 2 Jan 21 14:08:48 repdb006 kernel: NFS call readdirplus 37 Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 1 Jan 21 14:08:48 repdb006 kernel: NFS call readdirplus 3073 Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 1 Jan 21 14:08:48 repdb006 kernel: NFS call readdirplus 3074 Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 1 Jan 21 14:08:48 repdb006 kernel: NFS call readdirplus 3075 Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 1 Jan 21 14:08:48 repdb006 kernel: NFS call readdirplus 3076 Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 0 Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-1009531860, 32768) = fff5 Jan 21 14:08:48 repdb006 kernel: NFS call getacl Jan 21 14:08:48 repdb006 kernel: NFS reply getacl: 0 Jan 21 14:08:48 repdb006 kernel: NFS call getattr Jan 21 14:08:48 repdb006 kernel: NFS reply getattr Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-1009531860, 16384) = Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/487151148, 32768) = Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/487151148, 16384) = Jan 21 14:08:48 repdb006 kernel: NFS call getattr Jan 21 14:08:48 repdb006 kernel: NFS reply getattr Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-1991882046, 32768) = Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-1991882046, 16384) = Jan 21 14:08:48 repdb006 kernel: NFS call getattr Jan 21 14:08:48 repdb006 kernel: NFS reply getattr Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/1247752239, 32768) = Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/1247752239, 16384) = Jan 21 14:08:48 repdb006 kernel: NFS call getattr Jan 21 14:08:48 repdb006 kernel: NFS reply getattr Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/2137268371, 32768) = Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/2137268371, 16384) = Jan 21 14:08:48 repdb006 kernel: NFS call getattr Jan 21 14:08:48 repdb006 kernel: NFS reply getattr Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_forget_cached_acl
Re: [Gluster-devel] [Gluster-user] Sybase backup server failed to write to Gluster NFS
Hi Soumya, Yes this is strange as the same group of sybase servers been able to write their backup to gluster for the last 3 months. They can still write as sybase user on OS but not the sybase backup server process. They were able to mount another NFS server and perform both sybase OS user read/write and sybase backup server process. Here are the NFS debug log while we tried to perform backup via sybase process to gluster Jan 21 14:08:46 repdb006 kernel: NFS reply getattr Jan 21 14:08:46 repdb006 kernel: NFS call setacl Jan 21 14:08:46 repdb006 kernel: NFS reply setacl: 0 Jan 21 14:08:46 repdb006 kernel: NFS call lookup test~ Jan 21 14:08:46 repdb006 kernel: NFS reply lookup: 0 Jan 21 14:08:46 repdb006 kernel: NFS call remove test~ Jan 21 14:08:46 repdb006 kernel: NFS reply remove: 0 Jan 21 14:08:46 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/-1909463086) Jan 21 14:08:46 repdb006 kernel: NFS call getattr Jan 21 14:08:46 repdb006 kernel: NFS reply getattr Jan 21 14:08:46 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/1099635544) Jan 21 14:08:47 repdb006 kernel: NFS call getattr Jan 21 14:08:47 repdb006 kernel: NFS reply getattr Jan 21 14:08:47 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/-1009531860) Jan 21 14:08:47 repdb006 kernel: NFS call access Jan 21 14:08:47 repdb006 kernel: NFS reply access, status = 0 Jan 21 14:08:47 repdb006 kernel: NFS call lookup .test.swp Jan 21 14:08:47 repdb006 kernel: NFS reply lookup: 0 Jan 21 14:08:47 repdb006 kernel: NFS call remove .test.swp Jan 21 14:08:47 repdb006 kernel: NFS reply remove: 0 Jan 21 14:08:47 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/1099635544) Jan 21 14:08:48 repdb006 kernel: NFS call access Jan 21 14:08:48 repdb006 kernel: NFS reply access, status = 0 Jan 21 14:08:48 repdb006 kernel: NFS call getattr Jan 21 14:08:48 repdb006 kernel: NFS reply getattr Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/-1009531860) Jan 21 14:08:48 repdb006 kernel: NFS call access Jan 21 14:08:48 repdb006 kernel: NFS reply access, status = 0 Jan 21 14:08:48 repdb006 kernel: NFS call readdirplus 0 Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 2 Jan 21 14:08:48 repdb006 kernel: NFS call readdirplus 37 Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 1 Jan 21 14:08:48 repdb006 kernel: NFS call readdirplus 3073 Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 1 Jan 21 14:08:48 repdb006 kernel: NFS call readdirplus 3074 Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 1 Jan 21 14:08:48 repdb006 kernel: NFS call readdirplus 3075 Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 1 Jan 21 14:08:48 repdb006 kernel: NFS call readdirplus 3076 Jan 21 14:08:48 repdb006 kernel: NFS reply readdir: 0 Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-1009531860, 32768) = fff5 Jan 21 14:08:48 repdb006 kernel: NFS call getacl Jan 21 14:08:48 repdb006 kernel: NFS reply getacl: 0 Jan 21 14:08:48 repdb006 kernel: NFS call getattr Jan 21 14:08:48 repdb006 kernel: NFS reply getattr Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-1009531860, 16384) = Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/487151148, 32768) = Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/487151148, 16384) = Jan 21 14:08:48 repdb006 kernel: NFS call getattr Jan 21 14:08:48 repdb006 kernel: NFS reply getattr Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-1991882046, 32768) = Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-1991882046, 16384) = Jan 21 14:08:48 repdb006 kernel: NFS call getattr Jan 21 14:08:48 repdb006 kernel: NFS reply getattr Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/1247752239, 32768) = Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/1247752239, 16384) = Jan 21 14:08:48 repdb006 kernel: NFS call getattr Jan 21 14:08:48 repdb006 kernel: NFS reply getattr Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/2137268371, 32768) = Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/2137268371, 16384) = Jan 21 14:08:48 repdb006 kernel: NFS call getattr Jan 21 14:08:48 repdb006 kernel: NFS reply getattr Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/-702684555) Jan 21 14:08:48 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-702684555, 32768) = fff5 Jan 21 14:08:48 repdb006 kernel: NFS call getacl Jan 21 14:08:48 repdb006 kernel: NFS reply getacl: 0 Jan 21 14:08:49 repdb006 kernel: NFS call getattr Jan 21 14:08:49 repdb006 kernel: NFS reply getattr Jan 21 14:08:49 repdb006 kernel: NFS: nfs3_get_cached_acl(0:13/-702684555, 16384) = Jan 21 14:09:31 repdb006 kernel: NFS call getattr Jan 21 14:09:31 repdb006 kernel: NFS reply getattr Jan 21 14:09:31 repdb006 kernel: NFS: nfs3_forget_cached_acls(0:13/-1964955375) Jan 21 14:09:31 repdb006 kernel: NFS cal
Re: [Gluster-devel] RDMA: Patch to make use of pre registered memory
Rafi, great results, thanks. Your "io-cache off" columns are read tests with the io-cache translator disabled, correct? What jumps out at me from your numbers are two things: - io-cache translator destroys RDMA read performance. - approach 2i) "register iobuf pool" is the best approach. -- on reads with io-cache off, 32% better than baseline and 21% better than 1) "separate buffer" -- on writes, 22% better than baseline and 14% better than 1) Can someone explain to me why the typical Gluster site wants to use the io-cache translator, given that FUSE now caches file data? Should we just have it turned off by default at this point? This would buy us time to change io-cache implementation to be compatible with RDMA (see below option "2ii"). remaining comments inline -ben - Original Message - > From: "Mohammed Rafi K C" > To: gluster-devel@gluster.org > Cc: "Raghavendra Gowdappa" , "Anand Avati" > , "Ben Turner" > , "Ben England" , "Suman Debnath" > > Sent: Friday, January 23, 2015 7:43:45 AM > Subject: RDMA: Patch to make use of pre registered memory > > Hi All, > > As I pointed out earlier, for rdma protocol, we need to register memory > which is used during rdma read and write with rdma device. In fact it is a > costly operation. To avoid the registration of memory in i/o path, we > came up with two solutions. > > 1) To use a separate per-registered iobuf_pool for rdma. The approach > needs an extra level copying in rdma for each read/write request. ie, we > need to copy the content of memory given by application to buffers of > rdma in the rdma code. > copying data defeats the whole point of RDMA, which is to *avoid* copying data. > 2) Register default iobuf_pool in glusterfs_ctx with rdma device during > the rdma > initialize. Since we are registering buffers from the default pool for > read/write, we don't require either registration or copying. This makes far more sense to me. > But the > problem comes when io-cache translator is turned-on; then for each page > fault, io-cache will take a ref on the io-buf of the response buffer to > cache it, due to this all the pre-allocated buffer will get locked with > io-cache very soon. > Eventually all new requests would get iobufs from new iobuf_pools which > are not > registered with rdma and we will have to do registration for every iobuf. > To address this issue, we can: > > i) Turn-off io-cache > (we chose this for testing) > ii) Use separate buffer for io-cache, and offload from > default pool to io-cache buffer. > (New thread to offload) I think this makes sense, because if you get a io-cache translator cache hit, then you don't need to go out to the network, so io-cache memory doesn't have to be registered with RDMA. > iii) Dynamically register each newly created arena with rdma, > for this need to bring libglusterfs code and transport > layer code together. > (Will need changes in packaging and may bring hard > dependencies of rdma libs) >iv) Increase the default pool size. > (Will increase the footprint of glusterfs process) > registration with RDMA only makes sense to me when data is going to be sent/received over the RDMA network. Is it hard to tell in advance which buffers will need to be transmitted? > We implemented two approaches, (1) and (2i) to get some > performance numbers. The setup was 4*2 distributed-replicated volume > using ram disks as bricks to avoid hard disk bottleneck. And the numbers > are attached with the mail. > > > Please provide the your thoughts on these approaches. > > Regards > Rafi KC > > > Seperate buffer for rdma (1)No change Register Default iobuf pool(2i) write readio-cache offwrite readio-cache offwrite readio-cache off 1 373 527 656 343 483 532 446 512 696 2 380 528 668 347 485 540 426 525 715 3 376 527 594 346 482 540 422 526 720 4 381 533 597 348 484 540 413 526 710 5 372 527 479 347 482 538 422 519 719 Note: (varying result ) Average 376.4 528.4 598.8 346.2 483.2 538 425.8 521.6 712 command read: echo 3 > /proc/sys/vm/drop_caches; dd if=/home/ram0/mount0/foo.txt of=/dev/null bs=1024K count=1000; write echo 3 > /proc/sys/vm/drop_caches; dd of=/home/ram0/mount0/foo.txt if=/dev/zero bs=1024K count=1000 conv=sync; vol info"Volume Name: xcube Type: Distributed-Replicate Volume ID: 84cbc80f-bf93-4b10-9865-79a129efe2f5 Sta
[Gluster-devel] Netbsd regrsssion needs review
Hello Can someone please have a look at these. Most netbsd regression failures are caused by spurious failures fixed there. http://review.gluster.org/9461 (only fix the test script) http://review.gluster.org/9074 (this one probably needs a discussion) I can add a third one fixed this morning: http://review.gluster.org/9483 (only fixes the test script) And while I am there, two backports: http://review.gluster.org/9448 http://review.gluster.org/9484 On Tue, Jan 20, 2015 at 05:28:24AM +0100, Emmanuel Dreyfus wrote: > NetBSD regression tests suffers a few spurious failures. I have fixes > for the two worst offenders: > http://review.gluster.org/9461 > http://review.gluster.org/9074 > > Can someone please review? -- Emmanuel Dreyfus m...@netbsd.org -- Emmanuel Dreyfus m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Netbsd regrsssion needs review
Hello Can someone please have a look at these. Most netbsd regression failures are caused by spurious failures fixed there. http://review.gluster.org/9461 (only fix the test script) http://review.gluster.org/9074 (this one probably needs a discussion) I can add a third one fixed this morning: http://review.gluster.org/9483 (only fixes the test script) And while I am there, two backports: http://review.gluster.org/9448 http://review.gluster.org/9484 On Tue, Jan 20, 2015 at 05:28:24AM +0100, Emmanuel Dreyfus wrote: > NetBSD regression tests suffers a few spurious failures. I have fixes > for the two worst offenders: > http://review.gluster.org/9461 > http://review.gluster.org/9074 > > Can someone please review? -- Emmanuel Dreyfus m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-user] Sybase backup server failed to write to Gluster NFS
In that case, most likely it seems to be an issue with the backup servers you are using. Maybe you can first try verifying the NFS client on that machine. Issue write fops directly on the NFS mount points used by those servers. Enable rpcdebug -> "rpcdebug -m nfs all" and check "/var/log/messages" for any errors. Thanks, Soumya On 01/22/2015 11:11 PM, Peter Auyeung wrote: Hi Soumya, I was able to mount the same volume on other NFS client and do writes got the following nfs.log entries when write [2015-01-22 17:39:03.528405] I [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 0-sas02-replicate-1: metadata self heal is successfully completed, metadata self heal from source sas02-client-2 to sas02-client-3, metadata - Pending matrix: [ [ 0 0 ] [ 0 0 ] ], on /RepDBSata02 [2015-01-22 17:39:03.529407] I [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 0-sas02-replicate-2: metadata self heal is successfully completed, metadata self heal from source sas02-client-4 to sas02-client-5, metadata - Pending matrix: [ [ 0 0 ] [ 0 0 ] ], on /RepDBSata02 Thanks Peter From: Soumya Koduri [skod...@redhat.com] Sent: Wednesday, January 21, 2015 9:05 PM To: Peter Auyeung; gluster-devel@gluster.org; gluster-us...@gluster.org Subject: Re: [Gluster-devel] [Gluster-user] Sybase backup server failed to write to Gluster NFS Hi Peter, Can you please try manually mounting those volumes using any/other nfs client and check if you are able to perform write operations. Also please collect the gluster nfs log while doing so. Thanks, Soumya On 01/22/2015 08:18 AM, Peter Auyeung wrote: Hi, We have been having 5 sybase servers doing dump/export to Gluster NFS for couple months and yesterday it started to give us these error on not able to write files The gluster NFS export is not full and we can still move and write files as sybase unix user from the sybase servers. There are no error logs on gluster nfs nor the bricks and etc-glusterfs logs and no nfs client error on the sybase servers neither. The NFS export was a replica 2 volume (3x2) I created another NFS export from same gluster but a distributed only volume and still giving out the same error. Any Clue? Thanks Peter Jan 20 20:04:17 2015: Backup Server: 6.53.1.1: OPERATOR: Volume on device '/dbbackup01/db/full/pr_rssd_id_repsrv_rssd.F01-20-20-04.e' cannot be opened for write access. Mount another volume. Jan 20 20:04:17 2015: Backup Server: 6.78.1.1: EXECUTE sp_volchanged @session_id = 87, @devname = '/dbbackup01/db/full/pr_rssd_id_repsrv_rssd.F01-20-20-04.e', @action = { 'PROCEED' | 'RETRY' | 'ABORT' } Jan 20 20:04:26 2015: Backup Server: 6.53.1.1: OPERATOR: Volume on device '/dbbackup01/db/full/pr_rssd_id_repsrv_rssd.F01-20-20-04.a' cannot be opened for write access. Mount another volume. Jan 20 20:04:26 2015: Backup Server: 6.78.1.1: EXECUTE sp_volchanged @session_id = 87, @devname = '/dbbackup01/db/full/pr_rssd_id_repsrv_rssd.F01-20-20-04.a', @action = { 'PROCEED' | 'RETRY' | 'ABORT' } Jan 20 20:05:41 2015: Backup Server: 6.53.1.1: OPERATOR: Volume on device '/dbbackup01/db/full/pr_rssd_id_repsrv_rssd.F01-20-20-04.d' cannot be opened for write access. Mount another volume. Jan 20 20:05:41 2015: Backup Server: 6.78.1.1: EXECUTE sp_volchanged @session_id = 87, @devname = '/dbbackup01/db/full/pr_rssd_id_repsrv_rssd.F01-20-20-04.d', @action = { 'PROCEED' | 'RETRY' | 'ABORT' } ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Requesting review on rdma admin documentation
Hi All, I have put up a admin doc on rdma for review at http://review.gluster.org/#/c/9443/. Reviews appreciated. -- Raghavendra Talur ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] RDMA: Patch to make use of pre registered memory
Hi All, As I pointed out earlier, for rdma protocol, we need to register memory which is used during rdma read and write with rdma device. In fact it is a costly operation. To avoid the registration of memory in i/o path, we came up with two solutions. 1) To use a separate per-registered iobuf_pool for rdma. The approach needs an extra level copying in rdma for each read/write request. ie, we need to copy the content of memory given by application to buffers of rdma in the rdma code. 2) Register default iobuf_pool in glusterfs_ctx with rdma device during the rdma initialize. Since we are registering buffers from the default pool for read/write, we don't require either registration or copying. But the problem comes when io-cache translator is turned-on; then for each page fault, io-cache will take a ref on the io-buf of the response buffer to cache it, due to this all the pre-allocated buffer will get locked with io-cache very soon. Eventually all new requests would get iobufs from new iobuf_pools which are not registered with rdma and we will have to do registration for every iobuf. To address this issue, we can: i) Turn-off io-cache (we chose this for testing) ii) Use separate buffer for io-cache, and offload from default pool to io-cache buffer. (New thread to offload) iii) Dynamically register each newly created arena with rdma, for this need to bring libglusterfs code and transport layer code together. (Will need changes in packaging and may bring hard dependencies of rdma libs) iv) Increase the default pool size. (Will increase the footprint of glusterfs process) We implemented two approaches, (1) and (2i) to get some performance numbers. The setup was 4*2 distributed-replicated volume using ram disks as bricks to avoid hard disk bottleneck. And the numbers are attached with the mail. Please provide the your thoughts on these approaches. Regards Rafi KC Seperate buffer for rdma (1)No change Register Default iobuf pool(2i) write readio-cache offwrite readio-cache offwrite readio-cache off 1 373 527 656 343 483 532 446 512 696 2 380 528 668 347 485 540 426 525 715 3 376 527 594 346 482 540 422 526 720 4 381 533 597 348 484 540 413 526 710 5 372 527 479 347 482 538 422 519 719 Note: (varying result ) Average 376.4 528.4 598.8 346.2 483.2 538 425.8 521.6 712 command read: echo 3 > /proc/sys/vm/drop_caches; dd if=/home/ram0/mount0/foo.txt of=/dev/null bs=1024K count=1000; write echo 3 > /proc/sys/vm/drop_caches; dd of=/home/ram0/mount0/foo.txt if=/dev/zero bs=1024K count=1000 conv=sync; vol info"Volume Name: xcube Type: Distributed-Replicate Volume ID: 84cbc80f-bf93-4b10-9865-79a129efe2f5 Status: Started Snap Volume: no Number of Bricks: 4 x 2 = 8 Transport-type: rdma Bricks: Brick1: 192.168.44.105:/home/ram0/b0 Brick2: 192.168.44.106:/home/ram0/b0 Brick3: 192.168.44.107:/brick/0/b0 Brick4: 192.168.44.108:/brick/0/b0 Brick5: 192.168.44.105:/home/ram1/b1 Brick6: 192.168.44.106:/home/ram1/b1 Brick7: 192.168.44.107:/brick/1/b1 Brick8: 192.168.44.108:/brick/1/b1 Options Reconfigured: performance.io-cache: on performance.readdir-ahead: on snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] netbsd build failure
Raghavendra Bhat wrote: > You have mentioned that patch http://review.gluster.org/#/c/9469/ breaks > the build on netbsd. Where is it failing? > I tried to check the link for netbsd tests for the patch > (http://build.gluster.org/job/netbsd6-smoke/2431/). But I got the below > error. Here: http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/708/console -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Support to distinguish locks set via clients using SYNCOP framework
Hi, Server lock xlator seems to be distinguishing the various locks set by the clients on a file using two parameters - * client UUID * frame->root->lk_owner Hence if it is the same client setting locks on a file, for the server to treat them as being from different owners, the client need to pass different lk_owners(as 'frame->root->lk_owner') for all those locks. At present FUSE and gluster-nfs set this field while creating the frame whereas this support is missing for the clients using the SYNCOP framework. Have made the changes to fix the same. In addition have added an API in gfapi to pass the lk_owner to syncopctx. http://review.gluster.org/#/c/9482/ Kindly review the changes. Thanks, Soumya ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] netbsd build failure
Hi Emmanuel, You have mentioned that patch http://review.gluster.org/#/c/9469/ breaks the build on netbsd. Where is it failing? I tried to check the link for netbsd tests for the patch (http://build.gluster.org/job/netbsd6-smoke/2431/). But I got the below error. Status Code: 404 Exception: Stacktrace: (none) Regards, Raghavendra Bhat ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel