Re: [Gluster-devel] Feature: FOP Statistics JSON Dumps
Richard, what's great about your patch (besides lockless counters) is: - JSON easier to parse (particularly in python). Compare to parsing "gluster volume profile" output, which is much more difficult. This will enable tools to display profiling data in a user-friendly way. Would be nice if you attached a sample output to the bz 1261700. - client side capture - io-stats translator is at the top of the translator stack so we would see latencies just like the application sees them. "gluster volume profile" provides server-side latencies but this can be deceptive and fails to report "user experience" latencies. I'm not that clear on the UI for it, would be nice if "gluster volume " command could be set up to automatically poll this data at a fixed rate like many other perf utilities (example: iostat), so that user could capture a Gluster profile over time with a single command; at present the support team has to give them a script to do it. This would make it trivial for a user to share what their application is doing from a Gluster perspective, as well as how Gluster is performing from the client's perspective./usr/sbin/gluster utility can run on the client now since it is in gluster-cli RPM right? So in other words it would be great to replace this: gluster volume profile $volume_name start gluster volume profile $volume_name info > /tmp/past for min in `seq 1 $sample_count` ; do sleep $sample_interval gluster volume profile $volume_name info done > gvp.log gluster volume profile $volume_name stop With this: gluster volume profile $volume_name $sample_interval $sample_count > gvp.log And be able to run this command on the client to use your patch there. thx -ben - Original Message - > From: "Richard Wareing"> To: gluster-devel@gluster.org > Sent: Wednesday, September 9, 2015 10:24:54 PM > Subject: [Gluster-devel] Feature: FOP Statistics JSON Dumps > > Hey all, > > I just uploaded a clean patch for our FOP statistics dump feature @ > https://bugzilla.redhat.com/show_bug.cgi?id=1261700 . > > Patches cleanly to v3.6.x/v3.7.x release branches, also includes io-stats > support for intel arch atomic operations (ifdef'd for portability) such that > you can collect data 24x7 with a negligible latency hit in the IO path. > We've been using this for quite sometime and there appeared to have been > some interest at the dev summit to have this in mainline; so here it is. > > Take a look, and I hope you find it useful. > > Richard ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] High CPU Usage - Glusterfsd
Renchu, I didn't see anything about average file size and read/write mix. One example of how to observe both of these, as well as latency and throughput - on server run these commands: # gluster volume profile your-volume start # gluster volume profile your-volume info /tmp/dontcare # sleep 60 # gluster volume profile your-volume info profile-for-last-minute.log There is also a gluster volume top command that may be of use to you in understanding what your users are doing with Gluster. Also you may want to run top -H and see whether any threads in either glusterfsd or smbd are at or near 100% CPU - if so, you really are hitting a CPU bottleneck. Looking at process CPU utilization can be deceptive, since a process may include multiple threads. sar -n DEV 2 will show you network utilization, and iostat -mdx /dev/sd? 2 on your server will show block device queue depth (latter two tools require sysstat rpm). Together these can help you to understand what kind of bottleneck you are seeing. I don't see how many bricks are in your Gluster volume but it sounds like you have only one glusterfsd/server. If you have idle cores on your servers, you can harness more CPU power by using multiple bricks/server, which results in multiple glusterfsd processes on each server, allowing greater parallelism. For example, you can do this by presenting individual disk drives as bricks rather than RAID volumes. Let us know if these suggestions helped -ben england - Original Message - From: Renchu Mathew ren...@cracknell.com To: gluster-us...@gluster.org Cc: gluster-devel@gluster.org Sent: Sunday, February 22, 2015 7:09:09 AM Subject: [Gluster-devel] High CPU Usage - Glusterfsd Dear all, I have implemented glusterfs storage on my company – 2 servers with replicate. But glustherfsd shows more than 100% CPU utilization most of the time. So it is so slow to access the gluster volume. My setup is two glusterfs servers with replication. The gluster volume (almost 10TB of data) is mounted on another server (glusterfs native client) and using samba share for the network users to access those files. Is there any way to reduce the processor usage on these servers? Please give a solution ASAP since the users are complaining about the poor performance. I am using glusterfs version 3.6. Regards Renchu Mathew | Sr. IT Administrator CRACKNELL DUBAI | P.O. Box 66231 | United Arab Emirates | T +971 4 3445417 | F +971 4 3493675 | M +971 50 7386484 ABU DHABI | DUBAI | LONDON | MUSCAT | DOHA | JEDDAH EMAIL ren...@cracknell.com | WEB www.cracknell.com This email, its content and any files transmitted with it are intended solely for the addressee(s) and may be legally privileged and/or confidential. If you are not the intended recipient please let us know by email reply and delete it from the system. Please note that any views or opinions presented in this email do not necessarily represent those of the company. Email transmissions cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The company therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of email transmission. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Multi-network support proposal
Hope this message makes as much sense to me on Tuesday as it did at 3 AM in the airport ;-) Inline... - Original Message - From: Jeff Darcy jda...@redhat.com To: Ben England bengl...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org, Manoj Pillai mpil...@redhat.com Sent: Sunday, February 15, 2015 1:49:17 AM Subject: Re: [Gluster-devel] Multi-network support proposal It's really important for glusterfs not to require that the clients mount volumes using same subnet that is used by servers, and clearly your very general-purpose proposal could address that. For example, in a site where non-glusterfs protocols are used, there are already good reasons for using multiple subnets, and we want glusterfs to be able to coexist with non-glusterfs protocols at a site. However, is there a simpler way to allow glusterfs clients to connect to servers through more than one subnet. For example, suppose your Gluster volume subnet is 172.17.50.0/24 and your public network used by glusterfs clients is 1.2.3.0/22, but one of the servers also has an interface on subnet 4.5.6.0/24 . So at the time that the volume is either created or bricks are added/removed: - determine what servers are actually in the volume - ask each server to return the subnet for each of its active network interfaces - determine set of subnets that are directly accessible to ALL the volume's servers - write a glusterfs volfile for each of these subnets and save it This process is O(N) where N is number of servers, but it only happens for volume creation or addition/removal of bricks, these events do not happen very often (do they?). In example, 1.2.3.0/22 and 172.17.50.0/24 would have glusterfs volfiles, but 4.5.6.0/22 would not. So now when a client connects, the server knows which subnet the request came through (getsockaddr), so it can just return the volfile for that subnet. If there is no volfile for that subnet, the client mount request is rejected.. But what about existing Gluster volumes? When software is upgraded, we should provide a mechanism for triggering this volfile generation process to open up additional subnets for glusterfs clients. This proposal requires additional work to be done where volfiles are generated and where glusterfs mount processing is done, but does not require any additional configuration commands or extra user knowledge of Gluster. glusterfs clients can then use *any* subnet that is accessible to all the servers. That does have the advantage of not requiring any special configuration, and might work well enough for front-end traffic, but it has the drawback of not giving any control over back-end traffic. How do *servers* choose which interfaces to use for NSR normal traffic, reconciliation/self-heal, DHT rebalance, and so on? Which network should Ganesha/Samba servers use to communicate with bricks? Even on the front end, what happens when we do get around to adding per-subnet access control or options? For those kinds of use cases we need networks to be explicit parts of our model, not implicit or inferred. So maybe we need to reconcile the two approaches, and hope that the combined result isn't too complicated. I'm open to suggestions. In defense of your proposal, you are right that it is difficult to manage each node's network configuration independently or by volfile, and it would be useful to a system manager to be able to configure Gluster network behavior across the entire volume. For example, you can use pdsh to issue commands to any subset of Gluster servers, but what if some of them are down at the time the command is issued? How do you make these configuration changes persistent? What happens when you add or remove servers from the volume? That to me is the real selling point of your proposal - if we have a 60-node or even a 1000-node Gluster volume, we could provide a way to control network behavior in a persistent, highly-available, scalable way with as few sysadmin operations as possible. I have two concerns: 1) Do we have to specify each host's address rewriting in your example - why not something like this? # gluster network add client-net 1.2.3.0/24 glusterd could then use a discovery process as I described earlier to determine for each server what its IP address is on that subnet and rewrite volfiles accordingly. The advantage of this subnet-based specification IMHO is that it scales - as you add and remove nodes, you do not have to change client-net entity, you just make sure that Gluster servers provide the appropriate network interface with appropriate IP address and subnet mask. 2) Could we keep the number of roles and the sysadmin interface in general from getting too complicated? Here's an oversimplified model of Gluster networking - there are at most 2 kinds of subnets on each server in use by Gluster or apps: - replication
Re: [Gluster-devel] RDMA: Patch to make use of pre registered memory
Avati, I'm all for your zero-copy RDMA API proposal, but I have a concern about your proposed zero-copy fop below... - Original Message - From: Anand Avati av...@gluster.org To: Mohammed Rafi K C rkavu...@redhat.com, Gluster Devel gluster-devel@gluster.org Cc: Raghavendra Gowdappa rgowd...@redhat.com, Ben Turner btur...@redhat.com, Ben England bengl...@redhat.com, Suman Debnath sdebn...@redhat.com Sent: Saturday, January 24, 2015 1:15:52 AM Subject: Re: RDMA: Patch to make use of pre registered memory Couple of comments - ... 4. Next step for zero-copy would be introduction of a new fop readto() where the destination pointer is passed from the caller (gfapi being the primary use case). In this situation RDMA ought to register that memory if necessary and request server to RDMA_WRITE into the pointer provided by gfapi caller. The readto() API is emulating the Linux/Unix read() system call, where the caller passes in the address of the read buffer. This API was created half a century ago in a non-distributed world. IMHO The caller should not specify where the read data should arrive, instead it should let the read API specify where the data arrived. There should be a pre-registered pool of buffers, that both the sender and receiver *already* knew about, that can be used for RDMA reads, and one of these will be passed to the caller as part of the read event or completion. This seems related to performance results that Rafi KC had posted earlier this month. Why does it matter? With RDMA, the read transfer cannot begin until the OTHER END of the RDMA connection knows where the data will land, and it cannot know this soon enough if we wait until the read API call to specify what address to target. An API where the caller specifies the buffer address *blocks* the sender, introduces latency (transmitting RDMA-able address to sender) and prevents pipelined, overlapping activity by sender and receiver. So a read FOP for RDMA should be more like read_completion_event(buffer ** read_data_delivered). It is possible to change libgfapi to support this since it does not have to conform rigidly to POSIX. Could this work in Gluster translator API? RPC interface? So then how would the remote sender find out when it was ok to re-use this buffer to service another RDMA read request? Is there an interface, something like read_buffer_consumed(buffer * available_buf), on read API side that indicates to RDMA that the caller has consumed the buffer and it is ready for re-use, without the added expense of unregistering and re-registering? If so, then you then have a pipeline of buffers in one of 4 states: - in transmission by sender to reader - being consumed by reader - being returned to sender for re-use - available to sender - go back to state 1 By increasing the number of buffers sufficiently, we can avoid a situation where round-trip latency prevents you from filling the gigantic 40-Gbps (56-Gbps for FDR IB) RDMA pipeline. I'm also interested in how writes work - how do we avoid copies on the write path and also avoid having to re-register buffers with each write? BTW None of these concerns, or the concerns discussed by Rafi KC, are addressed in the Gluster 3.6 RDMA feature page. -ben (e) ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Reg. multi thread epoll NetBSD failures
Gluster-ians, would it be ok to temporarily disable multi-thread-epoll on NetBSD, unless there is some huge demand for it? NetBSD may be useful for exposing race conditions, but it's not clear to me that all of these race conditions would happen in a non-NetBSD environment, so are we chasing problems that non-NetBSD users can never see? what do people think? If yes, why bust our heads figuring them out for NetBSD right now? attached is a tiny, crude and possibly out-of-date patch for making multi-thread-epoll tunable, If we make number of epoll threads settable, we could add conditional compilation to make GLUSTERFS_EPOLL_MAXTHREADS 1 for NetBSD without much trouble, while still allowing people to experiment with it on NetBSD. From a performance perspective, let's review why we should go to the trouble of using multi-thread-epoll patch. The original goal was to allow far greater CPU utilization by Gluster than we typically were seeing. To do this, we want multiple Gluster RPC sockets to be read and processed in parallel by a single process. This is important to clients (glusterfs, libgfapi) that have to talk to many bricks (example: JBOD, erasure coding), and to brick processes (glusterfsd) that have to talk to many clients. It is also important for SSD support (cache tiering) because we need to be able to have the glusterfsd process keep up with SSD hardware and caches, which can have orders of magnitude more IOPS available than a single disk drive or even a RAID LUN, and glusterfsd epoll thread is currently the bottleneck in such configurations. This multi-thread-epoll enhancement seems similar to multi-queue ethernet driver, etc. that spreads load across CPU cores. RDMA 40-Gbps networking may also encounter this bottleneck. We don't want a small fraction of CPU cores (often just 1) to be a bottleneck - we want either network or storage hardware to be the bottleneck instead. Finally, is it possible with multi-thread-epoll that we do not need to use the io-threads translator (Anand Avati's suggestion) that offloads incoming requests to worker threads? In this case, the epoll threads ARE the server-side thread pool. If so, this could reduce context switching and latency further. I for one look forward to finding out but I do not want to invest in more performance testing than we have already done unless it is going to be upstream to use. thanks for your help, -Ben England, Red Hat Perf. Engr. - Original Message - From: Shyam srang...@redhat.com To: Emmanuel Dreyfus m...@netbsd.org Cc: Gluster Devel gluster-devel@gluster.org Sent: Friday, January 23, 2015 2:48:14 PM Subject: [Gluster-devel] Reg. multi thread epoll NetBSD failures Patch: http://review.gluster.org/#/c/3842/ Manu, I was not able to find the NetBSD job mentioned in the last review comment provided by you, pointers to that would help. Additionally, What is the support status of epoll on NetBSD? I though NetBSD favored the kqueue means of event processing over epoll and that epoll was not supported on NetBSD (or *BSD). I ask this, as this patch specifically changes the number of epoll threads, as a result, it is possibly having a different affect on NetBSD, which should either be on poll or kqueue (to my understanding). Could you shed some light on this and on the current status of epoll on NetBSD. Thanks, Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel --- event-epoll.c.nontunable 2014-09-05 16:27:10.261223176 -0400 +++ event-epoll.c 2014-09-05 16:33:19.818407183 -0400 @@ -612,23 +612,35 @@ } -#define GLUSTERFS_EPOLL_MAXTHREADS 2 +#define GLUSTERFS_EPOLL_MAX_THREADS 8 +#define GLUSTERFS_EPOLL_DEFAULT_THREADS 4 +int glusterfs_epoll_threads = -1; static int event_dispatch_epoll (struct event_pool *event_pool) { int i = 0; - pthread_t pollers[GLUSTERFS_EPOLL_MAXTHREADS]; + pthread_t pollers[GLUSTERFS_EPOLL_MAX_THREADS]; int ret = -1; +char *epoll_thrd_str = getenv(GLUSTERFS_EPOLL_THREADS); - for (i = 0; i GLUSTERFS_EPOLL_MAXTHREADS; i++) { +glusterfs_epoll_threads = +epoll_thrd_str ? atoi(epoll_thrd_str) : GLUSTERFS_EPOLL_DEFAULT_THREADS;; + +if (glusterfs_epoll_threads GLUSTERFS_EPOLL_MAX_THREADS) { + gf_log (epoll, GF_LOG_ERROR, +user requested %d threads but limit is %d, +glusterfs_epoll_threads, GLUSTERFS_EPOLL_MAX_THREADS); +return EINVAL; +} + for (i = 0; i glusterfs_epoll_threads; i++) { ret = pthread_create (pollers[i], NULL, event_dispatch_epoll_worker, event_pool); } - for (i = 0; i GLUSTERFS_EPOLL_MAXTHREADS; i++) + for (i = 0; i glusterfs_epoll_threads; i++) pthread_join (pollers[i], NULL); return ret
Re: [Gluster-devel] RDMA: Patch to make use of pre registered memory
Rafi, great results, thanks. Your io-cache off columns are read tests with the io-cache translator disabled, correct? What jumps out at me from your numbers are two things: - io-cache translator destroys RDMA read performance. - approach 2i) register iobuf pool is the best approach. -- on reads with io-cache off, 32% better than baseline and 21% better than 1) separate buffer -- on writes, 22% better than baseline and 14% better than 1) Can someone explain to me why the typical Gluster site wants to use the io-cache translator, given that FUSE now caches file data? Should we just have it turned off by default at this point? This would buy us time to change io-cache implementation to be compatible with RDMA (see below option 2ii). remaining comments inline -ben - Original Message - From: Mohammed Rafi K C rkavu...@redhat.com To: gluster-devel@gluster.org Cc: Raghavendra Gowdappa rgowd...@redhat.com, Anand Avati av...@gluster.org, Ben Turner btur...@redhat.com, Ben England bengl...@redhat.com, Suman Debnath sdebn...@redhat.com Sent: Friday, January 23, 2015 7:43:45 AM Subject: RDMA: Patch to make use of pre registered memory Hi All, As I pointed out earlier, for rdma protocol, we need to register memory which is used during rdma read and write with rdma device. In fact it is a costly operation. To avoid the registration of memory in i/o path, we came up with two solutions. 1) To use a separate per-registered iobuf_pool for rdma. The approach needs an extra level copying in rdma for each read/write request. ie, we need to copy the content of memory given by application to buffers of rdma in the rdma code. copying data defeats the whole point of RDMA, which is to *avoid* copying data. 2) Register default iobuf_pool in glusterfs_ctx with rdma device during the rdma initialize. Since we are registering buffers from the default pool for read/write, we don't require either registration or copying. This makes far more sense to me. But the problem comes when io-cache translator is turned-on; then for each page fault, io-cache will take a ref on the io-buf of the response buffer to cache it, due to this all the pre-allocated buffer will get locked with io-cache very soon. Eventually all new requests would get iobufs from new iobuf_pools which are not registered with rdma and we will have to do registration for every iobuf. To address this issue, we can: i) Turn-off io-cache (we chose this for testing) ii) Use separate buffer for io-cache, and offload from default pool to io-cache buffer. (New thread to offload) I think this makes sense, because if you get a io-cache translator cache hit, then you don't need to go out to the network, so io-cache memory doesn't have to be registered with RDMA. iii) Dynamically register each newly created arena with rdma, for this need to bring libglusterfs code and transport layer code together. (Will need changes in packaging and may bring hard dependencies of rdma libs) iv) Increase the default pool size. (Will increase the footprint of glusterfs process) registration with RDMA only makes sense to me when data is going to be sent/received over the RDMA network. Is it hard to tell in advance which buffers will need to be transmitted? We implemented two approaches, (1) and (2i) to get some performance numbers. The setup was 4*2 distributed-replicated volume using ram disks as bricks to avoid hard disk bottleneck. And the numbers are attached with the mail. Please provide the your thoughts on these approaches. Regards Rafi KC Seperate buffer for rdma (1)No change Register Default iobuf pool(2i) write readio-cache offwrite readio-cache offwrite readio-cache off 1 373 527 656 343 483 532 446 512 696 2 380 528 668 347 485 540 426 525 715 3 376 527 594 346 482 540 422 526 720 4 381 533 597 348 484 540 413 526 710 5 372 527 479 347 482 538 422 519 719 Note: (varying result ) Average 376.4 528.4 598.8 346.2 483.2 538 425.8 521.6 712 command read: echo 3 /proc/sys/vm/drop_caches; dd if=/home/ram0/mount0/foo.txt of=/dev/null bs=1024K count=1000; write echo 3 /proc/sys/vm/drop_caches; dd of=/home/ram0/mount0/foo.txt if=/dev/zero bs=1024K count=1000 conv=sync; vol infoVolume Name: xcube Type: Distributed-Replicate Volume ID: 84cbc80f
Re: [Gluster-devel] Order of server-side xlators
Since we're on the subject of minimizing STAT calls, there was talk in the small-file perf meeting about moving the md-cache translator into the server just before the POSIX translator so that all the stat and getxattr calls, etc. could be intercepted. This would be consistent with not needing to put access-control down near POSIX translator. Also we had discussed introducing negative caching (we could call llistxattr() first) to the md-cache translator so that we would not constantly ask brick fs for non-existent xattrs. - Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Anand Avati av...@gluster.org Cc: Gluster Devel gluster-devel@gluster.org Sent: Tuesday, January 13, 2015 4:18:15 AM Subject: Re: [Gluster-devel] Order of server-side xlators On 01/13/2015 05:45 AM, Anand Avati wrote: Valid questions. access-control had to be as close to posix as possible in its first implementation (to minimize the cost of the STAT calls originated by it), but since the introduction of posix-acl there are no extra STAT calls, and given the later introduction of quota, it certainly makes sense to have access-control/posix-acl closer to protocol/server. Some general constraints to consider while deciding the order: - keep io-stats as close to protocol/server as possible - keep io-threads as close to storage/posix as possible - any xlator which performs direct filesystem operations (with system calls, not STACK_WIND) are better placed between io-threads and posix to keep epoll thread nonblocking (e.g changelog) Based on these constraints and the requirements of each xlator, what do you think about this order: posix changelog (needs FS access) index (needs FS access) marker (needs FS access) io-threads barrier(just above io-threads as per documentation (*)) quota access-control locks io-stats server (*) I'm not sure of the requirements/dependencies of barrier xlator. Do you think this order makes sense and it would be better ? Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Suggestion needed to make use of iobuf_pool as rdma buffer.
Rafi, it totally makes sense to me that you need to pre-allocate i/o buffers that will be used by RDMA, and you don't want to constantly change (i.e. allocate and deallocate) these buffers. Since a remote RDMA controller can be reading and writing to them, we have to be very careful about deallocating in particular. So an arena of pre-registered RDMA buffers makes perfect sense. Am I understanding you correctly that io-cache translator is soaking up all the RDMA-related buffers? How important is io-cache translator to Gluster performance at this point? Given that FUSE caching is now enabled, it seems to me that io-cache translator would accomplish very little. Should we have it disabled by default? If so, would that solve your problem? So how do read-ahead translator and write-behind translator interact with RDMA buffering? -ben - Original Message - From: Mohammed Rafi K C rkavu...@redhat.com To: gluster-devel@gluster.org Sent: Tuesday, January 13, 2015 9:29:56 AM Subject: [Gluster-devel] Suggestion needed to make use of iobuf_pool as rdma buffer. Hi All, When using RDMA protocol, we need to register the buffer which is going to send through rdma with rdma device. In fact, it is a costly operation, and a performance killer if it happened in I/O path. So our current plan is to register pre-allocated iobuf_arenas from iobuf_pool with rdma when rdma is getting initialized. The problem comes when all the iobufs are exhausted, then we need to dynamically allocate new arenas from libglusterfs module. Since it is created in libglusterfs, we can't make a call to rdma from libglusterfs. So we will force to register each of the iobufs from the newly created arenas with rdma in I/O path. If io-cache is turned on in client stack, then all the pre-registred arenas will use by io-cache as cache buffer. so we have to do the registration in rdma for each i/o call for every iobufs, eventually we cannot make use of pre registered arenas. To address the issue, we have two approaches in mind, 1) Register each dynamically created buffers in iobuf by bringing transport layer together with libglusterfs. 2) create a separate buffer for caching and offload the data from the read response to the cache buffer in background. If we could make use of preregister memory for every rdma call, then we will have approximately 20% increment for write and 25% of increment for read. Please give your thoughts to address the issue. Thanks Regards Rafi KC ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] small-file performance feature sub-pages
I've expanded specification of two proposals for improving small-file performance as new feature pages, referenced at the bottom of this list (not in priority order I hope?). Could we possibly review these proposals at a gluster.org meeting this month? http://www.gluster.org/community/documentation/index.php/Features#Proposed_Features.2FIdeas new feature pages under this page are: http://www.gluster.org/community/documentation/index.php/Features/stat-xattr-cache - proposed enhancement to POSIX translator for small-file performance http://www.gluster.org/community/documentation/index.php/Features/composite-operations - changes to reduce round trips for small-file performance Specifically, the stat-xattr-cache proposal does not require Gluster 4.0 - it could be implemented today. These pages are referenced by Features/Planning40 page and also by the Features/Feature_Smallfile_Perf page. comments and feedback are appreciated. There have been other related proposals from Rudra Siva concerning round-trip reduction in http://supercolony.gluster.org/pipermail/gluster-devel/2014-November/042741.html . -ben ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] How to resolve gfid (and .glusterfs symlink) for a deleted file
Nux, Those thousands of entries all would match -links 2 but not -links 1 The only entry in .glusterfs that would match is the entry where you deleted the file from the brick. That's how hardlinks work - when you create a regular file, the link count is increased to 1 (since the directory entry now references the inode), and when you create an additional hard link to the same file, the link count is increased to 2. Try this with the stat your-file command and look at the link count, watch how it changes. The find command that I gave you just tracks down the one hardlink that you want and nothing else. -ben - Original Message - From: Nux! n...@li.nux.ro To: Ben England bengl...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Friday, November 21, 2014 11:03:46 AM Subject: Re: [Gluster-devel] How to resolve gfid (and .glusterfs symlink) for a deleted file Hi Ben, I have thousands of entries under /your/brick/directory/.glusterfs .. find would return too many results. How do I find the one I'm looking for? :-) -- Sent from the Delta quadrant using Borg technology! Nux! www.nux.ro - Original Message - From: Ben England bengl...@redhat.com To: Nux! n...@li.nux.ro Cc: Gluster Devel gluster-devel@gluster.org Sent: Friday, 21 November, 2014 16:00:40 Subject: Re: [Gluster-devel] How to resolve gfid (and .glusterfs symlink) for a deleted file first of all, links in .glusterfs are HARD links not symlinks. So the file is not actually deleted, since the local filesystem keeps a count of references to the inode and won't release the inode until the ref count reaches zero. I tried this, it turns out you can find it with # find /your/brick/directory/.glusterfs -links 1 -type f You use type f because it's a hard link to a file, and you don't want to look at directories or . or .. . Once you find the link, you can copy the file off somewhere, and then delete the link. At that point, regular self-heal could repair it (i.e. just do ls on the file from a Gluster mountpoint). - Original Message - From: Nux! n...@li.nux.ro To: Gluster Devel gluster-devel@gluster.org Sent: Friday, November 21, 2014 10:34:09 AM Subject: [Gluster-devel] How to resolve gfid (and .glusterfs symlink) for a deleted file Hi, I deleted a file by mistake in a brick. I never managed to find out its gfid so now I have a rogue symlink in .glusterfs pointing to it (if I got how it works). Any way I can discover which is this file and get rid of it? -- Sent from the Delta quadrant using Borg technology! Nux! www.nux.ro ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Feature help
inline... - Original Message - From: Rudra Siva rudrasiv...@gmail.com To: gluster-devel@gluster.org Sent: Saturday, November 1, 2014 10:20:41 AM Subject: [Gluster-devel] Feature help Hi, I'm very interested in helping with this feature by way of development help, testing and or benchmarking. I have a parallel-libgfapi benchmark that could be modified to fit the new API, and could test performance of it. https://github.com/bengland2/parallel-libgfapi Features/Feature Smallfile Perf One of the things I was looking into was possibility of adding a few API calls to libgfapi to help allow reading and writing multiple small files as objects - just as librados does for ceph - cutting out FUSE and other semantics that tend to be overheads for really small files. I don't know what else I will have to add for libgfapi to support this. libgfapi is a good place to prototype, it's easy to change libgfapi by adding to the existing calls, but this won't help performance as much as you might want unless the Gluster protocol can somehow change to allow combination of several separate FOPS such as LOOKUP, OPEN, READ and RELEASE FOPS and LOOKUP, CREATE, WRITE and RELEASE FOPS. That's the hard part IMHO. I suggest using wireshark to watch Gluster small-file creates, and then try to understand what each FOP is doing and why it is there. suggestions for protocol enhancement: Can we allow CREATE to piggyback write data if it's under 128 KB or whatever RPC size limit is, and optionally do a RELEASE after the WRITE? Or just create a new FOP that does that? Can we also specify xattrs that the application might want to set at create time? Example, SMB security-related XATTRs, Swift metadata. Can we do something like we did for sequential writes with eager-lock, and allow Gluster client to hang on to directory lock for a little while so that we don't have to continually reacquire the lock if we are going to keep creating files in it? Second, if we already have a write lock on the directory, we shouldn't have to do LOOKUP then CREATE, just do CREATE directly. Finally, Swift and other apps use hack of rename() call after close() so that they can create a file atomically, if we had an API for creating files atomically then these apps would not be forced into using the expensive rename operation. Can we do these things in an incremental way so that we can steadily improve performance over time without massive disruption to code base? Perhaps Glusterfs FUSE mount could learn to do something like that as well with a special mount option that would allow actual create at server to be deferred until any one of these 3 conditions occurred: - 100 msec had passed, or - the file was closed, or - at least N KB of data was written (i.e. an RPC's worth) This is a bit like Nagle's algorithm in TCP, which allows TCP to aggregate more data into segments before it actually transmits them. It technically violates POSIX and creates some semantic issues (how do you tell user that file already exists, for example?), but frankly fs interface in POSIX is an anachronism, we need to bend it a little to get what we need, NFS already does. This might not be appropriate for all apps but there might be quite a few cases like initial data ingest where this would be a very reasonable thing to do. The following is what I was thinking - please feel free to correct me or guide me if someone has already done some ground work on this. For read, multiple objects can be provided and they should be separated for read from appropriate brick based on the DHT flag - this will help avoid multiple lookups from all servers. In the absence of DHT they would be sent to all but only the ones that contain the object respond (it's more like a multiple file lookup request). I think it is very ambitious to batch creates for multiple files, and this greatly complicates the API. Let's just get to a point where we can create a Gluster file and write the data for it in the same libgfapi call and have that work efficiently in the Gluster RPC interface -- this would be a huge win. For write, same as the case of read, complete object writes (no partial updates, file offsets etc.) For delete, most of the lookup and batching logic remains the same. Delete is not the highest priority thing here. Creates are the worst performers, so we probably should focus on creates. someday it would be nice to be able to express the thought to the file system delete this directory tree or delete all files within this directory, since Gluster could then make that a parallel operation, hence scalable. I can help with testing, documentation or benchmarks if someone has already done some work. -Siva ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Fwd: [ovirt-devel] Using libgfapi on Gluster Storage Domains
The year-old BZ 1016886 describes three problems that any libgfapi application would have, not just KVM. As for the rpc-auth-allow-insecure=on setting, has any progress been made in this area? I think this setting should be unnecessary, and it's really important in general that Gluster have some way of optionally authenticating clients other than by client port number. Do SSL sockets solve this problem of authenticating in the control plane? Two of these problems can be fixed by just altering the virt group -- the settings that you get with the gluster command - gluster volume set your-volume group virt And then just document that when you configure a volume for KVM virtualization, use the above command, right? This is why we have volume group feature, right? In /var/lib/glusterd/groups/virt file, it has this: [root@g60ds-1 groups]# more virt quick-read=off read-ahead=off io-cache=off stat-prefetch=off eager-lock=enable remote-dio=enable quorum-type=auto server-quorum-type=server Just add 1 lines to it: allow-insecure=on And change stat-prefetch to on, and you'll have eliminated 2 of the 3 things that every KVM user has to do to Gluster. eager-lock is the default now so that can be removed. -ben - Original Message - From: Vijay Bellur vbel...@redhat.com To: Gluster Devel gluster-devel@gluster.org Sent: Monday, October 27, 2014 5:38:24 AM Subject: [Gluster-devel] Fwd: [ovirt-devel] Using libgfapi on Gluster Storage Domains FYI - if you are interested in trying out libgfapi support with oVirt. -Vijay -- Forwarded message -- From: Federico Simoncelli fsimo...@redhat.com Date: Fri, Oct 24, 2014 at 12:06 AM Subject: [ovirt-devel] Using libgfapi on Gluster Storage Domains To: oVirt Development de...@ovirt.org Hi everyone, if you want to try and use the libgfapi support included in qemu when accessing volumes on gluster storage domains you can try to apply this patch: http://gerrit.ovirt.org/33768 As far as I know Jason Brooks already tried it and he reported a positive feedback. What has been tested so far is: - qemu uses libgfapi to access the disks on gluster storage domains - hotplug of disks on gluster storage domains works as expected (libgfapi) - hotunplug works as expected (no failure when removing a disk that is using libgfapi) - live snpashots work as expected - disks of vms started before this patch are not affected (they won't use libgfapi since there's no way to do an hot swap) One major flow that is yet untested is live storage migration. Remember that you may need to do some special configuration on your gluster volumes (most notably the allow-insecure ports option) as described here: http://www.ovirt.org/Features/GlusterFS_Storage_Domain Please try and test the patch if you're interested and report your feedback. Thanks, -- Federico ___ Devel mailing list de...@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] gluster write million of lines: WRITE = -1 (Transport endpoint is not connected)
Sergio, I agree, excessive logging is a performance issue and can potentially fill a system disk partition or LVM volume over a long enough period of time, resulting in other errors. See bz 1156624 for another example that I encountered. Does it happen in glusterfs-3.6? Is there a logging option/interface that would rate-limit a particular logging call per unit of time to N messages/sec, and when that limit is exceeded a message is logged saying that M more events of that type were seen in the last second? -Ben England - Original Message - From: Sergio Traldi sergio.tra...@pd.infn.it To: gluster-us...@gluster.org, gluster-devel@gluster.org Sent: Monday, October 27, 2014 9:51:37 AM Subject: [Gluster-devel] gluster write million of lines: WRITE = -1 (Transport endpoint is not connected) Hi all, One server Redhat 6 with this rpms set: [ ~]# rpm -qa | grep gluster | sort glusterfs-3.5.2-1.el6.x86_64 glusterfs-api-3.5.2-1.el6.x86_64 glusterfs-cli-3.5.2-1.el6.x86_64 glusterfs-fuse-3.5.2-1.el6.x86_64 glusterfs-geo-replication-3.5.2-1.el6.x86_64 glusterfs-libs-3.5.2-1.el6.x86_64 glusterfs-server-3.5.2-1.el6.x86_64 I have a gluster volume with 1 server and 1 brick: [ ~]# gluster volume info volume-nova-pp Volume Name: volume-nova-pp Type: Distribute Volume ID: b5ec289b-9a54-4df1-9c21-52ca556aeead Status: Started Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: 192.168.61.100:/brick-nova-pp/mpathc Options Reconfigured: storage.owner-gid: 162 storage.owner-uid: 162 There are four clients attached to this volume with same O.S. and same fuse gluster rpms set: [ ~]# rpm -qa | grep gluster | sort glusterfs-3.5.0-2.el6.x86_64 glusterfs-api-3.5.0-2.el6.x86_64 glusterfs-fuse-3.5.0-2.el6.x86_64 glusterfs-libs-3.5.0-2.el6.x86_6 Last week, but it happens also two weeks ago, I found the disk almost full and I found the gluster logs /var/log/glusterfs/var-lib-nova-instances.log of 68GB: In the log there was the starting problem: [2014-10-10 07:29:43.730792] W [socket.c:522:__socket_rwv] 0-glusterfs: readv on 192.168.61.100:24007 failed (No data available) [2014-10-10 07:29:54.022608] E [socket.c:2161:socket_connect_finish] 0-glusterfs: connection to 192.168.61.100:24007 failed (Connection refused) [2014-10-10 07:30:05.271825] W [client-rpc-fops.c:866:client3_3_writev_cbk] 0-volume-nova-pp-client-0: remote operation failed: Input/output error [2014-10-10 07:30:08.783145] W [fuse-bridge.c:2201:fuse_writev_cbk] 0-glusterfs-fuse: 3661260: WRITE = -1 (Input/output error) [2014-10-10 07:30:08.783368] W [fuse-bridge.c:2201:fuse_writev_cbk] 0-glusterfs-fuse: 3661262: WRITE = -1 (Input/output error) [2014-10-10 07:30:08.806553] W [fuse-bridge.c:2201:fuse_writev_cbk] 0-glusterfs-fuse: 3661649: WRITE = -1 (Input/output error) [2014-10-10 07:30:08.844415] W [fuse-bridge.c:2201:fuse_writev_cbk] 0-glusterfs-fuse: 3662235: WRITE = -1 (Input/output error) and a lot of these lines: [2014-10-15 14:41:15.895105] W [fuse-bridge.c:2201:fuse_writev_cbk] 0-glusterfs-fuse: 951700230: WRITE = -1 (Transport endpoint is not connected) [2014-10-15 14:41:15.896205] W [fuse-bridge.c:2201:fuse_writev_cbk] 0-glusterfs-fuse: 951700232: WRITE = -1 (Transport endpoint is not connected) This second line log with different sector number has been written every millisecond so in about 1 minute we have 1GB write in O.S. disk. I search for a solution but I didn't find nobody having the same problem. I think there was a network problem but why does gluster write in logs million of: [2014-10-15 14:41:15.895105] W [fuse-bridge.c:2201:fuse_writev_cbk] 0-glusterfs-fuse: 951700230: WRITE = -1 (Transport endpoint is not connected) ? Thanks in advance. Cheers Sergio ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Jeff Darcy's objections to multi-thread-epoll and proposal to use own-thread alternative
This e-mail is specifically about use of multi-thread-epoll optimization (originally prototyped by Anand Avati) to solve a Gluster performance problem: single-threaded reception of protocol messages (for non-SSL sockets), and consequent inability to fully utilize available CPU on server. A discussion of its pros and cons follows, along with the alternative to it suggested by Jeff Darcy, referred to as own-thread below. Thanks to Shyam Ranganathan for helping me to clarify my thoughts on this. Attached is some performance data about multi-thread-epoll. To see why this threading discussion matters, consider that storage hardware encountered in the enterprise server world is rapidly speeding up with new hardware such as 40-Gbps networks and SSDs, but CPUs are not speeding up nearly as much. Instead, we have more cores per socket. So adequate performance for Gluster will require use of sufficient threads to match CPU throughput to network and storage. One way to get the server's idle CPU horsepower engaged is JBOD (just a bunch of disks, no RAID) - since there is one glusterfsd, hence 1 epoll thread per brick (disk). This causes scalability problems for small-file creates (cluster.lookup-unhashed=on is default), and it limits throughput of an individual file to the speed of the disk drive, so until these problems are addressed, the utility of JBOD approach is limited. - Original Message - From: Jeff Darcy jda...@redhat.com To: Gluster Devel gluster-devel@gluster.org Sent: Wednesday, October 8, 2014 4:20:34 PM Subject: [Gluster-devel] jdarcy status (October 2014) Multi-threading is even more controversial. It has also been in the tree for two years (it was developed to address the problem of SSL code slowing down our entire transport stack). This feature, controlled by the own-thread transport option, uses a thread per connection - not my favorite concurrency model, but kind of necessary to deal with the OpenSSL API. More recently, a *completely separate* approach to multi-threading - multi-threaded epoll - has been getting some attention. Here's what I see as the pros and cons of this new approach. * PRO: greater parallelism of requests on a single connection. I think the actual performance benefits vs. own-thread are unproven and likely to be small, but they're real. We should try comparing performance of multi-thread-epoll to own-thread, shouldn't be hard to hack own-thread into non-SSL-socket case. HOWEVER, if own-thread implies a thread per network connection, as you scale out a Gluster volume with N bricks, you have O(N) clients, and therefore you have O(N) threads on each glusterfsd (libgfapi adoption would make it far worse)! Suppose we are implementing a 64-brick configuration with 200 clients, not an unreasonably sized Gluster volume for a scalable filesystem. We then have 200 threads per Glusterfsd just listening for RPC messages on each brick. On a 60-drive server there can be a lot more than 1 brick per server, so multiply threads/glusterfsd by brick count! It doesn't make sense to have total threads = CPUs, and modern processors make context switching between threads more and more expensive. Shyam mentioned a refinement to own-thread where we equally partition the set of TCP connections among a pool of threads (own-thread is a special case of this). This cannot supply an individual client with more than 1 thread to receive RPCs, even when most of CPU cores on the server are idle. Why impose this constraint (see below)? To see why this is important, consider a common use case: KVM virtualization. SSDs require orders of magnitude more IOPS from glusterfsd and glusterfs than a traditional rotating disk. So even if you dedicate a thread to a single network connection, this thread may still have trouble keeping up with the high-speed network and the SSD. Multi-thread-epoll is the only proposal so far that offers a way to apply enough CPU to this problem. Consider that some SSDs have throughput on the order of a million IOPS (I/O operations per second). In the past, we have worked around this problem by placing multiple bricks on a single SSD, but this causes other problems (scalability, free space measurement). * CON: with greater concurrency comes greater potential to uncover race conditions in other modules used to being single-threaded. We've already seen this somewhat with own-thread, and we'd see it more with multi-epoll. On the Gluster server side, because of the io-threads translator, an RPC listener thread is effectively just starting a worker thread and then going back to read another RPC. With own-thread, although RPC requests are received in order, there is no guarantee that the requests will be processed in the order that they were received from the network. On the client side, we have operations such as readdir that will fan out parallel FOPS. If
[Gluster-devel] update-link-count-parent POSIX xlator parameter
What is update-link-count-parent POSIX xlator parameter for? Is it ever set by anyone and why? It appears to be off by default. Didn't see this documented, but the log message in posix.c says: update-link-count-parent is enabled. Thus for each file an extended attribute representing the number of hardlinks for that file within the same parent directory is set. Why would this be necessary? background: I'm trying to see where various xattr calls per file read are coming from. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel