Re: [Gluster-devel] Reg. multi thread epoll NetBSD failures
Since all of epoll code and its multithreading is under ifdefs, netbsd should just continue working as single threaded poll unaffected by the patch. If netbsd kqueue supports single shot event delivery and edge triggered notification, we could have an equivalent implantation on netbsd too. Even if kqueue does not support these features, it might well be worth implementing a single threaded level triggered kqueue based event handler, and promote netbsd from suffering from sucky vanilla poll. Thanks On Fri, Jan 23, 2015, 19:29 Emmanuel Dreyfus m...@netbsd.org wrote: Ben England bengl...@redhat.com wrote: NetBSD may be useful for exposing race conditions, but it's not clear to me that all of these race conditions would happen in a non-NetBSD environment, In many times, NetBSD exhibited cases where non specified, Linux-specific behaviors were assumed. I recall my very first finding in Glusterfs: Linux lets you use a mutex without calling pthread_mutex_init() first. That broke on NetBSD, as expected. Fixing this kind of issues is interesting beyond NetBSD support, since you cannot take for granted that an unspecified behavior will not be altered in the future. That said, I am fine if you let NetBSD run without fixing the underlying issue, but you have been warned :-) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] RDMA: Patch to make use of pre registered memory
Couple of comments - 1. rdma can register init/fini functions (via pointers) into iobuf_pool. Absolutely no need to introduce rdma dependency into libglusterfs. 2. It might be a good idea to take a holistic approach towards zero-copy with libgfapi + RDMA, rather than a narrow goal of use pre-registered memory with RDMA. Do keep the options open for RDMA'ing user's memory pointer (passed to glfs_write()) as well. 3. It is better to make io-cache and write-behind use a new iobuf_pool for caching purpose. There could be an optimization where they could just do iobuf/iobref_ref() when safe - e.g io-cache can cache with iobuf_ref when transport is socket, or write-behind can unwind by holding onto data with iobuf_ref() when the topmost layer is FUSE or server (i.e no gfapi). 4. Next step for zero-copy would be introduction of a new fop readto() where the destination pointer is passed from the caller (gfapi being the primary use case). In this situation RDMA ought to register that memory if necessary and request server to RDMA_WRITE into the pointer provided by gfapi caller. 2. and 4. require changes in the code you would be modifying if you were to just do pre-registered memroy, so it is better we plan for the bigger picture upfront. Zero-copy can improve performance (especially read) in qemu use case. Thanks ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] ctime weirdness
I don't think the problem is with the handling of SETATTR in either NetBSD or Linux. I am guessing NetBSD FUSE is _using_ SETATTR to update atime upon open? Linux FUSE just leaves it to the backend filesystem to update atime. Whenever there is a SETATTR fop, ctime is _always_ bumped. Thanks On Mon Jan 12 2015 at 5:03:51 PM Emmanuel Dreyfus m...@netbsd.org wrote: Hello Here is a NetBSD behavior that looks pathological: (it happens on a FUSE mount but not a native mount): # touch a # stat -x a File: a Size: 0FileType: Regular File Mode: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/ wheel) Device: 203,7 Inode: 13726586830943880794Links: 1 Access: Tue Jan 13 01:57:25 2015 Modify: Tue Jan 13 01:57:25 2015 Change: Tue Jan 13 01:57:25 2015 # cat a /dev/null # stat -x a File: a Size: 0FileType: Regular File Mode: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/ wheel) Device: 203,7 Inode: 13726586830943880794Links: 1 Access: Tue Jan 13 01:57:31 2015 Modify: Tue Jan 13 01:57:25 2015 Change: Tue Jan 13 01:57:31 2015 NetBSD FUSE implementation does not sends ctime with SETATTR. Looking at glusterfs FUSE xlator, I see setattr code does not handle ctime either. How does that happen? What wrong does NetBSD SETATTR does? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Suggestion needed to make use of iobuf_pool as rdma buffer.
On Tue Jan 13 2015 at 11:57:53 PM Mohammed Rafi K C rkavu...@redhat.com wrote: On 01/14/2015 12:11 AM, Anand Avati wrote: 3) Why not have a separate iobuf pool for RDMA? Since every fops are using the default iobuf_pool, if we go with another iobuf_pool dedicated to rdma, we need to copy that buffer from default pool to rdma or so, unless we are intelligently allocating the buffers based on the transport which we are going to use. It is an extra level copying in the I/O path. Not sure what you mean by that. Every fop does not use default iobuf_pool. Only readv() and writev() do. If you really want to save on memory registration cost, your first target should be the header buffers (which is used in every fop, and currently valloc()ed and ibv_reg_mr() per call). Making headers use an iobuf pool where every arena is registered during arena creation and destruction will get you the highest overhead savings. Coming to file data iobufs, today iobuf pools are used in a mixed way, i.e, they hold both data being actively transferred/under IO, and also data which is being held long term (cached by io-cache). io-cache just does an iobuf_ref() and holds on to the data. This avoids memory copies in io-cache layer. However that may be something we want to reconsider: io-cache could use its own iobuf pool into which data is copied into from the transfer iobuf (which is pre-registered with RDMA in bulk etc.) Thanks On Tue Jan 13 2015 at 6:30:09 AM Mohammed Rafi K C rkavu...@redhat.com wrote: Hi All, When using RDMA protocol, we need to register the buffer which is going to send through rdma with rdma device. In fact, it is a costly operation, and a performance killer if it happened in I/O path. So our current plan is to register pre-allocated iobuf_arenas from iobuf_pool with rdma when rdma is getting initialized. The problem comes when all the iobufs are exhausted, then we need to dynamically allocate new arenas from libglusterfs module. Since it is created in libglusterfs, we can't make a call to rdma from libglusterfs. So we will force to register each of the iobufs from the newly created arenas with rdma in I/O path. If io-cache is turned on in client stack, then all the pre-registred arenas will use by io-cache as cache buffer. so we have to do the registration in rdma for each i/o call for every iobufs, eventually we cannot make use of pre registered arenas. To address the issue, we have two approaches in mind, 1) Register each dynamically created buffers in iobuf by bringing transport layer together with libglusterfs. 2) create a separate buffer for caching and offload the data from the read response to the cache buffer in background. If we could make use of preregister memory for every rdma call, then we will have approximately 20% increment for write and 25% of increment for read. Please give your thoughts to address the issue. Thanks Regards Rafi KC ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Suggestion needed to make use of iobuf_pool as rdma buffer.
3) Why not have a separate iobuf pool for RDMA? On Tue Jan 13 2015 at 6:30:09 AM Mohammed Rafi K C rkavu...@redhat.com wrote: Hi All, When using RDMA protocol, we need to register the buffer which is going to send through rdma with rdma device. In fact, it is a costly operation, and a performance killer if it happened in I/O path. So our current plan is to register pre-allocated iobuf_arenas from iobuf_pool with rdma when rdma is getting initialized. The problem comes when all the iobufs are exhausted, then we need to dynamically allocate new arenas from libglusterfs module. Since it is created in libglusterfs, we can't make a call to rdma from libglusterfs. So we will force to register each of the iobufs from the newly created arenas with rdma in I/O path. If io-cache is turned on in client stack, then all the pre-registred arenas will use by io-cache as cache buffer. so we have to do the registration in rdma for each i/o call for every iobufs, eventually we cannot make use of pre registered arenas. To address the issue, we have two approaches in mind, 1) Register each dynamically created buffers in iobuf by bringing transport layer together with libglusterfs. 2) create a separate buffer for caching and offload the data from the read response to the cache buffer in background. If we could make use of preregister memory for every rdma call, then we will have approximately 20% increment for write and 25% of increment for read. Please give your thoughts to address the issue. Thanks Regards Rafi KC ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Order of server-side xlators
Valid questions. access-control had to be as close to posix as possible in its first implementation (to minimize the cost of the STAT calls originated by it), but since the introduction of posix-acl there are no extra STAT calls, and given the later introduction of quota, it certainly makes sense to have access-control/posix-acl closer to protocol/server. Some general constraints to consider while deciding the order: - keep io-stats as close to protocol/server as possible - keep io-threads as close to storage/posix as possible - any xlator which performs direct filesystem operations (with system calls, not STACK_WIND) are better placed between io-threads and posix to keep epoll thread nonblocking (e.g changelog) Thanks On Mon Jan 12 2015 at 5:02:59 AM Xavier Hernandez xhernan...@datalab.es wrote: Hi, looking at the server-side xlator stack created on a generic volume with quota enabled, I see the following xlators: posix changelog access-control locks io-threads barrier index marker quota io-stats server The question is why access-control and quota are in this relative order. It would seem more logical to me to be in the reverse order because if an operation is not permitted, it's irrelevant if there is enough quota to do it or not: gluster should return EPERM or EACCES instead of EDQUOT. Also, index and marker can operate on requests that can be later denied by access-control, having to undo the work done in that case. Wouldn't it be better to use index and marker after having validated all permissions of the request ? I'm not very familiarized with these xlators, so maybe I'm missing an important detail. Thanks, Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] mandatory lock
Note that the mandatory locks available in the locks translator is just the mandatory extensions for posix locks - at least one of the apps must be using locks to begin with. What Harmeet is asking for is something different - automatic exclusive access to edit files. i.e, if one app has opened a file for editing, other apps which attempt an open must either fail (EBUSY) or block till the first app closes. We need to treat open(O_RDONLY) as a read lock and open(O_RDWR|O_WRONLY) as a write lock request (essentially an auto applied oplock). This is something gluster does not yet have. Thanks On Thu Jan 08 2015 at 2:49:29 AM Raghavendra Gowdappa rgowd...@redhat.com wrote: - Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Harmeet Kalsi kharm...@hotmail.com Cc: Gluster-devel@gluster.org gluster-devel@gluster.org Sent: Thursday, January 8, 2015 4:12:44 PM Subject: Re: [Gluster-devel] mandatory lock - Original Message - From: Harmeet Kalsi kharm...@hotmail.com To: Gluster-devel@gluster.org gluster-devel@gluster.org Sent: Wednesday, January 7, 2015 5:55:43 PM Subject: [Gluster-devel] mandatory lock Dear All. Would it be possible for someone to guide me in the right direction to enable the mandatory lock on a volume please. At the moment two clients can edit the same file at the same time which is causing issues. I see code related to mandatory locking in posix-locks xlator (pl_writev, pl_truncate etc). To enable it you've to set option mandatory-locks yes in posix-locks xlator loaded on bricks (/var/lib/glusterd/vols/volname/*.vol). We've no way to set this option through gluster cli. Also, I am not sure to what extent this feature is tested/used till now. You can try it out and please let us know whether it worked for you :). If mandatory locking doesn't work for you, can you modify your application to use advisory locking, since advisory locking is tested well and being used for long time? Many thanks in advance Kind Regards ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] mandatory lock
Or use an rsync style .filename.rand tempfile, write the new version of the file, and rename that to filename. On Thu Jan 08 2015 at 12:21:18 PM Anand Avati av...@gluster.org wrote: Ideally you want the clients to coordinate among themselves. Note that this feature cannot be implemented foolproof (theoretically) in a system that supports NFSv3. On Thu Jan 08 2015 at 8:57:48 AM Harmeet Kalsi kharm...@hotmail.com wrote: Hi Anand, that was spot on. Any idea if there will be development on this side in near future as multiple clients writing to the same file can cause issues. Regards -- From: av...@gluster.org Date: Thu, 8 Jan 2015 16:07:50 + Subject: Re: [Gluster-devel] mandatory lock To: rgowd...@redhat.com; kharm...@hotmail.com CC: gluster-devel@gluster.org Note that the mandatory locks available in the locks translator is just the mandatory extensions for posix locks - at least one of the apps must be using locks to begin with. What Harmeet is asking for is something different - automatic exclusive access to edit files. i.e, if one app has opened a file for editing, other apps which attempt an open must either fail (EBUSY) or block till the first app closes. We need to treat open(O_RDONLY) as a read lock and open(O_RDWR|O_WRONLY) as a write lock request (essentially an auto applied oplock). This is something gluster does not yet have. Thanks On Thu Jan 08 2015 at 2:49:29 AM Raghavendra Gowdappa rgowd...@redhat.com wrote: - Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Harmeet Kalsi kharm...@hotmail.com Cc: Gluster-devel@gluster.org gluster-devel@gluster.org Sent: Thursday, January 8, 2015 4:12:44 PM Subject: Re: [Gluster-devel] mandatory lock - Original Message - From: Harmeet Kalsi kharm...@hotmail.com To: Gluster-devel@gluster.org gluster-devel@gluster.org Sent: Wednesday, January 7, 2015 5:55:43 PM Subject: [Gluster-devel] mandatory lock Dear All. Would it be possible for someone to guide me in the right direction to enable the mandatory lock on a volume please. At the moment two clients can edit the same file at the same time which is causing issues. I see code related to mandatory locking in posix-locks xlator (pl_writev, pl_truncate etc). To enable it you've to set option mandatory-locks yes in posix-locks xlator loaded on bricks (/var/lib/glusterd/vols/volname/*.vol). We've no way to set this option through gluster cli. Also, I am not sure to what extent this feature is tested/used till now. You can try it out and please let us know whether it worked for you :). If mandatory locking doesn't work for you, can you modify your application to use advisory locking, since advisory locking is tested well and being used for long time? Many thanks in advance Kind Regards ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Appending time to snap name in USS
It would be convenient if the time is appended to the snap name on the fly (when receiving list of snap names from glusterd?) so that the timezone application can be dynamic (which is what users would expect). Thanks On Thu Jan 08 2015 at 3:21:15 AM Poornima Gurusiddaiah pguru...@redhat.com wrote: Hi, Windows has a feature called shadow copy. This is widely used by all windows users to view the previous versions of a file. For shadow copy to work with glusterfs backend, the problem was that the clients expect snapshots to contain some format of time in their name. After evaluating the possible ways(asking the user to create snapshot with some format of time in it and have rename snapshot for existing snapshots) the following method seemed simpler. If the USS is enabled, then the creation time of the snapshot is appended to the snapname and is listed in the .snaps directory. The actual name of the snapshot is left unmodified. i.e. the snapshot list/info/restore etc. commands work with the original snapname. The patch for the same can be found @http://review.gluster.org/#/c/9371/ The impact is that, the users would see the snapnames to be different in the .snaps folder than what they have created. Also the current patch does not take care of the scenario where the snapname already has time in its name. Eg: Without this patch: drwxr-xr-x 4 root root 110 Dec 26 04:14 snap1 drwxr-xr-x 4 root root 110 Dec 26 04:14 snap2 With this patch drwxr-xr-x 4 root root 110 Dec 26 04:14 snap1@GMT-2014.12.30-05.07.50 drwxr-xr-x 4 root root 110 Dec 26 04:14 snap2@GMT-2014.12.30-23.49.02 Please let me know if you have any suggestions or concerns on the same. Thanks, Poornima ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] mandatory lock
Ideally you want the clients to coordinate among themselves. Note that this feature cannot be implemented foolproof (theoretically) in a system that supports NFSv3. On Thu Jan 08 2015 at 8:57:48 AM Harmeet Kalsi kharm...@hotmail.com wrote: Hi Anand, that was spot on. Any idea if there will be development on this side in near future as multiple clients writing to the same file can cause issues. Regards -- From: av...@gluster.org Date: Thu, 8 Jan 2015 16:07:50 + Subject: Re: [Gluster-devel] mandatory lock To: rgowd...@redhat.com; kharm...@hotmail.com CC: gluster-devel@gluster.org Note that the mandatory locks available in the locks translator is just the mandatory extensions for posix locks - at least one of the apps must be using locks to begin with. What Harmeet is asking for is something different - automatic exclusive access to edit files. i.e, if one app has opened a file for editing, other apps which attempt an open must either fail (EBUSY) or block till the first app closes. We need to treat open(O_RDONLY) as a read lock and open(O_RDWR|O_WRONLY) as a write lock request (essentially an auto applied oplock). This is something gluster does not yet have. Thanks On Thu Jan 08 2015 at 2:49:29 AM Raghavendra Gowdappa rgowd...@redhat.com wrote: - Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Harmeet Kalsi kharm...@hotmail.com Cc: Gluster-devel@gluster.org gluster-devel@gluster.org Sent: Thursday, January 8, 2015 4:12:44 PM Subject: Re: [Gluster-devel] mandatory lock - Original Message - From: Harmeet Kalsi kharm...@hotmail.com To: Gluster-devel@gluster.org gluster-devel@gluster.org Sent: Wednesday, January 7, 2015 5:55:43 PM Subject: [Gluster-devel] mandatory lock Dear All. Would it be possible for someone to guide me in the right direction to enable the mandatory lock on a volume please. At the moment two clients can edit the same file at the same time which is causing issues. I see code related to mandatory locking in posix-locks xlator (pl_writev, pl_truncate etc). To enable it you've to set option mandatory-locks yes in posix-locks xlator loaded on bricks (/var/lib/glusterd/vols/volname/*.vol). We've no way to set this option through gluster cli. Also, I am not sure to what extent this feature is tested/used till now. You can try it out and please let us know whether it worked for you :). If mandatory locking doesn't work for you, can you modify your application to use advisory locking, since advisory locking is tested well and being used for long time? Many thanks in advance Kind Regards ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Readdir d_off encoding
Please review http://review.gluster.org/9332/, as it undoes the introduction of itransform on d_off in AFR. This does not solve DHT-over-DHT or other future use cases, but at least fixes the regression in 3.6.x. Thanks On Tue Dec 23 2014 at 10:34:41 AM Anand Avati av...@gluster.org wrote: Using GFID does not work for d_off. The GFID represents and inode, and a d_off represents a directory entry. Therefore using GFID as an alternative to d_off breaks down when you have hardlinks for the same inode in a single directory. On Tue Dec 23 2014 at 2:20:34 AM Xavier Hernandez xhernan...@datalab.es wrote: On 12/22/2014 06:41 PM, Jeff Darcy wrote: An alternative would be to convert directories into regular files from the brick point of view. The benefits of this would be: * d_off would be controlled by gluster, so all bricks would have the same d_off and order. No need to use any d_off mapping or transformation. I don't think a full-out change from real directories to virtual ones is in the cards, but a variant of this idea might be worth exploring further. If we had a *server side* component to map between on-disk d_off values and those we present to clients, then it might be able to do a better job than the local FS of ensuring uniqueness within the bits (e.g. 48 of them) that are left over after we subtract some for a brick ID. This could be enough to make the bit-stealing approach (on the client) viable. There are probably some issues with failing over between replicas, which should have the same files but might not have assigned the same internal d_off values, but those issues might be avoidable if the d_off values are deterministic with respect to GFIDs. Having a server-side xlator seems a better approximation, however I see some problems that need to be solved: The mapper should work on the fly (i.e. it should do the mapping between the local d_off to the client d_off without having full knowledge of the directory contents). This is a good approach for really big directories because doesn't require to waste large amounts of memory, but it will be hard to find a way to avoid duplicates, specially if are limited to ~48 bits. Making it based on the GFID would be a good way to have common d_off between bricks, however maintaining order will be harder. It will also be hard to guarantee uniqueness if mapping is deterministic and directory is very big. Otherwise it would need to read full directory contents before returning mapped d_off's. To minimize the collision problem, we need to solve the ordering problem. If we can guarantee that all bricks return directory entries in the same order and d_off, we don't need to reserve some bits in d_off. I think the virtual directories solution should be the one to consider for 4.0. For earlier versions we can try to find an intermediate solution. Following your idea of a server side component, could this be useful ? * Keep all directories and its entries in a double linked list stored in xattr of each inode. * Use this linked list to build the readdir answer. * Use the first 64 (or 63) bits of gfid as the d_off. * There will be two special offsets: 0 for '.' and 1 for '..' Example (using shorter gfid's for simplicity): Directory root with gfid 0001 Directory 'test1' inside root with gfid Directory 'test2' inside root with gfid Entry 'entry1' inside 'test1' with gfid Entry 'entry2' inside 'test1' with gfid Entry 'entry3' inside 'test2' with gfid Entry 'entry4' inside 'test2' with gfid Entry 'entry5' inside 'test2' with gfid / (0001) test1/ () entry1 () entry2 () test2/ () entry3 () entry4 () entry5 () Note that entry2 and entry3 are hardlinks. xattrs of root (0001): trusted.dirmap.0001.next = trusted.dirmap.0001.prev = xattrs of 'test1' (): trusted.dirmap.0001.next = trusted.dirmap.0001.prev = 0001 trusted.dirmap..next = trusted.dirmap..prev = xattrs of 'test2' (): trusted.dirmap.0001.next = 0001 trusted.dirmap.0001.prev = trusted.dirmap..next = trusted.dirmap..prev = xattrs of 'entry1' (): trusted.dirmap..next = trusted.dirmap..prev = xattrs of 'entry2'/'entry3' (): trusted.dirmap..next = trusted.dirmap..prev = trusted.dirmap..next = trusted.dirmap..prev = xattrs of 'entry4' (): trusted.dirmap..next = trusted.dirmap..prev = xattrs of 'entry5' (): trusted.dirmap..next = trusted.dirmap..prev = It's easy to enumerate all entries from the beginning of a directory. Also, since we return extra information from each inode in a directory, accessing these new xattrs doesn't represent a big impact
Re: [Gluster-devel] pthread_mutex misusage in glusterd_op_sm
This is indeed a misuse. A very similar bug used to be there in io-threads, but we have moved to using pthread_cond over there since a while. To fix this problem we could use a pthread_mutex/pthread_cond pair + a boolean flag in place of the misused mutex. Or, we could just declare gd_op_sm_lock as a synclock_t to achieve the same result. Thanks On Tue Nov 25 2014 at 10:26:34 AM Emmanuel Dreyfus m...@netbsd.org wrote: I made a simple fix that address the problem: http://review.gluster.org/9197 Are there other places where the same bug could exist? Anyone familiar with the code would tell? Emmanuel Dreyfus m...@netbsd.org wrote: in glusterd_op_sm(), we lock and unlock the gd_op_sm_lock mutex. Unfortunately, locking and unlocking can happen in different threads (task swap will occur in handelr call). This case is explictely covered by POSIX: the behavior is undefined. http://pubs.opengroup.org/onlinepubs/9699919799/ functions/pthread_mutex_lo ck.html When unlocking from a thread that is not owner, Linux seems to be fine (though you never know with unspecified operation), while NetBSD returns EPERM, causing a spurious error in tests/basic/pump.t . It can be observed in a few failed tests here: http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/ Fixing it seems far from being obvious. I guess it needs to use syncop, but the change would be intrusive. Do we have another option? Is it possible to switch a task to a given thread? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Single layout at root (Was EHT / DHT)
On Tue Nov 25 2014 at 1:28:59 PM Shyam srang...@redhat.com wrote: On 11/12/2014 01:55 AM, Anand Avati wrote: On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy jda...@redhat.com mailto:jda...@redhat.com wrote: (Personally I would have done this by mixing in the parent GFID to the hash calculation, but that alternative was ignored.) Actually when DHT was implemented, the concept of GFID did not (yet) exist. Due to backward compatibility it has just remained this way even later. Including the GFID into the hash has benefits. I am curious here as this is interesting. So the layout start subvol assignment for a directory to be based on its GFID was provided so that files with the same name distribute better than ending up in the same bricks, right? Right, for e.g we wouldn't want all the README.txt in various directories of a volume to end up on the same server. The way it is achieved today is, the per server hash-range assignment is rotated by a certain amount (how much it is rotated is determined by a separate hash on the directory path) at the time of mkdir. Instead as we _now_ have GFID, we could use that including the name to get a similar/better distribution, or GFID+name to determine hashed subvol. What we could do now is, include the parent directory gfid as an input into the DHT hash function. Today, we do approximately: int hashval = dm_hash (readme.txt) hash_ranges[] = inode_ctx_get (parent_dir) subvol = find_subvol (hash_ranges, hashval) Instead, we could: int hashval = new_hash (readme.txt, parent_dir.gfid) hash_ranges[] = global_value subvol = find_subvol (hash_ranges, hashval) The idea here would be that on dentry creates we would need to generate the GFID and not let the bricks generate the same, so that we can choose the subvol to wind the FOP to. The GFID would be that of the parent (as an entry name is always in the context of a parent directory/inode). Also, the GFID for a new entry is already generated by the client, the brick does not generate a GFID. This eliminates the need for a layout per sub-directory and all the (interesting) problems that it comes with and instead can be replaced by a layout at root. Not sure if it handles all use cases and paths that we have now (which needs more understanding). I do understand there is a backward compatibility issue here, but other than this, this sounds better than the current scheme, as there is a single layout to read/optimize/stash/etc. across clients. Can I understand the rationale of this better, as to what you folks are thinking. Am I missing something or over reading on the benefits that this can provide? I think you understand it right. The benefit is one could have a single hash layout for the entire volume and the directory specific-ness is implemented by including the directory gfid into the hash function. The way I see it, the compromise would be something like: Pro per directory range: By having per-directory hash ranges, we can do easier incremental rebalance. Partial progress is well tolerated and does not impact the entire volume. The time a given directory is undergoing rebalance, for that directory alone we need to enter unhashed lookup mode, only for that period of time. Con per directory range: Just the new hash assignment phase (to impact placement of new files/data, not move old data) itself is an extended process, crawling the entire volume with complex per-directory operations. The number of points in the system where things can break (i.e, result in overlaps and holes in ranges) is high. Pro single layout with dir GFID in hash: Avoid the numerous parts (per-dir hash ranges) which can potentially break. Con single layout with dir GFID in hash: Rebalance phase 1 (assigning new layout) is atomic for the entire volume - unhashed lookup has to be on for all dirs for the entire period. To mitigate this, we could explore versioning the centralized hash ranges, and store the version used by each directory in its xattrs (and update the version as the rebalance progresses). But now we have more centralized metadata (may be/ may not be a worthy compromise - not sure.) In summary, including GFID into the hash calculation does open up interesting possibilities and worthy of serious consideration. HTH, Avati ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Invalid DIR * usage in quota xlator
On Tue, Oct 14, 2014 at 7:22 PM, Emmanuel Dreyfus m...@netbsd.org wrote: J. Bruce Fields bfie...@fieldses.org wrote: Is the result on non-Linux really to fail any readdir using an offset not returned from the current open? Yes, but thatnon-Linux behabvior is POSIX compliant. Linux just happens to do more than the standard here. I can't see how NFS READDIR will work on non-Linux platforms in that case. Differrent case: seekdir/telldir operate at libc level on data cached in userland. NFS READDIR operates on data in the kernel. Is there a way to get hold of the directory entry cookies used by NFS readdir from user-space? some sort of a NetBSD specific syscall (like getdents of Linux)? Thanks ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] if/else coding style :-)
On Mon, Oct 13, 2014 at 2:00 PM, Shyam srang...@redhat.com wrote: (apologies, last one on the metrics from me :), as I believe it is more about style than actual numbers at a point) _maybe_ this is better, and it is pretty close to call now ;) find -name '*.c' | xargs grep else | wc -l 3719 find -name '*.c' | xargs grep else | grep '}' | wc -l 1986 find -name '*.c' | xargs grep else | grep -v '}' | wc -l 1733 Without taking sides: the last grep is including else without either { or }. [~/work/glusterfs] sh$ git grep '} else {' | wc -l 1331 [~/work/glusterfs] sh$ git grep 'else {' | grep -v '}' | wc -l 142 So going by just numbers, } else { is 10x more common than }\n else {. I also find that believable based on familiarity of seeing this pattern in the code. Either way, good idea to stick to one and not allow both in future code. Thanks ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Feature proposal - FS-Cache support in FUSE
On Mon, Sep 1, 2014 at 6:07 AM, Vimal A R arvi...@yahoo.in wrote: Hello fuse-devel / fs-cache / gluster-devel lists, I would like to propose the idea of implementing FS-Cache support in the fuse kernel module, which I am planning to do as part of my UG university course. This proposal is by no means final, since I have just started to look into this. There are several user-space filesystems which are based on the FUSE kernel module. As of now, if I understand correct, the only networked filesystems having FS-Cache support are NFS and AFS. Implementing support hooks for fs-cache in the fuse module would provide networked filesystems such as GlusterFS the benefit of a client-side caching mechanism, which should decrease the access times. If you are planning to test this with GlusterFS, note that one of the first challenges would be to have persistent filehandles in FUSE. While GlusterFS has a notion of a persistent handle (GFID, 128bit) which is constant across clients and remounts, the FUSE kernel module is presented a transient LONG (64/32 bit) which is specific to the mount instance (actually, the address of the userspace inode_t within glusterfs process - allows for constant time filehandle resolution). This would be a challenge with any FUSE based filesystem which has persistent filehandles larger than 64bit. Thanks When enabled, FS-Cache would maintain a virtual indexing tree to cache the data or object-types per network FS. Indices in the tree are used by FS-Cache to find objects faster. The tree or index structure under the main network FS index depends on the filesystem. Cookies are used to represent the indices, the pages etc.. The tree structure would be as following: a) The virtual index tree maintained by fs-cache would look like: * FS-Cache master index - The network-filesystem indice (NFS/AFS etc..) - per-share indices - File-handle indices - Page indices b) In case of FUSE-based filesystems, the tree would be similar to : * FS-Cache master index - FUSE indice - Per FS indices - file-handle indices - page indices. c) In case of FUSE based filesystems as GlusterFS, the tree would as : * FS-Cache master index - FUSE indice (fuse.glusterfs) - GlusterFS volume ID (a UUID exists for each volume) - GlusterFS file-handle indices (based on the GFID of a file) - page indices. The idea is to enable FUSE to work with the FS-Cache network filesystem API, which is documented at ' https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/caching/netfs-api.txt '. The implementation of FS-Cache support in NFS can be taken as a guideline to understand and start off. I will reply to this mail with any other updates that would come up whilst pursuing this further. I request any sort of feedback/suggestions, ideas, any pitfalls etc.. that can help in taking this further. Thank you, Vimal References: * https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/caching/fscache.txt * https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/caching/netfs-api.txt * https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/caching/object.txt * http://people.redhat.com/dhowells/fscache/FS-Cache.pdf * http://people.redhat.com/steved/fscache/docs/HOWTO.txt * https://en.wikipedia.org/wiki/CacheFS * https://lwn.net/Articles/160122/ * http://www.linux-mag.com/id/7378/ ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Transparent encryption in GlusterFS: Implications on manageability
+1 for all the points. On Wed, Aug 13, 2014 at 11:22 AM, Jeff Darcy jda...@redhat.com wrote: I.1 Generating the master volume key Master volume key should be generated by user on the trusted machine. Recommendations on master key generation provided at section 6.2 of the manpages [1]. Generating of master volume key is in user's competence. That was fine for an initial implementation, but it's still the single largest obstacle to adoption of this feature. Looking forward, we need to provide full CLI support for generating keys in the necessary format, specifying their location, etc. I.2 Location of the master volume key when mounting a volume At mount time the crypt translator searches for a master volume key on the client machine at the location specified by the respective translator option. If there is no any key at the specified location, or the key at specified location is in improper format, then mount will fail. Otherwise, the crypt translator loads the key to its private memory data structures. Location of the master volume key can be specified at volume creation time (see option master-key, section 6.7 of the man pages [1]). However, this option can be overridden by user at mount time to specify another location, see section 7 of manpages [1], steps 6, 7, 8. Again, we need to improve on this. We should support this as a volume or mount option in its own right, not rely on the generic --xlator-option mechanism. Adding options to mount.glusterfs isn't hard. Alternatively, we could make this look like a volume option settable once through the CLI, even though the path is stored locally on the client. Or we could provide a separate special-purpose command/script, which again only needs to be run once. It would even be acceptable to treat the path to the key file (not its contents!) as a true volume option, stored on the servers. Any of these would be better than requiring the user to understand our volfile format and construction so that they can add the necessary option by hand. II. Check graph of translators on your client machine after mount! During mount your client machine receives configuration info from the non-trusted server. In particular, this info contains the graph of translators, which can be subjected to tampering, so that encryption won't be invoked for your volume at all. So it is highly important to verify this graph. After successful mount make sure that the graph of translators contains the crypt translator with proper options (see FAQ#1, section 11 of the manpages [1]). It is important to verify the graph, but not by poking through log files and not without more information about what to look for. So we got a volfile that includes the crypt translator, with some options. The *code* should ensure that the master-key option has the value from the command line or local config, and not some other. If we have to add special support for this in otherwise-generic graph initialization code, that's fine. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] how does meta xlator work?
On Tue, Aug 12, 2014 at 9:58 AM, Emmanuel Dreyfus m...@netbsd.org wrote: On Mon, Aug 11, 2014 at 09:53:19PM -0700, Anand Avati wrote: If FUSE implements proper direct_io semantics (somewhat like how O_DIRECT flag is handled) and allows the mode to be enabled by the FS in open_cbk, then I guess such a special handling of 0-byte need not be necessary? At least it hasn't been necessary in Linux FUSE implementation. I made a patch that lets meta.t pass on NetBSD. Unfortunately, the direct IO flag gets attached to the vnode and not to the file descriptor, which means it is not possible to have a fd with direct IO and another without. But perhaps it is just good enough. That may / may not work well in practice depending on the number of concurrent apps working on a file. But a good start nonetheless. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] how does meta xlator work?
On Wed, Aug 13, 2014 at 8:55 PM, Emmanuel Dreyfus m...@netbsd.org wrote: Anand Avati av...@gluster.org wrote: That may / may not work well in practice depending on the number of concurrent apps working on a file. I am not sure what could make a FS decide that for the same file, one file descriptor should use direct I/O and another should not. Keeping the flag at file descriptor level would require VFS modification in the kernel: the filesystem knows nothing about file descriptors, it just know vnodes. It could be done, but I expect to meet resistance :-) For e.g, glusterfs used to enable directio mode for non-read-only FDs (to cut overheads) but disable directio for read-only (to leverage readahead). After big_writes was introduced in Linux FUSE this has changed. But we should be OK having vnode level switch for now, I think. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] how does meta xlator work?
My guess would be that direct_io_mode works differently on *BSD. In Linux (and it appears in OS/X as well), the VFS takes hint from the file size (returned in lookup/stat) to limit itself from not read()ing beyond that offset. So if a file size is returned 0 in lookup, read() is never received even by FUSE. In meta all file sizes are 0 (since the contents of the inode are generated dynamically on open()/read(), size is unknown during lookup() -- just like /proc). And therefore all meta file open()s are forced into direct_io_mode ( http://review.gluster.org/7506) so that read() requests are sent straight to FUSE/glusterfs bypassing VFS (size is ignored etc.) So my guess would be to inspect how direct_io_mode works in those FUSE implementations first. It is unlikely to be any other issue. Thanks On Sun, Aug 10, 2014 at 9:56 PM, Harshavardhana har...@harshavardhana.net wrote: I am working on tests/basic/meta.t on NetBSD It fails because .meta/frames is empty, like all files in .meta. A quick investigation in source code shows that the function responsible for filling the code (frames_file_fill) is never called. Same experience here. It does work on OSX, but does not work on FreeBSD for similar reasons haven't figured it out yet what is causing the issue. -- Religious confuse piety with mere ritual, the virtuous confuse regulation with outcomes ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] how does meta xlator work?
On Mon, Aug 11, 2014 at 7:37 PM, Harshavardhana har...@harshavardhana.net wrote: But there is something I don't get withthe fix: - the code forces direct IO if (state-flags O_ACCMODE) != O_RDONLY), but here the file is open read/only, hence I would expect fuse xlator to do nothing special direct_io_mode(xdata) is in-fact gf_true here for 'meta' - using direct IO is the kernel's decision, reflecting the flags used by calling process on open(2). How can the fuse xlator convince the kernel to enable it? fuse is convinced by FOPEN_DIRECT_IO which is pro-actively set when 'xdata' is set with direct-io-mode to 1 by meta module. From FreeBSD fuse4bsd code void fuse_vnode_open(struct vnode *vp, int32_t fuse_open_flags, struct thread *td) { /* * Funcation is called for every vnode open. * Merge fuse_open_flags it may be 0 * * XXXIP: Handle FOPEN_DIRECT_IO and FOPEN_KEEP_CACHE */ if (vnode_vtype(vp) == VREG) { /* XXXIP prevent getattr, by using cached node size */ vnode_create_vobject(vp, 0, td); } } FUSE4BSD seems like doesn't even implement this as full functionality. Only getpages/putpages API seems to implement IO_DIRECT handling. Need to see if indeed such is the case. I think you found it right. fuse4bsd should start handling FOPEN_DIRECT_IO in the open handler. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] how does meta xlator work?
On Mon, Aug 11, 2014 at 9:14 PM, Emmanuel Dreyfus m...@netbsd.org wrote: Anand Avati av...@gluster.org wrote: In meta all file sizes are 0 (since the contents of the inode are generated dynamically on open()/read(), size is unknown during lookup() -- just like /proc). And therefore all meta file open()s are forced into direct_io_mode ( http://review.gluster.org/7506) so that read() requests are sent straight to FUSE/glusterfs bypassing VFS (size is ignored etc.) I found the code in the kernel that skips the read if it is beyond known file size. Hence I guess the idea is that on lookup of a file with a size equal to 0, a special handling should be done so that reads happens anyway. If FUSE implements proper direct_io semantics (somewhat like how O_DIRECT flag is handled) and allows the mode to be enabled by the FS in open_cbk, then I guess such a special handling of 0-byte need not be necessary? At least it hasn't been necessary in Linux FUSE implementation. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Fw: Re: Corvid gluster testing
David, Is it possible to profile the app to understand the block sizes used for performing write() (using strace, source code inspection etc)? The block sizes reported by gluster volume profile is measured on the server side and is subject to some aggregation by the client side write-behind xlator. Typically the biggest hurdle for small block writes is FUSE context switches which happens even before reaching the client side write-behind xlator. You could also enable the io-stats xlator on the client side just below FUSE (before reaching write-behind), and extract data using setfattr. On Wed, Aug 6, 2014 at 10:00 AM, David F. Robinson david.robin...@corvidtec.com wrote: My apologies. I did some additional testing and realized that my timing wasn't right. I believe that after I do the write, NFS caches the data and until I close and flush the file, the timing isn't correct. I believe the appropriate timing is now 38-seconds for NFS and 60-seconds for gluster. I played around with some of the parameters and got it down to 52-seconds with gluster by setting: performance.write-behind-window-size: 128MB performance.cache-size: 128MB I couldn't get it closer to the NFS timing on the writes, although the read speads were slightly better than NFS. I am not sure if this is reasonable, or if I should be able to get write speeds that are more comparable to the NFS mount... Sorry for the confusion I might have caused with my first email... It isn't 25x slower. It is roughly 30% slower for the writes... David -- Original Message -- From: Vijay Bellur vbel...@redhat.com To: David F. Robinson david.robin...@corvidtec.com; gluster-devel@gluster.org Sent: 8/6/2014 12:48:09 PM Subject: Re: [Gluster-devel] Fw: Re: Corvid gluster testing On 08/06/2014 12:11 AM, David F. Robinson wrote: I have been testing some of the fixes that Pranith incorporated into the 3.5.2-beta to see how they performed for moderate levels of i/o. All of the stability issues that I had seen in previous versions seem to have been fixed in 3.5.2; however, there still seem to be some significant performance issues. Pranith suggested that I send this to the gluster-devel email list, so here goes: I am running an MPI job that saves a restart file to the gluster file system. When I use the following in my fstab to mount the gluster volume, the i/o time for the 2.5GB file is roughly 45-seconds. / gfsib01a.corvidtec.com:/homegfs /homegfs glusterfs transport=tcp,_netdev 0 0 / When I switch this to use the NFS protocol (see below), the i/o time is 2.5-seconds. / gfsib01a.corvidtec.com:/homegfs /homegfs nfs vers=3,intr,bg,rsize=32768,wsize=32768 0 0/ The read-times for gluster are 10-20% faster than NFS, but the write times are almost 20x slower. What is the block size of the writes that are being performed? You can expect better throughput and lower latency with block sizes that are close to or greater than 128KB. -Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding fuse mount crash on graph-switch
Can you add more logging to the fd migration failure path as well please (errno and possibly other details)? Thanks! On Wed, Aug 6, 2014 at 9:16 PM, Pranith Kumar Karampuri pkara...@redhat.com wrote: hi, Could you guys review http://review.gluster.com/#/c/8402. This fixes crash reported by JoeJulian. We are yet to find why fd-migration failed. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding resolution for fuse/server
There are subtle differences between fuse and server. In fuse the inode table does not use LRU pruning, so expected inodes are guaranteed to be cached. For e.g, when mkdir() FOP arrives, fuse would have already checked with a lookup and the kernel guarantees another thread would not have created mkdir in the mean time (with mutex on dir). In the server, either because of LRU or threads racing, you need to re-evaulate the situation to be sure. RESOLVE_MUST/NOT makes more sense in the server because of this. HTH On Fri, Aug 1, 2014 at 2:43 AM, Pranith Kumar Karampuri pkara...@redhat.com wrote: hi, Does anyone know why there is different code for resolution in fuse vs server? There are some differences too, like server asserts about the resolution types like RESOLVE_MUST/RESOLVE_NOT etc where as fuse doesn't do any such thing. Wondering if there is any reason why the code is different in these two xlators. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Reuse of frame?
call frames and stacks are re-used from a mem-pool. So pointers might repeat. Can you describe your use case a little more in detail, just to be sure? On Mon, Jul 28, 2014 at 11:27 AM, Matthew McKeen matt...@mmckeen.net wrote: Is it true that different fops will always have a different frame (i.e. different frame pointer) as seen in the translator stack? I've always thought this to be true, but it seems that with the release-3.6.0 branch the two quick getxattr syscalls that the getfattr cli command calls share the same frame pointer. This is causing havoc with a translator of mine, and I was wondering if this was a bug, or expected behaviour. Thanks, Matt -- Matthew McKeen matt...@mmckeen.net ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Reuse of frame?
Does your code wait for both clients to unwind so that it merges the two replies before it unwinds itself? You typically would need to keep a call count (# of winds) and wait for that many _cbk invocations before calling STACK_UNWIND yourself. If you are not waiting for both replies, it is possible that the frame pointer got re-used for second call before second callback of first call arrived. On Mon, Jul 28, 2014 at 12:32 PM, Matthew McKeen matt...@mmckeen.net wrote: I have a translator on the client stack. For a particular getxattr I wind the stack to call two client translator getxattr fops. The callback for these fops is the same function in the original translator. The getfattr cli command calls two getxattr syscalls in rapid succession so that I see the callback being hit 4 times, and the original getxattr forward fop 2 times. For both the 4 callbacks and 2 forward fops the frame pointer for the translator is the same. Therefore, when I try and store a pointer to a dict in frame-local, the dict pointer points to the same dict for both fops and data set into the dict with the same keys ends up overwriting the values from the previous fop. On Mon, Jul 28, 2014 at 12:19 PM, Anand Avati av...@gluster.org wrote: call frames and stacks are re-used from a mem-pool. So pointers might repeat. Can you describe your use case a little more in detail, just to be sure? On Mon, Jul 28, 2014 at 11:27 AM, Matthew McKeen matt...@mmckeen.net wrote: Is it true that different fops will always have a different frame (i.e. different frame pointer) as seen in the translator stack? I've always thought this to be true, but it seems that with the release-3.6.0 branch the two quick getxattr syscalls that the getfattr cli command calls share the same frame pointer. This is causing havoc with a translator of mine, and I was wondering if this was a bug, or expected behaviour. Thanks, Matt -- Matthew McKeen matt...@mmckeen.net ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel -- Matthew McKeen matt...@mmckeen.net -- Matthew McKeen matt...@mmckeen.net ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Better organization for code documentation [Was: Developer Documentation for datastructures in gluster]
On Tue, Jul 22, 2014 at 7:35 AM, Kaushal M kshlms...@gmail.com wrote: Hey everyone, While I was writing the documentation for the options framework, I thought up of a way to better organize the code documentation we are creating now. I've posted a patch for review that implements this organization. [1] Copying the description from the patch I've posted for review, ``` A new directory hierarchy has been created in doc/code for the code documentation, which follows the general GlusterFS source hierarchy. Each GlusterFS module has an entry in this tree. The source directory of every GlusterFS module has a symlink, 'doc', to its corresponding directory in the doc/code tree. Taking glusterd for example. With this scheme, there will be doc/code/xlators/mgmg/glusterd directory which will contain the relevant documentation to glusterd. This directory will be symlinked to xlators/mgmt/glusterd/src/doc . This organization should allow for easy reference by developers when developing on GlusterFS and also allow for easy hosting of the documents when we set it up. ``` I haven't read the previous thread, but having doc dir co-exist with src in each module would encourage (or at least remind) keeping doc updated along with src changes. Generally recommended not to store symlinks in the source repo (though git supports it I think). You could create symlinks from top level doc/code to per module (or vice versa) in autogen.sh. Thanks ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Fwd: Re: can not build glusterfs3.5.1 on solaris because of missing sys/cdefs.h
Copying gluster-devel@ Thanks for reporting Michael. I guess we need to forward port that old change. Can you please send out a patch to gerrit? Thanks! On 7/16/14, 2:36 AM, 马忠 wrote: Hi Avati, I tried to build the latest glusterfs 3.5.1 on solaris11.1, but it stopped because of missing sys/cdefs.h I've checked the Changelog and found the commit a5301c874f978570187c3543b0c3a4ceba143c25 had once solved such a problem in the obsolete file libglusterfsclient/src/libglusterfsclient.h. I don't understand why it appeared again in the later added file api/src/glfs.h. Can you give me any suggestion about this problem? thanks. -- [root@localhost glusterfs]# git show a5301c874f978570187c3543b0c3a4ceba143c25 commit a5301c874f978570187c3543b0c3a4ceba143c25 Author: Anand V. Avati av...@amp.gluster.com Date: Mon May 18 17:24:16 2009 +0530 workaround for not including sys/cdefs.h -- including sys/cdefs.h breaks build on solaris and other platforms diff --git a/libglusterfsclient/src/libglusterfsclient.h b/libglusterfsclient/src/libglusterfsclient.h index 1c2441b..5376985 100755 --- a/libglusterfsclient/src/libglusterfsclient.h +++ b/libglusterfsclient/src/libglusterfsclient.h @@ -20,7 +20,22 @@ #ifndef _LIBGLUSTERFSCLIENT_H #define _LIBGLUSTERFSCLIENT_H -#include sys/cdefs.h +#ifndef __BEGIN_DECLS +#ifdef __cplusplus +#define __BEGIN_DECLS extern C { +#else - root@solaris:~/glusterfs-3.5.1# gmake gmake[3]: Entering directory `/root/glusterfs-3.5.1/api/src' CC libgfapi_la-glfs.lo In file included from glfs.c:50: glfs.h:41:23: sys/cdefs.h: No such file or directory In file included from glfs.c:50: glfs.h:57: error: syntax error before struct In file included from glfs.c:51: glfs-internal.h:57: error: syntax error before struct gmake[3]: *** [libgfapi_la-glfs.lo] Error 1 gmake[3]: Leaving directory `/root/glusterfs-3.5.1/api/src' gmake[2]: *** [all-recursive] Error 1 gmake[2]: Leaving directory `/root/glusterfs-3.5.1/api' gmake[1]: *** [all-recursive] Error 1 gmake[1]: Leaving directory `/root/glusterfs-3.5.1' gmake: *** [all] Error 2 -- Thanks in advance, Michael ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] inode linking in GlusterFS NFS server
On Mon, Jul 7, 2014 at 12:48 PM, Raghavendra Bhat rab...@redhat.com wrote: Hi, As per my understanding nfs server is not doing inode linking in readdirp callback. Because of this there might be some errors while dealing with virtual inodes (or gfids). As of now meta, gfid-access and snapview-server (used for user serviceable snapshots) xlators makes use of virtual inodes with random gfids. The situation is this: Say User serviceable snapshot feature has been enabled and there are 2 snapshots (snap1 and snap2). Let /mnt/nfs be the nfs mount. Now the snapshots can be accessed by entering .snaps directory. Now if snap1 directory is entered and *ls -l* is done (i.e. cd /mnt/nfs/.snaps/snap1 and then ls -l), the readdirp fop is sent to the snapview-server xlator (which is part of a daemon running for the volume), which talks to the corresponding snapshot volume and gets the dentry list. Before unwinding it would have generated random gfids for those dentries. Now nfs server upon getting readdirp reply, will associate the gfid with the filehandle created for the entry. But without linking the inode, it would send the readdirp reply back to nfs client. Now next time when nfs client makes some operation on one of those filehandles, nfs server tries to resolve it by finding the inode for the gfid present in the filehandle. But since the inode was not linked in readdirp, inode_find operation fails and it tries to do a hard resolution by sending the lookup operation on that gfid to the normal main graph. (The information on whether the call should be sent to main graph or snapview-server would be present in the inode context. But here the lookup has come on a gfid with a newly created inode where the context is not there yet. So the call would be sent to the main graph itself). But since the gfid is a randomly generated virtual gfid (not present on disk), the lookup operation fails giving error. As per my understanding this can happen with any xlator that deals with virtual inodes (by generating random gfids). I can think of these 2 methods to handle this: 1) do inode linking for readdirp also in nfs server 2) If lookup operation fails, snapview-client xlator (which actually redirects the fops on snapshot world to snapview-server by looking into the inode context) should check if the failed lookup is a nameless lookup. If so, AND the gfid of the inode is NULL AND lookup has come from main graph, then instead of unwinding the lookup with failure, send it to snapview-server which might be able to find the inode for the gfid (as the gfid was generated by itself, it should be able to find the inode for that gfid unless and until it has been purged from the inode table). Please let me know if I have missed anything. Please provide feedback. That's right. NFS server should be linking readdirp_cbk inodes just like FUSE or protocol/server. It has been OK without virtual gfids thus far. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] triggers for sending inode forgets
On Fri, Jul 4, 2014 at 8:17 PM, Pranith Kumar Karampuri pkara...@redhat.com wrote: On 07/05/2014 08:17 AM, Anand Avati wrote: On Fri, Jul 4, 2014 at 7:03 PM, Pranith Kumar Karampuri pkara...@redhat.com wrote: hi, I work on glusterfs and was debugging a memory leak. Need your help in figuring out if something is done properly or not. When a file is looked up for the first time in gluster through fuse, gluster remembers the parent-inode, basename for that inode. Whenever an unlink/rmdir/(lookup gives ENOENT) happens then corresponding forgetting of parent-inode, basename happens. This is because of the path resolver explicitly calls d_invalidate() on a dentry when d_revalidate() fails on it. In all other cases it relies on fuse to send forget of an inode to release these associations. I was wondering what are the trigger points for sending forgets by fuse. Lets say M0, M1 are fuse mounts of same volume. 1) Mount 'M0' creates a file 'a' 2) Mount 'M1' of deletes file 'a' M0 never touches 'a' anymore. Will a forget be sent on inode of 'a'? If yes when? Really depends on when the memory manager decides to start reclaiming memory from dcache due to memory pressure. If the system is not under memory pressure, and if the stale dentry is never encountered by the path resolver, the inode may never receive a forget. To keep a tight utilization limit on the inode/dcache, you will have to proactively fuse_notify_inval_entry on old/deleted files. Thanks for this info Avati. I see that in fuse-bridge for glusterfs there is a setxattr interface to do that. Is that what you are referring to? In glusterfs fuse-bridge.c:fuse_invalidate_entry() is the function you want to look at. The setxattr() interface is just for testing the functionality. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] [PATCH] fuse: ignore entry-timeout on LOOKUP_REVAL
The following test case demonstrates the bug: sh# mount -t glusterfs localhost:meta-test /mnt/one sh# mount -t glusterfs localhost:meta-test /mnt/two sh# echo stuff /mnt/one/file; rm -f /mnt/two/file; echo stuff /mnt/one/file bash: /mnt/one/file: Stale file handle sh# echo stuff /mnt/one/file; rm -f /mnt/two/file; sleep 1; echo stuff /mnt/one/file On the second open() on /mnt/one, FUSE would have used the old nodeid (file handle) trying to re-open it. Gluster is returning -ESTALE. The ESTALE propagates back to namei.c:filename_lookup() where lookup is re-attempted with LOOKUP_REVAL. The right behavior now, would be for FUSE to ignore the entry-timeout and and do the up-call revalidation. Instead FUSE is ignoring LOOKUP_REVAL, succeeding the revalidation (because entry-timeout has not passed), and open() is again retried on the old file handle and finally the ESTALE is going back to the application. Fix: if revalidation is happening with LOOKUP_REVAL, then ignore entry-timeout and always do the up-call. Signed-off-by: Anand Avati av...@redhat.com --- fs/fuse/dir.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 4219835..4eaa30d 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -198,7 +198,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags) inode = ACCESS_ONCE(entry-d_inode); if (inode is_bad_inode(inode)) goto invalid; - else if (fuse_dentry_time(entry) get_jiffies_64()) { + else if (fuse_dentry_time(entry) get_jiffies_64() || (flags LOOKUP_REVAL)) { int err; struct fuse_entry_out outarg; struct fuse_req *req; -- 1.7.1 ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Regarding doing away with refkeeper in locks xlator
On 6/3/14, 11:32 PM, Pranith Kumar Karampuri wrote: On 06/04/2014 11:37 AM, Krutika Dhananjay wrote: Hi, Recently there was a crash in locks translator (BZ 1103347, BZ 1097102) with the following backtrace: (gdb) bt #0 uuid_unpack (in=0x8 Address 0x8 out of bounds, uu=0x7fffea6c6a60) at ../../contrib/uuid/unpack.c:44 #1 0x7feeba9e19d6 in uuid_unparse_x (uu=value optimized out, out=0x2350fc0 081bbc7a-7551-44ac-85c7-aad5e2633db9, fmt=0x7feebaa08e00 %08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x) at ../../contrib/uuid/unparse.c:55 #2 0x7feeba9be837 in uuid_utoa (uuid=0x8 Address 0x8 out of bounds) at common-utils.c:2138 #3 0x7feeb06e8a58 in pl_inodelk_log_cleanup (this=0x230d910, ctx=0x7fee700f0c60) at inodelk.c:396 #4 pl_inodelk_client_cleanup (this=0x230d910, ctx=0x7fee700f0c60) at inodelk.c:428 #5 0x7feeb06ddf3a in pl_client_disconnect_cbk (this=0x230d910, client=value optimized out) at posix.c:2550 #6 0x7feeba9fa2dd in gf_client_disconnect (client=0x27724a0) at client_t.c:368 #7 0x7feeab77ed48 in server_connection_cleanup (this=0x2316390, client=0x27724a0, flags=value optimized out) at server-helpers.c:354 #8 0x7feeab77ae2c in server_rpc_notify (rpc=value optimized out, xl=0x2316390, event=value optimized out, data=0x2bf51c0) at server.c:527 #9 0x7feeba775155 in rpcsvc_handle_disconnect (svc=0x2325980, trans=0x2bf51c0) at rpcsvc.c:720 #10 0x7feeba776c30 in rpcsvc_notify (trans=0x2bf51c0, mydata=value optimized out, event=value optimized out, data=0x2bf51c0) at rpcsvc.c:758 #11 0x7feeba778638 in rpc_transport_notify (this=value optimized out, event=value optimized out, data=value optimized out) at rpc-transport.c:512 #12 0x7feeb115e971 in socket_event_poll_err (fd=value optimized out, idx=value optimized out, data=0x2bf51c0, poll_in=value optimized out, poll_out=0, poll_err=0) at socket.c:1071 #13 socket_event_handler (fd=value optimized out, idx=value optimized out, data=0x2bf51c0, poll_in=value optimized out, poll_out=0, poll_err=0) at socket.c:2240 #14 0x7feeba9fc6a7 in event_dispatch_epoll_handler (event_pool=0x22e2d00) at event-epoll.c:384 #15 event_dispatch_epoll (event_pool=0x22e2d00) at event-epoll.c:445 #16 0x00407e93 in main (argc=19, argv=0x7fffea6c7f88) at glusterfsd.c:2023 (gdb) f 4 #4 pl_inodelk_client_cleanup (this=0x230d910, ctx=0x7fee700f0c60) at inodelk.c:428 428pl_inodelk_log_cleanup (l); (gdb) p l-pl_inode-refkeeper $1 = (inode_t *) 0x0 (gdb) pl_inode-refkeeper was found to be NULL even when there were some blocked inodelks in a certain domain of the inode, which when dereferenced by the epoll thread in the cleanup codepath led to a crash. On inspecting the code (for want of a consistent reproducer), three things were found: 1. The function where the crash happens (pl_inodelk_log_cleanup()), makes an attempt to resolve the inode to path as can be seen below. But the way inode_path() itself works is to first construct the path based on the given inode's ancestry and place it in the buffer provided. And if all else fails, the gfid of the inode is placed in a certain format (gfid:%s). This eliminates the need for statements from line 4 through 7 below, thereby preventing dereferencing of pl_inode-refkeeper. Now, although this change prevents the crash altogether, it still does not fix the race that led to pl_inode-refkeeper becoming NULL, and comes at the cost of printing (null) in the log message on line 9 every time pl_inode-refkeeper is found to be NULL, rendering the logged messages somewhat useless. code 0 pl_inode = lock-pl_inode; 1 2 inode_path (pl_inode-refkeeper, NULL, path); 3 4 if (path) 5 file = path; 6 else 7 file = uuid_utoa (pl_inode-refkeeper-gfid); 8 9 gf_log (THIS-name, GF_LOG_WARNING, 10 releasing lock on %s held by 11 {client=%p, pid=%PRId64 lk-owner=%s}, 12 file, lock-client, (uint64_t) lock-client_pid, 13 lkowner_utoa (lock-owner)); \code I think this logging code is from the days when gfid handle concept was not there. So it wasn't returning gfid:gfid-str in cases the path is not present in the dentries. I believe the else block can be deleted safely now. Pranith 2. There is at least one codepath found that can lead to this crash: Imagine an inode on which an inodelk operation is attempted by a client and is successfully granted too. Now, between the time the lock was granted and pl_update_refkeeper() was called by this thread, the client could send a DISCONNECT event, causing cleanup codepath to be executed, where the epoll thread crashes on dereferencing pl_inode-refkeeper which is STILL NULL at this point. Besides, there are still places in locks xlator where the refkeeper is NOT updated whenever the lists are modified - for instance in the cleanup codepath from a
Re: [Gluster-devel] Regarding doing away with refkeeper in locks xlator
On 6/4/14, 9:43 PM, Krutika Dhananjay wrote: *From: *Pranith Kumar Karampuri pkara...@redhat.com *To: *Krutika Dhananjay kdhan...@redhat.com, Anand Avati aav...@redhat.com *Cc: *gluster-devel@gluster.org *Sent: *Wednesday, June 4, 2014 12:23:59 PM *Subject: *Re: [Gluster-devel] Regarding doing away with refkeeper in locks xlator On 06/04/2014 12:02 PM, Pranith Kumar Karampuri wrote: On 06/04/2014 11:37 AM, Krutika Dhananjay wrote: Hi, Recently there was a crash in locks translator (BZ 1103347, BZ 1097102) with the following backtrace: (gdb) bt #0 uuid_unpack (in=0x8 Address 0x8 out of bounds, uu=0x7fffea6c6a60) at ../../contrib/uuid/unpack.c:44 #1 0x7feeba9e19d6 in uuid_unparse_x (uu=value optimized out, out=0x2350fc0 081bbc7a-7551-44ac-85c7-aad5e2633db9, fmt=0x7feebaa08e00 %08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x) at ../../contrib/uuid/unparse.c:55 #2 0x7feeba9be837 in uuid_utoa (uuid=0x8 Address 0x8 out of bounds) at common-utils.c:2138 #3 0x7feeb06e8a58 in pl_inodelk_log_cleanup (this=0x230d910, ctx=0x7fee700f0c60) at inodelk.c:396 #4 pl_inodelk_client_cleanup (this=0x230d910, ctx=0x7fee700f0c60) at inodelk.c:428 #5 0x7feeb06ddf3a in pl_client_disconnect_cbk (this=0x230d910, client=value optimized out) at posix.c:2550 #6 0x7feeba9fa2dd in gf_client_disconnect (client=0x27724a0) at client_t.c:368 #7 0x7feeab77ed48 in server_connection_cleanup (this=0x2316390, client=0x27724a0, flags=value optimized out) at server-helpers.c:354 #8 0x7feeab77ae2c in server_rpc_notify (rpc=value optimized out, xl=0x2316390, event=value optimized out, data=0x2bf51c0) at server.c:527 #9 0x7feeba775155 in rpcsvc_handle_disconnect (svc=0x2325980, trans=0x2bf51c0) at rpcsvc.c:720 #10 0x7feeba776c30 in rpcsvc_notify (trans=0x2bf51c0, mydata=value optimized out, event=value optimized out, data=0x2bf51c0) at rpcsvc.c:758 #11 0x7feeba778638 in rpc_transport_notify (this=value optimized out, event=value optimized out, data=value optimized out) at rpc-transport.c:512 #12 0x7feeb115e971 in socket_event_poll_err (fd=value optimized out, idx=value optimized out, data=0x2bf51c0, poll_in=value optimized out, poll_out=0, poll_err=0) at socket.c:1071 #13 socket_event_handler (fd=value optimized out, idx=value optimized out, data=0x2bf51c0, poll_in=value optimized out, poll_out=0, poll_err=0) at socket.c:2240 #14 0x7feeba9fc6a7 in event_dispatch_epoll_handler (event_pool=0x22e2d00) at event-epoll.c:384 #15 event_dispatch_epoll (event_pool=0x22e2d00) at event-epoll.c:445 #16 0x00407e93 in main (argc=19, argv=0x7fffea6c7f88) at glusterfsd.c:2023 (gdb) f 4 #4 pl_inodelk_client_cleanup (this=0x230d910, ctx=0x7fee700f0c60) at inodelk.c:428 428pl_inodelk_log_cleanup (l); (gdb) p l-pl_inode-refkeeper $1 = (inode_t *) 0x0 (gdb) pl_inode-refkeeper was found to be NULL even when there were some blocked inodelks in a certain domain of the inode, which when dereferenced by the epoll thread in the cleanup codepath led to a crash. On inspecting the code (for want of a consistent reproducer), three things were found: 1. The function where the crash happens (pl_inodelk_log_cleanup()), makes an attempt to resolve the inode to path as can be seen below. But the way inode_path() itself works is to first construct the path based on the given inode's ancestry and place it in the buffer provided. And if all else fails, the gfid of the inode is placed in a certain format (gfid:%s). This eliminates the need for statements from line 4 through 7 below, thereby preventing dereferencing of pl_inode-refkeeper. Now, although this change prevents the crash altogether, it still does not fix the race that led to pl_inode-refkeeper becoming NULL, and comes at the cost of printing (null) in the log message on line 9 every time pl_inode-refkeeper is found to be NULL, rendering the logged messages
Re: [Gluster-devel] Need sensible default value for detecting unclean client disconnects
Niels, This is a good addition. While gluster clients do a reasonably good job at detecting dead/hung servers with ping-timeout, the server side detection has been rather weak. TCP_KEEPALIVE has helped to some extent, for cases where an idling client (which holds a lock) goes dead. However if an active client with pending data in server's socket buffer dies, we have been subject to long tcp retransmission to finish and give up. The way I see it, this option is complementary to TCP_KEEPALIVE (keepalive works for idle and only idle connections, user_timeout works only when there is pending acknowledgements, thus covering the full spectrum). To that end, it might make sense to present the admin a single timeout configuration value rather than two. It would be very frustrating for the admin to configure one of them to, say, 30 seconds, and then find that the server does not clean up after 30 seconds of a hung client only because the connection was idle (or not idle). Configuring a second timeout for the other case can be very unintuitive. In fact, I would suggest to have a single network timeout configuration, which gets applied to all the three: ping-timeout on the client, user_timeout on the server, keepalive on both. I think that is what a user would be expecting anyways. Each is for a slightly different technical situation, but all just internal details as far as a user is concerned. Thoughts? On Tue, May 20, 2014 at 4:30 AM, Niels de Vos nde...@redhat.com wrote: Hi all, the last few days I've been looking at a problem [1] where a client locks a file over a FUSE-mount, and a 2nd client tries to grab that lock too. It is expected that the 2nd client gets blocked until the 1st client releases the lock. This all work as long as the 1st client cleanly releases the lock. Whenever the 1st client crashes (like a kernel panic) or the network is split and the 1st client is unreachable, the 2nd client may not get the lock until the bricks detect that the connection to the 1st client is dead. If there are pending Replies, the bricks may need 15-20 minutes until the re-transmissions of the replies have timed-out. The current default of 15-20 minutes is quite long for a fail-over scenario. Relatively recently [2], the Linux kernel got a TCP_USER_TIMEOUT socket option (similar to TCP_KEEPALIVE). This option can be used to configure a per-socket timeout, instead of a system-wide configuration through the net.ipv4.tcp_retries2 sysctl. The default network.ping-timeout is set to 42 seconds. I'd like to propose a network.tcp-timeout option that can be set per volume. This option should then set TCP_USER_TIMEOUT for the socket, which causes re-transmission failures to be fatal after the timeout has passed. Now the remaining question, what shall be the default timeout in seconds for this new network.tcp-timeout option? I'm currently thinking of making it high enough (like 5 minutes) to prevent false positives. Thoughts and comments welcome, Niels 1 https://bugzilla.redhat.com/show_bug.cgi?id=1099460 2 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=dca43c7 ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] New project on the Forge - gstatus
KP, Vipul, It will be awesome to get io-stats like instrumentation on the client side. Here are some further thoughts on how to implement that. If you have a recent git HEAD build, I would suggest that you explore the latency stats on the client side exposed through meta at $MNT/.meta/graphs/active/$xlator/profile. You can enable latency measurement with echo 1 $MNT/.meta/measure_latency. I would suggest extending these stats with the extra ones io-stats has, and make glusterfsiostats expose these stats. If you can compare libglusterfs/src/latency.c:gf_latency_begin(), gf_latency_end() and gf_latency_udpate() with the macros in io-stats.c UPDATE_PROFILE_STATS() and START_FOP_LATENCY(), you will quickly realize how a lot of logic is duplicated between io-stats and latency.c. If you can enhance latency.c and make it capture the remaining stats what io-stats is capturing, the benefits of this approach would be: - stats are already getting captured at all xlator levels, and not just at the position where io-stats is inserted - file like interface makes the stats more easily inspectable and consumable, and updated on the fly - conforms with the way rest of the internals are exposed through $MNT/.meta In order to this, you might want to look into: - latency.c as of today captures fop count, mean latency, total time, whereas io-stats measures these along with min-time, max-time and block-size histogram. - extend gf_proc_dump_latency_info() to dump the new stats - either prettify that output like 'volume profile info' output, or JSONify it like xlators/meta/src/frames-file.c - add support for cumulative vs interval stats (store an extra copy of this-latencies[]) etc.. Thanks! On Fri, Apr 25, 2014 at 9:09 PM, Krishnan Parthasarathi kpart...@redhat.com wrote: [Resending due to gluster-devel mailing list issue] Apologies for the late reply. glusterd uses its socket connection with brick processes (where io-stats xlator is loaded) to gather information from io-stats via an RPC request. This facility is restricted to brick processes as it stands today. Some background ... io-stats xlator is loaded, both in GlusterFS mounts and brick processes. So, we have the capabilities to monitor I/O statistics on both sides. To collect I/O statistics at the server side, we have # gluster volume profile VOLNAME [start | info | stop] AND #gluster volume top VOLNAME info [and other options] We don't have a usable way of gathering I/O statistics (not monitoring, though the counters could be enhanced) at the client-side, ie. for a given mount point. This is the gap glusterfsiostat aims to fill. We need to remember that the machines hosting GlusterFS mounts may not have glusterd installed on them. We are considering rrdtool as a possible statistics database because it seems like a natural choice for storing time-series data. rrdtool is capable of answering high-level statistical queries on statistics that were logged in it by io-stats xlator over and above printing running counters periodically. Hope this gives some more clarity on what we are thinking. thanks, Krish - Original Message - Probably me not understanding. the comment iostats making data available to glusterd over RPC - is what I latched on to. I wondered whether this meant that a socket could be opened that way to get at the iostats data flow. Cheers, PC - Original Message - From: Vipul Nayyar nayyar_vi...@yahoo.com To: Paul Cuzner pcuz...@redhat.com, Krishnan Parthasarathi kpart...@redhat.com Cc: Vijay Bellur vbel...@redhat.com, gluster-devel gluster-de...@nongnu.org Sent: Thursday, 20 February, 2014 5:06:27 AM Subject: Re: [Gluster-devel] New project on the Forge - gstatus Hi Paul, I'm really not sure, if this can be done in python(at least comfortably). Maybe we can tread on the same path as Justin's glusterflow in python. But I don't think, all the io-stats counters will be available with the way how Justin's used Jeff Darcy's previous work to build his tool. I can be wrong. My knowledge is a bit incomplete and based on very less experience as a user and an amateur Gluster developer. Please do correct me, if I can be. Regards Vipul Nayyar ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
On Thu, May 15, 2014 at 5:49 PM, Pranith Kumar Karampuri pkara...@redhat.com wrote: hi, In the latest build I fired for review.gluster.com/7766 ( http://build.gluster.org/job/regression/4443/console) failed because of spurious failure. The script doesn't wait for nfs export to be available. I fixed that, but interestingly I found quite a few scripts with same problem. Some of the scripts are relying on 'sleep 5' which also could lead to spurious failures if the export is not available in 5 seconds. We found that waiting for 20 seconds is better, but 'sleep 20' would unnecessarily delay the build execution. So if you guys are going to write any scripts which has to do nfs mounts, please do it the following way: EXPECT_WITHIN 20 1 is_nfs_export_available; TEST mount -t nfs -o vers=3 $H0:/$V0 $N0; Always please also add mount -o soft,intr in the regression scripts for mounting nfs. Becomes so much easier to cleanup any hung mess. We probably need an NFS mounting helper function which can be called like: TEST mount_nfs $H0:/$V0 $N0; Thanks Avati ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Automatically building RPMs upon patch submission?
On Mon, May 12, 2014 at 4:23 PM, Justin Clift jus...@gluster.org wrote: On 12/05/2014, at 9:04 PM, Anand Avati wrote: snip And yeah, the other reason: if a dev pushes a series/set of dependent patches, regression needs to run only on the last one (regression test/voting is cumulative for the set). Running regression on all the individual patches (like a smoke test) would be very wasteful, and tricky to avoid (this was the part which I couldn't solve) What's the manual with intelligence required process we use to do this atm? eg for people wanting to test the combined patch set Side note - I'm mucking around with the Gerrit Trigger plugin in a VM on my desktop running Jenkins. So if you see any strangeness in things with Gerrit comments, it could be me. (feel free to ping me as needed) http://build.gluster.org/job/regression/build - key in the gerrit patch number for the CHANGE_ID field, and click 'Build'. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] dht: selfheal of missing directories on nameless (by GFID) LOOKUP
On Sun, May 4, 2014 at 9:22 AM, Niels de Vos nde...@redhat.com wrote: Hi, bug 1093324 has been opened and we have identified the following cause: 1. an NFS-client does a LOOKUP of a directory on a volume 2. the NFS-client receives a filehandle (contains volume-id + GFID) 3. add-brick is executed, but the new brick does not have any directories yet 4. the NFS-client creates a new file in the directory, this request is in the format or filehandle/filename, filehandle was received in step 2 5. the NFS-server does a LOOKUP on the parent directory identified by the filehandle - nameless LOOKUP, only GFID is known 6. the old brick(s) return successfully 7. the new brick returns ESTALE 8. the NFS-server returns ESTALE to the NFS-client In this case, the NFS-client should not receive an ESTALE. There is also no ESTALE error passed to the client when this procedure is done over FUSE or samba/libgfapi. Selfhealing a directory entry based only on a GFID is not always possible. Files do not have a unique filename (hardlinks), so it is not trivial to find a filename for a GFID (expensive operation, and the result could be a list). However, for a directory this is simpler. A directory is not hardlink'd in the .glusterfs directory, directories are maintained as symbolic-links. This makes it possible to find the name of a directory, when only the GFID is known. Currently DHT is not able to selfheal directories on a nameless LOOKUP. I think that it should be possible to change this, and to fix the ESTALE returned by the NFS-server. At least two changes would be needed, and this is where I would like to hear opinions from others about it: - The posix-xlator should be able to return the directory name when a GFID is given. This can be part of the LOOKUP-reply (dict), and that would add a readlink() syscall for each nameless LOOKUP that finds a directory. Or (suggested by Pranith) add a virtual xattr and handle this specific request with an additional FGETXATTR call. I think the LOOKUP-reply with readlink() is better, instead of a new over-the-wire FOP. - DHT should selfheal the directory when at least one ESTALE is returned by the bricks. This also makes sense, except - if even the parent directory is missing on that server (yet to be healed). Another important point to note is that, the directories (with the same GFID) themselves may be present at various locations as various dentries on the many servers. A lookup of dir-gfid/name should succeed transparently independent of the differing dir-gfid's dentries across servers. However if you want to heal, now the choice of server from where you select the dir's parent and name become important as the self-heal will impose that on the other servers. For e.g one of the AFR subvolumes may have not yet healed the parent directories etc. Or, the N-1 servers may each return a different par-gfid/dir-name in the LOOKUP reply. So it can quickly get hairy. As a general approach, using the LOOKUP-reply to send parent info from the posix level makes sense. But we also need a more detailed proposal on how that info is used at the cluster xlator levels to achieve a higher level goal, like self-heal. When all bricks return ESTALE, the ESTALE is valid and should be passed on to the upper layers (NFS-server - NFS-client). Yes. Thanks ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel