Re: [Gluster-devel] AFR conservative merge portability
Ravishankar N ravishan...@redhat.com wrote: The check can be done in metadata selfheal itself but I don't think AFR is to blame if the user space app in NetBSD sends a setattr on the parent dir. It is not exactly the problem: When adding an entry, you must update parent directory mtime/ctime. On NetBSD, the filesystem-independant code (that is, above VFS) in the kernel does it. On Linux, it seems it is the responsability of the filesystem (below VFS) to do it. Since this in-kernel implementation, no standard will tell what OS is right: both are. The only blame we can put on AFR is that it assumes Linux behavior, hence the proposal to make it portable. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Portable filesystem ACLs?
On Sun, Dec 14, 2014 at 07:24:52PM +0530, Soumya Koduri wrote: On 12/12/2014 11:06 PM, Niels de Vos wrote: Hi, I started to look into getting some form of support for ACLs in gfapi. After a short discussion with Shyam, some investigation showed that our current implementation of ACLs is not very portable. There definitely seem to be issues with ACLs when a FUSE mount is used on FreeBSD, and the bricks are on a Linux system. Our current implementations of (POSIX) ACLs is very much focussed on the Linux behaviour. For example, there is the assumption that ACLs are stored in the system.posix_acl_access extended attribute. FreeBSD uses a system.posix1e.acl_access xattr. Other platforms likely use an other variation. Also the (binary) encoding of the contens most definitely differs per platform. In order to provide a good experience with ACLs on different platforms, we could introduce a solution like this: setfacl ... | v glusterfs client (like fuse) | v some API, possibly transparent in the posix-acl xlator | converts the client-platform specific ACL into | a Gluster ACL format v Outgoing RPC procedure, a new SET_ACL, or as SETXATTR(gluster.acl) | v [network] | v Incoming RPC procedure on the brick (can be different platform) | v Conversion from Gluster/ACL format to platform specific, possibly in | the storage/posix xlator v setfacl() syscall/library call to store the ACL on the filesystem Reading the ACL would be the same, just in reverse. It would be most welcome to have some kind of API that can get exposed in gfapi, so that NFS-Ganesha and other gfapi applications can get/set ACLs in a standardized way. One option is to use the (possibly platform dependent) structures defined by libacl or librichacl. From a quick look at the code, 'libacl' seems to be supporting only 'system.posix_acl_access' xattr. IMO it may be good to have a standard 'acl_obj' structure in RichACL format itself, which Multi-protocol team (CCed) has been looking at which you have been suggesting. This structure can be exposed to the applications(NFS-Ganesha/SAMBA) to fill-in ACLs data and probably call 'glfs_set_acl(..)' which in turn converts these 'acl_obj's to platform dependent acl xattrs and make a syncop_setxattr call. And in future, if we get RichACL support, we could just extend/modify this routine to send in those ACLs as is to the back-end. Translating automatically between FreeBSD and Linux ACLs makes sense, I think, as long as the only differences are trivial encoding and naming issues. Translating between completely different ACL models could quickly get more complicated than its worth. --b. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] AFR conservative merge portability
Here is a proposal: we know that at the end of conservative merge, we should end up with the situation where directory ctime/mtime is the ctime of the most recently added children. Won't the directory mtime change as the result of a rename or unlink? Neither of those would be reflected in the children's times (in the unlink case the child no longer exists). And fortunately, as conservative merge happens, parent directory ctime/mtime are updated on each child addition, and we finish in the desired state. In other words, after conservative merge, parent directory metadata split brain for only ctime/mtime can just be cleared by AFR without any harm. Is there *any* case, not even necessarily involving conservative merge, where it would be harmful to propagate the latest ctime/mtime for any replica of a directory? ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Readdir d_off encoding
With the changes present in [1] and [2], A short explanation of the change would be, we encode the subvol ID in the d_off, losing 'n + 1' bits in case the high order n+1 bits of the underlying xlator returned d_off is not free. (Best to read the commit message for [1] :) ) Although not related to the latest patch, here is something to consider for the future: We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol encoding in the returned readdir offset. Due to this, the loss in bits _may_ cause unwanted offset behavior, when used in the current scheme. As we would end up eating more bits than what we do at present. Or IOW, we could be invalidating the assumption both EXT4/XFS are tolerant in terms of the accuracy of the value presented back in seekdir(). i.e, a seekdir(val) actually seeks to the entry which has the closest true offset. Should we reconsider an in memory _cookie_ like approach that can help in this case? It would invalidate (some or all based on the implementation) the following constraints that the current design resolves, (from, [1]) - Nothing to remember in memory or evict old entries. - Works fine across NFS server reboots and also NFS head failover. - Tolerant to seekdir() to arbitrary locations. But, would provide a more reliable readdir offset for use (when valid and not evicted, say). How would NFS adapt to this? Does Ganesha need a better scheme when doing multi-head NFS fail over? Thoughts? Shyam [1] http://review.gluster.org/#/c/4711/ [2] http://review.gluster.org/#/c/8201/ ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Upcalls Infrastructure
- Is there a new connection from glusterfsd (upcall xlator) to a client accessing a file? If so, how does the upcall xlator reuse connections when the same client accesses multiple files, or does it? No. We are using the same connection which client initiates to send-in fops. Thanks to you for pointing me initially to the 'client_t' structure. As these connection details are available only in the server xlator, I am passing these to upcall xlator by storing them in 'frame-root-client'. - In the event of a network separation (i.e, a partition) between a client and a server, how does the client discover or detect that the server has 'freed' up its previously registerd upcall notification? The rpc connection details of each client are stored based on its client-uid. So incase of network partition, when client comes back online, IMO it re-initiates the connection (along with new client-uid). How would a client discover that a server has purged its upcall entries? For instance, a client could assume that the server would notify it about changes as before (while the server has purged the client's upcall entries) and assume that it still holds the lease/lock. How would you avoid that? Please correct me if that's not the case. So there will new entries created/added in this xlator. However, we still need to decide on how to cleanup the old-timed-out and stale entries * either clean-up the entries as and when we find any expired entry or stale entry (in case if notification fails). * or by spawning a new thread which periodically scans through this list and cleans up those entries. There are couple of things to resource cleanup in this context. 1) Time to cleanup; For e.g, on expiry of a timer. 2) Order of cleaning up; This involves clearly establishing relationships among inode, upcall entry and client_t(s). We should document this. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Volume management proposal (4.0)
So . . . about that new functionality. The core idea of data classification is to apply step 6c repeatedly, with variants of DHT that do tiering or various other kinds of intelligent placement instead of the hash-based random placement we do now. NUFA and switch are already examples of this. In fact, their needs drove some of the code structure that makes data classification (DC) possible. The trickiest question with DC has always been how the user specifies these complex placement policies, which we then turn into volfiles. In the interests of maximizing compatibility with existing scripts and user habits, what I propose is that we do this by allowing the user to combine existing volumes into a new higher-level volume. This is I like the idea for its simplicity. Abstracting 'tiers' as volumes is natural from manageability point of view for the following reason. Tiering is about partitioning resources to data based on their needs. This is done by moving data to its most suited resource (for e.g, kind of disk). Secondary volumes approach allow us to partition the resources and their management together. (D) Secondary volumes may not be started and stopped by the user. Instead, a secondary volume is automatically started or stopped along with its primary. Wouldn't it help in some cases to have secondary volumes running while primary is not running? Some form of maintenance activity. (E) The user must specify an explicit option to see the status of secondary volumes. Without this option, secondary volumes are hidden and status for their constituent bricks will be shown as though they were (directly) part of the corresponding primary volume. As it turns out, most of the extra volfiles in step 8 above also have their own steps 6d and 7, so implementing step C will probably make those paths simpler as well. The one big remaining question is how this will work in terms of detecting and responding to volume configuration changes. Currently we treat each volfile as a completely independent entity, and just compare whole graphs. Instead, what we need to do is track dependencies between graphs (a graph of graphs?) so that a change to a secondary volume will ripple up to its primary where a new graph can be generated and compared to its predecessor. IIUC, (E) describes that primary volume file would be generated with all secondary volume references resolved. Wouldn't that preclude the possibility of the respective processes discovering the dependencies? ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] REMINDER: Gluster Community Bug Triage meeting today at 12:00 UTC
Hi all, Later today we will have an other Gluster Community Bug Triage meeting. Meeting details: - location: #gluster-meeting on Freenode IRC - date: every Tuesday - time: 12:00 UTC, 13:00 CET (in your terminal, run: date -d 12:00 UTC) - agenda: https://public.pad.fsfe.org/p/gluster-bug-triage Currently the following items are listed: * Roll Call * Status of last weeks action items * Group Triage * Open Floor The last two topics have space for additions. If you have a suitable bug or topic to discuss, please add it to the agenda. Thanks, Niels pgpfX6SY0x_NZ.pgp Description: PGP signature ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel