Re: [PATCH 2/4] Fix mainline filesystems to handle ATTR_KILL_ bits correctly
Jeff Layton wrote: On Tue, 21 Aug 2007 17:21:28 -0400 Josef Sipek <[EMAIL PROTECTED]> wrote: On Tue, Aug 21, 2007 at 07:35:51AM -0400, Jeff Layton wrote: On Tue, 21 Aug 2007 15:35:08 +1000 Timothy Shimmin <[EMAIL PROTECTED]> wrote: Jeff Layton wrote: This should fix all of the filesystems in the mainline kernels to handle ATTR_KILL_SUID and ATTR_KILL_SGID correctly. For most of them, this is just a matter of making sure that they call generic_attrkill early in the setattr inode op. Signed-off-by: Jeff Layton <[EMAIL PROTECTED]> --- fs/xfs/linux-2.6/xfs_iops.c |5 - --- a/fs/xfs/linux-2.6/xfs_iops.c +++ b/fs/xfs/linux-2.6/xfs_iops.c @@ -651,12 +651,15 @@ xfs_vn_setattr( struct iattr*attr) { struct inode*inode = dentry->d_inode; - unsigned intia_valid = attr->ia_valid; + unsigned intia_valid; bhv_vnode_t *vp = vn_from_inode(inode); bhv_vattr_t vattr = { 0 }; int flags = 0; int error; + generic_attrkill(inode->i_mode, attr); + ia_valid = attr->ia_valid; + if (ia_valid & ATTR_UID) { vattr.va_mask |= XFS_AT_UID; vattr.va_uid = attr->ia_uid; Looks reasonable to me for XFS. Acked-by: Tim Shimmin <[EMAIL PROTECTED]> So before, this clearing would happen directly in notify_change() and now this won't happen until notify_change() calls i_op->setattr which for a particular fs it can call generic_attrkill() to do it. So I guess for the cases where i_op->setattr is called outside of via notify_change, we don't normally have ATTR_KILL_SUID/SGID set so that nothing will happen there? Right. If neither ATTR_KILL bit is set then generic_attrkill is a noop. I guess just wondering the effect with having the code on all setattr's. (I'm not familiar with the code path) These bits are referenced in very few places in the current kernel tree -- mostly in the VFS layer. The *only* place I see that they actually get interpreted into a mode change is in notify_change. So places that call setattr ops w/o going through notify_change are not likely to have those bits set. But hypothetically, if a fs did set ATTR_KILL_* and call setattr directly, then the setattr would now include a mode change that clears setuid or setgid bits where it may not have before. I should probably clarify -- in the hypothetical situation above, the setattr function would have to call generic_attrkill (as most filesystems should do with this change). Thanks for the confirmation. That's what it looked like to me but I wanted to know explicitly what the thinking was. It almost sounds like an argument for a new inode op (NULL would use generic_attr_kill). That's not a bad idea at all. I suppose that would be easier than modifying every fs like this, and it does seem like it might be cleaner. I need to mull it over, but that might be the best solution. Yeah, sounds a much more direct way of handling things and as you say wouldn't need most of the filesystems to all be modified calling generic_attrkill. Not sure what the ramifications of adding a new iop are though. Cheers, Tim. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Fix mainline filesystems to handle ATTR_KILL_ bits correctly
Jeff Layton wrote: This should fix all of the filesystems in the mainline kernels to handle ATTR_KILL_SUID and ATTR_KILL_SGID correctly. For most of them, this is just a matter of making sure that they call generic_attrkill early in the setattr inode op. Signed-off-by: Jeff Layton <[EMAIL PROTECTED]> --- fs/xfs/linux-2.6/xfs_iops.c |5 - --- a/fs/xfs/linux-2.6/xfs_iops.c +++ b/fs/xfs/linux-2.6/xfs_iops.c @@ -651,12 +651,15 @@ xfs_vn_setattr( struct iattr*attr) { struct inode*inode = dentry->d_inode; - unsigned intia_valid = attr->ia_valid; + unsigned intia_valid; bhv_vnode_t *vp = vn_from_inode(inode); bhv_vattr_t vattr = { 0 }; int flags = 0; int error; + generic_attrkill(inode->i_mode, attr); + ia_valid = attr->ia_valid; + if (ia_valid & ATTR_UID) { vattr.va_mask |= XFS_AT_UID; vattr.va_uid = attr->ia_uid; Looks reasonable to me for XFS. Acked-by: Tim Shimmin <[EMAIL PROTECTED]> So before, this clearing would happen directly in notify_change() and now this won't happen until notify_change() calls i_op->setattr which for a particular fs it can call generic_attrkill() to do it. So I guess for the cases where i_op->setattr is called outside of via notify_change, we don't normally have ATTR_KILL_SUID/SGID set so that nothing will happen there? I guess just wondering the effect with having the code on all setattr's. (I'm not familiar with the code path) --Tim - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
Amit K. Arora wrote: FA_FL_NO_MTIME 0x10 /* keep same mtime (default change on size, data change) */ FA_FL_NO_CTIME 0x20 /* keep same ctime (default change on size, data change) */ NACK to these aswell. If i_size changes c/mtime need updates, if the size doesn't chamge they don't. No need to add more flags for this. This requirement was from the point of view of HSM applications. Hope you saw Andreas previous post and are keeping that in mind. We use this capability in XFS at the moment. I think this is mainly for DMF (HSM) but is done via the xfs handle interface (xfs_open_by_handle) AFAICT. This sets up a set of invisible operations (xfs_invis_file_operations). xfs_file_ioctl_invis goes on to set IO_INVIS which goes on to set ATTR_DMI which is then tested in xfs_change_file_space() (which handles XFS_IOC_RESVSP & friends) for whether xfs_ichgtime(ip, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG) is called or not. --Tim - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Hi, --On 28 May 2007 12:45:59 PM +1000 David Chinner <[EMAIL PROTECTED]> wrote: On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote: Thanks everyone for your input. There was some very valuable observations in the various emails. I will try to pull most of it together and bring out what seem to be the important points. 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP. Sounds good to me, but how do we test to see if the underlying device supports barriers? Do we just assume that they do and only change behaviour if -o nobarrier is specified in the mount options? I would assume so. Then when the block layer finds that they aren't supported and does non-barrier ones, then it could report a message. We, xfs, I guess can't take much other course of action and we aint doing much now other than not requesting them anymore and printing an error message. 2/ Maybe barriers provide stronger semantics than are required. All write requests are synchronised around a barrier write. This is often more than is required and apparently can cause a measurable slowdown. Also the FUA for the actual commit write might not be needed. It is important for consistency that the preceding writes are in safe storage before the commit write, but it is not so important that the commit write is immediately safe on storage. That isn't needed until a 'sync' or 'fsync' or similar. The use of barriers in XFS assumes the commit write to be on stable storage before it returns. One of the ordering guarantees that we need is that the transaction (commit write) is on disk before the metadata block containing the change in the transaction is written to disk and the current barrier behaviour gives us that. Yep, and that one is what we want the FUA for - for the write into the log. I'm taking it that the FUA write will just guarantee that that particular write has made it to disk on i/o completion (and no write cache flush is done). The other XFS constraint is that we know when the metadata hits the disk so that we can move the tail of the log. And that is what we are effectively getting from the pre-write-flush part of the barrier. It would ensure that any metadata not yet to disk would be on disk before we overwrite the tail of the log. If we could determine cases when we don't have to worry about overwriting the tail of the log, then it would be good if we could just do FUA writes for contraint 1 above. Is that possible? --Tim - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
--On 18 April 2007 6:21:39 PM -0600 Andreas Dilger <[EMAIL PROTECTED]> wrote: Below is an aggregation of the comments in this thread: struct fiemap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ __u32 fe_lun; /* logical storage device number in array */ } struct fiemap { __u64 fm_start; /* logical start offset of mapping (in/out) */ __u64 fm_len; /* logical length of mapping (in/out) */ __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ __u64 fm_unused; struct fiemap_extent fm_extents[0]; } /* flags for the fiemap request */ # define FIEMAP_FLAG_SYNC 0x0001 /* flush delalloc data to disk*/ # define FIEMAP_FLAG_HSM_READ 0x0002 /* retrieve data from HSM */ # define FIEMAP_FLAG_INCOMPAT0xff00 /* must understand these flags*/ /* flags for the returned extents */ # define FIEMAP_EXTENT_HOLE 0x0001 /* no space allocated */ # define FIEMAP_EXTENT_UNWRITTEN0x0002 /* uninitialized space */ # define FIEMAP_EXTENT_UNKNOWN 0x0004 /* in use, location unknown */ # define FIEMAP_EXTENT_ERROR0x0008 /* error mapping space */ # define FIEMAP_EXTENT_NO_DIRECT0x0010 /* no direct data access */ SUMMARY OF CHANGES == - use fm_* fields directly in request instead of making it a fiemap_extent (though they are layed out identically) I much prefer that - it makes it a lot clearer to me to have fiemap_extent just for fm_extents (no different meanings now). (Don't like the word "offset" in comment without "physical" or some such but whatever;-) I also prefer the flags as separate fields too :) --Tim - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
Hi Andreas, --On 12 April 2007 5:05:50 AM -0600 Andreas Dilger <[EMAIL PROTECTED]> wrote: I'm interested in getting input for implementing an ioctl to efficiently map file extents & holes (FIEMAP) instead of looping over FIBMAP a billion times. ... I had come up with a plan independently and was also steered toward XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original plan, though I think the XFS structs used there are a bit bloated. They certainly seem to be (combining entries and header). struct fibmap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ } struct fibmap { struct fibmap_extent fm_start; /* offset, length of desired mapping */ __u32 fm_extent_count; /* number of extents in array */ __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ __u64 unused; struct fibmap_extent fm_extents[0]; } # define FIEMAP_LEN_MASK0xff # define FIEMAP_LEN_HOLE0x01 # define FIEMAP_LEN_UNWRITTEN 0x02 All offsets are in bytes to allow cases where filesystems are not going block-aligned/sized allocations (e.g. tail packing). The fm_extents array returned contains the packed list of allocation extents for the file, including entries for holes (which have fe_start == 0, and a flag). The ->fm_extents[] array includes all of the holes in addition to allocated extents because this avoids the need to return both the logical and physical address for every extent and does not make processing any harder. Well, that's what stood out for me. I was wondering where the "fe_block" field had gone - the "physical address". So is your "fe_start; /* starting offset */" actually the disk location (not a logical file offset) _except_ in the header (fibmap) where it is the desired logical offset. Okay, looking at your example use below that's what it looks like. And when you refer to fm_start below, you mean fm_start.fe_start? Sorry, I realise this is just an approximation but this part confused me. So you get rid of all the logical file offsets in the extents because we report holes explicitly (and we know everything is contiguous if you include the holes). --Tim Caller works something like: char buf[4096]; struct fibmap *fm = (struct fibmap *)buf; int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); fm->fm_extent.fe_start = 0; /* start of file */ fm->fm_extent.fe_len = -1; /* end of file */ fm->fm_extent_count = count; /* max extents in fm_extents[] array */ fm->fm_flags = 0;/* maybe "no DMAPI", etc like XFS */ fd = open(path, O_RDONLY); printf("logical\t\tphysical\t\tbytes\n"); /* The last entry will have less extents than the maximum */ while (fm->fm_extent_count == count) { rc = ioctl(fd, FIEMAP, fm); if (rc) break; /* kernel filled in fm_extents[] array, set fm_extent_count * to be actual number of extents returned, leaves fm_start * alone (unlike XFS_IOC_GETBMAP). */ for (i = 0; i < fm->fm_extent_count; i++) { __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK; __u64 fm_next = fm->fm_start + len; int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE; int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN; printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n", fm->fm_start, fm_next - 1, hole ? 0 : fm->fm_extents[i].fe_start, hole ? 0 : fm->fm_extents[i].fe_start + fm->fm_extents[i].fe_len - 1, len, hole ? "(hole) " : "", unwr ? "(unwritten) " : ""); /* get ready for printing next extent, or next ioctl */ fm->fm_start = fm_next; } } - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PROPOSAL2] Extended Attributes
Andreas writes: > > Hi again, > > > There were some good arguments for adding a few more features to the EA > interface. This new proposal reflects some of the discussion. > > I still decline to support forks/streams through the EA interface. IMHO > that's just the wrong way to go. This doesn't preclude an EA > implementation on top of streams, of course. > > The interface described here also doesn't include Stephen's idea to allow > an ordered list of EA's under the same name. In addition to the append and > prepend operations Stephen suggested, a whole range of other operations > (get/delete/... by index, etc.) might make sense, and stuff like that > could well be added. However, it would complicate the semantics even > further. I'd really like to learn more about the requirements for that. > Back to comments from the XFS team and your 2nd proposed API. * We (XFS) also have an attr_multi() system call which is reminiscent of your sys_ext_attr_XXX() call. It allows one to apply an op to many attributes in the one call. However, it actually takes an op-list, and thus can actually apply different ops to different attributes in the one syscall, instead of taking the one op-code as a parameter . i.e. it takes: attr_multiop_t *oplist, where struct attr_multiop { int am_opcode; /* which operation to perform (see below) */ int am_error; /* [out arg] result of this sub-op (an errno) */ char *am_attrname; /* attribute name to work with */ char *am_attrvalue; /* [in/out arg] attribute value (raw bytes) */ int am_length; /* [in/out arg] length of value */ int am_flags; /* flags (bit-wise OR of #defines above) */ }; opcode = ATTR_OP_GET, ATTR_OP_SET, ATTR_OP_REMOVE The attr_multi() call does not actually provide atomicity over the call - other threads could change the attributes between ops. It also does not provide a log transaction for the set of ops. So, I think it is just saving on multiple system calls. It doesn't have a matching VOP call, it just calls the VOPS for get, set and remove. The only use I've seen in the IRIX tree is where the opcode is the same for the given list - i.e. 1 multi for setting a bunch of attributes and another multi for getting a bunch of attributes This could thus be achieved with your API. * I noticed how you have Stephen's "attrib family" encoded as a "namespace" in your API BUT I didn't see how you've encoded Stephen's "name family" field. BTW, I still don't have a clear understanding of the "name family". Is it like some sort of "typing" for names ? * As mentioned previously, we'd also like a way of optionally following symlinks - such as by another flag value or another system call variant. * FYI our attr_list cursor is of the form: (instead of a single 32 bit offset). The identifies which entry we were last at. One form of the attribute data storage, actually stores with ordered hashed-name entries and so the offset is used to distinguish between equal hash values. In another EA storage form, we don't actually store the hash values, however, the list function still preserves hash-value ordering and so we actually compute the names' hash values. I think the blkno is used in the BTREE form of storage as a suggestion on where to start looking for the . If it finds that the block isn't an EA block or it doesn't have hash values in the right range then we start from the top of the tree and search looking for . The initted field is a 1 bit field (0/1) which says whether the cursor has been initialized yet. I can only see a test for it in the one place, which seems to be just an early sanity test in xfs_attr_list() if ((cursor->initted == 0) && (cursor->hashval || cursor->blkno || cursor->offset)) { return(XFS_ERROR(EINVAL)); } Sorry for the rambling :) --Tim - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [PROPOSAL] Extended attributes for Posix security extensions
Hi, I'm new to this area so please excuse some naive questions. I have no criticisms, just desire clarifications on your "Families" API for EA. I've been looking at your API versus the XFS attrib functionality. 1. It seems that your EA API assumes EA lists, whereas in XFS we seem to base it on EA sets. That is, in your API the ordering of attributes is significant; hence the ATR_APPEND, ATR_PREPEND and an ATR_SET definition which mentions the handling of multiple instances of the same name. In XFS there doesn't seem to be an explicit concept of ordering, and I thought that only one attribute name could exist for a particular file system object (well actually 2 - possibly a User space version and a Root space version) - is this not the case, Curtis ? 2. In your definitions, you mention how only "name" in is used for GET, and in SET you mention, "if multiple instances of the name already exist". I assume that your reference to name existence really means matching on ? 3. How do you remove an attribute or multiple attributes ? Do you do a SET with a null value in ? Thanks muchly, Tim. Stephen wrote: = Our proposed kernel API looks something like this: sys_setattr (char *filename, int attrib_family, int op, struct attrib *old_attribs, int *old_lenp, struct attrib *new_attribs, int new_len); sys_fsetattr(int fd, int attrib_family, int op, struct attrib *old_attribs, int *old_lenp, struct attrib *new_attribs, int new_len); where can be ATR_SET overwrite existing attribute ATR_GET read existing attribute ATR_GETALL read entire ordered attribute list (ignores new val) ATR_PREPEND add new attribute to start of ordered list ATR_APPEND add new attribute to end of ordered list ATR_REPLACE replace entire ordered attribute list and where is a buffer of length bytes of variable length struct attrib records: struct attrib { int rec_len;/* Length of the whole record: should be padded to long alignment */ int name_family;/* Which namespace is the name in? */ int name_len; int val_len; charname[variable]; /* byte-aligned */ charval[variable]; /* byte-aligned */ }; ATR_SET will overwrite an existing attribute, or if the attribute does not already exist, will append the new attribute (ie. it does not override existing ACL controls, in keeping with the Principle of Least Surprise). If multiple instances of the name already exist, then the first one is replaced and subsequent ones deleted. If supplied with an "old" buffer, all old attributes of that name will be returned. For the PREPEND/APPEND/REPLACE operations, the entire old attribute set is returned. For GET, the specification is read and all attributes which match any items in are returned, in the order in which they are specified in . The actual value in is ignored; only the name is used. For GETALL, is ignored entirely. *old_lenp should contain the size of the old attributes buffer on entry. It will contain the number of valid bytes in the old buffer on exit. If the buffer is not sufficiently large to contain all of the attributes, E2BIG is returned. = - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]