Re: [PATCH 2/4] Fix mainline filesystems to handle ATTR_KILL_ bits correctly

2007-08-27 Thread Timothy Shimmin

Jeff Layton wrote:

On Tue, 21 Aug 2007 17:21:28 -0400
Josef Sipek <[EMAIL PROTECTED]> wrote:


On Tue, Aug 21, 2007 at 07:35:51AM -0400, Jeff Layton wrote:

On Tue, 21 Aug 2007 15:35:08 +1000
Timothy Shimmin <[EMAIL PROTECTED]> wrote:


Jeff Layton wrote:

This should fix all of the filesystems in the mainline kernels to handle
ATTR_KILL_SUID and ATTR_KILL_SGID correctly. For most of them, this is
just a matter of making sure that they call generic_attrkill early in
the setattr inode op.

Signed-off-by: Jeff Layton <[EMAIL PROTECTED]>
---
 fs/xfs/linux-2.6/xfs_iops.c   |5 -
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -651,12 +651,15 @@ xfs_vn_setattr(
struct iattr*attr)
 {
struct inode*inode = dentry->d_inode;
-   unsigned intia_valid = attr->ia_valid;
+   unsigned intia_valid;
bhv_vnode_t *vp = vn_from_inode(inode);
bhv_vattr_t vattr = { 0 };
int flags = 0;
int error;
 
+	generic_attrkill(inode->i_mode, attr);

+   ia_valid = attr->ia_valid;
+
if (ia_valid & ATTR_UID) {
vattr.va_mask |= XFS_AT_UID;
vattr.va_uid = attr->ia_uid;

Looks reasonable to me for XFS.
Acked-by: Tim Shimmin <[EMAIL PROTECTED]>

So before, this clearing would happen directly in notify_change()
and now this won't happen until notify_change() calls i_op->setattr
which for a particular fs it can call generic_attrkill() to do it.
So I guess for the cases where i_op->setattr is called outside of
via notify_change, we don't normally have ATTR_KILL_SUID/SGID
set so that nothing will happen there?

Right. If neither ATTR_KILL bit is set then generic_attrkill is a
noop.


I guess just wondering the effect with having the code on all
setattr's. (I'm not familiar with the code path)


These bits are referenced in very few places in the current kernel
tree -- mostly in the VFS layer. The *only* place I see that they
actually get interpreted into a mode change is in notify_change. So
places that call setattr ops w/o going through notify_change are
not likely to have those bits set.

But hypothetically, if a fs did set ATTR_KILL_* and call setattr
directly, then the setattr would now include a mode change that
clears setuid or setgid bits where it may not have before.


I should probably clarify -- in the hypothetical situation above,
the setattr function would have to call generic_attrkill (as most
filesystems should do with this change).


Thanks for the confirmation. That's what it looked like to me
but I wanted to know explicitly what the thinking was.


It almost sounds like an argument for a new inode op (NULL would use
generic_attr_kill).



That's not a bad idea at all. I suppose that would be easier than
modifying every fs like this, and it does seem like it might be
cleaner. I need to mull it over, but that might be the best
solution.


Yeah, sounds a much more direct way of handling things and as you
say wouldn't need most of the filesystems to all be modified calling
generic_attrkill.
Not sure what the ramifications of adding a new iop are though.

Cheers,
Tim.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Fix mainline filesystems to handle ATTR_KILL_ bits correctly

2007-08-20 Thread Timothy Shimmin

Jeff Layton wrote:

This should fix all of the filesystems in the mainline kernels to handle
ATTR_KILL_SUID and ATTR_KILL_SGID correctly. For most of them, this is
just a matter of making sure that they call generic_attrkill early in
the setattr inode op.

Signed-off-by: Jeff Layton <[EMAIL PROTECTED]>
---
 fs/xfs/linux-2.6/xfs_iops.c   |5 -
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -651,12 +651,15 @@ xfs_vn_setattr(
struct iattr*attr)
 {
struct inode*inode = dentry->d_inode;
-   unsigned intia_valid = attr->ia_valid;
+   unsigned intia_valid;
bhv_vnode_t *vp = vn_from_inode(inode);
bhv_vattr_t vattr = { 0 };
int flags = 0;
int error;
 
+	generic_attrkill(inode->i_mode, attr);

+   ia_valid = attr->ia_valid;
+
if (ia_valid & ATTR_UID) {
vattr.va_mask |= XFS_AT_UID;
vattr.va_uid = attr->ia_uid;


Looks reasonable to me for XFS.
Acked-by: Tim Shimmin <[EMAIL PROTECTED]>

So before, this clearing would happen directly in notify_change()
and now this won't happen until notify_change() calls i_op->setattr
which for a particular fs it can call generic_attrkill() to do it.
So I guess for the cases where i_op->setattr is called outside of
via notify_change, we don't normally have ATTR_KILL_SUID/SGID
set so that nothing will happen there?
I guess just wondering the effect with having the code on all
setattr's. (I'm not familiar with the code path)

--Tim
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-03 Thread Timothy Shimmin

Amit K. Arora wrote:

FA_FL_NO_MTIME  0x10 /* keep same mtime (default change on size, data change) */
FA_FL_NO_CTIME  0x20 /* keep same ctime (default change on size, data change) */

NACK to these aswell.  If i_size changes c/mtime need updates, if the size
doesn't chamge they don't.  No need to add more flags for this.


This requirement was from the point of view of HSM applications. Hope
you saw Andreas previous post and are keeping that in mind.


We use this capability in XFS at the moment.
I think this is mainly for DMF (HSM) but is done via the xfs handle interface
(xfs_open_by_handle) AFAICT.

This sets up a set of invisible operations (xfs_invis_file_operations).
xfs_file_ioctl_invis goes on to set IO_INVIS which goes on to set ATTR_DMI
which is then tested in xfs_change_file_space() (which handles XFS_IOC_RESVSP & 
friends)
for whether xfs_ichgtime(ip, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG)
is called or not.

--Tim
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-27 Thread Timothy Shimmin

Hi,

--On 28 May 2007 12:45:59 PM +1000 David Chinner <[EMAIL PROTECTED]> wrote:


On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote:


Thanks everyone for your input.  There was some very valuable
observations in the various emails.
I will try to pull most of it together and bring out what seem to be
the important points.


1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP.


Sounds good to me, but how do we test to see if the underlying
device supports barriers? Do we just assume that they do and
only change behaviour if -o nobarrier is specified in the mount
options?


I would assume so.
Then when the block layer finds that they aren't supported and does
non-barrier ones, then it could report a message.
We, xfs, I guess can't take much other course of action
and we aint doing much now other than not requesting them
anymore and printing an error message.


2/ Maybe barriers provide stronger semantics than are required.

 All write requests are synchronised around a barrier write.  This is
 often more than is required and apparently can cause a measurable
 slowdown.

 Also the FUA for the actual commit write might not be needed.  It is
 important for consistency that the preceding writes are in safe
 storage before the commit write, but it is not so important that the
 commit write is immediately safe on storage.  That isn't needed until
 a 'sync' or 'fsync' or similar.


The use of barriers in XFS assumes the commit write to be on stable
storage before it returns.  One of the ordering guarantees that we
need is that the transaction (commit write) is on disk before the
metadata block containing the change in the transaction is written
to disk and the current barrier behaviour gives us that.


Yep, and that one is what we want the FUA for -
for the write into the log.

I'm taking it that the FUA write will just guarantee that that
particular write has made it to disk on i/o completion
(and no write cache flush is done).

The other XFS constraint is that we know when the metadata hits the disk
so that we can move the tail of the log.
And that is what we are effectively getting from the pre-write-flush
part of the barrier. It would ensure that any metadata not yet to disk would
be on disk before we overwrite the tail of the log.
If we could determine cases when we don't have to worry about overwriting
the tail of the log, then it would be good if we could
just do FUA writes for contraint 1 above. Is that possible?

--Tim


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-18 Thread Timothy Shimmin

--On 18 April 2007 6:21:39 PM -0600 Andreas Dilger <[EMAIL PROTECTED]> wrote:


Below is an aggregation of the comments in this thread:

struct fiemap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len;   /* length in bytes */
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_lun;   /* logical storage device number in array */
}

struct fiemap {
__u64 fm_start; /* logical start offset of mapping (in/out) */
__u64 fm_len;   /* logical length of mapping (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
__u32 fm_extent_count;  /* number of extents in fm_extents (in/out) */
__u64 fm_unused;
struct fiemap_extent fm_extents[0];
}

/* flags for the fiemap request */
# define FIEMAP_FLAG_SYNC   0x0001  /* flush delalloc data to disk*/
# define FIEMAP_FLAG_HSM_READ   0x0002  /* retrieve data from HSM */
# define FIEMAP_FLAG_INCOMPAT0xff00 /* must understand these flags*/

/* flags for the returned extents */
# define FIEMAP_EXTENT_HOLE 0x0001  /* no space allocated */
# define FIEMAP_EXTENT_UNWRITTEN0x0002  /* uninitialized space 
*/
# define FIEMAP_EXTENT_UNKNOWN  0x0004  /* in use, location unknown */
# define FIEMAP_EXTENT_ERROR0x0008  /* error mapping space */
# define FIEMAP_EXTENT_NO_DIRECT0x0010  /* no direct data 
access */



SUMMARY OF CHANGES
==
- use fm_* fields directly in request instead of making it a fiemap_extent
  (though they are layed out identically)


I much prefer that - it makes it a lot clearer to me to have fiemap_extent
just for fm_extents (no different meanings now).
(Don't like the word "offset" in comment without "physical" or some such but 
whatever;-)
I also prefer the flags as separate fields too :)

--Tim
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-16 Thread Timothy Shimmin

Hi Andreas,

--On 12 April 2007 5:05:50 AM -0600 Andreas Dilger <[EMAIL PROTECTED]> wrote:


I'm interested in getting input for implementing an ioctl to efficiently
map file extents & holes (FIEMAP) instead of looping over FIBMAP a billion
times.

...


I had come up with a plan independently and was also steered toward
XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original
plan, though I think the XFS structs used there are a bit bloated.


They certainly seem to be (combining entries and header).



struct fibmap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len;   /* length in bytes */
}

struct fibmap {
struct fibmap_extent fm_start;  /* offset, length of desired mapping */
__u32 fm_extent_count;  /* number of extents in array */
__u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
__u64 unused;
struct fibmap_extent fm_extents[0];
}

# define FIEMAP_LEN_MASK0xff
# define FIEMAP_LEN_HOLE0x01
# define FIEMAP_LEN_UNWRITTEN   0x02

All offsets are in bytes to allow cases where filesystems are not going
block-aligned/sized allocations (e.g. tail packing).  The fm_extents array
returned contains the packed list of allocation extents for the file,
including entries for holes (which have fe_start == 0, and a flag).

The ->fm_extents[] array includes all of the holes in addition to
allocated extents because this avoids the need to return both the logical
and physical address for every extent and does not make processing any
harder.


Well, that's what stood out for me. I was wondering where the "fe_block" field
had gone - the "physical address".
So is your "fe_start; /* starting offset */" actually the disk location
(not a logical file offset)
_except_ in the header (fibmap) where it is the desired logical offset.
Okay, looking at your example use below that's what it looks like.
And when you refer to fm_start below, you mean fm_start.fe_start?
Sorry, I realise this is just an approximation but this part confused me.
So you get rid of all the logical file offsets in the extents because we
report holes explicitly (and we know everything is contiguous if you
include the holes).

--Tim



Caller works something like:

char buf[4096];
struct fibmap *fm = (struct fibmap *)buf;
int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);

fm->fm_extent.fe_start = 0; /* start of file */
fm->fm_extent.fe_len = -1;   /* end of file */
fm->fm_extent_count = count; /* max extents in fm_extents[] array */
fm->fm_flags = 0;/* maybe "no DMAPI", etc like XFS */

fd = open(path, O_RDONLY);
printf("logical\t\tphysical\t\tbytes\n");

/* The last entry will have less extents than the maximum */
while (fm->fm_extent_count == count) {
rc = ioctl(fd, FIEMAP, fm);
if (rc)
break;

/* kernel filled in fm_extents[] array, set fm_extent_count
 * to be actual number of extents returned, leaves fm_start
 * alone (unlike XFS_IOC_GETBMAP). */

for (i = 0; i < fm->fm_extent_count; i++) {
__u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK;
__u64 fm_next = fm->fm_start + len;
int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE;
int unwr = fm->fm_extents[i].fe_len & 
FIEMAP_LEN_UNWRITTEN;

printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n",
fm->fm_start, fm_next - 1,
hole ? 0 : fm->fm_extents[i].fe_start,
hole ? 0 : fm->fm_extents[i].fe_start +
   fm->fm_extents[i].fe_len - 1,
len, hole ? "(hole) " : "",
unwr ? "(unwritten) " : "");

/* get ready for printing next extent, or next ioctl */
fm->fm_start = fm_next;
}
}



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PROPOSAL2] Extended Attributes

2000-11-02 Thread Timothy Shimmin

Andreas writes:
> 
> Hi again,
> 
> 
> There were some good arguments for adding a few more features to the EA
> interface. This new proposal reflects some of the discussion.
> 
> I still decline to support forks/streams through the EA interface. IMHO
> that's just the wrong way to go. This doesn't preclude an EA
> implementation on top of streams, of course.
> 
> The interface described here also doesn't include Stephen's idea to allow
> an ordered list of EA's under the same name. In addition to the append and
> prepend operations Stephen suggested, a whole range of other operations
> (get/delete/... by index, etc.) might make sense, and stuff like that
> could well be added. However, it would complicate the semantics even
> further. I'd really like to learn more about the requirements for that.
> 

Back to comments from the XFS team and your 2nd proposed API.

* We (XFS) also have an attr_multi() system call which is reminiscent of
  your sys_ext_attr_XXX() call. It allows one to apply an op
  to many attributes in the one call. However, it actually takes
  an op-list, and thus can actually apply different ops to different
  attributes in the one syscall, instead of taking the one op-code
  as a parameter .

  i.e. it takes: attr_multiop_t *oplist, 
  where
struct attr_multiop {
int  am_opcode;  /* which operation to perform (see below) */
int  am_error;   /* [out arg] result of this sub-op (an errno) */
char *am_attrname;   /* attribute name to work with */
char *am_attrvalue;  /* [in/out arg] attribute value (raw bytes) */
int  am_length;  /* [in/out arg] length of value */
int  am_flags;   /* flags (bit-wise OR of #defines above) */
};

opcode = ATTR_OP_GET, ATTR_OP_SET, ATTR_OP_REMOVE 
 
  The attr_multi() call does not actually provide atomicity
  over the call - other threads could change the attributes between ops.
  It also does not provide a log transaction for the set of ops.
  So, I think it is just saving on multiple system calls.

  It doesn't have a matching VOP call, it just calls the VOPS for
  get, set and remove.

  The only use I've seen in the IRIX tree is where the opcode is
  the same for the given list - 
  i.e. 1 multi for setting a bunch of attributes and another
   multi for getting a bunch of attributes
  This could thus be achieved with your API.

* I noticed how you have Stephen's "attrib family" encoded as a 
  "namespace" in your API BUT I didn't see how you've encoded 
  Stephen's "name family" field.
  BTW, I still don't have a clear understanding of the "name family".
  Is it like some sort of "typing" for names ? 

* As mentioned previously, we'd also like a way of optionally
  following symlinks - such as by another flag value or another
  system call variant.


* FYI our attr_list cursor is of the form:
  
  (instead of a single 32 bit offset).

  The  identifies which entry we were last at.
  One form of the attribute data storage, actually stores
  with ordered hashed-name entries and so the offset is used
  to distinguish between equal hash values. 
  In another EA storage form, we don't actually 
  store the hash values, however, the list function still 
  preserves hash-value ordering and so we actually compute the
  names' hash values.
  I think the blkno is used in the BTREE form of storage as
  a suggestion on where to start looking for the .
  If it finds that the block isn't an EA block or it doesn't have
  hash values in the right range then we start from the top of
  the tree and search looking for .
  The initted field is a 1 bit field (0/1) which says whether
  the cursor has been initialized yet.
  I can only see a test for it in the one place, which seems
  to be just an early sanity test in xfs_attr_list()
if ((cursor->initted == 0) &&
(cursor->hashval || cursor->blkno || cursor->offset)) {
return(XFS_ERROR(EINVAL));
}
  Sorry for the rambling :)


--Tim
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [PROPOSAL] Extended attributes for Posix security extensions

2000-10-26 Thread Timothy Shimmin

Hi,

I'm new to this area so please excuse some naive questions.
I have no criticisms, just desire clarifications on your
"Families" API for EA.

I've been looking at your API versus the XFS attrib functionality.

1.
It seems that your EA API assumes EA lists, whereas
in XFS we seem to base it on EA sets.
That is, in your API the ordering of attributes is significant;
hence the ATR_APPEND, ATR_PREPEND and an ATR_SET definition which
mentions the handling of multiple instances of the same name.
In XFS there doesn't seem to be an explicit concept of ordering,
and I thought that only one attribute name could exist for 
a particular file system object (well actually 2 - possibly
a User space version and a Root space version) -
is this not the case, Curtis ?

2.
In your definitions, you mention how only "name" in  is 
used for GET, and in SET you mention, 
"if multiple instances of the name already exist". 
I assume that your reference to name existence really means
matching on  ?

3.
How do you remove an attribute or multiple attributes ?
Do you do a SET with a null value in  ?


Thanks muchly,
Tim.


Stephen wrote:
=
Our proposed kernel API looks something like this:

  sys_setattr (char *filename, int attrib_family, int op, 
   struct attrib *old_attribs, int *old_lenp,
   struct attrib *new_attribs, int new_len);

  sys_fsetattr(int fd, int attrib_family, int op, 
   struct attrib *old_attribs, int *old_lenp,
   struct attrib *new_attribs, int new_len);

  where  can be

  ATR_SET overwrite existing attribute
  ATR_GET read existing attribute
  ATR_GETALL  read entire ordered attribute list (ignores new val)
  ATR_PREPEND add new attribute to start of ordered list
  ATR_APPEND  add new attribute to end of ordered list
  ATR_REPLACE replace entire ordered attribute list

  and where  is a buffer of length  bytes of variable
  length struct attrib records:

  struct attrib {
  int rec_len;/* Length of the whole record:
 should be padded to long
 alignment */
  int name_family;/* Which namespace is the name in? */
  int name_len;
  int val_len;
  charname[variable]; /* byte-aligned */
  charval[variable];  /* byte-aligned */
  };

  ATR_SET will overwrite an existing attribute, or if the attribute does
  not already exist, will append the new attribute (ie. it does not
  override existing ACL controls, in keeping with the Principle of Least
  Surprise).  If multiple instances of the name already exist, then the
  first one is replaced and subsequent ones deleted.  If supplied with
  an "old" buffer, all old attributes of that name will be returned.

  For the PREPEND/APPEND/REPLACE operations, the entire old attribute
  set is returned.

  For GET, the  specification is read and all attributes which
  match any items in  are returned, in the order in which they are
  specified in .  The actual value in  is ignored; only the
  name is used.

  For GETALL,  is ignored entirely.

  *old_lenp should contain the size of the old attributes buffer on
  entry.  It will contain the number of valid bytes in the old buffer on
  exit.  If the buffer is not sufficiently large to contain all of the
  attributes, E2BIG is returned.
=
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]