Re: [2.6.24 patch] let EXT4DEV_FS depend on BROKEN

2008-01-02 Thread Trond Myklebust

On Wed, 2008-01-02 at 23:16 +0200, Adrian Bunk wrote:
> On Wed, Jan 02, 2008 at 07:26:29PM +0100, Diego Calleja wrote:
> > El Wed, 2 Jan 2008 03:32:18 +0200, Adrian Bunk <[EMAIL PROTECTED]> escribió:
> > 
> > > It might make sense to offer ext4 in -mm and even in early -rc kernels, 
> > > but I've already seen people using ext4 simply because a stable kernel 
> > > offered it - and that's definitely not intended.
> > 
> > But isn't that the whole purpose of having ext4 snapshots in the stable 
> > kernel - to
> > allow people to try it?
> 
> ext4 has quite an unusual development model for kernel code, other 
> code in the state of ext4 is usually only in -mm and not in stable 
> kernels.

Bullshit... We all do this.

> Stable kernels are mainly meant for usage, not for trying stuff.
> And although I see a point in perhaps shipping some not-yet-perfect 
> device drivers for otherwise unsupported hardware or some
> not-yet-perfect filesystems required for accessing foreign
> (non-Linux) filesystems, I don't see any point in offering a
> WIP Linux-only filesystem in stable kernels.

This breaks with the 2.6.x development model that we've been working
with for several years now. I, for one, do not wish to change that
model.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/25] move handling of setuid/gid bits from VFS into individual setattr functions (RESEND)

2007-08-07 Thread Trond Myklebust
On Tue, 2007-08-07 at 17:15 -0700, Andrew Morton wrote:

> Is there any way in which we can prevent these problems?  Say

The problem here is that we occasionally DO need to add new flags, and
yes, they MAY be security related. The whole reason why we're now having
to change the semantics of setattr is because somebody tried to hack
their way around the write+suid issue.

I suspect we will see the exact same thing will happen again in a couple
of years with Serge's ATTR_KILL_PRIV flag.

> - rename something so that unconverted filesystems will reliably fail to
>   compile?
> 
> - leave existing filesystems alone, but add a new
>   inode_operations.setattr_jeff, which the networked filesytems can
>   implement, and teach core vfs to call setattr_jeff in preference to
>   setattr?

If you really need to know that the filesystem is handling the flags,
then how about instead having ->setattr() return something which
indicates which flags it actually handled? That is likely to be a far
more intrusive change, but it is one which is future-proof.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [fuse-devel] [PATCH 01/25] VFS: move attr_kill logic from notify_change into helper function

2007-08-06 Thread Trond Myklebust
On Mon, 2007-08-06 at 21:37 +0200, Miklos Szeredi wrote:
> > > Your patch is changing the API in a very unsafe way, since there will
> > > be no error or warning on an unconverted fs.  And that could lead to
> > > security holes.
> > > 
> > > If we would rename the setattr method to setattr_new as well as
> > > changing it's behavior, that would be fine.  But I guess we do not
> > > want to do that.
> > 
> > Which "unconverted fses"? If we're talking out of tree stuff, then too
> > bad: it is _their_ responsibility to keep up with kernel changes.
> 
> It is usually a good idea to not change the semantics of an API in a
> backward incompatible way without changing the syntax as well.

We're taking two setattr flags ATTR_KILL_SGID, and ATTR_KILL_SUID which
have existed for several years in the VFS, and making them visible to
the filesystems. Out-of-tree filesystems that care can check for them in
a completely backward compatible way: you don't even need to add a
#define.

> This is true regardless of whether we care about out-of-tree code or
> not (and we should care to some degree).  And especially true if the
> change in question is security sensitive.

It is not true "regardless": the in-tree code is being converted.
Out-of-tree code is the only "problem" here, and their only problem is
that they may have to add support for the new flags if they also support
suid/sgid mode bits.

Are you advocating reserving a new filesystem bit every time we need to
add an attribute flag?

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [fuse-devel] [PATCH 01/25] VFS: move attr_kill logic from notify_change into helper function

2007-08-06 Thread Trond Myklebust
On Mon, 2007-08-06 at 20:28 +0200, Miklos Szeredi wrote:

> Your patch is changing the API in a very unsafe way, since there will
> be no error or warning on an unconverted fs.  And that could lead to
> security holes.
> 
> If we would rename the setattr method to setattr_new as well as
> changing it's behavior, that would be fine.  But I guess we do not
> want to do that.

Which "unconverted fses"? If we're talking out of tree stuff, then too
bad: it is _their_ responsibility to keep up with kernel changes.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-11 Thread Trond Myklebust
On Wed, 2007-07-11 at 09:47 +0100, Christoph Hellwig wrote:
> On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote:
> > This patch is on top of i_version_update_vfs.
> > The i_version field of the inode is set on inode creation and incremented 
> > when
> > the inode is being modified.
> 
> Which is not what i_version is supposed to do.  It'll get you tons of misses
> for NFSv3 filehandles that rely on the generation staying the same for the
> same file.  Please add a new field for the NFSv4 sequence counter instead
> of making i_version unuseable.

Aren't you confusing i_version and i_generation here? Those are two
separate inode fields.

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Trond Myklebust
On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote:
> On Tuesday July 10, [EMAIL PROTECTED] wrote:
> > 
> > Yes, thanks.  It doesn't actually tell us why we want to implement
> > this attribute and it doesn't tell us what the implications of failing
> > to do so are, but I guess we can take that on trust from the NFS guys.
> 
> You would like to think so, but remember NFSv4 was designed by a
> committee :-)
> 
> The 'change' number is used for cache consistency, and as the spec
> makes very strong statements about the 'change' number, it is very
> hard (or impossible) to implement a server correctly without storing a
> change number in stable storage (just one of my grips about V4).

Well... It reflects a requirement that was just as present in the
caching models that we use for NFSv2/v3, but that we glossed over for
other reasons.
The real difference here is that v2/v3 caching model assumes that you
have sufficient resolution in the ctime/mtime to allow clients to detect
any changes to the file/directory contents, whereas NFSv4 allows you to
use an arbitrary variable (that may be the ctime, if it has sufficient
resolution) for the same purposes.

> > But I suspect the ext4 implementation doesn't actually do this.  afaict we
> > won't update i_version for file overwrites (especially if s_time_gran can
> > indeed be 1,000,000,000) and of course for MAP_SHARED modifications.  What
> > would be the implications of this?
> 
> The first part sounds like a bug - i_version should really be updated
> by every call to ->commit_write (if that is still what it is called).
> 
> The MAP_SHARED thing is less obvious.  I guess every time we notice
> that the page might have been changed, we need to increment i_version.

You need to increment is any time that you detect remotely visible
changes.
IOW: any change that posix mandates should result in a ctime update,
must also result in an update of i_version. The only difference is that
i_version must not suffer from the time resolution issues that ctime
does.

> > And how does the NFS server know that the filesystem implements i_version? 
> > Will a zero-value of i_version have special significance, telling the
> > server to not send this attribute, perhaps?
> 
> That is a very important question.  Zero probably makes sense, but
> what ever it is needs to be agreed and documented.
> And just by-the-way, the server doesn't really have the option of not
> sending the attribute.  If i_version isn't defined, it has to fake
> something using mtime, and hope that is good enough.
> 
> Alternately we could mandate that i_version is always kept up-to-date
> and if a filesystem doesn't have anything to load from storage, it
> just sets it to the current time in nanoseconds.
> 
> That would mean that a client would need to flush it's cache whenever
> the inode fell out of cache on the server, but I don't think we can
> reliably do better than that.
> 
> I think I like that approach.
> 
> So my vote is to increment i_version in common code every time any
> change is made to the file, and alloc_inode should initialise it to
> current time, which might be changed by the filesystem before it calls
> unlock_new_inode. 
> ... but doesn't lustre want to control its i_version... so maybe not :-(

If lustre wants to be exportable via pNFS (as Peter Braam has suggested
it should), then it had better be able to return a change attribute that
is compatible with the NFSv4.1 spec...

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-03 Thread Trond Myklebust
On Mon, 2007-07-02 at 10:58 -0400, Mingming Cao wrote:
> Trond or Bruce, can you please review these patch series and ack if you
> agrees? Thanks.
> 
> As to performance concerns that raise before the inode version counter
> (at least for ext4) is done inside ext4_mark_inode_dirty), so there is
> no extra IO work to store this counter to disk.

Hi Mingming,

It looks OK to me, but you might want to strip out the now redundant
i_version updates in add_dirent_to_buf(), ext4_rmdir(), ext4_rename().

I also have some questions about how this will affect the readdir code:
unless I missed something, the filp->f_version is still unsigned long,
so the comparisons and assignments in ext4_readdir()/ext4_dx_readdir()
no longer make sense.

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/2] i_version update

2007-05-31 Thread Trond Myklebust
On Thu, 2007-05-31 at 10:01 +1000, Neil Brown wrote:

> This will provide a change number that normally changes only when the
> file changes and doesn't require any extra storage on disk.
> The change number will change inappropriately only when the inode has
> fallen out of cache and is being reload, which is either after a crash
> (hopefully rare) of when a file hasn't been used for a while, implying
> that it is unlikely that any client has it in cache.

It will also change inappropriately if the server is under heavy load
and needs to reclaim memory by tossing out inodes that are cached and
still in use by the clients. That change will trigger clients to
invalidate their caches and to refetch the data from the server, further
cranking up the load.

Not an ideal solution...

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2] i_version update - ext4 part

2007-05-30 Thread Trond Myklebust
On Wed, 2007-05-30 at 16:48 -0700, Mingming Cao wrote:
> On Tue, 2007-05-29 at 16:58 -0600, Andreas Dilger wrote:
> > I don't know what the NFS requirements for the version are.  There may
> > also be some complaints from others if the i_version is 64 bits because
> > this contributes to generic inode growth and isn't used for other
> > filesystems.
> > 
> That should benefit for other filesystems, as I thought this NFS
> requirements apply to all filesystems. 

Right. The point here is that the NFS protocol needs to impose certain
requirements on _all_ filesystems that want to be supported, and so it
is in the interest of everyone to have a generic update mechanism
available in the VFS in order to avoid code (and bug) replication.

Now if Lustre doesn't care about NFS compatibility, then I suppose it
would be fairly easy to engineer the i_version update interface to allow
them to use that field in whatever way suits them best.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [patch 0/3] i_version update for ext4

2007-01-24 Thread Trond Myklebust
On Wed, 2007-01-24 at 09:40 -0800, Mingming Cao wrote:
> Cordenner jean noel wrote:
> > Andreas Dilger a écrit :
> > 
> >> On Jan 23, 2007  18:23 +0100, Cordenner jean noel wrote:
> >>
> >>> I've updated what was previously the change attribute patch for ext4
> >>> initially posted by Alexandre Ratchov. The previous patch was
> >>> introducing a change_attribute field, now it uses the i_version field of
> >>> the inode.
> >>>
> >>> The i_version field is a counter that is set on every inode creation and
> >>> that is incremented every time the inode data is modified (similarly to
> >>> the "ctime" time-stamp).
> >>> The aim is to fulfill NFSv4 requirements for rfc3530.
> >>> For the moent, the counter is only a 32bit value but it is planned to be
> >>> 64bit as required.
> >>>
> >>> The patch is divided into 3 parts, the vfs layer, the ext4 specific code
> >>> and an user part to check i_version changes via stat.
> >>
> >>
> >> Have you had a chance to look at the performance impact of this change
> >> (possible with oprofile)?  Always marking the inodes dirty for ext3
> >> may have some noticable overhead.
> >>
> > 
> > I did some tests using fileop with the previous version of the patch
> > which was very similar. I was surprised that there was no noticable
> > overhead:
> >  http://www.bullopensource.org/ext4/change_attribute/index.html
> > 
> > I will use oprofile to check it again.
> 
> Could we just increment the counter each time the mtime is modifies(not 
> the ctime)? Is that enough to serve NFSv4 need?

No. That would not conform to RFC3530:

5.5.  Mandatory Attributes - Definitions

   Name  #DataType Access   Description
   ___

.

   change3uint64   READ A value created by the
server that the client
can use to determine
if file data,
directory contents or
attributes of the
object have been
modified.  The server
may return the
object's time_metadata
attribute for this
attribute's value but
only if the filesystem
object can not be
updated more
frequently than the
resolution of
time_metadata.

so the change attribute needs to change on both data and metadata
updates.

Cheers,
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rfc: [patch] change attribute for ext3

2006-12-14 Thread Trond Myklebust
On Wed, 2006-12-13 at 20:52 -0500, J. Bruce Fields wrote:
> > What kind of requirements does NFSv4 place on the version?  Monotonic is
> > probably a good bet.
> 
> The only requirement is that it be unique (assuming a file is never
> modified 2^64 times).  Clients can't compare them except for equality.

The other requirement is that they be updated in more or less any
situation where you would normally see a 'ctime' update. In other words
any time when the file metadata or data changes, and any time when the
ACL changes.

(NB: I'm not sure what we should do w.r.t. xattr changes since those are
not really covered by RFC3530.)

Atomicity is not a hard requirement, however the server is required to
know whether or not the update was atomic. If the update is atomic, a
careful client may perform certain optimisations based upon it knowing
that no other changes to the inode have raced with this one. For
instance, if it knows that a file creation atomically updated the change
attribute of the directory, then it can determine that it does not need
to check for other changes to that directory.

> > Does it need to be global for the filesystem
> 
> Nope.
> 
> > or is a per-inode version sufficient?
> 
> Yes.

Yes. If your filesystem wants to support Solaris or Reiser4-like
subfiles, then it is expected that each subfile should have its own
change attribute (whereas changes to the subfile 'directory' will be
reflected by the parent inode's change attribute.

Change attribute values may be reused if the inode number is reused (as
long as the filesystem has something like a generation counter that
allows it to distinguish between different instances of the same inode
number).

> > What functionality of NFSv4 needs the version?
> 
> Clients use it to revalidate their caches.

Yup. It is used to detect changes made on the NFS server itself
(possibly by other NFS clients, possibly by local processes on the
server), so that the client can flush out any stale cached data.

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 merge status

2006-09-27 Thread Trond Myklebust
On Tue, 2006-09-26 at 13:41 -0700, Andrew Morton wrote:

> I don't know if we'll be merging the fs-cache code this time around - it's
> largely in Trond and Christoph's hands.

I'm OK with it going in.

Cheers,
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rfc: [patch] change attribute for ext3

2006-09-14 Thread Trond Myklebust
On Thu, 2006-09-14 at 09:46 -0400, Peter Staubach wrote:
> Wouldn't the generation count work better than ctime to differentiate 
> between
> instances of files using the same inode number?  That way, there wouldn't be
> a clock resolution issue.

No. This is about distinguishing updates to the metadata/data of the
same instance. It is exactly what ctime was supposed to do, but ctime
relies on a clock which usually has too poor time resolution, may not be
monotonic, etc...

Cheers,
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rfc: [patch] change attribute for ext3

2006-09-13 Thread Trond Myklebust
On Wed, 2006-09-13 at 20:30 +0200, Alexandre Ratchov wrote:
> On Wed, Sep 13, 2006 at 02:11:11PM -0400, Trond Myklebust wrote:
> > On Wed, 2006-09-13 at 18:42 +0200, Alexandre Ratchov wrote:
> > > hello, 
> > > 
> > > here is a small patch that adds the "change attribute" for ext3
> > > file-systems;
> > > 
> > > the change attribute is a simple counter that is reset to zero on
> > > inode creation and that is incremented every time the inode data is
> > > modified (similarly to the "ctime" time-stamp).
> > 
> > I would really have preferred a full-blown 64-bit counter as per
> > RFC3530, but I suppose we could always combine this change attribute
> > with the high word from ctime in order to make up the NFSv4 change
> > attribute. That should keep us safe until someone develops a ramdisk
> > with < 1 nsecond access time.
> > 
> 
> do you mean something like "(ctime.tv_sec << 32) | change_attribute"? this
> would allow 2^32 inode changes per second.

Yes. As I said, that probably ought to suffice for now.

> For ext3 it's hard to find unused bits in the on-disk inode structure, but
> ext4 inode may become larger in the future, allowing a 64bit counter.

In anticipation of that event, could you please change the field in
'struct stat' and 'struct stat64' to be an 'unsigned long long' instead
of the current 'unsigned long'?

All the other fields are internal to the kernel, so a future change of
their size should not matter.

Cheers,
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rfc: [patch] change attribute for ext3

2006-09-13 Thread Trond Myklebust
On Wed, 2006-09-13 at 18:42 +0200, Alexandre Ratchov wrote:
> hello, 
> 
> here is a small patch that adds the "change attribute" for ext3
> file-systems;
> 
> the change attribute is a simple counter that is reset to zero on
> inode creation and that is incremented every time the inode data is
> modified (similarly to the "ctime" time-stamp).

I would really have preferred a full-blown 64-bit counter as per
RFC3530, but I suppose we could always combine this change attribute
with the high word from ctime in order to make up the NFSv4 change
attribute. That should keep us safe until someone develops a ramdisk
with < 1 nsecond access time.

Cheers,
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html