Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Fri, Jun 01 2007, Neil Brown wrote: > On Friday June 1, [EMAIL PROTECTED] wrote: > > On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote: > > > David Chinner wrote: > > > >That sounds like a good idea - we can leave the existing > > > >WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > > > >behaviour that only guarantees ordering. The filesystem can then > > > >choose which to use where appropriate > > > > > > So what if you want a synchronous write, but DON'T care about the order? > > > > submit_bio(WRITE_SYNC, bio); > > > > Already there, already used by XFS, JFS and direct I/O. > > Are you sure? > > You seem to be saying that WRITE_SYNC causes the write to be safe on > media before the request returns. That isn't my understanding. > I think (from comments near the definition and a quick grep through > the code) that WRITE_SYNC expedites the delivery of the request > through the elevator, but doesn't do anything special about getting it > onto the media. > It essentially say "Submit this request now, don't wait for more > request to bundle with it for better bandwidth utilisation" That is exactly right. WRITE_SYNC doesn't give any integrity guarentees, it's just makes sure it goes straight through the io scheduler. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Friday June 1, [EMAIL PROTECTED] wrote: > On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote: > > David Chinner wrote: > > >That sounds like a good idea - we can leave the existing > > >WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > > >behaviour that only guarantees ordering. The filesystem can then > > >choose which to use where appropriate > > > > So what if you want a synchronous write, but DON'T care about the order? > > submit_bio(WRITE_SYNC, bio); > > Already there, already used by XFS, JFS and direct I/O. Are you sure? You seem to be saying that WRITE_SYNC causes the write to be safe on media before the request returns. That isn't my understanding. I think (from comments near the definition and a quick grep through the code) that WRITE_SYNC expedites the delivery of the request through the elevator, but doesn't do anything special about getting it onto the media. It essentially say "Submit this request now, don't wait for more request to bundle with it for better bandwidth utilisation" NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Fri, 1 Jun 2007, Tejun Heo wrote: but one thing we should bear in mind is that harddisks don't have humongous caches or very smart controller / instruction set. No matter how relaxed interface the block layer provides, in the end, it just has to issue whole-sale FLUSH CACHE on the device to guarantee data ordering on the media. if you are talking about individual drives you may be right for the moment (but 16M cache on drives is a _lot_ larger then people imagined would be there a few years ago) but when you consider the self-contained disk arrays it's an entirely different story. you can easily have a few gig of cache and a complete OS pretending to be a single drive as far as you are concerned. and the price of such devices is plummeting (in large part thanks to Linux moving into this space), you can now readily buy a 10TB array for $10k that looks like a single drive. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] obsoleting /etc/mtab
On May 31, 2007 17:11 -0700, H. Peter Anvin wrote: > NFS takes a binary option block anyway. However, that's the exception, > not the rule. There was recently a patch submitted to linux-fsdevel to change NFS to use text option parsing. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Stefan Bader wrote: > 2007/5/30, Phillip Susi <[EMAIL PROTECTED]>: >> Stefan Bader wrote: >> > >> > Since drive a supports barrier request we don't get -EOPNOTSUPP but >> > the request with block y might get written before block x since the >> > disk are independent. I guess the chances of this are quite low since >> > at some point a barrier request will also hit drive b but for the time >> > being it might be better to indicate -EOPNOTSUPP right from >> > device-mapper. >> >> The device mapper needs to ensure that ALL underlying devices get a >> barrier request when one comes down from above, even if it has to >> construct zero length barriers to send to most of them. >> > > And somehow also make sure all of the barriers have been processed > before returning the barrier that came in. Plus it would have to queue > all mapping requests until the barrier is done (if strictly acting > according to barrier.txt). > > But I am wondering a bit whether the requirements to barriers are > really that tight as described in Tejun's document (barrier request is > only started if everything before is safe, the barrier itself isn't > returned until it is safe, too, and all requests after the barrier > aren't started before the barrier is done). Is it really necessary to > defer any further requests until the barrier has been written to save > storage? Or would it be sufficient to guarantee that, if a barrier > request returns, everything up to (including the barrier) is on safe > storage? Well, what's described in barrier.txt is the current implemented semantics and what filesystems expect, so we can't change it underneath them but we definitely can introduce new more relaxed variants, but one thing we should bear in mind is that harddisks don't have humongous caches or very smart controller / instruction set. No matter how relaxed interface the block layer provides, in the end, it just has to issue whole-sale FLUSH CACHE on the device to guarantee data ordering on the media. IMHO, we can do better by paying more attention to how we do things in the request queue which can be deeper and more intelligent than the device queue. Thanks. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Jens Axboe wrote: > On Thu, May 31 2007, David Chinner wrote: >> On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: >>> On Thu, May 31 2007, David Chinner wrote: IOWs, there are two parts to the problem: 1 - guaranteeing I/O ordering 2 - guaranteeing blocks are on persistent storage. Right now, a single barrier I/O is used to provide both of these guarantees. In most cases, all we really need to provide is 1); the need for 2) is a much rarer condition but still needs to be provided. > if I am understanding it correctly, the big win for barriers is that you > do NOT have to stop and wait until the data is on persistant media before > you can continue. Yes, if we define a barrier to only guarantee 1), then yes this would be a big win (esp. for XFS). But that requires all filesystems to handle sync writes differently, and sync_blockdev() needs to call blkdev_issue_flush() as well So, what do we do here? Do we define a barrier I/O to only provide ordering, or do we define it to also provide persistent storage writeback? Whatever we decide, it needs to be documented >>> The block layer already has a notion of the two types of barriers, with >>> a very small amount of tweaking we could expose that. There's absolutely >>> zero reason we can't easily support both types of barriers. >> That sounds like a good idea - we can leave the existing >> WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED >> behaviour that only guarantees ordering. The filesystem can then >> choose which to use where appropriate > > Precisely. The current definition of barriers are what Chris and I came > up with many years ago, when solving the problem for reiserfs > originally. It is by no means the only feasible approach. > > I'll add a WRITE_ORDERED command to the #barrier branch, it already > contains the empty-bio barrier support I posted yesterday (well a > slightly modified and cleaned up version). Would that be very different from issuing barrier and not waiting for its completion? For ATA and SCSI, we'll have to flush write back cache anyway, so I don't see how we can get performance advantage by implementing separate WRITE_ORDERED. I think zero-length barrier (haven't looked at the code yet, still recovering from jet lag :-) can serve as genuine barrier without the extra write tho. Thanks. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops
On Thu, May 31, 2007 at 12:05:39AM -0700, Andrew Morton wrote: > On Thu, 31 May 2007 07:15:39 +0200 Nick Piggin <[EMAIL PROTECTED]> wrote: > > > If you can send that rollup, it would be good. I could try getting > > everything to compile and do some more testing on it too. > > > Single patch against 2.6.22-rc3: http://userweb.kernel.org/~akpm/np.gz > > broken-out: > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/broken-out-2007-05-30-09-30.tar.gz Thanks, I'll get onto it. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] obsoleting /etc/mtab
Trond Myklebust wrote: >>> A lot of these could be fixed all at once by letting the filesystem tell >>> the VFS to retain the string passed to the original mount. That will >> Unfortunately, the original option string (from userspace) != real >> options (in kernel), see NFS. This bug should be fixed -- the kernel >> has to fully follow mount(2) or ends with EINVAL. > > Way ahead of you... See patches 6 and 7 on > > http://client.linux-nfs.org/Linux-2.6.x/2.6.22-rc3/ > NFS takes a binary option block anyway. However, that's the exception, not the rule. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] obsoleting /etc/mtab
On Thu, May 31, 2007 at 09:40:49AM -0700, H. Peter Anvin wrote: > Miklos Szeredi wrote: > > > > (2) needs work in the filesystems implicated. I already have patches > > for ext2, ext3, tmpfs, devpts and hostfs, and it would be nice if the > > maintainers for others could help out. > > > > A lot of these could be fixed all at once by letting the filesystem tell > the VFS to retain the string passed to the original mount. That will > solve *almost* all filesystems which take string options. Except some filesystems mangle that string as it gets passed around (i.e. XFS) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote: > David Chinner wrote: > >That sounds like a good idea - we can leave the existing > >WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > >behaviour that only guarantees ordering. The filesystem can then > >choose which to use where appropriate > > So what if you want a synchronous write, but DON'T care about the order? submit_bio(WRITE_SYNC, bio); Already there, already used by XFS, JFS and direct I/O. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] obsoleting /etc/mtab
On Fri, 2007-06-01 at 01:04 +0200, Karel Zak wrote: > On Thu, May 31, 2007 at 09:40:49AM -0700, H. Peter Anvin wrote: > > Miklos Szeredi wrote: > > > > > > (2) needs work in the filesystems implicated. I already have patches > > > for ext2, ext3, tmpfs, devpts and hostfs, and it would be nice if the > > > maintainers for others could help out. > > > > > > > A lot of these could be fixed all at once by letting the filesystem tell > > the VFS to retain the string passed to the original mount. That will > > Unfortunately, the original option string (from userspace) != real > options (in kernel), see NFS. This bug should be fixed -- the kernel > has to fully follow mount(2) or ends with EINVAL. Way ahead of you... See patches 6 and 7 on http://client.linux-nfs.org/Linux-2.6.x/2.6.22-rc3/ Trond - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] support larger cifs network reads
With Samba 3.0.26pre it is now possible for a cifs client (one which supports the newest Unix/Posix cifs extensions) to request up to almost 8MB at a time on a cifs read request. A patch for the cifs client to support larger reads follows. In this patch, using very large reads is not the default behavior, since it would require larger buffer allocations for the large cifs request buffers, but in the future when cifs can demultiplex reads to a page list in the cifs_demultiplex_thread (without having to copy to a large temporary buffer) this will be even more useful. diff --git a/fs/cifs/README b/fs/cifs/README index 4d01697..6ad722b 100644 --- a/fs/cifs/README +++ b/fs/cifs/README @@ -301,8 +301,19 @@ A partial list of the supported mount op during the local client kernel build will be used. If server does not support Unicode, this parameter is unused. - rsizedefault read size (usually 16K) - wsizedefault write size (usually 16K, 32K is often better over GigE) + rsizedefault read size (usually 16K). The client currently + can not use rsize larger than CIFSMaxBufSize. CIFSMaxBufSize + defaults to 16K and may be changed (from 8K to the maximum + kmalloc size allowed by your kernel) at module install time + for cifs.ko. Setting CIFSMaxBufSize to a very large value + will cause cifs to use more memory and may reduce performance + in some cases. To use rsize greater than 127K (the original + cifs protocol maximum) also requires that the server support + a new Unix Capability flag (for very large read) which some + newer servers (e.g. Samba 3.0.26 or later) do. rsize can be + set from a minimum of 2048 to a maximum of 8388608 (depending + on the value of CIFSMaxBufSize) + wsizedefault write size (default 57344) maximum wsize currently allowed by CIFS is 57344 (14 4096 byte pages) rwmount the network share read-write (note that the diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index d38c69b..8c4365d 100644 --- a/fs/cifs/cifsfs.c +++ b/fs/cifs/cifsfs.c @@ -732,10 +732,21 @@ cifs_init_request_bufs(void) /* Buffer size can not be smaller than 2 * PATH_MAX since maximum Unicode path name has to fit in any SMB/CIFS path based frames */ CIFSMaxBufSize = 8192; - } else if (CIFSMaxBufSize > 1024*127) { - CIFSMaxBufSize = 1024 * 127; } else { - CIFSMaxBufSize &= 0x1FE00; /* Round size to even 512 byte mult*/ + if (CIFSMaxBufSize + MAX_CIFS_HDR_SIZE > KMALLOC_MAX_SIZE) { + CIFSMaxBufSize = KMALLOC_MAX_SIZE - MAX_CIFS_HDR_SIZE; + cERROR(1,("CIFSMaxBufSize too large, resetting to %ld", + KMALLOC_MAX_SIZE)); + } + + /* The length field is 3 bytes, but for time being we use only +* 23 of the available 24 length bits */ + if (CIFSMaxBufSize > 8388608) { + CIFSMaxBufSize = 8388608; + cERROR(1, + ("CIFSMaxBufSize set to protocol max 8388608")); + } else /* round buffer size to even 512 byte multiple */ + CIFSMaxBufSize &= 0x7FFE00; } /* cERROR(1,("CIFSMaxBufSize %d 0x%x",CIFSMaxBufSize,CIFSMaxBufSize)); */ cifs_req_cachep = kmem_cache_create("cifs_request", diff --git a/fs/cifs/cifspdu.h b/fs/cifs/cifspdu.h index d619ca7..6e6cda0 100644 --- a/fs/cifs/cifspdu.h +++ b/fs/cifs/cifspdu.h @@ -1885,15 +1885,19 @@ typedef struct { #define CIFS_UNIX_POSIX_PATHNAMES_CAP 0x0010 /* Allow POSIX path chars */ #define CIFS_UNIX_POSIX_PATH_OPS_CAP0x0020 /* Allow new POSIX path based calls including posix open - and posix unlink */ + and posix unlink */ +#define CIFS_UNIX_LARGE_READ_CAP0x0040 /* support reads >128K (up + to 0x00 */ +#define CIFS_UNIX_LARGE_WRITE_CAP 0x0080 + #ifdef CONFIG_CIFS_POSIX /* Can not set pathnames cap yet until we send new posix create SMB since otherwise server can treat such handles opened with older ntcreatex (by a new client which knows how to send posix path ops) as non-posix handles (can affect write behavior with byte range locks. We can add back in POSIX_PATH_OPS cap when Posix Create/Mkdir finished */ -/* #define CIFS_UNIX_CAP_MASK 0x003b */ -#define CIFS_UNIX_CAP_MASK 0x001b +/* #define CIFS_UNIX_CAP_MASK
Re: [RFC] obsoleting /etc/mtab
On Thu, May 31, 2007 at 09:40:49AM -0700, H. Peter Anvin wrote: > Miklos Szeredi wrote: > > > > (2) needs work in the filesystems implicated. I already have patches > > for ext2, ext3, tmpfs, devpts and hostfs, and it would be nice if the > > maintainers for others could help out. > > > > A lot of these could be fixed all at once by letting the filesystem tell > the VFS to retain the string passed to the original mount. That will Unfortunately, the original option string (from userspace) != real options (in kernel), see NFS. This bug should be fixed -- the kernel has to fully follow mount(2) or ends with EINVAL. Karel -- Karel Zak <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH resend] introduce I_SYNC
On Thu, 31 May 2007 16:25:35 +0200 Jörn Engel <[EMAIL PROTECTED]> wrote: > On Wed, 16 May 2007 10:15:35 -0700, Andrew Morton wrote: > > > > If we're going to do this then please let's get some exhaustive commentary > > in there so that others have a chance of understanding these flags without > > having to do the amount of reverse-engineering which you've been put > > through. > > Done. Found and fixed some bugs in the process. By now I feal > reasonable certain that the patch fixes more than it breaks. > > > -- > Good warriors cause others to come to them and do not go to others. > -- Sun Tzu > > Introduce I_SYNC. > > I_LOCK was used for several unrelated purposes, which caused deadlock > situations in certain filesystems as a side effect. One of the purposes > now uses the new I_SYNC bit. Do we know what those deadlocks were? It's a bit of a mystery patch otherwise. Put yourself in the position of random-distro-engineer wondering "should I backport this?". > Also document the various bits and change their order from historical to > logical. What a nice comment you added ;) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] obsoleting /etc/mtab
Hi Miklos, On Thu, May 31, 2007 at 06:29:12PM +0200, Miklos Szeredi wrote: > It's not just mount(8) that reads /etc/mtab, but various other > utilities, for example df(1). So the best solution would be if mount.nfs, mount.cifs, mount.ocfs, HAL, am-utils (amd)... and these utils also write to mtab, although I think many of them already check for a symlink. > /etc/mtab were a symlink to /proc/mounts, and the kernel would be the > authoritative source of information regarding mounts. Yes. > (1) user mounts ("user" or "users" option in /etc/fstab) currently > need /etc/mtab to keep track of the owner There is more things: loop= ... umount(8) uses this option for loop device deinitialization, when the device was initialized by mount(8), encryption=, offset=, speed= ... but nothing uses these options uhelper= ... this one is my baby :-( (Not released by upstream yet. ...according to Google this Fedora patch is already in Mandrake, PCLinuxOS, Pardus, and ??? ) From man page: The uhelper (unprivileged umount request helper) is possible used when non-root user wants to umount a mountpoint which is not defined in the /etc/fstab file (e.g devices mounted by HAL). GNOME people love it, because that's a way how use command line utils (umount(8)) for devices that was mounted by desktop daemons. The umount.nfs also reads many options from mtab, but it seems all these options are also in /proc/mounts. I know almost nothing about the others [u]mount dialects (cifs, ...). > (2) lots of filesystems only present a few mount options (or none) in > /proc/mounts > > (1) can be solved with the new mount owner support in the unprivileged > mounts patchset. Mount(8) would still have to detect at boot time if > this is available, and either create the symlink to /proc/mounts or if > MS_SETOWNER is not available, fall back to using /etc/mtab. Sounds good, but there should be a way (an option) to disable this functionality (in case when mtab is required for an exotic reason). > (2) needs work in the filesystems implicated. I already have patches > for ext2, ext3, tmpfs, devpts and hostfs, and it would be nice if the > maintainers for others could help out. > > It wouldn't even be fatal if some mount options were missing from > /proc/mounts. Mount options in /etc/mtab have never been perfectly > accurate, especially after a remount, when they could get totally out > of sync with the options effective for the filesystem. The /etc/mtab is almost always useless with NFS (kernel is changing mount options according to NFS server settings, so there is possible that you have "rw" in mtab and "ro" in /proc/mounts :-) > Can someone think of any other problem with getting rid of /etc/mtab? Crazy idea: make kernel more promiscuous with mount options -- it means you can use an arbitrary "foo=" option for mount(2) when max length of all options is less than or equal to /proc/sys/fs/mntopt_max. (well... NACK :-) I agree that the /etc/mtab file is badly designed thing where we duplicate almost all from /proc/mounts, but loop= and uhelper= are nice examples that userspace utils are able to capitalize on this concept. Maybe we need something better than the mtab for userspace specific options. Somewhere at people.redhat.com/kzak/ I have a patch that add LUKS support to the mount(8) and this patch also add new options to the mtab file. I can imagine more scenarios when userspace specific options are good thing. > [1] http://lkml.org/lkml/2007/4/27/180 The patches have been postponed by Andrew, right? Or is it already in -mm? Karel -- Karel Zak <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH] locks: provide a file lease method enabling cluster-coherent leases
On Thu, 2007-05-31 at 17:40 -0400, J. Bruce Fields wrote: > From: J. Bruce Fields <[EMAIL PROTECTED]> > > Currently leases are only kept locally, so there's no way for a distributed > filesystem to enforce them against multiple clients. We're particularly > interested in the case of nfsd exporting a cluster filesystem, in which > case nfsd needs cluster-coherent leases in order to implement delegations > correctly. > > Signed-off-by: J. Bruce Fields <[EMAIL PROTECTED]> > --- > fs/locks.c |5 - > include/linux/fs.h |4 > 2 files changed, 8 insertions(+), 1 deletions(-) > > diff --git a/fs/locks.c b/fs/locks.c > index 3f366e1..40a7f39 100644 > --- a/fs/locks.c > +++ b/fs/locks.c > @@ -1444,7 +1444,10 @@ int setlease(struct file *filp, long arg, struct > file_lock **lease) > return error; > > lock_kernel(); > - error = __setlease(filp, arg, lease); > + if (filp->f_op && filp->f_op->set_lease) > + error = filp->f_op->set_lease(filp, arg, lease); > +else > + error = __setlease(filp, arg, lease); > unlock_kernel(); > > return error; > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 7cf0c54..09aefb4 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1112,6 +1112,7 @@ struct file_operations { > int (*flock) (struct file *, int, struct file_lock *); > ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t > *, size_t, unsigned int); > ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info > *, size_t, unsigned int); > + int (*set_lease)(struct file *, long, struct file_lock **); > }; > > struct inode_operations { > @@ -1137,6 +1138,7 @@ struct inode_operations { > ssize_t (*listxattr) (struct dentry *, char *, size_t); > int (*removexattr) (struct dentry *, const char *); > void (*truncate_range)(struct inode *, loff_t, loff_t); > + int (*break_lease)(struct inode *, unsigned int); Splitting the lease into a file_operation part and an inode_operation part looks really ugly. It also means that you're calling twice down into the filesystem for every call to may_open() (once for vfs_permission() and once for break_lease()) and 3 times in do_sys_truncate(). Would it perhaps make sense to package up the call to vfs_permission() and break_lease() as a single 'may_open()' inode operation that could be called by may_open(), do_sys_truncate() and nfsd? Trond - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] locks: share more common lease code
From: J. Bruce Fields <[EMAIL PROTECTED]> Share more code between setlease (used by nfsd) and fcntl. Also some minor cleanup. Signed-off-by: "J. Bruce Fields" <[EMAIL PROTECTED]> --- fs/locks.c | 14 +++--- 1 files changed, 3 insertions(+), 11 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index 431a8b8..3f366e1 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -1469,14 +1469,6 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg) struct inode *inode = dentry->d_inode; int error; - if ((current->fsuid != inode->i_uid) && !capable(CAP_LEASE)) - return -EACCES; - if (!S_ISREG(inode->i_mode)) - return -EINVAL; - error = security_file_lock(filp, arg); - if (error) - return error; - locks_init_lock(&fl); error = lease_init(filp, arg, &fl); if (error) @@ -1484,15 +1476,15 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg) lock_kernel(); - error = __setlease(filp, arg, &flp); + error = setlease(filp, arg, &flp); if (error || arg == F_UNLCK) goto out_unlock; error = fasync_helper(fd, filp, 1, &flp->fl_fasync); if (error < 0) { - /* remove lease just inserted by __setlease */ + /* remove lease just inserted by setlease */ flp->fl_type = F_UNLCK | F_INPROGRESS; - flp->fl_break_time = jiffies- 10; + flp->fl_break_time = jiffies - 10; time_out_leases(inode); goto out_unlock; } -- 1.5.2.rc3 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] locks: provide a file lease method enabling cluster-coherent leases
From: J. Bruce Fields <[EMAIL PROTECTED]> Currently leases are only kept locally, so there's no way for a distributed filesystem to enforce them against multiple clients. We're particularly interested in the case of nfsd exporting a cluster filesystem, in which case nfsd needs cluster-coherent leases in order to implement delegations correctly. Signed-off-by: J. Bruce Fields <[EMAIL PROTECTED]> --- fs/locks.c |5 - include/linux/fs.h |4 2 files changed, 8 insertions(+), 1 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index 3f366e1..40a7f39 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -1444,7 +1444,10 @@ int setlease(struct file *filp, long arg, struct file_lock **lease) return error; lock_kernel(); - error = __setlease(filp, arg, lease); + if (filp->f_op && filp->f_op->set_lease) + error = filp->f_op->set_lease(filp, arg, lease); +else + error = __setlease(filp, arg, lease); unlock_kernel(); return error; diff --git a/include/linux/fs.h b/include/linux/fs.h index 7cf0c54..09aefb4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1112,6 +1112,7 @@ struct file_operations { int (*flock) (struct file *, int, struct file_lock *); ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); + int (*set_lease)(struct file *, long, struct file_lock **); }; struct inode_operations { @@ -1137,6 +1138,7 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + int (*break_lease)(struct inode *, unsigned int); }; struct seq_file; @@ -1463,6 +1465,8 @@ static inline int locks_verify_truncate(struct inode *inode, static inline int break_lease(struct inode *inode, unsigned int mode) { + if (inode->i_op && inode->i_op->break_lease) + return inode->i_op->break_lease(inode, mode); if (inode->i_flock) return __break_lease(inode, mode); return 0; -- 1.5.2.rc3 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
cluster-coherent leases
NFSv4 uses leases to offer clients better caching. Unfortunately that doesn't work well on a cluster filesystem, since leases are acquired only locally. So, add the following patches add new lease and break methods, and a simple lease implementation for GFS2 that just refuses to grant any leases for now. Opinions? --b. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] gfs2: stop giving out non-cluster-coherent leases
From: Marc Eshel <[EMAIL PROTECTED]> Since gfs2 can't prevent conflicting opens or leases on other nodes, we probably shouldn't allow it to give out leases at all. Put the newly defined lease operation into use in gfs2 by turning off lease, unless we're using the "nolock' locking module (in which case all locking is local anyway). Signed-off-by: Marc Eshel <[EMAIL PROTECTED]> --- fs/gfs2/ops_file.c | 26 ++ 1 files changed, 26 insertions(+), 0 deletions(-) diff --git a/fs/gfs2/ops_file.c b/fs/gfs2/ops_file.c index 064df88..78ac4ac 100644 --- a/fs/gfs2/ops_file.c +++ b/fs/gfs2/ops_file.c @@ -489,6 +489,31 @@ static int gfs2_fsync(struct file *file, struct dentry *dentry, int datasync) } /** + * gfs2_set_lease - acquire/release a file lease + * @file: the file pointer + * @arg: lease type + * @fl: file lock + * + * Returns: errno + */ + +static int gfs2_set_lease(struct file *file, long arg, struct file_lock **fl) +{ + struct gfs2_sbd *sdp = GFS2_SB(file->f_mapping->host); + int ret = EAGAIN; + + if (sdp->sd_args.ar_localflocks) { + return setlease(file, arg, fl); + } + + /* For now fail the delegation request. Cluster file system can not + allow any node in the cluster to get a local lease until it can + be managed centrally by the cluster file system. + */ + return ret; +} + +/** * gfs2_lock - acquire/release a posix lock on a file * @file: the file pointer * @cmd: either modify or retrieve lock state, possibly wait @@ -639,6 +664,7 @@ const struct file_operations gfs2_file_fops = { .flock = gfs2_flock, .splice_read= generic_file_splice_read, .splice_write = generic_file_splice_write, + .set_lease = gfs2_set_lease, }; const struct file_operations gfs2_dir_fops = { -- 1.5.2.rc3 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH] locks: share more common lease code
On Thu, 2007-05-31 at 17:40 -0400, J. Bruce Fields wrote: > From: J. Bruce Fields <[EMAIL PROTECTED]> > > Share more code between setlease (used by nfsd) and fcntl. > > Also some minor cleanup. > > Signed-off-by: "J. Bruce Fields" <[EMAIL PROTECTED]> > --- > fs/locks.c | 14 +++--- > 1 files changed, 3 insertions(+), 11 deletions(-) > > diff --git a/fs/locks.c b/fs/locks.c > index 431a8b8..3f366e1 100644 > --- a/fs/locks.c > +++ b/fs/locks.c > @@ -1469,14 +1469,6 @@ int fcntl_setlease(unsigned int fd, struct file *filp, > long arg) > struct inode *inode = dentry->d_inode; > int error; > > - if ((current->fsuid != inode->i_uid) && !capable(CAP_LEASE)) > - return -EACCES; > - if (!S_ISREG(inode->i_mode)) > - return -EINVAL; > - error = security_file_lock(filp, arg); > - if (error) > - return error; > - > locks_init_lock(&fl); > error = lease_init(filp, arg, &fl); > if (error) > @@ -1484,15 +1476,15 @@ int fcntl_setlease(unsigned int fd, struct file > *filp, long arg) > > lock_kernel(); > > - error = __setlease(filp, arg, &flp); > + error = setlease(filp, arg, &flp); > if (error || arg == F_UNLCK) > goto out_unlock; > > error = fasync_helper(fd, filp, 1, &flp->fl_fasync); > if (error < 0) { > - /* remove lease just inserted by __setlease */ > + /* remove lease just inserted by setlease */ > flp->fl_type = F_UNLCK | F_INPROGRESS; > - flp->fl_break_time = jiffies- 10; > + flp->fl_break_time = jiffies - 10; > time_out_leases(inode); > goto out_unlock; > } Why not move the security checks from setlease() into __setlease()? That way you can continue to avoid the calls to (re)take the BKL, which are redundant as far as fcntl_setlease() is concerned. Cheers Trond - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31 2007, [EMAIL PROTECTED] wrote: > On Thu, 31 May 2007, Jens Axboe wrote: > > >On Thu, May 31 2007, Phillip Susi wrote: > >>David Chinner wrote: > >>>That sounds like a good idea - we can leave the existing > >>>WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > >>>behaviour that only guarantees ordering. The filesystem can then > >>>choose which to use where appropriate > >> > >>So what if you want a synchronous write, but DON'T care about the order? > >> They need to be two completely different flags which you can choose > >>to combine, or use individually. > > > >If you have a use case for that, we can easily support it as well... > >Depending on the drive capabilities (FUA support or not), it may be > >nearly as slow as a "real" barrier write. > > true, but a "real" barrier write could have significant side effects on > other writes that wouldn't happen with a synchronous wrote (a sync wrote > can have other, unrelated writes re-ordered around it, a barrier write > can't) That is true, the sync write also has side effects at the drive side since it may have a varied cost depending on the workload (eg what already resides in the cache when it is issued), unless FUA is active. That is also true for the barrier of course, but only for previously submitted IO as we don't reorder. I'm not saying that a SYNC write wont be potentially useful, just that it's definitely not free even outside of the write itself. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, 31 May 2007, Jens Axboe wrote: On Thu, May 31 2007, Phillip Susi wrote: David Chinner wrote: That sounds like a good idea - we can leave the existing WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED behaviour that only guarantees ordering. The filesystem can then choose which to use where appropriate So what if you want a synchronous write, but DON'T care about the order? They need to be two completely different flags which you can choose to combine, or use individually. If you have a use case for that, we can easily support it as well... Depending on the drive capabilities (FUA support or not), it may be nearly as slow as a "real" barrier write. true, but a "real" barrier write could have significant side effects on other writes that wouldn't happen with a synchronous wrote (a sync wrote can have other, unrelated writes re-ordered around it, a barrier write can't) David Lang - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31 2007, Phillip Susi wrote: > David Chinner wrote: > >That sounds like a good idea - we can leave the existing > >WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > >behaviour that only guarantees ordering. The filesystem can then > >choose which to use where appropriate > > So what if you want a synchronous write, but DON'T care about the order? > They need to be two completely different flags which you can choose > to combine, or use individually. If you have a use case for that, we can easily support it as well... Depending on the drive capabilities (FUA support or not), it may be nearly as slow as a "real" barrier write. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31 2007, Phillip Susi wrote: > Jens Axboe wrote: > >No Stephan is right, the barrier is both an ordering and integrity > >constraint. If a driver completes a barrier request before that request > >and previously submitted requests are on STABLE storage, then it > >violates that principle. Look at the code and the various ordering > >options. > > I am saying that is the wrong thing to do. Barrier should be about > ordering only. So long as the order they hit the media is maintained, > the order the requests are completed in can change. barrier.txt bears But you can't guarentee ordering without flushing the data out as well. It all depends on the type of cache on the device, of course. If you look at the ordinary sata/ide drive with write back caching, you can't just issue the requests in order and pray that the drive cache will make it to platter. If you don't have write back caching, or if the cache is battery backed and thus guarenteed to never be lost, maintaining order is naturally enough. Or if the drive can do ordered queued commands, you can relax the flushing (again depending on the cache type, you may need to take different paths). > "Requests in ordered sequence are issued in order, but not required to > finish in order. Barrier implementation can handle out-of-order > completion of ordered sequence. IOW, the requests MUST be processed in > order but the hardware/software completion paths are allowed to reorder > completion notifications - eg. current SCSI midlayer doesn't preserve > completion order during error handling." If you carefully re-read that paragraph, then it just tells you that the software implementation can deal with reordered completions. It doesn't relax the rconstraints on ordering and integrity AT ALL. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Jens Axboe wrote: No Stephan is right, the barrier is both an ordering and integrity constraint. If a driver completes a barrier request before that request and previously submitted requests are on STABLE storage, then it violates that principle. Look at the code and the various ordering options. I am saying that is the wrong thing to do. Barrier should be about ordering only. So long as the order they hit the media is maintained, the order the requests are completed in can change. barrier.txt bears this out: "Requests in ordered sequence are issued in order, but not required to finish in order. Barrier implementation can handle out-of-order completion of ordered sequence. IOW, the requests MUST be processed in order but the hardware/software completion paths are allowed to reorder completion notifications - eg. current SCSI midlayer doesn't preserve completion order during error handling." - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
David Chinner wrote: That sounds like a good idea - we can leave the existing WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED behaviour that only guarantees ordering. The filesystem can then choose which to use where appropriate So what if you want a synchronous write, but DON'T care about the order? They need to be two completely different flags which you can choose to combine, or use individually. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
David Chinner wrote: you are understanding barriers to be the same as syncronous writes. (and therefor the data is on persistant media before the call returns) No, I'm describing the high level behaviour that is expected by a filesystem. The reasons for this are below You say no, but then you go on to contradict yourself below. Ok, that's my understanding of how *device based barriers* can work, but there's more to it than that. As far as the filesystem is concerned the barrier write needs to *behave* exactly like a sync write because of the guarantees the filesystem has to provide userspace. Specifically - sync, sync writes and fsync. There, you just ascribed the synchronous property to barrier requests. This is false. Barriers are about ordering, synchronous writes are another thing entirely. The filesystem is supposed to use barriers to maintain ordering for journal data. If you are trying to handle a synchronous write request, that's another flag. This is the big problem, right? If we use barriers for commit writes, the filesystem can return to userspace after a sync write or fsync() and an *ordered barrier device implementation* may not have written the blocks to persistent media. If we then pull the plug on the box, we've just lost data that sync or fsync said was successfully on disk. That's BAD. That's why for synchronous writes, you set the flag to mark the request as synchronous, which has nothing at all to do with barriers. You are trying to use barriers to solve two different problems. Use one flag to indicate ordering, and another to indicate synchronisity. Right now a barrier write on the last block of the fsync/sync write is sufficient to prevent that because of the FUA on the barrier block write. A purely ordered barrier implementation does not provide this guarantee. This is a side effect of the implementation of the barrier, not part of the semantics of barriers, so you shouldn't rely on this behavior. You don't have to use FUA to handle the barrier request, and if you don't, then the request can be completed while the data is still in the write cache. You just have to make sure to flush it before any subsequent requests. IOWs, there are two parts to the problem: 1 - guaranteeing I/O ordering 2 - guaranteeing blocks are on persistent storage. Right now, a single barrier I/O is used to provide both of these guarantees. In most cases, all we really need to provide is 1); the need for 2) is a much rarer condition but still needs to be provided. Yep... two problems... two flags. Yes, if we define a barrier to only guarantee 1), then yes this would be a big win (esp. for XFS). But that requires all filesystems to handle sync writes differently, and sync_blockdev() needs to call blkdev_issue_flush() as well So, what do we do here? Do we define a barrier I/O to only provide ordering, or do we define it to also provide persistent storage writeback? Whatever we decide, it needs to be documented We do the former or we end up in the same boat as O_DIRECT; where you have one flag that means several things, and no way to specify you only need some of those and not the others. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/2] i_version update
On Thu, 2007-05-31 at 10:33 +1000, David Chinner wrote: > On Wed, May 30, 2007 at 04:32:57PM -0700, Mingming Cao wrote: > > On Wed, 2007-05-30 at 10:21 +1000, David Chinner wrote: > > > On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote: > > > > Hi, > > > > > > > > This is an update of the i_version patch. > > > > The i_version field is a 64bit counter that is set on every inode > > > > creation and that is incremented every time the inode data is modified > > > > (similarly to the "ctime" time-stamp). > > > > > > My understanding (please correct me if I'm wrong) is that the > > > requirements are much more rigourous than simply incrementing an in > > > memory counter on every change. i.e. the this counter has to > > > survive server crashes intact so clients never see the counter go > > > backwards. That means version number changes need to be journalled > > > along with the operation that caused the change of the version > > > number. > > > > > Yeah, the i_version is the in memeory counter. From the patch it looks > > like the counter is being updated inside ext4_mark_iloc_dirty(), so it > > is being journalled and being flush to on-disk ext4 inode structure > > immediately (On-disk ext4 inode structure is being modified/expanded to > > store the counter in the first patch). > > Ok, that catches most things (I missed that), but the version number still > needs to change on file data changes, right? So if we are overwriting the > file, we're calling __mark_inode_dirty(I_DIRTY_PAGES) which means you don't > get the callout and so the version number doesn't change or get logged. In > that case, the version number is not doing what it needs to do, right? > Hmm, maybe I missed something... but looking at the code again, in the case of overwrite (file date updated),it seems the ctime/mtime is being updated and the inode is being dirtied, so the version number is being updated. vfs_write()->.. ->__generic_file_aio_write_nolock() ->file_update_time() ->mark_inode_dirty_sync() ->__mark_inode_dirty(I_DIRTY_SYNC) ->ext4_dirty_inode() ->ext4_mark_inode_dirty() Regards, Mingming - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH resend] introduce I_SYNC
On Thu, 2007-05-31 at 16:25 +0200, Jörn Engel wrote: > --- linux-2.6.21logfs/fs/jfs/jfs_txnmgr.c~I_LOCK2007-05-07 > 10:28:55.0 +0200 > +++ linux-2.6.21logfs/fs/jfs/jfs_txnmgr.c 2007-05-29 > 13:10:32.0 +0200 > @@ -1286,7 +1286,14 @@ int txCommit(tid_t tid, /* > transaction > * commit the transaction synchronously, so the last > iput > * will be done by the calling thread (or later) > */ > - if (tblk->u.ip->i_state & I_LOCK) > + /* > +* I believe this code is no longer needed. Splitting > I_LOCK > +* into two bits, I_LOCK and I_SYNC should prevent > this > +* deadlock as well. But since I don't have a JFS > testload > +* to verify this, only a trivial s/I_LOCK/I_SYNC/ was > done. > +* Joern > +*/ > + if (tblk->u.ip->i_state & I_SYNC) > tblk->xflag &= ~COMMIT_LAZY; > } I think the code is still needed, and I think this change is correct. The deadlock that this code is avoiding is caused by clear_inode() calling wait_on_inode(). Since clear_inode() now calls inode_sync_wait(inode), we want to avoid the lazily committing this transaction when the I_SYNC flag is set. Unfortunately, recreating the deadlock is hard, and I haven't been able to recreate it with this code commented out. Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] obsoleting /etc/mtab
> > > > (2) needs work in the filesystems implicated. I already have patches > > for ext2, ext3, tmpfs, devpts and hostfs, and it would be nice if the > > maintainers for others could help out. > > > > A lot of these could be fixed all at once by letting the filesystem tell > the VFS to retain the string passed to the original mount. That will > solve *almost* all filesystems which take string options. On remount some filesystems like ext[234] use the given options as a delta, not as the new set of options. Others just ignore some of the options on remount. Yes, /etc/mtab is broken wrt. remount too, but somehow I feel breaking /proc/mounts this way too would be frowned upon. > On the other hand, maybe it's cleaner to present a canonical view of the > options. Note that /etc/mtab does not, however. Yes, we could emulate /etc/mtab like this for filesystems which don't have a ->show_options(), but mostly filesystems do have ->show_options(), they are just lazy about updating it with all the new mount options. Miklos - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] obsoleting /etc/mtab
Miklos Szeredi wrote: > > (2) needs work in the filesystems implicated. I already have patches > for ext2, ext3, tmpfs, devpts and hostfs, and it would be nice if the > maintainers for others could help out. > A lot of these could be fixed all at once by letting the filesystem tell the VFS to retain the string passed to the original mount. That will solve *almost* all filesystems which take string options. On the other hand, maybe it's cleaner to present a canonical view of the options. Note that /etc/mtab does not, however. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] obsoleting /etc/mtab
I started working on adding support for unprivileged mounts[1] to util-linux. The big obstacle seems to be the reliance on /etc/mtab, since that won't be kept up-to-date after mount(2) or umount(2) calls by unprivileged users. It's not just mount(8) that reads /etc/mtab, but various other utilities, for example df(1). So the best solution would be if /etc/mtab were a symlink to /proc/mounts, and the kernel would be the authoritative source of information regarding mounts. This works quite well already, but there are minor problems: (1) user mounts ("user" or "users" option in /etc/fstab) currently need /etc/mtab to keep track of the owner (2) lots of filesystems only present a few mount options (or none) in /proc/mounts (1) can be solved with the new mount owner support in the unprivileged mounts patchset. Mount(8) would still have to detect at boot time if this is available, and either create the symlink to /proc/mounts or if MS_SETOWNER is not available, fall back to using /etc/mtab. (2) needs work in the filesystems implicated. I already have patches for ext2, ext3, tmpfs, devpts and hostfs, and it would be nice if the maintainers for others could help out. It wouldn't even be fatal if some mount options were missing from /proc/mounts. Mount options in /etc/mtab have never been perfectly accurate, especially after a remount, when they could get totally out of sync with the options effective for the filesystem. Can someone think of any other problem with getting rid of /etc/mtab? Thanks, Miklos [1] http://lkml.org/lkml/2007/4/27/180 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH resend] introduce I_SYNC
On Wed, 16 May 2007 10:15:35 -0700, Andrew Morton wrote: > > If we're going to do this then please let's get some exhaustive commentary > in there so that others have a chance of understanding these flags without > having to do the amount of reverse-engineering which you've been put through. Done. Found and fixed some bugs in the process. By now I feal reasonable certain that the patch fixes more than it breaks. Jörn -- Good warriors cause others to come to them and do not go to others. -- Sun Tzu Introduce I_SYNC. I_LOCK was used for several unrelated purposes, which caused deadlock situations in certain filesystems as a side effect. One of the purposes now uses the new I_SYNC bit. Also document the various bits and change their order from historical to logical. Signed-off-by: Jörn Engel <[EMAIL PROTECTED]> --- fs/fs-writeback.c | 39 +++- fs/hugetlbfs/inode.c|2 - fs/inode.c |6 +-- fs/jfs/jfs_txnmgr.c |9 + fs/xfs/linux-2.6/xfs_iops.c |4 +- include/linux/fs.h | 70 ++-- include/linux/writeback.h |7 mm/page-writeback.c |2 - 8 files changed, 107 insertions(+), 32 deletions(-) --- linux-2.6.21logfs/fs/fs-writeback.c~I_LOCK 2007-05-07 10:28:53.0 +0200 +++ linux-2.6.21logfs/fs/fs-writeback.c 2007-05-07 13:29:35.0 +0200 @@ -99,11 +99,11 @@ void __mark_inode_dirty(struct inode *in inode->i_state |= flags; /* -* If the inode is locked, just update its dirty state. +* If the inode is being synced, just update its dirty state. * The unlocker will place the inode on the appropriate * superblock list, based upon its state. */ - if (inode->i_state & I_LOCK) + if (inode->i_state & I_SYNC) goto out; /* @@ -139,6 +139,15 @@ static int write_inode(struct inode *ino return 0; } +static void inode_sync_complete(struct inode *inode) +{ + /* +* Prevent speculative execution through spin_unlock(&inode_lock); +*/ + smp_mb(); + wake_up_bit(&inode->i_state, __I_SYNC); +} + /* * Write a single inode's dirty pages and inode data out to disk. * If `wait' is set, wait on the writeout. @@ -158,11 +167,11 @@ __sync_single_inode(struct inode *inode, int wait = wbc->sync_mode == WB_SYNC_ALL; int ret; - BUG_ON(inode->i_state & I_LOCK); + BUG_ON(inode->i_state & I_SYNC); - /* Set I_LOCK, reset I_DIRTY */ + /* Set I_SYNC, reset I_DIRTY */ dirty = inode->i_state & I_DIRTY; - inode->i_state |= I_LOCK; + inode->i_state |= I_SYNC; inode->i_state &= ~I_DIRTY; spin_unlock(&inode_lock); @@ -183,7 +192,7 @@ __sync_single_inode(struct inode *inode, } spin_lock(&inode_lock); - inode->i_state &= ~I_LOCK; + inode->i_state &= ~I_SYNC; if (!(inode->i_state & I_FREEING)) { if (!(inode->i_state & I_DIRTY) && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) { @@ -231,7 +240,7 @@ __sync_single_inode(struct inode *inode, list_move(&inode->i_list, &inode_unused); } } - wake_up_inode(inode); + inode_sync_complete(inode); return ret; } @@ -250,7 +259,7 @@ __writeback_single_inode(struct inode *i else WARN_ON(inode->i_state & I_WILL_FREE); - if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_LOCK)) { + if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_SYNC)) { struct address_space *mapping = inode->i_mapping; int ret; @@ -269,16 +278,16 @@ __writeback_single_inode(struct inode *i /* * It's a data-integrity sync. We must wait. */ - if (inode->i_state & I_LOCK) { - DEFINE_WAIT_BIT(wq, &inode->i_state, __I_LOCK); + if (inode->i_state & I_SYNC) { + DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC); - wqh = bit_waitqueue(&inode->i_state, __I_LOCK); + wqh = bit_waitqueue(&inode->i_state, __I_SYNC); do { spin_unlock(&inode_lock); __wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE); spin_lock(&inode_lock); - } while (inode->i_state & I_LOCK); + } while (inode->i_state & I_SYNC); } return __sync_single_inode(inode, wbc); } @@ -311,7 +320,7 @@ __writeback_single_inode(struct inode *i * The inodes to be written are parked on sb->s_io. They are moved back onto * sb->s_dirty as they are selected for writing. This way, none can be missed * on the writer
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31 2007, Bill Davidsen wrote: > Jens Axboe wrote: > >On Thu, May 31 2007, David Chinner wrote: > > > >>On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: > >> > >>>On Thu, May 31 2007, David Chinner wrote: > >>> > IOWs, there are two parts to the problem: > > 1 - guaranteeing I/O ordering > 2 - guaranteeing blocks are on persistent storage. > > Right now, a single barrier I/O is used to provide both of these > guarantees. In most cases, all we really need to provide is 1); the > need for 2) is a much rarer condition but still needs to be > provided. > > > >if I am understanding it correctly, the big win for barriers is that > >you do NOT have to stop and wait until the data is on persistant media > >before you can continue. > > > Yes, if we define a barrier to only guarantee 1), then yes this > would be a big win (esp. for XFS). But that requires all filesystems > to handle sync writes differently, and sync_blockdev() needs to > call blkdev_issue_flush() as well > > So, what do we do here? Do we define a barrier I/O to only provide > ordering, or do we define it to also provide persistent storage > writeback? Whatever we decide, it needs to be documented > > >>>The block layer already has a notion of the two types of barriers, with > >>>a very small amount of tweaking we could expose that. There's absolutely > >>>zero reason we can't easily support both types of barriers. > >>> > >>That sounds like a good idea - we can leave the existing > >>WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > >>behaviour that only guarantees ordering. The filesystem can then > >>choose which to use where appropriate > >> > > > >Precisely. The current definition of barriers are what Chris and I came > >up with many years ago, when solving the problem for reiserfs > >originally. It is by no means the only feasible approach. > > > >I'll add a WRITE_ORDERED command to the #barrier branch, it already > >contains the empty-bio barrier support I posted yesterday (well a > >slightly modified and cleaned up version). > > > > > Wait. Do filesystems expect (depend on) anything but ordering now? Does > md? Having users of barriers as they currently behave suddenly getting > SYNC behavior where they expect ORDERED is likely to have a negative > effect on performance. Or do I misread what is actually guaranteed by > WRITE_BARRIER now, and a flush is currently happening in all cases? See the above stuff you quote, it's answered there. It's not a change, this is how the Linux barrier write has always worked since I first implemented it. What David and I are talking about is adding a more relaxed version as well, that just implies ordering. > And will this also be available to user space f/s, since I just proposed > a project which uses one? :-( I see several uses for that, so I'd hope so. > I think the goal is good, more choice is almost always better choice, I > just want to be sure there won't be big disk performance regressions. We can't get more heavy weight than the current barrier, it's about as conservative as you can get. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Jens Axboe wrote: On Thu, May 31 2007, David Chinner wrote: On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: On Thu, May 31 2007, David Chinner wrote: IOWs, there are two parts to the problem: 1 - guaranteeing I/O ordering 2 - guaranteeing blocks are on persistent storage. Right now, a single barrier I/O is used to provide both of these guarantees. In most cases, all we really need to provide is 1); the need for 2) is a much rarer condition but still needs to be provided. if I am understanding it correctly, the big win for barriers is that you do NOT have to stop and wait until the data is on persistant media before you can continue. Yes, if we define a barrier to only guarantee 1), then yes this would be a big win (esp. for XFS). But that requires all filesystems to handle sync writes differently, and sync_blockdev() needs to call blkdev_issue_flush() as well So, what do we do here? Do we define a barrier I/O to only provide ordering, or do we define it to also provide persistent storage writeback? Whatever we decide, it needs to be documented The block layer already has a notion of the two types of barriers, with a very small amount of tweaking we could expose that. There's absolutely zero reason we can't easily support both types of barriers. That sounds like a good idea - we can leave the existing WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED behaviour that only guarantees ordering. The filesystem can then choose which to use where appropriate Precisely. The current definition of barriers are what Chris and I came up with many years ago, when solving the problem for reiserfs originally. It is by no means the only feasible approach. I'll add a WRITE_ORDERED command to the #barrier branch, it already contains the empty-bio barrier support I posted yesterday (well a slightly modified and cleaned up version). Wait. Do filesystems expect (depend on) anything but ordering now? Does md? Having users of barriers as they currently behave suddenly getting SYNC behavior where they expect ORDERED is likely to have a negative effect on performance. Or do I misread what is actually guaranteed by WRITE_BARRIER now, and a flush is currently happening in all cases? And will this also be available to user space f/s, since I just proposed a project which uses one? :-( I think the goal is good, more choice is almost always better choice, I just want to be sure there won't be big disk performance regressions. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/2] i_version update
On Thu, 2007-05-31 at 10:01 +1000, Neil Brown wrote: > This will provide a change number that normally changes only when the > file changes and doesn't require any extra storage on disk. > The change number will change inappropriately only when the inode has > fallen out of cache and is being reload, which is either after a crash > (hopefully rare) of when a file hasn't been used for a while, implying > that it is unlikely that any client has it in cache. It will also change inappropriately if the server is under heavy load and needs to reclaim memory by tossing out inodes that are cached and still in use by the clients. That change will trigger clients to invalidate their caches and to refetch the data from the server, further cranking up the load. Not an ideal solution... Trond - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Neil Brown wrote: On Monday May 28, [EMAIL PROTECTED] wrote: There are two things I'm not sure you covered. First, disks which don't support flush but do have a "cache dirty" status bit you can poll at times like shutdown. If there are no drivers which support these, it can be ignored. There are really devices like that? So to implement a flush, you have to stop sending writes and wait and poll - maybe poll every millisecond? Yes, there really are (or were). But I don't think that there are drivers, so it's not an issue. That wouldn't be very good for performance maybe you just wouldn't bother with barriers on that sort of device? That is why there are no drivers... Which reminds me: What is the best way to turn off barriers? Several filesystems have "-o nobarriers" or "-o barriers=0", or the inverse. If they can function usefully without, the admin gets to make that choice. md/raid currently uses barriers to write metadata, and there is no way to turn that off. I'm beginning to wonder if that is best. I don't see how you can have reliable operation without it, particularly WRT bitmap. Maybe barrier support should be a function of the device. i.e. the filesystem or whatever always sends barrier requests where it thinks it is appropriate, and the block device tries to honour them to the best of its ability, but if you run blockdev --enforce-barriers=no /dev/sda then you lose some reliability guarantees, but gain some throughput (a bit like the 'async' export option for nfsd). Since this is device dependent, it really should be in the device driver, and requests should have status of success, failure, or feature unavailability. Second, NAS (including nbd?). Is there enough information to handle this "really right?" NAS means lots of things, including NFS and CIFS where this doesn't apply. Well, we're really talking about network attached devices rather than network filesystems. I guess people do lump them together. For 'nbd', it is entirely up to the protocol. If the protocol allows a barrier flag to be sent to the server, then barriers should just work. If it doesn't, then either the server disables write-back caching, or flushes every request, or you lose all barrier guarantees. Pretty much agrees with what I said above, it's at a level closer to the device, and status should come back from the physical i/o request. For 'iscsi', I guess it works just the same as SCSI... Hopefully. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
2007/5/30, Phillip Susi <[EMAIL PROTECTED]>: Stefan Bader wrote: > > Since drive a supports barrier request we don't get -EOPNOTSUPP but > the request with block y might get written before block x since the > disk are independent. I guess the chances of this are quite low since > at some point a barrier request will also hit drive b but for the time > being it might be better to indicate -EOPNOTSUPP right from > device-mapper. The device mapper needs to ensure that ALL underlying devices get a barrier request when one comes down from above, even if it has to construct zero length barriers to send to most of them. And somehow also make sure all of the barriers have been processed before returning the barrier that came in. Plus it would have to queue all mapping requests until the barrier is done (if strictly acting according to barrier.txt). But I am wondering a bit whether the requirements to barriers are really that tight as described in Tejun's document (barrier request is only started if everything before is safe, the barrier itself isn't returned until it is safe, too, and all requests after the barrier aren't started before the barrier is done). Is it really necessary to defer any further requests until the barrier has been written to save storage? Or would it be sufficient to guarantee that, if a barrier request returns, everything up to (including the barrier) is on safe storage? Stefan - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] AFS: Implement file locking [try #2]
J. Bruce Fields <[EMAIL PROTECTED]> wrote: > > Yes. I need to get the server lock first, before going to the VFS locking > > routines. > > That doesn't really answer the question. The NFS client has similar > requirements, but it doesn't have to duplicate the per-inode lists of > granted locks, for example. Actually, it might... It's just that they're in the lock manager server, not in the NFS client. As far as I can tell, NFS passes each lock request individually to the lock manager server, which grants them individually. AFS doesn't do that. David - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Cross-chunk reference checking time estimates
Hey all, I altered Karuna's cref tool to print the number of seconds it would take to check the cross-references for a chunk. The results look good for chunkfs: on my laptop /home file system and a 1 GB chunk size, the per-chunk cross-reference check time would be an average of 5 seconds and a max of 160 seconds in 2013. This is calculated assuming average seek time and rotational latency delay for every cross-reference checked; some simple batching of I/Os could significantly improve that. The tool is a little dodgy on error handling and other edge cases ATM, but for now, here's the results and the code (attached): [EMAIL PROTECTED]:~/chunkfs/cref_new$ sudo ./cref.sh /dev/hda3 dump /home 1024 Total size = 19535040 KB Total data stored = 13998240 KB Number of files = 445406 Number of directories = 31836 Number of special files = 12156 Size of block groups = 1048576 KB Inodes per block group = 130304 Intra-file cross references = 63167 Directory-subdirectory references = 429 Directory-file references = 2381 Total directory cross references = 2810 Total cross references = 65977 Total cross references = 65977 Average cross references per group = 439 Maximum cross references in a group = 13997 Max group is 4 (0:3, 1:46, 2:282, 3:4996, 5:8445, 6:2, 7:1, 8:27, 9:1, 10:2, 12:1, 13:51, 14:32, 15:99, 16:2, 17:5, 18:2, ) Average additional time to check cross references = 6.77 s Max additional time to check cross references = 215.55 s 2013 average additional time to check cross references = 4.93 s 2013 max additional time to check cross references = 156.77 s Questions? Come talk on #linuxfs at irc.oftc.net. -VAL cref_new.tar.gz Description: GNU Zip compressed data
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31 2007, David Chinner wrote: > On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: > > On Thu, May 31 2007, David Chinner wrote: > > > IOWs, there are two parts to the problem: > > > > > > 1 - guaranteeing I/O ordering > > > 2 - guaranteeing blocks are on persistent storage. > > > > > > Right now, a single barrier I/O is used to provide both of these > > > guarantees. In most cases, all we really need to provide is 1); the > > > need for 2) is a much rarer condition but still needs to be > > > provided. > > > > > > > if I am understanding it correctly, the big win for barriers is that > > > > you > > > > do NOT have to stop and wait until the data is on persistant media > > > > before > > > > you can continue. > > > > > > Yes, if we define a barrier to only guarantee 1), then yes this > > > would be a big win (esp. for XFS). But that requires all filesystems > > > to handle sync writes differently, and sync_blockdev() needs to > > > call blkdev_issue_flush() as well > > > > > > So, what do we do here? Do we define a barrier I/O to only provide > > > ordering, or do we define it to also provide persistent storage > > > writeback? Whatever we decide, it needs to be documented > > > > The block layer already has a notion of the two types of barriers, with > > a very small amount of tweaking we could expose that. There's absolutely > > zero reason we can't easily support both types of barriers. > > That sounds like a good idea - we can leave the existing > WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED > behaviour that only guarantees ordering. The filesystem can then > choose which to use where appropriate Precisely. The current definition of barriers are what Chris and I came up with many years ago, when solving the problem for reiserfs originally. It is by no means the only feasible approach. I'll add a WRITE_ORDERED command to the #barrier branch, it already contains the empty-bio barrier support I posted yesterday (well a slightly modified and cleaned up version). -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops
On Thu, 31 May 2007 07:15:39 +0200 Nick Piggin <[EMAIL PROTECTED]> wrote: > If you can send that rollup, it would be good. I could try getting > everything to compile and do some more testing on it too. Single patch against 2.6.22-rc3: http://userweb.kernel.org/~akpm/np.gz broken-out: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/broken-out-2007-05-30-09-30.tar.gz - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: > On Thu, May 31 2007, David Chinner wrote: > > IOWs, there are two parts to the problem: > > > > 1 - guaranteeing I/O ordering > > 2 - guaranteeing blocks are on persistent storage. > > > > Right now, a single barrier I/O is used to provide both of these > > guarantees. In most cases, all we really need to provide is 1); the > > need for 2) is a much rarer condition but still needs to be > > provided. > > > > > if I am understanding it correctly, the big win for barriers is that you > > > do NOT have to stop and wait until the data is on persistant media before > > > you can continue. > > > > Yes, if we define a barrier to only guarantee 1), then yes this > > would be a big win (esp. for XFS). But that requires all filesystems > > to handle sync writes differently, and sync_blockdev() needs to > > call blkdev_issue_flush() as well > > > > So, what do we do here? Do we define a barrier I/O to only provide > > ordering, or do we define it to also provide persistent storage > > writeback? Whatever we decide, it needs to be documented > > The block layer already has a notion of the two types of barriers, with > a very small amount of tweaking we could expose that. There's absolutely > zero reason we can't easily support both types of barriers. That sounds like a good idea - we can leave the existing WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED behaviour that only guarantees ordering. The filesystem can then choose which to use where appropriate Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html