Re: [patch] [possible race in ext2] Re: how to write get_block?
"Stephen C. Tweedie" <[EMAIL PROTECTED]> writes: > Hi, > > On Sun, 10 Oct 1999 16:57:18 +0200 (CEST), Andrea Arcangeli > <[EMAIL PROTECTED]> said: > > > My point was that even being forced to do a lookup before creating > > each empty buffer, will be still faster than 2.2.x as in 2.3.x the hash > > will contain only metadata. Less elements means faster lookups. > > The _fast_ quick fix is to maintain a per-inode list of dirty buffers > and to invalidate that list when we do a delete. This works for > directories if we only support truncate back to zero --- it obviously > gets things wrong if we allow partial truncates of directories (but why > would anyone want to allow that?!) > > This would have minimal performance implication and would also allow > fast fsync() of indirect block metadata for regular files. What about adding to the end of ext2_alloc_block: bh = get_hash_table(inode->i_dev, result, inode->i_sb->s_blocksize); /* something is playing with our fresh block, make them stop. ;-) */ if (bh) { if (buffer_dirty(bh)) { mark_buffer_clean(bh); wait_on_buffer(bh); } bforget(bh); } This is in a relatively slow/uncommon path and also catches races if indirect blocks are allocated to a file, instead of just directory syncs. Ultimately we really want to have indirect blocks, and the directory in the page cache as it should result in more uniform code, and faster partial truncates (as well as faster syncs). Eric
Re: [RFC] truncate() (generic stuff)
On Mon, 11 Oct 1999, Mikulas Patocka wrote: > > I mean the following: assume that call of hpfs_truncate() (via > > ->truncate()) had been moved into the inode_setattr() (i.e. happens > > immediately before the hpfs_write_inode() in hpfs_notify_change()). Will > > it have any bad consequences? Right now the sequence looks so: > > inode->i_size = newsize; > > hpfs_write_inode(inode); > > hpfs_truncate(inode); > > Could we swap the last steps? > > Probably yes, the first hpfs_write_inode is not needed at all. Aha. OK, since with this patch ->truncate() is called only from inode_setattr() I'm leaving the call of hpfs_write_inode() in hpfs_notify_change() and removing it from hpfs_truncate(). > You can do the same just by examining content of dnode (dirents have fnode > numbers). They do, but lookup in icache is going to be more expensive. Besides, we will have to do it for every dirent when we are splitting dnode. And we do not care for inodes not in core. > > ->clear_inode() would exclude the sucker from the lists, all inodes marked > > with I_FREEING (i.e. those that gave NULL upon igrab()) are skipped during > > the cyclic list searches. > > There is another clash if you use async write_inode. Somebody is modifying > directory, btree operation is in progress and now - system decides to call > write_inode. You can't wait in write_inode until directory is in > consistent state, you just must patch dirent somwhere in memory and > return. That means - inode must hold pointer to dirent on disk or in > memory. No problems - in-core inode holds a location on disk, any dirent moving starts from setting it to 0 (with spinlock held - see above) and ends with setting it according to new location. write_inode(): write fnode part; retry: loc = location; if (!loc) return; get the block corresponding to loc; grab the spinlock; if (loc still points to the same block) { update dirent; mark it dirty; release block; release spinlock; return; } /* we had been moved */ release block; release spinlock; goto retry; That's it. This is what I've done in FAT and it doesn't take much code. > It is possible to write it, but HPFS is currently very stable and > race-free (the last race I had was bug in kernel) and such modifications > would surely make a lot of problems. OK, I definitely agree with that. It's worth doing someday, but not now. Anyway, the new variant of truncate patch follows. Please, check whether you have problems with the hpfs part. Cheers, Al diff -urN linux-2.3.20/fs/affs/inode.c linux-bird.truncate/fs/affs/inode.c --- linux-2.3.20/fs/affs/inode.cMon Jun 21 12:36:43 1999 +++ linux-bird.truncate/fs/affs/inode.c Mon Oct 11 10:55:11 1999 @@ -246,7 +246,6 @@ inode->u.affs_i.i_protect = mode_to_prot(attr->ia_mode); inode_setattr(inode, attr); - mark_inode_dirty(inode); error = 0; out: return error; diff -urN linux-2.3.20/fs/attr.c linux-bird.truncate/fs/attr.c --- linux-2.3.20/fs/attr.c Mon Jun 21 12:36:42 1999 +++ linux-bird.truncate/fs/attr.c Mon Oct 11 10:54:44 1999 @@ -62,8 +62,6 @@ inode->i_uid = attr->ia_uid; if (ia_valid & ATTR_GID) inode->i_gid = attr->ia_gid; - if (ia_valid & ATTR_SIZE) - inode->i_size = attr->ia_size; if (ia_valid & ATTR_ATIME) inode->i_atime = attr->ia_atime; if (ia_valid & ATTR_MTIME) @@ -75,6 +73,12 @@ if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID)) inode->i_mode &= ~S_ISGID; } + if (ia_valid & ATTR_SIZE) { + inode->i_size = attr->ia_size; + vmtruncate(inode, attr->ia_size); + if (inode->i_op && inode->i_op->truncate) + inode->i_op->truncate(inode); + } mark_inode_dirty(inode); } diff -urN linux-2.3.20/fs/coda/inode.c linux-bird.truncate/fs/coda/inode.c --- linux-2.3.20/fs/coda/inode.cSun Sep 12 12:45:49 1999 +++ linux-bird.truncate/fs/coda/inode.c Mon Oct 11 10:59:23 1999 @@ -229,6 +229,8 @@ if ( !error ) { coda_vattr_to_iattr(inode, &vattr); coda_cache_clear_inode(inode); + if (iattr->ia_valid && ATTR_SIZE) + vmtruncate(inode, iattr->ia_size); } CDEBUG(D_SUPER, "inode.i_mode %o, error %d\n", inode->i_mode, error); diff -urN linux-2.3.20/fs/hpfs/file.c linux-bird.truncate/fs/hpfs/file.c --- linux-2.3.20/fs/hpfs/file.c Sun Sep 12 06:00:35 1999 +++ linux-bird.truncate/fs/hpfs/file.c Mon Oct 11 17:44:44 1999 @@ -55,7 +55,6 @@ i->i_hpfs_n_secs = 0;
Re: HFS, QNXFS
On Mon, 11 Oct 1999, Jedi/Sector One wrote: > Jeff Garzik wrote: > > IMHO, even if an fs was bug-free in 2.2.x, lack of testing under the new > > 2.3.x VFS implies an experimental and possibly non-working nature. > > The QNXFS is not bug-free in 2.2.x . It is useable to share some files I'll say... > between Linux and QNX, and when QNX is screwed up, this is really a good > help : things can be fixed under Linux a lot easier than with a QNX > rescue system. > I am a coder and maintainer of the QNX FS. The QSSL team gave me a > valid QNX license in order to do some work like that project. But what > they didn't tell me is that this license was limited and a few days > after the QNX FS was in the Kernel, it expired. I tried to got in touch > with QSSL without any success, maybe they don't want to help us in > working on their FS for Linux. OK, to bloody hell with QSSL. Do you have a documentation? If the license had expired, how can you use your code at all (with what)? Do you have valid fs images? > I will try to fix it for 2.3.x (in fact, I am waiting for the ReiserFS > port for that series because my system heavily depends on ReiserFS) ??? > asap. Some features are still missing to fully support the FS (like the > extents list) . I'd be glad to implement them as soon as I will have a > copy of QNX to work with. == when your contacts with QSSL will give some results, that is? I suspect that those guys are less than excited by the interesting trick played by Amiga idi^H^H^Hmanagement. Does their license allow remote logins? Is it per-user or per-CPU or what? If it does you might consider comp.os.linux.announce - somebody might help you.
Re: [RFC] truncate() (generic stuff)
> I mean the following: assume that call of hpfs_truncate() (via > ->truncate()) had been moved into the inode_setattr() (i.e. happens > immediately before the hpfs_write_inode() in hpfs_notify_change()). Will > it have any bad consequences? Right now the sequence looks so: > inode->i_size = newsize; > hpfs_write_inode(inode); > hpfs_truncate(inode); > Could we swap the last steps? Probably yes, the first hpfs_write_inode is not needed at all. > > > I can toss the code for that in - preserving the current behaviour, but > > > allowing to get rid of locking. Would you mind such a variant? It's not > > > like the thing was going to take a lot of time - straightforward copy of > > > FAT version would go. > > > > Do you see any better solution than to go through all btree operations and > > update pointer on every dirent move? > > Just put a cyclic list through the inodes referenced from the same > dnode + hash the lists by dnode block number (hash chain going through one > of the cyclic list members, shifted to the neighbor if we are excluding > the thing from cyclic list; the whole mess protected by spinlock). Then > you can quickly find all in-core inodes that belong to any given dnode. You can do the same just by examining content of dnode (dirents have fnode numbers). > ->clear_inode() would exclude the sucker from the lists, all inodes marked > with I_FREEING (i.e. those that gave NULL upon igrab()) are skipped during > the cyclic list searches. There is another clash if you use async write_inode. Somebody is modifying directory, btree operation is in progress and now - system decides to call write_inode. You can't wait in write_inode until directory is in consistent state, you just must patch dirent somwhere in memory and return. That means - inode must hold pointer to dirent on disk or in memory. It is possible to write it, but HPFS is currently very stable and race-free (the last race I had was bug in kernel) and such modifications would surely make a lot of problems. Mikulas Patocka
Re: HFS, QNXFS
Jeff Garzik wrote: > IMHO, even if an fs was bug-free in 2.2.x, lack of testing under the new > 2.3.x VFS implies an experimental and possibly non-working nature. The QNXFS is not bug-free in 2.2.x . It is useable to share some files between Linux and QNX, and when QNX is screwed up, this is really a good help : things can be fixed under Linux a lot easier than with a QNX rescue system. I am a coder and maintainer of the QNX FS. The QSSL team gave me a valid QNX license in order to do some work like that project. But what they didn't tell me is that this license was limited and a few days after the QNX FS was in the Kernel, it expired. I tried to got in touch with QSSL without any success, maybe they don't want to help us in working on their FS for Linux. I will try to fix it for 2.3.x (in fact, I am waiting for the ReiserFS port for that series because my system heavily depends on ReiserFS) asap. Some features are still missing to fully support the FS (like the extents list) . I'd be glad to implement them as soon as I will have a copy of QNX to work with. Scuba made a very good starting point for that FS implementation, but I've not heard about him a lot of time ago. Still alive Scuba ? Best regards, -- Frank DENIS aka Jedi/Sector One aka DJ Chrysalis <[EMAIL PROTECTED]> -> Software : http://www.jedi.claranet.fr <- -> Music : http://www.mp3.com/chrysalis <-
Re: [RFC] truncate() (generic stuff)
On Mon, 11 Oct 1999, Mikulas Patocka wrote: > > > > hpfs - no problems, but there is something rather odd. Why on the > > > > Earth does it forcibly write inode to disk in the end of notify_change()? > > > > Immediately before that inode is dirtified by inode_setattr(). Mikulas, is > > > > there something special that makes you write it ASAP? > > > > > > HPFS doesn't use write_inode. On HPFS, write_inode needs to lock parent > > > directory to update dirent. write_inode is called asynchronously and > > > locking directory makes deadlocks. So, write_inode is NULL and inode is > > > explicitly written on notify_change and file_release. > > > > I see. OK, will it hurt if would do hpfs_truncate() prior to the call of > > hpfs_file_write()? > > What do you mean? truncate and write together will surely cause bad > things. I mean the following: assume that call of hpfs_truncate() (via ->truncate()) had been moved into the inode_setattr() (i.e. happens immediately before the hpfs_write_inode() in hpfs_notify_change()). Will it have any bad consequences? Right now the sequence looks so: inode->i_size = newsize; hpfs_write_inode(inode); hpfs_truncate(inode); Could we swap the last steps? > > I can toss the code for that in - preserving the current behaviour, but > > allowing to get rid of locking. Would you mind such a variant? It's not > > like the thing was going to take a lot of time - straightforward copy of > > FAT version would go. > > Do you see any better solution than to go through all btree operations and > update pointer on every dirent move? Just put a cyclic list through the inodes referenced from the same dnode + hash the lists by dnode block number (hash chain going through one of the cyclic list members, shifted to the neighbor if we are excluding the thing from cyclic list; the whole mess protected by spinlock). Then you can quickly find all in-core inodes that belong to any given dnode. ->clear_inode() would exclude the sucker from the lists, all inodes marked with I_FREEING (i.e. those that gave NULL upon igrab()) are skipped during the cyclic list searches.
Re: [RFC] truncate() (generic stuff)
> > > hpfs - no problems, but there is something rather odd. Why on the > > > Earth does it forcibly write inode to disk in the end of notify_change()? > > > Immediately before that inode is dirtified by inode_setattr(). Mikulas, is > > > there something special that makes you write it ASAP? > > > > HPFS doesn't use write_inode. On HPFS, write_inode needs to lock parent > > directory to update dirent. write_inode is called asynchronously and > > locking directory makes deadlocks. So, write_inode is NULL and inode is > > explicitly written on notify_change and file_release. > > I see. OK, will it hurt if would do hpfs_truncate() prior to the call of > hpfs_file_write()? What do you mean? truncate and write together will surely cause bad things. > > [ Once you suggested a solution for this - keep in inode pointer to > > directory data, but I'm too lazy to implement it. HPFS is not performance > > critical ] > > I can toss the code for that in - preserving the current behaviour, but > allowing to get rid of locking. Would you mind such a variant? It's not > like the thing was going to take a lot of time - straightforward copy of > FAT version would go. Do you see any better solution than to go through all btree operations and update pointer on every dirent move? Mikulas Patocka
Re: [patch] [possible race in ext2] Re: how to write get_block?
Hi, On Sun, 10 Oct 1999 16:57:18 +0200 (CEST), Andrea Arcangeli <[EMAIL PROTECTED]> said: > My point was that even being forced to do a lookup before creating > each empty buffer, will be still faster than 2.2.x as in 2.3.x the hash > will contain only metadata. Less elements means faster lookups. The _fast_ quick fix is to maintain a per-inode list of dirty buffers and to invalidate that list when we do a delete. This works for directories if we only support truncate back to zero --- it obviously gets things wrong if we allow partial truncates of directories (but why would anyone want to allow that?!) This would have minimal performance implication and would also allow fast fsync() of indirect block metadata for regular files. --Stephen
Re: [patch] [possible race in ext2] Re: how to write get_block?
On Mon, 11 Oct 1999, Stephen C. Tweedie wrote: > Hi, > > On Sun, 10 Oct 1999 16:57:18 +0200 (CEST), Andrea Arcangeli > <[EMAIL PROTECTED]> said: > > > My point was that even being forced to do a lookup before creating > > each empty buffer, will be still faster than 2.2.x as in 2.3.x the hash > > will contain only metadata. Less elements means faster lookups. > > The _fast_ quick fix is to maintain a per-inode list of dirty buffers > and to invalidate that list when we do a delete. This works for > directories if we only support truncate back to zero --- it obviously > gets things wrong if we allow partial truncates of directories (but why > would anyone want to allow that?!) AFFS. OTOH there we are dropping the known bh every time.
Re: [RFC] truncate() (generic stuff)
On Mon, 11 Oct 1999, Mikulas Patocka wrote: > > hpfs - no problems, but there is something rather odd. Why on the > > Earth does it forcibly write inode to disk in the end of notify_change()? > > Immediately before that inode is dirtified by inode_setattr(). Mikulas, is > > there something special that makes you write it ASAP? > > HPFS doesn't use write_inode. On HPFS, write_inode needs to lock parent > directory to update dirent. write_inode is called asynchronously and > locking directory makes deadlocks. So, write_inode is NULL and inode is > explicitly written on notify_change and file_release. I see. OK, will it hurt if would do hpfs_truncate() prior to the call of hpfs_file_write()? > [ Once you suggested a solution for this - keep in inode pointer to > directory data, but I'm too lazy to implement it. HPFS is not performance > critical ] I can toss the code for that in - preserving the current behaviour, but allowing to get rid of locking. Would you mind such a variant? It's not like the thing was going to take a lot of time - straightforward copy of FAT version would go.
Re: [RFC] truncate() (generic stuff)
> hpfs - no problems, but there is something rather odd. Why on the > Earth does it forcibly write inode to disk in the end of notify_change()? > Immediately before that inode is dirtified by inode_setattr(). Mikulas, is > there something special that makes you write it ASAP? HPFS doesn't use write_inode. On HPFS, write_inode needs to lock parent directory to update dirent. write_inode is called asynchronously and locking directory makes deadlocks. So, write_inode is NULL and inode is explicitly written on notify_change and file_release. [ Once you suggested a solution for this - keep in inode pointer to directory data, but I'm too lazy to implement it. HPFS is not performance critical ] Mikulas Patocka
[RFC] truncate() (generic stuff)
I began screwing around the truncate() stuff and the following is a status report/request for comments: a) call of ->truncate() method (and vmtruncate()) had been moved into the notify_change(). It is triggered if ATTR_SIZE is set. Modified places: do_truncate(), nfsd_truncate(), nfsd_setattr() and hpfs_unlink(). Semantics had been changed (fixed, AFAICS) for nfsd_* - truncation happens with capabilities dropped. BTW, nfsd_setattr() didn't call vmtruncate(). Fixed. b) I'ld rather pushed the call of ->truncate() deeper - into the inode_setattr() if possible. Situation being: adfs, efs, isofs, qnx4 and romfs - readonly, no problems at all. devpts, autofs - no regular files, no problems. ext2, ufs, minix, sysv, msdos, vfat, udf, umsdos - no problems. ntfs - very odd. No ->truncate(), but there is ->write(). IOW, call of sys_truncate() will happily pass and just change ->i_size. Steve, could you comment on that? No problems wrt this patch, but something is very wrong. procfs - there may be drago^Wbugs. That is, sys_truncate() silently sets i_size to zero. sys_chmod() is also, erm, interesting. Unrelated to the patch, though. hpfs - no problems, but there is something rather odd. Why on the Earth does it forcibly write inode to disk in the end of notify_change()? Immediately before that inode is dirtified by inode_setattr(). Mikulas, is there something special that makes you write it ASAP? affs - no problems, except the call of make_inode_dirty() immediately after inode_setattr() (which dirtifies inode immediately before exit). I'm removing the call. hfs - probably OK, but I'ld really like to hear the confirmation - is there any chance of troubles (from the HFS side, VFS is OK) if the call of ->truncate() will happen before the hfs_cat_mark_dirty()? (in hfs_notify_change(), that is). coda,nfs,ncpfs,smbfs - I'm bringing the call of vmtruncate() into the end of ->notify_change() (if ATTR_SIZE is set, indeed). It preserves the current behaviour, but I'm not sure that it's OK. Those filesystems don't have ->truncate() and do the equivalent theselves (from ->notify_change()). Maybe vmtruncate() should be called earlier. c) ext2-related parts (and general messing with pagecache for directories) - still too unstable. IOW, crashes the kernel _fast_. The following is the relatively stable part of the patch (no ext2 stuff). Please, look if you have any problems with it. Cheers, Al diff -urN linux-2.3.20/fs/affs/inode.c linux-bird.truncate/fs/affs/inode.c --- linux-2.3.20/fs/affs/inode.cMon Jun 21 12:36:43 1999 +++ linux-bird.truncate/fs/affs/inode.c Mon Oct 11 10:55:11 1999 @@ -246,7 +246,6 @@ inode->u.affs_i.i_protect = mode_to_prot(attr->ia_mode); inode_setattr(inode, attr); - mark_inode_dirty(inode); error = 0; out: return error; diff -urN linux-2.3.20/fs/attr.c linux-bird.truncate/fs/attr.c --- linux-2.3.20/fs/attr.c Mon Jun 21 12:36:42 1999 +++ linux-bird.truncate/fs/attr.c Mon Oct 11 10:54:44 1999 @@ -62,8 +62,6 @@ inode->i_uid = attr->ia_uid; if (ia_valid & ATTR_GID) inode->i_gid = attr->ia_gid; - if (ia_valid & ATTR_SIZE) - inode->i_size = attr->ia_size; if (ia_valid & ATTR_ATIME) inode->i_atime = attr->ia_atime; if (ia_valid & ATTR_MTIME) @@ -75,6 +73,12 @@ if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID)) inode->i_mode &= ~S_ISGID; } + if (ia_valid & ATTR_SIZE) { + inode->i_size = attr->ia_size; + vmtruncate(inode, attr->ia_size); + if (inode->i_op && inode->i_op->truncate) + inode->i_op->truncate(inode); + } mark_inode_dirty(inode); } diff -urN linux-2.3.20/fs/coda/inode.c linux-bird.truncate/fs/coda/inode.c --- linux-2.3.20/fs/coda/inode.cSun Sep 12 12:45:49 1999 +++ linux-bird.truncate/fs/coda/inode.c Mon Oct 11 10:59:23 1999 @@ -229,6 +229,8 @@ if ( !error ) { coda_vattr_to_iattr(inode, &vattr); coda_cache_clear_inode(inode); + if (iattr->ia_valid && ATTR_SIZE) + vmtruncate(inode, iattr->ia_size); } CDEBUG(D_SUPER, "inode.i_mode %o, error %d\n", inode->i_mode, error); diff -urN linux-2.3.20/fs/hpfs/inode.c linux-bird.truncate/fs/hpfs/inode.c --- linux-2.3.20/fs/hpfs/inode.cSun Sep 12 06:00:35 1999 +++ linux-bird.truncate/fs/hpfs/inode.c Mon Oct 11 10:55:25 1999 @@ -364,7 +364,6 @@ if (inode->i_sb->s_hpfs_root == inode->i_ino) return -EINVAL; if ((error = inode_change_ok(inode, attr))) return error; inode_setattr(inode, attr); -
Re: (reiserfs) Re: RE: journal requirements for buffer.c (was: Roma progress report)
Hans Reiser <[EMAIL PROTECTED]> writes: > I feel we should encourage Linus to allow the following: > > * unions in struct buffer_head and struct page containing filesystem specific > fields comparable to the union in struct inode. No. In struct buffer_head I don't have problems. In struct page (you don't want to pay the overhead for every page of memory)... There is the buffer_head *bh pointer that could be made more generic. > > * a filesystem operation which flushes to disk a buffer or page, and leaves it > to the filesystem whether to flush a whole commit along with it, or a whole > bunch of semantically adjacent buffers, or to repack the buffers before writing > them, or to write a whole bunch of semantically nearby buffers, or to fill the > rest of the RAID stripe, or assign a block number to the page, or mark the page > copy_on_write, or whatever else it wants. Or allocate bounce buffers, for high memory pages. . . Though I don't think copy_on_write and (whatever else it wants) are necessarily hot ideas. There need to be some constraints. > Do the rest of you agree? I agree that it is a good idea. Exactly how it should, and if it should be implemented for 2.3 is a second question. There is also the related issue of the page lock not currently working for NFS (it needs some I/O locking). Since I've been pushing for this for a while. I'll see if there is anything that could be considered a ``bug fix'' doable yet in 2.3 The recent patch to improve fdatasync performance is also somewhat related. Eric