date:20000724

[PATCH] VFAT short names in 2.4.0-test5-pre[3,4]

2000-07-24 Thread Petr Vandrovec


Hi Gordon,
  it looks like that recently added 16-bit charsets added few bugs
into VFAT code. I was really surprised when I found that for
VLAKC.TTR short name is VLAKCTTR...
  I'm not VFAT guru, so please can you review this patch and
eventually fix it or send to Linus?
Thanks,
Petr Vandrovec
[EMAIL PROTECTED]




diff -urdN linux/fs/vfat/namei.c linux/fs/vfat/namei.c
--- linux/fs/vfat/namei.c   Mon Jul 24 09:34:50 2000
+++ linux/fs/vfat/namei.c   Mon Jul 24 09:38:51 2000
@@ -453,10 +453,11 @@
if (charbuf[chi] != vfat_tolower(nls, c)) return -EINVAL;
if (strchr(replace_chars,c)) return -EINVAL;
if (c  ' '|| c==':') return -EINVAL;
-   if (c == '.') break;
+   if (c == '.') goto dot;
space = c == ' ';
}
}
+dot:;
if (space) return -EINVAL;
if (len  c != '.') {
len--;
@@ -464,6 +465,7 @@
if (charbuf[0] != '.') return -EINVAL;
} else
return -EINVAL;
+   c = '.';
}
if (c == '.') {
if (len = 4) return -EINVAL;
@@ -522,7 +524,7 @@
if (chl == 0)
return -EINVAL;
for (chi = 0; chi  chl; chi++){
-   if (charbuf[chi] == '.') break;
+   if (charbuf[chi] == '.') goto dot;
if (!charbuf[chi]) return -EINVAL;
if (walk-res == 8) return -EINVAL;
if (strchr(replace_chars,charbuf[chi])) return -EINVAL;
@@ -532,6 +534,7 @@
walk++;
}
}
+dot:;
if (space) return -EINVAL;
if (len = 0) {
while (walk-res  8) *walk++ = ' ';

cont_prepare_write fix

2000-07-24 Thread Roman Zippel


Hi,

While trying to get affs working again, I noticed that the current
cont_prepare_write is hardly working, if you write to a file in steps of
1024, the bytes pointer is never updated resulting in an endless loop.
Below is a fixed version, that also removes the bogus tricks to the
bytes ptr. I also "fixed" the comment, any reason we have to be
insulting like this?
Anyway, does anyone see a problem with the fix before I send it to
Linus?

bye, Roman

Index: fs/buffer.c
===
RCS file: /dump/cvs/linux-2.4/fs/buffer.c,v
retrieving revision 1.1.1.3
diff -u -r1.1.1.3 buffer.c
--- fs/buffer.c 2000/07/17 10:46:49 1.1.1.3
+++ fs/buffer.c 2000/07/24 11:45:54
@@ -1585,7 +1585,7 @@
 }
 
 /*
- * For moronic filesystems that do not allow holes in file.
+ * For filesystems that do not allow holes in file.
  * We may have to extend the file.
  */
 
@@ -1600,7 +1600,15 @@
unsigned blocksize = inode-i_sb-s_blocksize;
char *kaddr;
 
-   while(page-index  (pgpos = *bytesPAGE_CACHE_SHIFT)) {
+restart:
+   zerofrom = offset;
+   pgpos = *bytes;
+   if (page-index  (pgpos  PAGE_CACHE_SHIFT))
+   goto skip_extend;
+
+   zerofrom = pgpos  ~PAGE_CACHE_MASK;
+   pgpos = PAGE_CACHE_SHIFT;
+   while (page-index  pgpos) {
status = -ENOMEM;
new_page = grab_cache_page(mapping, pgpos);
if (!new_page)
@@ -1609,12 +1617,7 @@
if (*bytesPAGE_CACHE_SHIFT != pgpos) {
UnlockPage(new_page);
page_cache_release(new_page);
-   continue;
-   }
-   zerofrom = *bytes  ~PAGE_CACHE_MASK;
-   if (zerofrom  (blocksize-1)) {
-   *bytes |= (blocksize-1);
-   (*bytes)++;
+   goto restart;
}
status = __block_prepare_write(inode, new_page, zerofrom,
PAGE_CACHE_SIZE, get_block);
@@ -1626,35 +1629,26 @@
kunmap(new_page);
UnlockPage(new_page);
page_cache_release(new_page);
+   zerofrom = 0;
+   *bytes = ++pgpos  PAGE_CACHE_SHIFT;
}
 
-   if (page-index  pgpos) {
-   /* completely inside the area */
-   zerofrom = offset;
-   } else {
-   /* page covers the boundary, find the boundary offset */
-   zerofrom = *bytes  ~PAGE_CACHE_MASK;
-
-   /* if we will expand the thing last block will be filled */
-   if (to  zerofrom  (zerofrom  (blocksize-1))) {
-   *bytes |= (blocksize-1);
-   (*bytes)++;
-   }
+   pgpos = PAGE_CACHE_SHIFT; 
+   if (zerofrom  to)
+   *bytes = pgpos + to;
 
-   /* starting below the boundary? Nothing to zero out */
-   if (offset = zerofrom)
-   zerofrom = offset;
-   }
+skip_extend:
status = __block_prepare_write(inode, page, zerofrom, to, get_block);
-   if (status)
-   goto out1;
-   kaddr = (char*)page_address(page);
-   if (zerofrom  offset) {
-   memset(kaddr+zerofrom, 0, offset-zerofrom);
-   __block_commit_write(inode, page, zerofrom, offset);
+   if (!status) {
+   if (zerofrom  offset) {
+   kaddr = (char*)page_address(page);
+   memset(kaddr+zerofrom, 0, offset-zerofrom);
+   __block_commit_write(inode, page, zerofrom, offset);
+   *bytes = pgpos + to;
+   }
+   return 0;
}
-   return 0;
-out1:
+
ClearPageUptodate(page);
kunmap(page);
return status;

Re: cont_prepare_write fix

2000-07-24 Thread Alexander Viro




On Mon, 24 Jul 2000, Roman Zippel wrote:

 Hi,
 
 While trying to get affs working again, I noticed that the current
 cont_prepare_write is hardly working, if you write to a file in steps of
 1024, the bytes pointer is never updated resulting in an endless loop.

Testcase, please? It certainly seems to be working on FAT and HPFS -
both are using the same thing.

 Below is a fixed version, that also removes the bogus tricks to the
 bytes ptr. I also "fixed" the comment, any reason we have to be
 insulting like this?

Sheer frustration from the braindead design? _Nothing_ can insult persons
responsible for AFFS and FAT "designs". Sorry.

 Anyway, does anyone see a problem with the fix before I send it to
 Linus?

Yes. In particular, adding the whole PAGE_CACHE_SIZE to *bytes is
_definitely_ wrong. The whole "round up to the block boundary" thing is
there for purpose - in native environment all of these filesystems do not
care to clean the tail of the last allocated block. So we have to do it
for them. And we don't want to repeat such cleaning on _every_ write() -
the first one should normalize the situation.

If you have problems with affs_get_block() - feel free to ask, I probably
can help. What _is_ the problem? Notice that you can assume that
affs_get_block() will only have to create blocks immediately past the
end-of-file. Please, check fat_get_block() or hpfs_get_block() - they
seem to work. I don't claim that I didn't fsck up there, but I would
certainly like to hear more before I decide that I did.

If you need description of cont_... design - just ask.

Re: cont_prepare_write fix

2000-07-24 Thread Roman Zippel


Hi,

  While trying to get affs working again, I noticed that the current
  cont_prepare_write is hardly working, if you write to a file in steps of
  1024, the bytes pointer is never updated resulting in an endless loop.
 
 Testcase, please? It certainly seems to be working on FAT and HPFS -
 both are using the same thing.

Ok, now I see it, they are updating it themselves, why isn't that done
by the vfs?

  Below is a fixed version, that also removes the bogus tricks to the
  bytes ptr. I also "fixed" the comment, any reason we have to be
  insulting like this?
 
 Sheer frustration from the braindead design? _Nothing_ can insult persons
 responsible for AFFS and FAT "designs". Sorry.

$ grep cont_prepare_write fs/*/*.c
fs/adfs/inode.c:prepare_write:  cont_prepare_write,
fs/affs/file.c: prepare_write: cont_prepare_write,
fs/hfs/inode.c: prepare_write: cont_prepare_write,
fs/hpfs/file.c: prepare_write: cont_prepare_write,
fs/ntfs/fs.c:   prepare_write: cont_prepare_write,
fs/qnx4/inode.c:prepare_write: cont_prepare_write,

Only because a fs wasn't designed for UNIX, means it's shit. Sorry, for
me it's a simple matter of respect. A fs isn't "moronic", only because
it doesn't support holes, the real design issues are somewhere else.
Anyway, probably as I only see mmu_private updates in fat and hpfs, it
looks like adfs, hfs, ntfs and qnx4 are broken as well. BTW wouldn't it
be better to introduce a i_realsize field for that, it can be useful for
other fs later as well. E.g. it can be used to fix read/truncate race
without a BKL (i_realsize only grows while file is open and truncate can
be delayed until file is closed).

 Yes. In particular, adding the whole PAGE_CACHE_SIZE to *bytes is
 _definitely_ wrong. The whole "round up to the block boundary" thing is
 there for purpose - in native environment all of these filesystems do not
 care to clean the tail of the last allocated block. So we have to do it
 for them. And we don't want to repeat such cleaning on _every_ write() -
 the first one should normalize the situation.

Rounding it up to the block size isn't a problem, the question is why do
we give the bytes ptr to cont_prepare_write, if get_block has to update
it in first place. That's probably the point that confused me here.

 If you have problems with affs_get_block() - feel free to ask, I probably
 can help. What _is_ the problem? Notice that you can assume that
 affs_get_block() will only have to create blocks immediately past the
 end-of-file.

Yes, the whole get_block/getblock stuff in affs needs a real cleanup
because of this.

bye, Roman

Re: ext3 on root filesystem problem

2000-07-24 Thread Andreas Dilger


Jeremy writes:
 Hmm...no, I'm not booting from a ramdisk.  The only ramdisk involved is to
 initialize my scsi adapter and I've already successfully install the
 journal on my scsi drive's partition.

The problem is that you probably have a journal option in your lilo.conf
file, which is getting passed to the ramdisk and not the real root fs.
If you have already mounted it as ext3, then you don't need any journal
mount options, as the filesystem will mount as ext3 before it tries to
mount as ext2.

 I thought about what you suggest, just putting things in fstab, but then I
 fear not being able to boot.

You need a rescue floppy for sure.  It doesn't need ext3 support, but it
does need debugfs to be able to turn off the recovery feature so you can
mount it in emergency.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert

Re: cont_prepare_write fix

2000-07-24 Thread Alexander Viro




On Mon, 24 Jul 2000, Roman Zippel wrote:

 Hi,
 
   While trying to get affs working again, I noticed that the current
   cont_prepare_write is hardly working, if you write to a file in steps of
   1024, the bytes pointer is never updated resulting in an endless loop.
  
  Testcase, please? It certainly seems to be working on FAT and HPFS -
  both are using the same thing.
 
 Ok, now I see it, they are updating it themselves, why isn't that done
 by the vfs?

Because actual _allocation_ unit may have nothing to blocksize and be
invisible to VFS. Check fatfs - it uses the thing to trigger allocation,
but actual zero-out goes on per-sector basis.

  Sheer frustration from the braindead design? _Nothing_ can insult persons
  responsible for AFFS and FAT "designs". Sorry.
 
 $ grep cont_prepare_write fs/*/*.c
 fs/adfs/inode.c:prepare_write:  cont_prepare_write,
 Not on this box. What version are you using?
 fs/affs/file.c: prepare_write: cont_prepare_write,
 fs/hfs/inode.c: prepare_write: cont_prepare_write,
Ditto.
 fs/hpfs/file.c: prepare_write: cont_prepare_write,
Ditto.  
 fs/ntfs/fs.c:   prepare_write: cont_prepare_write,
Ditto.
 fs/qnx4/inode.c:prepare_write: cont_prepare_write,
Ditto.
They are _calling_ the thing, but they would simply refuse to compile if
they would have such lines - different prototypes.

BTW, when that comment went in it was FAT, HPFS and AFFS.

 Only because a fs wasn't designed for UNIX, means it's shit. Sorry, for
 me it's a simple matter of respect. A fs isn't "moronic", only because
 it doesn't support holes, the real design issues are somewhere else.

And you _bet_ that they are. No, lack of holes per se is not
horrible. But there is a _lot_ of things about AFFS and FAT (and to some
extent HPFS) design that are past "horrible" - "Lovecraftian" would be
more accurate. But that's another rant...

 Anyway, probably as I only see mmu_private updates in fat and hpfs, it
 looks like adfs, hfs, ntfs and qnx4 are broken as well. BTW wouldn't it
From adfs_get_block():
...
/* don't support allocation of blocks yet */
return -EIO;
NTFS is not just broken, it eats filesystems in rw mode.
QNX4 does not allocate blocks ("// to be done" says the driver)
HFS refuses to allocate anything. Probably broken...

 be better to introduce a i_realsize field for that, it can be useful for
 other fs later as well. E.g. it can be used to fix read/truncate race
 without a BKL (i_realsize only grows while file is open and truncate can
 be delayed until file is closed).

Not needed. That race is taken care of - readpage() never does access past
i_size and has the page locked, while vmtruncate() starts with setting
i_size (on shrinking truncate(), that is), then goes through all pages in
affected area and grabs the lock on all of them (one by one, not
simultaneously, indeed) and only then calls -truncate().

So readpage() in affected area that began before we've set -i_size will
be completed prior to call of -truncate() and any attempt after setting
-i_size will see new -i_size and will not bother -truncate() at all.

_That_ one is OK. There are very scary dragons around -truncate(), but
a) they are unrelated to that
b) knowledge of realsize won't help
c) let's take that to email.

 Rounding it up to the block size isn't a problem, the question is why do
 we give the bytes ptr to cont_prepare_write, if get_block has to update
 it in first place. That's probably the point that confused me here.

cont_prepare_write() has to zero the tail of the last block out ;-

  If you have problems with affs_get_block() - feel free to ask, I probably
  can help. What _is_ the problem? Notice that you can assume that
  affs_get_block() will only have to create blocks immediately past the
  end-of-file.
 
 Yes, the whole get_block/getblock stuff in affs needs a real cleanup
 because of this.

Well, it will get simpler - all logics with growing the chain shrinks a
lot, AFAICS...

Re: cont_prepare_write fix

2000-07-24 Thread Benjamin C.R. LaHaise


On Mon, 24 Jul 2000, Alexander Viro wrote:

 Not needed. That race is taken care of - readpage() never does access past
 i_size and has the page locked, while vmtruncate() starts with setting
 i_size (on shrinking truncate(), that is), then goes through all pages in
 affected area and grabs the lock on all of them (one by one, not
 simultaneously, indeed) and only then calls -truncate().

That assumes that i_size is atomically updated, which currently is not the
case on 32 bit platforms.

-ben

Re: cont_prepare_write fix

2000-07-24 Thread Roman Zippel


Hi,

 Because actual _allocation_ unit may have nothing to blocksize and be
 invisible to VFS. Check fatfs - it uses the thing to trigger allocation,
 but actual zero-out goes on per-sector basis.

I tried vfat on a MO drive with 1KByte sectors - it doesn't work. I also
don't know how you want to do it. A fs with a block size smaller than
the sector size can't impossibly write directly into the page cache and
I don't see anywhere where that is taken care of.

  fs/adfs/inode.c:prepare_write:  cont_prepare_write,
  Not on this box. What version are you using?
 They are _calling_ the thing, but they would simply refuse to compile if
 they would have such lines - different prototypes.

Oops, sorry, wrong source tree, it's a 2.4 with some other changes. :)
What I tried was to change the prototypes so the smaller caller routines
in _every_ fs can be removed - seems to work, except that
cont_prepare_write needs that extra argument (another reason I'd like to
move it to i_realsize). It seems to work, is there any reason not to do
this?

 Not needed. That race is taken care of - readpage() never does access past
 i_size and has the page locked, while vmtruncate() starts with setting
 i_size (on shrinking truncate(), that is), then goes through all pages in
 affected area and grabs the lock on all of them (one by one, not
 simultaneously, indeed) and only then calls -truncate().

Hmm, then I don't understand the BKL in the get_block functions.

  Rounding it up to the block size isn't a problem, the question is why do
  we give the bytes ptr to cont_prepare_write, if get_block has to update
  it in first place. That's probably the point that confused me here.
 
 cont_prepare_write() has to zero the tail of the last block out ;-

IMO we could simplify that if we would do the preallocation business in
the page cache. We only need to support a dummy page structure in the
page cache to delay the allocation of the real page (we need that
anyway, if want to do direct io into/from e.g. pci space), but that's
more something for 2.5 or is there any reason why it shouldn't work?

bye, Roman

Re: Virtual files

2000-07-24 Thread Pavel Machek


Hi!

 I've been looking at the possibility of cleanly implementing virtual
 files (podfuk/avfs style) with the multiple mount thing in 2.4
 kernels. Here is how it would work:
 
 1) There needs to be a global lookup hook in VFS. If a real lookup
 fails, the global lookup is invoked. The cached lookup path is not
 effected.
 
 2) The global lookup checks whether the name refers to a virtual file
 (contains the magic char  exists in /overlay). If it does, then the
 file/dir from overlay is mounted on top of the negative dentry. The
 dentry is not filled in, so the virtual file remains invisible.
 
 For 1) the VFS needs to be modified, but with infinitesimal
 performance effect. For 2) a kernel module could be used.

Great!

 I've prepared a kernel patch against 2.4.0-test4 (just for comments on
 this, NOT for inclusion in the main kernel), and a kernel module. I've
 tested it with avfs, but it should also work for podfuk.

It looks much cleaner than my previous approaches. There seems to be
something strange with usage counts:

pavel@bug:/tmp/cz1_mbrola.tgz#utar$ ls -al
total 88
drwxr-xr-x   2 pavelusers3586 Mar 26 18:43 ./
drwxrwxrwt  38 root root81920 Jul 24 09:57 ../
drwxr-xr-x 42949 pavelusers   0 Mar 25 11:16 czech/
pavel@bug:/tmp/cz1_mbrola.tgz#utar$

 Problems:
 
  - The filesystem will be littered with these loopback mounts. This
should be cleared upon unmount, and possibly when the dcache is
shrunk. There was a similar requirement for new autofs IIRC.
 
  - Creation/removal of virtual files are not handled by this code.
 
 Comments?

Yes. You seem to have damaged locking:

 @@ -281,12 +285,18 @@
   lock_kernel();
   result = dir-i_op-lookup(dir, dentry);
   unlock_kernel();
 - if (result)
 + if (result) {
   dput(dentry);
 - else
 +up(dir-i_sem);
 +}
 + else {
 + up(dir-i_sem);
   result = dentry;
 + /* for avfs */
 + if (!dentry-d_inode  lookup_global)
 + lookup_global(nd, dentry);
 + }
   }
 - up(dir-i_sem);

In case of !dentry, up(dir-i_sem) is not executed at all.

I have some cleanups to nredir.c module, they are attached (please
apply). Ouch, and there's slight problem with caching:

You access vmlinux#gz, then load nredir module, then access vmlinux#gz
again. But failure is already cached and you are out of luck :-(.

Pavel
-- 
I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED]


--- /tmp/nredir.c   Mon Jul 24 10:03:48 2000
+++ nredir.cMon Jul 24 09:53:05 2000
@@ -1,4 +1,5 @@
 #include linux/module.h
+#include linux/init.h
 #include linux/kernel.h
 #include linux/fs.h
 #include linux/slab.h
@@ -7,7 +8,7 @@
 
 #define NREDIR_VERSION "0.1"
 
-#define AVFS_MAGIC_CHAR '@'
+#define AVFS_MAGIC_CHAR '#'
 #define OVERLAY_DIR "/overlay"
 #define OVERLAY_DIR_LEN 8
 
@@ -99,7 +100,7 @@
 return dentry;
 }
 
-int init_module(void)
+static int __init init_nredir(void)
 {
 printk(KERN_INFO "nredir init (version %s)\n", NREDIR_VERSION);
 
@@ -111,11 +112,14 @@
 }
 
 
-void cleanup_module(void)
+static void __exit cleanup_nredir(void)
 {
 printk(KERN_INFO "nredir cleanup\n");
 
-lookup_global = 0;
+lookup_global = NULL;
 
 printk("lookup_global: %x\n", (int) lookup_global);
 }
+
+module_init(init_nredir);
+module_exit(cleanup_nredir);

Re: Virtual files

2000-07-24 Thread Alexander Viro




On Mon, 24 Jul 2000, Pavel Machek wrote:

 Hi!
 
  I've been looking at the possibility of cleanly implementing virtual
  files (podfuk/avfs style) with the multiple mount thing in 2.4
  kernels. Here is how it would work:
  
  1) There needs to be a global lookup hook in VFS. If a real lookup
  fails, the global lookup is invoked. The cached lookup path is not
  effected.
  
  2) The global lookup checks whether the name refers to a virtual file
  (contains the magic char  exists in /overlay). If it does, then the
  file/dir from overlay is mounted on top of the negative dentry. The
  dentry is not filled in, so the virtual file remains invisible.
  
  For 1) the VFS needs to be modified, but with infinitesimal
  performance effect. For 2) a kernel module could be used.

Erm? Negative lookup happens every time when you are creating an object.

  Problems:
  
   - The filesystem will be littered with these loopback mounts. This
 should be cleared upon unmount, and possibly when the dcache is
 shrunk. There was a similar requirement for new autofs IIRC.

Not with the dcache shrinking.

   - Creation/removal of virtual files are not handled by this code.
  
  Comments?

How will it interact with modifying the mount-tree?

Comment on autofs is pretty interesting, though. Sigh... Just let me find
some time to do mount-traps - that may very well be the right tool here...

Re: cont_prepare_write fix

2000-07-24 Thread Jan Harkes


On Mon, Jul 24, 2000 at 12:04:39PM -0400, Benjamin C.R. LaHaise wrote:
 On Mon, 24 Jul 2000, Alexander Viro wrote:
 
  Not needed. That race is taken care of - readpage() never does access past
  i_size and has the page locked, while vmtruncate() starts with setting
  i_size (on shrinking truncate(), that is), then goes through all pages in
  affected area and grabs the lock on all of them (one by one, not
  simultaneously, indeed) and only then calls -truncate().
 
 That assumes that i_size is atomically updated, which currently is not the
 case on 32 bit platforms.
 
   -ben

I found that i_size is in most cases reasonably protected by i_sem.
Except for Coda where inode-i_size != inode-i_mapping-host-i_size,
because we `borrow' the mapping of the underlying container file.

At the moment I believe that the introduction of i_mapping-host might
have been a mistake. The readpage/writepage operations use
i_mapping-host-i_size, but sys_stat and generic_write_file (for
a file with O_APPEND) are using inode-i_size. So for Coda we are
copying the i_size back to the Coda inode in a wrapper around
generic_file_write.

AFAIK Coda is the only FS that has an address space mapping that
associated with another inode, and we need to lock the i_sem of that
inode to avoid problems. Maybe the Coda file write should become
something like:

coda_file_write()
{
down(container_inode-i_sem);
n = generic_file_write(file, ...);
container_inode-i_size = file-f_dentry-d_inode-i_size;
up(container_inode-i_sem);
return n;
}

That way all writes are protected by the semaphore on the container
file's inode. And the use of i_mapping-host, along with the nasty cast
to (struct inode *) could be removed.

Jan

question about sard and disk profiling patches

2000-07-24 Thread Jeremy Hansen



I build some customized kernel rpm's for our inhouse distro and I've
incorporated the profiling patches to use the sard utility.  I'm just
curious if there is any downside to using this patch for any reason and if
there are any concerns for stability using this patch?

Thanks
-jeremy

-- 

http://www.xxedgexx.com | [EMAIL PROTECTED]
-

Re: cont_prepare_write fix

2000-07-24 Thread David Gibson


On Mon, Jul 24, 2000 at 02:24:42PM -0400, Jan Harkes wrote:
 On Mon, Jul 24, 2000 at 12:04:39PM -0400, Benjamin C.R. LaHaise wrote:
  On Mon, 24 Jul 2000, Alexander Viro wrote:
  
   Not needed. That race is taken care of - readpage() never does access past
   i_size and has the page locked, while vmtruncate() starts with setting
   i_size (on shrinking truncate(), that is), then goes through all pages in
   affected area and grabs the lock on all of them (one by one, not
   simultaneously, indeed) and only then calls -truncate().
  
  That assumes that i_size is atomically updated, which currently is not the
  case on 32 bit platforms.
 
 I found that i_size is in most cases reasonably protected by i_sem.
 Except for Coda where inode-i_size != inode-i_mapping-host-i_size,
 because we `borrow' the mapping of the underlying container file.
 
 At the moment I believe that the introduction of i_mapping-host might
 have been a mistake. The readpage/writepage operations use
 i_mapping-host-i_size, but sys_stat and generic_write_file (for
 a file with O_APPEND) are using inode-i_size. So for Coda we are
 copying the i_size back to the Coda inode in a wrapper around
 generic_file_write.
 
 AFAIK Coda is the only FS that has an address space mapping that
 associated with another inode, and we need to lock the i_sem of that
 inode to avoid problems. Maybe the Coda file write should become
 something like:
 
 coda_file_write()
 {
   down(container_inode-i_sem);
   n = generic_file_write(file, ...);
   container_inode-i_size = file-f_dentry-d_inode-i_size;
   up(container_inode-i_sem);
   return n;
 }
 
 That way all writes are protected by the semaphore on the container
 file's inode. And the use of i_mapping-host, along with the nasty cast
 to (struct inode *) could be removed.

This might work for Coda, but I think in general there are a couple of
problems with always using the struct file to get at the inode
structure. IIRC either writepage or prepare_write is called with
file==NULL in some circumstances (when creating a symlink I think). I
discovered this when I got oopsen in my patch which adds
space/inode/dentry usage limits to ramfs.

There's another problem too: for ramfs limits, I obviously also need
to keep track of when a page is removed from an inode. The only sane
way I've seen to do this is to add a hook to address_space_operations
called when the page is removed (currently I'm calling it from
remove_page_from_inode_queue() but that may not be the best place). At
this point, only the struct page is available, so from that I need to
find the inode (well, actually inode-i_sb where I store the
limits). Easy with mapping-host, otherwise impossible (AFAICT, it's
entirely possible I'm missing something blindingly obvious).

-- 
David Gibson, Technical Support Engineer, Linuxcare, Inc.
+61 2 6262 8990
[EMAIL PROTECTED], http://www.linuxcare.com/ 
Linuxcare. Support for the revolution.

[PATCH] VFAT short names in 2.4.0-test5-pre[3,4]

cont_prepare_write fix

Re: cont_prepare_write fix

Re: cont_prepare_write fix

Re: ext3 on root filesystem problem

Re: cont_prepare_write fix

Re: cont_prepare_write fix

Re: cont_prepare_write fix

Re: Virtual files

Re: Virtual files

Re: cont_prepare_write fix

question about sard and disk profiling patches

Re: cont_prepare_write fix

13 matches

Site Navigation

Mail list logo

Footer information