cont_prepare_write fix

2000-07-24 Thread Roman Zippel

Hi,

While trying to get affs working again, I noticed that the current
cont_prepare_write is hardly working, if you write to a file in steps of
1024, the bytes pointer is never updated resulting in an endless loop.
Below is a fixed version, that also removes the bogus tricks to the
bytes ptr. I also "fixed" the comment, any reason we have to be
insulting like this?
Anyway, does anyone see a problem with the fix before I send it to
Linus?

bye, Roman

Index: fs/buffer.c
===
RCS file: /dump/cvs/linux-2.4/fs/buffer.c,v
retrieving revision 1.1.1.3
diff -u -r1.1.1.3 buffer.c
--- fs/buffer.c 2000/07/17 10:46:49 1.1.1.3
+++ fs/buffer.c 2000/07/24 11:45:54
@@ -1585,7 +1585,7 @@
 }
 
 /*
- * For moronic filesystems that do not allow holes in file.
+ * For filesystems that do not allow holes in file.
  * We may have to extend the file.
  */
 
@@ -1600,7 +1600,15 @@
unsigned blocksize = inode-i_sb-s_blocksize;
char *kaddr;
 
-   while(page-index  (pgpos = *bytesPAGE_CACHE_SHIFT)) {
+restart:
+   zerofrom = offset;
+   pgpos = *bytes;
+   if (page-index  (pgpos  PAGE_CACHE_SHIFT))
+   goto skip_extend;
+
+   zerofrom = pgpos  ~PAGE_CACHE_MASK;
+   pgpos = PAGE_CACHE_SHIFT;
+   while (page-index  pgpos) {
status = -ENOMEM;
new_page = grab_cache_page(mapping, pgpos);
if (!new_page)
@@ -1609,12 +1617,7 @@
if (*bytesPAGE_CACHE_SHIFT != pgpos) {
UnlockPage(new_page);
page_cache_release(new_page);
-   continue;
-   }
-   zerofrom = *bytes  ~PAGE_CACHE_MASK;
-   if (zerofrom  (blocksize-1)) {
-   *bytes |= (blocksize-1);
-   (*bytes)++;
+   goto restart;
}
status = __block_prepare_write(inode, new_page, zerofrom,
PAGE_CACHE_SIZE, get_block);
@@ -1626,35 +1629,26 @@
kunmap(new_page);
UnlockPage(new_page);
page_cache_release(new_page);
+   zerofrom = 0;
+   *bytes = ++pgpos  PAGE_CACHE_SHIFT;
}
 
-   if (page-index  pgpos) {
-   /* completely inside the area */
-   zerofrom = offset;
-   } else {
-   /* page covers the boundary, find the boundary offset */
-   zerofrom = *bytes  ~PAGE_CACHE_MASK;
-
-   /* if we will expand the thing last block will be filled */
-   if (to  zerofrom  (zerofrom  (blocksize-1))) {
-   *bytes |= (blocksize-1);
-   (*bytes)++;
-   }
+   pgpos = PAGE_CACHE_SHIFT; 
+   if (zerofrom  to)
+   *bytes = pgpos + to;
 
-   /* starting below the boundary? Nothing to zero out */
-   if (offset = zerofrom)
-   zerofrom = offset;
-   }
+skip_extend:
status = __block_prepare_write(inode, page, zerofrom, to, get_block);
-   if (status)
-   goto out1;
-   kaddr = (char*)page_address(page);
-   if (zerofrom  offset) {
-   memset(kaddr+zerofrom, 0, offset-zerofrom);
-   __block_commit_write(inode, page, zerofrom, offset);
+   if (!status) {
+   if (zerofrom  offset) {
+   kaddr = (char*)page_address(page);
+   memset(kaddr+zerofrom, 0, offset-zerofrom);
+   __block_commit_write(inode, page, zerofrom, offset);
+   *bytes = pgpos + to;
+   }
+   return 0;
}
-   return 0;
-out1:
+
ClearPageUptodate(page);
kunmap(page);
return status;



Re: cont_prepare_write fix

2000-07-24 Thread Alexander Viro



On Mon, 24 Jul 2000, Roman Zippel wrote:

 Hi,
 
 While trying to get affs working again, I noticed that the current
 cont_prepare_write is hardly working, if you write to a file in steps of
 1024, the bytes pointer is never updated resulting in an endless loop.

Testcase, please? It certainly seems to be working on FAT and HPFS -
both are using the same thing.

 Below is a fixed version, that also removes the bogus tricks to the
 bytes ptr. I also "fixed" the comment, any reason we have to be
 insulting like this?

Sheer frustration from the braindead design? _Nothing_ can insult persons
responsible for AFFS and FAT "designs". Sorry.

 Anyway, does anyone see a problem with the fix before I send it to
 Linus?

Yes. In particular, adding the whole PAGE_CACHE_SIZE to *bytes is
_definitely_ wrong. The whole "round up to the block boundary" thing is
there for purpose - in native environment all of these filesystems do not
care to clean the tail of the last allocated block. So we have to do it
for them. And we don't want to repeat such cleaning on _every_ write() -
the first one should normalize the situation.

If you have problems with affs_get_block() - feel free to ask, I probably
can help. What _is_ the problem? Notice that you can assume that
affs_get_block() will only have to create blocks immediately past the
end-of-file. Please, check fat_get_block() or hpfs_get_block() - they
seem to work. I don't claim that I didn't fsck up there, but I would
certainly like to hear more before I decide that I did.

If you need description of cont_... design - just ask.




Re: cont_prepare_write fix

2000-07-24 Thread Roman Zippel

Hi,

  While trying to get affs working again, I noticed that the current
  cont_prepare_write is hardly working, if you write to a file in steps of
  1024, the bytes pointer is never updated resulting in an endless loop.
 
 Testcase, please? It certainly seems to be working on FAT and HPFS -
 both are using the same thing.

Ok, now I see it, they are updating it themselves, why isn't that done
by the vfs?

  Below is a fixed version, that also removes the bogus tricks to the
  bytes ptr. I also "fixed" the comment, any reason we have to be
  insulting like this?
 
 Sheer frustration from the braindead design? _Nothing_ can insult persons
 responsible for AFFS and FAT "designs". Sorry.

$ grep cont_prepare_write fs/*/*.c
fs/adfs/inode.c:prepare_write:  cont_prepare_write,
fs/affs/file.c: prepare_write: cont_prepare_write,
fs/hfs/inode.c: prepare_write: cont_prepare_write,
fs/hpfs/file.c: prepare_write: cont_prepare_write,
fs/ntfs/fs.c:   prepare_write: cont_prepare_write,
fs/qnx4/inode.c:prepare_write: cont_prepare_write,

Only because a fs wasn't designed for UNIX, means it's shit. Sorry, for
me it's a simple matter of respect. A fs isn't "moronic", only because
it doesn't support holes, the real design issues are somewhere else.
Anyway, probably as I only see mmu_private updates in fat and hpfs, it
looks like adfs, hfs, ntfs and qnx4 are broken as well. BTW wouldn't it
be better to introduce a i_realsize field for that, it can be useful for
other fs later as well. E.g. it can be used to fix read/truncate race
without a BKL (i_realsize only grows while file is open and truncate can
be delayed until file is closed).

 Yes. In particular, adding the whole PAGE_CACHE_SIZE to *bytes is
 _definitely_ wrong. The whole "round up to the block boundary" thing is
 there for purpose - in native environment all of these filesystems do not
 care to clean the tail of the last allocated block. So we have to do it
 for them. And we don't want to repeat such cleaning on _every_ write() -
 the first one should normalize the situation.

Rounding it up to the block size isn't a problem, the question is why do
we give the bytes ptr to cont_prepare_write, if get_block has to update
it in first place. That's probably the point that confused me here.

 If you have problems with affs_get_block() - feel free to ask, I probably
 can help. What _is_ the problem? Notice that you can assume that
 affs_get_block() will only have to create blocks immediately past the
 end-of-file.

Yes, the whole get_block/getblock stuff in affs needs a real cleanup
because of this.

bye, Roman



Re: cont_prepare_write fix

2000-07-24 Thread Alexander Viro



On Mon, 24 Jul 2000, Roman Zippel wrote:

 Hi,
 
   While trying to get affs working again, I noticed that the current
   cont_prepare_write is hardly working, if you write to a file in steps of
   1024, the bytes pointer is never updated resulting in an endless loop.
  
  Testcase, please? It certainly seems to be working on FAT and HPFS -
  both are using the same thing.
 
 Ok, now I see it, they are updating it themselves, why isn't that done
 by the vfs?

Because actual _allocation_ unit may have nothing to blocksize and be
invisible to VFS. Check fatfs - it uses the thing to trigger allocation,
but actual zero-out goes on per-sector basis.

  Sheer frustration from the braindead design? _Nothing_ can insult persons
  responsible for AFFS and FAT "designs". Sorry.
 
 $ grep cont_prepare_write fs/*/*.c
 fs/adfs/inode.c:prepare_write:  cont_prepare_write,
 Not on this box. What version are you using?
 fs/affs/file.c: prepare_write: cont_prepare_write,
 fs/hfs/inode.c: prepare_write: cont_prepare_write,
Ditto.
 fs/hpfs/file.c: prepare_write: cont_prepare_write,
Ditto.  
 fs/ntfs/fs.c:   prepare_write: cont_prepare_write,
Ditto.
 fs/qnx4/inode.c:prepare_write: cont_prepare_write,
Ditto.
They are _calling_ the thing, but they would simply refuse to compile if
they would have such lines - different prototypes.

BTW, when that comment went in it was FAT, HPFS and AFFS.

 Only because a fs wasn't designed for UNIX, means it's shit. Sorry, for
 me it's a simple matter of respect. A fs isn't "moronic", only because
 it doesn't support holes, the real design issues are somewhere else.

And you _bet_ that they are. No, lack of holes per se is not
horrible. But there is a _lot_ of things about AFFS and FAT (and to some
extent HPFS) design that are past "horrible" - "Lovecraftian" would be
more accurate. But that's another rant...

 Anyway, probably as I only see mmu_private updates in fat and hpfs, it
 looks like adfs, hfs, ntfs and qnx4 are broken as well. BTW wouldn't it
From adfs_get_block():
...
/* don't support allocation of blocks yet */
return -EIO;
NTFS is not just broken, it eats filesystems in rw mode.
QNX4 does not allocate blocks ("// to be done" says the driver)
HFS refuses to allocate anything. Probably broken...

 be better to introduce a i_realsize field for that, it can be useful for
 other fs later as well. E.g. it can be used to fix read/truncate race
 without a BKL (i_realsize only grows while file is open and truncate can
 be delayed until file is closed).

Not needed. That race is taken care of - readpage() never does access past
i_size and has the page locked, while vmtruncate() starts with setting
i_size (on shrinking truncate(), that is), then goes through all pages in
affected area and grabs the lock on all of them (one by one, not
simultaneously, indeed) and only then calls -truncate().

So readpage() in affected area that began before we've set -i_size will
be completed prior to call of -truncate() and any attempt after setting
-i_size will see new -i_size and will not bother -truncate() at all.

_That_ one is OK. There are very scary dragons around -truncate(), but
a) they are unrelated to that
b) knowledge of realsize won't help
c) let's take that to email.

 Rounding it up to the block size isn't a problem, the question is why do
 we give the bytes ptr to cont_prepare_write, if get_block has to update
 it in first place. That's probably the point that confused me here.

cont_prepare_write() has to zero the tail of the last block out ;-

  If you have problems with affs_get_block() - feel free to ask, I probably
  can help. What _is_ the problem? Notice that you can assume that
  affs_get_block() will only have to create blocks immediately past the
  end-of-file.
 
 Yes, the whole get_block/getblock stuff in affs needs a real cleanup
 because of this.

Well, it will get simpler - all logics with growing the chain shrinks a
lot, AFAICS...




Re: cont_prepare_write fix

2000-07-24 Thread Benjamin C.R. LaHaise

On Mon, 24 Jul 2000, Alexander Viro wrote:

 Not needed. That race is taken care of - readpage() never does access past
 i_size and has the page locked, while vmtruncate() starts with setting
 i_size (on shrinking truncate(), that is), then goes through all pages in
 affected area and grabs the lock on all of them (one by one, not
 simultaneously, indeed) and only then calls -truncate().

That assumes that i_size is atomically updated, which currently is not the
case on 32 bit platforms.

-ben




Re: cont_prepare_write fix

2000-07-24 Thread Roman Zippel

Hi,

 Because actual _allocation_ unit may have nothing to blocksize and be
 invisible to VFS. Check fatfs - it uses the thing to trigger allocation,
 but actual zero-out goes on per-sector basis.

I tried vfat on a MO drive with 1KByte sectors - it doesn't work. I also
don't know how you want to do it. A fs with a block size smaller than
the sector size can't impossibly write directly into the page cache and
I don't see anywhere where that is taken care of.

  fs/adfs/inode.c:prepare_write:  cont_prepare_write,
  Not on this box. What version are you using?
 They are _calling_ the thing, but they would simply refuse to compile if
 they would have such lines - different prototypes.

Oops, sorry, wrong source tree, it's a 2.4 with some other changes. :)
What I tried was to change the prototypes so the smaller caller routines
in _every_ fs can be removed - seems to work, except that
cont_prepare_write needs that extra argument (another reason I'd like to
move it to i_realsize). It seems to work, is there any reason not to do
this?

 Not needed. That race is taken care of - readpage() never does access past
 i_size and has the page locked, while vmtruncate() starts with setting
 i_size (on shrinking truncate(), that is), then goes through all pages in
 affected area and grabs the lock on all of them (one by one, not
 simultaneously, indeed) and only then calls -truncate().

Hmm, then I don't understand the BKL in the get_block functions.

  Rounding it up to the block size isn't a problem, the question is why do
  we give the bytes ptr to cont_prepare_write, if get_block has to update
  it in first place. That's probably the point that confused me here.
 
 cont_prepare_write() has to zero the tail of the last block out ;-

IMO we could simplify that if we would do the preallocation business in
the page cache. We only need to support a dummy page structure in the
page cache to delay the allocation of the real page (we need that
anyway, if want to do direct io into/from e.g. pci space), but that's
more something for 2.5 or is there any reason why it shouldn't work?

bye, Roman



Re: cont_prepare_write fix

2000-07-24 Thread Jan Harkes

On Mon, Jul 24, 2000 at 12:04:39PM -0400, Benjamin C.R. LaHaise wrote:
 On Mon, 24 Jul 2000, Alexander Viro wrote:
 
  Not needed. That race is taken care of - readpage() never does access past
  i_size and has the page locked, while vmtruncate() starts with setting
  i_size (on shrinking truncate(), that is), then goes through all pages in
  affected area and grabs the lock on all of them (one by one, not
  simultaneously, indeed) and only then calls -truncate().
 
 That assumes that i_size is atomically updated, which currently is not the
 case on 32 bit platforms.
 
   -ben

I found that i_size is in most cases reasonably protected by i_sem.
Except for Coda where inode-i_size != inode-i_mapping-host-i_size,
because we `borrow' the mapping of the underlying container file.

At the moment I believe that the introduction of i_mapping-host might
have been a mistake. The readpage/writepage operations use
i_mapping-host-i_size, but sys_stat and generic_write_file (for
a file with O_APPEND) are using inode-i_size. So for Coda we are
copying the i_size back to the Coda inode in a wrapper around
generic_file_write.

AFAIK Coda is the only FS that has an address space mapping that
associated with another inode, and we need to lock the i_sem of that
inode to avoid problems. Maybe the Coda file write should become
something like:

coda_file_write()
{
down(container_inode-i_sem);
n = generic_file_write(file, ...);
container_inode-i_size = file-f_dentry-d_inode-i_size;
up(container_inode-i_sem);
return n;
}

That way all writes are protected by the semaphore on the container
file's inode. And the use of i_mapping-host, along with the nasty cast
to (struct inode *) could be removed.

Jan




Re: cont_prepare_write fix

2000-07-24 Thread David Gibson

On Mon, Jul 24, 2000 at 02:24:42PM -0400, Jan Harkes wrote:
 On Mon, Jul 24, 2000 at 12:04:39PM -0400, Benjamin C.R. LaHaise wrote:
  On Mon, 24 Jul 2000, Alexander Viro wrote:
  
   Not needed. That race is taken care of - readpage() never does access past
   i_size and has the page locked, while vmtruncate() starts with setting
   i_size (on shrinking truncate(), that is), then goes through all pages in
   affected area and grabs the lock on all of them (one by one, not
   simultaneously, indeed) and only then calls -truncate().
  
  That assumes that i_size is atomically updated, which currently is not the
  case on 32 bit platforms.
 
 I found that i_size is in most cases reasonably protected by i_sem.
 Except for Coda where inode-i_size != inode-i_mapping-host-i_size,
 because we `borrow' the mapping of the underlying container file.
 
 At the moment I believe that the introduction of i_mapping-host might
 have been a mistake. The readpage/writepage operations use
 i_mapping-host-i_size, but sys_stat and generic_write_file (for
 a file with O_APPEND) are using inode-i_size. So for Coda we are
 copying the i_size back to the Coda inode in a wrapper around
 generic_file_write.
 
 AFAIK Coda is the only FS that has an address space mapping that
 associated with another inode, and we need to lock the i_sem of that
 inode to avoid problems. Maybe the Coda file write should become
 something like:
 
 coda_file_write()
 {
   down(container_inode-i_sem);
   n = generic_file_write(file, ...);
   container_inode-i_size = file-f_dentry-d_inode-i_size;
   up(container_inode-i_sem);
   return n;
 }
 
 That way all writes are protected by the semaphore on the container
 file's inode. And the use of i_mapping-host, along with the nasty cast
 to (struct inode *) could be removed.

This might work for Coda, but I think in general there are a couple of
problems with always using the struct file to get at the inode
structure. IIRC either writepage or prepare_write is called with
file==NULL in some circumstances (when creating a symlink I think). I
discovered this when I got oopsen in my patch which adds
space/inode/dentry usage limits to ramfs.

There's another problem too: for ramfs limits, I obviously also need
to keep track of when a page is removed from an inode. The only sane
way I've seen to do this is to add a hook to address_space_operations
called when the page is removed (currently I'm calling it from
remove_page_from_inode_queue() but that may not be the best place). At
this point, only the struct page is available, so from that I need to
find the inode (well, actually inode-i_sb where I store the
limits). Easy with mapping-host, otherwise impossible (AFAICT, it's
entirely possible I'm missing something blindingly obvious).

-- 
David Gibson, Technical Support Engineer, Linuxcare, Inc.
+61 2 6262 8990
[EMAIL PROTECTED], http://www.linuxcare.com/ 
Linuxcare. Support for the revolution.