Re: [possible race in ext2] Re: how to write get_block?

1999-10-08 Thread Matti Aarnio

  Commenting only on a subset of Alexander's comments:

On Fri, Oct 08, 1999 at 12:35:53PM -0400, Alexander Viro wrote:

> More or less random comments/questions:

> 4) I still think that indirect blocks may go into the page cache - that
> would further simplify truncate(). OTOH the method used by BSD (indirect
> block covering addresses from n to n+block_size*pointers_per_block gets
> an address -n, double indirect block covering the area from n to n+...
> gets -n-block_size, triple indirect - -n-2*block size) will be wasteful
> for situations when block size is smaller than page.

   My LFS patches ( ftp://mea.tmt.tele.fi/linux/LFS/ ) do *allow* you to
   do that, cache index is plus/minus 2G * 512 bytes.  Just to support
   that type of signed magic..

   Earlier version was strictly 0 thru 4G of 512 byte blocks - unsigned.

...
> Comments?

/Matti Aarnio <[EMAIL PROTECTED]>



Re: how to write get_block?

1999-10-08 Thread Manfred Spraul

Andrea Arcangeli wrote:
> I really don't think it make sense to let write(2) to be SMP parallel as
> it would be unreliable and thus useless. So basically allowing more writes
> to enter the critical section looks an useless improvement and lose of
> robustness to me.

a) we must allow concurrent write operations for pipes (O_NONBLOCK).

b) are you sure that these statements are true for a large database?

I think the right aproach would be new functions [public, because needed
by nfsd and arch/*/kernel/sys*.c]

int inode_lock_write(inode, filp, offset, len);
...
int inode_lock_rename(inode_source,inode_dest);
...

and flags so that the f_ops implementation can choose the amount of
synchronization it needs.

Or you could add new function pointers to f_ops/i_ops for
synchronization.

--
Manfred



Re: how to write get_block?

1999-10-08 Thread Manfred Spraul

Manfred Spraul wrote:
> I'll check the POSIX standard, perhaps we can just reject mmap() or
> file-locks in this situation.

I couldn't find it in the POSIX standart, but at least some UNIXes don't
allow it:

http://hoth.stsci.edu/man/man2/fcntl.html: (sun unix)
> The fcntl() function will fail if: [...]
> or the cmd argument is F_SETLK, F_SETLK64, F_SETLKW, or F_SETLKW64,
> mandatory file/record locking is set, and the file is currently
> being mapped to virtual memory using mmap(2). 

http://cando.dwr.co.gov/cgi-bin/man2html/fcntl(2): IRIX 5.2
>  [EAGAIN]   cmd is F_SETLK or F_SETLKW , mandatory file locking bit is
> set for the file, and the file is currently being mapped
> to virtual memory via mmap [see mmap(2)].



Re: how to write get_block?

1999-10-08 Thread Alexander Viro



On Fri, 8 Oct 1999, Manfred Spraul wrote:

> I couldn't find it in the POSIX standart, but at least some UNIXes don't
> allow it:

Sure. We are kinda-sorta preventing it too. But there is a windows when
both the shared mmap() and lockf() may pass (if mmap() comes first,
overlaps the memory area that was the last mmap to the unlinked file
and fcntl() happens before the file is finally removed).

OK, folks, I'm going down - 18 hours of tube time have some unpleasant
effects.



Re: how to write get_block?

1999-10-08 Thread Manfred Spraul

Alexander Viro wrote:
> _That_ protection should be enough to avoid several callers of
> write() messing with each other.

IIRC NfsV2 guarantees atomicity of write operation.

--
Manfred



Re: how to write get_block?

1999-10-08 Thread Alexander Viro



On Fri, 8 Oct 1999, Manfred Spraul wrote:

> I think it's wrong to fix synchonization on a problem-by-problem basis:
> VFS synchonization for read()/write()/truncate()/file lock's is broken,
> and I would try to find one solution instead of patch-work.

There are different things to protect. E.g. i_sem is (ab)used for dcache
race-prevention. Lumping everything together is not a good idea, IMO.

> Alexander Viro wrote:
> > > 1) overlapping concurrent write operations violate Nfsv2.
> > 
> > Yes, that is serious.
> > > 3) do not forget f_pos when you implement a synchonization: even
> > > concurrent reads are not allowed if both use the same file pointer.
> > 
> > Ditto. It's a separate issue.
> 
> These are 2 pending problems which must be fixed before 2.4
> 
> If you fix them, then you will add synchonization to sys_write(), and
> raw-io will be affected since raw-io is accessed through sys_write() to
> a special character device driver.

Wait a minute. It's exactly the reason why we should not do it in
sys_write() - synchronization issues vary depending on _what_ you are
accessing.

> Other pending problems are O_APPEND
writer.

>, mandantory file locks.

broken so badly, that... well, for example there is a nice race with
mmap(). There is a whole crapload of horror in the implementation - I have
some fixes, but they are pretty obsolete now (late July; I was net.dead
in August and beginning of September ;-/). And there is an interesting
stuff in interaction of fs/locks.c with fs/lockd/*. If NFS folks are ready
to comment on their side of code - fine, but IMHO the locking stuff is
not going to be fixed soon.



Re: how to write get_block?

1999-10-08 Thread Manfred Spraul

Alexander Viro wrote:
> On Fri, 8 Oct 1999, Manfred Spraul wrote:
> 
> > I think it's wrong to fix synchonization on a problem-by-problem basis:
> > VFS synchonization for read()/write()/truncate()/file lock's is broken,
> > and I would try to find one solution instead of patch-work.
> 
> There are different things to protect. E.g. i_sem is (ab)used for dcache
> race-prevention. Lumping everything together is not a good idea, IMO.
> 
Of course.
But it's wrong to start fixing on one end of the problem before you see
the complete problem: if you start with a rw-semaphore (called ERESOURCE
in WinNT kernel, if you search for a name), then I'm 99% sure that this
code will die before 2.4.

> Wait a minute. It's exactly the reason why we should not do it in
> sys_write() - synchronization issues vary depending on _what_ you are
> accessing.

think about f_pos: VFS hides the difference between sys_pread() and
sys_read(). Filesystems can't implement the locking.

I think VFS should (and must) offer 3 or 4 locking options through
i_flags, otherwise every filesystem must reinvent the wheel. (do
nothing; protect f_pos; EOF and NvsV2 protection)


> >, mandantory file locks.
> 
> broken so badly, that... well, for example there is a nice race with
> mmap().

mandantory filelocks and mmap() are incompatible, there is no chance to
fix that.

eg. under Windows, all filelocks are mandantory locks. This means that
noone can use mmap() for databases, that's a known and unsolvable
problem.

I'll check the POSIX standard, perhaps we can just reject mmap() or
file-locks in this situation.

Btw, look at default_llseek(): it's not POSIX compliant because it does
not check for file system specific file size limits. As a result, ext2
has it's own llseek-function. A few weeks ago I noticed that UDF has a
special llseek function. That's the wrong aproach - I posted a patch
which fixes default_llseek(), but it was ignored.

--
Manfred



Re: [possible race in ext2] Re: how to write get_block?

1999-10-08 Thread Alexander Viro



On Fri, 8 Oct 1999, Ingo Molnar wrote:

> 
> On Fri, 8 Oct 1999, Alexander Viro wrote:
> 
> > Stephen, Ingo, could you look at the stuff above? Methink it means that we
> > either must separate ext2_truncate() for directories (doing bforget() on
> > the data blocks) _or_ put the directory blocks into the page cache and do
> > block_flush(). I'ld rather prefer the latter. Moreover, we can stuff the
> > indirect blocks into the page cache too (using negative offsets). That
> > would leave us with pure buffer cache only for static structures.
> 
> yep we knew about this problem ... it's not quite an easy hack though. 
> Putting the directory block cache (and symlink block cache) into the page
> cache would be the preferred method - this would also clean up the code
> alot i think.

More or less random comments/questions:

1) Is ext2_bread() vs. bread() separation in fs/ext2/* a first step to
that? (I.e. data blocks vs. inode/bitmap/indirect ones)
2) Sometime ago there was a pretty impressive patch excluding large part
of ext2_find_entry() (it caches the offset into dentry and invalidates it
if needed) and I have a code for cyclic lookups in a directory (cheap
hack, but it really speeds the things up for ls -l and friends). Both
would go nicely into this scheme.
3) We ought to switch from (bh+pointer) to (page+pointer) in namei.c.
4) I still think that indirect blocks may go into the page cache - that
would further simplify truncate(). OTOH the method used by BSD (indirect
block covering addresses from n to n+block_size*pointers_per_block gets
an address -n, double indirect block covering the area from n to n+...
gets -n-block_size, triple indirect - -n-2*block size) will be wasteful
for situations when block size is smaller than page.
5) Who works on that? I would try to do the thing, but if there already is
some code I'ld like to look at it, indeed.

Comments?



Re: how to write get_block?

1999-10-08 Thread Andrea Arcangeli

On Fri, 8 Oct 1999, Alexander Viro wrote:

>That's it. write() does lock the page before writing to it/allocating new
>blocks/etc. _That_ protection should be enough to avoid several callers of
>write() messing with each other.

Except the fact that two writers writing 100byte each one will produce a
100byte long file and not a 200byte long file as the user expected. And
btw NFS needs to be atomic at the write() level, line 415 of rfc1094:

   file after the write has completed.  The write operation is atomic.
   Data from this "WRITE" will not be mixed with data from another
   client's "WRITE".

What I would like is to have a three level semaphore with three kind of
locks:

1)  read
2)  write-serialized (allow-readers)
3)  truncate

Lots of readers can enter at the same time of write-serialized. But only
one write-serialized event can enter the critical section at once.

And the `truncate' locking instead enters only if nobody else is grabbing
locks and forbids everybody else to enter.

I think it's safe as write(2) only enlarges the i_size so any reader won't
risk to read beyond the end of the inode.

During truncate() instead everybody will block and so there won't be any
race.

With this design all reads can be parallel to writes. But only one
write(2) can happen at once so there won't be any i_size race and NFS will
be happy too.

I really don't think it make sense to let write(2) to be SMP parallel as
it would be unreliable and thus useless. So basically allowing more writes
to enter the critical section looks an useless improvement and lose of
robustness to me.

Andrea



[patch] [possible race in ext2] Re: how to write get_block?

1999-10-08 Thread Mikulas Patocka

> Sheesh... Yes, you are right. It was not a problem with the old buffer
> cache, but now... Arrgh. Races in truncate(), film at 11. Pheeewww...
> Unless I'm seriously misunderstanding what happens you've just dug out a
> race in ext2. It doesn't do bforget() on the data blocks in directories.
> If I'm not mistaken the same scenario applies here just fine.

Here goes quick'n'dirty patch. It does bforget(). It should prevent file
corruption.

Mikulas Patocka

--- linux-2.3.19.tar.gz#utar/linux/fs/ext2/truncate.c   Tue Jun 29 18:22:08 1999
+++ linux/fs/ext2/truncate.cFri Oct  8 18:40:36 1999
@@ -101,6 +101,14 @@
  * -- WSH, 1998
  */
 
+/* M.P.: quick & dirty patch to prevent data corruption */
+
+static void brelse_blocks(struct inode *i, unsigned long block_to_free, unsigned long 
+count)
+{
+   while (count--)
+   bforget(get_hash_table(i->i_dev, block_to_free++, 
+i->i_sb->s_blocksize));
+}
+
 /*
  * Check whether any of the slots in an indirect block are
  * still in use, and if not free the block.
@@ -184,14 +192,17 @@
else if (block_to_free == tmp - free_count)
free_count++;
else {
+   if (!S_ISREG(inode->i_mode)) brelse_blocks(inode, 
+block_to_free, free_count);
ext2_free_blocks (inode, block_to_free, free_count);
free_this:
block_to_free = tmp;
free_count = 1;
}
}
-   if (free_count > 0)
+   if (free_count > 0) {
+   if (!S_ISREG(inode->i_mode)) brelse_blocks(inode, block_to_free, 
+free_count);
ext2_free_blocks (inode, block_to_free, free_count);
+   }
return retry;
 }
 
@@ -247,14 +258,17 @@
else if (block_to_free == tmp - free_count)
free_count++;
else {
+   if (!S_ISREG(inode->i_mode)) brelse_blocks(inode, 
+block_to_free, free_count);
ext2_free_blocks (inode, block_to_free, free_count);
free_this:
block_to_free = tmp;
free_count = 1;
}
}
-   if (free_count > 0)
+   if (free_count > 0) {
+   if (!S_ISREG(inode->i_mode)) brelse_blocks(inode, block_to_free, 
+free_count);
ext2_free_blocks (inode, block_to_free, free_count);
+   }
/*
 * Check the block and dispose of the ind_bh buffer.
 */




Re: how to write get_block?

1999-10-08 Thread Manfred Spraul

I think it's wrong to fix synchonization on a problem-by-problem basis:
VFS synchonization for read()/write()/truncate()/file lock's is broken,
and I would try to find one solution instead of patch-work.

Alexander Viro wrote:
> > 1) overlapping concurrent write operations violate Nfsv2.
> 
> Yes, that is serious.
> > 3) do not forget f_pos when you implement a synchonization: even
> > concurrent reads are not allowed if both use the same file pointer.
> 
> Ditto. It's a separate issue.

These are 2 pending problems which must be fixed before 2.4

If you fix them, then you will add synchonization to sys_write(), and
raw-io will be affected since raw-io is accessed through sys_write() to
a special character device driver.

Other pending problems are O_APPEND, mandantory file locks.

Eg mandantory file locks:
process A: backup. It make multi-megabyte read operations. During these
long operations, the kernel schedules around.

process B: uses the database. Mandantory file locks.
it calls lock(), read(), write(), write(), unlock().
^^^
kernel schedules back to process A, it reads the data.

--> data corruption. Mandantory file locks must not return if there are
pending read operations in the affected area.


> > Eg I'm pretty sure that Stephen will not like it if raw-io is slowed
> > down by a multiple reader-single writer synchonization.
> 
> Erm? Writer access is for truncate() and I don't see where the raw-io
> comes into the picture.

Not yet, but if you fix all synchonization, then it will come into the
picture.


> Huh? It's not a sys_write(), it's generic_file_write(). Pipes do not come
> even close to that place.

Not yet, but after you have fixed f_pos, it will come into the picture.

> And I'm talking about 2.3 - 2.2 is _way_ too
> different.
I know that, but 2.3 is broken, 2.2 is ugly.
I wanted to say that we should NOT make the same fault again. Flags are
needed.

--
Manfred



Re: [possible race in ext2] Re: how to write get_block?

1999-10-08 Thread Mikulas Patocka

> yep we knew about this problem ... it's not quite an easy hack though. 
> Putting the directory block cache (and symlink block cache) into the page
> cache would be the preferred method - this would also clean up the code
> alot i think.

If you knew about it, why didn't you fix it?? (or implement temporary
workaround: bforget(get_hash_table(...)) through all blocks if inode is
symlink or directory). It is serious! - it can cause data corruption. Such
bugs should be fixed immediatelly. Now, anybody using 2.3 has to worry
about his data. I'm switching back to good old 2.3.4. 

Mikulas Patocka



Re: how to write get_block?

1999-10-08 Thread Alexander Viro



On Fri, 8 Oct 1999, Manfred Spraul wrote:

> Alexander Viro wrote:
> >All stuff with truncate()/write() is racey right now,
> > AFAICS. The next thing I'm going to do is a (dumb) implementation of
> > blocking rwlocks. write() being reader and truncate() - writer.
> 
> IMHO that's the wrong approach:
> 
> 1) overlapping concurrent write operations violate Nfsv2.

Yes, that is serious.

> 2) the pipe code needs O_NONBLOCK characteristics, ie no synchonization
> is allowed. The 2.2 code uses an ugly hack: sys_write() acquires i_sem,
> and inside the pipe code the semaphore is dropped.

Huh? It's not a sys_write(), it's generic_file_write(). Pipes do not come
even close to that place. And I'm talking about 2.3 - 2.2 is _way_ too
different.

> 3) do not forget f_pos when you implement a synchonization: even
> concurrent reads are not allowed if both use the same file pointer.

Ditto. It's a separate issue.

> What about these ideas:
> a) rename i_sem (*), and add public functions such as
> inode_lock_exclusive(), inode_lock_truncate(), inode_lock_write() etc.
> 
> Currently, everyone directly accesses the VFS synchonization (nfsd,
> drivers/char/tty_io.c; arch/*/kernel/sys*.c).
> 
> b) the requirements of the VFS users differ completely, add flags:
> 0:synchonize nothing, eg pipes.
> I_FPOS_SYNC:  synchonize f_pos access, nothing else
> I_FILE_SYNC:  multiple read, non-overlapping write, truncate.
> 
> Since it will be hidden by inode_lock_??(), it will be easy to optimize
> the implementation as needed.
> 
> Eg I'm pretty sure that Stephen will not like it if raw-io is slowed
> down by a multiple reader-single writer synchonization.

Erm? Writer access is for truncate() and I don't see where the raw-io
comes into the picture.



Re: [possible race in ext2] Re: how to write get_block?

1999-10-08 Thread Ingo Molnar


On Fri, 8 Oct 1999, Alexander Viro wrote:

> Stephen, Ingo, could you look at the stuff above? Methink it means that we
> either must separate ext2_truncate() for directories (doing bforget() on
> the data blocks) _or_ put the directory blocks into the page cache and do
> block_flush(). I'ld rather prefer the latter. Moreover, we can stuff the
> indirect blocks into the page cache too (using negative offsets). That
> would leave us with pure buffer cache only for static structures.

yep we knew about this problem ... it's not quite an easy hack though. 
Putting the directory block cache (and symlink block cache) into the page
cache would be the preferred method - this would also clean up the code
alot i think.

-- mingo




Re: how to write get_block?

1999-10-08 Thread Alexander Viro



On Fri, 8 Oct 1999, Andrea Arcangeli wrote:

> On Fri, 8 Oct 1999, Alexander Viro wrote:
> 
> >AFAICS. The next thing I'm going to do is a (dumb) implementation of
> >blocking rwlocks. write() being reader and truncate() - writer. 
> 
> To make the design of the code simpler write(2) should be a writer too.
> 
> All the VM checks i_size without any lock. So if you don't want to play
> with everything you should grab the writer lock at the VFS layer if you
> know your path may change i_size in any way (and write definitely can).

I don't think so. Look, it's not a matter of protecting i_size - it's a
separate problem. We are seriously abusing i_size in the call of
->truncate() anyway. What I am looking for is
truncate may be sure that blocks belonging to file will remain
that way and no new will appear. IOW, truncate can trust the contents of
inode/indirect blocks/FAT chain/whatever.
That's it. write() does lock the page before writing to it/allocating new
blocks/etc. _That_ protection should be enough to avoid several callers of
write() messing with each other.



Re: how to write get_block?

1999-10-08 Thread Andrea Arcangeli

On Fri, 8 Oct 1999, Mikulas Patocka wrote:

>How do you manage freeing buffers on rmdir? I don't see any bforget there.
>If you don't free buffers pointing to deallocated space, it can cause data
>corruption.

Replacing all bforget with brelse can't harm. It will only be far slower
as you'll risk to waste time flushing garbage to disk and you'll risk to
go into the shrink_mmap path to get some new buffer (while you are caching
buffers that are containing garbage). BTW, in 2.3.x you only need bforget
for the metadata as flushpage will take care of invalidating the dirty
page-cache and to kill their buffers during truncate_inode_pages.

Andrea



Re: how to write get_block?

1999-10-08 Thread Andrea Arcangeli

On Fri, 8 Oct 1999, Alexander Viro wrote:

>AFAICS. The next thing I'm going to do is a (dumb) implementation of
>blocking rwlocks. write() being reader and truncate() - writer. 

To make the design of the code simpler write(2) should be a writer too.

All the VM checks i_size without any lock. So if you don't want to play
with everything you should grab the writer lock at the VFS layer if you
know your path may change i_size in any way (and write definitely can).

Andrea



[possible race in ext2] Re: how to write get_block?

1999-10-08 Thread Alexander Viro



On Fri, 8 Oct 1999, Mikulas Patocka wrote:

> You have a directory and change something in it - there are dirty hashed
> buffers covering directory. 
> 
> You erase that directory - as there is no bforget, there are still dirty
> hashed buffers.
> 
> You alloc file at that place. Buffers containing former directory are
> flushed over the file.
> 
> I think bforget must be called always when you release some structure that
> was accessed using bread/brelse. 

Sheesh... Yes, you are right. It was not a problem with the old buffer
cache, but now... Arrgh. Races in truncate(), film at 11. Pheeewww...
Unless I'm seriously misunderstanding what happens you've just dug out a
race in ext2. It doesn't do bforget() on the data blocks in directories.
If I'm not mistaken the same scenario applies here just fine.

Let's see:
something does rmdir() on a directory. It dirtifies the blocks. Soon after
that iput() calls ext2_delete_inode(). It calls ext2_truncate(). For
simplicity we can assume the directory was short. trunc_direct() happily
calls ext2_free_blocks() and calls it quits, leaving the dirty buffers in
the buffer cache. Later the corresponding on-disk blocks are allocated by
regular file. After that kflushd does its thing and corrupts the data.
Damn.

Stephen, Ingo, could you look at the stuff above? Methink it means that we
either must separate ext2_truncate() for directories (doing bforget() on
the data blocks) _or_ put the directory blocks into the page cache and do
block_flush(). I'ld rather prefer the latter. Moreover, we can stuff the
indirect blocks into the page cache too (using negative offsets). That
would leave us with pure buffer cache only for static structures.

Comments?
Al



Re: [possible race in ext2] Re: how to write get_block?

1999-10-08 Thread Mikulas Patocka

> > You have a directory and change something in it - there are dirty hashed
> > buffers covering directory. 
> > 
> > You erase that directory - as there is no bforget, there are still dirty
> > hashed buffers.
> > 
> > You alloc file at that place. Buffers containing former directory are
> > flushed over the file.
> > 
> > I think bforget must be called always when you release some structure that
> > was accessed using bread/brelse. 

When writing this I thought of fat (assuming that ext2 is already
BugFree(TM) :-), but yes - it applies to ext2 and all other filesystems as
well.

Mikulas Patocka




Re: how to write get_block?

1999-10-08 Thread Manfred Spraul

Alexander Viro wrote:
>All stuff with truncate()/write() is racey right now,
> AFAICS. The next thing I'm going to do is a (dumb) implementation of
> blocking rwlocks. write() being reader and truncate() - writer.

IMHO that's the wrong approach:

1) overlapping concurrent write operations violate Nfsv2.

2) the pipe code needs O_NONBLOCK characteristics, ie no synchonization
is allowed. The 2.2 code uses an ugly hack: sys_write() acquires i_sem,
and inside the pipe code the semaphore is dropped.

3) do not forget f_pos when you implement a synchonization: even
concurrent reads are not allowed if both use the same file pointer.

What about these ideas:
a) rename i_sem (*), and add public functions such as
inode_lock_exclusive(), inode_lock_truncate(), inode_lock_write() etc.

Currently, everyone directly accesses the VFS synchonization (nfsd,
drivers/char/tty_io.c; arch/*/kernel/sys*.c).

b) the requirements of the VFS users differ completely, add flags:
0:  synchonize nothing, eg pipes.
I_FPOS_SYNC:synchonize f_pos access, nothing else
I_FILE_SYNC:multiple read, non-overlapping write, truncate.

Since it will be hidden by inode_lock_??(), it will be easy to optimize
the implementation as needed.

Eg I'm pretty sure that Stephen will not like it if raw-io is slowed
down by a multiple reader-single writer synchonization.


--
Manfred

(*) I prefer compiler errors instead of missing synchonization.
You could rename it back to i_sem later.



Re: how to write get_block?

1999-10-08 Thread Mikulas Patocka

> On what? ITYM truncate(), right? Metadata doesn't end up in the page
> cache, so... ->flushpage() set to block_flushpage() should be OK, but it
> will not be enough. All stuff with truncate()/write() is racey right now,
> AFAICS. The next thing I'm going to do is a (dumb) implementation of
> blocking rwlocks. write() being reader and truncate() - writer. 
> 
> Oh, crap! Add

Doing flushpage is good thing. But there's another posibility of data
corruption.

You have a directory and change something in it - there are dirty hashed
buffers covering directory. 

You erase that directory - as there is no bforget, there are still dirty
hashed buffers.

You alloc file at that place. Buffers containing former directory are
flushed over the file.


I think bforget must be called always when you release some structure that
was accessed using bread/brelse. 

Mikulas Patocka



Re: how to write get_block?

1999-10-08 Thread Alexander Viro



On Fri, 8 Oct 1999, Mikulas Patocka wrote:

> > > Hi.
> > > 
> > > I'm porting my HPFS driver to 2.3.18 and I don't know how to write
> > > get_block properly. When user does lseek(far beyond file end) and
> > > write(), get_block must allocate a lot of sectors and clear them (HPFS
> > > doesn't support holes). The question is: what call should I use to
> > > clear these sectors?
> > 
> > Use the (fixed) variant of FAT code. I.e block_write_cont_page() +
> > equivalent of fat_write_partial_page().
> 
> The only thing this function needs from fat filesystem is i_realsize. What
> about to replace i_realsize with i_size (I guess they have identical
> values at the beginning) and move it to /fs/buffer.c to let other
> filesystems use it?

Hmm... I'm not sure that allowing i_size to change in the middle of
writepage() is a good idea. I'll look at it. i_realsize is the size of
initialized part of file... Maybe it can be merged with i_size, but I'll
have to look at mm/filemap.c to be sure.

[snip]

> How do you manage freeing buffers on rmdir? I don't see any bforget there.
> If you don't free buffers pointing to deallocated space, it can cause data
> corruption.

On what? ITYM truncate(), right? Metadata doesn't end up in the page
cache, so... ->flushpage() set to block_flushpage() should be OK, but it
will not be enough. All stuff with truncate()/write() is racey right now,
AFAICS. The next thing I'm going to do is a (dumb) implementation of
blocking rwlocks. write() being reader and truncate() - writer. 

Oh, crap! Add

--- fs/fat/file.c   Fri Oct  8 09:46:35 1999
+++ fs/fat/file.c.new   Fri Oct  8 09:47:16 1999
@@ -60,7 +60,7 @@
fat_get_block,  /* get_block */
block_read_full_page,   /* readpage */
NULL,   /* writepage */
-   NULL,   /* flushpage */
+   block_flushpage,/* flushpage */
fat_truncate,   /* truncate */
NULL,   /* permission */
NULL,   /* smap */

to the previous patch. Sheesh...



Re: how to write get_block?

1999-10-08 Thread Mikulas Patocka

> > Hi.
> > 
> > I'm porting my HPFS driver to 2.3.18 and I don't know how to write
> > get_block properly. When user does lseek(far beyond file end) and
> > write(), get_block must allocate a lot of sectors and clear them (HPFS
> > doesn't support holes). The question is: what call should I use to
> > clear these sectors?
> 
>   Use the (fixed) variant of FAT code. I.e block_write_cont_page() +
> equivalent of fat_write_partial_page().

The only thing this function needs from fat filesystem is i_realsize. What
about to replace i_realsize with i_size (I guess they have identical
values at the beginning) and move it to /fs/buffer.c to let other
filesystems use it?

> > What should I do with it? Is there any function that I didn't notice? Of
> > course I could write routines to directly touch page cache and attach
> > buffer there, but I think it's dirty. Especially when other filesystem
> > also need to do the same (fat currently crashes when writing beyond file
> > end, other filesystems aren't ported yet) 
> 
>   FAT should be OK now.

How do you manage freeing buffers on rmdir? I don't see any bforget there.
If you don't free buffers pointing to deallocated space, it can cause data
corruption.

Mikulas Patocka