subject:"Re\: The INN\/mmap bug"

Re: The INN/mmap bug

2000-09-19 Thread Daniel Phillips


Daniel Phillips wrote:
> Alexander Viro wrote:
> > On Tue, 19 Sep 2000, Daniel Phillips wrote:
> >
> > >   !Mapped, !Uptodate,  Dirty:  pending map and write
> >
> > Wrong. It's an instant BUG at line 711 in ll_rw_blk.c - remember these
> > reports?

Yes, correct, this state should not escape from prepare_write, though
that's only a limitation of the current code.

Here's the corrected, corrected table:

 Mapped,  Uptodate,  Dirty:  pending write
!Mapped,  Uptodate,  Dirty:  not possible
 Mapped, !Uptodate,  Dirty:  not possible
!Mapped, !Uptodate,  Dirty:  not possible
 Mapped,  Uptodate, !Dirty:  regular block
!Mapped,  Uptodate, !Dirty:  hole of zeroes
 Mapped, !Uptodate, !Dirty:  unread
!Mapped, !Uptodate, !Dirty:  pending map

And here's the punchline: why not have 5 states instead of 3 bits and
set the state with a single function call, instead of the mutiple bit
sets we see now.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-19 Thread Daniel Phillips

Alexander Viro wrote:
> 
> On Tue, 19 Sep 2000, Daniel Phillips wrote:
> 
> > The more I think about it the less clear and ambiguous I find it.
> > When you add the dirty bit into the pot you get:
> >
> >Mapped,  Uptodate,  Dirty:  not possible
> 
> Sure, it is possible - that's how the write happens
> 
> >   !Mapped,  Uptodate,  Dirty:  not possible
> >Mapped, !Uptodate,  Dirty:  pending write
> 
> s/pending write/obvious bug/, damnit. Daniel, just think for a second: we
> have the buffer read to be picked by bdflush and written to disk, while
> the _contents_ _is_ _not_ _uptodate_. Just what can you expect when write
> succeeds? Junk on disk, right? You know, GIGO queue - garbage in, garbage
> out...
> 
> >   !Mapped, !Uptodate,  Dirty:  pending map and write
> 
> Wrong. It's an instant BUG at line 711 in ll_rw_blk.c - remember these
> reports?

It wasn't a conceptual error, it was a typo: here's the corrected
table with the wrongly tagged states you noticed
interchanged:

 Mapped,  Uptodate,  Dirty:  pending write
!Mapped,  Uptodate,  Dirty:  not possible
 Mapped, !Uptodate,  Dirty:  not possible
!Mapped, !Uptodate,  Dirty:  pending map and write
 Mapped,  Uptodate, !Dirty:  regular block
!Mapped,  Uptodate, !Dirty:  hole of zeroes
 Mapped, !Uptodate, !Dirty:  unread
!Mapped, !Uptodate, !Dirty:  pending map

> Sure, the dirty bit is not orthogonal to the rest. You don't need to do
> any complex analysis - it's as simple as
> * if I don't know _where_ to write the data - I'ld better not feed
> the request to ll_rw_block(), or it may get PO'd
> * if I know that data is junk - I don't want it hitting the disk.
> 
> Dirty bit == request fed into the funnel and can be on disk any moment now.
> Locked == already in IO subsystem.
> 
> That's it - completely independent from the rest, except that you don't
> want the whole write mechanism applied to non-uptodate or non-mapped
> pieces.

Right, two obvious bugs, which is the same thing as saying two uneeded
states.  Now, I'm just trying to be tidy and fit this all into a nice
regular model that I can represent with state transition diagrams.  I
don't know for sure why it's good to do that, but it seems good.  I
like to imagine that if I could just get them all down on paper in a
regular form some possible optimizations would just jump right out at
me.  Mind you, this is not necessary for the filesystem work I'm
doing, this is more like a side interest.  I get along just fine with
the existing mechanism.

> Think about IO as a memory bus - cache controller deals with writing the
> cache line to RAM, but you don't want it to try that on lines that don't
> have PA already calculated or have invalid contents. "mark dirty" == point
> the write-behind mechanism to it and let it decide when the thing must be
> written.

I'd also like it to be able to decide on its own when to map the
block.  If I'm fully replacing the contents of a buffer on a page I
would like to be able to just mark the buffer dirty without mapping it
and let bdflush map it later.  Right now it doesn't work because
bdflush tries to feed a null block to ll_rw_block, but a small change
would fix that: the flush daemon just has to notice the buffer is on a
page, then it can call get_block to map it.  Does this accomplish
anything useful?  I *think* so but I'm not sure.  It seems to me that
if you handled this properly you could have, for example, a temporary
file written, read back and deleted, all without ever touching the
disk, or even going into get_block to check for mappings, let alone
fussing around with the allocation bitmaps.

I'm also trying to determine if a 'don't know' mapping state would be
useful for optimizing certain I/O paths.  

> Think how to make the cache indexed by virtual address (not by
> physical, as in case of x86) work correctly. That's what pagecache is -
> software MMU with cache-by-VA architecture.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-19 Thread Alexander Viro

On Tue, 19 Sep 2000, Daniel Phillips wrote:

> The more I think about it the less clear and ambiguous I find it. 
> When you add the dirty bit into the pot you get:
> 
>Mapped,  Uptodate,  Dirty:  not possible

Sure, it is possible - that's how the write happens

>   !Mapped,  Uptodate,  Dirty:  not possible
>Mapped, !Uptodate,  Dirty:  pending write

s/pending write/obvious bug/, damnit. Daniel, just think for a second: we
have the buffer read to be picked by bdflush and written to disk, while
the _contents_ _is_ _not_ _uptodate_. Just what can you expect when write
succeeds? Junk on disk, right? You know, GIGO queue - garbage in, garbage
out...

>   !Mapped, !Uptodate,  Dirty:  pending map and write

Wrong. It's an instant BUG at line 711 in ll_rw_blk.c - remember these
reports?

>Mapped,  Uptodate, !Dirty:  regular block
>   !Mapped,  Uptodate, !Dirty:  hole of zeroes
>Mapped, !Uptodate, !Dirty:  unread
>   !Mapped, !Uptodate, !Dirty:  pending map

> Those two 'not possible' states bother me, they show that the three
> bits are not orthogonal.  We can make arbitrary assignments of meaning
> to them as you did with !Mapped, !Uptodate (pending map) but in that
> case why not just enumerate all the states we need and lose the bit
> assignments?  I think the deep problem here is trying to look at this
> as a bit-flipping problem instead of a state-transition problem.

Sure, the dirty bit is not orthogonal to the rest. You don't need to do
any complex analysis - it's as simple as
* if I don't know _where_ to write the data - I'ld better not feed
the request to ll_rw_block(), or it may get PO'd
* if I know that data is junk - I don't want it hitting the disk.

Dirty bit == request fed into the funnel and can be on disk any moment now.
Locked == already in IO subsystem.

That's it - completely independent from the rest, except that you don't
want the whole write mechanism applied to non-uptodate or non-mapped
pieces.

Think about IO as a memory bus - cache controller deals with writing the
cache line to RAM, but you don't want it to try that on lines that don't
have PA already calculated or have invalid contents. "mark dirty" == point
the write-behind mechanism to it and let it decide when the thing must be
written. Think how to make the cache indexed by virtual address (not by
physical, as in case of x86) work correctly. That's what pagecache is -
software MMU with cache-by-VA architecture.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-19 Thread Daniel Phillips

Linus Torvalds wrote:
> 
> On Mon, 18 Sep 2000, Alexander Viro wrote:
> >   * we have several bh state components and the thing is a big,
> > fscking mess. If we look at the areas outside of the page lock we have:
> 
> It's not a mess at all.
> 
> I don't see why you call things "first stage" etc. It's perfectly
> straightforward. There are two bits that matter:
>  - buffer uptodate
>  - buffer mapped.
> 
> And the state is very clear and unambiguous:
> 
> Mapped, Uptodate: regular block
> !Mapped, Uptodate: hole of zeroes
> Mapped, !Uptodate: unread
> !Mapped, !Uptodate: "pending map"
> 
> No "several states" at all.

The more I think about it the less clear and ambiguous I find it. 
When you add the dirty bit into the pot you get:

 Mapped,  Uptodate,  Dirty:  not possible
!Mapped,  Uptodate,  Dirty:  not possible
 Mapped, !Uptodate,  Dirty:  pending write
!Mapped, !Uptodate,  Dirty:  pending map and write
 Mapped,  Uptodate, !Dirty:  regular block
!Mapped,  Uptodate, !Dirty:  hole of zeroes
 Mapped, !Uptodate, !Dirty:  unread
!Mapped, !Uptodate, !Dirty:  pending map

Those two 'not possible' states bother me, they show that the three
bits are not orthogonal.  We can make arbitrary assignments of meaning
to them as you did with !Mapped, !Uptodate (pending map) but in that
case why not just enumerate all the states we need and lose the bit
assignments?  I think the deep problem here is trying to look at this
as a bit-flipping problem instead of a state-transition problem.

I'm not saying that the current design can't be made to work.  It can,
it's practically working now - it's just confusing as Al said, and
there are reasons for that.  It gets more confusing when you add in
the less-than-simple relationship between buffers and pages and then
try to make sense of the combination of the two sets of states.  The
motivation for making this less confusing is to be able to think about
optimization possibilities instead of just making it work.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-19 Thread Alexander Viro

On Tue, 19 Sep 2000, Rik van Riel wrote:

> IMHO it would be really nice if this problem was solved
> on the /PAGE/ level, so it will work for non-buffer
> filesystems as well ;)

It would be nice if we separated the buffer-cache storage pages from the
pagecache buffer-using ones and from the swap ones - logics is seriously
different and using the same code for all of them... Not good. We can
distinguish between them just by looking at ->mapping, AFAICS.

BTW, I suspect that combination of partial block_flushpage() with
block_truncate_page() should be address_space method (different for NFS,
etc., indeed). Then truncate_inode_pages() would become much simpler
("get rid of pages that went off-limits" rather than "... and do some part
of the work on the partial page". Oh, and block_flushpage() would 
be simpler that way.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-19 Thread Rik van Riel


On Mon, 18 Sep 2000, Alexander Viro wrote:
> On Sun, 17 Sep 2000, Linus Torvalds wrote:
> 
> > And as we've seen, simplifying this area would not necessarily be a bad
> > thing ;)
> > 
> > Right now I'll just do the minimal fix, though, I think.
> 
> Fine with me. I'll do the full analysis tonight, anyway, and try
> to write down the rules for that stuff. One thing that makes me
> seriously uneasy is the fact that VM knows about ->buffers -

This doesn't make me happy either. It doesn't work for
network-based filesystems, won't work with KIOBUF and
will mean extra difficulties with filesystems that have
write ordering constraints

IMHO it would be really nice if this problem was solved
on the /PAGE/ level, so it will work for non-buffer
filesystems as well ;)

(at least, for those problems which aren't specific to
the disk-based filesystems only)

> I'm not proposing it for immediate inclusion, but I don't feel
> good about all this code - current rules are too complex and
> rely on details of VM/buffer interaction too much for my taste.
> It may be correct, but it's not obviously correct...

Indeed...

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Daniel Phillips


Andreas Dilger wrote:
> 
> Daniel writes:
> > Alexander Viro wrote:
> > > On Mon, 18 Sep 2000, Andreas Dilger wrote:
> > > > This may actually be a problem in the future... what about shared access
> > > > block devices like FCAL or a distributed filesystem?  It has to be
> > > > possible for pages to become non-uptodate in a sane way.
> > >
> > > So what the heck do you do when something modifies mmaped page when you
> > > get the change of on-disk one? Say it, writer is notified that write had
> > > been completed, sends packet to you and you flip a bit on a page that
> > > happens to be mmaped on the place where write had happened.
> > >
> > > Write-through-pagecache is OK, but write straight to disk bypassing the
> > > cache? Welcome to the fun with aliases...
> >
> > Yes, I think that's one rule we can write down right now: to update a
> > block on disk you have to go through the buffer.  Not going through
> > the buffer is about the same as accessing a semaphore-protected
> > resource without bothering with the semaphore.
> 
> What I'm getting at is that the local system didn't make the change at all,
> so there is _no way_ to make the write go through the local cache.
> The write to disk happens on a remote system.  The only thing that happens
> on the local system is that it gets a message that a buffer is invalid.
> 
> You don't want to have to re-send each buffer to each system that ever
> read it.  It is much better to simply invalidate the cache, and only if
> the client accesses it again will it be re-read.
> 
> It should be possible to mark a page non-uptodate, so that the next time
> it is accessed locally it will re-read the _new_ data from disk.  Think
> of the fs "revalidate" method.

Yes, this is a cache coherency protocol.  I should have said you have
to go through the *buffer head*, it's what I meant.  You can't just
invalidate the buffer any time you want, you have to wait until there
are no readers/writers first.  Ooh, my head hurts, I was just trying
to write a simple filesystem extension and now I'm getting sucked into
distributed filesystems...

*** goes back to hunting down tailmerge races

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Daniel Phillips


Linus Torvalds wrote:
> 
> On Mon, 18 Sep 2000, Alexander Viro wrote:
> >   That's what makes me unhappy about the current situation + obvious
> > fixes. It works, but the proof is... well, not pretty.
> 
> Oh, agreed. I think we should clean up the code. I looked at it yesterday,
> and it didn't look all that horribly bad, but I lost interest

It is horribly bad, just not unmanageably horribly bad.

> I don't know if it is worth doing before 2.4.x, as the current code certainly
> should work with the small changes already proposed.

Emphatically agreed.   The buffer/cache states are in need of repair,
that's pretty obvious.  Getting them to where they are seen to be
correct will take some time and will hit a lot of code.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Andreas Dilger

Daniel writes:
> Alexander Viro wrote:
> > On Mon, 18 Sep 2000, Andreas Dilger wrote:
> > > This may actually be a problem in the future... what about shared access
> > > block devices like FCAL or a distributed filesystem?  It has to be
> > > possible for pages to become non-uptodate in a sane way.
> > 
> > So what the heck do you do when something modifies mmaped page when you
> > get the change of on-disk one? Say it, writer is notified that write had
> > been completed, sends packet to you and you flip a bit on a page that
> > happens to be mmaped on the place where write had happened.
> > 
> > Write-through-pagecache is OK, but write straight to disk bypassing the
> > cache? Welcome to the fun with aliases...
> 
> Yes, I think that's one rule we can write down right now: to update a
> block on disk you have to go through the buffer.  Not going through
> the buffer is about the same as accessing a semaphore-protected
> resource without bothering with the semaphore.

What I'm getting at is that the local system didn't make the change at all,
so there is _no way_ to make the write go through the local cache.
The write to disk happens on a remote system.  The only thing that happens
on the local system is that it gets a message that a buffer is invalid.

You don't want to have to re-send each buffer to each system that ever
read it.  It is much better to simply invalidate the cache, and only if
the client accesses it again will it be re-read.

It should be possible to mark a page non-uptodate, so that the next time
it is accessed locally it will re-read the _new_ data from disk.  Think
of the fs "revalidate" method.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Daniel Phillips


Alexander Viro wrote:
> 
> On Mon, 18 Sep 2000, Andreas Dilger wrote:
> 
> > Alexander writes:
> > > * uptodate pages should never become non-uptodate.
> > > uptodate .. pages ... never have data _older_ than on disk
> >
> > This may actually be a problem in the future... what about shared access
> > block devices like FCAL or a distributed filesystem?  It has to be
> > possible for pages to become non-uptodate in a sane way.
> 
> So what the heck do you do when something modifies mmaped page when you
> get the change of on-disk one? Say it, writer is notified that write had
> been completed, sends packet to you and you flip a bit on a page that
> happens to be mmaped on the place where write had happened.
> 
> Write-through-pagecache is OK, but write straight to disk bypassing the
> cache? Welcome to the fun with aliases...

Yes, I think that's one rule we can write down right now: to update a
block on disk you have to go through the buffer.  Not going through
the buffer is about the same as accessing a semaphore-protected
resource without bothering with the semaphore.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Linus Torvalds

On Mon, 18 Sep 2000, Alexander Viro wrote:
> > 
> > but should instead be
> > 
> > static int make_buffer_uptodate(struct page *page, struct buffer_head * bh)
> > {
> > if (Page_Uptodate(page)) {
>   ^^^
> > set_bit(BH_Uptodate, &bh->b_state);
> > return;
> > }
> > ll_rw_block(READ, 1, &bh);
> > }
> 
> > Forget about your "stage 1" and "stage 2" complications. They shouldn't
> ^^
> > exist.
> 
>   Erm...

No, but that's just because you _think_ about the problem the wrong way
around.

If you think about it like the above, it's not a "stage" at all. It's just
part of getting the buffer to be up-to-date.

Very logical, very simple, no "fscking mess" at all.

That's my argument. You're trying to make it a problem. It's not. It's
something new with the page cache, yes - the fact that the page cache
drives the logic means that the buffer "uptodate" logic is different, but
if you think about it some, you'll realize that that's actually just a
natural outgrowth of the fact that the buffer cache is slaved to the page
cache these days. 

It's not a "stage 2". It's a basic fact of life - we don't care _how_ the
page got to be up-to-date, because for all we know there are other ways to
get it to be up-to-date than your "stage 1". 

For example, a recvfile() implementation could mark the page up-to-date by
virtue of getting the information off the network. No "stage 1" or "stage
2" there at all: the make_buffer_uptodate() logic does _not_ mean that we
lost the buffers from "stage 1", it might equally well mean that we got
the page up-to-date through some other means.

Just change how you think about the problem, and suddenly it's not a mess,
it's a design.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

[PATCH] Re: The INN/mmap bug

2000-09-18 Thread Alexander Viro



That's to fs/buffer.c, generic_file_write() part will go
separately (it's a separate bug actually - if copy_from_user() in
generic_file_write() fails we mark page as not-uptodate; if memory
pressure will drop the buffer_heads  attempts to read will blow the page
contents away).

Linus, does this look OK with you?

--- buffer.cMon Sep 18 18:56:29 2000
+++ buffer.c.newMon Sep 18 19:11:06 2000
@@ -1495,6 +1495,8 @@
err = get_block(inode, block, bh, 1);
if (err)
goto out;
+   if (Page_Uptodate(page))
+   mark_buffer_uptodate(bh, 1);
if (buffer_new(bh)) {
unmap_underlying_metadata(bh);
if (block_end > to)
@@ -1600,8 +1602,10 @@
continue;
 
if (!buffer_mapped(bh)) {
-   if (iblock < lblock)
-   get_block(inode, iblock, bh, 0);
+   if (iblock < lblock) {
+   if (get_block(inode, iblock, bh, 0))
+   continue;
+   }
if (!buffer_mapped(bh)) {
if (!kaddr)
kaddr = kmap(page);
@@ -1789,7 +1793,11 @@
/* Hole? Nothing to do */
if (buffer_uptodate(bh))
goto unlock;
-   get_block(inode, iblock, bh, 0);
+   err = get_block(inode, iblock, bh, 0);
+   if (err)
+   goto unlock;
+   if (Page_Uptodate(page))
+   mark_buffer_uptodate(bh, 1);
/* Still unmapped? Nothing to do */
if (!buffer_mapped(bh))
goto unlock;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Alexander Viro

On Mon, 18 Sep 2000, Alexander Viro wrote:

> Except that the only reason why check for page being uptodate is correct
> is that for such pages we are guaranteed that buffer ring will not be
> dropped when page contatins data newer than on disk.

Ugh. Translation:

The only reason why the simple check for page being uptodate is enough to
avoid killing data with bogus rereads is that for pages that are NOT
uptodate we are guaranteed that buffer ring will not be dropped when data
in page is newer than on disk.

Sorry.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Alexander Viro

On Mon, 18 Sep 2000, Linus Torvalds wrote:

> It's not a mess at all.
> 
> I don't see why you call things "first stage" etc. It's perfectly
> straightforward. There are two bits that matter: 
>  - buffer uptodate
>  - buffer mapped.

first stage == page still not uptodate. second stage == page is already
uptodate and if it reverts to not-uptade state - it's a bug. Why? Because
then all nice logics about "no read requests for mapped pages" goes to
hell. We have such bug, BTW (in generic_file_write()).

> And the state is very clear and unambiguous:
> 
>   Mapped, Uptodate: regular block
>   !Mapped, Uptodate: hole of zeroes
>   Mapped, !Uptodate: unread
>   !Mapped, !Uptodate: "pending map"
> 
> No "several states" at all.

Except that the only reason why check for page being uptodate is correct
is that for such pages we are guaranteed that buffer ring will not be
dropped when page contatins data newer than on disk.

Yes, it works. But I'ld really prefer to avoid these extra reads at all -
if the rule is that buffer is read only once during the whole life of
page the life becomes somewhat easier to explain.

Currently (even with the fixes discussed) we _do_ reread them. Yes, it's
harmless (see above).

> It so happens that we forgot an important part of the "read a buffer"
> equation. The "read a buffer" function is NOT just
> 
>   static int make_buffer_uptodate(struct buffer_head * bh)
>   {
>   ll_rw_block(READ, 1, &bh);
>   }
> 
> but should instead be
> 
>   static int make_buffer_uptodate(struct page *page, struct buffer_head * bh)
>   {
>   if (Page_Uptodate(page)) {
^^^
>   set_bit(BH_Uptodate, &bh->b_state);
>   return;
>   }
>   ll_rw_block(READ, 1, &bh);
>   }

> Forget about your "stage 1" and "stage 2" complications. They shouldn't
  ^^
> exist.

Erm...

> > 1st stage, uptodate, !mappedhole. Contents is all-zeroes. It may also
> > be a result of failed attempt to map - we
> > have no way to tell.
> 
> The "uptodate !mapped" case is entirely clear: it's a hole. No ifs, no
> buts, no nothing.
> 
> If a map() failed, that map() will have returned an error, and if
> something turns the buffer uptodate that's a BUG.

Yes. In block_read_full_page(), for one thing.  Oops. Sorry, I
didn't mark it.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Linus Torvalds

On Mon, 18 Sep 2000, Alexander Viro wrote:
>   * we have several bh state components and the thing is a big,
> fscking mess. If we look at the areas outside of the page lock we have:

It's not a mess at all.

I don't see why you call things "first stage" etc. It's perfectly
straightforward. There are two bits that matter: 
 - buffer uptodate
 - buffer mapped.

And the state is very clear and unambiguous:

Mapped, Uptodate: regular block
!Mapped, Uptodate: hole of zeroes
Mapped, !Uptodate: unread
!Mapped, !Uptodate: "pending map"

No "several states" at all.

It so happens that we forgot an important part of the "read a buffer"
equation. The "read a buffer" function is NOT just

static int make_buffer_uptodate(struct buffer_head * bh)
{
ll_rw_block(READ, 1, &bh);
}

but should instead be

static int make_buffer_uptodate(struct page *page, struct buffer_head * bh)
{
if (Page_Uptodate(page)) {
set_bit(BH_Uptodate, &bh->b_state);
return;
}
ll_rw_block(READ, 1, &bh);
}

Forget about your "stage 1" and "stage 2" complications. They shouldn't
exist.

> 1st stage, uptodate, !mapped  hole. Contents is all-zeroes. It may also
>   be a result of failed attempt to map - we
>   have no way to tell.

The "uptodate !mapped" case is entirely clear: it's a hole. No ifs, no
buts, no nothing.

If a map() failed, that map() will have returned an error, and if
something turns the buffer uptodate that's a BUG.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Chris Mason

--On 09/18/00 13:19:27 -0400 Alexander Viro <[EMAIL PROTECTED]> wrote:

> 
> 
> On Mon, 18 Sep 2000, Chris Mason wrote:
> 
>> I'm not trying to put it all into a single get_block call, we have
>> different get_block funcs for different purposes.  What I'm really trying
>> to do is squeeze into block_prepare_write, as a generic setup function
>> for file modifications.
> 
> It is not. Period. Full stop. Check cont_prepare_write() and you will see
> a counterexample in the same fs/buffer.c. Yes, they are used in exactly
> the same manner.
> 
generic setup function.  Not the function that does the writes, but the
function that gets the page ready.  No, block_prepare_write does not have
to be the generic setup function, and it should not be for some
filesystems, and situations.  Right now, for what I need, it is usually
enough (except for writepage ;-)

In other words, I agree with you, even if we are saying it differently.
Clearly, when we go to writing directly to the tail for file_write, we'll
need a completely different prepare_write function.

>Generic function for file modifications is generic_file_write().
> block_prepare_write() lives several layers below and is used only for some
> of the local filesystems (ones where it fits).
> 
>> The other reason why I convert in get_block is that is when I've got the
>> most information about the internals of the file, including a path
>> through the btree to the item that may or may not need converting.  In
>> reiserfs, the only way to find out if the last item of the file is
>> packed is to search the tree for it, and the same searching code is used
>> if it is a packed or an unpacked item.
> 
>? Chris, I hope you've noticed that e.g. ext2_prepare_write()
> lives in fs/ext2/inode.c, not in fs/buffer.c. So I don't see what stops
> you from pulling _any_ data you want.
> 

Sorry, not sure what you mean.  When prepare_write is called, reiserfs
doesn't know if it needs to convert until it has searched the internal
btree.  Once the search is done, 90% of prepare_write is also done, so we
might as well do the conversion there (otherwise we end up searching
twice).  The searching is done in reiserfs_get_block, so the conversion is
done in reiserfs_get_block.

Yes, I can pull almost anything I need out of buffer.c.  I'm trying to make
it so I don't need to, just to keep down the replication (and bugs).  But
if it makes things too nasty, I pull the code out and make it reiserfs
specific.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Alexander Viro



OK, let's see. I've tried to describe what we have now (marking the
bugs) + proposal that would give somewhat saner logics (in the end of
posting). Comments are more than welcome.

* life of the page is clearly divided in two parts - before it can
be mapped to a user context and after that.
* no read requests can be issued in the second stage, by that
time page should be permanently up-to-date [ClearPageUptodate() in
generic_file_write() is a bug]
* buffer ring can be dropped only under the page lock.
* on the first stage buffer ring can be dropped only if the page
doesn't contain data that would be more recent than data in fs.
* on the second stage buffer ring can be dropped if there is no IO
scheduled or in progress [note: buffer_tied() might be handy, meaning
"dirty or under IO"; i.e. when ll_rw_block() does or will need this bh]
* we have several bh state components and the thing is a big,
fscking mess. If we look at the areas outside of the page lock we have:

1st stage, !uptodate, !mapped   contents is either the same as on disk
or it's a junk.
1st stage, !uptodate, mappedfailed attempt to read. contents may be
the same as the last data on disk or
it may be junk
1st stage, uptodate, !mappedhole. Contents is all-zeroes. It may also
be a result of failed attempt to map - we
have no way to tell.
1st stage, uptodate, mapped data. Same as on-disk or newer.
2nd stage, !uptodate, !mapped   contents is same or newer than on disk,
mapping unknown. [Recipe for disaster,
since the current code may try to read it;
should be map-and-be-done-with-that]
2nd stage, !uptodate, mappedfailed attempt to read that should never
happen. [Bug]
2nd stage, uptodate, !mappedhole. Contents may be nonzero due to
access via mmap(). May be a result of
failed attempt to map [we don't handle
such errors]
2nd stage, uptodate, mapped data. Same as on disk or newer.

Something should be done about that - aside of forced uptodate on
the second stage we ought to flag the errors (possibly by BH_Error?)
In __block_read_full_page() page being uptodate is a bug - we are
going to lose data.
Additionally, we have "tied by IO" state - can happen only for
mapped uptodate bh.

We seriously rely on the fact that on stage 1 we drop buffer ring
only if the page has no data in need of write. We can't do that on stage 2
(mmap() and dirtying the page from processes).
Due to the above we can afford re-reading the data on stage 1.
However, I'ld rather see a bitmap (PAGE_CACHE_SIZE/blocksize bits)
describing what is not up-to-date. Then we could use it for deciding
what should not be read.

Comments?

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: The INN/mmap bug

2000-09-18 Thread Alexander Viro

On Mon, 18 Sep 2000, Chris Mason wrote:

> I'm not trying to put it all into a single get_block call, we have
> different get_block funcs for different purposes.  What I'm really trying
> to do is squeeze into block_prepare_write, as a generic setup function for
> file modifications.

It is not. Period. Full stop. Check cont_prepare_write() and you will see
a counterexample in the same fs/buffer.c. Yes, they are used in exactly
the same manner.

Generic function for file modifications is generic_file_write().
block_prepare_write() lives several layers below and is used only for some
of the local filesystems (ones where it fits).

> The other reason why I convert in get_block is that is when I've got the
> most information about the internals of the file, including a path through
> the btree to the item that may or may not need converting.  In reiserfs,
> the only way to find out if the last item of the file is packed is to
> search the tree for it, and the same searching code is used if it is a
> packed or an unpacked item.

? Chris, I hope you've noticed that e.g. ext2_prepare_write()
lives in fs/ext2/inode.c, not in fs/buffer.c. So I don't see what stops
you from pulling _any_ data you want.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Chris Mason

--On 09/18/00 12:23:43 -0400 Alexander Viro <[EMAIL PROTECTED]> wrote:

> On Mon, 18 Sep 2000, Chris Mason wrote:
> 
>> When reiserfs_get_block is called on the packed data (with create == 1),
>> we unpack it to a full block, so the generic functions can handle
>> dirties/writes from then on.  If the data in the page cache isn't up to
>> date, we need to copy the packed data into the page cache so it doesn't
>> get lost in the conversion.
> 
>  why bother with bh in that case? You can do all this work on the
> ->writepage()/->readpage()/->prepare_write()/->commit_write() level.
> 

For us, the conversion only needs to be done during prepare_write (readpage
reads directly from the tail into the page).  We chose to do it in
get_block because it seemed to fit...prepare_write gives us an offset in
the file, and we give prepare_write a block.  Sometimes that means a tail
conversion is involved, sometimes not.  prepare_write just doesn't care.

>> This isn't for writing directly to the tail, I'm only doing that in
>> writepage, where I always overwrite the tail data with page data.
> 
>Well, duh. Now I can see where the complaints about VFS being
> ext2-friendly at expense of other filesystems come from. 
> Sure thing, if
> you bend your code so that it would use helper functions written for
> different layout model... It will hurt. The question being: what for?
> These functions do not belong to VFS API - they are convenient to
> implement the real thing (address_space_operations) for a class of
> filesystems, but that's it.
> 
>Look how 'no-holes' filesystems are doing that - yup, different
> layout, different helper functions. It would hurt like hell if we would
> try to do e.g. FAT on get_block() level. So that stuff was done on
> ...page() level, originally in fs/fat/inode.c. When it became obvious that
> there are other filesystems that might share the code - well, into
> fs/buffer.c it went. You are in the same situation.
> 

Yes, that's why I posted the reiserfs specific code for writepage and
truncate.  If we can pull a generic function out of that, and the work done
for other fragment supporting filesystems, perfect.

>Moreover, right now I'm testing yet another class (fragment
> support). Currently it's UFS only, but ext2 also might eventually win
> from that. Yup, also work on ..._page() level, doing that in
> ufs_get_frag() would be a horrible PITA.
> 
>I don't see why are you trying to squeeze everything into
> get_block() - it's not even a method anymore...
> 

I'm not trying to put it all into a single get_block call, we have
different get_block funcs for different purposes.  What I'm really trying
to do is squeeze into block_prepare_write, as a generic setup function for
file modifications.

The other reason why I convert in get_block is that is when I've got the
most information about the internals of the file, including a path through
the btree to the item that may or may not need converting.  In reiserfs,
the only way to find out if the last item of the file is packed is to
search the tree for it, and the same searching code is used if it is a
packed or an unpacked item.

In other words, even if reiserfs_prepare_write didn't use
block_prepare_write, I would probably still convert in a function similar
to reiserfs_get_block.

-chris  

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Alexander Viro

On Mon, 18 Sep 2000, Chris Mason wrote:

> When reiserfs_get_block is called on the packed data (with create == 1), we
> unpack it to a full block, so the generic functions can handle
> dirties/writes from then on.  If the data in the page cache isn't up to
> date, we need to copy the packed data into the page cache so it doesn't get
> lost in the conversion.

 why bother with bh in that case? You can do all this work on the
->writepage()/->readpage()/->prepare_write()/->commit_write() level.

> This isn't for writing directly to the tail, I'm only doing that in
> writepage, where I always overwrite the tail data with page data.

Well, duh. Now I can see where the complaints about VFS being
ext2-friendly at expense of other filesystems come from. Sure thing, if
you bend your code so that it would use helper functions written for
different layout model... It will hurt. The question being: what for? These
functions do not belong to VFS API - they are convenient to implement the
real thing (address_space_operations) for a class of filesystems, but
that's it.

Look how 'no-holes' filesystems are doing that - yup, different
layout, different helper functions. It would hurt like hell if we would
try to do e.g. FAT on get_block() level. So that stuff was done on
...page() level, originally in fs/fat/inode.c. When it became obvious that
there are other filesystems that might share the code - well, into
fs/buffer.c it went. You are in the same situation.

Moreover, right now I'm testing yet another class (fragment
support). Currently it's UFS only, but ext2 also might eventually win
from that. Yup, also work on ..._page() level, doing that in ufs_get_frag()
would be a horrible PITA.

I don't see why are you trying to squeeze everything into
get_block() - it's not even a method anymore...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Chris Mason




--On 09/18/00 08:52:12 -0700 Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> 
> On Mon, 18 Sep 2000, Chris Mason wrote:
>> 
>> ReiserFS depends on the buffer head up to date flag being correct when it
>> is sent to get_block.  When unpacking the tail, we have to know if the
>> packed data on disk should be copied over the data in the page. 
>> 
>> So, the above should work for us if you tested Page_Uptodate before
>> testing buffer_mapped.
> 
> Can't be done.
> 
> A up-to-date but non-mapped buffer is a perfectly valid thing to have -
> it's just a hole.
> 
> So we can't mark the buffer uptodate before marking it mapped - that would
> be a totally different state.
> 

Ok, I can also check to see if the page is up to date during the tail
conversion.  Mostly I wanted to make everyone aware that nastiness like
this exists, and that I haven't found a clean solution yet.

-chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Linus Torvalds




On Mon, 18 Sep 2000, Chris Mason wrote:
> 
> ReiserFS depends on the buffer head up to date flag being correct when it
> is sent to get_block.  When unpacking the tail, we have to know if the
> packed data on disk should be copied over the data in the page. 
> 
> So, the above should work for us if you tested Page_Uptodate before testing
> buffer_mapped.

Can't be done.

A up-to-date but non-mapped buffer is a perfectly valid thing to have -
it's just a hole.

So we can't mark the buffer uptodate before marking it mapped - that would
be a totally different state.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Daniel Phillips

Alexander Viro wrote:
> 
> On Sun, 17 Sep 2000, Linus Torvalds wrote:
> 
> > On Sun, 17 Sep 2000, Alexander Viro wrote:
> > >
> > > Looks sane. But I really wonder if we could just do it in
> > > create_page_buffers() if page is up-to-date. OTOH it would require attempt
> > > to map them all. Comments?
> >
> > That would certainly simplify a lot.
> >
> > And as we've seen, simplifying this area would not necessarily be a bad
> > thing ;)
> >
> > Right now I'll just do the minimal fix, though, I think.
> 
> Fine with me. I'll do the full analysis tonight, anyway, and try to write
> down the rules for that stuff. One thing that makes me seriously uneasy
> is the fact that VM knows about ->buffers - I would be much happier if all
> this stuff would be contained in fs/buffer.c. IOW, I'm not sure that
> block_flushpage()/try_to_free_buffers() is a happy camper.

As you know, I've begun to analyse it;

  Canonic buffer states and transitions
  http://marc.theaimsgroup.com/?l=linux-fsdevel&m=96893011609105&w=2

Since that post I've found one additional useful buffer state (*)
bringing the total to 5:

  0: undef - Neither data on disk or in cache known to be valid
  1: inval - Data on disk known to be invalid (*)
  2: known - Data on disk known to be valid
  3: dirty - Data in cache known to be valid
  4: clean - Data in cache known to match data on disk

This is still a lot less that the current 16 states.  I'm still
waiting for comments on this analysis one way or the other.  I think
it strikes at the heart of the problem.  I already used this way of
thinking to find and fix a similar bug in my tailmerging code where
__block_prepare_write was wrongly clearing part of a dirty buffer just
because ext2_get_block reported it had created a new block.  The quick
fix was to suppress the buffer_new bit coming back from ext2_get_block
if the buffer is dirty.  A better fix would be to eliminate the
buffer_new bit entirely, but of course this can wait.

My contribution to this will be to update the buffer states post and
if nobody thinks I'm completely off-base, do a similar one for page
states.  It seems to me that you and Linus have already found exactly
the crevice that this bug lives in.

> I'm not proposing it for immediate inclusion, but I don't feel good about
> all this code - current rules are too complex and rely on details of
> VM/buffer interaction too much for my taste. It may be correct, but it's
> not obviously correct...

Amen.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Chris Mason

--On 09/18/00 09:16:22 -0400 Alexander Viro <[EMAIL PROTECTED]> wrote:

> 
> 
> On Mon, 18 Sep 2000, Chris Mason wrote:
> 
>> ReiserFS depends on the buffer head up to date flag being correct when it
>> is sent to get_block.  When unpacking the tail, we have to know if the
>> packed data on disk should be copied over the data in the page. 
> 
> ??? Details, please. What the hell are you doing with buffer-heads for a
> piece of data that can be completely unaligned?
> 

When reiserfs_get_block is called on the packed data (with create == 1), we
unpack it to a full block, so the generic functions can handle
dirties/writes from then on.  If the data in the page cache isn't up to
date, we need to copy the packed data into the page cache so it doesn't get
lost in the conversion.

This isn't for writing directly to the tail, I'm only doing that in
writepage, where I always overwrite the tail data with page data.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Alexander Viro




On Mon, 18 Sep 2000, Chris Mason wrote:

> ReiserFS depends on the buffer head up to date flag being correct when it
> is sent to get_block.  When unpacking the tail, we have to know if the
> packed data on disk should be copied over the data in the page. 

??? Details, please. What the hell are you doing with buffer-heads for a
piece of data that can be completely unaligned?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-18 Thread Chris Mason


--On 09/17/00 20:30:29 -0700 Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> Basically, both "truncate()" and "write()" have this bug where they can
> end up re-reading stuff from disk even though the in-memory copy is newer.
> 
> And because write() had this bug, the bug also got into
> block_write_full_page(). Not because block_write_full_page() was buggy in
> itself, but because it used a buggy routine.
> 
> And your patch fixes the corruption, not by fixing the bug, but by
> avoiding the buggy routing in block_write_full_page().
> 
> We need to fix the real bug - otherwise anybody doing both write() and
> shared mmap's to the same file is going to be bit by it later on...
> 
> The easy fix is probably to do something like
> 
>/* Map the buffer */
>if (!buffer_mapped(bh)) {
>...
>}
> + /* If the page is up-to-date, so is the buffer */
> + if (Page_Uptodate(page))
> + set_bit(BH_Uptodate, &bh->b_state);
> 
>/* Ok, now it was _truly_ not uptodate */
>if (!buffer_uptodate(bh))
>ll_rw_block(READ, 1, &bh);
> 
> Comments? The above should fix block_write_full_page() automatically, as
> well as fixing the other cases too - and actually improve performance at
> the same time by getting rid of unnecessary re-reads.
> 

ReiserFS depends on the buffer head up to date flag being correct when it
is sent to get_block.  When unpacking the tail, we have to know if the
packed data on disk should be copied over the data in the page. 

So, the above should work for us if you tested Page_Uptodate before testing
buffer_mapped.

-chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-17 Thread Alexander Viro




On Mon, 18 Sep 2000, Alexander Viro wrote:

> On Sun, 17 Sep 2000, Linus Torvalds wrote:
> 
> > At that point, block_write_full_page() would become something like
> > 
> > unsigned long index = inode->i_size >> PAGE_CACHE_SHIFT;
> > 
> > offset = inode->i_size & (PAGE_CACHE_SIZE-1);
> > if (index > page->index)
> > offset = PAGE_CACHE_SIZE;
> > 
> > create_empty_buffers(page, 0, offset);
> > for_each_buffer(page) {
> > if (buffer_mapped(bh))
> > mark_buffer_dirty(bh);
> > }
> > 
> > which would certainly be simpler than the current code. Hmm.
> 
> At that point it will become prepare_write+commit_write. Which may be a
> good thing in many other respects. I wonder if we need the
> ->writepage() at all - I seriosuly suspect that the reason why I didn't
> kill it back when I added ->prepare_write() was that I _did_ stumble on
> that bug with reads and didn't realize that.

Nope. NFS does things differently in cases of write and pageout.
Ouch. On the write path for local filesystems we generally ignore O_SYNC
for data. And IMO that's firmly on the "must be fixed before
2.4-final" side. Possibly by sync/async flag for ->commit_write()...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-17 Thread Alexander Viro




On Sun, 17 Sep 2000, Linus Torvalds wrote:

> At that point, block_write_full_page() would become something like
> 
>   unsigned long index = inode->i_size >> PAGE_CACHE_SHIFT;
> 
>   offset = inode->i_size & (PAGE_CACHE_SIZE-1);
>   if (index > page->index)
>   offset = PAGE_CACHE_SIZE;
> 
>   create_empty_buffers(page, 0, offset);
>   for_each_buffer(page) {
>   if (buffer_mapped(bh))
>   mark_buffer_dirty(bh);
>   }
> 
> which would certainly be simpler than the current code. Hmm.

At that point it will become prepare_write+commit_write. Which may be a
good thing in many other respects. I wonder if we need the
->writepage() at all - I seriosuly suspect that the reason why I didn't
kill it back when I added ->prepare_write() was that I _did_ stumble on
that bug with reads and didn't realize that.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-17 Thread Alexander Viro

On Sun, 17 Sep 2000, Linus Torvalds wrote:

> On Sun, 17 Sep 2000, Alexander Viro wrote:
> > 
> > Looks sane. But I really wonder if we could just do it in
> > create_page_buffers() if page is up-to-date. OTOH it would require attempt
> > to map them all. Comments?
> 
> That would certainly simplify a lot.
> 
> And as we've seen, simplifying this area would not necessarily be a bad
> thing ;)
> 
> Right now I'll just do the minimal fix, though, I think.

Fine with me. I'll do the full analysis tonight, anyway, and try to write
down the rules for that stuff. One thing that makes me seriously uneasy
is the fact that VM knows about ->buffers - I would be much happier if all
this stuff would be contained in fs/buffer.c. IOW, I'm not sure that
block_flushpage()/try_to_free_buffers() is a happy camper.

I'm not proposing it for immediate inclusion, but I don't feel good about
all this code - current rules are too complex and rely on details of
VM/buffer interaction too much for my taste. It may be correct, but it's
not obviously correct...

[mingo, Daniel and Chris added to cc for obvious reasons]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-17 Thread Linus Torvalds

On Mon, 18 Sep 2000, Alexander Viro wrote:
> 
> We might as well do it in create_empty_buffers(), FWIW. Do you see any
> case when it would mean extra work? I'm thinking about the following:
> 
>   if (page is uptodate)
>   do get_block(...,0) for all buffers and mark them uptodate
> 
> in the end of buffer-ring creation. Comments?

The above _will_ cause extra work, notably for the case of writing to a
hole (first create_empty_buffers() will do the get_block( .., 0) and get a
zero, then the writing will do a get_block( .., 1) to actually allocate
the block.

We could push the "create" thing into create_empty_buffers(), though. We'd
give it a "begin,end" the way writing wants, and a reader (or truncate)
would just pass in 0,0.

Hmmm.. I like that approach - because I suspect that eventually we want to
change the FS mapping function away from the current buffer-based one, and
make it a page-based one to let the low-level filesystem handle the tail
etc issues itself.

And when we do that, we'd have to pass in that "I'm going to want to dirty
this range" information anyway, and the current "create_empty_buffers()"
would in fact become the new "map this page, please" function.

At that point, block_write_full_page() would become something like

unsigned long index = inode->i_size >> PAGE_CACHE_SHIFT;

offset = inode->i_size & (PAGE_CACHE_SIZE-1);
if (index > page->index)
offset = PAGE_CACHE_SIZE;

create_empty_buffers(page, 0, offset);
for_each_buffer(page) {
if (buffer_mapped(bh))
mark_buffer_dirty(bh);
}

which would certainly be simpler than the current code. Hmm.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-17 Thread Linus Torvalds




On Sun, 17 Sep 2000, Alexander Viro wrote:
> 
>   Looks sane. But I really wonder if we could just do it in
> create_page_buffers() if page is up-to-date. OTOH it would require attempt
> to map them all. Comments?

That would certainly simplify a lot.

And as we've seen, simplifying this area would not necessarily be a bad
thing ;)

Right now I'll just do the minimal fix, though, I think.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-17 Thread Alexander Viro




On Sun, 17 Sep 2000, Linus Torvalds wrote:

> Yeah. See my other mail - I don't think the patch is needed at all,
> although I'm convinced that it would actually hide the problem for the
> particular case of INN (zeroing out the partial block is not an issue
> there, as truncate will have done it - it's really only protection against
> badly behaved software and shouldn't matter to any "regular" use).

True, but that way we also get the consistent rules wrt. async/sync.

> > We know that ->writepage() is called and we know that data doesn't get
> > to the disk. How can it happen? If we can... Oh, fsck. Linus, we could
> > very well lose the buffers since the moment when page had been read.
> > See what happens? We recreate the buffer ring for the page and it's
> > nowhere near "everything uptodate" state.
>
> Yeah. See my suggested fix in the other mail. We just must not read in
> buffers for pages that are already up-to-date.

We might as well do it in create_empty_buffers(), FWIW. Do you see any
case when it would mean extra work? I'm thinking about the following:

if (page is uptodate)
do get_block(...,0) for all buffers and mark them uptodate

in the end of buffer-ring creation. Comments?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-17 Thread Alexander Viro




On Sun, 17 Sep 2000, Linus Torvalds wrote:

> The basic problem is that right now we have code that does 
> 
>   if (!page->buffers)
>   create_empty_buffers(page, inode, inode->i_sb->s_blocksize);
>   ...
>   if (!buffer_uptodate(bh))
>   ll_rw_block(READ, 1, &bh);
> 
> And this is WRONG.
> 
> If the whole page is up-to-date, we must NOT try to read the buffer in
> from disk, because the in-memory copy is always more up-to-date.

Yep.

> We need to fix the real bug - otherwise anybody doing both write() and
> shared mmap's to the same file is going to be bit by it later on...
> 
> The easy fix is probably to do something like
> 
>   /* Map the buffer */
>   if (!buffer_mapped(bh)) {
>   ...
>   }
> + /* If the page is up-to-date, so is the buffer */
> + if (Page_Uptodate(page))
> + set_bit(BH_Uptodate, &bh->b_state);
> 
>   /* Ok, now it was _truly_ not uptodate */
>   if (!buffer_uptodate(bh))
>   ll_rw_block(READ, 1, &bh);
> 
> Comments? The above should fix block_write_full_page() automatically, as
> well as fixing the other cases too - and actually improve performance at
> the same time by getting rid of unnecessary re-reads.
> 
> Looks fairly simple. It only happens in __block_prepare_write() and in
> block_truncate_page(), so there's just two places to fix.
> 
> Can you se anything else wrong?

Looks sane. But I really wonder if we could just do it in
create_page_buffers() if page is up-to-date. OTOH it would require attempt
to map them all. Comments?

That way we restore the bh state if buffer ring is recreated -
IMO somewhat simpler...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-17 Thread Linus Torvalds

On Sun, 17 Sep 2000, Alexander Viro wrote:
> 
> Ugh. OK, first of all, patch is _not_ correct. Doesn't zero out the end of
> partial block.
> 

Yeah. See my other mail - I don't think the patch is needed at all,
although I'm convinced that it would actually hide the problem for the
particular case of INN (zeroing out the partial block is not an issue
there, as truncate will have done it - it's really only protection against
badly behaved software and shouldn't matter to any "regular" use).

> We know that ->writepage() is called and we know that data doesn't get
> to the disk. How can it happen? If we can... Oh, fsck. Linus, we could
> very well lose the buffers since the moment when page had been read.
> See what happens? We recreate the buffer ring for the page and it's
> nowhere near "everything uptodate" state.

Yeah. See my suggested fix in the other mail. We just must not read in
buffers for pages that are already up-to-date.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-17 Thread Linus Torvalds

On Sun, 17 Sep 2000, Linus Torvalds wrote:
> 
> I bet your patch fixes the corruption, but I want to understand _why_.
> Call me dense, but __block_commit_write() seems to do everything we want
> done..

Ok, I may be dense, but I see the bug.

And no, your patch was not the right thing either. It _will_ make that
particular corruption go away, but it's a more insidious problem, and your
patch only addreses part of the whole problem.

The basic problem is that right now we have code that does 

if (!page->buffers)
create_empty_buffers(page, inode, inode->i_sb->s_blocksize);
...
if (!buffer_uptodate(bh))
ll_rw_block(READ, 1, &bh);

And this is WRONG.

If the whole page is up-to-date, we must NOT try to read the buffer in
from disk, because the in-memory copy is always more up-to-date.

Normally this bug shows itself only as a small performance issue - if we
only use read() and write(), all changes to the page will be done
"synchronously" with "page->buffers" being held, and as such the page will
never have contents that are newer than the on-disk image. But whenever
somebody writes to the page through a shared mapping, that is no longer
true - we MUST NOT do the buffer read, because it's going to overwrite
data that is newer than the disk contents.

The bug again didn't show up in the trivial test-cases, because it depends
on us "losing" the page buffers and having to re-create them in order to
show up, and that only happens under memory pressure.

Basically, both "truncate()" and "write()" have this bug where they can
end up re-reading stuff from disk even though the in-memory copy is newer.

And because write() had this bug, the bug also got into
block_write_full_page(). Not because block_write_full_page() was buggy in
itself, but because it used a buggy routine.

And your patch fixes the corruption, not by fixing the bug, but by
avoiding the buggy routing in block_write_full_page().

We need to fix the real bug - otherwise anybody doing both write() and
shared mmap's to the same file is going to be bit by it later on...

The easy fix is probably to do something like

/* Map the buffer */
if (!buffer_mapped(bh)) {
...
}
+   /* If the page is up-to-date, so is the buffer */
+   if (Page_Uptodate(page))
+   set_bit(BH_Uptodate, &bh->b_state);

/* Ok, now it was _truly_ not uptodate */
if (!buffer_uptodate(bh))
ll_rw_block(READ, 1, &bh);

Comments? The above should fix block_write_full_page() automatically, as
well as fixing the other cases too - and actually improve performance at
the same time by getting rid of unnecessary re-reads.

Looks fairly simple. It only happens in __block_prepare_write() and in
block_truncate_page(), so there's just two places to fix.

Can you se anything else wrong?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-17 Thread Alexander Viro

On Sun, 17 Sep 2000, Linus Torvalds wrote:

> Hmm.. Ok, I see what you're saying. And I looked at the diff the wrong way
> around - we _lost_ two articles rather than getting two new ones.
> 
> However, what's wrong with the block_write_full_page() logic? Your patch
> looks fine to me, but so does the old logic, I have to admit. 
> 
> Sure, __block_prepare_write() does the "optimize away reads" stuff, but
> this all can only happen for pages that are up-to-date, so that shouldn't
> matter. __block_commit_write() will certainly mark the buffers dirty, so I
> don't see how the data would get lost anywhere. 
> 
> I bet your patch fixes the corruption, but I want to understand _why_.
> Call me dense, but __block_commit_write() seems to do everything we want
> done..
> 
> I wonder if we have some cache aliasing problem somewhere. Or what's the
> bug I overlooked?

Ugh. OK, first of all, patch is _not_ correct. Doesn't zero out the end of
partial block.

We know that ->writepage() is called and we know that data doesn't get to
the disk. How can it happen? If we can... Oh, fsck. Linus, we could very
well lose the buffers since the moment when page had been read. See what
happens? We recreate the buffer ring for the page and it's nowhere near
"everything uptodate" state.

So yes, reading the buffers is the problem. I'll fix the zero-out part
of that and send the corrected variant. From my reading of
__block_write_full_page() it looks like we are OK wrt assumptions about
the bh state there, but we may need to check other places where we use
prepare_write().

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-17 Thread Linus Torvalds

On Sun, 17 Sep 2000, Alexander Viro wrote:
> 
> It _is_ a bug, unfortunately. Look: he had taken a copy when the thing was
> running. I.e. took the copy of in-core page. After that he stopped innd
> and that had triggered block_write_full_page(). So far, so good, right?
> Except that the page contents didn't reach the disk.

Hmm.. Ok, I see what you're saying. And I looked at the diff the wrong way
around - we _lost_ two articles rather than getting two new ones.

However, what's wrong with the block_write_full_page() logic? Your patch
looks fine to me, but so does the old logic, I have to admit. 

Sure, __block_prepare_write() does the "optimize away reads" stuff, but
this all can only happen for pages that are up-to-date, so that shouldn't
matter. __block_commit_write() will certainly mark the buffers dirty, so I
don't see how the data would get lost anywhere. 

I bet your patch fixes the corruption, but I want to understand _why_.
Call me dense, but __block_commit_write() seems to do everything we want
done..

I wonder if we have some cache aliasing problem somewhere. Or what's the
bug I overlooked?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-17 Thread Alexander Viro




On Sun, 17 Sep 2000, Linus Torvalds wrote:

> 
> 
> On Sun, 17 Sep 2000, Marco d'Itri wrote:
> >
> > I added to block_write_full_page() the debugging code suggested by
> > Alexander Viro:
> > 
> > if (inode->i_dev == 0x305 && inode->i_ino == 48991) // Md
> > printk("block_write_full_page: writing page %d, size %Ld\n",
> > page->index, inode->i_size);
> > 
> > and I have just been able to trigger the bug (two times):
> 
> Bug?
> 
> How do you think updates get written back to disk when innd has made
> changes to its in-core file?
> 
> Right. With block_write_full_page.
> 
> > root@wonderland:vc/2:/var/lib/news$diff -u active active.ok
> > --- active  Sun Sep 17 16:09:01 2000
> > +++ active.ok   Sun Sep 17 21:03:16 2000
> > @@ -1,11 +1,11 @@
> >  control 004774 004775 y
> >  control.cancel 021917 021889 n
> >  junk 001777 001768 y
> > -fido.ita.ridere 014632 014600 y
> > +fido.ita.ridere 014634 014600 y
> 
> So? innd got two new articles. 
> 
> The above is what you want. If the active file wouldn't get updated, you'd
> never see any new articles ever again.
> 
> Doesn't look like a bug to me.

It _is_ a bug, unfortunately. Look: he had taken a copy when the thing was
running. I.e. took the copy of in-core page. After that he stopped innd
and that had triggered block_write_full_page(). So far, so good, right?
Except that the page contents didn't reach the disk.

How about the following:
--- buffer.cMon Sep 18 01:13:42 2000
+++ buffer.c.newMon Sep 18 01:28:45 2000
@@ -1411,9 +1411,12 @@
  */
 static int __block_write_full_page(struct inode *inode, struct page *page, 
get_block_t *get_block)
 {
-   int err, i, need_balance_dirty = 0;
+   int err = -EIO, i, need_balance_dirty = 0;
unsigned long block;
struct buffer_head *bh, *head;
+   unsigned long end_block = (inode->i_size+inode->i_sb->s_blocksize-1) >> 
+   inode->i_sb->s_blocksize_bits;
+   unsigned offset;
 
if (!PageLocked(page))
BUG();
@@ -1423,6 +1426,9 @@
head = page->buffers;
 
block = page->index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+   /* Are we completely out? */
+   if (block > end_block)
+   goto out;
 
bh = head;
i = 0;
@@ -1451,6 +1457,8 @@
 
bh = bh->b_this_page;
block++;
+   if (block > end_block)
+   break;
} while (bh != head);
 
if (need_balance_dirty)
@@ -1823,31 +1831,7 @@
 int block_write_full_page(struct page *page, get_block_t *get_block)
 {
struct inode *inode = (struct inode*)page->mapping->host;
-   unsigned long end_index = inode->i_size >> PAGE_CACHE_SHIFT;
-   unsigned offset;
-   int err;
-
-   /* easy case */
-   if (page->index < end_index)
-   return __block_write_full_page(inode, page, get_block);
-
-   /* things got complicated... */
-   offset = inode->i_size & (PAGE_CACHE_SIZE-1);
-   /* OK, are we completely out? */
-   if (page->index >= end_index+1 || !offset)
-   return -EIO;
-   /* Sigh... will have to work, then... */
-   err = __block_prepare_write(inode, page, 0, offset, get_block);
-   if (!err) {
-   memset(page_address(page) + offset, 0, PAGE_CACHE_SIZE - offset);
-   flush_dcache_page(page);
-   __block_commit_write(inode,page,0,offset);
-done:
-   kunmap(page);
-   return err;
-   }
-   ClearPageUptodate(page);
-   goto done;
+   return __block_write_full_page(inode, page, get_block);
 }
 
 int generic_block_bmap(struct address_space *mapping, long block, get_block_t 
*get_block)

Unlike the current variant it does the right thing with ->b_end_io and is
less intrusive (and shorter, BTW).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The INN/mmap bug

2000-09-17 Thread Linus Torvalds




On Sun, 17 Sep 2000, Marco d'Itri wrote:
>
> I added to block_write_full_page() the debugging code suggested by
> Alexander Viro:
> 
> if (inode->i_dev == 0x305 && inode->i_ino == 48991) // Md
> printk("block_write_full_page: writing page %d, size %Ld\n",
> page->index, inode->i_size);
> 
> and I have just been able to trigger the bug (two times):

Bug?

How do you think updates get written back to disk when innd has made
changes to its in-core file?

Right. With block_write_full_page.

> root@wonderland:vc/2:/var/lib/news$diff -u active active.ok
> --- active  Sun Sep 17 16:09:01 2000
> +++ active.ok   Sun Sep 17 21:03:16 2000
> @@ -1,11 +1,11 @@
>  control 004774 004775 y
>  control.cancel 021917 021889 n
>  junk 001777 001768 y
> -fido.ita.ridere 014632 014600 y
> +fido.ita.ridere 014634 014600 y

So? innd got two new articles. 

The above is what you want. If the active file wouldn't get updated, you'd
never see any new articles ever again.

Doesn't look like a bug to me.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

39 matches

Mail list logo