Re: DVD blockdevice buffers

2001-05-23 Thread Stephen C. Tweedie

Hi,

On Wed, May 23, 2001 at 11:12:00AM -0700, Linus Torvalds wrote:
> 
> On Wed, 23 May 2001, Stephen C. Tweedie wrote:
> No, you can actually do all the "prepare_write()"/"commit_write()" stuff
> that the filesystems already do. And you can do it a lot _better_ than the
> current buffer-cache-based approach. Done right, you can actually do all
> IO in page-sized chunks, BUT fall down on sector-sized things for the
> cases where you want to. 

Right, but you still lose the caching in that case.  The write works,
but the "cache" becomes nothing more than a buffer.

This actually came up recently after the first posting of the
bdev-on-pagecache patches, when somebody was getting lousy
database performance for an application I think they had developed
from scratch --- it was using 512-byte blocks as the basic write
alignment and was relying on the kernel caching that.  In fact, in
that case even our old buffer cache was failing due to the default
blocksize of 1024 bytes, and he had had to add an ioctl to force the
blocksize to 512 bytes before the application would perform at all
well on Linux.

So we do have at least one real-world example which will fail if we
increase the IO granularity.  We may well decide that the pain is
worth it, but the page cache really cannot deal properly with this
right now without having an uptodate labeling at finer granularity
than the page (which would be unnecessary ugliness in most cases).

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Stephen C. Tweedie

Hi,

On Sat, May 19, 2001 at 07:36:07PM -0700, Linus Torvalds wrote:

> Right now we don't try to aggressively drop streaming pages, but it's
> possible. Using raw devices is a silly work-around that should not be
> needed, and this load shows a real problem in current Linux (one soon to
> be fixed, I think - Andrea already has some experimental patches for the
> page-cache thing).

Right.  I'd like to see buffered IO able to work well --- apart from
the VM issues, it's the easiest way to allow the application to take
advantage of readahead.  However, there's one sticking point we
encountered, which is applications which write to block devices in
units smaller than a page.  Small block writes get magically
transformed into read/modify/write cycles if you shift the block
devices into the page cache.

Of course, we could just say "then don't do that" and be done with it
--- after all, we already have this behaviour when writing to regular
files.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Getting FS access events

2001-05-23 Thread Stephen C. Tweedie

Hi,

On Tue, May 15, 2001 at 04:37:01PM +1200, Chris Wedgwood wrote:
> On Sun, May 13, 2001 at 08:39:23PM -0600, Richard Gooch wrote:
> 
> Yeah, we need a decent unfragmenter. We can do that now with
> bmap().
> 
> SCT wrote a defragger for ext2 but it only handles 1k blocks :(

Actually, I wrote it for extfs, and Alexey Vovenko ported it to ext2.
Extfs *really* needed a defragmenter, because it had weird behaviour
patterns which included allocating all of the blocks of a file in
descending disk blocks at times.  

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Getting FS access events

2001-05-23 Thread Stephen C. Tweedie

Hi,

On Fri, May 18, 2001 at 09:55:14AM +0200, Rogier Wolff wrote:

> The "boot quickly" was an example. "Load netscape quickly" on some
> systems is done by dd-ing the binary to /dev/null. 

This is one of the reasons why some filesystems use extent maps
instead of inode indirection trees.  The problem of caching the
metadata basically just goes away if your mapping information is a few
bytes saying "this file is an extent of a hundred block at offset FOO
followed by fifty blocks at offset BAR."

If the mapping metadata is _that_ compact, then your binaries are
almost guaranteed to be either mapped in the inode or in a single
mapping block, so the problem of seeking between indirect blocks
basically just goes away.  You still have to do things like prime the
inode/indirect cache before the first data access if you want
directory scans to go fast, and you still have to preload data pages
for readahead, of course.  

If the objective is "start netscape faster", then the cost of having
to do one synchronous IO to pull in a single indirect extent map block
is going to be negligible next to the other costs.

(Extent maps have their own problems, especially when it comes to
dealing with holes, but that's a different story...)

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Getting FS access events

2001-05-23 Thread Stephen C. Tweedie

Hi,

On Sat, May 19, 2001 at 12:47:15PM -0700, Linus Torvalds wrote:
> 
> On Sat, 19 May 2001, Pavel Machek wrote:
> > 
> > > Don't get _too_ hung up about the power-management kind of "invisible
> > > suspend/resume" sequence where you resume the whole kernel state.
> > 
> > Ugh. Now I'm confused. How do you do usefull resume from disk when you
> > don't restore complete state? Do you propose something like "write
> > only pagecache to disk"?
> 
> Go back to the original _reason_ for this whole discussion. 
> 
> It's not really a "resume" event, it's a "populate caches really
> efficiently at boot" event.

Then you'd better be sure that the cache (or at least, the saved
image) only contains data which is guaranteed not to be written
between successive restores from the same image.  The big advantage of
just resuming from the state of the previous shutdown (whether it's
cache or the whole kernenl state) is that you've got a much higher
expectation that nothing on disk got modified between the save and the
restore.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Getting FS access events

2001-05-23 Thread Stephen C. Tweedie

Hi,

On Sat, May 19, 2001 at 12:47:15PM -0700, Linus Torvalds wrote:
 
 On Sat, 19 May 2001, Pavel Machek wrote:
  
   Don't get _too_ hung up about the power-management kind of invisible
   suspend/resume sequence where you resume the whole kernel state.
  
  Ugh. Now I'm confused. How do you do usefull resume from disk when you
  don't restore complete state? Do you propose something like write
  only pagecache to disk?
 
 Go back to the original _reason_ for this whole discussion. 
 
 It's not really a resume event, it's a populate caches really
 efficiently at boot event.

Then you'd better be sure that the cache (or at least, the saved
image) only contains data which is guaranteed not to be written
between successive restores from the same image.  The big advantage of
just resuming from the state of the previous shutdown (whether it's
cache or the whole kernenl state) is that you've got a much higher
expectation that nothing on disk got modified between the save and the
restore.

--Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Getting FS access events

2001-05-23 Thread Stephen C. Tweedie

Hi,

On Tue, May 15, 2001 at 04:37:01PM +1200, Chris Wedgwood wrote:
 On Sun, May 13, 2001 at 08:39:23PM -0600, Richard Gooch wrote:
 
 Yeah, we need a decent unfragmenter. We can do that now with
 bmap().
 
 SCT wrote a defragger for ext2 but it only handles 1k blocks :(

Actually, I wrote it for extfs, and Alexey Vovenko ported it to ext2.
Extfs *really* needed a defragmenter, because it had weird behaviour
patterns which included allocating all of the blocks of a file in
descending disk blocks at times.  

Cheers,
 Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Getting FS access events

2001-05-23 Thread Stephen C. Tweedie

Hi,

On Fri, May 18, 2001 at 09:55:14AM +0200, Rogier Wolff wrote:

 The boot quickly was an example. Load netscape quickly on some
 systems is done by dd-ing the binary to /dev/null. 

This is one of the reasons why some filesystems use extent maps
instead of inode indirection trees.  The problem of caching the
metadata basically just goes away if your mapping information is a few
bytes saying this file is an extent of a hundred block at offset FOO
followed by fifty blocks at offset BAR.

If the mapping metadata is _that_ compact, then your binaries are
almost guaranteed to be either mapped in the inode or in a single
mapping block, so the problem of seeking between indirect blocks
basically just goes away.  You still have to do things like prime the
inode/indirect cache before the first data access if you want
directory scans to go fast, and you still have to preload data pages
for readahead, of course.  

If the objective is start netscape faster, then the cost of having
to do one synchronous IO to pull in a single indirect extent map block
is going to be negligible next to the other costs.

(Extent maps have their own problems, especially when it comes to
dealing with holes, but that's a different story...)

--Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Stephen C. Tweedie

Hi,

On Sat, May 19, 2001 at 07:36:07PM -0700, Linus Torvalds wrote:

 Right now we don't try to aggressively drop streaming pages, but it's
 possible. Using raw devices is a silly work-around that should not be
 needed, and this load shows a real problem in current Linux (one soon to
 be fixed, I think - Andrea already has some experimental patches for the
 page-cache thing).

Right.  I'd like to see buffered IO able to work well --- apart from
the VM issues, it's the easiest way to allow the application to take
advantage of readahead.  However, there's one sticking point we
encountered, which is applications which write to block devices in
units smaller than a page.  Small block writes get magically
transformed into read/modify/write cycles if you shift the block
devices into the page cache.

Of course, we could just say then don't do that and be done with it
--- after all, we already have this behaviour when writing to regular
files.

--Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Stephen C. Tweedie

Hi,

On Wed, May 23, 2001 at 11:12:00AM -0700, Linus Torvalds wrote:
 
 On Wed, 23 May 2001, Stephen C. Tweedie wrote:
 No, you can actually do all the prepare_write()/commit_write() stuff
 that the filesystems already do. And you can do it a lot _better_ than the
 current buffer-cache-based approach. Done right, you can actually do all
 IO in page-sized chunks, BUT fall down on sector-sized things for the
 cases where you want to. 

Right, but you still lose the caching in that case.  The write works,
but the cache becomes nothing more than a buffer.

This actually came up recently after the first posting of the
bdev-on-pagecache patches, when somebody was getting lousy
database performance for an application I think they had developed
from scratch --- it was using 512-byte blocks as the basic write
alignment and was relying on the kernel caching that.  In fact, in
that case even our old buffer cache was failing due to the default
blocksize of 1024 bytes, and he had had to add an ioctl to force the
blocksize to 512 bytes before the application would perform at all
well on Linux.

So we do have at least one real-world example which will fail if we
increase the IO granularity.  We may well decide that the pain is
worth it, but the page cache really cannot deal properly with this
right now without having an uptodate labeling at finer granularity
than the page (which would be unnecessary ugliness in most cases).

--Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Ext2, fsync() and MTA's?

2001-05-22 Thread Stephen C. Tweedie

Hi,

On Tue, May 22, 2001 at 11:54:55AM -0500, Oliver Xymoron wrote:

> > > > That's probably the right thing to add.
> > >
> > > I'd vote for an async flag instead.
> >
> > Why???  Why change the default behaviour to be something much slower?
> 
> I was suggesting an async flag _in addition_ to the sync flag, both
> propagating to subdirs. Nice and orthogonal.

The whole problem is that the flag applies to both files and
directories, but we often only want it enforced on directories
(because we already have fsync for files).  Adding another orthogonal
file+dir async flag won't help that at all.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Ext2, fsync() and MTA's?

2001-05-22 Thread Stephen C. Tweedie

Hi,

On Tue, May 22, 2001 at 10:50:51AM -0500, Oliver Xymoron wrote:
> On Mon, 21 May 2001, Theodore Tso wrote:
> 
> > On Mon, May 21, 2001 at 06:47:58PM +0100, Stephen C. Tweedie wrote:
> >
> > > Just set chattr +S on the spool dir.  That's what the flag is for.
> > > The biggest problem with that is that it propagates to subdirectories
> > > and files --- would a version of the flag which applied only to
> > > directories be a help here?
> >
> > That's probably the right thing to add.
> 
> I'd vote for an async flag instead.

Why???  Why change the default behaviour to be something much slower?

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Ext2, fsync() and MTA's?

2001-05-22 Thread Stephen C. Tweedie

Hi,

On Tue, May 22, 2001 at 11:54:55AM -0500, Oliver Xymoron wrote:

That's probably the right thing to add.
  
   I'd vote for an async flag instead.
 
  Why???  Why change the default behaviour to be something much slower?
 
 I was suggesting an async flag _in addition_ to the sync flag, both
 propagating to subdirs. Nice and orthogonal.

The whole problem is that the flag applies to both files and
directories, but we often only want it enforced on directories
(because we already have fsync for files).  Adding another orthogonal
file+dir async flag won't help that at all.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Ext2, fsync() and MTA's?

2001-05-22 Thread Stephen C. Tweedie

Hi,

On Tue, May 22, 2001 at 10:50:51AM -0500, Oliver Xymoron wrote:
 On Mon, 21 May 2001, Theodore Tso wrote:
 
  On Mon, May 21, 2001 at 06:47:58PM +0100, Stephen C. Tweedie wrote:
 
   Just set chattr +S on the spool dir.  That's what the flag is for.
   The biggest problem with that is that it propagates to subdirectories
   and files --- would a version of the flag which applied only to
   directories be a help here?
 
  That's probably the right thing to add.
 
 I'd vote for an async flag instead.

Why???  Why change the default behaviour to be something much slower?

Cheers,
 Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Ext2, fsync() and MTA's?

2001-05-21 Thread Stephen C. Tweedie

Hi,

On Sun, May 13, 2001 at 12:53:37AM +1000, Andrew McNamara wrote:

> I seem to recall that in 2.2, fsync behaved like fdatasync, and that
> it's only in 2.4 that it also syncs metadata - is this correct?

No, fsync should be safe on 2.2.  There was a problem with O_SYNC not
syncing all metadata on 2.2 if you were extending a file, but that
never applied to fsync.

> Do the BSD's sync the directory data on an fsync of a file? I guess
> this is the bone of contention

No --- the old BSDs were safe because their directory operations were
fully synchronous so they *never* needed to be sync'ed manually.
According to SuS, an application relying on sync directory updates is
buggy, because SuS simply makes no such guarantees.

Just set chattr +S on the spool dir.  That's what the flag is for.
The biggest problem with that is that it propagates to subdirectories
and files --- would a version of the flag which applied only to
directories be a help here?

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Ext2, fsync() and MTA's?

2001-05-21 Thread Stephen C. Tweedie

Hi,

On Sat, May 12, 2001 at 03:13:55PM +0100, Alan Cox wrote:

> fsync guarantees the inode data is up to date, fdatasync just the data.

fdatasync guarantees "important" inode data too.  The only thing that
fdatasync is allowed to skip is the timestamps.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC][PATCH] Re: Linux 2.4.4-ac10

2001-05-21 Thread Stephen C. Tweedie

Hi,

On Sun, May 20, 2001 at 07:04:31AM -0300, Rik van Riel wrote:
> On Sun, 20 May 2001, Mike Galbraith wrote:
> > 
> > Looking at the locking and trying to think SMP (grunt) though, I
> > don't like the thought of taking two locks for each page until
> 
> > 100%.  The data in that block is toast anyway.  A big hairy SMP
> > box has to feel reclaim_page(). (they probably feel the zone lock
> > too.. probably would like to allocate blocks)
> 
> Indeed, but this is a separate problem.  Doing per-CPU private
> (small, 8-32 page?) free lists is probably a good idea

Ingo already implemented that for Tux2.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LANANA: To Pending Device Number Registrants

2001-05-21 Thread Stephen C. Tweedie

Hi,

On Sat, May 19, 2001 at 04:20:11PM -0400, Michael Meissner wrote:
> On Fri, May 18, 2001 at 03:17:50PM +0100, Stephen C. Tweedie wrote:

> Presumably, a new UUID is created each time format a partition, which means it
> is a slight bit of hassle if you have to reload a partition from a dump, or
> copy a partition to another disk drive.  In the scheme of things, it is not a
> large hassle perhaps, but it is a hassle.

Right.  Tune2fs can reset the UUID on an existing filesystem, but if
you want something immune from the possible collisions of LABEL
namespaces, you can't really avoid ending up with different IDs on
filesystems after a restore.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LANANA: To Pending Device Number Registrants

2001-05-21 Thread Stephen C. Tweedie

Hi,

On Sat, May 19, 2001 at 04:20:11PM -0400, Michael Meissner wrote:
 On Fri, May 18, 2001 at 03:17:50PM +0100, Stephen C. Tweedie wrote:

 Presumably, a new UUID is created each time format a partition, which means it
 is a slight bit of hassle if you have to reload a partition from a dump, or
 copy a partition to another disk drive.  In the scheme of things, it is not a
 large hassle perhaps, but it is a hassle.

Right.  Tune2fs can reset the UUID on an existing filesystem, but if
you want something immune from the possible collisions of LABEL
namespaces, you can't really avoid ending up with different IDs on
filesystems after a restore.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC][PATCH] Re: Linux 2.4.4-ac10

2001-05-21 Thread Stephen C. Tweedie

Hi,

On Sun, May 20, 2001 at 07:04:31AM -0300, Rik van Riel wrote:
 On Sun, 20 May 2001, Mike Galbraith wrote:
  
  Looking at the locking and trying to think SMP (grunt) though, I
  don't like the thought of taking two locks for each page until
 
  100%.  The data in that block is toast anyway.  A big hairy SMP
  box has to feel reclaim_page(). (they probably feel the zone lock
  too.. probably would like to allocate blocks)
 
 Indeed, but this is a separate problem.  Doing per-CPU private
 (small, 8-32 page?) free lists is probably a good idea

Ingo already implemented that for Tux2.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Ext2, fsync() and MTA's?

2001-05-21 Thread Stephen C. Tweedie

Hi,

On Sun, May 13, 2001 at 12:53:37AM +1000, Andrew McNamara wrote:

 I seem to recall that in 2.2, fsync behaved like fdatasync, and that
 it's only in 2.4 that it also syncs metadata - is this correct?

No, fsync should be safe on 2.2.  There was a problem with O_SYNC not
syncing all metadata on 2.2 if you were extending a file, but that
never applied to fsync.

 Do the BSD's sync the directory data on an fsync of a file? I guess
 this is the bone of contention

No --- the old BSDs were safe because their directory operations were
fully synchronous so they *never* needed to be sync'ed manually.
According to SuS, an application relying on sync directory updates is
buggy, because SuS simply makes no such guarantees.

Just set chattr +S on the spool dir.  That's what the flag is for.
The biggest problem with that is that it propagates to subdirectories
and files --- would a version of the flag which applied only to
directories be a help here?

Cheers,
 Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LANANA: To Pending Device Number Registrants

2001-05-19 Thread Stephen C. Tweedie

Hi,

On Sat, May 19, 2001 at 05:29:32PM +1200, Chris Wedgwood wrote:
> 
> Or you can fall back to mounting by UUID, which is globally
> unique and still avoids referencing physical location.  You also
> don't need to manually set LABELs for UUID to work: all e2fsprogs
> over the past couple of years have set UUID on partitions, and
> e2fsck will create a new UUID if it sees an old filesystem that
> doesn't already have one.
> 
> Other filesystems such as reiserfs at present don't have such a
> thing. I brought this a while ago and in theory it's not too hard, we
> just need to get Hans to officially designate part of the SB or
> whatever for the UUID.

There are other ways to deal with it: both md and (I think, in newer
releases) LVM can pick up their logical config from scanning physical
volumes for IDs, and so present a consistent logical device namespace
despite physical devices moving around. 

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LANANA: To Pending Device Number Registrants

2001-05-19 Thread Stephen C. Tweedie

Hi,

On Sat, May 19, 2001 at 05:29:32PM +1200, Chris Wedgwood wrote:
 
 Or you can fall back to mounting by UUID, which is globally
 unique and still avoids referencing physical location.  You also
 don't need to manually set LABELs for UUID to work: all e2fsprogs
 over the past couple of years have set UUID on partitions, and
 e2fsck will create a new UUID if it sees an old filesystem that
 doesn't already have one.
 
 Other filesystems such as reiserfs at present don't have such a
 thing. I brought this a while ago and in theory it's not too hard, we
 just need to get Hans to officially designate part of the SB or
 whatever for the UUID.

There are other ways to deal with it: both md and (I think, in newer
releases) LVM can pick up their logical config from scanning physical
volumes for IDs, and so present a consistent logical device namespace
despite physical devices moving around. 

--Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.4-ac10

2001-05-18 Thread Stephen C. Tweedie

Hi,

On Fri, May 18, 2001 at 07:44:39PM -0300, Rik van Riel wrote:

> This is the core of why we cannot (IMHO) have a discussion
> of whether a patch introducing new VM tunables can go in:
> there is no clear overview of exactly what would need to be
> tunable and how it would help.

It's worse than that.  The workload on most typical systems is not
static.  The VM *must* be able to cope with dynamic workloads.  You
might twiddle all the knobs on your system to make your database run
faster, but end up in such a situation that the next time a mail flood
arrives for sendmail, the whole box locks up because the VM can no
longer adapt.

That's the main problem with static parameters.  The problem you are
trying to solve is fundamentally dynamic in most cases (which is also
why magic numbers tend to suck in the VM.)

Cheers, 
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-18 Thread Stephen C. Tweedie

Hi,

On Fri, May 11, 2001 at 04:54:44PM +0200, Daniel Phillips wrote:

> The only reasonable way I can think of getting a block-coherent view 
> underneath a mounted fs is to have a reverse map, and update it each 
> time we map block into the page cache or unmap it.

It's called the "buffer cache", and Ingo's early page-cache code in
2.3 actually did install page-cache backing buffers into the buffer
cache as aliases, mainly for debugging purposes.

Even without that, though, an application can achieve almost-coherency
via invalidation of the buffer cache before reading it.  And yes, this
won't necessarily remain coherent over the lifetime of the application
process, but then unless the filesystem is 100% quiescent then you
don't get that on 2.2 either.

Which is rather the point.  If the filesystem is active, then
coherency cannot be obtained at the block-device level in any case
without knowledge of the fs transaction activity.  If the filesystem
is quiescent, then you can sync it and flush the buffer cache and you
already get the coherency that you need.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LANANA: To Pending Device Number Registrants

2001-05-18 Thread Stephen C. Tweedie

Hi,

On Wed, May 16, 2001 at 12:18:15PM -0400, Michael Meissner wrote:

> With the current LABEL= support, you won't be able to mount the disks with
> duplicate labels, but you can still mount them via /dev/sd.

Or you can fall back to mounting by UUID, which is globally unique and
still avoids referencing physical location.  You also don't need to
manually set LABELs for UUID to work: all e2fsprogs over the past
couple of years have set UUID on partitions, and e2fsck will create a
new UUID if it sees an old filesystem that doesn't already have one.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LANANA: To Pending Device Number Registrants

2001-05-18 Thread Stephen C. Tweedie

Hi,

On Wed, May 16, 2001 at 12:18:15PM -0400, Michael Meissner wrote:

 With the current LABEL= support, you won't be able to mount the disks with
 duplicate labels, but you can still mount them via /dev/sdxxx.

Or you can fall back to mounting by UUID, which is globally unique and
still avoids referencing physical location.  You also don't need to
manually set LABELs for UUID to work: all e2fsprogs over the past
couple of years have set UUID on partitions, and e2fsck will create a
new UUID if it sees an old filesystem that doesn't already have one.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-18 Thread Stephen C. Tweedie

Hi,

On Fri, May 11, 2001 at 04:54:44PM +0200, Daniel Phillips wrote:

 The only reasonable way I can think of getting a block-coherent view 
 underneath a mounted fs is to have a reverse map, and update it each 
 time we map block into the page cache or unmap it.

It's called the buffer cache, and Ingo's early page-cache code in
2.3 actually did install page-cache backing buffers into the buffer
cache as aliases, mainly for debugging purposes.

Even without that, though, an application can achieve almost-coherency
via invalidation of the buffer cache before reading it.  And yes, this
won't necessarily remain coherent over the lifetime of the application
process, but then unless the filesystem is 100% quiescent then you
don't get that on 2.2 either.

Which is rather the point.  If the filesystem is active, then
coherency cannot be obtained at the block-device level in any case
without knowledge of the fs transaction activity.  If the filesystem
is quiescent, then you can sync it and flush the buffer cache and you
already get the coherency that you need.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.4-ac10

2001-05-18 Thread Stephen C. Tweedie

Hi,

On Fri, May 18, 2001 at 07:44:39PM -0300, Rik van Riel wrote:

 This is the core of why we cannot (IMHO) have a discussion
 of whether a patch introducing new VM tunables can go in:
 there is no clear overview of exactly what would need to be
 tunable and how it would help.

It's worse than that.  The workload on most typical systems is not
static.  The VM *must* be able to cope with dynamic workloads.  You
might twiddle all the knobs on your system to make your database run
faster, but end up in such a situation that the next time a mail flood
arrives for sendmail, the whole box locks up because the VM can no
longer adapt.

That's the main problem with static parameters.  The problem you are
trying to solve is fundamentally dynamic in most cases (which is also
why magic numbers tend to suck in the VM.)

Cheers, 
 Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] allocation looping + kswapd CPU cycles

2001-05-10 Thread Stephen C. Tweedie

Hi,

On Thu, May 10, 2001 at 03:49:05PM -0300, Marcelo Tosatti wrote:

> Back to the main discussion --- I guess we could make __GFP_FAIL (with
> __GFP_WAIT set :)) allocations actually fail if "try_to_free_pages()" does
> not make any progress (ie returns zero). But maybe thats a bit too
> extreme.

That would seem to be a reasonable interpretation of __GFP_FAIL +
__GFP_WAIT, yes.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] allocation looping + kswapd CPU cycles

2001-05-10 Thread Stephen C. Tweedie

Hi,

On Thu, May 10, 2001 at 03:22:57PM -0300, Marcelo Tosatti wrote:

> Initially I thought about __GFP_FAIL to be used by writeout routines which
> want to cluster pages until they can allocate memory without causing any
> pressure to the system. Something like this: 
> 
> while ((page = alloc_page(GFP_FAIL))
>   add_page_to_cluster(page);
> write_cluster(); 

Isn't that an orthogonal decision?  You can use __GFP_FAIL with or
without __GFP_WAIT or __GFP_IO, whichever is appropriate.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] allocation looping + kswapd CPU cycles

2001-05-10 Thread Stephen C. Tweedie

Hi,

On Thu, May 10, 2001 at 01:43:46PM -0300, Marcelo Tosatti wrote:

> No. __GFP_FAIL can to try to reclaim pages from inactive clean.
> 
> We just want to avoid __GFP_FAIL allocations from going to
> try_to_free_pages().

Why?  __GFP_FAIL is only useful as an indication that the caller has
some magic mechanism for coping with failure.  There's no other
information passed, so a brief call to try_to_free_pages is quite
appropriate.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] allocation looping + kswapd CPU cycles

2001-05-10 Thread Stephen C. Tweedie

Hi,

On Thu, May 10, 2001 at 01:43:46PM -0300, Marcelo Tosatti wrote:

 No. __GFP_FAIL can to try to reclaim pages from inactive clean.
 
 We just want to avoid __GFP_FAIL allocations from going to
 try_to_free_pages().

Why?  __GFP_FAIL is only useful as an indication that the caller has
some magic mechanism for coping with failure.  There's no other
information passed, so a brief call to try_to_free_pages is quite
appropriate.

--Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] allocation looping + kswapd CPU cycles

2001-05-10 Thread Stephen C. Tweedie

Hi,

On Thu, May 10, 2001 at 03:22:57PM -0300, Marcelo Tosatti wrote:

 Initially I thought about __GFP_FAIL to be used by writeout routines which
 want to cluster pages until they can allocate memory without causing any
 pressure to the system. Something like this: 
 
 while ((page = alloc_page(GFP_FAIL))
   add_page_to_cluster(page);
 write_cluster(); 

Isn't that an orthogonal decision?  You can use __GFP_FAIL with or
without __GFP_WAIT or __GFP_IO, whichever is appropriate.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] allocation looping + kswapd CPU cycles

2001-05-10 Thread Stephen C. Tweedie

Hi,

On Thu, May 10, 2001 at 03:49:05PM -0300, Marcelo Tosatti wrote:

 Back to the main discussion --- I guess we could make __GFP_FAIL (with
 __GFP_WAIT set :)) allocations actually fail if try_to_free_pages() does
 not make any progress (ie returns zero). But maybe thats a bit too
 extreme.

That would seem to be a reasonable interpretation of __GFP_FAIL +
__GFP_WAIT, yes.

--Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Swap space deallocation speed. (fwd)

2001-05-04 Thread Stephen C. Tweedie

Hi,

On Thu, May 03, 2001 at 12:03:39AM -0400, Dave Mielke wrote:

> unresponsive. The relevant line in the log, as you can find in the attached
> "crash.log" file, appears to be:
> 
> Unable to handle kernel paging request at virtual address 00020024

> Apr 16 11:23:06 dave kernel: esi: 0002   edi: c14ff5d0   ebp: c3e6a6d0   esp: 
>c142ff30 

This looks like a random bit flip in a page table.  That's almost
always a hardware problem.  Stop overclocking if you are doing that;
check that the CPU fan is still working, etc.

Cheers, 
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Swap space deallocation speed. (fwd)

2001-05-04 Thread Stephen C. Tweedie

Hi,

On Thu, May 03, 2001 at 12:03:39AM -0400, Dave Mielke wrote:

 unresponsive. The relevant line in the log, as you can find in the attached
 crash.log file, appears to be:
 
 Unable to handle kernel paging request at virtual address 00020024

 Apr 16 11:23:06 dave kernel: esi: 0002   edi: c14ff5d0   ebp: c3e6a6d0   esp: 
c142ff30 

This looks like a random bit flip in a page table.  That's almost
always a hardware problem.  Stop overclocking if you are doing that;
check that the CPU fan is still working, etc.

Cheers, 
 Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4 and 2GB swap partition limit

2001-05-02 Thread Stephen C. Tweedie

Hi,

On Wed, May 02, 2001 at 01:49:16PM +0100, Hugh Dickins wrote:
> On Wed, 2 May 2001, Stephen C. Tweedie wrote:
> > 
> > So the aim is more complex.  Basically, once we are short on VM, we
> > want to eliminate redundant copies of swap data.  That implies two
> > possible actions, not one --- we can either remove the swap page for
> > data which is already in memory, or we can remove the in-memory copy
> > of data which is already on swap.  Which one is appropriate will
> > depend on whether the ptes in the system point to the swap entry or
> > the memory entry.  If we have ptes pointing to both, then we cannot
> > free either.
> 
> Sorry for stating the obvious, but that last sentence gives up too easily.
> If we have ptes pointing to both, then we cannot free either until we have
> replaced all the references to one by references to the other.

Sure, but it's far from obvious that we need to worry about this.  2.2
has exactly this same behaviour for shared pages, and so if people are
complaining about a 2.4 regression, this particular aspect of the
behaviour is clearly not the underlying problem.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4 and 2GB swap partition limit

2001-05-02 Thread Stephen C. Tweedie

Hi,

On Wed, May 02, 2001 at 12:54:15PM +0200, Rogier Wolff wrote:
> 
> first: Thanks for clearing this up for me. 
> 
> So, there are in fact some more "states" a swap-page can be in:
> 
>   -(0) free
>   -(1) allocated, not in mem. 
>   -(2) on swap, valid copy of memory. 
>   -(3) on swap: invalid copy, allocated for fragmentation, can 
>   be freed on demand if we are close to running out of swap.
> 
> If we running low on (0) swap-pages we can first start to reap the (3)
> pages, and if that runs out, we can start reaping the (2)
> pages. Right?

Yes.  However, there is other state to worry about too.  Anonymous
pages are referenced from process page tables.  As long as the page
tables are referring to the copy in memory, you can free up the copy
on disk.  However, if any ptes point to the copy on disk, you cannot
(and remember, process forks can result in multiple process mm's
pointing to the same anonymous page, and some of those mm's may point
to swap while others point to the in-core page).

So the aim is more complex.  Basically, once we are short on VM, we
want to eliminate redundant copies of swap data.  That implies two
possible actions, not one --- we can either remove the swap page for
data which is already in memory, or we can remove the in-memory copy
of data which is already on swap.  Which one is appropriate will
depend on whether the ptes in the system point to the swap entry or
the memory entry.  If we have ptes pointing to both, then we cannot
free either.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4 and 2GB swap partition limit

2001-05-02 Thread Stephen C. Tweedie

Hi,

On Tue, May 01, 2001 at 06:14:54PM +0200, Rogier Wolff wrote:

> Shouldn't the algorithm be: 
> 
> - If (current_access == write )
>   free (swap_page);
>   else
>   map (page, READONLY)
> 
> and 
>   when a write access happens, we fault again, and map free the 
>   swap-page as it is now dirty anyway. 

That's what 2.2 did.  2.4 doesn't have to. 

The trouble is, you really want contiguous virtual memory to remain
contiguous on swap.  Freeing individual pages like this on fault can
cause a great deal of fragmentation in swap.  We'd far rather keep the
swap page reserved for future use by the same page so that the VM
region remains contiguous on disk.

That's fine as far as it goes, but the problem happens if you _never_
free up such pages.  We should reap the unused swap page if we run out
of swap.  We don't, and _that_ is the problem --- not the fact that
the page is left allocated in the first place, but the fact that we
don't do anything about it once we are short on disk.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Patch] deadlock on write in tmpfs

2001-05-02 Thread Stephen C. Tweedie

hi,

On Tue, May 01, 2001 at 03:39:47PM +0200, Christoph Rohland wrote:
> 
> tmpfs deadlocks when writing into a file from a mapping of the same
> file. 
> 
> So I see two choices: 
> 
> 1) Do not serialise the whole of shmem_getpage_locked but protect
>critical pathes with the spinlock and do retries after sleeps
> 2) Add another semaphore to serialize shmem_getpage_locked and
>shmem_truncate
> 
> I tried some time to get 1) done but the retry logic became way too
> complicated. So the attached patch implements 2)
> 
> I still think it's ugly to add another semaphore, but it works.

If the locking is for a completely different reason, then a different
semaphore is quite appropriate.  In this case you're trying to lock
the shm internal info structures, which is quite different from the
sort of inode locking which the VFS tries to do itself, so the new
semaphore appears quite clean --- and definitely needed.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Patch] deadlock on write in tmpfs

2001-05-02 Thread Stephen C. Tweedie

hi,

On Tue, May 01, 2001 at 03:39:47PM +0200, Christoph Rohland wrote:
 
 tmpfs deadlocks when writing into a file from a mapping of the same
 file. 
 
 So I see two choices: 
 
 1) Do not serialise the whole of shmem_getpage_locked but protect
critical pathes with the spinlock and do retries after sleeps
 2) Add another semaphore to serialize shmem_getpage_locked and
shmem_truncate
 
 I tried some time to get 1) done but the retry logic became way too
 complicated. So the attached patch implements 2)
 
 I still think it's ugly to add another semaphore, but it works.

If the locking is for a completely different reason, then a different
semaphore is quite appropriate.  In this case you're trying to lock
the shm internal info structures, which is quite different from the
sort of inode locking which the VFS tries to do itself, so the new
semaphore appears quite clean --- and definitely needed.

--Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4 and 2GB swap partition limit

2001-05-02 Thread Stephen C. Tweedie

Hi,

On Tue, May 01, 2001 at 06:14:54PM +0200, Rogier Wolff wrote:

 Shouldn't the algorithm be: 
 
 - If (current_access == write )
   free (swap_page);
   else
   map (page, READONLY)
 
 and 
   when a write access happens, we fault again, and map free the 
   swap-page as it is now dirty anyway. 

That's what 2.2 did.  2.4 doesn't have to. 

The trouble is, you really want contiguous virtual memory to remain
contiguous on swap.  Freeing individual pages like this on fault can
cause a great deal of fragmentation in swap.  We'd far rather keep the
swap page reserved for future use by the same page so that the VM
region remains contiguous on disk.

That's fine as far as it goes, but the problem happens if you _never_
free up such pages.  We should reap the unused swap page if we run out
of swap.  We don't, and _that_ is the problem --- not the fact that
the page is left allocated in the first place, but the fact that we
don't do anything about it once we are short on disk.

--Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4 and 2GB swap partition limit

2001-05-02 Thread Stephen C. Tweedie

Hi,

On Wed, May 02, 2001 at 12:54:15PM +0200, Rogier Wolff wrote:
 
 first: Thanks for clearing this up for me. 
 
 So, there are in fact some more states a swap-page can be in:
 
   -(0) free
   -(1) allocated, not in mem. 
   -(2) on swap, valid copy of memory. 
   -(3) on swap: invalid copy, allocated for fragmentation, can 
   be freed on demand if we are close to running out of swap.
 
 If we running low on (0) swap-pages we can first start to reap the (3)
 pages, and if that runs out, we can start reaping the (2)
 pages. Right?

Yes.  However, there is other state to worry about too.  Anonymous
pages are referenced from process page tables.  As long as the page
tables are referring to the copy in memory, you can free up the copy
on disk.  However, if any ptes point to the copy on disk, you cannot
(and remember, process forks can result in multiple process mm's
pointing to the same anonymous page, and some of those mm's may point
to swap while others point to the in-core page).

So the aim is more complex.  Basically, once we are short on VM, we
want to eliminate redundant copies of swap data.  That implies two
possible actions, not one --- we can either remove the swap page for
data which is already in memory, or we can remove the in-memory copy
of data which is already on swap.  Which one is appropriate will
depend on whether the ptes in the system point to the swap entry or
the memory entry.  If we have ptes pointing to both, then we cannot
free either.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4 and 2GB swap partition limit

2001-05-02 Thread Stephen C. Tweedie

Hi,

On Wed, May 02, 2001 at 01:49:16PM +0100, Hugh Dickins wrote:
 On Wed, 2 May 2001, Stephen C. Tweedie wrote:
  
  So the aim is more complex.  Basically, once we are short on VM, we
  want to eliminate redundant copies of swap data.  That implies two
  possible actions, not one --- we can either remove the swap page for
  data which is already in memory, or we can remove the in-memory copy
  of data which is already on swap.  Which one is appropriate will
  depend on whether the ptes in the system point to the swap entry or
  the memory entry.  If we have ptes pointing to both, then we cannot
  free either.
 
 Sorry for stating the obvious, but that last sentence gives up too easily.
 If we have ptes pointing to both, then we cannot free either until we have
 replaced all the references to one by references to the other.

Sure, but it's far from obvious that we need to worry about this.  2.2
has exactly this same behaviour for shared pages, and so if people are
complaining about a 2.4 regression, this particular aspect of the
behaviour is clearly not the underlying problem.

--Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4 and 2GB swap partition limit

2001-05-01 Thread Stephen C. Tweedie

Hi,

On Mon, Apr 30, 2001 at 07:12:12PM +0100, Alan Cox wrote:
> > paging in just released 2.4.4, but in previuos kernel, a page that was
> > paged-out, reserves its place in swap even if it is paged-in again, so
> > once you have paged-out all your ram at least once, you can't get any
> > more memory, even if swap is 'empty'.
> 
> This is a bug in the 2.4 VM, nothing more or less. It and the horrible bounce
> buffer bugs are forcing large machines to remain on 2.2. So it has to get 
> fixed

Umm, 2.2 can behave in the same way.  The only difference in the 2.4
behaviour is that 2.4 can maintain the swap cache effect for dirty
pages as well as clean ones.  An application which creates a large
in-core data set and then does not modify it will show exactly the
same behaviour on 2.2.

To call it a "bug" is to imply that "fixing it" is the right thing to
do.  It might be in some cases, but discarding the swap entry has a
cost --- you fragment swap, and if the page in memory is clean, you
end up increasing the amount of swap IO.  

The right fix is to reclaim such pages only when we need to.  To
disable swap caching when we still have enough swap free would hurt
users who have the spare swap to cope with it.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4 and 2GB swap partition limit

2001-05-01 Thread Stephen C. Tweedie

Hi,

On Mon, Apr 30, 2001 at 07:12:12PM +0100, Alan Cox wrote:
  paging in just released 2.4.4, but in previuos kernel, a page that was
  paged-out, reserves its place in swap even if it is paged-in again, so
  once you have paged-out all your ram at least once, you can't get any
  more memory, even if swap is 'empty'.
 
 This is a bug in the 2.4 VM, nothing more or less. It and the horrible bounce
 buffer bugs are forcing large machines to remain on 2.2. So it has to get 
 fixed

Umm, 2.2 can behave in the same way.  The only difference in the 2.4
behaviour is that 2.4 can maintain the swap cache effect for dirty
pages as well as clean ones.  An application which creates a large
in-core data set and then does not modify it will show exactly the
same behaviour on 2.2.

To call it a bug is to imply that fixing it is the right thing to
do.  It might be in some cases, but discarding the swap entry has a
cost --- you fragment swap, and if the page in memory is clean, you
end up increasing the amount of swap IO.  

The right fix is to reclaim such pages only when we need to.  To
disable swap caching when we still have enough swap free would hurt
users who have the spare swap to cope with it.

--Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: generic_osync_inode/ext2_fsync_inode still not safe

2001-04-20 Thread Stephen C. Tweedie

Hi,

On Wed, Apr 18, 2001 at 06:45:40AM -0300, Marcelo Tosatti wrote:

> As far as I can see, you cannot guarantee that an inode which is unlocked
> _and_ clean (accordingly to the inode->i_state) is safely on disk.
> 
> The reason for that are calls to sync_one() which write the inode
> asynchronously: 
> 
> sync_one(struct inode *inode, int sync) {
> ...
> /* Don't write the inode if only I_DIRTY_PAGES was set */
> if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC))
> write_inode(inode, sync);   <-
...
> inode->i_state &= ~I_LOCK;

Right --- nasty.  But not _too_ nasty, there's a moderately easy way
of dealing with this.

Basically we can't trust the i_state flag for this purpose if we are
allowing async IO to happen on inodes without having proper IO
completion callbacks marking the inodes as unlocked once they are firm
on disk.  However, in this case the filesystem itself will know which
underlying buffer_head contains the inode data, and can check to see
if that buffer is locked and perform a wait if necessary.

This is somewhat unpleasant in that it may sometimes cause unnecessary
false sharing, given that we have multiple inodes in an inode block.
However, I can't see any simple way around that.

Linus, do you have any preference for how to deal with this?  We can
either do it by adding a s_op->wait_inode() function to complement
write_inode(), and have a wait_inode() implementation in block-device
filesystems which does the buffer lookup and wait; or we can push that
whole logic into the filesystems, so that the I_DIRTY check is removed
from the VFS mid-layer altogether and the filesystem is responsible
for testing both the inode and buffer locked state when we try to wait
for outstanding inode IO to complete.

The second method is a bit cleaner conceptually but it results in more
code duplication.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: RFC: pageable kernel-segments

2001-04-20 Thread Stephen C. Tweedie

Hi,

On Fri, Apr 20, 2001 at 03:49:30PM +0100, Alan Cox wrote:

> There is a proposal (several it seems) to make 2.5 replace the conventional
> unix swap with a filesystem of backing store for anonymous objects. That will
> mean each object has its own vm area and inode and thus we can start blowing
> away all user mode page tables when we want.

Not without major VM overhaul.

The problem is MAP_PRIVATE, where a single vma can contain both normal
file-backed pages and anonymous pages at the same time.  You don't
even know whose anonymous page it is --- a process with anon pages can
fork, so that later on some of the child's anon pages actually come
from the parent's anon space instead of the child's.

Right now all of the magic that makes this work is in the page tables.
To remove page tables we'd need additional structures all through the
VM to track anonymous pages, and that's exactly where the FreeBSD VM
starts to get extremely messy compared to ours.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: RFC: pageable kernel-segments

2001-04-20 Thread Stephen C. Tweedie

Hi,

On Tue, Apr 17, 2001 at 12:21:17PM -0700, H. Peter Anvin wrote:

> > Certain parts of drivers could get the __pageable prefix or so
> > (like the __init parts of drivers which get removed) for letting
> > the paging-code know that it can be discared if memory-pressure
> > demands it.
> 
> VMS does this.  It at least used to have a great tendency to crash
> itself, because it swapped out something that was called from a driver
> that was called by the swapper -- resulting in deadlock.  You need
> iron discipline for this to work right in all circumstances.

Actually, VMS doesn't do this, precisely because it is so hard to get
right.  VMS has both paged and non-paged pools for dynamically
allocated kernel memory, but the kernel code itself is non-pageable.  

The big problem with such pageable memory isn't really device driver
deadlocks --- the easy rule which makes that work is simply never to
use paged pool from a driver which might be involved in swapping. :)
Even more tricky is the handling of kernel locking --- you cannot
access any paged memory with a spinlock held unless you have pinned
the pages in core beforehand.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Asynchronous IO

2001-04-20 Thread Stephen C. Tweedie

Hi,

On Fri, Apr 13, 2001 at 04:45:07AM -0400, Dan Maas wrote:
> IIRC the problem with implementing asynchronous *disk* I/O in Linux today is
> that the filesystem code assumes synchronous I/O operations that block the
> whole process/thread. So implementing "real" asynch I/O (without the
> overhead of creating a process context for each operation) would require
> re-writing the filesystems as non-blocking state machines. Last I heard this
> was a long-term goal, but nobody's done the work yet

SGI and Ben LaHaise both have kernel async IO functionality working,
and Ingo Molnar's Tux code has support for doing certain filesystem
lookup operations asynchronously too.  

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Asynchronous IO

2001-04-20 Thread Stephen C. Tweedie

Hi,

On Fri, Apr 13, 2001 at 04:45:07AM -0400, Dan Maas wrote:
 IIRC the problem with implementing asynchronous *disk* I/O in Linux today is
 that the filesystem code assumes synchronous I/O operations that block the
 whole process/thread. So implementing "real" asynch I/O (without the
 overhead of creating a process context for each operation) would require
 re-writing the filesystems as non-blocking state machines. Last I heard this
 was a long-term goal, but nobody's done the work yet

SGI and Ben LaHaise both have kernel async IO functionality working,
and Ingo Molnar's Tux code has support for doing certain filesystem
lookup operations asynchronously too.  

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: RFC: pageable kernel-segments

2001-04-20 Thread Stephen C. Tweedie

Hi,

On Tue, Apr 17, 2001 at 12:21:17PM -0700, H. Peter Anvin wrote:

  Certain parts of drivers could get the __pageable prefix or so
  (like the __init parts of drivers which get removed) for letting
  the paging-code know that it can be discared if memory-pressure
  demands it.
 
 VMS does this.  It at least used to have a great tendency to crash
 itself, because it swapped out something that was called from a driver
 that was called by the swapper -- resulting in deadlock.  You need
 iron discipline for this to work right in all circumstances.

Actually, VMS doesn't do this, precisely because it is so hard to get
right.  VMS has both paged and non-paged pools for dynamically
allocated kernel memory, but the kernel code itself is non-pageable.  

The big problem with such pageable memory isn't really device driver
deadlocks --- the easy rule which makes that work is simply never to
use paged pool from a driver which might be involved in swapping. :)
Even more tricky is the handling of kernel locking --- you cannot
access any paged memory with a spinlock held unless you have pinned
the pages in core beforehand.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: RFC: pageable kernel-segments

2001-04-20 Thread Stephen C. Tweedie

Hi,

On Fri, Apr 20, 2001 at 03:49:30PM +0100, Alan Cox wrote:

 There is a proposal (several it seems) to make 2.5 replace the conventional
 unix swap with a filesystem of backing store for anonymous objects. That will
 mean each object has its own vm area and inode and thus we can start blowing
 away all user mode page tables when we want.

Not without major VM overhaul.

The problem is MAP_PRIVATE, where a single vma can contain both normal
file-backed pages and anonymous pages at the same time.  You don't
even know whose anonymous page it is --- a process with anon pages can
fork, so that later on some of the child's anon pages actually come
from the parent's anon space instead of the child's.

Right now all of the magic that makes this work is in the page tables.
To remove page tables we'd need additional structures all through the
VM to track anonymous pages, and that's exactly where the FreeBSD VM
starts to get extremely messy compared to ours.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: generic_osync_inode/ext2_fsync_inode still not safe

2001-04-20 Thread Stephen C. Tweedie

Hi,

On Wed, Apr 18, 2001 at 06:45:40AM -0300, Marcelo Tosatti wrote:

 As far as I can see, you cannot guarantee that an inode which is unlocked
 _and_ clean (accordingly to the inode-i_state) is safely on disk.
 
 The reason for that are calls to sync_one() which write the inode
 asynchronously: 
 
 sync_one(struct inode *inode, int sync) {
 ...
 /* Don't write the inode if only I_DIRTY_PAGES was set */
 if (dirty  (I_DIRTY_SYNC | I_DIRTY_DATASYNC))
 write_inode(inode, sync);   -
...
 inode-i_state = ~I_LOCK;

Right --- nasty.  But not _too_ nasty, there's a moderately easy way
of dealing with this.

Basically we can't trust the i_state flag for this purpose if we are
allowing async IO to happen on inodes without having proper IO
completion callbacks marking the inodes as unlocked once they are firm
on disk.  However, in this case the filesystem itself will know which
underlying buffer_head contains the inode data, and can check to see
if that buffer is locked and perform a wait if necessary.

This is somewhat unpleasant in that it may sometimes cause unnecessary
false sharing, given that we have multiple inodes in an inode block.
However, I can't see any simple way around that.

Linus, do you have any preference for how to deal with this?  We can
either do it by adding a s_op-wait_inode() function to complement
write_inode(), and have a wait_inode() implementation in block-device
filesystems which does the buffer lookup and wait; or we can push that
whole logic into the filesystems, so that the I_DIRTY check is removed
from the VFS mid-layer altogether and the filesystem is responsible
for testing both the inode and buffer locked state when we try to wait
for outstanding inode IO to complete.

The second method is a bit cleaner conceptually but it results in more
code duplication.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [NEED TESTERS] remove swapin_readahead Re: shmem_getpage_locked() / swapin_readahead() race in 2.4.4-pre3

2001-04-17 Thread Stephen C. Tweedie

Hi,

On Sat, Apr 14, 2001 at 08:31:07PM -0300, Marcelo Tosatti wrote:
> On Sat, 14 Apr 2001, Rik van Riel wrote:
> > On Sat, 14 Apr 2001, Marcelo Tosatti wrote:
> > 
> > > There is a nasty race between shmem_getpage_locked() and
> > > swapin_readahead() with the new shmem code (introduced in
> > > 2.4.3-ac3 and merged in the main tree in 2.4.4-pre3):

> Test (multiple shm-stress) runs fine without swapin_readahead(), as
> expected.

> Stephen/Linus? 

I don't see the problem.  shmem_getpage_locked appears to back off
correctly if it encounters a swap-cached page already existing if
swapin_readahead has installed the page first, at least with the code
in 2.4.3-ac5.

There *does* appear to be a race, but it's swapin_readahead racing
with shmem_writepage.  That code does not search for an existing entry
in the swap cache when it decides to move a shmem page to swap, so we
can install the page twice and end up doing a lookup on the wrong
physical page if there is swap readahead going on.

To fix that, shmem_writepage needs to do a swap cache lookup and lock
before installing the new page --- it should probably just copy the
new page into the old one if it finds one already there.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: generic_osync_inode/ext2_fsync_inode still not safe

2001-04-17 Thread Stephen C. Tweedie

Hi,

On Sat, Apr 14, 2001 at 07:24:42AM -0300, Marcelo Tosatti wrote:
> 
> As described earlier, code which wants to write an inode cannot rely on
> the I_DIRTY bits (on inode->i_state) being clean to guarantee that the
> inode and its dirty pages, if any, are safely synced on disk.

Indeed --- for all such structures, including pages, buffer_heads and
inodes, you can only assume that the object is safely on disk if you
have checked both the dirty bit AND the locked bit.  If you find it
locked but clean, then a writeout may be in progress, so you need to
do a wait_on_* to be really sure that the write has completed.

> The reason for that is sync_one() --- it cleans the I_DIRTY bits of an
> inode, sets the I_LOCK and starts a writeout. 

As long as it is setting the I_LOCK bit, then that's fine.

> The easy and safe fix is to simply remove the I_DIRTY_* checks from
> generic_osync_inode and ext2_fsync_inode. Easy but slow. Another fix would
> be to make sync_one() unconditionally synchronous... slow.

Just make the *sync functions look for the locked bit and do a wait on
the inode if it is locked.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: generic_osync_inode/ext2_fsync_inode still not safe

2001-04-17 Thread Stephen C. Tweedie

Hi,

On Sat, Apr 14, 2001 at 07:24:42AM -0300, Marcelo Tosatti wrote:
 
 As described earlier, code which wants to write an inode cannot rely on
 the I_DIRTY bits (on inode-i_state) being clean to guarantee that the
 inode and its dirty pages, if any, are safely synced on disk.

Indeed --- for all such structures, including pages, buffer_heads and
inodes, you can only assume that the object is safely on disk if you
have checked both the dirty bit AND the locked bit.  If you find it
locked but clean, then a writeout may be in progress, so you need to
do a wait_on_* to be really sure that the write has completed.

 The reason for that is sync_one() --- it cleans the I_DIRTY bits of an
 inode, sets the I_LOCK and starts a writeout. 

As long as it is setting the I_LOCK bit, then that's fine.

 The easy and safe fix is to simply remove the I_DIRTY_* checks from
 generic_osync_inode and ext2_fsync_inode. Easy but slow. Another fix would
 be to make sync_one() unconditionally synchronous... slow.

Just make the *sync functions look for the locked bit and do a wait on
the inode if it is locked.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Fix races in 2.4.2-ac22 SysV shared memory

2001-03-25 Thread Stephen C. Tweedie

Hi,

On Sat, Mar 24, 2001 at 10:05:18PM -0300, Rik van Riel wrote:
> On Sun, 25 Mar 2001, Stephen C. Tweedie wrote:
> 
> > Rik, do you think it is really necessary to take the page lock and
> > release it inside lookup_swap_cache?  I may be overlooking something,
> > but I can't see the benefit of it ---
> 
> I don't think we need to do this, except to protect us from
> using a page which isn't up-to-date yet and locked because
> of disk IO.

But it doesn't --- page_launder can try to lock the page after it
checks the refcount, without taking any locks which protect us against
running lookup_swap_cache in parallel.  If we get our reference after
page_launder checks the count, we can find the page getting locked out
from underneath our feet.

> Reclaim_page() takes the pagecache_lock before trying to
> free anything, so there's no reason to lock against that.

Exactly.  We're not in danger of _losing_ the page, because
reclaim_page is locked more aggressively than page_launder.  We still
risk having the page locked against us after lookup_swap_cache does
its own UnlockPage.

So, if lookup_swap_cache doesn't actually ensure that the page is
unlocked, are there any callers which implicitly rely on
lookup_swap_cache() doing a wait_on_page?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Fix races in 2.4.2-ac22 SysV shared memory

2001-03-25 Thread Stephen C. Tweedie

Hi,

On Sat, Mar 24, 2001 at 10:05:18PM -0300, Rik van Riel wrote:
 On Sun, 25 Mar 2001, Stephen C. Tweedie wrote:
 
  Rik, do you think it is really necessary to take the page lock and
  release it inside lookup_swap_cache?  I may be overlooking something,
  but I can't see the benefit of it ---
 
 I don't think we need to do this, except to protect us from
 using a page which isn't up-to-date yet and locked because
 of disk IO.

But it doesn't --- page_launder can try to lock the page after it
checks the refcount, without taking any locks which protect us against
running lookup_swap_cache in parallel.  If we get our reference after
page_launder checks the count, we can find the page getting locked out
from underneath our feet.

 Reclaim_page() takes the pagecache_lock before trying to
 free anything, so there's no reason to lock against that.

Exactly.  We're not in danger of _losing_ the page, because
reclaim_page is locked more aggressively than page_launder.  We still
risk having the page locked against us after lookup_swap_cache does
its own UnlockPage.

So, if lookup_swap_cache doesn't actually ensure that the page is
unlocked, are there any callers which implicitly rely on
lookup_swap_cache() doing a wait_on_page?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] 2.4.2-ac24 buffer.c oops on highmem

2001-03-24 Thread Stephen C. Tweedie

Hi,

We've just seen a buffer.c oops in:

>>EIP; c013ae4b <__block_prepare_write+2bb/300>   <=
Trace; c013b732 
Trace; c015dbba 
Trace; c012a67e 
Trace; c015dbba 
Trace; c01281c0 
Trace; c01384a6 
Trace; c010910b 

__block_prepare_write()'s "out:" error handler tries to do a

memset(bh->b_data, 0, bh->b_size);

even if the buffer's page has already been kmapped for highmem.
Highmem pages will obviously have b_data being NULL.  Patch below.

I had a quick look through the rest of buffer.c and apart from the
initialisation of bh->b_data in set_bh_page(), there are no other
references left to b_data once we fix this.

Cheers,
 Stephen



--- fs/buffer.c.~1~ Sat Mar 24 17:30:13 2001
+++ fs/buffer.c Sat Mar 24 18:16:55 2001
@@ -1629,12 +1629,14 @@
return 0;
 out:
bh = head;
+   block_start = 0;
do {
if (buffer_new(bh) && !buffer_uptodate(bh)) {
-   memset(bh->b_data, 0, bh->b_size);
+   memset(kaddr+block_start, 0, bh->b_size);
set_bit(BH_Uptodate, >b_state);
mark_buffer_dirty(bh);
}
+   block_start += bh->b_size;
bh = bh->b_this_page;
} while (bh != head);
return err;



Re: [PATCH] Fix races in 2.4.2-ac22 SysV shared memory

2001-03-24 Thread Stephen C. Tweedie

Hi,

On Fri, Mar 23, 2001 at 11:58:50AM -0800, Linus Torvalds wrote:

> Ehh.. Sleeping with the spin-lock held? Sounds like a truly bad idea.

Uggh --- the shmem code already does, see:

shmem_truncate->shmem_truncate_part->shmem_free_swp->
lookup_swap_cache->find_lock_page

It looks messy: lookup_swap_cache seems to be abusing the page lock
gratuitously, but there are probably callers of it which rely on the
assumption that it performs an implicit wait_on_page().

Rik, do you think it is really necessary to take the page lock and
release it inside lookup_swap_cache?  I may be overlooking something,
but I can't see the benefit of it --- we can still race against
page_launder, so the page may still get locked behind our backs after
we get the reference from lookup_swap_cache (page_launder explicitly
avoids taking the pagecache hash spinlock which might avoid this
particular race).

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Fix races in 2.4.2-ac22 SysV shared memory

2001-03-24 Thread Stephen C. Tweedie

Hi,

On Fri, Mar 23, 2001 at 11:58:50AM -0800, Linus Torvalds wrote:

 Ehh.. Sleeping with the spin-lock held? Sounds like a truly bad idea.

Uggh --- the shmem code already does, see:

shmem_truncate-shmem_truncate_part-shmem_free_swp-
lookup_swap_cache-find_lock_page

It looks messy: lookup_swap_cache seems to be abusing the page lock
gratuitously, but there are probably callers of it which rely on the
assumption that it performs an implicit wait_on_page().

Rik, do you think it is really necessary to take the page lock and
release it inside lookup_swap_cache?  I may be overlooking something,
but I can't see the benefit of it --- we can still race against
page_launder, so the page may still get locked behind our backs after
we get the reference from lookup_swap_cache (page_launder explicitly
avoids taking the pagecache hash spinlock which might avoid this
particular race).

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] 2.4.2-ac24 buffer.c oops on highmem

2001-03-24 Thread Stephen C. Tweedie

Hi,

We've just seen a buffer.c oops in:

EIP; c013ae4b __block_prepare_write+2bb/300   =
Trace; c013b732 block_prepare_write+22/70
Trace; c015dbba ext2_get_block+a/4e0
Trace; c012a67e generic_file_write+3ee/710
Trace; c015dbba ext2_get_block+a/4e0
Trace; c01281c0 file_read_actor+0/f0
Trace; c01384a6 sys_write+96/d0
Trace; c010910b system_call+33/38

__block_prepare_write()'s "out:" error handler tries to do a

memset(bh-b_data, 0, bh-b_size);

even if the buffer's page has already been kmapped for highmem.
Highmem pages will obviously have b_data being NULL.  Patch below.

I had a quick look through the rest of buffer.c and apart from the
initialisation of bh-b_data in set_bh_page(), there are no other
references left to b_data once we fix this.

Cheers,
 Stephen



--- fs/buffer.c.~1~ Sat Mar 24 17:30:13 2001
+++ fs/buffer.c Sat Mar 24 18:16:55 2001
@@ -1629,12 +1629,14 @@
return 0;
 out:
bh = head;
+   block_start = 0;
do {
if (buffer_new(bh)  !buffer_uptodate(bh)) {
-   memset(bh-b_data, 0, bh-b_size);
+   memset(kaddr+block_start, 0, bh-b_size);
set_bit(BH_Uptodate, bh-b_state);
mark_buffer_dirty(bh);
}
+   block_start += bh-b_size;
bh = bh-b_this_page;
} while (bh != head);
return err;



Re: [linux-lvm] EXT2-fs panic (device lvm(58,0)):

2001-03-22 Thread Stephen C. Tweedie

Hi,

On Wed, Mar 07, 2001 at 01:35:05PM -0700, Andreas Dilger wrote:

> The only remote possibility is in ext2_free_blocks() if block+count
> overflows a 32-bit unsigned value.  Only 2 places call ext2_free_blocks()
> with a count != 1, and ext2_free_data() looks to be OK.  The other
> possibility is that i_prealloc_count is bogus - that is it!  Nowhere
> is i_prealloc_count initialized to zero AFAICS.
> 
Did you ever push this to Alan and/or Linus?  This looks pretty
important!

Cheers,
 Stephen

> ==
> diff -ru linux/fs/ext2/ialloc.c.orig linux/fs/ext2/ialloc.c
> --- linux/fs/ext2/ialloc.c.orig   Fri Dec  8 18:35:54 2000
> +++ linux/fs/ext2/ialloc.cWed Mar  7 12:22:11 2001
> @@ -432,6 +444,8 @@
>   inode->u.ext2_i.i_file_acl = 0;
>   inode->u.ext2_i.i_dir_acl = 0;
>   inode->u.ext2_i.i_dtime = 0;
> + inode->u.ext2_i.i_prealloc_count = 0;
>   inode->u.ext2_i.i_block_group = i;
>   if (inode->u.ext2_i.i_flags & EXT2_SYNC_FL)
>   inode->i_flags |= S_SYNC;
> diff -ru linux/fs/ext2/inode.c.orig linux/fs/ext2/inode.c
> --- linux/fs/ext2/inode.c.origTue Jan 16 01:29:29 2001
> +++ linux/fs/ext2/inode.c Wed Mar  7 12:05:47 2001
> @@ -1048,6 +1038,8 @@
>   (((__u64)le32_to_cpu(raw_inode->i_size_high)) << 32);
>   }
>   inode->i_generation = le32_to_cpu(raw_inode->i_generation);
> + inode->u.ext2_i.i_prealloc_count = 0;
>   inode->u.ext2_i.i_block_group = block_group;
>  
>   /*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] Fix races in 2.4.2-ac22 SysV shared memory

2001-03-22 Thread Stephen C. Tweedie

Hi,

The patch below is for two races in sysV shared memory.

The first (minor) one is in shmem_free_swp:

swap_free (entry);
*ptr = (swp_entry_t){0};
freed++;
if (!(page = lookup_swap_cache(entry)))
continue;
delete_from_swap_cache(page);
page_cache_release(page);

has a window between the first swap_free() and the
lookup_swap_cache().  If the swap_free() frees the last ref to the
swap entry and another cpu allocates and caches the same entry before
the lookup, we'll end up destroying another task's swap cache.

The second is nastier.  shmem_nopage() uses the inode semaphore to
serialise access to shmem_getpage_locked() for paging in shared memory
segments.  Lookups in the page cache and in the shmem swap vector are
done to locate the entry.  _getpage_ can move entries from swap to
page cache under protection of the shmem's info->lock spinlock.

shmem_writepage() is locked via the page lock, and moves shmem pages
from the page cache to the swap cache under protection of the same
info->lock spinlock.

However, shmem_nopage() does not hold this spinlock while doing its
lookups in the page cache and swap vectors, so it can race with
writepage, with once cpu in the middle of moving the page out of the
page cache in writepage and the other cpu then failing to find the
entry either in the page cache or in the shm swap entry vector.

Feedback welcome.

Cheers, 
 Stephen


--- mm/shmem.c.~1~  Fri Mar 23 00:26:49 2001
+++ mm/shmem.c  Fri Mar 23 00:42:21 2001
@@ -121,13 +121,13 @@
if (!ptr->val)
continue;
entry = *ptr;
-   swap_free (entry);
*ptr = (swp_entry_t){0};
freed++;
-   if (!(page = lookup_swap_cache(entry)))
-   continue;
-   delete_from_swap_cache(page);
-   page_cache_release(page);
+   if ((page = lookup_swap_cache(entry)) != NULL) {
+   delete_from_swap_cache(page);
+   page_cache_release(page);   
+   }
+   swap_free (entry);
}
return freed;
 }
@@ -218,15 +218,24 @@
 }
 
 /*
- * Move the page from the page cache to the swap cache
+ * Move the page from the page cache to the swap cache.
+ *
+ * The page lock prevents multiple occurences of shmem_writepage at
+ * once.  We still need to guard against racing with
+ * shmem_getpage_locked().  
  */
 static int shmem_writepage(struct page * page)
 {
int error;
struct shmem_inode_info *info;
swp_entry_t *entry, swap;
+   struct inode *inode;
 
-   info = >mapping->host->u.shmem_i;
+   if (!PageLocked(page))
+   BUG();
+   
+   inode = page->mapping->host;
+   info = >u.shmem_i;
swap = __get_swap_page(2);
if (!swap.val) {
set_page_dirty(page);
@@ -234,11 +243,11 @@
return -ENOMEM;
}
 
-   spin_lock(>lock);
-   shmem_recalc_inode(page->mapping->host);
entry = shmem_swp_entry(info, page->index);
if (IS_ERR(entry))  /* this had been allocted on page allocation */
BUG();
+   spin_lock(>lock);
+   shmem_recalc_inode(page->mapping->host);
error = -EAGAIN;
if (entry->val) {
__swap_free(swap, 2);
@@ -268,6 +277,10 @@
  * If we allocate a new one we do not mark it dirty. That's up to the
  * vm. If we swap it in we mark it dirty since we also free the swap
  * entry since a page cannot live in both the swap and page cache
+ *
+ * Called with the inode locked, so it cannot race with itself, but we
+ * still need to guard against racing with shm_writepage(), which might
+ * be trying to move the page to the swap cache as we run.
  */
 static struct page * shmem_getpage_locked(struct inode * inode, unsigned long idx)
 {
@@ -276,31 +289,57 @@
struct page * page;
swp_entry_t *entry;
 
-   page = find_lock_page(mapping, idx);;
+   info = >u.shmem_i;
+
+repeat:
+   page = find_lock_page(mapping, idx);
if (page)
return page;
 
-   info = >u.shmem_i;
entry = shmem_swp_entry (info, idx);
if (IS_ERR(entry))
return (void *)entry;
+
+   spin_lock (>lock);
+   
+   /* The shmem_swp_entry() call may have blocked, and
+* shmem_writepage may have been moving a page between the page
+* cache and swap cache.  We need to recheck the page cache
+* under the protection of the info->lock spinlock. */
+
+   page = find_lock_page(mapping, idx);
+   if (page) {
+   spin_unlock (>lock);
+   return page;
+   }
+   
if (entry->val) {
unsigned long flags;
 
/* Look it up and read it in.. */
page = 

Re: 2.4.2 fs/inode.c

2001-03-22 Thread Stephen C. Tweedie

Hi,

On Thu, Mar 22, 2001 at 01:42:15PM -0500, Jan Harkes wrote:
> 
> I found some code that seems wrong and didn't even match it's comment.
> Patch is against 2.4.2, but should go cleanly against 2.4.3-pre6 as well.
 
Patch looks fine to me.  Have you tested it?  If this goes wrong,
things break badly...

>   /* Don't do this for I_DIRTY_PAGES - that doesn't actually dirty the 
>inode itself */
> - if (flags & (I_DIRTY | I_DIRTY_SYNC)) {
> + if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Thinko in kswapd?

2001-03-22 Thread Stephen C. Tweedie

Hi,

On Thu, Mar 22, 2001 at 09:36:48AM -0800, Linus Torvalds wrote:
> On Thu, 22 Mar 2001, Stephen C. Tweedie wrote:
> >
> > There is what appears to be a simple thinko in kswapd.  We really
> > ought to keep kswapd running as long as there is either a free space
> > or an inactive page shortfall; but right now we only keep going if
> > _both_ are short.
> 
> Hmm.. The comment definitely says "or", so changing it to "and" in the
> sources makes the comment be non-sensical.

Indeed.  
 
> I suspect that the comment and the code were true at some point. The
> behaviour of "do_try_to_free_pages()" has changed, though, and I suspect
> your suggested change makes more sense now (it certainly seems to be
> logical to have the reverse condition for sleeping and for when to call
> "do_try_to_free_pages()").

> The only way to know is to test the behaviour. My only real worry is that
> kswapd might end up eating too much CPU time and make the system feel bad,
> but on the other hand the same can certainly be true from _not_ doing this

Yes, it's more the inconsistency between the tests than the tests that
prompted me to try it, and the scale of the interactive performance
improvement was quite a surprise.

On the other hand, Alan is now reporting that on one of his workloads
it does cause erratic behaviour for interactive loads.  So this is
definitely not a cure-all.

We already do have some problems with excessive swap time being
consumed under some loads: I can reproduce stalls of several seconds
on a PAE box with simple "dd > /dev/sd*".  That's something I need to
follow up further once we've found the source of some SMP data
corruption we're still seeing on big boxes (I'll be sending patches
for a shm race shortly that we found while chasing this.)

I suspect we'll need to instrument the activity of the various lrus in
the VM more accurately before we'll ever understand _why_ the VM works
well or badly in any given situation.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Thinko in kswapd?

2001-03-22 Thread Stephen C. Tweedie

Hi,

There is what appears to be a simple thinko in kswapd.  We really
ought to keep kswapd running as long as there is either a free space
or an inactive page shortfall; but right now we only keep going if
_both_ are short.

Diff below.  With this change, I've got a 64MB box running Applix and
Star Office with multiple open documents plus a few other big apps
running, and switching desktops or going between documents is once
more nice and snappy.  Running a normal heavily populated desktop in
256MB used to be painful, with much apparently unnecessary swapping,
if we had background page-cache intensive operations (eg find|wc)
going on: the patched kernel feels much better interactively,
presumably because kswapd is now doing the work it is supposed to do,
instead of forcing normal apps to go into page stealing mode
themselves.

--Stephen



--- mm/vmscan.c.~1~ Fri Mar 16 15:39:24 2001
+++ mm/vmscan.c Thu Mar 22 13:05:37 2001
@@ -1010,7 +1010,7 @@
 * We go to sleep for one second, but if it's needed
 * we'll be woken up earlier...
 */
-   if (!free_shortage() || !inactive_shortage()) {
+   if (!free_shortage() && !inactive_shortage()) {
interruptible_sleep_on_timeout(_wait, HZ);
/*
 * If we couldn't free enough memory, we see if it was



Thinko in kswapd?

2001-03-22 Thread Stephen C. Tweedie

Hi,

There is what appears to be a simple thinko in kswapd.  We really
ought to keep kswapd running as long as there is either a free space
or an inactive page shortfall; but right now we only keep going if
_both_ are short.

Diff below.  With this change, I've got a 64MB box running Applix and
Star Office with multiple open documents plus a few other big apps
running, and switching desktops or going between documents is once
more nice and snappy.  Running a normal heavily populated desktop in
256MB used to be painful, with much apparently unnecessary swapping,
if we had background page-cache intensive operations (eg find|wc)
going on: the patched kernel feels much better interactively,
presumably because kswapd is now doing the work it is supposed to do,
instead of forcing normal apps to go into page stealing mode
themselves.

--Stephen



--- mm/vmscan.c.~1~ Fri Mar 16 15:39:24 2001
+++ mm/vmscan.c Thu Mar 22 13:05:37 2001
@@ -1010,7 +1010,7 @@
 * We go to sleep for one second, but if it's needed
 * we'll be woken up earlier...
 */
-   if (!free_shortage() || !inactive_shortage()) {
+   if (!free_shortage()  !inactive_shortage()) {
interruptible_sleep_on_timeout(kswapd_wait, HZ);
/*
 * If we couldn't free enough memory, we see if it was



Re: Thinko in kswapd?

2001-03-22 Thread Stephen C. Tweedie

Hi,

On Thu, Mar 22, 2001 at 09:36:48AM -0800, Linus Torvalds wrote:
 On Thu, 22 Mar 2001, Stephen C. Tweedie wrote:
 
  There is what appears to be a simple thinko in kswapd.  We really
  ought to keep kswapd running as long as there is either a free space
  or an inactive page shortfall; but right now we only keep going if
  _both_ are short.
 
 Hmm.. The comment definitely says "or", so changing it to "and" in the
 sources makes the comment be non-sensical.

Indeed.  
 
 I suspect that the comment and the code were true at some point. The
 behaviour of "do_try_to_free_pages()" has changed, though, and I suspect
 your suggested change makes more sense now (it certainly seems to be
 logical to have the reverse condition for sleeping and for when to call
 "do_try_to_free_pages()").

 The only way to know is to test the behaviour. My only real worry is that
 kswapd might end up eating too much CPU time and make the system feel bad,
 but on the other hand the same can certainly be true from _not_ doing this

Yes, it's more the inconsistency between the tests than the tests that
prompted me to try it, and the scale of the interactive performance
improvement was quite a surprise.

On the other hand, Alan is now reporting that on one of his workloads
it does cause erratic behaviour for interactive loads.  So this is
definitely not a cure-all.

We already do have some problems with excessive swap time being
consumed under some loads: I can reproduce stalls of several seconds
on a PAE box with simple "dd  /dev/sd*".  That's something I need to
follow up further once we've found the source of some SMP data
corruption we're still seeing on big boxes (I'll be sending patches
for a shm race shortly that we found while chasing this.)

I suspect we'll need to instrument the activity of the various lrus in
the VM more accurately before we'll ever understand _why_ the VM works
well or badly in any given situation.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.2 fs/inode.c

2001-03-22 Thread Stephen C. Tweedie

Hi,

On Thu, Mar 22, 2001 at 01:42:15PM -0500, Jan Harkes wrote:
 
 I found some code that seems wrong and didn't even match it's comment.
 Patch is against 2.4.2, but should go cleanly against 2.4.3-pre6 as well.
 
Patch looks fine to me.  Have you tested it?  If this goes wrong,
things break badly...

   /* Don't do this for I_DIRTY_PAGES - that doesn't actually dirty the 
inode itself */
 - if (flags  (I_DIRTY | I_DIRTY_SYNC)) {
 + if (flags  (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] Fix races in 2.4.2-ac22 SysV shared memory

2001-03-22 Thread Stephen C. Tweedie

Hi,

The patch below is for two races in sysV shared memory.

The first (minor) one is in shmem_free_swp:

swap_free (entry);
*ptr = (swp_entry_t){0};
freed++;
if (!(page = lookup_swap_cache(entry)))
continue;
delete_from_swap_cache(page);
page_cache_release(page);

has a window between the first swap_free() and the
lookup_swap_cache().  If the swap_free() frees the last ref to the
swap entry and another cpu allocates and caches the same entry before
the lookup, we'll end up destroying another task's swap cache.

The second is nastier.  shmem_nopage() uses the inode semaphore to
serialise access to shmem_getpage_locked() for paging in shared memory
segments.  Lookups in the page cache and in the shmem swap vector are
done to locate the entry.  _getpage_ can move entries from swap to
page cache under protection of the shmem's info-lock spinlock.

shmem_writepage() is locked via the page lock, and moves shmem pages
from the page cache to the swap cache under protection of the same
info-lock spinlock.

However, shmem_nopage() does not hold this spinlock while doing its
lookups in the page cache and swap vectors, so it can race with
writepage, with once cpu in the middle of moving the page out of the
page cache in writepage and the other cpu then failing to find the
entry either in the page cache or in the shm swap entry vector.

Feedback welcome.

Cheers, 
 Stephen


--- mm/shmem.c.~1~  Fri Mar 23 00:26:49 2001
+++ mm/shmem.c  Fri Mar 23 00:42:21 2001
@@ -121,13 +121,13 @@
if (!ptr-val)
continue;
entry = *ptr;
-   swap_free (entry);
*ptr = (swp_entry_t){0};
freed++;
-   if (!(page = lookup_swap_cache(entry)))
-   continue;
-   delete_from_swap_cache(page);
-   page_cache_release(page);
+   if ((page = lookup_swap_cache(entry)) != NULL) {
+   delete_from_swap_cache(page);
+   page_cache_release(page);   
+   }
+   swap_free (entry);
}
return freed;
 }
@@ -218,15 +218,24 @@
 }
 
 /*
- * Move the page from the page cache to the swap cache
+ * Move the page from the page cache to the swap cache.
+ *
+ * The page lock prevents multiple occurences of shmem_writepage at
+ * once.  We still need to guard against racing with
+ * shmem_getpage_locked().  
  */
 static int shmem_writepage(struct page * page)
 {
int error;
struct shmem_inode_info *info;
swp_entry_t *entry, swap;
+   struct inode *inode;
 
-   info = page-mapping-host-u.shmem_i;
+   if (!PageLocked(page))
+   BUG();
+   
+   inode = page-mapping-host;
+   info = inode-u.shmem_i;
swap = __get_swap_page(2);
if (!swap.val) {
set_page_dirty(page);
@@ -234,11 +243,11 @@
return -ENOMEM;
}
 
-   spin_lock(info-lock);
-   shmem_recalc_inode(page-mapping-host);
entry = shmem_swp_entry(info, page-index);
if (IS_ERR(entry))  /* this had been allocted on page allocation */
BUG();
+   spin_lock(info-lock);
+   shmem_recalc_inode(page-mapping-host);
error = -EAGAIN;
if (entry-val) {
__swap_free(swap, 2);
@@ -268,6 +277,10 @@
  * If we allocate a new one we do not mark it dirty. That's up to the
  * vm. If we swap it in we mark it dirty since we also free the swap
  * entry since a page cannot live in both the swap and page cache
+ *
+ * Called with the inode locked, so it cannot race with itself, but we
+ * still need to guard against racing with shm_writepage(), which might
+ * be trying to move the page to the swap cache as we run.
  */
 static struct page * shmem_getpage_locked(struct inode * inode, unsigned long idx)
 {
@@ -276,31 +289,57 @@
struct page * page;
swp_entry_t *entry;
 
-   page = find_lock_page(mapping, idx);;
+   info = inode-u.shmem_i;
+
+repeat:
+   page = find_lock_page(mapping, idx);
if (page)
return page;
 
-   info = inode-u.shmem_i;
entry = shmem_swp_entry (info, idx);
if (IS_ERR(entry))
return (void *)entry;
+
+   spin_lock (info-lock);
+   
+   /* The shmem_swp_entry() call may have blocked, and
+* shmem_writepage may have been moving a page between the page
+* cache and swap cache.  We need to recheck the page cache
+* under the protection of the info-lock spinlock. */
+
+   page = find_lock_page(mapping, idx);
+   if (page) {
+   spin_unlock (info-lock);
+   return page;
+   }
+   
if (entry-val) {
unsigned long flags;
 
/* Look it up and read it in.. */

Re: [linux-lvm] EXT2-fs panic (device lvm(58,0)):

2001-03-22 Thread Stephen C. Tweedie

Hi,

On Wed, Mar 07, 2001 at 01:35:05PM -0700, Andreas Dilger wrote:

 The only remote possibility is in ext2_free_blocks() if block+count
 overflows a 32-bit unsigned value.  Only 2 places call ext2_free_blocks()
 with a count != 1, and ext2_free_data() looks to be OK.  The other
 possibility is that i_prealloc_count is bogus - that is it!  Nowhere
 is i_prealloc_count initialized to zero AFAICS.
 
Did you ever push this to Alan and/or Linus?  This looks pretty
important!

Cheers,
 Stephen

 ==
 diff -ru linux/fs/ext2/ialloc.c.orig linux/fs/ext2/ialloc.c
 --- linux/fs/ext2/ialloc.c.orig   Fri Dec  8 18:35:54 2000
 +++ linux/fs/ext2/ialloc.cWed Mar  7 12:22:11 2001
 @@ -432,6 +444,8 @@
   inode-u.ext2_i.i_file_acl = 0;
   inode-u.ext2_i.i_dir_acl = 0;
   inode-u.ext2_i.i_dtime = 0;
 + inode-u.ext2_i.i_prealloc_count = 0;
   inode-u.ext2_i.i_block_group = i;
   if (inode-u.ext2_i.i_flags  EXT2_SYNC_FL)
   inode-i_flags |= S_SYNC;
 diff -ru linux/fs/ext2/inode.c.orig linux/fs/ext2/inode.c
 --- linux/fs/ext2/inode.c.origTue Jan 16 01:29:29 2001
 +++ linux/fs/ext2/inode.c Wed Mar  7 12:05:47 2001
 @@ -1048,6 +1038,8 @@
   (((__u64)le32_to_cpu(raw_inode-i_size_high))  32);
   }
   inode-i_generation = le32_to_cpu(raw_inode-i_generation);
 + inode-u.ext2_i.i_prealloc_count = 0;
   inode-u.ext2_i.i_block_group = block_group;
  
   /*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: changing mm->mmap_sem (was: Re: system call for process information?)

2001-03-19 Thread Stephen C. Tweedie

Hi,

On Sun, Mar 18, 2001 at 10:34:38AM +0100, Manfred Spraul wrote:

> > The problem is that mmap_sem seems to be protecting the list
> > of VMAs, so taking _only_ the page_table_lock could let a VMA
> > change under us while a page fault is underway ...
> 
> No, that can't happen.

It can.  Page faults often need to block, so they have to be able to
drop the page_table_lock.  Holding the mmap_sem is all that keeps the
vma intact until the IO is complete.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: changing mm-mmap_sem (was: Re: system call for process information?)

2001-03-19 Thread Stephen C. Tweedie

Hi,

On Sun, Mar 18, 2001 at 10:34:38AM +0100, Manfred Spraul wrote:

  The problem is that mmap_sem seems to be protecting the list
  of VMAs, so taking _only_ the page_table_lock could let a VMA
  change under us while a page fault is underway ...
 
 No, that can't happen.

It can.  Page faults often need to block, so they have to be able to
drop the page_table_lock.  Holding the mmap_sem is all that keeps the
vma intact until the IO is complete.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH]: Only one memory zone for sparc64

2001-03-16 Thread Stephen C. Tweedie

Hi,

On Thu, Mar 15, 2001 at 07:13:52PM +1100, Anton Blanchard wrote:
> 
> On sparc64 we dont care about the different memory zones and iterating
> through them all over the place only serves to waste CPU. I suspect this
> would be the case with some other architectures but for the moment I
> have just enabled it for sparc64.
> 
> With this patch I get close to a 1% improvement in dbench on the dual
> ultra60.

I'd be surprised if dbench was anything other than disk-bound on most
systems.  On any of my machines, the standard error of a single dbench
run is *way* larger than 1%, and I'd expect to have to run the
benchmark a dozen times to get a confidence interval small enough to
detect a 1% performance change: are your runs repeatable enough to be
this sensitive to the effect of the allocator?

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: changing mm->mmap_sem (was: Re: system call for process information?)

2001-03-16 Thread Stephen C. Tweedie

Hi,

On Fri, Mar 16, 2001 at 08:50:25AM -0300, Rik van Riel wrote:
> On Fri, 16 Mar 2001, Stephen C. Tweedie wrote:
> 
> > > Write locks would be used in the code where we actually want
> > > to change the VMA list and page faults would use an extra lock
> > > to protect against each other (possibly a per-pagetable lock
> > 
> > Why do we need another lock?  The critical section where we do the
> > final update on the pte _already_ takes the page table spinlock to
> > avoid races against the swapper.
> 
> The problem is that mmap_sem seems to be protecting the list
> of VMAs, so taking _only_ the page_table_lock could let a VMA
> change under us while a page fault is underway ...

Right, I'm not suggesting removing that: making the mmap_sem
read/write is fine, but yes, we still need that semaphore.  But as for
the "page faults would use an extra lock to protect against each
other" bit --- we already have another lock, the page table lock,
which can be used in this way, so ANOTHER lock should be unnecessary.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: O_DSYNC flag for open

2001-03-16 Thread Stephen C. Tweedie

Hi,

On Wed, Mar 14, 2001 at 10:26:42PM -0500, Tom Vier wrote:
> fdatasync() is the same as fsync(), in linux.

No, in 2.4 fdatasync does the right thing and skips the inode flush if
only the timestamps have changed.

> until fdatasync() is
> implimented (ie, syncs the data only)

fdatasync is required to sync more than just the data: it has to sync
the inode too if any fields other than the timestamps have changed.
So, for appending to files or writing new files from scratch, fsync ==
fdatasync (because each write also changes the inode size).  Only for
updating existing files in place does fdatasync behave differently.

> #ifndef O_DSYNC
> # define O_DSYNC O_SYNC
> #endif

2.4's O_SYNC actually does a fdatasync internally.  This is also the
default behaviour of HPUX, which requires you to set a sysctl variable
if you want O_SYNC to flush timestamp changes to disk.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: changing mm->mmap_sem (was: Re: system call for process information?)

2001-03-16 Thread Stephen C. Tweedie

Hi,

On Thu, Mar 15, 2001 at 09:24:59AM -0300, Rik van Riel wrote:
> On Wed, 14 Mar 2001, Rik van Riel wrote:

> The mmap_sem is used in procfs to prevent the list of VMAs
> from changing. In the page fault code it seems to be used
> to prevent other page faults to happen at the same time with
> the current page fault (and to prevent VMAs from changing
> while a page fault is underway).

The page table spinlock should be quite sufficient to let us avoid
races in the page fault code.  We've had to deal with this before
there was ever a mmap_sem anyway: in ancient times, every page fault
had to do things like check to see if the pte had changed after IO was
complete and once the BKL had been retaken.  We can do the same with
the page fault spinlock without much pain.

> Maybe we should change the mmap_sem into a R/W semaphore ?

Definitely.

> Write locks would be used in the code where we actually want
> to change the VMA list and page faults would use an extra lock
> to protect against each other (possibly a per-pagetable lock

Why do we need another lock?  The critical section where we do the
final update on the pte _already_ takes the page table spinlock to
avoid races against the swapper.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: magic device renumbering was -- Re: Linux 2.4.2ac20

2001-03-16 Thread Stephen C. Tweedie

Hi,

On Wed, Mar 14, 2001 at 02:11:57PM -0500, Lars Kellogg-Stedman wrote:
> > Put LABEL= in you fstab in place of the device name.
> 
> Which is great, for filesystems that support labels.  Unfortunately,
> this isn't universally available -- for instance, you cannot mount
> a swap partition by label or uuid, so it is not possible to completely
> isolate yourself from the problems of disk device renumbering.

It's not convenient, but it is certainly possible: use a
single-partition raid0 logical device with raid autostart, and you get
a logical /dev/md* device which corresponds to a single partition and
which has a fixed name which is detected by the kernel at runtime and
mapped to the correct disk, wherever the disk may be.

The IBM EVMS folks are looking to generalise this sort of probing, but
for now there is at least one solution to this problem.  LVM works too
to some extent, but it currently lacks the automatic boot-time/
device-detect-time kernel probing step that the software raid code
has.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: magic device renumbering was -- Re: Linux 2.4.2ac20

2001-03-16 Thread Stephen C. Tweedie

Hi,

On Wed, Mar 14, 2001 at 02:11:57PM -0500, Lars Kellogg-Stedman wrote:
  Put LABEL=label set with e2label in you fstab in place of the device name.
 
 Which is great, for filesystems that support labels.  Unfortunately,
 this isn't universally available -- for instance, you cannot mount
 a swap partition by label or uuid, so it is not possible to completely
 isolate yourself from the problems of disk device renumbering.

It's not convenient, but it is certainly possible: use a
single-partition raid0 logical device with raid autostart, and you get
a logical /dev/md* device which corresponds to a single partition and
which has a fixed name which is detected by the kernel at runtime and
mapped to the correct disk, wherever the disk may be.

The IBM EVMS folks are looking to generalise this sort of probing, but
for now there is at least one solution to this problem.  LVM works too
to some extent, but it currently lacks the automatic boot-time/
device-detect-time kernel probing step that the software raid code
has.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: changing mm-mmap_sem (was: Re: system call for process information?)

2001-03-16 Thread Stephen C. Tweedie

Hi,

On Thu, Mar 15, 2001 at 09:24:59AM -0300, Rik van Riel wrote:
 On Wed, 14 Mar 2001, Rik van Riel wrote:

 The mmap_sem is used in procfs to prevent the list of VMAs
 from changing. In the page fault code it seems to be used
 to prevent other page faults to happen at the same time with
 the current page fault (and to prevent VMAs from changing
 while a page fault is underway).

The page table spinlock should be quite sufficient to let us avoid
races in the page fault code.  We've had to deal with this before
there was ever a mmap_sem anyway: in ancient times, every page fault
had to do things like check to see if the pte had changed after IO was
complete and once the BKL had been retaken.  We can do the same with
the page fault spinlock without much pain.

 Maybe we should change the mmap_sem into a R/W semaphore ?

Definitely.

 Write locks would be used in the code where we actually want
 to change the VMA list and page faults would use an extra lock
 to protect against each other (possibly a per-pagetable lock

Why do we need another lock?  The critical section where we do the
final update on the pte _already_ takes the page table spinlock to
avoid races against the swapper.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: O_DSYNC flag for open

2001-03-16 Thread Stephen C. Tweedie

Hi,

On Wed, Mar 14, 2001 at 10:26:42PM -0500, Tom Vier wrote:
 fdatasync() is the same as fsync(), in linux.

No, in 2.4 fdatasync does the right thing and skips the inode flush if
only the timestamps have changed.

 until fdatasync() is
 implimented (ie, syncs the data only)

fdatasync is required to sync more than just the data: it has to sync
the inode too if any fields other than the timestamps have changed.
So, for appending to files or writing new files from scratch, fsync ==
fdatasync (because each write also changes the inode size).  Only for
updating existing files in place does fdatasync behave differently.

 #ifndef O_DSYNC
 # define O_DSYNC O_SYNC
 #endif

2.4's O_SYNC actually does a fdatasync internally.  This is also the
default behaviour of HPUX, which requires you to set a sysctl variable
if you want O_SYNC to flush timestamp changes to disk.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: changing mm-mmap_sem (was: Re: system call for process information?)

2001-03-16 Thread Stephen C. Tweedie

Hi,

On Fri, Mar 16, 2001 at 08:50:25AM -0300, Rik van Riel wrote:
 On Fri, 16 Mar 2001, Stephen C. Tweedie wrote:
 
   Write locks would be used in the code where we actually want
   to change the VMA list and page faults would use an extra lock
   to protect against each other (possibly a per-pagetable lock
  
  Why do we need another lock?  The critical section where we do the
  final update on the pte _already_ takes the page table spinlock to
  avoid races against the swapper.
 
 The problem is that mmap_sem seems to be protecting the list
 of VMAs, so taking _only_ the page_table_lock could let a VMA
 change under us while a page fault is underway ...

Right, I'm not suggesting removing that: making the mmap_sem
read/write is fine, but yes, we still need that semaphore.  But as for
the "page faults would use an extra lock to protect against each
other" bit --- we already have another lock, the page table lock,
which can be used in this way, so ANOTHER lock should be unnecessary.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH]: Only one memory zone for sparc64

2001-03-16 Thread Stephen C. Tweedie

Hi,

On Thu, Mar 15, 2001 at 07:13:52PM +1100, Anton Blanchard wrote:
 
 On sparc64 we dont care about the different memory zones and iterating
 through them all over the place only serves to waste CPU. I suspect this
 would be the case with some other architectures but for the moment I
 have just enabled it for sparc64.
 
 With this patch I get close to a 1% improvement in dbench on the dual
 ultra60.

I'd be surprised if dbench was anything other than disk-bound on most
systems.  On any of my machines, the standard error of a single dbench
run is *way* larger than 1%, and I'd expect to have to run the
benchmark a dozen times to get a confidence interval small enough to
detect a 1% performance change: are your runs repeatable enough to be
this sensitive to the effect of the allocator?

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG? race between kswapd and ptrace (access_process_vm )

2001-03-12 Thread Stephen C. Tweedie

Hi,

On Thu, Mar 08, 2001 at 09:12:52PM +0100, Manfred Spraul wrote:
> > 
> Fixing the bug is more difficult than I thought:
> 
> Initially I assumed it would be a two-liner (lock, unlock) but kmap()
> can sleep.
> 
> Can I reuse a kmap_atomic() type or should I add a new type?

I've just tried with the patch below and it seems fine.  You don't
need to hold the spinlock over the kmap() call: you only need to hold
a reference to the page.

Cheers,
 Stephen


--- linux-2.4.2-ac18/kernel/ptrace.c.~1~Thu Nov  9 03:01:34 2000
+++ linux-2.4.2-ac18/kernel/ptrace.cMon Mar 12 11:32:30 2001
@@ -28,6 +28,7 @@
struct page *page;
 
 repeat:
+   spin_lock(>page_table_lock);
pgdir = pgd_offset(vma->vm_mm, addr);
if (pgd_none(*pgdir))
goto fault_in_page;
@@ -47,9 +48,13 @@
 
/* ZERO_PAGE is special: reads from it are ok even though it's marked reserved 
*/
if (page != ZERO_PAGE(addr) || write) {
-   if ((!VALID_PAGE(page)) || PageReserved(page))
+   if ((!VALID_PAGE(page)) || PageReserved(page)) {
+   spin_unlock(>page_table_lock);
return 0;
+   }
}
+   get_page(page);
+   spin_unlock(>page_table_lock);
flush_cache_page(vma, addr);
 
if (write) {
@@ -64,19 +69,23 @@
flush_page_to_ram(page);
kunmap(page);
}
+   put_page(page);
return len;
 
 fault_in_page:
+   spin_unlock(>page_table_lock);
/* -1: out of memory. 0 - unmapped page */
if (handle_mm_fault(mm, vma, addr, write) > 0)
goto repeat;
return 0;
 
 bad_pgd:
+   spin_unlock(>page_table_lock);
pgd_ERROR(*pgdir);
return 0;
 
 bad_pmd:
+   spin_unlock(>page_table_lock);
pmd_ERROR(*pgmiddle);
return 0;
 }



Re: BUG? race between kswapd and ptrace (access_process_vm )

2001-03-12 Thread Stephen C. Tweedie

Hi,

On Thu, Mar 08, 2001 at 09:12:52PM +0100, Manfred Spraul wrote:
  
 Fixing the bug is more difficult than I thought:
 
 Initially I assumed it would be a two-liner (lock, unlock) but kmap()
 can sleep.
 
 Can I reuse a kmap_atomic() type or should I add a new type?

I've just tried with the patch below and it seems fine.  You don't
need to hold the spinlock over the kmap() call: you only need to hold
a reference to the page.

Cheers,
 Stephen


--- linux-2.4.2-ac18/kernel/ptrace.c.~1~Thu Nov  9 03:01:34 2000
+++ linux-2.4.2-ac18/kernel/ptrace.cMon Mar 12 11:32:30 2001
@@ -28,6 +28,7 @@
struct page *page;
 
 repeat:
+   spin_lock(mm-page_table_lock);
pgdir = pgd_offset(vma-vm_mm, addr);
if (pgd_none(*pgdir))
goto fault_in_page;
@@ -47,9 +48,13 @@
 
/* ZERO_PAGE is special: reads from it are ok even though it's marked reserved 
*/
if (page != ZERO_PAGE(addr) || write) {
-   if ((!VALID_PAGE(page)) || PageReserved(page))
+   if ((!VALID_PAGE(page)) || PageReserved(page)) {
+   spin_unlock(mm-page_table_lock);
return 0;
+   }
}
+   get_page(page);
+   spin_unlock(mm-page_table_lock);
flush_cache_page(vma, addr);
 
if (write) {
@@ -64,19 +69,23 @@
flush_page_to_ram(page);
kunmap(page);
}
+   put_page(page);
return len;
 
 fault_in_page:
+   spin_unlock(mm-page_table_lock);
/* -1: out of memory. 0 - unmapped page */
if (handle_mm_fault(mm, vma, addr, write)  0)
goto repeat;
return 0;
 
 bad_pgd:
+   spin_unlock(mm-page_table_lock);
pgd_ERROR(*pgdir);
return 0;
 
 bad_pmd:
+   spin_unlock(mm-page_table_lock);
pmd_ERROR(*pgmiddle);
return 0;
 }



Re: 64-bit capable block device layer

2001-03-08 Thread Stephen C. Tweedie

Hi,

On Wed, Mar 07, 2001 at 07:53:23PM +0100, Jens Axboe wrote:
> > 
> > OTOH, I'm not sure what problems it could give to make this
> > a compile-time option...
> 
> Plus compile time options are nasty :-). It would probably make
> bigger sense to completely skip all the merging etc for low end
> machines. I think they already do this for embedded kernels (ie
> removing ll_rw_blk.c and elevator.c). That avoids most of the
> 64-bit arithmetic anyway.

It's not just a sector-number and ll_rw_blk/elevator issue.  The limit
goes all the way up to the users of the block device, be they the
filesystem, buffer cache or block read/write layer.  

This is especially true for filesystems like XFS which need a 512-byte
blocksize.  At least with ext2 you can set the blocksize to 4kB and
get some of the benefit of larger block devices without having to
overflow the 32-bit block number.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scsi vs ide performance on fsync's

2001-03-08 Thread Stephen C. Tweedie

Hi,

On Wed, Mar 07, 2001 at 10:36:38AM -0800, Linus Torvalds wrote:
> On Wed, 7 Mar 2001, Jeremy Hansen wrote:
> > 
> > So in the meantime as this gets worked out on a lower level, we've decided
> > to take the fsync() out of berkeley db for mysql transaction logs and
> > mount the filesystem -o sync.
> > 
> > Can anyone perhaps tell me why this may be a bad idea?
> 
>  - it doesn't help. The disk will _still_ do write buffering. It's the
>DISK, not the OS. It doesn't matter what you do.
>  - your performance will suck.

Added to which, "-o sync" only enables sync metadata updates.  It
still doesn't force an fsync on data writes.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scsi vs ide performance on fsync's

2001-03-08 Thread Stephen C. Tweedie

Hi,

On Wed, Mar 07, 2001 at 10:36:38AM -0800, Linus Torvalds wrote:
 On Wed, 7 Mar 2001, Jeremy Hansen wrote:
  
  So in the meantime as this gets worked out on a lower level, we've decided
  to take the fsync() out of berkeley db for mysql transaction logs and
  mount the filesystem -o sync.
  
  Can anyone perhaps tell me why this may be a bad idea?
 
  - it doesn't help. The disk will _still_ do write buffering. It's the
DISK, not the OS. It doesn't matter what you do.
  - your performance will suck.

Added to which, "-o sync" only enables sync metadata updates.  It
still doesn't force an fsync on data writes.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 64-bit capable block device layer

2001-03-08 Thread Stephen C. Tweedie

Hi,

On Wed, Mar 07, 2001 at 07:53:23PM +0100, Jens Axboe wrote:
  
  OTOH, I'm not sure what problems it could give to make this
  a compile-time option...
 
 Plus compile time options are nasty :-). It would probably make
 bigger sense to completely skip all the merging etc for low end
 machines. I think they already do this for embedded kernels (ie
 removing ll_rw_blk.c and elevator.c). That avoids most of the
 64-bit arithmetic anyway.

It's not just a sector-number and ll_rw_blk/elevator issue.  The limit
goes all the way up to the users of the block device, be they the
filesystem, buffer cache or block read/write layer.  

This is especially true for filesystems like XFS which need a 512-byte
blocksize.  At least with ext2 you can set the blocksize to 4kB and
get some of the benefit of larger block devices without having to
overflow the 32-bit block number.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scsi vs ide performance on fsync's

2001-03-07 Thread Stephen C. Tweedie

Hi,

On Wed, Mar 07, 2001 at 09:15:36PM +0100, Jens Axboe wrote:
> On Wed, Mar 07 2001, Stephen C. Tweedie wrote:
> > 
> > For most fs'es, that's not an issue.  The fs won't start writeback on
> > the primary disk at all until the journal commit has been acknowledged
> > as firm on disk.
> 
> But do you then force wait on that journal commit?

It doesn't matter too much --- it's only the writeback which is doing
this (ext3 uses a separate journal thread for it), so any sleep is
only there to wait for the moment when writeback can safely begin:
users of the filesystem won't see any stalls.

> A barrier operation is sufficient then. So you're saying don't
> over design, a simple barrier is all you need?

Pretty much so.  The simple barrier is the only thing which can be
effectively optimised at the hardware level with SCSI anyway.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scsi vs ide performance on fsync's

2001-03-07 Thread Stephen C. Tweedie

Hi,

On Wed, Mar 07, 2001 at 07:51:52PM +0100, Jens Axboe wrote:
> On Wed, Mar 07 2001, Stephen C. Tweedie wrote:
> 
> My bigger concern is when the journalled fs has a log on a different
> queue.

For most fs'es, that's not an issue.  The fs won't start writeback on
the primary disk at all until the journal commit has been acknowledged
as firm on disk.

Certainly for ext3, synchronisation between the log and the primary
disk is no big thing.  What really hurts is writing to the log, where
we have to wait for the log writes to complete before submitting the
commit write (which is sequentially allocated just after the rest of
the log blocks).  Specifying a barrier on the commit block would allow
us to keep the log device streaming, and the fs can deal with
synchronising the primary disk quite happily by itself.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scsi vs ide performance on fsync's

2001-03-07 Thread Stephen C. Tweedie

Hi,

On Wed, Mar 07, 2001 at 03:12:41PM +0100, Jens Axboe wrote:
> 
> Yep, it's much harder than it seems. Especially because for the barrier
> to be really useful, having inter-request dependencies becomes a
> requirement. So you can say something like 'flush X and Y, but don't
> flush Y before X is done'.

Yes.  Fortunately, the simplest possible barrier is just a matter of
marking a request as non-reorderable, and then making sure that you
both flush the elevator queue before servicing that request, and defer
any subsequent requests until the barrier request has been satisfied.
One it has gone through, you can let through the deferred requests (in
order, up to the point at which you encounter another barrier).

Only if the queue is empty can you give a barrier request directly to
the driver.  The special optimisation you can do in this case with
SCSI is to continue to allow new requests through even before the
barrier has completed if the disk supports ordered queue tags.  

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scsi vs ide performance on fsync's

2001-03-07 Thread Stephen C. Tweedie

Hi,

On Tue, Mar 06, 2001 at 09:37:20PM +0100, Jens Axboe wrote:
> 
> SCSI has ordered tag, which fit the model Alan described quite nicely.
> I've been meaning to implement this for some time, it would be handy
> for journalled fs to use such a barrier. Since ATA doesn't do queueing
> (at least not in current Linux), a synchronize cache is probably the
> only way to go there.

Note that you also have to preserve the position of the barrier in the
elevator queue, and you need to prevent LVM and soft raid from
violating the barrier if different commands end up being sent to
different disks.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scsi vs ide performance on fsync's

2001-03-07 Thread Stephen C. Tweedie

Hi,

On Tue, Mar 06, 2001 at 10:44:34AM -0800, Linus Torvalds wrote:

> On Tue, 6 Mar 2001, Alan Cox wrote:
> > You want a write barrier. Write buffering (at least for short intervals) in
> > the drive is very sensible. The kernel needs to able to send drivers a write
> > barrier which will not be completed with outstanding commands before the
> > barrier.
> 
> But Alan is right - we needs a "sync" command or something. I don't know
> if IDE has one (it already might, for all I know).

Sync and barrier are very different models.  With barriers we can
enforce some elemnt of write ordering without actually waiting for the
IOs to complete; with sync, we're explicitly asking to be told when
the data has become persistant.  We can make use of both of these.

SCSI certainly lets us do both of these operations independently.  IDE
has the sync/flush command afaik, but I'm not sure whether the IDE
tagged command stuff has the equivalent of SCSI's ordered tag bits.
Andre?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scsi vs ide performance on fsync's

2001-03-07 Thread Stephen C. Tweedie

Hi,

On Tue, Mar 06, 2001 at 10:44:34AM -0800, Linus Torvalds wrote:

 On Tue, 6 Mar 2001, Alan Cox wrote:
  You want a write barrier. Write buffering (at least for short intervals) in
  the drive is very sensible. The kernel needs to able to send drivers a write
  barrier which will not be completed with outstanding commands before the
  barrier.
 
 But Alan is right - we needs a "sync" command or something. I don't know
 if IDE has one (it already might, for all I know).

Sync and barrier are very different models.  With barriers we can
enforce some elemnt of write ordering without actually waiting for the
IOs to complete; with sync, we're explicitly asking to be told when
the data has become persistant.  We can make use of both of these.

SCSI certainly lets us do both of these operations independently.  IDE
has the sync/flush command afaik, but I'm not sure whether the IDE
tagged command stuff has the equivalent of SCSI's ordered tag bits.
Andre?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scsi vs ide performance on fsync's

2001-03-07 Thread Stephen C. Tweedie

Hi,

On Wed, Mar 07, 2001 at 03:12:41PM +0100, Jens Axboe wrote:
 
 Yep, it's much harder than it seems. Especially because for the barrier
 to be really useful, having inter-request dependencies becomes a
 requirement. So you can say something like 'flush X and Y, but don't
 flush Y before X is done'.

Yes.  Fortunately, the simplest possible barrier is just a matter of
marking a request as non-reorderable, and then making sure that you
both flush the elevator queue before servicing that request, and defer
any subsequent requests until the barrier request has been satisfied.
One it has gone through, you can let through the deferred requests (in
order, up to the point at which you encounter another barrier).

Only if the queue is empty can you give a barrier request directly to
the driver.  The special optimisation you can do in this case with
SCSI is to continue to allow new requests through even before the
barrier has completed if the disk supports ordered queue tags.  

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scsi vs ide performance on fsync's

2001-03-07 Thread Stephen C. Tweedie

Hi,

On Wed, Mar 07, 2001 at 07:51:52PM +0100, Jens Axboe wrote:
 On Wed, Mar 07 2001, Stephen C. Tweedie wrote:
 
 My bigger concern is when the journalled fs has a log on a different
 queue.

For most fs'es, that's not an issue.  The fs won't start writeback on
the primary disk at all until the journal commit has been acknowledged
as firm on disk.

Certainly for ext3, synchronisation between the log and the primary
disk is no big thing.  What really hurts is writing to the log, where
we have to wait for the log writes to complete before submitting the
commit write (which is sequentially allocated just after the rest of
the log blocks).  Specifying a barrier on the commit block would allow
us to keep the log device streaming, and the fs can deal with
synchronising the primary disk quite happily by itself.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



<    1   2   3   4   5   6   7   >