Re: [PATCH] ext34: ensure do_split leaves enough free space in both blocks

2007-09-18 Thread Stephen C. Tweedie
Hi,

On Mon, 2007-09-17 at 11:06 -0500, Eric Sandeen wrote:
> The do_split() function for htree dir blocks is intended to split a
> leaf block to make room for a new entry.  It sorts the entries in the
> original block by hash value, then moves the last half of the entries to 
> the new block - without accounting for how much space this actually moves.  

Nasty.

> Also add a few comments to the functions involved.

A big improvement.  :)

> + /* Split the existing block in the middle, size-wise */
> + size = 0;
> + move = 0;
> + for (i = count-1; i >= 0; i--) {
> + /* is more than half of this entry in 2nd half of the block? */
> + if (size + map[i].size/2 > blocksize/2)
> + break;
> + size += map[i].size;
> + move++;
> + }
> + /* map index at which we will split */
> + split = count - move;

This is the guts of it, and I spent a while looking just to convince
myself it was getting things right in the corner case and was splitting
at _exactly_ the right point.  If we have large dirents and 1k
blocksize, even getting the split point off-by-one will still leave us
vulnerable to the bug.

But it all looks fine.  My only other comment would be that "move" is
redundant, you could lose it completely and just do

split = i+1;

but I think the logic's easier to understand the way it is. 

Nice catch.

--Stephen


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ext34: ensure do_split leaves enough free space in both blocks

2007-09-18 Thread Stephen C. Tweedie
Hi,

On Mon, 2007-09-17 at 11:06 -0500, Eric Sandeen wrote:
 The do_split() function for htree dir blocks is intended to split a
 leaf block to make room for a new entry.  It sorts the entries in the
 original block by hash value, then moves the last half of the entries to 
 the new block - without accounting for how much space this actually moves.  

Nasty.

 Also add a few comments to the functions involved.

A big improvement.  :)

 + /* Split the existing block in the middle, size-wise */
 + size = 0;
 + move = 0;
 + for (i = count-1; i = 0; i--) {
 + /* is more than half of this entry in 2nd half of the block? */
 + if (size + map[i].size/2  blocksize/2)
 + break;
 + size += map[i].size;
 + move++;
 + }
 + /* map index at which we will split */
 + split = count - move;

This is the guts of it, and I spent a while looking just to convince
myself it was getting things right in the corner case and was splitting
at _exactly_ the right point.  If we have large dirents and 1k
blocksize, even getting the split point off-by-one will still leave us
vulnerable to the bug.

But it all looks fine.  My only other comment would be that move is
redundant, you could lose it completely and just do

split = i+1;

but I think the logic's easier to understand the way it is. 

Nice catch.

--Stephen


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [188/2many] MAINTAINERS - EXT4 FILE SYSTEM

2007-08-13 Thread Stephen C. Tweedie
Hi,

On Mon, 2007-08-13 at 03:01 -0600, Andreas Dilger wrote:

> To be honest, Stephen and Andrew haven't been directly involved in
> the ext4 development.  It probably makes more sense to have e.g.
> Eric Sandeen, Ted Ts'o, and MingMing Cao in their place.

Works for me.

--Stephen


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [188/2many] MAINTAINERS - EXT4 FILE SYSTEM

2007-08-13 Thread Stephen C. Tweedie
Hi,

On Mon, 2007-08-13 at 03:01 -0600, Andreas Dilger wrote:

 To be honest, Stephen and Andrew haven't been directly involved in
 the ext4 development.  It probably makes more sense to have e.g.
 Eric Sandeen, Ted Ts'o, and MingMing Cao in their place.

Works for me.

--Stephen


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

2007-06-12 Thread Stephen C. Tweedie
Hi,

On Sun, 2007-06-10 at 18:27 +, Pavel Machek wrote:

> > Once a f/s is read-only, there should be NO writing to 
> > it.  Right?
> 
> Linux happily writes to filesystems mounted read-only. It will replay
> journal on them.

Only at mount time, not on unmount; and it does check whether the
underlying device is truly readonly or not first (assuming
bdev_read_only() is working on the device in question.)

--Stephen


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

2007-06-12 Thread Stephen C. Tweedie
Hi,

On Sun, 2007-06-10 at 18:27 +, Pavel Machek wrote:

  Once a f/s is read-only, there should be NO writing to 
  it.  Right?
 
 Linux happily writes to filesystems mounted read-only. It will replay
 journal on them.

Only at mount time, not on unmount; and it does check whether the
underlying device is truly readonly or not first (assuming
bdev_read_only() is working on the device in question.)

--Stephen


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

2007-06-07 Thread Stephen C. Tweedie
Hi,

On Thu, 2007-06-07 at 12:01 -0400, Mark Lord wrote:

> >>mount /var/lib/mythtv -oremount,ro
> >>sync
> >>umount /var/lib/mythtv
> > 
> > Did this succeed?  If the application is still truncating that file, the
> > umount should have failed.
> 
> Actually, what I expect to happen is for the remount,ro
> to block  until the file deletion completes.  But it doesn't.

No -- all that the remount,ro sees is that there's a fd still open for
write.  It has no idea if or when that fd is going to get closed, so it
should fail.  Returning -EBUSY is the only thing it _can_ do: waiting
for the fs to be remountable might wait forever.

So the fs is still writable after that point.

--Stephen


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

2007-06-07 Thread Stephen C. Tweedie
Hi,

On Thu, 2007-06-07 at 12:01 -0400, Mark Lord wrote:

 mount /var/lib/mythtv -oremount,ro
 sync
 umount /var/lib/mythtv
  
  Did this succeed?  If the application is still truncating that file, the
  umount should have failed.
 
 Actually, what I expect to happen is for the remount,ro
 to block  until the file deletion completes.  But it doesn't.

No -- all that the remount,ro sees is that there's a fd still open for
write.  It has no idea if or when that fd is going to get closed, so it
should fail.  Returning -EBUSY is the only thing it _can_ do: waiting
for the fs to be remountable might wait forever.

So the fs is still writable after that point.

--Stephen


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: EXT3-fs error (device hda8): ext3_free_blocks: Freeing blocks not in datazone

2005-09-05 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-09-05 at 17:24, Riccardo Castellani wrote:
> I'm using FC3 with Kernel 2.6.12-1.1376.
> After few hours file system on /dev/hda8 EXT3 partition has a problem so it 
> remounted in only read mode.

> Sep  5 17:34:40 mrtg kernel: EXT3-fs error (device hda8): ext3_free_blocks: 
> Freeing blocks not in datazone - block = 134217728, count = 1

That block number is 0x800 in hex.  It's a single-bit flip error;
that strongly sounds like hardware, and I'd run memtest86 on that box
next.

> Sep  5 17:34:40 mrtg kernel: Aborting journal on device hda8.
> Sep  5 17:34:40 mrtg kernel: EXT3-fs error (device hda8) in 
> ext3_reserve_inode_write: Journal has aborted
...

> I tried several times to run fsck on this partition and I also tried to 
> remount fs in a new partition, but it happened nothing !

What do you mean?  fsck found nothing wrong?  remount failed?  You _did_
unmount before fscking, did you?

--Stephen
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Stephen C. Tweedie
Hi,

On Sun, 2005-09-04 at 21:33, Pavel Machek wrote:

> > - read-only mount
> > - "specatator" mount (like ro but no journal allocated for the mount,
> >   no fencing needed for failed node that was mounted as specatator)
> 
> I'd call it "real-read-only", and yes, that's very usefull
> mount. Could we get it for ext3, too?

I don't want to pollute the ext3 paths with extra checks for the case
when there's no journal struct at all.  But a dummy journal struct that
isn't associated with an on-disk journal and that can never, ever go
writable would certainly be pretty easy to do.

But mount -o readonly gives you most of what you want already.  An
always-readonly option would be different in some key ways --- for a
start, it would be impossible to perform journal recovery if that's
needed, as that still needs journal and superblock write access.  That's
not necessarily a good thing.

And you *still* wouldn't get something that could act as a spectator to
a filesystem mounted writable elsewhere on a SAN, because updates on the
other node wouldn't invalidate cached data on the readonly node.  So is
this really a useful combination?

About the only combination I can think of that really makes sense in
this context is if you have a busted filesystem that somehow can't be
recovered --- either the journal is broken or the underlying device is
truly readonly --- and you want to mount without recovery in order to
attempt to see what you can find.  That's asking for data corruption,
but that may be better than getting no data at all.  

But that is something that could be done with a "-o skip-recovery" mount
option, which would necessarily imply always-readonly behaviour.

--Stephen


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Stephen C. Tweedie
Hi,

On Sun, 2005-09-04 at 21:33, Pavel Machek wrote:

  - read-only mount
  - specatator mount (like ro but no journal allocated for the mount,
no fencing needed for failed node that was mounted as specatator)
 
 I'd call it real-read-only, and yes, that's very usefull
 mount. Could we get it for ext3, too?

I don't want to pollute the ext3 paths with extra checks for the case
when there's no journal struct at all.  But a dummy journal struct that
isn't associated with an on-disk journal and that can never, ever go
writable would certainly be pretty easy to do.

But mount -o readonly gives you most of what you want already.  An
always-readonly option would be different in some key ways --- for a
start, it would be impossible to perform journal recovery if that's
needed, as that still needs journal and superblock write access.  That's
not necessarily a good thing.

And you *still* wouldn't get something that could act as a spectator to
a filesystem mounted writable elsewhere on a SAN, because updates on the
other node wouldn't invalidate cached data on the readonly node.  So is
this really a useful combination?

About the only combination I can think of that really makes sense in
this context is if you have a busted filesystem that somehow can't be
recovered --- either the journal is broken or the underlying device is
truly readonly --- and you want to mount without recovery in order to
attempt to see what you can find.  That's asking for data corruption,
but that may be better than getting no data at all.  

But that is something that could be done with a -o skip-recovery mount
option, which would necessarily imply always-readonly behaviour.

--Stephen


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: EXT3-fs error (device hda8): ext3_free_blocks: Freeing blocks not in datazone

2005-09-05 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-09-05 at 17:24, Riccardo Castellani wrote:
 I'm using FC3 with Kernel 2.6.12-1.1376.
 After few hours file system on /dev/hda8 EXT3 partition has a problem so it 
 remounted in only read mode.

 Sep  5 17:34:40 mrtg kernel: EXT3-fs error (device hda8): ext3_free_blocks: 
 Freeing blocks not in datazone - block = 134217728, count = 1

That block number is 0x800 in hex.  It's a single-bit flip error;
that strongly sounds like hardware, and I'd run memtest86 on that box
next.

 Sep  5 17:34:40 mrtg kernel: Aborting journal on device hda8.
 Sep  5 17:34:40 mrtg kernel: EXT3-fs error (device hda8) in 
 ext3_reserve_inode_write: Journal has aborted
...

 I tried several times to run fsck on this partition and I also tried to 
 remount fs in a new partition, but it happened nothing !

What do you mean?  fsck found nothing wrong?  remount failed?  You _did_
unmount before fscking, did you?

--Stephen
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Ext3 online resizing locking issue

2005-08-31 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-08-31 at 12:35, Glauber de Oliveira Costa wrote:

> At a first look, i thought about locking gdt-related data. But in a
> closer one, it seemed to me that we're in fact modifying a little bit
> more than that in the resize code. But all these modifications seem to
> be somehow related to the ext3 super block specific data in
> ext3_sb_info. My first naive approach would be adding a lock to that
> struct

I took great care when making that code SMP-safe to avoid such locks,
for performance reasons.  See the comments at

 * We need to protect s_groups_count against other CPUs seeing
 * inconsistent state in the superblock.

in fs/ext3/resize.c for the rules.  But basically the way it works is
that we only usually modify data that cannot be in use by other parts of
the kernel --- and that's fairly easy to guarantee, since by definition
extending the fs is something that is touching bits that aren't already
in use.  Only once all the new data is safely installed do we atomically
update the s_groups_count field, which instantly makes the new data
visible.  We enforce this ordering via smp read barriers before reading
s_groups_count and write barriers after modifying it, but we don't
actually have locks as such.

The only use of locking in the resize is hence the superblock lock,
which is not really there to protect the resize from the rest of the fs
--- the s_groups_count barriers do that.  All the sb lock is needed for
is to prevent two resizes from progressing at the same time; and that
could easily be abstracted into a separate resize lock.

Cheers,
 Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Ext3 online resizing locking issue

2005-08-31 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-08-31 at 12:35, Glauber de Oliveira Costa wrote:

 At a first look, i thought about locking gdt-related data. But in a
 closer one, it seemed to me that we're in fact modifying a little bit
 more than that in the resize code. But all these modifications seem to
 be somehow related to the ext3 super block specific data in
 ext3_sb_info. My first naive approach would be adding a lock to that
 struct

I took great care when making that code SMP-safe to avoid such locks,
for performance reasons.  See the comments at

 * We need to protect s_groups_count against other CPUs seeing
 * inconsistent state in the superblock.

in fs/ext3/resize.c for the rules.  But basically the way it works is
that we only usually modify data that cannot be in use by other parts of
the kernel --- and that's fairly easy to guarantee, since by definition
extending the fs is something that is touching bits that aren't already
in use.  Only once all the new data is safely installed do we atomically
update the s_groups_count field, which instantly makes the new data
visible.  We enforce this ordering via smp read barriers before reading
s_groups_count and write barriers after modifying it, but we don't
actually have locks as such.

The only use of locking in the resize is hence the superblock lock,
which is not really there to protect the resize from the rest of the fs
--- the s_groups_count barriers do that.  All the sb lock is needed for
is to prevent two resizes from progressing at the same time; and that
could easily be abstracted into a separate resize lock.

Cheers,
 Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Ext3 online resizing locking issue

2005-08-30 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-08-25 at 21:43, Glauber de Oliveira Costa wrote:

> Just a question here. With s_lock held by the remount code, we're
> altering the struct super_block, and believing we're safe. We try to
> acquire it inside the resize functions, because we're trying to modify 
> this same data. Thus, if we rely on another lock, aren't we probably 
> messing  up something ?

The two different uses of the superblock lock are really quite
different; I don't see any particular problem with using two different
locks for the two different things.  Mount and the namespace code are
not locking the same thing --- the fact that the resize code uses the
superblock lock is really a historical side-effect of the fact that we
used to use the same overloaded superblock lock in the ext2/ext3 block
allocation layers to guard bitmap access.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Real-Time Preemption, -RT-2.6.13-rc4-V0.7.52-01

2005-08-30 Thread Stephen C. Tweedie
Hi,

On Fri, 2005-08-26 at 12:20, Steven Rostedt wrote:

> > could you try a), how clean does it get? Personally i'm much more in 
> > favor of cleanliness. On the vanilla kernel a spinlock is zero bytes on 
> > UP [the most RAM-sensitive platform], and it's a word on typical SMP.

It's a word, maybe; but it's a word used only by ext3 afaik, and it's
getting added to the core buffer_head.  Not very nice.  It certainly
looks like the easiest short-term way out for a development patch
series, though.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Real-Time Preemption, -RT-2.6.13-rc4-V0.7.52-01

2005-08-30 Thread Stephen C. Tweedie
Hi,

On Fri, 2005-08-26 at 05:24, Steven Rostedt wrote:

> Well, I just spent several hours trying to use the b_update_lock in
> implementing something to replace the bit spinlocks for RT.  It's
> getting really ugly and I just hit a stone wall.
> 
> The problem is that I have two locks to work with. A
> jbd_lock_bh_journal_head and a jbd_lock_bh_state.  ...

For now, yes.

> So, the only other solutions that I can think of is:
> 
> a) add yet another (bloat) lock to the buffer head.

This one looks like the right answer for now, just to get the patch
series running.  I've got a WIP patch under development which removes
the bh_journal_head lock entirely; if that works out, you may find
things get a bit easier.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Real-Time Preemption, -RT-2.6.13-rc4-V0.7.52-01

2005-08-30 Thread Stephen C. Tweedie
Hi,

On Fri, 2005-08-26 at 12:20, Steven Rostedt wrote:

  could you try a), how clean does it get? Personally i'm much more in 
  favor of cleanliness. On the vanilla kernel a spinlock is zero bytes on 
  UP [the most RAM-sensitive platform], and it's a word on typical SMP.

It's a word, maybe; but it's a word used only by ext3 afaik, and it's
getting added to the core buffer_head.  Not very nice.  It certainly
looks like the easiest short-term way out for a development patch
series, though.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Real-Time Preemption, -RT-2.6.13-rc4-V0.7.52-01

2005-08-30 Thread Stephen C. Tweedie
Hi,

On Fri, 2005-08-26 at 05:24, Steven Rostedt wrote:

 Well, I just spent several hours trying to use the b_update_lock in
 implementing something to replace the bit spinlocks for RT.  It's
 getting really ugly and I just hit a stone wall.
 
 The problem is that I have two locks to work with. A
 jbd_lock_bh_journal_head and a jbd_lock_bh_state.  ...

For now, yes.

 So, the only other solutions that I can think of is:
 
 a) add yet another (bloat) lock to the buffer head.

This one looks like the right answer for now, just to get the patch
series running.  I've got a WIP patch under development which removes
the bh_journal_head lock entirely; if that works out, you may find
things get a bit easier.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Ext3 online resizing locking issue

2005-08-30 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-08-25 at 21:43, Glauber de Oliveira Costa wrote:

 Just a question here. With s_lock held by the remount code, we're
 altering the struct super_block, and believing we're safe. We try to
 acquire it inside the resize functions, because we're trying to modify 
 this same data. Thus, if we rely on another lock, aren't we probably 
 messing  up something ?

The two different uses of the superblock lock are really quite
different; I don't see any particular problem with using two different
locks for the two different things.  Mount and the namespace code are
not locking the same thing --- the fact that the resize code uses the
superblock lock is really a historical side-effect of the fact that we
used to use the same overloaded superblock lock in the ext2/ext3 block
allocation layers to guard bitmap access.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Ext3 online resizing locking issue

2005-08-25 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-08-24 at 22:03, Glauber de Oliveira Costa wrote:

> This simple patch provides a fix for a locking issue found in the online
> resizing code. The problem actually happened while trying to resize the
> filesystem trough the resize=xxx option in a remount. 

NAK, this is wrong:

> + lock_super(sb);
>   err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count);
> + unlock_super(sb);

This basically reverses the order of locking between lock_super() and
journal_start() (the latter acts like a lock because it can block on a
resource if the journal is too full for the new transaction.)  That's
the opposite order to normal, and will result in a potential deadlock.

> + {Opt_resize, "resize=%u"},
>   {Opt_err, NULL},
> - {Opt_resize, "resize"},

Right, that's disabled for now.  I guess the easy fix here is just to
remove the code entirely, given that we have locking problems with
trying to fix it!

But the _right_ fix, if you really want to keep that code, is probably
to move all the resize locking to a separate lock that ranks outside the
journal_start.  The easy workaround is to drop the superblock lock and
reaquire it around the journal_start(); it would be pretty easy to make
that work robustly as far as ext3 is concerned, but I suspect there may
be VFS-layer problems if we start dropping the superblock lock in the
middle of the s_ops->remount() call --- Al?

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Ext3 online resizing locking issue

2005-08-25 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-08-24 at 22:03, Glauber de Oliveira Costa wrote:

 This simple patch provides a fix for a locking issue found in the online
 resizing code. The problem actually happened while trying to resize the
 filesystem trough the resize=xxx option in a remount. 

NAK, this is wrong:

 + lock_super(sb);
   err = ext3_group_extend(sb, EXT3_SB(sb)-s_es, n_blocks_count);
 + unlock_super(sb);

This basically reverses the order of locking between lock_super() and
journal_start() (the latter acts like a lock because it can block on a
resource if the journal is too full for the new transaction.)  That's
the opposite order to normal, and will result in a potential deadlock.

 + {Opt_resize, resize=%u},
   {Opt_err, NULL},
 - {Opt_resize, resize},

Right, that's disabled for now.  I guess the easy fix here is just to
remove the code entirely, given that we have locking problems with
trying to fix it!

But the _right_ fix, if you really want to keep that code, is probably
to move all the resize locking to a separate lock that ranks outside the
journal_start.  The easy workaround is to drop the superblock lock and
reaquire it around the journal_start(); it would be pretty easy to make
that work robustly as far as ext3 is concerned, but I suspect there may
be VFS-layer problems if we start dropping the superblock lock in the
middle of the s_ops-remount() call --- Al?

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [-mm PATCH 2/32] fs: fix-up schedule_timeout() usage

2005-08-15 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-08-15 at 19:08, Nishanth Aravamudan wrote:

> Description: Use schedule_timeout_{,un}interruptible() instead of
> set_current_state()/schedule_timeout() to reduce kernel size.

> +++ 2.6.13-rc5-mm1-dev/fs/jbd/transaction.c   2005-08-10 15:03:33.0 
> -0700
> @@ -1340,8 +1340,7 @@ int journal_stop(handle_t *handle)
> - set_current_state(TASK_UNINTERRUPTIBLE);
> - schedule_timeout(1);
> + schedule_timeout_uninterruptible(1);

This chunk at least is fine.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [-mm PATCH 2/32] fs: fix-up schedule_timeout() usage

2005-08-15 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-08-15 at 19:08, Nishanth Aravamudan wrote:

 Description: Use schedule_timeout_{,un}interruptible() instead of
 set_current_state()/schedule_timeout() to reduce kernel size.

 +++ 2.6.13-rc5-mm1-dev/fs/jbd/transaction.c   2005-08-10 15:03:33.0 
 -0700
 @@ -1340,8 +1340,7 @@ int journal_stop(handle_t *handle)
 - set_current_state(TASK_UNINTERRUPTIBLE);
 - schedule_timeout(1);
 + schedule_timeout_uninterruptible(1);

This chunk at least is fine.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Inotify patch missed arch/x86_64/ia32/sys_ia32.c

2005-07-15 Thread Stephen C. Tweedie
Hi,

The inotify patch just added a line

+   fsnotify_open(f->f_dentry);

to sys_open, but it missed the x86_64 compatibility sys32_open()
equivalent in arch/x86_64/ia32/sys_ia32.c.

Andi, perhaps it's time to factor out the guts of sys_open from the flag
munging to keep as much of that code as common as possible, and avoid
this sort of maintenance problem in the future?

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Inotify patch missed arch/x86_64/ia32/sys_ia32.c

2005-07-15 Thread Stephen C. Tweedie
Hi,

The inotify patch just added a line

+   fsnotify_open(f-f_dentry);

to sys_open, but it missed the x86_64 compatibility sys32_open()
equivalent in arch/x86_64/ia32/sys_ia32.c.

Andi, perhaps it's time to factor out the guts of sys_open from the flag
munging to keep as much of that code as common as possible, and avoid
this sort of maintenance problem in the future?

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kjournald() missing JFS_UNMOUNT check

2005-07-12 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-07-11 at 22:19, Mark Fasheh wrote:

>   Can we please merge this patch? I sent it to ext2-devel for comments
> last week and haven't hear anything back. It seems trivially correct and is
> testing fine - famous last words, I know :)

ACK --- looks fine to me.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kjournald() missing JFS_UNMOUNT check

2005-07-12 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-07-11 at 22:19, Mark Fasheh wrote:

   Can we please merge this patch? I sent it to ext2-devel for comments
 last week and haven't hear anything back. It seems trivially correct and is
 testing fine - famous last words, I know :)

ACK --- looks fine to me.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix race in do_get_write_access()

2005-07-11 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-07-11 at 17:10, Jan Kara wrote:

>   attached patch should fix the following race:
...
>   and we have sent wrong data to disk... We now clean the dirty buffer
> flag under buffer lock in all cases and hence we know that whenever a buffer
> is starting to be journaled we either finish the pending write-out
> before attaching a buffer to a transaction or we won't write the buffer
> until the transaction is going to be committed... Please apply.

Looks good to me.

Btw, how did you find this?  Were you able to reproduce this in
practice?

Cheers,
 Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix JBD race in t_forget list handling

2005-07-11 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-07-11 at 16:34, Jan Kara wrote:

>   attached patch should close the possible race between
> journal_commit_transaction() and journal_unmap_buffer() (which adds
> buffers to committing transaction's t_forget list) that could leave
> some buffers on transaction's t_forget list (hence leading to an
> assertion failure later when transaction is dropped). The patch is
> against 2.6.13-rc2 kernel.  The race was really happening to David Wilk
> <[EMAIL PROTECTED]> (thanks for testing) so please apply if you find
> the patch correct.

Indeed we can be adding to the committing transaction's t_forget list
here.  The fix looks correct: we're taking the j_list_lock anyway in
that loop for the checkpoint processing, so we're not taking it any more
frequently by grabbing it for the loop condition itself too --- just
holding it a little longer.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix JBD race in t_forget list handling

2005-07-11 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-07-11 at 16:34, Jan Kara wrote:

   attached patch should close the possible race between
 journal_commit_transaction() and journal_unmap_buffer() (which adds
 buffers to committing transaction's t_forget list) that could leave
 some buffers on transaction's t_forget list (hence leading to an
 assertion failure later when transaction is dropped). The patch is
 against 2.6.13-rc2 kernel.  The race was really happening to David Wilk
 [EMAIL PROTECTED] (thanks for testing) so please apply if you find
 the patch correct.

Indeed we can be adding to the committing transaction's t_forget list
here.  The fix looks correct: we're taking the j_list_lock anyway in
that loop for the checkpoint processing, so we're not taking it any more
frequently by grabbing it for the loop condition itself too --- just
holding it a little longer.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix race in do_get_write_access()

2005-07-11 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-07-11 at 17:10, Jan Kara wrote:

   attached patch should fix the following race:
...
   and we have sent wrong data to disk... We now clean the dirty buffer
 flag under buffer lock in all cases and hence we know that whenever a buffer
 is starting to be journaled we either finish the pending write-out
 before attaching a buffer to a transaction or we won't write the buffer
 until the transaction is going to be committed... Please apply.

Looks good to me.

Btw, how did you find this?  Were you able to reproduce this in
practice?

Cheers,
 Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-18 Thread Stephen C. Tweedie
Hi,

On Fri, 2005-04-15 at 21:32, Mingming Cao wrote:

> Sorry for the delaying. I was not in office these days.
No problem.


> > > Also I am concerned about the possible
> > > starvation on writers.
> > In what way?
> I was worried about the rw lock case.:)

OK, so we're both on the same track here. :-)

> > Note that there _is_ some additional complexity here.  It's not entirely
> > free.  Specifically, if we have other tasks waiting on our provisional
> > window, then we need to be very careful about the life expectancy of the
> > wait queues involved, so that if the first task fixes its window and
> > then deletes it before the waiters have woken up, they don't get
> > confused by the provisional window vanishing while being waited for.

> This approach(provisional window) sounds interesting to me, but it's a
> little more complicated than I thought:(

Yep.  Once you have other tasks waiting on your window while it's not
locked, object lifetime becomes a much bigger problem.

> alloc_new_reservation()
> retry:
> lock the tree
>   search for a new space start from the given startblock
>   found a new space
>   if the new space is not a "provisional window" 

I was thinking rather that we _start_ with the window, and _then_ look
for new space.

So we'd start with:

if we already have a window, 
mark it provisional; 
else,
do
search for a new window;
if the immediately preceding window is provisional, 
wait for that window;
continue;
allocate the window and mark it provisional;
break

At this point we have a provisional window, and we know that nobody else
is going to allocate either in it, or in the empty space following it
(because if they tried to, they'd bump into our own provisional window
as their predecessor and would wait for us.)  So even if the window
doesn't occupy the _whole_ unreserved area, we can still keep searching
without fear of multiple tasks trying to do so in the same space at the
same time.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-13 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote:

> I have measured the bh refcount before the buffer_uptodate() for a few days.
> I found out that the bh refcount sometimes reached to 0 .
> So, I think following modifications are effective.
> 
> diff -Nru 2.4.30-rc3/fs/jbd/commit.c 2.4.30-rc3_patch/fs/jbd/commit.c
> --- 2.4.30-rc3/fs/jbd/commit.c2005-04-06 17:14:47.0 +0900
> +++ 2.4.30-rc3_patch/fs/jbd/commit.c  2005-04-06 17:18:49.0 +0900
> @@ -302,11 +303,14 @@
>   if (unlikely(!buffer_uptodate(bh)))
>   err = -EIO;
>   /* the journal_head may have been removed now */
> + put_bh(bh);
>   lock_journal(journal);
>   goto write_out_data;

...

In further testing, this chunk is causing some problems.  It is entirely
legal for a buffer to be !uptodate here, although the path is somewhat
convoluted.  The trouble is, once the IO has completed,
journal_dirty_data() can steal the buffer from the committing
transaction.  And once that happens, journal_unmap_buffer() can then
release the buffer and clear the uptodate bit.  

I've run this under buffer tracing and can show you this exact path on a
trace.  It only takes a few seconds to reproduce under fsx.

For this to have happened, though, journal_dirty_data needs to have
waited on the data, so it has an opportunity to see the !uptodate state
at that point.

I think it's safe enough to sidestep this by double-checking to see
whether the bh is still on the committing transaction after we've waited
for the buffer and retaken journaling locks, but before testing
buffer_uptodate().

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-13 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-13 at 00:27, Mingming Cao wrote:

> > I wonder if there's not a simple solution for this --- mark the window
> > as "provisional", and if any other task tries to allocate in the space
> > immediately following such a window, it needs to block until that window
> > is released.

> Sounds interesting. However that implies we need a write lock to mark
> the window as provisional and block other files looking for windows near
> it: we need to insert the provisional window into the tree and then mark
> it as a temporary window, to really let other file notice this kind of
> "hold".

We need a lock for the tree modification, yes.

> I wonder if the benefit of read/write lock is worth all the hassles now.

The point of the provisional window is that you don't need read/write
locks at all.  The provisional window lets you unlock the tree
completely while you do the bitmap scan, so there's really no reason to
have rwlocks for the tree any more.

> If the new window's neighbor stays the same, we only need to roll
> forward once.  If not, after a successful scan, we need to grab the
> write lock, and make sure the window is still there.

When we take the provisional window, we can make a note of how much
space we have before the next window.  And because all future allocators
will stall if they try to allocate at this point due to the provisional
window, we know that that space will remain outside any other window
until we come to fix the provisional window and potentially roll it
forward to the space we found.

>  If we dropped the
> lock without holding the new space, we have to search the whole tree
> again to find out if the space is still there

As long as the space is within the area between the provisional window
and its successor, we don't need to do that.  (It gets more complex if
the bitmap search returns a bit _beyond_ the next window, though.)

> Also I am concerned about the possible
> starvation on writers.

In what way?

> I am thinking, maybe back to the spin_lock is not that bad with the
> "mark provisional" suggest you made?

Right, that was the intent --- sorry if I didn't make it clear. 

>  It allows us to mark the new space
> as provisional if we find a new space(prevent other window searching run
> into the same neighborhood). We could release the lock and scan the
> bitmap without worry about the new space will be taking by others.

Exactly.

Note that there _is_ some additional complexity here.  It's not entirely
free.  Specifically, if we have other tasks waiting on our provisional
window, then we need to be very careful about the life expectancy of the
wait queues involved, so that if the first task fixes its window and
then deletes it before the waiters have woken up, they don't get
confused by the provisional window vanishing while being waited for.

The easy solution is a global wait queue for that, but that doesn't
scale well.  The complex solution is a per-window wait queue and
reference count, which is obviously a bit of bloat, though probably
worth it for the high-load case.

Cheers,
 Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-13 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-13 at 00:27, Mingming Cao wrote:

  I wonder if there's not a simple solution for this --- mark the window
  as provisional, and if any other task tries to allocate in the space
  immediately following such a window, it needs to block until that window
  is released.

 Sounds interesting. However that implies we need a write lock to mark
 the window as provisional and block other files looking for windows near
 it: we need to insert the provisional window into the tree and then mark
 it as a temporary window, to really let other file notice this kind of
 hold.

We need a lock for the tree modification, yes.

 I wonder if the benefit of read/write lock is worth all the hassles now.

The point of the provisional window is that you don't need read/write
locks at all.  The provisional window lets you unlock the tree
completely while you do the bitmap scan, so there's really no reason to
have rwlocks for the tree any more.

 If the new window's neighbor stays the same, we only need to roll
 forward once.  If not, after a successful scan, we need to grab the
 write lock, and make sure the window is still there.

When we take the provisional window, we can make a note of how much
space we have before the next window.  And because all future allocators
will stall if they try to allocate at this point due to the provisional
window, we know that that space will remain outside any other window
until we come to fix the provisional window and potentially roll it
forward to the space we found.

  If we dropped the
 lock without holding the new space, we have to search the whole tree
 again to find out if the space is still there

As long as the space is within the area between the provisional window
and its successor, we don't need to do that.  (It gets more complex if
the bitmap search returns a bit _beyond_ the next window, though.)

 Also I am concerned about the possible
 starvation on writers.

In what way?

 I am thinking, maybe back to the spin_lock is not that bad with the
 mark provisional suggest you made?

Right, that was the intent --- sorry if I didn't make it clear. 

  It allows us to mark the new space
 as provisional if we find a new space(prevent other window searching run
 into the same neighborhood). We could release the lock and scan the
 bitmap without worry about the new space will be taking by others.

Exactly.

Note that there _is_ some additional complexity here.  It's not entirely
free.  Specifically, if we have other tasks waiting on our provisional
window, then we need to be very careful about the life expectancy of the
wait queues involved, so that if the first task fixes its window and
then deletes it before the waiters have woken up, they don't get
confused by the provisional window vanishing while being waited for.

The easy solution is a global wait queue for that, but that doesn't
scale well.  The complex solution is a per-window wait queue and
reference count, which is obviously a bit of bloat, though probably
worth it for the high-load case.

Cheers,
 Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-13 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote:

 I have measured the bh refcount before the buffer_uptodate() for a few days.
 I found out that the bh refcount sometimes reached to 0 .
 So, I think following modifications are effective.
 
 diff -Nru 2.4.30-rc3/fs/jbd/commit.c 2.4.30-rc3_patch/fs/jbd/commit.c
 --- 2.4.30-rc3/fs/jbd/commit.c2005-04-06 17:14:47.0 +0900
 +++ 2.4.30-rc3_patch/fs/jbd/commit.c  2005-04-06 17:18:49.0 +0900
 @@ -302,11 +303,14 @@
   if (unlikely(!buffer_uptodate(bh)))
   err = -EIO;
   /* the journal_head may have been removed now */
 + put_bh(bh);
   lock_journal(journal);
   goto write_out_data;

...

In further testing, this chunk is causing some problems.  It is entirely
legal for a buffer to be !uptodate here, although the path is somewhat
convoluted.  The trouble is, once the IO has completed,
journal_dirty_data() can steal the buffer from the committing
transaction.  And once that happens, journal_unmap_buffer() can then
release the buffer and clear the uptodate bit.  

I've run this under buffer tracing and can show you this exact path on a
trace.  It only takes a few seconds to reproduce under fsx.

For this to have happened, though, journal_dirty_data needs to have
waited on the data, so it has an opportunity to see the !uptodate state
at that point.

I think it's safe enough to sidestep this by double-checking to see
whether the bh is still on the committing transaction after we've waited
for the buffer and retaken journaling locks, but before testing
buffer_uptodate().

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem in log_do_checkpoint()?

2005-04-12 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-04-11 at 12:36, Jan Kara wrote:

> > The prevention of multiple writes in this case should also improve
> > performance a little.
> > 
> > That ought to be pretty straightforward, I think.  The existing cases
> > where we remove buffers from a checkpoint shouldn't have to care about
> > which list_head we're removing from; those cases already handle buffers
> > in both states.  It's only when doing the flush/wait that we have to
> > distinguish the two.
>   Yes, AFAICS the changes should remain local to the checkpointing code
> (plus __unlink_buffer()). Should I write the patch or will you?

Feel free, but please let me know if you start.  I'm doing a bit of
chasing of leaks and dealing with that O_SYNC thing for 2.4 right now,
but I'll get back to the checkpoint code after that if you haven't
started by then.

Cheers,
 Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-04-12 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 02:23, Andrew Morton wrote:

> Nobody has noticed the now-fixed leak since 2.6.6 and this one appears to
> be 100x slower.  Which is fortunate because this one is going to take a
> long time to fix.  I'll poke at it some more.

OK, I'm now at the stage where I can kick off that fsx test on a kernel
without your leak fix, kill it, umount and get

Whoops: found 43 unfreeable buffers still on the superblock debug list
for sb 0100296b2d48.  Tracing one...
buffer trace for buffer at 0x01003edaa9c8 (I am CPU 0)
...

with a trace pointing to journal_unmap_buffer().  I'll try with the fix
in place to see if there are any other cases showing up with the same
problem.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-12 Thread Stephen C. Tweedie
Hi,

On Tue, 2005-04-12 at 07:41, Mingming Cao wrote:

> > Note that this may improve average case latencies, but it's not likely
> > to improve worst-case ones.  We still need a write lock to install a new
> > window, and that's going to have to wait for us to finish finding a free
> > bit even if that operation starts using a read lock.  
> > 
> Yes indeed. However nothing is free and there are always trade-offs.:) 
> 
> By worse case you mean multiple writes trying to allocate blocks around
> same area?

It doesn't matter where they are; multiple new file opens will all be
looking for a write lock.  You only need one long-held read lock and all
the writers still block.  The worst-case latencies can't be properly
solved with r/w locks --- those let the readers go more quickly
(assuming they are in the majority), which helps the average case, but
writers still have to wait for exclusive access.  We only really help
them by dropping the lock entirely.

> Even if we take out the whole
> reservation, we still possibility run into this kind of latency: the
> bitmap on disk and on journal are extremely inconsistent so we need to
> keep searching them both until we find a free bit on both map.

Quite possibly.  But as long as that code is running without locks, it's
much easier to deal with those latencies: they won't impact other CPUs,
cond_resched() is easier, and there's even CONFIG_PREEMPT.

> > I'm not really sure what to do about worst-case here.  For that, we
> > really do want to drop the lock entirely while we do the bitmap scan.

> Hmm...if we drop the lock entirely while scan the bitmap, assuming you
> mean drop the read lock, then I am afraid we have to re-check the tree
> (require a read or write lock ) to see if the new window space is still
> there after the scan succeed.

Sure.  You basically start off with a provisional window, and then if
necessary roll it forward just the same way you roll normal windows
forward when they get to their end.  That means you can still drop the
lock while you search for new space.  When you get there, reacquire the
lock and check that the intervening space is still available.

That's really cheap for the common case.  The difficulty is when you
have many parallel allocations hitting the same bg: they allocate
provisional windows, find the same free area later on in the bg, and
then stomp on each other as they try to move their windows there.

I wonder if there's not a simple solution for this --- mark the window
as "provisional", and if any other task tries to allocate in the space
immediately following such a window, it needs to block until that window
is released.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-12 Thread Stephen C. Tweedie
Hi,

On Tue, 2005-04-12 at 07:41, Mingming Cao wrote:

  Note that this may improve average case latencies, but it's not likely
  to improve worst-case ones.  We still need a write lock to install a new
  window, and that's going to have to wait for us to finish finding a free
  bit even if that operation starts using a read lock.  
  
 Yes indeed. However nothing is free and there are always trade-offs.:) 
 
 By worse case you mean multiple writes trying to allocate blocks around
 same area?

It doesn't matter where they are; multiple new file opens will all be
looking for a write lock.  You only need one long-held read lock and all
the writers still block.  The worst-case latencies can't be properly
solved with r/w locks --- those let the readers go more quickly
(assuming they are in the majority), which helps the average case, but
writers still have to wait for exclusive access.  We only really help
them by dropping the lock entirely.

 Even if we take out the whole
 reservation, we still possibility run into this kind of latency: the
 bitmap on disk and on journal are extremely inconsistent so we need to
 keep searching them both until we find a free bit on both map.

Quite possibly.  But as long as that code is running without locks, it's
much easier to deal with those latencies: they won't impact other CPUs,
cond_resched() is easier, and there's even CONFIG_PREEMPT.

  I'm not really sure what to do about worst-case here.  For that, we
  really do want to drop the lock entirely while we do the bitmap scan.

 Hmm...if we drop the lock entirely while scan the bitmap, assuming you
 mean drop the read lock, then I am afraid we have to re-check the tree
 (require a read or write lock ) to see if the new window space is still
 there after the scan succeed.

Sure.  You basically start off with a provisional window, and then if
necessary roll it forward just the same way you roll normal windows
forward when they get to their end.  That means you can still drop the
lock while you search for new space.  When you get there, reacquire the
lock and check that the intervening space is still available.

That's really cheap for the common case.  The difficulty is when you
have many parallel allocations hitting the same bg: they allocate
provisional windows, find the same free area later on in the bg, and
then stomp on each other as they try to move their windows there.

I wonder if there's not a simple solution for this --- mark the window
as provisional, and if any other task tries to allocate in the space
immediately following such a window, it needs to block until that window
is released.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-04-12 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 02:23, Andrew Morton wrote:

 Nobody has noticed the now-fixed leak since 2.6.6 and this one appears to
 be 100x slower.  Which is fortunate because this one is going to take a
 long time to fix.  I'll poke at it some more.

OK, I'm now at the stage where I can kick off that fsx test on a kernel
without your leak fix, kill it, umount and get

Whoops: found 43 unfreeable buffers still on the superblock debug list
for sb 0100296b2d48.  Tracing one...
buffer trace for buffer at 0x01003edaa9c8 (I am CPU 0)
...

with a trace pointing to journal_unmap_buffer().  I'll try with the fix
in place to see if there are any other cases showing up with the same
problem.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem in log_do_checkpoint()?

2005-04-12 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-04-11 at 12:36, Jan Kara wrote:

  The prevention of multiple writes in this case should also improve
  performance a little.
  
  That ought to be pretty straightforward, I think.  The existing cases
  where we remove buffers from a checkpoint shouldn't have to care about
  which list_head we're removing from; those cases already handle buffers
  in both states.  It's only when doing the flush/wait that we have to
  distinguish the two.
   Yes, AFAICS the changes should remain local to the checkpointing code
 (plus __unlink_buffer()). Should I write the patch or will you?

Feel free, but please let me know if you start.  I'm doing a bit of
chasing of leaks and dealing with that O_SYNC thing for 2.4 right now,
but I'll get back to the checkpoint code after that if you haven't
started by then.

Cheers,
 Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-04-11 at 21:46, Andrew Morton wrote:
> "Stephen C. Tweedie" <[EMAIL PROTECTED]> wrote:
> >
> > Andrew, what was the exact illegal state of the pages you were seeing
> >  when fixing that recent leak?  It looks like it's nothing more complex
> >  than dirty buffers on an anon page.
> 
> Correct.

Good --- I think I've got the debugging solid against 2.4 for that case,
patching against 2.6 now.

> >  I think that simply calling
> >  try_to_release_page() for all the remaining buffers at umount time will
> 
> Presumably these pages have no ->mapping, so try_to_release_page() will
> call try_to_free_buffers().

Exactly.

> >  The only thing that would be required on
> >  top of that would be a check that the page is also on the VM LRU lists.
> 
> Why do we have dirty buffers left over at umount time?

In the leak case we're worried about, the buffers are dirty but the page
is anon so there's nothing to clean them up.  The buffers _will_ be
destroyed by unmount, sure; invalidate_bdev() should see to that.  But
I'm doing the bh leak check before we get to the final
invalidate_bdev(), so they should still be available for testing at that
point.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-04-11 at 19:38, Mingming Cao wrote:

> I agree. We should not skip the home block group of the file.  I guess
> what I was suggesting is, if allocation from the home group failed and
> we continuing the linear search the rest of block groups, we could
> probably try to skip the block groups without enough usable free blocks
> to make a reservation. 

Fair enough.  Once those are the only bgs left, performance is going to
drop pretty quickly, but that's not really avoidable on a very full
filesystem.

> Ah.. I see the win with the read lock now: once the a reservation window
> is added, updating it (either winding it forward and or searching for a
> avaliable window) probably is the majorirty of the operations on the
> tree, and using read lock for that should help reduce the latency.

Right.  The down side is that for things like a kernel "tar xf", we'll
be doing lots of small file unpacks, and hopefully most files will be
just one or two reservations --- so there's little winding forward going
on.  The searching will still be improved in that case.

Note that this may improve average case latencies, but it's not likely
to improve worst-case ones.  We still need a write lock to install a new
window, and that's going to have to wait for us to finish finding a free
bit even if that operation starts using a read lock.  

I'm not really sure what to do about worst-case here.  For that, we
really do want to drop the lock entirely while we do the bitmap scan.

That leaves two options.  Hold a reservation while we do that; or don't.
Holding one poses the problems we discussed before: either you hold a
large reservation (bad for disk layout in the presence of concurrent
allocators), or you hold smaller ones (high cost as you continually
advance the window, which requires some read lock on the tree to avoid
bumping into the next window.)

Just how bad would it be if we didn't hold a lock _or_ a window at all
while doing the search for new window space?  I didn't like that
alternative at first because of the problem when you've got multiple
tasks trying to allocate in the same space at the same time; but given
the locking overhead of the alternatives, I'm wondering if this is
actually the lesser evil.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-04-07 at 16:51, Stephen C. Tweedie wrote:

> I'm currently running with the buffer-trace debug patch, on 2.4, with a
> custom patch to put every buffer jbd ever sees onto a per-superblock
> list, and remove it only when the bh is destroyed in
> put_unused_buffer_head().  At unmount time, we can walk that list to
> find stale buffers attached to data pages (invalidate_buffers() already
> does such a walk for metadata.)

After a 50-process dbench run, unmount is showing ~300 buffers
unclaimed; that seems to be OK, it's a large stress test doing _lots_ of
create/unlink and we expect to see a few buffers around at the end where
they were truncated during commit and couldn't be instantly reclaimed.

The main thing now is to test these buffers to make 100% sure that they
are in a state where the VM can reclaim them.  That looks fine: the
buffers I see are unjournaled, have no jh, and are clean and on the
BUF_CLEAN lru.

Andrew, what was the exact illegal state of the pages you were seeing
when fixing that recent leak?  It looks like it's nothing more complex
than dirty buffers on an anon page.  I think that simply calling
try_to_release_page() for all the remaining buffers at umount time will
be enough to catch these; if that function fails, it tells us that the
VM can't reclaim these pages.  The only thing that would be required on
top of that would be a check that the page is also on the VM LRU lists.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Fri, 2005-04-08 at 19:10, Mingming Cao wrote:

> > It still needs to be done under locking to prevent us from expanding
> > over the next window, though.  And having to take and drop a spinlock a
> > dozen times or more just to find out that there are no usable free
> > blocks in the current block group is still expensive, even if we're not
> > actually fully unlinking the window each time.

> Isn't this a rare case? The whole group is relatively full and the free
> bits are all reserved by other files.

Well, that's not much different from the case where there _are_ usable
bits but they are all near the end of the bitmap.  And that's common
enough as you fill up a block group with small files, for example.  Once
the bg becomes nearly full, new file opens-for-write will still try to
allocate in that bg (because that's where the inode was allocated), but
as it's a new fd we have no prior knowledge of _where_ in the bh to
allocate, and we'll have to walk it to the end to find any free space. 
This is the access pattern I'd expect of (for example) "tar xvjf
linux-2.6.12.tar.bz2", not exactly a rare case.

>   Probably we should avoid trying
> to make reservation in this block group at the first place

Not in this case, because it's the "home" of the file in question, and
skipping to another bg would just leave useful space empty --- and that
leads to fragmentation.

> You are proposing that we hold the read lock first, do the window search
> and bitmap scan, then once we confirm there is free bit in the candidate
> window, we grab the write lock and update the tree?  

No, I'm suggesting that if we need the write lock for tree updates, we
may still be able to get away with just a read lock when updating an
individual window.  If all we're doing is winding the window forwards a
bit, that's not actually changing the structure of the tree.

> However I am still worried that the rw lock will allow concurrent files
> trying to lock the same window at the same time. 

Adding a new window will need the write lock, always.  But with the read
lock, we can still check the neighbouring windows (the structure is
pinned so those remain valid), so advancing a window with just that lock
can still be done without risking overlapping the next window.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem in log_do_checkpoint()?

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Fri, 2005-04-08 at 18:14, Badari Pulavarty wrote:
> I get OOPs  in log_do_checkpoint() while using ext3 quotas.
> Is this anyway related to what you are working on ?
> 
> Unable to handle kernel NULL pointer dereference at virtual address
> 

Doesn't look like it, no.  If we understand it right, Jan's problem
would only ever manifest itself as an assert failure, not as an oops.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem in log_do_checkpoint()?

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Fri, 2005-04-08 at 18:14, Badari Pulavarty wrote:
 I get OOPs  in log_do_checkpoint() while using ext3 quotas.
 Is this anyway related to what you are working on ?
 
 Unable to handle kernel NULL pointer dereference at virtual address
 

Doesn't look like it, no.  If we understand it right, Jan's problem
would only ever manifest itself as an assert failure, not as an oops.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Fri, 2005-04-08 at 19:10, Mingming Cao wrote:

  It still needs to be done under locking to prevent us from expanding
  over the next window, though.  And having to take and drop a spinlock a
  dozen times or more just to find out that there are no usable free
  blocks in the current block group is still expensive, even if we're not
  actually fully unlinking the window each time.

 Isn't this a rare case? The whole group is relatively full and the free
 bits are all reserved by other files.

Well, that's not much different from the case where there _are_ usable
bits but they are all near the end of the bitmap.  And that's common
enough as you fill up a block group with small files, for example.  Once
the bg becomes nearly full, new file opens-for-write will still try to
allocate in that bg (because that's where the inode was allocated), but
as it's a new fd we have no prior knowledge of _where_ in the bh to
allocate, and we'll have to walk it to the end to find any free space. 
This is the access pattern I'd expect of (for example) tar xvjf
linux-2.6.12.tar.bz2, not exactly a rare case.

   Probably we should avoid trying
 to make reservation in this block group at the first place

Not in this case, because it's the home of the file in question, and
skipping to another bg would just leave useful space empty --- and that
leads to fragmentation.

 You are proposing that we hold the read lock first, do the window search
 and bitmap scan, then once we confirm there is free bit in the candidate
 window, we grab the write lock and update the tree?  

No, I'm suggesting that if we need the write lock for tree updates, we
may still be able to get away with just a read lock when updating an
individual window.  If all we're doing is winding the window forwards a
bit, that's not actually changing the structure of the tree.

 However I am still worried that the rw lock will allow concurrent files
 trying to lock the same window at the same time. 

Adding a new window will need the write lock, always.  But with the read
lock, we can still check the neighbouring windows (the structure is
pinned so those remain valid), so advancing a window with just that lock
can still be done without risking overlapping the next window.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-04-07 at 16:51, Stephen C. Tweedie wrote:

 I'm currently running with the buffer-trace debug patch, on 2.4, with a
 custom patch to put every buffer jbd ever sees onto a per-superblock
 list, and remove it only when the bh is destroyed in
 put_unused_buffer_head().  At unmount time, we can walk that list to
 find stale buffers attached to data pages (invalidate_buffers() already
 does such a walk for metadata.)

After a 50-process dbench run, unmount is showing ~300 buffers
unclaimed; that seems to be OK, it's a large stress test doing _lots_ of
create/unlink and we expect to see a few buffers around at the end where
they were truncated during commit and couldn't be instantly reclaimed.

The main thing now is to test these buffers to make 100% sure that they
are in a state where the VM can reclaim them.  That looks fine: the
buffers I see are unjournaled, have no jh, and are clean and on the
BUF_CLEAN lru.

Andrew, what was the exact illegal state of the pages you were seeing
when fixing that recent leak?  It looks like it's nothing more complex
than dirty buffers on an anon page.  I think that simply calling
try_to_release_page() for all the remaining buffers at umount time will
be enough to catch these; if that function fails, it tells us that the
VM can't reclaim these pages.  The only thing that would be required on
top of that would be a check that the page is also on the VM LRU lists.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-04-11 at 19:38, Mingming Cao wrote:

 I agree. We should not skip the home block group of the file.  I guess
 what I was suggesting is, if allocation from the home group failed and
 we continuing the linear search the rest of block groups, we could
 probably try to skip the block groups without enough usable free blocks
 to make a reservation. 

Fair enough.  Once those are the only bgs left, performance is going to
drop pretty quickly, but that's not really avoidable on a very full
filesystem.

 Ah.. I see the win with the read lock now: once the a reservation window
 is added, updating it (either winding it forward and or searching for a
 avaliable window) probably is the majorirty of the operations on the
 tree, and using read lock for that should help reduce the latency.

Right.  The down side is that for things like a kernel tar xf, we'll
be doing lots of small file unpacks, and hopefully most files will be
just one or two reservations --- so there's little winding forward going
on.  The searching will still be improved in that case.

Note that this may improve average case latencies, but it's not likely
to improve worst-case ones.  We still need a write lock to install a new
window, and that's going to have to wait for us to finish finding a free
bit even if that operation starts using a read lock.  

I'm not really sure what to do about worst-case here.  For that, we
really do want to drop the lock entirely while we do the bitmap scan.

That leaves two options.  Hold a reservation while we do that; or don't.
Holding one poses the problems we discussed before: either you hold a
large reservation (bad for disk layout in the presence of concurrent
allocators), or you hold smaller ones (high cost as you continually
advance the window, which requires some read lock on the tree to avoid
bumping into the next window.)

Just how bad would it be if we didn't hold a lock _or_ a window at all
while doing the search for new window space?  I didn't like that
alternative at first because of the problem when you've got multiple
tasks trying to allocate in the same space at the same time; but given
the locking overhead of the alternatives, I'm wondering if this is
actually the lesser evil.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-04-11 at 21:46, Andrew Morton wrote:
 Stephen C. Tweedie [EMAIL PROTECTED] wrote:
 
  Andrew, what was the exact illegal state of the pages you were seeing
   when fixing that recent leak?  It looks like it's nothing more complex
   than dirty buffers on an anon page.
 
 Correct.

Good --- I think I've got the debugging solid against 2.4 for that case,
patching against 2.6 now.

   I think that simply calling
   try_to_release_page() for all the remaining buffers at umount time will
 
 Presumably these pages have no -mapping, so try_to_release_page() will
 call try_to_free_buffers().

Exactly.

   The only thing that would be required on
   top of that would be a check that the page is also on the VM LRU lists.
 
 Why do we have dirty buffers left over at umount time?

In the leak case we're worried about, the buffers are dirty but the page
is anon so there's nothing to clean them up.  The buffers _will_ be
destroyed by unmount, sure; invalidate_bdev() should see to that.  But
I'm doing the bh leak check before we get to the final
invalidate_bdev(), so they should still be available for testing at that
point.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-08 Thread Stephen C. Tweedie
Hi,

On Fri, 2005-04-08 at 00:37, Mingming Cao wrote:

> Actually, we do not have to do an rbtree link and unlink for every
> window we search.  If the reserved window(old) has no free bit and the
> new reservable window's is right after the old one, no need to unlink
> the old window from the rbtree and then link the new window, just update
> the start and end of the old one.

It still needs to be done under locking to prevent us from expanding
over the next window, though.  And having to take and drop a spinlock a
dozen times or more just to find out that there are no usable free
blocks in the current block group is still expensive, even if we're not
actually fully unlinking the window each time.

I wonder if this can possibly be done safely without locking?  It would
be really good if we could rotate windows forward with no global locks. 
At the very least, a fair rwlock would let us freeze the total layout of
the tree, while still letting us modify individual windows safely.  As
long as we use wmb() to make sure that we always extend the end of the
window before we shrink the start of it, I think we could get away with
such shared locking.  And rw locking is much better for concurrency, so
we might be able to hold it over the whole bitmap search rather than
taking it and dropping it at each window advance.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem in log_do_checkpoint()?

2005-04-08 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-04-04 at 10:04, Jan Kara wrote:

>   In log_do_checkpoint() we go through the t_checkpoint_list of a
> transaction and call __flush_buffer() on each buffer. Suppose there is
> just one buffer on the list and it is dirty. __flush_buffer() sees it and
> puts it to an array of buffers for flushing. Then the loop finishes,
> retry=0, drop_count=0, batch_count=1. So __flush_batch() is called - we
> drop all locks and sleep. While we are sleeping somebody else comes and
> makes the buffer dirty again (OK, that is not probable, but I think it
> could be possible). 

Yes, there's no reason why not at that point.

> Now we wake up and call __cleanup_transaction().
> It's not able to do anything and returns 0.

I think the _right_ answer here is to have two separate checkpoint
lists: the current one, plus one for which the checkpoint write has
already been submitted.  That way, we can wait for IO completion on
submitted writes without (a) getting conned into doing multiple rewrites
if there's somebody else dirtying the buffer; or (b) getting confused
about how much progress we're making.  Buffers on the pre-write list get
written; buffers on the post-write list get waited for; and both count
as progress (eliminating the false assert-failure when we failed to
detect progress).

The prevention of multiple writes in this case should also improve
performance a little.

That ought to be pretty straightforward, I think.  The existing cases
where we remove buffers from a checkpoint shouldn't have to care about
which list_head we're removing from; those cases already handle buffers
in both states.  It's only when doing the flush/wait that we have to
distinguish the two.

Sounds good?

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem in log_do_checkpoint()?

2005-04-08 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-04-04 at 10:04, Jan Kara wrote:

   In log_do_checkpoint() we go through the t_checkpoint_list of a
 transaction and call __flush_buffer() on each buffer. Suppose there is
 just one buffer on the list and it is dirty. __flush_buffer() sees it and
 puts it to an array of buffers for flushing. Then the loop finishes,
 retry=0, drop_count=0, batch_count=1. So __flush_batch() is called - we
 drop all locks and sleep. While we are sleeping somebody else comes and
 makes the buffer dirty again (OK, that is not probable, but I think it
 could be possible). 

Yes, there's no reason why not at that point.

 Now we wake up and call __cleanup_transaction().
 It's not able to do anything and returns 0.

I think the _right_ answer here is to have two separate checkpoint
lists: the current one, plus one for which the checkpoint write has
already been submitted.  That way, we can wait for IO completion on
submitted writes without (a) getting conned into doing multiple rewrites
if there's somebody else dirtying the buffer; or (b) getting confused
about how much progress we're making.  Buffers on the pre-write list get
written; buffers on the post-write list get waited for; and both count
as progress (eliminating the false assert-failure when we failed to
detect progress).

The prevention of multiple writes in this case should also improve
performance a little.

That ought to be pretty straightforward, I think.  The existing cases
where we remove buffers from a checkpoint shouldn't have to care about
which list_head we're removing from; those cases already handle buffers
in both states.  It's only when doing the flush/wait that we have to
distinguish the two.

Sounds good?

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-08 Thread Stephen C. Tweedie
Hi,

On Fri, 2005-04-08 at 00:37, Mingming Cao wrote:

 Actually, we do not have to do an rbtree link and unlink for every
 window we search.  If the reserved window(old) has no free bit and the
 new reservable window's is right after the old one, no need to unlink
 the old window from the rbtree and then link the new window, just update
 the start and end of the old one.

It still needs to be done under locking to prevent us from expanding
over the next window, though.  And having to take and drop a spinlock a
dozen times or more just to find out that there are no usable free
blocks in the current block group is still expensive, even if we're not
actually fully unlinking the window each time.

I wonder if this can possibly be done safely without locking?  It would
be really good if we could rotate windows forward with no global locks. 
At the very least, a fair rwlock would let us freeze the total layout of
the tree, while still letting us modify individual windows safely.  As
long as we use wmb() to make sure that we always extend the end of the
window before we shrink the start of it, I think we could get away with
such shared locking.  And rw locking is much better for concurrency, so
we might be able to hold it over the whole bitmap search rather than
taking it and dropping it at each window advance.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-07 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 21:10, Stephen C. Tweedie wrote:

> However, 2.6 is suspected of still having leaks in ext3.  To be certain
> that we're not just backporting one of those to 2.4, we need to
> understand who exactly is going to clean up these bh's if they are in
> fact unused once we complete this code and do the final put_bh().  
> 
> I'll give this patch a spin with Andrew's fsx-based leak stress and see
> if anything unpleasant appears.  

I'm currently running with the buffer-trace debug patch, on 2.4, with a
custom patch to put every buffer jbd ever sees onto a per-superblock
list, and remove it only when the bh is destroyed in
put_unused_buffer_head().  At unmount time, we can walk that list to
find stale buffers attached to data pages (invalidate_buffers() already
does such a walk for metadata.)

I just ran it with a quick test and it found 300,000 buffers still
present at unmount.  Whoops, I guess I need to move the check to _after_
the final invalidate_inodes() call. :-)

But this method ought to allow us to test for this sort of leak a lot
more reliably, and to report via the usual buffer history tracing the
most recent activity on any bh that does leak.

Andrew, I'll give this a try on 2.6 once I've got this 2.4 patch tested
for Marcelo.  I've got uptodate buffer_trace for 2.6 anyway, so it
shouldn't be too hard.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-07 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote:

> doesnt the first option also allow searches to be in parallel?

In terms of CPU usage, yes.  But either we use large windows, in which
case we *can't* search remotely near areas of the disk in parallel; or
we use small windows, in which case failure to find a free bit becomes
disproportionately expensive (we have to do an rbtree link and unlink
for each window we search.)

>  This 
> particular one took over 1 msec, so it seems there's a fair room for 
> parallellizing multiple writers and thus improving scalability. (or is 
> this codepath serialized globally anyway?)

No, the rbtree ops are serialised per-sb, and allocations within any one
inode are serialised, but the bitmap searches need not be serialised
globally.  (They are in the case of new window searches right now, which
is what we're trying to fix.)

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-07 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote:

 doesnt the first option also allow searches to be in parallel?

In terms of CPU usage, yes.  But either we use large windows, in which
case we *can't* search remotely near areas of the disk in parallel; or
we use small windows, in which case failure to find a free bit becomes
disproportionately expensive (we have to do an rbtree link and unlink
for each window we search.)

  This 
 particular one took over 1 msec, so it seems there's a fair room for 
 parallellizing multiple writers and thus improving scalability. (or is 
 this codepath serialized globally anyway?)

No, the rbtree ops are serialised per-sb, and allocations within any one
inode are serialised, but the bitmap searches need not be serialised
globally.  (They are in the case of new window searches right now, which
is what we're trying to fix.)

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-07 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 21:10, Stephen C. Tweedie wrote:

 However, 2.6 is suspected of still having leaks in ext3.  To be certain
 that we're not just backporting one of those to 2.4, we need to
 understand who exactly is going to clean up these bh's if they are in
 fact unused once we complete this code and do the final put_bh().  
 
 I'll give this patch a spin with Andrew's fsx-based leak stress and see
 if anything unpleasant appears.  

I'm currently running with the buffer-trace debug patch, on 2.4, with a
custom patch to put every buffer jbd ever sees onto a per-superblock
list, and remove it only when the bh is destroyed in
put_unused_buffer_head().  At unmount time, we can walk that list to
find stale buffers attached to data pages (invalidate_buffers() already
does such a walk for metadata.)

I just ran it with a quick test and it found 300,000 buffers still
present at unmount.  Whoops, I guess I need to move the check to _after_
the final invalidate_inodes() call. :-)

But this method ought to allow us to test for this sort of leak a lot
more reliably, and to report via the usual buffer history tracing the
most recent activity on any bh that does leak.

Andrew, I'll give this a try on 2.6 once I've got this 2.4 patch tested
for Marcelo.  I've got uptodate buffer_trace for 2.6 anyway, so it
shouldn't be too hard.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-06 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote:

> I have measured the bh refcount before the buffer_uptodate() for a few days.
> I found out that the bh refcount sometimes reached to 0 .
> So, I think following modifications are effective.

Thanks --- it certainly looks like this should fix the dereferencing of
invalid bh's, and this code mirrors what 2.6 does around that area.

However, 2.6 is suspected of still having leaks in ext3.  To be certain
that we're not just backporting one of those to 2.4, we need to
understand who exactly is going to clean up these bh's if they are in
fact unused once we complete this code and do the final put_bh().  

I'll give this patch a spin with Andrew's fsx-based leak stress and see
if anything unpleasant appears.  

Cheers,
 Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-06 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 17:53, Mingming Cao wrote:

> > Possible, but not necessarily nice.  If you've got a nearly-full disk,
> > most bits will be already allocated.  As you scan the bitmaps, it may
> > take quite a while to find a free bit; do you really want to (a) lock
> > the whole block group with a temporary window just to do the scan, or
> > (b) keep allocating multiple smaller windows until you finally find a
> > free bit?  The former is bad for concurrency if you have multiple tasks
> > trying to allocate nearby on disk --- you'll force them into different
> > block groups.  The latter is high overhead.

> I am not quite understand what you mean about (a).  In this proposal, we
> will drop the lock before the scan. 

s/lock/reserve/.  

> And for (b), maybe I did not make myself clear: I am not proposing to
> keeping allocating multiple smaller windows until finally find a free
> bit. I mean, we book the window(just link the node into the tree) before
> we drop the lock, if there is no free bit inside that window, we will go
> back search for another window(call find_next_reserveable_window()),
> inside it, we will remove the temporary window we just created and find
> next window. SO we only have one temporary window at a time. 

And that's the problem.  Either we create small temporary windows, in
which case we may end up thrashing through vast numbers of them before
we find a bit that's available --- very expensive as the disk gets full
--- or we use large windows but get worse layout when there are parallel
allocators going on.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-06 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote:

>  >Certainly it's normal for a short read/write to imply either error or
>  >EOF, without the error necessarily needing to be returned explicitly.
>  >I'm not convinced that the Singleunix language actually requires that,
>  >but it seems the most obvious and consistent behaviour.

> When an O_SYNC flag is set , if commit_write() succeed but 
> generic_osync_inode() return
> error due to I/O failure, write() must fail .

Yes.  But it is conventional to interpret a short write as being a
failure.  Returning less bytes than were requested in the write
indicates that the rest failed.  It just doesn't give the exact nature
of the failure (EIO vs ENOSPC etc.)  For regular files, a short write is
never permitted unless there are errors of some description.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-06 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 06:35, Mingming Cao wrote:

> It seems we are holding the rsv_block while searching the bitmap for a
> free bit.  

Probably something to avoid!

> In alloc_new_reservation(), we first find a available to
> create a reservation window, then we check the bitmap to see if it
> contains any free block. If not, we will search for next available
> window, so on and on. During the whole process we are holding the global
> rsv_lock.  We could, and probably should, avoid that.  Just unlock the
> rsv_lock before the bitmap search and re-grab it after it.  We need to
> make sure that the available space that are still available after we re-
> grab the lock. 

Not necessarily.  As long as the windows remain mutually exclusive in
the rbtree, it doesn't matter too much if we occasionally allocate a
full one --- as long as that case is rare, the worst that happens is
that we fail to allocate from the window and have to repeat the window
reserve.

The difficulty will be in keeping it rare.  What we need to avoid is the
case where multiple tasks need a new window, they all drop the lock,
find the same bits free in the bitmap, then all try to take that
window.  One will succeed, the others will fail; but as the files in
that bit of the disk continue to grow, we risk those processes
*continually* repeating the same stomp-on-each-others'-windows operation
and raising the allocation overhead significantly.

> Another option is to hold that available window before we release the
> rsv_lock, and if there is no free bit inside that window, we will remove
> it from the tree in the next round of searching for next available
> window.

Possible, but not necessarily nice.  If you've got a nearly-full disk,
most bits will be already allocated.  As you scan the bitmaps, it may
take quite a while to find a free bit; do you really want to (a) lock
the whole block group with a temporary window just to do the scan, or
(b) keep allocating multiple smaller windows until you finally find a
free bit?  The former is bad for concurrency if you have multiple tasks
trying to allocate nearby on disk --- you'll force them into different
block groups.  The latter is high overhead.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-06 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 06:35, Mingming Cao wrote:

 It seems we are holding the rsv_block while searching the bitmap for a
 free bit.  

Probably something to avoid!

 In alloc_new_reservation(), we first find a available to
 create a reservation window, then we check the bitmap to see if it
 contains any free block. If not, we will search for next available
 window, so on and on. During the whole process we are holding the global
 rsv_lock.  We could, and probably should, avoid that.  Just unlock the
 rsv_lock before the bitmap search and re-grab it after it.  We need to
 make sure that the available space that are still available after we re-
 grab the lock. 

Not necessarily.  As long as the windows remain mutually exclusive in
the rbtree, it doesn't matter too much if we occasionally allocate a
full one --- as long as that case is rare, the worst that happens is
that we fail to allocate from the window and have to repeat the window
reserve.

The difficulty will be in keeping it rare.  What we need to avoid is the
case where multiple tasks need a new window, they all drop the lock,
find the same bits free in the bitmap, then all try to take that
window.  One will succeed, the others will fail; but as the files in
that bit of the disk continue to grow, we risk those processes
*continually* repeating the same stomp-on-each-others'-windows operation
and raising the allocation overhead significantly.

 Another option is to hold that available window before we release the
 rsv_lock, and if there is no free bit inside that window, we will remove
 it from the tree in the next round of searching for next available
 window.

Possible, but not necessarily nice.  If you've got a nearly-full disk,
most bits will be already allocated.  As you scan the bitmaps, it may
take quite a while to find a free bit; do you really want to (a) lock
the whole block group with a temporary window just to do the scan, or
(b) keep allocating multiple smaller windows until you finally find a
free bit?  The former is bad for concurrency if you have multiple tasks
trying to allocate nearby on disk --- you'll force them into different
block groups.  The latter is high overhead.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-06 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote:

  Certainly it's normal for a short read/write to imply either error or
  EOF, without the error necessarily needing to be returned explicitly.
  I'm not convinced that the Singleunix language actually requires that,
  but it seems the most obvious and consistent behaviour.

 When an O_SYNC flag is set , if commit_write() succeed but 
 generic_osync_inode() return
 error due to I/O failure, write() must fail .

Yes.  But it is conventional to interpret a short write as being a
failure.  Returning less bytes than were requested in the write
indicates that the rest failed.  It just doesn't give the exact nature
of the failure (EIO vs ENOSPC etc.)  For regular files, a short write is
never permitted unless there are errors of some description.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 allocate-with-reservation latencies

2005-04-06 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 17:53, Mingming Cao wrote:

  Possible, but not necessarily nice.  If you've got a nearly-full disk,
  most bits will be already allocated.  As you scan the bitmaps, it may
  take quite a while to find a free bit; do you really want to (a) lock
  the whole block group with a temporary window just to do the scan, or
  (b) keep allocating multiple smaller windows until you finally find a
  free bit?  The former is bad for concurrency if you have multiple tasks
  trying to allocate nearby on disk --- you'll force them into different
  block groups.  The latter is high overhead.

 I am not quite understand what you mean about (a).  In this proposal, we
 will drop the lock before the scan. 

s/lock/reserve/.  

 And for (b), maybe I did not make myself clear: I am not proposing to
 keeping allocating multiple smaller windows until finally find a free
 bit. I mean, we book the window(just link the node into the tree) before
 we drop the lock, if there is no free bit inside that window, we will go
 back search for another window(call find_next_reserveable_window()),
 inside it, we will remove the temporary window we just created and find
 next window. SO we only have one temporary window at a time. 

And that's the problem.  Either we create small temporary windows, in
which case we may end up thrashing through vast numbers of them before
we find a bit that's available --- very expensive as the disk gets full
--- or we use large windows but get worse layout when there are parallel
allocators going on.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-06 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote:

 I have measured the bh refcount before the buffer_uptodate() for a few days.
 I found out that the bh refcount sometimes reached to 0 .
 So, I think following modifications are effective.

Thanks --- it certainly looks like this should fix the dereferencing of
invalid bh's, and this code mirrors what 2.6 does around that area.

However, 2.6 is suspected of still having leaks in ext3.  To be certain
that we're not just backporting one of those to 2.4, we need to
understand who exactly is going to clean up these bh's if they are in
fact unused once we complete this code and do the final put_bh().  

I'll give this patch a spin with Andrew's fsx-based leak stress and see
if anything unpleasant appears.  

Cheers,
 Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-05 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-03-30 at 12:59, Marcelo Tosatti wrote:

> > I'm not certain that this is right, but it seems possible and would
> > explain the symptoms.  Maybe Stephen or Andrew could comments?
> 
> Andrew, Stephen?

Sorry, was offline for a week last week; I'll try to look at this more
closely tomorrow.  Checking the buffer_uptodate() without either a
refcount or a lock certainly looks unsafe at first glance.

There are lots of ways to pin the bh in that particular bit of the
code.  The important thing will be to do so without causing leaks if
we're truly finished with the buffer after this flush.


> > If some of the write succeeded and some failed, then I believe the
> > correct behaviour is to return the number of bytes that succeeded.
> > However this change to the return status (remember the above patch is
> > a reversal) causes any failure to over-ride any success. This, I
> > think, is wrong.
> 
> Yeap, that part also looks wrong.

Certainly it's normal for a short read/write to imply either error or
EOF, without the error necessarily needing to be returned explicitly. 
I'm not convinced that the Singleunix language actually requires that,
but it seems the most obvious and consistent behaviour.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-04-05 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-04-04 at 02:35, Andrew Morton wrote:

> Without the below patch it's possible to make ext3 leak at around a
> megabyte per minute by arranging for the fs to run a commit every 50
> milliseconds, btw.

Ouch! 
> (Stephen, please review...)

Doing so now. 

> The patch teaches journal_unmap_buffer() about buffers which are on the
> committing transaction's t_locked_list.  These buffers have been written and
> I/O has completed.  

Agreed.  The key here is that the buffer is locked before
journal_unmap_buffer() is called, so we can indeed rely on it being
safely on disk.  

> We can take them off the transaction and undirty them
> within the context of journal_invalidatepage()->journal_unmap_buffer().

Right - the committing transaction can't be doing any more writes, and
the current transaction has explicitly told us to throw away its own
writes if we get here.  Unfiling the buffer should be safe.

> + if (jh->b_jlist == BJ_Locked) {
> + /*
> +  * The buffer is on the committing transaction's locked
> +  * list.  We have the buffer locked, so I/O has
> +  * completed.  So we can nail the buffer now.
> +  */
> + may_free = __dispose_buffer(jh, transaction);
> + goto zap_buffer;
> + }

ACK.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-05 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-03-30 at 12:59, Marcelo Tosatti wrote:

  I'm not certain that this is right, but it seems possible and would
  explain the symptoms.  Maybe Stephen or Andrew could comments?
 
 Andrew, Stephen?

Sorry, was offline for a week last week; I'll try to look at this more
closely tomorrow.  Checking the buffer_uptodate() without either a
refcount or a lock certainly looks unsafe at first glance.

There are lots of ways to pin the bh in that particular bit of the
code.  The important thing will be to do so without causing leaks if
we're truly finished with the buffer after this flush.


  If some of the write succeeded and some failed, then I believe the
  correct behaviour is to return the number of bytes that succeeded.
  However this change to the return status (remember the above patch is
  a reversal) causes any failure to over-ride any success. This, I
  think, is wrong.
 
 Yeap, that part also looks wrong.

Certainly it's normal for a short read/write to imply either error or
EOF, without the error necessarily needing to be returned explicitly. 
I'm not convinced that the Singleunix language actually requires that,
but it seems the most obvious and consistent behaviour.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-04-05 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-04-04 at 02:35, Andrew Morton wrote:

 Without the below patch it's possible to make ext3 leak at around a
 megabyte per minute by arranging for the fs to run a commit every 50
 milliseconds, btw.

Ouch! 
 (Stephen, please review...)

Doing so now. 

 The patch teaches journal_unmap_buffer() about buffers which are on the
 committing transaction's t_locked_list.  These buffers have been written and
 I/O has completed.  

Agreed.  The key here is that the buffer is locked before
journal_unmap_buffer() is called, so we can indeed rely on it being
safely on disk.  

 We can take them off the transaction and undirty them
 within the context of journal_invalidatepage()-journal_unmap_buffer().

Right - the committing transaction can't be doing any more writes, and
the current transaction has explicitly told us to throw away its own
writes if we get here.  Unfiling the buffer should be safe.

 + if (jh-b_jlist == BJ_Locked) {
 + /*
 +  * The buffer is on the committing transaction's locked
 +  * list.  We have the buffer locked, so I/O has
 +  * completed.  So we can nail the buffer now.
 +  */
 + may_free = __dispose_buffer(jh, transaction);
 + goto zap_buffer;
 + }

ACK.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 journalling BUG on full filesystem

2005-03-24 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-03-24 at 19:38, Chris Wright wrote:

> OK, good to know.  When I last checked you were working on a higher risk
> yet more complete fix, and I thought we'd wait for that one to stabilize.
> Looks like the one Jan attached is the better -stable candidate?

Definitely; it's the one I gave akpm.  The lock reworking is going to
remove one layer of locks, so it's worthwhile from that point of view;
but it's longer-term, and I don't know for sure of any paths to chaos
with that simpler journal_unmap_buffer() fix in place.  (It's just very
hard to _prove_ all cases are correct without the locking rework.)

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 journalling BUG on full filesystem

2005-03-24 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-03-24 at 10:39, Jan Kara wrote:

>   Actually the patch you atached showed in the end as not covering all
> the cases and so Stephen agreed to stay with the first try (attached)
> which should cover all known cases (although it's not so nice).

Right.  The later patch is getting reworked into a proper locking
overhaul for the journal_put_journal_head() code.  The earlier one (that
Jan attached) is the one that's appropriate in the mean time; it covers
all of the holes we know about for sure and has proven robust in
testing.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 journalling BUG on full filesystem

2005-03-24 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-03-24 at 10:39, Jan Kara wrote:

   Actually the patch you atached showed in the end as not covering all
 the cases and so Stephen agreed to stay with the first try (attached)
 which should cover all known cases (although it's not so nice).

Right.  The later patch is getting reworked into a proper locking
overhaul for the journal_put_journal_head() code.  The earlier one (that
Jan attached) is the one that's appropriate in the mean time; it covers
all of the holes we know about for sure and has proven robust in
testing.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 journalling BUG on full filesystem

2005-03-24 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-03-24 at 19:38, Chris Wright wrote:

 OK, good to know.  When I last checked you were working on a higher risk
 yet more complete fix, and I thought we'd wait for that one to stabilize.
 Looks like the one Jan attached is the better -stable candidate?

Definitely; it's the one I gave akpm.  The lock reworking is going to
remove one layer of locks, so it's worthwhile from that point of view;
but it's longer-term, and I don't know for sure of any paths to chaos
with that simpler journal_unmap_buffer() fix in place.  (It's just very
hard to _prove_ all cases are correct without the locking rework.)

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CHECKER] ext3 bug in ftruncate() with O_SYNC?

2005-03-23 Thread Stephen C. Tweedie
Hi,

On Tue, 2005-03-22 at 03:51, Andrew Morton wrote:

> The spec says "Write I/O operations on the file descriptor shall complete
> as defined by synchronized I/O file integrity completion".
> 
> Is ftruncate a "write I/O operation"?  No.

SUS seems to be pretty clear on this.  The syscall descriptions for
write(2) and pwrite(2) explicitly describe O_SYNC as requiring
synchronized I/O file integrity completion.  ftruncate() has no such
requirement.

It would certainly be a reasonable thing to do, but I don't think it
strictly counts as a bug that we're not honouring O_SYNC here.

--Stephen


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CHECKER] ext3 bug in ftruncate() with O_SYNC?

2005-03-23 Thread Stephen C. Tweedie
Hi,

On Tue, 2005-03-22 at 03:51, Andrew Morton wrote:

 The spec says Write I/O operations on the file descriptor shall complete
 as defined by synchronized I/O file integrity completion.
 
 Is ftruncate a write I/O operation?  No.

SUS seems to be pretty clear on this.  The syscall descriptions for
write(2) and pwrite(2) explicitly describe O_SYNC as requiring
synchronized I/O file integrity completion.  ftruncate() has no such
requirement.

It would certainly be a reasonable thing to do, but I don't think it
strictly counts as a bug that we're not honouring O_SYNC here.

--Stephen


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


e2fsprogs bug [was Re: ext2/3 file limits to avoid overflowing i_blocks]

2005-03-17 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-03-17 at 17:23, Stephen C. Tweedie wrote:

> I wrote a small program to calculate the total indirect tree overhead
> for any given file size, and 0x1ff7fffe000 turned out to be the largest
> file we can get without the total i_blocks overflowing 2^32.
> 
> But in testing, that *just* wrapped --- we need to limit the file to be
> one page smaller than that to deal with the possibility of an EA/ACL
> block being accounted against i_blocks.

On a side, issue, e2fsck was unable to find any problem on that
filesystem after the i_blocks had wrapped exactly to zero.

The bug seems to be in e2fsck/pass1.c: we do the numblocks checking
inside process_block(), which is called as an inode block iteration
function in check_blocks().  Then, later, check_blocks() does

if (inode->i_file_acl && check_ext_attr(ctx, pctx, block_buf))
pb.num_blocks++;

pb.num_blocks *= (fs->blocksize / 512);

but without any further testing to see if pb.num_blocks has exceeded the
max_blocks.  So by the time we've got to the end of check_blocks(),
we're testing the wrapped i_blocks on disk against the wrapped
num_blocks in memory, and so e2fsck fails to notice anything wrong.

The fix may be as simple as just moving the

if (inode->i_file_acl && check_ext_attr(ctx, pctx, block_buf))
pb.num_blocks++;

earlier in the function; Ted, do you see any problems with that?

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch] ext2/3 file limits to avoid overflowing i_blocks

2005-03-17 Thread Stephen C. Tweedie
Hi all,

As discussed before, we can overflow i_blocks in ext2/ext3 inodes by
growing a file up to 2TB.  That gives us 2^32 sectors of data in the
file; but once you add on the indirect tree and possible EA/ACL
metadata, i_blocks will wrap beyond 2^32.  Consensus seemed to be that
the best way to avoid this was simply to stop files getting so large
that this was a problem in the first place; anything else would lead to
complications if a sparse file tried to overflow that 2^32 sector limit
while filling in holes.

I wrote a small program to calculate the total indirect tree overhead
for any given file size, and 0x1ff7fffe000 turned out to be the largest
file we can get without the total i_blocks overflowing 2^32.

But in testing, that *just* wrapped --- we need to limit the file to be
one page smaller than that to deal with the possibility of an EA/ACL
block being accounted against i_blocks.

So this patch has been tested, at least on ext3, by letting a file grow
densely to its maximum size permitted by the kernel; at 0x1ff7fffe000,
stat shows the file to have wrapped back exactly to 0 st_blocks, but
with the limit at 0x1ff7fffd000, du shows it occupying the expected
2TB-blocksize bytes.

Cheers,
 Stephen

[Patch] ext2/3 file limits to avoid overflowing i_blocks

ext2/3 can currently overflow i_blocks past the maximum 2^32 sector
(2TB) limit.  The ext[23]_max_size functions try to enforce an upper
limit of 2^32-1 blocks by limiting the file size to 2TB-blocksize, but
this fails to account for indirect block metadata and the possibility of
an extended attribute block also being accounted against i_blocks.

The new file size limit of 0x1ff7fffd000 allows a file to grow to
2^32-blocksize sectors, assuming 4k blocksize.  (We don't have to worry
about smaller blocksizes as those have indirect tree limits that don't
let their size grow near this upper bound in the first place.)

Signed-off-by: Stephen Tweedie <[EMAIL PROTECTED]>

--- linux-2.6-ext3/fs/ext2/super.c.=K0001=.orig
+++ linux-2.6-ext3/fs/ext2/super.c
@@ -518,12 +518,18 @@ static int ext2_check_descriptors (struc
 static loff_t ext2_max_size(int bits)
 {
loff_t res = EXT2_NDIR_BLOCKS;
+   /* This constant is calculated to be the largest file size for a
+* dense, 4k-blocksize file such that the total number of
+* sectors in the file, including data and all indirect blocks,
+* does not exceed 2^32. */
+   const loff_t upper_limit = 0x1ff7fffd000LL;
+
res += 1LL << (bits-2);
res += 1LL << (2*(bits-2));
res += 1LL << (3*(bits-2));
res <<= bits;
-   if (res > (512LL << 32) - (1 << bits))
-   res = (512LL << 32) - (1 << bits);
+   if (res > upper_limit)
+   res = upper_limit;
return res;
 }
 
--- linux-2.6-ext3/fs/ext3/super.c.=K0001=.orig
+++ linux-2.6-ext3/fs/ext3/super.c
@@ -1193,12 +1193,18 @@ static void ext3_orphan_cleanup (struct 
 static loff_t ext3_max_size(int bits)
 {
loff_t res = EXT3_NDIR_BLOCKS;
+   /* This constant is calculated to be the largest file size for a
+* dense, 4k-blocksize file such that the total number of
+* sectors in the file, including data and all indirect blocks,
+* does not exceed 2^32. */
+   const loff_t upper_limit = 0x1ff7fffd000LL;
+
res += 1LL << (bits-2);
res += 1LL << (2*(bits-2));
res += 1LL << (3*(bits-2));
res <<= bits;
-   if (res > (512LL << 32) - (1 << bits))
-   res = (512LL << 32) - (1 << bits);
+   if (res > upper_limit)
+   res = upper_limit;
return res;
 }
 


[Patch] ext2/3 file limits to avoid overflowing i_blocks

2005-03-17 Thread Stephen C. Tweedie
Hi all,

As discussed before, we can overflow i_blocks in ext2/ext3 inodes by
growing a file up to 2TB.  That gives us 2^32 sectors of data in the
file; but once you add on the indirect tree and possible EA/ACL
metadata, i_blocks will wrap beyond 2^32.  Consensus seemed to be that
the best way to avoid this was simply to stop files getting so large
that this was a problem in the first place; anything else would lead to
complications if a sparse file tried to overflow that 2^32 sector limit
while filling in holes.

I wrote a small program to calculate the total indirect tree overhead
for any given file size, and 0x1ff7fffe000 turned out to be the largest
file we can get without the total i_blocks overflowing 2^32.

But in testing, that *just* wrapped --- we need to limit the file to be
one page smaller than that to deal with the possibility of an EA/ACL
block being accounted against i_blocks.

So this patch has been tested, at least on ext3, by letting a file grow
densely to its maximum size permitted by the kernel; at 0x1ff7fffe000,
stat shows the file to have wrapped back exactly to 0 st_blocks, but
with the limit at 0x1ff7fffd000, du shows it occupying the expected
2TB-blocksize bytes.

Cheers,
 Stephen

[Patch] ext2/3 file limits to avoid overflowing i_blocks

ext2/3 can currently overflow i_blocks past the maximum 2^32 sector
(2TB) limit.  The ext[23]_max_size functions try to enforce an upper
limit of 2^32-1 blocks by limiting the file size to 2TB-blocksize, but
this fails to account for indirect block metadata and the possibility of
an extended attribute block also being accounted against i_blocks.

The new file size limit of 0x1ff7fffd000 allows a file to grow to
2^32-blocksize sectors, assuming 4k blocksize.  (We don't have to worry
about smaller blocksizes as those have indirect tree limits that don't
let their size grow near this upper bound in the first place.)

Signed-off-by: Stephen Tweedie [EMAIL PROTECTED]

--- linux-2.6-ext3/fs/ext2/super.c.=K0001=.orig
+++ linux-2.6-ext3/fs/ext2/super.c
@@ -518,12 +518,18 @@ static int ext2_check_descriptors (struc
 static loff_t ext2_max_size(int bits)
 {
loff_t res = EXT2_NDIR_BLOCKS;
+   /* This constant is calculated to be the largest file size for a
+* dense, 4k-blocksize file such that the total number of
+* sectors in the file, including data and all indirect blocks,
+* does not exceed 2^32. */
+   const loff_t upper_limit = 0x1ff7fffd000LL;
+
res += 1LL  (bits-2);
res += 1LL  (2*(bits-2));
res += 1LL  (3*(bits-2));
res = bits;
-   if (res  (512LL  32) - (1  bits))
-   res = (512LL  32) - (1  bits);
+   if (res  upper_limit)
+   res = upper_limit;
return res;
 }
 
--- linux-2.6-ext3/fs/ext3/super.c.=K0001=.orig
+++ linux-2.6-ext3/fs/ext3/super.c
@@ -1193,12 +1193,18 @@ static void ext3_orphan_cleanup (struct 
 static loff_t ext3_max_size(int bits)
 {
loff_t res = EXT3_NDIR_BLOCKS;
+   /* This constant is calculated to be the largest file size for a
+* dense, 4k-blocksize file such that the total number of
+* sectors in the file, including data and all indirect blocks,
+* does not exceed 2^32. */
+   const loff_t upper_limit = 0x1ff7fffd000LL;
+
res += 1LL  (bits-2);
res += 1LL  (2*(bits-2));
res += 1LL  (3*(bits-2));
res = bits;
-   if (res  (512LL  32) - (1  bits))
-   res = (512LL  32) - (1  bits);
+   if (res  upper_limit)
+   res = upper_limit;
return res;
 }
 


e2fsprogs bug [was Re: ext2/3 file limits to avoid overflowing i_blocks]

2005-03-17 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-03-17 at 17:23, Stephen C. Tweedie wrote:

 I wrote a small program to calculate the total indirect tree overhead
 for any given file size, and 0x1ff7fffe000 turned out to be the largest
 file we can get without the total i_blocks overflowing 2^32.
 
 But in testing, that *just* wrapped --- we need to limit the file to be
 one page smaller than that to deal with the possibility of an EA/ACL
 block being accounted against i_blocks.

On a side, issue, e2fsck was unable to find any problem on that
filesystem after the i_blocks had wrapped exactly to zero.

The bug seems to be in e2fsck/pass1.c: we do the numblocks checking
inside process_block(), which is called as an inode block iteration
function in check_blocks().  Then, later, check_blocks() does

if (inode-i_file_acl  check_ext_attr(ctx, pctx, block_buf))
pb.num_blocks++;

pb.num_blocks *= (fs-blocksize / 512);

but without any further testing to see if pb.num_blocks has exceeded the
max_blocks.  So by the time we've got to the end of check_blocks(),
we're testing the wrapped i_blocks on disk against the wrapped
num_blocks in memory, and so e2fsck fails to notice anything wrong.

The fix may be as simple as just moving the

if (inode-i_file_acl  check_ext_attr(ctx, pctx, block_buf))
pb.num_blocks++;

earlier in the function; Ted, do you see any problems with that?

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Devices/Partitions over 2TB

2005-03-16 Thread Stephen C. Tweedie
Hi,

On Tue, 2005-03-15 at 04:54, jmerkey wrote:

> Good Question.  Where are the standard tools in FC2 and FC3 for these types?

For LVM, the lvm2 package contains all the necessary tools.  I know
Alasdair did some kernel fixes for lvm2 striping on >2TB partitions
recently, though, so older kernels might not work perfectly if you're
using stripes.

To use genuine partitions > 2TB, though, you need alternative
partitioning; the GPT disk label supports that, and "parted" can create
and partition such disk labels.  (Note that most x86 BIOSes can't boot
off them, though, so don't do this on your boot disk!)

--Stephen


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Devices/Partitions over 2TB

2005-03-16 Thread Stephen C. Tweedie
Hi,

On Tue, 2005-03-15 at 04:54, jmerkey wrote:

 Good Question.  Where are the standard tools in FC2 and FC3 for these types?

For LVM, the lvm2 package contains all the necessary tools.  I know
Alasdair did some kernel fixes for lvm2 striping on 2TB partitions
recently, though, so older kernels might not work perfectly if you're
using stripes.

To use genuine partitions  2TB, though, you need alternative
partitioning; the GPT disk label supports that, and parted can create
and partition such disk labels.  (Note that most x86 BIOSes can't boot
off them, though, so don't do this on your boot disk!)

--Stephen


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3/jbd race: releasing in-use journal_heads

2005-03-09 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-03-09 at 13:28, Jan Kara wrote:

>   Hmm. I see for example a place at jbd/commit.c, line 287 (which you
> did not change in your patch) which does this and doesn't seem to be
> protected against journal_unmap_buffer() (but maybe I miss something).
> Not that I'd find that race probable but in theory...

Indeed; I can't see why that wouldn't trigger (at least without the
existing, low-risk journal_unmap_buffer() patch.)

Andrew, I think we just go with that simple patch for now --- it catches
the cases we actually see in testing.  And rather than mess with
temporary states where b_transaction goes NULL, I think the *correct*
long-term fix is to hide those states using locking rather than by list
tricks.  Merging the bh_journal_head and bh_state locks really seems
like the safe solution here, as the latter seems to be held nearly
everywhere where we need protection against journal_put_journal_head()
(and where it's not held --- as it wasn't in journal_unmap_buffer() ---
it's a bug.)

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3/jbd race: releasing in-use journal_heads

2005-03-09 Thread Stephen C. Tweedie
Hi,

On Tue, 2005-03-08 at 15:12, Jan Kara wrote:

>   Isn't also the following scenario dangerous?
> 
>   __journal_unfile_buffer(jh);
>   journal_remove_journal_head(bh);

It depends.  I think the biggest problem here is that there's really no
written rule protecting this stuff universally.  But in testing, we just
don't see this triggering.

Why not?  We actually have a raft of other locks protecting us in
various places.  As long as there's some overlap in the locks, we're
OK.  But because it's not written down, not everybody relies on the same
set of locks; it's only because journal_unmap_buffer() was dropping the
jh without enough locks that we saw a problem there.

So the scenario:

__journal_unfile_buffer(jh);
journal_remove_journal_head(bh);

*might* be dangerous, if called carelessly; but in practice it works out
OK.  Remember, that journal_remove_journal_head() call still takes the
bh_journal_head lock and still checks b_transaction with that held.

I think it's time I went and worked out *why* it works out OK, though,
so that we can write it down and make sure it stays working!  And we may
well be able to simplify the locking when we do this; collapsing the two
bh state locks, for example, may help improve the robustness as well as
improving performance through removing some redundant locking
operations.

Fortunately, there are really only three classes of operation that can
remove a jh.  There are the normal VFS/VM operations, like writepage,
try_to_free_buffer, etc; these are all called with the page lock.  There
are metadata updates, which take a liberal dose of buffer, bh_state and
journal locks.  

Finally there are the commit/checkpoint functions, which run
asynchronously to the rest of the journaling and which don't relate to
the current transaction, but to previous ones.  This is the danger area,
I think.  But as long as at least the bh_state lock is held, we can
protect these operations against the data/metadata update operations.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3/jbd race: releasing in-use journal_heads

2005-03-09 Thread Stephen C. Tweedie
Hi,

On Tue, 2005-03-08 at 15:12, Jan Kara wrote:

   Isn't also the following scenario dangerous?
 
   __journal_unfile_buffer(jh);
   journal_remove_journal_head(bh);

It depends.  I think the biggest problem here is that there's really no
written rule protecting this stuff universally.  But in testing, we just
don't see this triggering.

Why not?  We actually have a raft of other locks protecting us in
various places.  As long as there's some overlap in the locks, we're
OK.  But because it's not written down, not everybody relies on the same
set of locks; it's only because journal_unmap_buffer() was dropping the
jh without enough locks that we saw a problem there.

So the scenario:

__journal_unfile_buffer(jh);
journal_remove_journal_head(bh);

*might* be dangerous, if called carelessly; but in practice it works out
OK.  Remember, that journal_remove_journal_head() call still takes the
bh_journal_head lock and still checks b_transaction with that held.

I think it's time I went and worked out *why* it works out OK, though,
so that we can write it down and make sure it stays working!  And we may
well be able to simplify the locking when we do this; collapsing the two
bh state locks, for example, may help improve the robustness as well as
improving performance through removing some redundant locking
operations.

Fortunately, there are really only three classes of operation that can
remove a jh.  There are the normal VFS/VM operations, like writepage,
try_to_free_buffer, etc; these are all called with the page lock.  There
are metadata updates, which take a liberal dose of buffer, bh_state and
journal locks.  

Finally there are the commit/checkpoint functions, which run
asynchronously to the rest of the journaling and which don't relate to
the current transaction, but to previous ones.  This is the danger area,
I think.  But as long as at least the bh_state lock is held, we can
protect these operations against the data/metadata update operations.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3/jbd race: releasing in-use journal_heads

2005-03-09 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-03-09 at 13:28, Jan Kara wrote:

   Hmm. I see for example a place at jbd/commit.c, line 287 (which you
 did not change in your patch) which does this and doesn't seem to be
 protected against journal_unmap_buffer() (but maybe I miss something).
 Not that I'd find that race probable but in theory...

Indeed; I can't see why that wouldn't trigger (at least without the
existing, low-risk journal_unmap_buffer() patch.)

Andrew, I think we just go with that simple patch for now --- it catches
the cases we actually see in testing.  And rather than mess with
temporary states where b_transaction goes NULL, I think the *correct*
long-term fix is to hide those states using locking rather than by list
tricks.  Merging the bh_journal_head and bh_state locks really seems
like the safe solution here, as the latter seems to be held nearly
everywhere where we need protection against journal_put_journal_head()
(and where it's not held --- as it wasn't in journal_unmap_buffer() ---
it's a bug.)

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3/jbd race: releasing in-use journal_heads

2005-03-08 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-03-07 at 21:08, Stephen C. Tweedie wrote:

> Right, that was what I was thinking might be possible.  But for now I've
> just done the simple patch --- make sure we don't clear
> jh->b_transaction when we're just refiling buffers from one list to
> another.  That should have the desired effect here without dangerous
> messing around with locks.

And here it is; it has had about 3 hours' testing in conjunction with
the O_DIRECT livelock fix so far, running fsx and fsstress in parallel. 
This one is higher-risk than the previous simple fix.  However, it has
other advantages: the original patch closed the race completely in the
one place I know for sure it was showing up, but this version removes
the opportunity for that particular race in the first place by avoiding
temporary jh->b_transaction=NULL conditions completely during temporary
list refiling.

I'll try to get some targetted testing done by the people who were able
to reproduce this in the first place, too.

--Stephen

Fix destruction of in-use journal_head

journal_put_journal_head() can destroy a journal_head at any time as
long as the jh's b_jcount is zero and b_transaction is NULL.  It has no
locking protection against the rest of the journaling code, as the lock
it uses to protect b_jcount and bh->b_private is not used elsewhere in
jbd.

However, there are small windows where b_transaction is getting set
temporarily to NULL during normal operations; typically this is
happening in 

__journal_unfile_buffer(jh);
__journal_file_buffer(jh, ...);

call pairs, as __journal_unfile_buffer() will set b_transaction to NULL
and __journal_file_buffer() re-sets it afterwards.  A truncate running
in parallel can lead to journal_unmap_buffer() destroying the jh if it
occurs between these two calls.

Fix this by adding a variant of __journal_unfile_buffer() which is only
used for these temporary jh unlinks, and which leaves the b_transaction
field intact so that we never leave a window open where b_transaction is
NULL.

Additionally, trap this error if it does occur, by checking against
jh->b_jlist being non-null when we destroy a jh.

Signed-off-by: Stephen Tweedie <[EMAIL PROTECTED]>

--- linux-2.6-ext3/fs/jbd/commit.c.=K0003=.orig
+++ linux-2.6-ext3/fs/jbd/commit.c
@@ -258,7 +258,7 @@ write_out_data:
BUFFER_TRACE(bh, "locked");
if (!inverted_lock(journal, bh))
goto write_out_data;
-   __journal_unfile_buffer(jh);
+   __journal_temp_unlink_buffer(jh);
__journal_file_buffer(jh, commit_transaction,
BJ_Locked);
jbd_unlock_bh_state(bh);
--- linux-2.6-ext3/fs/jbd/journal.c.=K0003=.orig
+++ linux-2.6-ext3/fs/jbd/journal.c
@@ -1767,6 +1767,7 @@ static void __journal_remove_journal_hea
if (jh->b_transaction == NULL &&
jh->b_next_transaction == NULL &&
jh->b_cp_transaction == NULL) {
+   J_ASSERT_JH(jh, jh->b_jlist == BJ_None);
J_ASSERT_BH(bh, buffer_jbd(bh));
J_ASSERT_BH(bh, jh2bh(jh) == bh);
BUFFER_TRACE(bh, "remove journal_head");
--- linux-2.6-ext3/fs/jbd/transaction.c.=K0003=.orig
+++ linux-2.6-ext3/fs/jbd/transaction.c
@@ -1044,7 +1044,12 @@ int journal_dirty_data(handle_t *handle,
/* journal_clean_data_list() may have got there first */
if (jh->b_transaction != NULL) {
JBUFFER_TRACE(jh, "unfile from commit");
-   __journal_unfile_buffer(jh);
+   __journal_temp_unlink_buffer(jh);
+   /* It still points to the committing
+* transaction; move it to this one so
+* that the refile assert checks are
+* happy. */
+   jh->b_transaction = handle->h_transaction;
}
/* The buffer will be refiled below */
 
@@ -1058,7 +1063,8 @@ int journal_dirty_data(handle_t *handle,
if (jh->b_jlist != BJ_SyncData && jh->b_jlist != BJ_Locked) {
JBUFFER_TRACE(jh, "not on correct data list: unfile");
J_ASSERT_JH(jh, jh->b_jlist != BJ_Shadow);
-   __journal_unfile_buffer(jh);
+   __journal_temp_unlink_buffer(jh);
+   jh->b_transaction = handle->h_transaction;
JBUFFER_TRACE(jh, "file as data");
__journa

Re: Linux 2.6.11-ac1

2005-03-08 Thread Stephen C. Tweedie
Hi,

On Tue, 2005-03-08 at 06:49, Chris Wright wrote:

> Yes, we are intending to pick up bits from -ac (you might have missed
> that in another thread).

There's actually a successor patch to that which I'm just about to get
feedback on here and on ext2-devel.  It's higher-risk than the one Alan
has already picked up, but also hopefully closes the race much more
tightly throughout the whole of the jbd layer.

--Stephen


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] invalidate/o_direct livelock {was Re: [RFC] ext3/jbd race: releasing in-use journal_heads}

2005-03-08 Thread Stephen C. Tweedie
Hi,

On Tue, 2005-03-08 at 09:28, Stephen C. Tweedie wrote:

> I think it should be OK just to move the page->mapping != mapping test
> above the page>index > end test.  Sure, if all the pages have been
> stolen by the time we see them, then we'll repeat without advancing
> "next"; but we're still making progress in that case because pages that
> were previously in this mapping *have* been removed, just by a different
> process.  If we're really concerned about that case we can play the
> trick that invalidate_mapping_pages() tries and do a "next++" in that
> case.

Variant on akpm's last patch with this change, plus one extra
s/page->index/page_index/ that he missed.  I've moved the whole
next=page_index+1 and wrapping check up to just after we've tested the
mapping, so even if we break early all of that updating has already been
done reliably and we don't end up setting next in two places in the
code.

A 400-task fsstress used to livelock reliably within minutes inside
O_DIRECT without any fix; this patch has been running for over an hour
now without problem.

--Stephen

Fix livelock in invalidate_inode_pages2_range

invalidate_inode_pages2_range can loop indefinitely if pagevec_lookup
returns only pages beyond the end of the range being invalidated.  Fix
it by advancing the restart pointer, "next", every time we see a page on
the correct mapping, not just if the page is within the range; and break
out early when we encounter such a page.  Based on a proposed fix by
akpm.

Signed-off-by: Stephen Tweedie <[EMAIL PROTECTED]>


--- linux-2.6-ext3/mm/truncate.c.=K0002=.orig
+++ linux-2.6-ext3/mm/truncate.c
@@ -268,25 +268,31 @@ int invalidate_inode_pages2_range(struct
min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
for (i = 0; !ret && i < pagevec_count(); i++) {
struct page *page = pvec.pages[i];
+   pgoff_t page_index;
int was_dirty;
 
lock_page(page);
-   if (page->mapping != mapping || page->index > end) {
+   if (page->mapping != mapping) {
unlock_page(page);
continue;
}
-   wait_on_page_writeback(page);
-   next = page->index + 1;
+   page_index = page->index;
+   next = page_index + 1;
if (next == 0)
wrapped = 1;
+   if (page_index > end) {
+   unlock_page(page);
+   break;
+   }
+   wait_on_page_writeback(page);
while (page_mapped(page)) {
if (!did_range_unmap) {
/*
 * Zap the rest of the file in one hit.
 */
unmap_mapping_range(mapping,
-   page->index << PAGE_CACHE_SHIFT,
-   (end - page->index + 1)
+   page_index << PAGE_CACHE_SHIFT,
+   (end - page_index + 1)
<< PAGE_CACHE_SHIFT,
0);
did_range_unmap = 1;
@@ -295,7 +301,7 @@ int invalidate_inode_pages2_range(struct
 * Just zap this page
 */
unmap_mapping_range(mapping,
- page->index << PAGE_CACHE_SHIFT,
+ page_index << PAGE_CACHE_SHIFT,
  PAGE_CACHE_SIZE, 0);
}
}


Re: [RFC] ext3/jbd race: releasing in-use journal_heads

2005-03-08 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-03-07 at 23:50, Andrew Morton wrote:

> truncate_inode_pages_range() seems to dtrt here.  Can we do it in the same
> manner in invalidate_inode_pages2_range()?
> 
> 
> Something like:

> - if (page->mapping != mapping || page->index > end) {
> + page_index = page->index;
> + if (page_index > end) {
> + next = page_index;
> + unlock_page(page);
> + break;
> + }
> + if (page->mapping != mapping) {
>   unlock_page(page);
>   continue;
>   }

Yes, breaking early seems fine for this.  But don't we need to test
page->mapping == mapping again with the page lock held before we can
reliably break on page_index?

I think it should be OK just to move the page->mapping != mapping test
above the page>index > end test.  Sure, if all the pages have been
stolen by the time we see them, then we'll repeat without advancing
"next"; but we're still making progress in that case because pages that
were previously in this mapping *have* been removed, just by a different
process.  If we're really concerned about that case we can play the
trick that invalidate_mapping_pages() tries and do a "next++" in that
case.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3/jbd race: releasing in-use journal_heads

2005-03-08 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-03-07 at 23:50, Andrew Morton wrote:

 truncate_inode_pages_range() seems to dtrt here.  Can we do it in the same
 manner in invalidate_inode_pages2_range()?
 
 
 Something like:

 - if (page-mapping != mapping || page-index  end) {
 + page_index = page-index;
 + if (page_index  end) {
 + next = page_index;
 + unlock_page(page);
 + break;
 + }
 + if (page-mapping != mapping) {
   unlock_page(page);
   continue;
   }

Yes, breaking early seems fine for this.  But don't we need to test
page-mapping == mapping again with the page lock held before we can
reliably break on page_index?

I think it should be OK just to move the page-mapping != mapping test
above the pageindex  end test.  Sure, if all the pages have been
stolen by the time we see them, then we'll repeat without advancing
next; but we're still making progress in that case because pages that
were previously in this mapping *have* been removed, just by a different
process.  If we're really concerned about that case we can play the
trick that invalidate_mapping_pages() tries and do a next++ in that
case.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] invalidate/o_direct livelock {was Re: [RFC] ext3/jbd race: releasing in-use journal_heads}

2005-03-08 Thread Stephen C. Tweedie
Hi,

On Tue, 2005-03-08 at 09:28, Stephen C. Tweedie wrote:

 I think it should be OK just to move the page-mapping != mapping test
 above the pageindex  end test.  Sure, if all the pages have been
 stolen by the time we see them, then we'll repeat without advancing
 next; but we're still making progress in that case because pages that
 were previously in this mapping *have* been removed, just by a different
 process.  If we're really concerned about that case we can play the
 trick that invalidate_mapping_pages() tries and do a next++ in that
 case.

Variant on akpm's last patch with this change, plus one extra
s/page-index/page_index/ that he missed.  I've moved the whole
next=page_index+1 and wrapping check up to just after we've tested the
mapping, so even if we break early all of that updating has already been
done reliably and we don't end up setting next in two places in the
code.

A 400-task fsstress used to livelock reliably within minutes inside
O_DIRECT without any fix; this patch has been running for over an hour
now without problem.

--Stephen

Fix livelock in invalidate_inode_pages2_range

invalidate_inode_pages2_range can loop indefinitely if pagevec_lookup
returns only pages beyond the end of the range being invalidated.  Fix
it by advancing the restart pointer, next, every time we see a page on
the correct mapping, not just if the page is within the range; and break
out early when we encounter such a page.  Based on a proposed fix by
akpm.

Signed-off-by: Stephen Tweedie [EMAIL PROTECTED]


--- linux-2.6-ext3/mm/truncate.c.=K0002=.orig
+++ linux-2.6-ext3/mm/truncate.c
@@ -268,25 +268,31 @@ int invalidate_inode_pages2_range(struct
min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
for (i = 0; !ret  i  pagevec_count(pvec); i++) {
struct page *page = pvec.pages[i];
+   pgoff_t page_index;
int was_dirty;
 
lock_page(page);
-   if (page-mapping != mapping || page-index  end) {
+   if (page-mapping != mapping) {
unlock_page(page);
continue;
}
-   wait_on_page_writeback(page);
-   next = page-index + 1;
+   page_index = page-index;
+   next = page_index + 1;
if (next == 0)
wrapped = 1;
+   if (page_index  end) {
+   unlock_page(page);
+   break;
+   }
+   wait_on_page_writeback(page);
while (page_mapped(page)) {
if (!did_range_unmap) {
/*
 * Zap the rest of the file in one hit.
 */
unmap_mapping_range(mapping,
-   page-index  PAGE_CACHE_SHIFT,
-   (end - page-index + 1)
+   page_index  PAGE_CACHE_SHIFT,
+   (end - page_index + 1)
 PAGE_CACHE_SHIFT,
0);
did_range_unmap = 1;
@@ -295,7 +301,7 @@ int invalidate_inode_pages2_range(struct
 * Just zap this page
 */
unmap_mapping_range(mapping,
- page-index  PAGE_CACHE_SHIFT,
+ page_index  PAGE_CACHE_SHIFT,
  PAGE_CACHE_SIZE, 0);
}
}


Re: Linux 2.6.11-ac1

2005-03-08 Thread Stephen C. Tweedie
Hi,

On Tue, 2005-03-08 at 06:49, Chris Wright wrote:

 Yes, we are intending to pick up bits from -ac (you might have missed
 that in another thread).

There's actually a successor patch to that which I'm just about to get
feedback on here and on ext2-devel.  It's higher-risk than the one Alan
has already picked up, but also hopefully closes the race much more
tightly throughout the whole of the jbd layer.

--Stephen


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3/jbd race: releasing in-use journal_heads

2005-03-08 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-03-07 at 21:08, Stephen C. Tweedie wrote:

 Right, that was what I was thinking might be possible.  But for now I've
 just done the simple patch --- make sure we don't clear
 jh-b_transaction when we're just refiling buffers from one list to
 another.  That should have the desired effect here without dangerous
 messing around with locks.

And here it is; it has had about 3 hours' testing in conjunction with
the O_DIRECT livelock fix so far, running fsx and fsstress in parallel. 
This one is higher-risk than the previous simple fix.  However, it has
other advantages: the original patch closed the race completely in the
one place I know for sure it was showing up, but this version removes
the opportunity for that particular race in the first place by avoiding
temporary jh-b_transaction=NULL conditions completely during temporary
list refiling.

I'll try to get some targetted testing done by the people who were able
to reproduce this in the first place, too.

--Stephen

Fix destruction of in-use journal_head

journal_put_journal_head() can destroy a journal_head at any time as
long as the jh's b_jcount is zero and b_transaction is NULL.  It has no
locking protection against the rest of the journaling code, as the lock
it uses to protect b_jcount and bh-b_private is not used elsewhere in
jbd.

However, there are small windows where b_transaction is getting set
temporarily to NULL during normal operations; typically this is
happening in 

__journal_unfile_buffer(jh);
__journal_file_buffer(jh, ...);

call pairs, as __journal_unfile_buffer() will set b_transaction to NULL
and __journal_file_buffer() re-sets it afterwards.  A truncate running
in parallel can lead to journal_unmap_buffer() destroying the jh if it
occurs between these two calls.

Fix this by adding a variant of __journal_unfile_buffer() which is only
used for these temporary jh unlinks, and which leaves the b_transaction
field intact so that we never leave a window open where b_transaction is
NULL.

Additionally, trap this error if it does occur, by checking against
jh-b_jlist being non-null when we destroy a jh.

Signed-off-by: Stephen Tweedie [EMAIL PROTECTED]

--- linux-2.6-ext3/fs/jbd/commit.c.=K0003=.orig
+++ linux-2.6-ext3/fs/jbd/commit.c
@@ -258,7 +258,7 @@ write_out_data:
BUFFER_TRACE(bh, locked);
if (!inverted_lock(journal, bh))
goto write_out_data;
-   __journal_unfile_buffer(jh);
+   __journal_temp_unlink_buffer(jh);
__journal_file_buffer(jh, commit_transaction,
BJ_Locked);
jbd_unlock_bh_state(bh);
--- linux-2.6-ext3/fs/jbd/journal.c.=K0003=.orig
+++ linux-2.6-ext3/fs/jbd/journal.c
@@ -1767,6 +1767,7 @@ static void __journal_remove_journal_hea
if (jh-b_transaction == NULL 
jh-b_next_transaction == NULL 
jh-b_cp_transaction == NULL) {
+   J_ASSERT_JH(jh, jh-b_jlist == BJ_None);
J_ASSERT_BH(bh, buffer_jbd(bh));
J_ASSERT_BH(bh, jh2bh(jh) == bh);
BUFFER_TRACE(bh, remove journal_head);
--- linux-2.6-ext3/fs/jbd/transaction.c.=K0003=.orig
+++ linux-2.6-ext3/fs/jbd/transaction.c
@@ -1044,7 +1044,12 @@ int journal_dirty_data(handle_t *handle,
/* journal_clean_data_list() may have got there first */
if (jh-b_transaction != NULL) {
JBUFFER_TRACE(jh, unfile from commit);
-   __journal_unfile_buffer(jh);
+   __journal_temp_unlink_buffer(jh);
+   /* It still points to the committing
+* transaction; move it to this one so
+* that the refile assert checks are
+* happy. */
+   jh-b_transaction = handle-h_transaction;
}
/* The buffer will be refiled below */
 
@@ -1058,7 +1063,8 @@ int journal_dirty_data(handle_t *handle,
if (jh-b_jlist != BJ_SyncData  jh-b_jlist != BJ_Locked) {
JBUFFER_TRACE(jh, not on correct data list: unfile);
J_ASSERT_JH(jh, jh-b_jlist != BJ_Shadow);
-   __journal_unfile_buffer(jh);
+   __journal_temp_unlink_buffer(jh);
+   jh-b_transaction = handle-h_transaction;
JBUFFER_TRACE(jh, file as data);
__journal_file_buffer(jh, handle-h_transaction,
BJ_SyncData);
@@ -1233,8 +1239,6 @@ int journal_forget (handle_t *handle, st

Re: [RFC] ext3/jbd race: releasing in-use journal_heads

2005-03-07 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-03-07 at 21:22, Stephen C. Tweedie wrote:

> altgr-scrlck is showing a range of EIPs all in ext3_direct_IO->
> invalidate_inode_pages2_range().  I'm seeing
> 
> invalidate_inode_pages2_range()->pagevec_lookup()->find_get_pages()

In invalidate_inode_pages2_range(), what happens if we lookup a pagevec,
get a bunch of pages back, but all the pages in the vec are beyond the
end of the range we want?

That's quite possible: pagevec_lookup() gets told how many pages we
want, but not the index of the end of the range.

The loop over the pages in the pagevec contains this:

lock_page(page);
if (page->mapping != mapping || page->index > end) {
unlock_page(page);
continue;
}
wait_on_page_writeback(page);
next = page->index + 1;

Now, if all of the pages have page->index > end, we'll skip them all...
but we'll do it before we've advanced "next" to indicate that we've made
progress.  The next iteration is just going to start in the same place. 

Truncate always invalidates to EOF, which is probably why we haven't
seen this before.  But O_DIRECT is doing limited range cache
invalidates, and is getting stuck here pretty repeatably.

I'm currently testing the small patch below; it seems to be running OK
so far, although it hasn't been going for long yet.  I'll run a longer
test in the morning.

--Stephen

--- linux-2.6-ext3/mm/truncate.c.=K0002=.orig
+++ linux-2.6-ext3/mm/truncate.c
@@ -271,12 +271,13 @@ int invalidate_inode_pages2_range(struct
int was_dirty;
 
lock_page(page);
+   if (page->mapping == mapping)
+   next = page->index + 1;
if (page->mapping != mapping || page->index > end) {
unlock_page(page);
continue;
}
wait_on_page_writeback(page);
-   next = page->index + 1;
if (next == 0)
wrapped = 1;
while (page_mapped(page)) {


Re: [RFC] ext3/jbd race: releasing in-use journal_heads

2005-03-07 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-03-07 at 21:11, Andrew Morton wrote:

> > I'm having trouble testing it, though --- I seem to be getting livelocks
> >  in O_DIRECT running 400 fsstress processes in parallel; ring any bells? 
> 
> Nope.  I dont think anyone has been that cruel to ext3 for a while.
> I assume this workload used to succeed?

I think so.  I remember getting a lockup a couple of months ago that I
couldn't get to the bottom of and that looked very similar, so it may
just be a matter of it being much more reproducible now.  But I seem to
be able to livelock the directIO bits in minutes now, quite reliably. 
I'll try to narrow things down.

altgr-scrlck is showing a range of EIPs all in ext3_direct_IO->
invalidate_inode_pages2_range().  I'm seeing

invalidate_inode_pages2_range()->pagevec_lookup()->find_get_pages()

as by far the most common trace, but also 

invalidate_inode_pages2_range()->__might_sleep()
invalidate_inode_pages2_range()->unlock_page()
and a few other similar paths.

invalidate_inode_pages2_range() does do a cond_resched(), so the machine
carries on despite having 316 fsstress tasks in system-mode R state at
the moment. :-)  The scheduler is doing a rather impressive job.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3/jbd race: releasing in-use journal_heads

2005-03-07 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-03-07 at 20:31, Andrew Morton wrote:

> jbd_lock_bh_journal_head() is supposed to be a
> finegrained innermost lock whose mandate is purely for atomicity of adding
> and removing the journal_head and the b_jcount refcounting.  I don't recall
> there being any deeper meaning than that.

> It could be that we can optimise jbd_lock_bh_journal_head() away, as you
> mention.  If we have an assertion in there to check that
> jbd_lock_bh_state() is held...

Right, that was what I was thinking might be possible.  But for now I've
just done the simple patch --- make sure we don't clear
jh->b_transaction when we're just refiling buffers from one list to
another.  That should have the desired effect here without dangerous
messing around with locks.

I'm having trouble testing it, though --- I seem to be getting livelocks
in O_DIRECT running 400 fsstress processes in parallel; ring any bells? 
I get it with today's bk tree without any ext3 changes too, so at least
it's not a regression.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   >