Re: [RFC] Parallelize IO for e2fsck

2008-01-17 Thread Valerie Henson
On Jan 17, 2008 5:15 PM, David Chinner <[EMAIL PROTECTED]> wrote:
> On Wed, Jan 16, 2008 at 01:30:43PM -0800, Valerie Henson wrote:
> > Hi y'all,
> >
> > This is a request for comments on the rewrite of the e2fsck IO
> > parallelization patches I sent out a few months ago.  The mechanism is
> > totally different.  Previously IO was parallelized by issuing IOs from
> > multiple threads; now a single thread issues fadvise(WILLNEED) and
> > then uses read() to complete the IO.
>
> Interesting.
>
> We ultimately rejected a similar patch to xfs_repair (pre-population
> the kernel block device cache) mainly because of low memory
> performance issues and it doesn't really enable you to do anything
> particularly smart with optimising I/O patterns for larger, high
> performance RAID arrays.
>
> The low memory problems were particularly bad; the readahead
> thrashing cause a slowdown of 2-3x compared to the baseline and
> often it was due to the repair process requiring all of memory
> to cache stuff it would need later. IIRC, multi-terabyte ext3
> filesystems have similar memory usage problems to XFS, so there's
> a good chance that this patch will see the same sorts of issues.

That was one of my first concerns - how to avoid overflowing memory?
Whenever I screw it up on e2fsck, it does go, oh, 2 times slower due
to the minor detail of every single block being read from disk twice.
:)

I have a partial solution that sort of blindly manages the buffer
cache.  First, the user passes e2fsck a parameter saying how much
memory is available as buffer cache.  The readahead thread reads
things in and immediately throws them away so they are only in buffer
cache (no double-caching).  Then readahead and e2fsck work together so
that readahead only reads in new blocks when the main thread is done
with earlier blocks.  The already-used blocks get kicked out of buffer
cache to make room for the new ones.

What would be nice is to take into account the current total memory
usage of the whole fsck process and factor that in.  I don't think it
would be hard to add to the existing cache management framework.
Thoughts?

> Promising results, though

Thanks!  It's solving a rather simpler problem than XFS check/repair. :)

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-17 Thread David Chinner
On Wed, Jan 16, 2008 at 01:30:43PM -0800, Valerie Henson wrote:
> Hi y'all,
> 
> This is a request for comments on the rewrite of the e2fsck IO
> parallelization patches I sent out a few months ago.  The mechanism is
> totally different.  Previously IO was parallelized by issuing IOs from
> multiple threads; now a single thread issues fadvise(WILLNEED) and
> then uses read() to complete the IO.

Interesting.

We ultimately rejected a similar patch to xfs_repair (pre-population
the kernel block device cache) mainly because of low memory
performance issues and it doesn't really enable you to do anything
particularly smart with optimising I/O patterns for larger, high
performance RAID arrays.

The low memory problems were particularly bad; the readahead
thrashing cause a slowdown of 2-3x compared to the baseline and
often it was due to the repair process requiring all of memory
to cache stuff it would need later. IIRC, multi-terabyte ext3
filesystems have similar memory usage problems to XFS, so there's
a good chance that this patch will see the same sorts of issues.

> Single disk performance doesn't change, but elapsed time drops by
> about 50% on a big RAID-5 box.  Passes 1 and 2 are parallelized.  Pass
> 5 is left as an exercise for the reader.

Promising results, though

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Convert ext4_ioctl to an unlocked_ioctl

2008-01-17 Thread Mathieu SEGAUD
Vous m'avez dit récemment :

as well for this one,

> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -17,12 +17,18 @@
>  #include 
>  #include 
>  
> -int ext4_ioctl (struct inode * inode, struct file * filp, unsigned int cmd,
> - unsigned long arg)
> +long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>  {
> - struct ext4_inode_info *ei = EXT4_I(inode);
> + struct ext4_inode_info *ei;
> + struct inode *inode;
>   unsigned int flags;
>   unsigned short rsv_window_size;
> + long retval = 0;
> +
> + lock_kernel();
> +
> + inode = filp->f_dentry->d_inode;

this should be inode = filp->f_path.dentry->d_inode;

I will update this one too, sorry for this one as well :-)

-- 
Mathieu
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Convert ext3_ioctl() to an unlocked_ioctl

2008-01-17 Thread Mathieu SEGAUD
Vous m'avez dit récemment :

> --- a/fs/ext3/ioctl.c
> +++ b/fs/ext3/ioctl.c
> @@ -17,12 +17,19 @@
>  #include 
>  #include 
>  
> -int ext3_ioctl (struct inode * inode, struct file * filp, unsigned int cmd,
> +long ext3_ioctl(struct file *filp, unsigned int cmd,
>   unsigned long arg)
>  {
> - struct ext3_inode_info *ei = EXT3_I(inode);
> + struct ext3_inode_info *ei;
> + struct inode *inode;
>   unsigned int flags;
>   unsigned short rsv_window_size;
> + long retval = 0;
> +
> + lock_kernel();
> +
> + inode = filp->f_dentry->d_inode;

I guess this should be
inode = filp->f_path.dentry->d_inode;


I will repost a fixed patch
sorry for this one.


> + ei = EXT3_I(inode);
>  
>   ext3_debug ("cmd = %u, arg = %lu\n", cmd, arg);
>  

-- 
Mathieu
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Mel Gorman
On (17/01/08 13:50), Martin Knoblauch didst pronounce:
> > 
> 
> The effect  is  defintely  depending on  the  IO  hardware. I performed the 
> same tests
> on a different box with an AACRAID controller and there things look different.

I take it different also means it does not show this odd performance
behaviour and is similar whether the patch is applied or not?

> Basically
> the "offending" commit helps seingle stream performance on that box, while 
> dual/triple
> stream are not affected. So I suspect that the CCISS is just not behaving 
> well.
> 
> And yes, the tests are usually done on a freshly booted box. Of course, I 
> repeat them
> a few times. On the CCISS box the numbers are very constant. On the AACRAID 
> box
> they vary quite a bit.
> 
>  I can certainly stress the box before doing the tests. Please define "many" 
> for the kernel
> compiles :-)
> 

With 8GiB of RAM, try making 24 copies of the kernel and compiling them
all simultaneously. Running that for for 20-30 minutes should be enough to
randomise the freelists affecting what color of page is used for the dd test.

> > >  OK, the change happened between rc5 and rc6. Just following a
> > > gut feeling, I reverted
> > > 
> > > #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> > > #Author: Mel Gorman 
> > > #Date:   Mon Dec 17 16:20:05 2007 -0800
> > > #
> 
> > > 
> > > This has brought back the good results I observed and reported.
> > > I do not know what to make out of this. At least on the systems
> > > I care about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory,
> > > SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery
> > > protected writeback cache enabled) and gigabit networking (tg3)) this
> > > optimisation is a dissaster.
> > > 
> > 
> > That patch was not an optimisation, it was a regression fix
> > against 2.6.23 and I don't believe reverting it is an option. Other IO
> > hardware benefits from having the allocator supply pages in PFN order.
> 
>  I think this late in the 2.6.24 game we just should leave things as they 
> are. But
> we should try to find a way to make CCISS faster, as it apparently can be 
> faster.
> 
> > Your controller would seem to suffer when presented with the same situation
> > but I don't know why that is. I've added James to the cc in case he has 
> > seen this
> > sort of situation before.
> > 
> > > On the other hand, it is not a regression against 2.6.22/23. Those
> > 
>  > had bad IO scaling to. It would just be a shame to loose an apparently
> > 
>  > great performance win.
> > 
> > Could you try running your tests again when the system has been
> > stressed with some other workload first?
> > 
> 
>  Will do.
> 

Thanks

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Martin Knoblauch
- Original Message 
> From: Mel Gorman <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter 
> Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL 
> PROTECTED]>; [EMAIL PROTECTED]; "linux-ext4@vger.kernel.org" 
> ; Linus Torvalds <[EMAIL PROTECTED]>; [EMAIL 
> PROTECTED]
> Sent: Thursday, January 17, 2008 9:23:57 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On (17/01/08 09:44), Martin Knoblauch didst pronounce:
> > > > > > > > On Wed, Jan 16, 2008 at 01:26:41AM -0800,
> Martin
> 
 Knoblauch wrote:
> > > > > > > For those interested in using your writeback
> improvements
> 
 in
> > > > > > > production sooner rather than later (primarily with
> ext3);
> 
 what
> > > > > > > recommendations do you have?  Just heavily test our
> own
> 
 2.6.24
> > > > > > > evolving "close, but not ready for merge" -mm
> writeback
> 
 patchset?
> > > > > > > 
> > > > > > 
> > > > > >  I can add myself to Mikes question. It would be good to
> know
> 
 a
> > > > > 
> > > > > "roadmap" for the writeback changes. Testing 2.6.24-rcX so
> far
> 
 has
> > > > > been showing quite nice improvement of the overall
> writeback
> 
 situation and
> > > > > it would be sad to see this [partially] gone in 2.6.24-final.
> > > > > Linus apparently already has reverted  "...2250b". I
> will
> 
 definitely
> > > > > repeat my tests  with -rc8. and report.
> > > > > 
> > > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > > > Maybe we can push it to 2.6.24 after your testing.
> > > > 
> > > Hi Fengguang,
> > > 
> > > something really bad has happened between -rc3 and -rc6.
> > > Embarrassingly I did not catch that earlier :-(
> > > Compared to the numbers I posted in
> > > http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
> > > (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24.
> > > The only test that is still good is mix3, which I attribute to
> > > the per-BDI stuff.
> 
> I suspect that the IO hardware you have is very sensitive to the
> color of the physical page. I wonder, do you boot the system cleanly
> and then run these tests? If so, it would be interesting to know what
> happens if you stress the system first (many kernel compiles for example,
> basically anything that would use a lot of memory in different ways for some
> time) to randomise the free lists a bit and then run your test. You'd need to 
> run
> the test three times for 2.6.23, 2.6.24-rc8 and 2.6.24-rc8 with the patch you
> identified reverted.
>

 The effect  is  defintely  depending on  the  IO  hardware. I performed the 
same tests
on a different box with an AACRAID controller and there things look different. 
Basically
the "offending" commit helps seingle stream performance on that box, while 
dual/triple
stream are not affected. So I suspect that the CCISS is just not behaving well.

 And yes, the tests are usually done on a freshly booted box. Of course, I 
repeat them
a few times. On the CCISS box the numbers are very constant. On the AACRAID box
they vary quite a bit.

 I can certainly stress the box before doing the tests. Please define "many" 
for the kernel
compiles :-)

> > 
> >  OK, the change happened between rc5 and rc6. Just following a
> > gut feeling, I reverted
> > 
> > #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> > #Author: Mel Gorman 
> > #Date:   Mon Dec 17 16:20:05 2007 -0800
> > #

> > 
> > This has brought back the good results I observed and reported.
> > I do not know what to make out of this. At least on the systems
> > I care about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory,
> > SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery
> > protected writeback cache enabled) and gigabit networking (tg3)) this
> > optimisation is a dissaster.
> > 
> 
> That patch was not an optimisation, it was a regression fix
> against 2.6.23 and I don't believe reverting it is an option. Other IO
> hardware benefits from having the allocator supply pages in PFN order.

 I think this late in the 2.6.24 game we just should leave things as they are. 
But
we should try to find a way to make CCISS faster, as it apparently can be 
faster.

> Your controller would seem to suffer when presented with the same situation
> but I don't know why that is. I've added James to the cc in case he has seen 
> this
> sort of situation before.
> 
> > On the other hand, it is not a regression against 2.6.22/23. Those
> 
 > had bad IO scaling to. It would just be a shame to loose an apparently
> 
 > great performance win.
> 
> Could you try running your tests again when the system has been
> stressed with some other workload first?
> 

 Will do.

Cheers
Martin



-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Mel Gorman
On (17/01/08 09:44), Martin Knoblauch didst pronounce:
> > > > > > > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > > > > For those interested in using your writeback improvements in
> > > > > > production sooner rather than later (primarily with ext3); what
> > > > > > recommendations do you have?  Just heavily test our own 2.6.24
> > > > > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > > > > 
> > > > > 
> > > > >  I can add myself to Mikes question. It would be good to know a
> > > > 
> > > > "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> > > > been showing quite nice improvement of the overall writeback situation 
> > > > and
> > > > it would be sad to see this [partially] gone in 2.6.24-final.
> > > > Linus apparently already has reverted  "...2250b". I will definitely
> > > > repeat my tests  with -rc8. and report.
> > > > 
> > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > > Maybe we can push it to 2.6.24 after your testing.
> > > 
> > Hi Fengguang,
> > 
> > something really bad has happened between -rc3 and -rc6.
> > Embarrassingly I did not catch that earlier :-(
> > Compared to the numbers I posted in
> > http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
> > (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24.
> > The only test that is still good is mix3, which I attribute to
> > the per-BDI stuff.

I suspect that the IO hardware you have is very sensitive to the color of the
physical page. I wonder, do you boot the system cleanly and then run these
tests? If so, it would be interesting to know what happens if you stress
the system first (many kernel compiles for example, basically anything that
would use a lot of memory in different ways for some time) to randomise the
free lists a bit and then run your test. You'd need to run the test
three times for 2.6.23, 2.6.24-rc8 and 2.6.24-rc8 with the patch you
identified reverted.

> > At the moment I am frantically trying to find when things went down. I
> > did run -rc8 and rc8+yourpatch. No difference to what I see with -rc6.
> > 
> > Sorry that I cannot provide any input to your patch.
> 
>  OK, the change happened between rc5 and rc6. Just following a gut feeling, I 
> reverted
> 
> #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> #Author: Mel Gorman <[EMAIL PROTECTED]>
> #Date:   Mon Dec 17 16:20:05 2007 -0800
> #
> #mm: fix page allocation for larger I/O segments
> #
> #In some cases the IO subsystem is able to merge requests if the pages are
> #adjacent in physical memory.  This was achieved in the allocator by 
> having
> #expand() return pages in physically contiguous order in situations were a
> #large buddy was split.  However, list-based anti-fragmentation changed 
> the
> #order pages were returned in to avoid searching in buffered_rmqueue() 
> for a
> #page of the appropriate migrate type.
> #
> #This patch restores behaviour of rmqueue_bulk() preserving the physical
> #order of pages returned by the allocator without incurring increased 
> search
> #costs for anti-fragmentation.
> #
> #Signed-off-by: Mel Gorman <[EMAIL PROTECTED]>
> #Cc: James Bottomley <[EMAIL PROTECTED]>
> #Cc: Jens Axboe <[EMAIL PROTECTED]>
> #Cc: Mark Lord <[EMAIL PROTECTED]
> #Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
> #Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>
> diff -urN linux-2.6.24-rc5/mm/page_alloc.c linux-2.6.24-rc6/mm/page_alloc.c
> --- linux-2.6.24-rc5/mm/page_alloc.c2007-12-21 04:14:11.305633890 +
> +++ linux-2.6.24-rc6/mm/page_alloc.c2007-12-21 04:14:17.746985697 +
> @@ -847,8 +847,19 @@
> struct page *page = __rmqueue(zone, order, migratetype);
> if (unlikely(page == NULL))
> break;
> +
> +   /*
> +* Split buddy pages returned by expand() are received here
> +* in physical page order. The page is added to the callers 
> and
> +* list and the list head then moves forward. From the callers
> +* perspective, the linked list is ordered by page number in
> +* some conditions. This is useful for IO devices that can
> +* merge IO requests if the physical pages are ordered
> +* properly.
> +*/
> list_add(&page->lru, list);
> set_page_private(page, migratetype);
> +   list = &page->lru;
> }
> spin_unlock(&zone->lock);
> return i;
> 
> This has brought back the good results I observed and reported.
> I do not know what to make out of this. At least on the systems I care
> about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory, SmartaArray6i
> controller with 4x72GB SCSI disks as RAID5 (battery protected writeback
> cache enabled) and gigabit networking (tg3)) this optimisation is a

Re: [PATCH] Fix oops in mballoc caused by a variable overflow

2008-01-17 Thread Mingming Cao
On Thu, 2008-01-17 at 21:59 +0530, Aneesh Kumar K.V wrote:
> On Thu, Jan 17, 2008 at 02:09:41PM +0100, Valerie Clement wrote:
> > Aneesh Kumar K.V wrote:
> >> On Thu, Jan 17, 2008 at 10:43:40AM +0100, Valerie Clement wrote:
> >>> Aneesh Kumar K.V wrote:
>  What about this  ? I guess we will overflow start = start << bsbits;
> 
> >>> Hi Aneesh,
> >>> your patch below doesn't fix the issue, because as start_off is also  
> >>> loff_t, start_off = ac->ac_o_ex.fe_logical << bsbits  also overflows.
> >>>
> >>
> >> loff_t is 64 bits.
> >>
> >> typedef __kernel_loff_t loff_t;
> >> typedef long long   __kernel_loff_t;
> >> typedef __u32 ext4_lblk_t;
> >> typedef unsigned long long ext4_fsblk_t
> >>
> >> start_off = ac->ac_o_ex.fe_logical << bsbits;
> >>
> >> In the above line what we are storing in start_off is the offset in 
> >> bytes.So it makes
> >> sense to use the type loff_t. It is neither logical block nor physical 
> >> block.
> >
> > Oh yes, sorry, you're right. I read too quickly.
> >
> > In fact, it's missing a cast :
> >   start_off = (loff_t) ac->ac_o_ex.fe_logical << bsbits;
> >
> > With that change, the test is ok.
> 
> Updated patch below.
> 
Thanks, folded to the mballoc-core patch

Mingming
> -aneesh

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4]JBD2: user of the jiffies rounding code

2008-01-17 Thread Mingming Cao
Ported from JBD changes from Arjan van de Ven <[EMAIL PROTECTED]>
---
Date: Sun, 10 Dec 2006 10:21:26 + (-0800)
Subject: [PATCH] user of the jiffies rounding code: JBD
X-Git-Tag: v2.6.20-rc1~15^2~43
X-Git-Url: 
http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=44d306e1508fef6fa7a6eb15a1aba86ef68389a6

[PATCH] user of the jiffies rounding code: JBD

This patch introduces a user: of the round_jiffies() function; the "5 second"
ext3/jbd wakeup.

While "every 5 seconds" doesn't sound as a problem, there can be many of these
(and these timers do add up over all the kernel).  The "5 second" wakeup isn't
really timing sensitive; in addition even with rounding it'll still happen
every 5 seconds (with the exception of the very first time, which is likely to
be rounded up to somewhere closer to 6 seconds)

Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
---

---
 fs/jbd2/transaction.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.24-rc7/fs/jbd2/transaction.c
===
--- linux-2.6.24-rc7.orig/fs/jbd2/transaction.c 2008-01-16 15:41:14.0 
-0800
+++ linux-2.6.24-rc7/fs/jbd2/transaction.c  2008-01-16 17:49:48.0 
-0800
@@ -54,7 +54,7 @@ jbd2_get_transaction(journal_t *journal,
spin_lock_init(&transaction->t_handle_lock);
 
/* Set up the commit timer for the new transaction. */
-   journal->j_commit_timer.expires = transaction->t_expires;
+   journal->j_commit_timer.expires = round_jiffies(transaction->t_expires);
add_timer(&journal->j_commit_timer);
 
J_ASSERT(journal->j_running_transaction == NULL);

 



-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4]JBD2: sparse pointer use of zero as null

2008-01-17 Thread Mingming Cao
Ported from upstream jbd changes to jbd2

sparse pointer use of zero as null

Get rid of sparse related warnings from places that use integer as NULL
pointer.

Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
---
 fs/jbd2/transaction.c |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux-2.6.24-rc7/fs/jbd2/transaction.c
===
--- linux-2.6.24-rc7.orig/fs/jbd2/transaction.c 2008-01-16 17:49:48.0 
-0800
+++ linux-2.6.24-rc7/fs/jbd2/transaction.c  2008-01-16 18:06:33.0 
-0800
@@ -1182,7 +1182,7 @@ int jbd2_journal_dirty_metadata(handle_t
}
 
/* That test should have eliminated the following case: */
-   J_ASSERT_JH(jh, jh->b_frozen_data == 0);
+   J_ASSERT_JH(jh, jh->b_frozen_data == NULL);
 
JBUFFER_TRACE(jh, "file as BJ_Metadata");
spin_lock(&journal->j_list_lock);
@@ -1532,7 +1532,7 @@ void __jbd2_journal_temp_unlink_buffer(s
 
J_ASSERT_JH(jh, jh->b_jlist < BJ_Types);
if (jh->b_jlist != BJ_None)
-   J_ASSERT_JH(jh, transaction != 0);
+   J_ASSERT_JH(jh, transaction != NULL);
 
switch (jh->b_jlist) {
case BJ_None:
@@ -1601,11 +1601,11 @@ __journal_try_to_free_buffer(journal_t *
if (buffer_locked(bh) || buffer_dirty(bh))
goto out;
 
-   if (jh->b_next_transaction != 0)
+   if (jh->b_next_transaction != NULL)
goto out;
 
spin_lock(&journal->j_list_lock);
-   if (jh->b_transaction != 0 && jh->b_cp_transaction == 0) {
+   if (jh->b_transaction != NULL && jh->b_cp_transaction == NULL) {
if (jh->b_jlist == BJ_SyncData || jh->b_jlist == BJ_Locked) {
/* A written-back ordered data buffer */
JBUFFER_TRACE(jh, "release data");
@@ -1613,7 +1613,7 @@ __journal_try_to_free_buffer(journal_t *
jbd2_journal_remove_journal_head(bh);
__brelse(bh);
}
-   } else if (jh->b_cp_transaction != 0 && jh->b_transaction == 0) {
+   } else if (jh->b_cp_transaction != NULL && jh->b_transaction == NULL) {
/* written-back checkpointed metadata buffer */
if (jh->b_jlist == BJ_None) {
JBUFFER_TRACE(jh, "remove from checkpoint list");
@@ -1973,7 +1973,7 @@ void __jbd2_journal_file_buffer(struct j
 
J_ASSERT_JH(jh, jh->b_jlist < BJ_Types);
J_ASSERT_JH(jh, jh->b_transaction == transaction ||
-   jh->b_transaction == 0);
+   jh->b_transaction == NULL);
 
if (jh->b_transaction && jh->b_jlist == jlist)
return;

 



-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4]JBD2: Group short-lived and reclaimable kernel allocations

2008-01-17 Thread Mingming Cao
JBD2: Group short-lived and reclaimable kernel allocations
From: Mingming Cao <[EMAIL PROTECTED]>
Ported from JBD to JBD2


From: Mel Gorman <[EMAIL PROTECTED]>
Date: Tue, 16 Oct 2007 08:25:52 + (-0700)
Subject: Group short-lived and reclaimable kernel allocations
X-Git-Tag: v2.6.24-rc1~1137
X-Git-Url: 
http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=e12ba74d8ff3e2f73a583500d7095e406df4d093

Group short-lived and reclaimable kernel allocations

This patch marks a number of allocations that are either short-lived such as
network buffers or are reclaimable such as inode allocations.  When something
like updatedb is called, long-lived and unmovable kernel allocations tend to
be spread throughout the address space which increases fragmentation.

This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE.  The MIGRATE_RECLAIMABLE type is for allocations that can be
reclaimed on demand, but not moved.  i.e.  they can be migrated by deleting
them and re-reading the information from elsewhere.

Cc: Mel Gorman <[EMAIL PROTECTED]>
Cc: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
---

---
 fs/jbd2/journal.c |4 ++--
 fs/jbd2/revoke.c  |6 --
 2 files changed, 6 insertions(+), 4 deletions(-)

Index: linux-2.6.24-rc7/fs/jbd2/journal.c
===
--- linux-2.6.24-rc7.orig/fs/jbd2/journal.c 2008-01-16 15:02:40.0 
-0800
+++ linux-2.6.24-rc7/fs/jbd2/journal.c  2008-01-16 17:40:24.0 -0800
@@ -1975,7 +1975,7 @@ static int journal_init_jbd2_journal_hea
jbd2_journal_head_cache = kmem_cache_create("jbd2_journal_head",
sizeof(struct journal_head),
0,  /* offset */
-   0,  /* flags */
+   SLAB_TEMPORARY, /* flags */
NULL);  /* ctor */
retval = 0;
if (jbd2_journal_head_cache == 0) {
@@ -2271,7 +2271,7 @@ static int __init journal_init_handle_ca
jbd2_handle_cache = kmem_cache_create("jbd2_journal_handle",
sizeof(handle_t),
0,  /* offset */
-   0,  /* flags */
+   SLAB_TEMPORARY, /* flags */
NULL);  /* ctor */
if (jbd2_handle_cache == NULL) {
printk(KERN_EMERG "JBD: failed to create handle cache\n");
Index: linux-2.6.24-rc7/fs/jbd2/revoke.c
===
--- linux-2.6.24-rc7.orig/fs/jbd2/revoke.c  2008-01-06 13:45:38.0 
-0800
+++ linux-2.6.24-rc7/fs/jbd2/revoke.c   2008-01-16 17:40:24.0 -0800
@@ -171,13 +171,15 @@ int __init jbd2_journal_init_revoke_cach
 {
jbd2_revoke_record_cache = kmem_cache_create("jbd2_revoke_record",
   sizeof(struct jbd2_revoke_record_s),
-  0, SLAB_HWCACHE_ALIGN, NULL);
+  0,
+  SLAB_HWCACHE_ALIGN|SLAB_TEMPORARY,
+  NULL);
if (jbd2_revoke_record_cache == 0)
return -ENOMEM;
 
jbd2_revoke_table_cache = kmem_cache_create("jbd2_revoke_table",
   sizeof(struct jbd2_revoke_table_s),
-  0, 0, NULL);
+  0, SLAB_TEMPORARY, NULL);
if (jbd2_revoke_table_cache == 0) {
kmem_cache_destroy(jbd2_revoke_record_cache);
jbd2_revoke_record_cache = NULL;

 
 


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4]jbd2: port jbd lockdep support to jbd2

2008-01-17 Thread Mingming Cao
Hi Andrew, Ted,

I walked through the linus's git tree history and found 4 patches should
port from ext3/jbd to ext4/jbd2, since the day ext4 was forked
(2006.10.11) to today. I have already queued the ported patches in ext4
patch queue and verified they seems fine. Here is the first one.



jbd2: port jbd lockdep support to jbd2
> Except lockdep doesn't know about journal_start(), which has ranking
> requirements similar to a semaphore.  

Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
---
 fs/jbd2/transaction.c |   11 +++
 include/linux/jbd2.h  |4 
 2 files changed, 15 insertions(+)

Index: linux-2.6.24-rc7/fs/jbd2/transaction.c
===
--- linux-2.6.24-rc7.orig/fs/jbd2/transaction.c 2008-01-16 15:30:24.0 
-0800
+++ linux-2.6.24-rc7/fs/jbd2/transaction.c  2008-01-16 15:41:14.0 
-0800
@@ -241,6 +241,8 @@ out:
return ret;
 }
 
+static struct lock_class_key jbd2_handle_key;
+
 /* Allocate a new handle.  This should probably be in a slab... */
 static handle_t *new_handle(int nblocks)
 {
@@ -251,6 +253,9 @@ static handle_t *new_handle(int nblocks)
handle->h_buffer_credits = nblocks;
handle->h_ref = 1;
 
+   lockdep_init_map(&handle->h_lockdep_map, "jbd2_handle",
+   &jbd2_handle_key, 0);
+
return handle;
 }
 
@@ -293,7 +298,11 @@ handle_t *jbd2_journal_start(journal_t *
jbd2_free_handle(handle);
current->journal_info = NULL;
handle = ERR_PTR(err);
+   goto out;
}
+
+   lock_acquire(&handle->h_lockdep_map, 0, 0, 0, 2, _THIS_IP_);
+out:
return handle;
 }
 
@@ -1419,6 +1428,8 @@ int jbd2_journal_stop(handle_t *handle)
spin_unlock(&journal->j_state_lock);
}
 
+   lock_release(&handle->h_lockdep_map, 1, _THIS_IP_);
+
jbd2_free_handle(handle);
return err;
 }
Index: linux-2.6.24-rc7/include/linux/jbd2.h
===
--- linux-2.6.24-rc7.orig/include/linux/jbd2.h  2008-01-16 15:29:03.0 
-0800
+++ linux-2.6.24-rc7/include/linux/jbd2.h   2008-01-16 15:29:54.0 
-0800
@@ -418,6 +418,10 @@ struct handle_s
unsigned inth_sync: 1;  /* sync-on-close */
unsigned inth_jdata:1;  /* force data journaling */
unsigned inth_aborted:  1;  /* fatal error on handle */
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+   struct lockdep_map  h_lockdep_map;
+#endif
 };
 



-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Martin Knoblauch
- Original Message 
> From: Mike Snitzer <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Fengguang Wu <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; 
> [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; 
> "linux-ext4@vger.kernel.org" ; Linus Torvalds 
> <[EMAIL PROTECTED]>
> Sent: Thursday, January 17, 2008 5:11:50 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> 
> I've backported Peter's perbdi patchset to 2.6.22.x.  I can share it
> with anyone who might be interested.
> 
> As expected, it has yielded 2.6.24-rcX level scaling.  Given the test
> result matrix you previously posted, 2.6.22.x+perbdi might give you
> what you're looking for (sans improved writeback that 2.6.24 was
> thought to be providing).  That is, much improved scaling with better
> O_DIRECT and network throughput.  Just a thought...
> 
> Unfortunately, my priorities (and computing resources) have shifted
> and I won't be able to thoroughly test Fengguang's new writeback patch
> on 2.6.24-rc8... whereby missing out on providing
> justification/testing to others on _some_ improved writeback being
> included in 2.6.24 final.
> 
> Not to mention the window for writeback improvement is all but closed
> considering the 2.6.24-rc8 announcement's 2.6.24 final release
> timetable.
> 
Mike,

 thanks for the offer, but the improved throughput is my #1 priority nowadays.
And while the better scaling for different targets is nothing to frown upon, the
much better scaling when writing to the same target would have been the big
winner for me.

 Anyway, I located the "offending" commit. Lets see what the experts say.


Cheers
Martin



-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Martin Knoblauch
- Original Message 
> From: Martin Knoblauch <[EMAIL PROTECTED]>
> To: Fengguang Wu <[EMAIL PROTECTED]>
> Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; 
> [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; 
> "linux-ext4@vger.kernel.org" ; Linus Torvalds 
> <[EMAIL PROTECTED]>
> Sent: Thursday, January 17, 2008 2:52:58 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> - Original Message 
> > From: Fengguang Wu 
> > To: Martin Knoblauch 
> > Cc: Mike Snitzer ; Peter
> Zijlstra
> 
 ; [EMAIL PROTECTED]; Ingo Molnar
> ;
> 
 [EMAIL PROTECTED];
> "linux-ext4@vger.kernel.org"
> 
 ; Linus
> Torvalds
> 
 
> > Sent: Wednesday, January 16, 2008 1:00:04 PM
> > Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> > 
> > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > > For those interested in using your writeback improvements in
> > > > production sooner rather than later (primarily with ext3); what
> > > > recommendations do you have?  Just heavily test our own 2.6.24
> > +
> > 
>  your
> > > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > > 
> > > Hi Fengguang, Mike,
> > > 
> > >  I can add myself to Mikes question. It would be good to know
> > a
> > 
>  "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> > been
> > 
>  showing quite nice improvement of the overall writeback situation and
> > it
> > 
>  would be sad to see this [partially] gone in 2.6.24-final.
> > Linus
> > 
>  apparently already has reverted  "...2250b". I will definitely
> repeat
> 
 my
> > tests
> > 
>  with -rc8. and report.
> > 
> > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > Maybe we can push it to 2.6.24 after your testing.
> > 
> Hi Fengguang,
> 
>  something really bad has happened between -rc3 and
> -rc6.
> 
 Embarrassingly I did not catch that earlier :-(
> 
>  Compared to the numbers I posted
> in
> 
 http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
> (slight
> 
 plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only
> test
> 
 that is still good is mix3, which I attribute to the per-BDI stuff.
> 
>  At the moment I am frantically trying to find when things went down.
> I
> 
 did run -rc8 and rc8+yourpatch. No difference to what I see with
> -rc6.
> 
 Sorry that I cannot provide any input to your patch.
> 

 OK, the change happened between rc5 and rc6. Just following a gut feeling, I 
reverted

#commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
#Author: Mel Gorman <[EMAIL PROTECTED]>
#Date:   Mon Dec 17 16:20:05 2007 -0800
#
#mm: fix page allocation for larger I/O segments
#
#In some cases the IO subsystem is able to merge requests if the pages are
#adjacent in physical memory.  This was achieved in the allocator by having
#expand() return pages in physically contiguous order in situations were a
#large buddy was split.  However, list-based anti-fragmentation changed the
#order pages were returned in to avoid searching in buffered_rmqueue() for a
#page of the appropriate migrate type.
#
#This patch restores behaviour of rmqueue_bulk() preserving the physical
#order of pages returned by the allocator without incurring increased search
#costs for anti-fragmentation.
#
#Signed-off-by: Mel Gorman <[EMAIL PROTECTED]>
#Cc: James Bottomley <[EMAIL PROTECTED]>
#Cc: Jens Axboe <[EMAIL PROTECTED]>
#Cc: Mark Lord <[EMAIL PROTECTED]
#Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
#Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>
diff -urN linux-2.6.24-rc5/mm/page_alloc.c linux-2.6.24-rc6/mm/page_alloc.c
--- linux-2.6.24-rc5/mm/page_alloc.c2007-12-21 04:14:11.305633890 +
+++ linux-2.6.24-rc6/mm/page_alloc.c2007-12-21 04:14:17.746985697 +
@@ -847,8 +847,19 @@
struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
+
+   /*
+* Split buddy pages returned by expand() are received here
+* in physical page order. The page is added to the callers and
+* list and the list head then moves forward. From the callers
+* perspective, the linked list is ordered by page number in
+* some conditions. This is useful for IO devices that can
+* merge IO requests if the physical pages are ordered
+* properly.
+*/
list_add(&page->lru, list);
set_page_private(page, migratetype);
+   list = &page->lru;
}
spin_unlock(&zone->lock);
return i;



This has brought back the good results I observed and reported.
I do not know what to make out of this. At least on the systems I care
about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory, SmartaArray6i
controller with 4x72GB SCSI disks as RAID5 (battery pr

Re: [PATCH] Fix oops in mballoc caused by a variable overflow

2008-01-17 Thread Aneesh Kumar K.V
On Thu, Jan 17, 2008 at 02:09:41PM +0100, Valerie Clement wrote:
> Aneesh Kumar K.V wrote:
>> On Thu, Jan 17, 2008 at 10:43:40AM +0100, Valerie Clement wrote:
>>> Aneesh Kumar K.V wrote:
 What about this  ? I guess we will overflow start = start << bsbits;

>>> Hi Aneesh,
>>> your patch below doesn't fix the issue, because as start_off is also  
>>> loff_t, start_off = ac->ac_o_ex.fe_logical << bsbits  also overflows.
>>>
>>
>> loff_t is 64 bits.
>>
>> typedef __kernel_loff_t loff_t;
>> typedef long long   __kernel_loff_t;
>> typedef __u32 ext4_lblk_t;
>> typedef unsigned long long ext4_fsblk_t
>>
>> start_off = ac->ac_o_ex.fe_logical << bsbits;
>>
>> In the above line what we are storing in start_off is the offset in bytes.So 
>> it makes
>> sense to use the type loff_t. It is neither logical block nor physical block.
>
> Oh yes, sorry, you're right. I read too quickly.
>
> In fact, it's missing a cast :
>   start_off = (loff_t) ac->ac_o_ex.fe_logical << bsbits;
>
> With that change, the test is ok.

Updated patch below.

-aneesh
ext4: Fix overflow in ext4_mb_normalize_request

From: Aneesh Kumar K.V <[EMAIL PROTECTED]>

kernel BUG at fs/ext4/mballoc.c:3148!

The BUG_ON is:
BUG_ON(size <= 0 || size >= EXT4_BLOCKS_PER_GROUP(ac->ac_sb));

where the value of "size" is 4293920768.

This is due to the overflow of the variable "start" in the
ext4_mb_normalize_request() function.

Signed-off-by: Aneesh Kumar K.V <[EMAIL PROTECTED]>
---

 fs/ext4/mballoc.c |   24 +++-
 1 files changed, 11 insertions(+), 13 deletions(-)


diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d8cd81e..d16083c 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2998,7 +2998,7 @@ static void ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	int bsbits, max;
 	ext4_lblk_t end;
 	struct list_head *cur;
-	loff_t size, orig_size;
+	loff_t size, orig_size, start_off;
 	ext4_lblk_t start, orig_start;
 	struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
 
@@ -3039,7 +3039,7 @@ static void ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 
 	/* first, try to predict filesize */
 	/* XXX: should this table be tunable? */
-	start = 0;
+	start_off = 0;
 	if (size <= 16 * 1024) {
 		size = 16 * 1024;
 	} else if (size <= 32 * 1024) {
@@ -3055,26 +3055,24 @@ static void ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	} else if (size <= 1024 * 1024) {
 		size = 1024 * 1024;
 	} else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, bsbits)) {
-		start = ac->ac_o_ex.fe_logical << bsbits;
-		start = (start / (1024 * 1024)) * (1024 * 1024);
+		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
+		(20 - bsbits)) << 20;
 		size = 1024 * 1024;
 	} else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, bsbits)) {
-		start = ac->ac_o_ex.fe_logical << bsbits;
-		start = (start / (4 * (1024 * 1024))) * 4 * (1024 * 1024);
+		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
+			(22 - bsbits)) << 22;
 		size = 4 * 1024 * 1024;
 	} else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len,
 	(8<<20)>>bsbits, max, bsbits)) {
-		start = ac->ac_o_ex.fe_logical;
-		start = start << bsbits;
-		start = (start / (8 * (1024 * 1024))) * 8 * (1024 * 1024);
+		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
+			(23 - bsbits)) << 23;
 		size = 8 * 1024 * 1024;
 	} else {
-		start = ac->ac_o_ex.fe_logical;
-		start = start << bsbits;
-		size = ac->ac_o_ex.fe_len << bsbits;
+		start_off = (loff_t)ac->ac_o_ex.fe_logical << bsbits;
+		size	  = ac->ac_o_ex.fe_len << bsbits;
 	}
 	orig_size = size = size >> bsbits;
-	orig_start = start = start >> bsbits;
+	orig_start = start = start_off >> bsbits;
 
 	/* don't cover already allocated blocks in selected range */
 	if (ar->pleft && start <= ar->lleft) {


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Mike Snitzer
On Jan 17, 2008 8:52 AM, Martin Knoblauch <[EMAIL PROTECTED]> wrote:
>
> - Original Message 
> > From: Fengguang Wu <[EMAIL PROTECTED]>
> > To: Martin Knoblauch <[EMAIL PROTECTED]>
> > Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; 
> > [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; 
> > "linux-ext4@vger.kernel.org" ; Linus Torvalds 
> > <[EMAIL PROTECTED]>
> > Sent: Wednesday, January 16, 2008 1:00:04 PM
> > Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> >
> > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > > For those interested in using your writeback improvements in
> > > > production sooner rather than later (primarily with ext3); what
> > > > recommendations do you have?  Just heavily test our own 2.6.24
> > +
> >
>  your
> > > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > >
> > > Hi Fengguang, Mike,
> > >
> > >  I can add myself to Mikes question. It would be good to know
> > a
> >
>  "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> > been
> >
>  showing quite nice improvement of the overall writeback situation and
> > it
> >
>  would be sad to see this [partially] gone in 2.6.24-final.
> > Linus
> >
>  apparently already has reverted  "...2250b". I will definitely repeat my
> > tests
> >
>  with -rc8. and report.
> >
> > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > Maybe we can push it to 2.6.24 after your testing.
> >
> Hi Fengguang,
>
>  something really bad has happened between -rc3 and -rc6. Embarrassingly I 
> did not catch that earlier :-(
>
>  Compared to the numbers I posted in http://lkml.org/lkml/2007/10/26/208 , 
> dd1 is now at 60 MB/sec (slight plus), while dd2/dd3 suck the same way as in 
> pre 2.6.24. The only test that is still good is mix3, which I attribute to 
> the per-BDI stuff.
>
>  At the moment I am frantically trying to find when things went down. I did 
> run -rc8 and rc8+yourpatch. No difference to what I see with -rc6. Sorry that 
> I cannot provide any input to your patch.
>
> Depressed
> Martin

Martin,

I've backported Peter's perbdi patchset to 2.6.22.x.  I can share it
with anyone who might be interested.

As expected, it has yielded 2.6.24-rcX level scaling.  Given the test
result matrix you previously posted, 2.6.22.x+perbdi might give you
what you're looking for (sans improved writeback that 2.6.24 was
thought to be providing).  That is, much improved scaling with better
O_DIRECT and network throughput.  Just a thought...

Unfortunately, my priorities (and computing resources) have shifted
and I won't be able to thoroughly test Fengguang's new writeback patch
on 2.6.24-rc8... whereby missing out on providing
justification/testing to others on _some_ improved writeback being
included in 2.6.24 final.

Not to mention the window for writeback improvement is all but closed
considering the 2.6.24-rc8 announcement's 2.6.24 final release
timetable.

regards,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Martin Knoblauch
- Original Message 
> From: Fengguang Wu <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; 
> [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; 
> "linux-ext4@vger.kernel.org" ; Linus Torvalds 
> <[EMAIL PROTECTED]>
> Sent: Wednesday, January 16, 2008 1:00:04 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > For those interested in using your writeback improvements in
> > > production sooner rather than later (primarily with ext3); what
> > > recommendations do you have?  Just heavily test our own 2.6.24
> +
> 
 your
> > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > 
> > Hi Fengguang, Mike,
> > 
> >  I can add myself to Mikes question. It would be good to know
> a
> 
 "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> been
> 
 showing quite nice improvement of the overall writeback situation and
> it
> 
 would be sad to see this [partially] gone in 2.6.24-final.
> Linus
> 
 apparently already has reverted  "...2250b". I will definitely repeat my
> tests
> 
 with -rc8. and report.
> 
> Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> Maybe we can push it to 2.6.24 after your testing.
> 
Hi Fengguang,

 something really bad has happened between -rc3 and -rc6. Embarrassingly I did 
not catch that earlier :-(

 Compared to the numbers I posted in http://lkml.org/lkml/2007/10/26/208 , dd1 
is now at 60 MB/sec (slight plus), while dd2/dd3 suck the same way as in pre 
2.6.24. The only test that is still good is mix3, which I attribute to the 
per-BDI stuff.

 At the moment I am frantically trying to find when things went down. I did run 
-rc8 and rc8+yourpatch. No difference to what I see with -rc6. Sorry that I 
cannot provide any input to your patch.

Depressed
Martin

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix oops in mballoc caused by a variable overflow

2008-01-17 Thread Valerie Clement

Aneesh Kumar K.V wrote:

On Thu, Jan 17, 2008 at 10:43:40AM +0100, Valerie Clement wrote:

Aneesh Kumar K.V wrote:

What about this  ? I guess we will overflow start = start << bsbits;


Hi Aneesh,
your patch below doesn't fix the issue, because as start_off is also  
loff_t, start_off = ac->ac_o_ex.fe_logical << bsbits  also overflows.




loff_t is 64 bits.

typedef __kernel_loff_t loff_t;
typedef long long   __kernel_loff_t;
typedef __u32 ext4_lblk_t;
typedef unsigned long long ext4_fsblk_t

start_off = ac->ac_o_ex.fe_logical << bsbits;

In the above line what we are storing in start_off is the offset in bytes.So it 
makes
sense to use the type loff_t. It is neither logical block nor physical block.


Oh yes, sorry, you're right. I read too quickly.

In fact, it's missing a cast :
  start_off = (loff_t) ac->ac_o_ex.fe_logical << bsbits;

With that change, the test is ok.

   Valérie


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.24-rc6 -mm patch]

2008-01-17 Thread Abhishek Rai
On Jan 15, 2008 10:28 AM, Theodore Tso <[EMAIL PROTECTED]> wrote:
> Also, it's not just reducing fsck times, although that's the main one.
> The last time this was suggested, the rationale was to speed up the
> "rm dvd.iso" case.  Also, something which *could* be done, if Abhishek
> wants to pursue it, would be to pull in all of the indirect blocks
> when the file is opened, and create an in-memory extent tree that
> would speed up access to the file.  It's rarely worth doing this
> without metaclustering, since it doesn't help for sequential I/O, only
> random I/O, but with metaclustering it would also be a win for
> sequential I/O.  (This would also remove the minor performance
> degradation for sequential I/O imposed by metaclustering, and in fact
> improve it slightly for really big files.)
>
> - Ted
>

Also, since the in memory extent tree will now occupy much less space,
we can keep them cached for a much longer time which will improve
performance of random reads. The new metaclustering patch is more
amenable to this trick since it reduces fragmentation thereby reducing
the number of extents.

Thanks,
Abhishek
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Convert ext3_ioctl() to an unlocked_ioctl

2008-01-17 Thread Mathieu Segaud
Signed-off-by: Mathieu Segaud <[EMAIL PROTECTED]>
---
 fs/ext3/dir.c   |2 +-
 fs/ext3/file.c  |2 +-
 fs/ext3/ioctl.c |  161 ---
 include/linux/ext3_fs.h |3 +-
 4 files changed, 113 insertions(+), 55 deletions(-)

diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index 8ca3bfd..5ab6b88 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -42,7 +42,7 @@ const struct file_operations ext3_dir_operations = {
.llseek = generic_file_llseek,
.read   = generic_read_dir,
.readdir= ext3_readdir, /* we take BKL. needed?*/
-   .ioctl  = ext3_ioctl,   /* BKL held */
+   .unlocked_ioctl = ext3_ioctl,   /* BKL held */
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext3_compat_ioctl,
 #endif
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index acc4913..49798ed 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -112,7 +112,7 @@ const struct file_operations ext3_file_operations = {
.write  = do_sync_write,
.aio_read   = generic_file_aio_read,
.aio_write  = ext3_file_write,
-   .ioctl  = ext3_ioctl,
+   .unlocked_ioctl = ext3_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext3_compat_ioctl,
 #endif
diff --git a/fs/ext3/ioctl.c b/fs/ext3/ioctl.c
index 023a070..bccb372 100644
--- a/fs/ext3/ioctl.c
+++ b/fs/ext3/ioctl.c
@@ -17,12 +17,19 @@
 #include 
 #include 
 
-int ext3_ioctl (struct inode * inode, struct file * filp, unsigned int cmd,
+long ext3_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg)
 {
-   struct ext3_inode_info *ei = EXT3_I(inode);
+   struct ext3_inode_info *ei;
+   struct inode *inode;
unsigned int flags;
unsigned short rsv_window_size;
+   long retval = 0;
+
+   lock_kernel();
+
+   inode = filp->f_dentry->d_inode;
+   ei = EXT3_I(inode);
 
ext3_debug ("cmd = %u, arg = %lu\n", cmd, arg);
 
@@ -30,7 +37,8 @@ int ext3_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
case EXT3_IOC_GETFLAGS:
ext3_get_inode_flags(ei);
flags = ei->i_flags & EXT3_FL_USER_VISIBLE;
-   return put_user(flags, (int __user *) arg);
+   retval = put_user(flags, (int __user *) arg);
+   goto out;
case EXT3_IOC_SETFLAGS: {
handle_t *handle = NULL;
int err;
@@ -38,14 +46,20 @@ int ext3_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
unsigned int oldflags;
unsigned int jflag;
 
-   if (IS_RDONLY(inode))
-   return -EROFS;
+   if (IS_RDONLY(inode)) {
+   retval = -EROFS;
+   goto out;
+   }
 
-   if (!is_owner_or_cap(inode))
-   return -EACCES;
+   if (!is_owner_or_cap(inode)) {
+   retval = -EACCES;
+   goto out;
+   }
 
-   if (get_user(flags, (int __user *) arg))
-   return -EFAULT;
+   if (get_user(flags, (int __user *) arg)) {
+   retval = -EFAULT;
+   goto out;
+   }
 
if (!S_ISDIR(inode->i_mode))
flags &= ~EXT3_DIRSYNC_FL;
@@ -54,7 +68,8 @@ int ext3_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
/* Is it quota file? Do not allow user to mess with it */
if (IS_NOQUOTA(inode)) {
mutex_unlock(&inode->i_mutex);
-   return -EPERM;
+   retval = -EPERM;
+   goto out;
}
oldflags = ei->i_flags;
 
@@ -70,7 +85,8 @@ int ext3_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
if ((flags ^ oldflags) & (EXT3_APPEND_FL | EXT3_IMMUTABLE_FL)) {
if (!capable(CAP_LINUX_IMMUTABLE)) {
mutex_unlock(&inode->i_mutex);
-   return -EPERM;
+   retval = -EPERM;
+   goto out;
}
}
 
@@ -81,7 +97,8 @@ int ext3_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
if ((jflag ^ oldflags) & (EXT3_JOURNAL_DATA_FL)) {
if (!capable(CAP_SYS_RESOURCE)) {
mutex_unlock(&inode->i_mutex);
-   return -EPERM;
+   retval = -EPERM;
+   goto out;
}
}
 
@@ -89,7 +106,8 @@ int ext3_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
handle = ext3_journal_start(inode, 1);
 

[PATCH] Convert EXT2 to use unlocked_ioctl

2008-01-17 Thread Mathieu Segaud
Change ext_ioctl() to be an unlocked_ioctl(), explicitly
exposing BKL's uses.

Signed-off-by: Mathieu Segaud <[EMAIL PROTECTED]>
---
 fs/ext2/dir.c   |2 +-
 fs/ext2/ext2.h  |3 +-
 fs/ext2/file.c  |4 +-
 fs/ext2/ioctl.c |  103 +--
 4 files changed, 73 insertions(+), 39 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index d868e26..dbee3c9 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -703,7 +703,7 @@ const struct file_operations ext2_dir_operations = {
.llseek = generic_file_llseek,
.read   = generic_read_dir,
.readdir= ext2_readdir,
-   .ioctl  = ext2_ioctl,
+   .unlocked_ioctl = ext2_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext2_compat_ioctl,
 #endif
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index c87ae29..bb9948c 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -139,8 +139,7 @@ int __ext2_write_begin(struct file *file, struct 
address_space *mapping,
struct page **pagep, void **fsdata);
 
 /* ioctl.c */
-extern int ext2_ioctl (struct inode *, struct file *, unsigned int,
-  unsigned long);
+extern long ext2_ioctl(struct file *, unsigned int, unsigned long);
 extern long ext2_compat_ioctl(struct file *, unsigned int, unsigned long);
 
 /* namei.c */
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index c051798..17fe628 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -48,7 +48,7 @@ const struct file_operations ext2_file_operations = {
.write  = do_sync_write,
.aio_read   = generic_file_aio_read,
.aio_write  = generic_file_aio_write,
-   .ioctl  = ext2_ioctl,
+   .unlocked_ioctl = ext2_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext2_compat_ioctl,
 #endif
@@ -65,7 +65,7 @@ const struct file_operations ext2_xip_file_operations = {
.llseek = generic_file_llseek,
.read   = xip_file_read,
.write  = xip_file_write,
-   .ioctl  = ext2_ioctl,
+   .unlocked_ioctl = ext2_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext2_compat_ioctl,
 #endif
diff --git a/fs/ext2/ioctl.c b/fs/ext2/ioctl.c
index 320b2cb..0f18ac5 100644
--- a/fs/ext2/ioctl.c
+++ b/fs/ext2/ioctl.c
@@ -17,12 +17,18 @@
 #include 
 
 
-int ext2_ioctl (struct inode * inode, struct file * filp, unsigned int cmd,
+long ext2_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg)
 {
-   struct ext2_inode_info *ei = EXT2_I(inode);
+   struct ext2_inode_info *ei;
+   struct inode *inode;
unsigned int flags;
unsigned short rsv_window_size;
+   long retval = 0;
+
+   lock_kernel();
+   inode = filp->f_path.dentry->d_inode;
+   ei = EXT2_I(inode);
 
ext2_debug ("cmd = %u, arg = %lu\n", cmd, arg);
 
@@ -30,18 +36,25 @@ int ext2_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
case EXT2_IOC_GETFLAGS:
ext2_get_inode_flags(ei);
flags = ei->i_flags & EXT2_FL_USER_VISIBLE;
-   return put_user(flags, (int __user *) arg);
+   retval = put_user(flags, (int __user *) arg);
+   goto out;
case EXT2_IOC_SETFLAGS: {
unsigned int oldflags;
 
-   if (IS_RDONLY(inode))
-   return -EROFS;
+   if (IS_RDONLY(inode)) {
+   retval = -EROFS;
+   goto out;
+   }
 
-   if (!is_owner_or_cap(inode))
-   return -EACCES;
+   if (!is_owner_or_cap(inode)) {
+   retval = -EACCES;
+   goto out;
+   }
 
-   if (get_user(flags, (int __user *) arg))
-   return -EFAULT;
+   if (get_user(flags, (int __user *) arg)) {
+   retval = -EFAULT;
+   goto out;
+   }
 
if (!S_ISDIR(inode->i_mode))
flags &= ~EXT2_DIRSYNC_FL;
@@ -50,7 +63,8 @@ int ext2_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
/* Is it quota file? Do not allow user to mess with it */
if (IS_NOQUOTA(inode)) {
mutex_unlock(&inode->i_mutex);
-   return -EPERM;
+   retval = -EPERM;
+   goto out;
}
oldflags = ei->i_flags;
 
@@ -63,7 +77,8 @@ int ext2_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
if ((flags ^ oldflags) & (EXT2_APPEND_FL | EXT2_IMMUTABLE_FL)) {
if (!capable(CAP_LINUX_IMMUTABLE)) {
mutex_unlock(&inode->i_mutex);
-   return -EPERM;
+   retval = -EPERM;
+

[PATCH] Convert ext4_ioctl to an unlocked_ioctl

2008-01-17 Thread Mathieu Segaud
Signed-off-by: Mathieu Segaud <[EMAIL PROTECTED]>
---
 fs/ext4/dir.c   |2 +-
 fs/ext4/file.c  |2 +-
 fs/ext4/ioctl.c |  161 ---
 include/linux/ext4_fs.h |3 +-
 4 files changed, 112 insertions(+), 56 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index f612bef..8f6677a 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -42,7 +42,7 @@ const struct file_operations ext4_dir_operations = {
.llseek = generic_file_llseek,
.read   = generic_read_dir,
.readdir= ext4_readdir, /* we take BKL. needed?*/
-   .ioctl  = ext4_ioctl,   /* BKL held */
+   .unlocked_ioctl = ext4_ioctl,   /* BKL held */
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext4_compat_ioctl,
 #endif
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a81cd6..34852a9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -112,7 +112,7 @@ const struct file_operations ext4_file_operations = {
.write  = do_sync_write,
.aio_read   = generic_file_aio_read,
.aio_write  = ext4_file_write,
-   .ioctl  = ext4_ioctl,
+   .unlocked_ioctl = ext4_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext4_compat_ioctl,
 #endif
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index e7f894b..27cab25 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -17,12 +17,18 @@
 #include 
 #include 
 
-int ext4_ioctl (struct inode * inode, struct file * filp, unsigned int cmd,
-   unsigned long arg)
+long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
-   struct ext4_inode_info *ei = EXT4_I(inode);
+   struct ext4_inode_info *ei;
+   struct inode *inode;
unsigned int flags;
unsigned short rsv_window_size;
+   long retval = 0;
+
+   lock_kernel();
+
+   inode = filp->f_dentry->d_inode;
+   ei = EXT4_I(inode);
 
ext4_debug ("cmd = %u, arg = %lu\n", cmd, arg);
 
@@ -30,7 +36,8 @@ int ext4_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
case EXT4_IOC_GETFLAGS:
ext4_get_inode_flags(ei);
flags = ei->i_flags & EXT4_FL_USER_VISIBLE;
-   return put_user(flags, (int __user *) arg);
+   retval = put_user(flags, (int __user *) arg);
+   goto out;
case EXT4_IOC_SETFLAGS: {
handle_t *handle = NULL;
int err;
@@ -38,14 +45,20 @@ int ext4_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
unsigned int oldflags;
unsigned int jflag;
 
-   if (IS_RDONLY(inode))
-   return -EROFS;
+   if (IS_RDONLY(inode)) {
+   retval = -EROFS;
+   goto out;
+   }
 
-   if (!is_owner_or_cap(inode))
-   return -EACCES;
+   if (!is_owner_or_cap(inode)) {
+   retval = -EACCES;
+   goto out;
+   }
 
-   if (get_user(flags, (int __user *) arg))
-   return -EFAULT;
+   if (get_user(flags, (int __user *) arg)) {
+   retval = -EFAULT;
+   goto out;
+   }
 
if (!S_ISDIR(inode->i_mode))
flags &= ~EXT4_DIRSYNC_FL;
@@ -54,7 +67,8 @@ int ext4_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
/* Is it quota file? Do not allow user to mess with it */
if (IS_NOQUOTA(inode)) {
mutex_unlock(&inode->i_mutex);
-   return -EPERM;
+   retval = -EPERM;
+   goto out;
}
oldflags = ei->i_flags;
 
@@ -70,7 +84,8 @@ int ext4_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
if ((flags ^ oldflags) & (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) {
if (!capable(CAP_LINUX_IMMUTABLE)) {
mutex_unlock(&inode->i_mutex);
-   return -EPERM;
+   retval = -EPERM;
+   goto out;
}
}
 
@@ -81,7 +96,8 @@ int ext4_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
if ((jflag ^ oldflags) & (EXT4_JOURNAL_DATA_FL)) {
if (!capable(CAP_SYS_RESOURCE)) {
mutex_unlock(&inode->i_mutex);
-   return -EPERM;
+   retval = -EPERM;
+   goto out;
}
}
 
@@ -89,7 +105,8 @@ int ext4_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
handle = ext4_journal

Re: [PATCH] Convert EXT2 to use unlocked_ioctl

2008-01-17 Thread Mathieu SEGAUD
Vous m'avez dit récemment :

> On Thursday 17 January 2008, Mathieu SEGAUD wrote:
>> yep, they do. I noticed this nested calls. I guess I will add
>> _extX_compat_ioctl() running with no BKL's which would be used by both
>> extX_ioctl() and extX_compat_ioctl().
>> Any comments on such a strategy ? thanks a lot for the reminder :)
>> 
>
> Why not just kill the lock_kernel() in extX_compat_ioctl() ?

sure, I just posted I wasn't that awake enough to see this one :)
by the way thanks for the heads up.
I repost the whole thing corrected.

-- 
Mathieu
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Convert EXT2 to use unlocked_ioctl

2008-01-17 Thread Arnd Bergmann
On Thursday 17 January 2008, Mathieu SEGAUD wrote:
> yep, they do. I noticed this nested calls. I guess I will add
> _extX_compat_ioctl() running with no BKL's which would be used by both
> extX_ioctl() and extX_compat_ioctl().
> Any comments on such a strategy ? thanks a lot for the reminder :)
> 

Why not just kill the lock_kernel() in extX_compat_ioctl()?

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Convert EXT2 to use unlocked_ioctl

2008-01-17 Thread Mathieu SEGAUD
Vous m'avez dit récemment :

> Vous m'avez dit récemment :
>
>> On Thursday 17 January 2008, you wrote:
>>> 
>>> Change ext_ioctl() to be an unlocked_ioctl(), explicitly
>>> exposing BKL's uses.
>>> 
>>> Signed-off-by: Mathieu Segaud <[EMAIL PROTECTED]>
>>
>> You are now calling lock_kernel() twice in case of ext2_compat_ioctl(),
>> which calls back into ext2_ioctl with the BKL already held.
>>
>> This is allowed with the BKL, but really bad style that you should
>> avoid. I assume the ext3 and ext4dev versions of your patch have
>> the same issue, but I didn't check in detail.
>
> yep, they do. I noticed this nested calls. I guess I will add
> _extX_compat_ioctl() running with no BKL's which would be used by both
> extX_ioctl() and extX_compat_ioctl().
> Any comments on such a strategy ? thanks a lot for the reminder :)

Well as I am not awake enough, this would sum up to get rid of
lock/unlock_kernel() around extX_ioctl() calls...
Will repost with theses changes.
Thanks.

-- 
Mathieu
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix oops in mballoc caused by a variable overflow

2008-01-17 Thread Aneesh Kumar K.V
On Thu, Jan 17, 2008 at 10:43:40AM +0100, Valerie Clement wrote:
> Aneesh Kumar K.V wrote:
>> What about this  ? I guess we will overflow start = start << bsbits;
>>
>
> Hi Aneesh,
> your patch below doesn't fix the issue, because as start_off is also  
> loff_t, start_off = ac->ac_o_ex.fe_logical << bsbits  also overflows.
>

loff_t is 64 bits.

typedef __kernel_loff_t loff_t;
typedef long long   __kernel_loff_t;
typedef __u32 ext4_lblk_t;
typedef unsigned long long ext4_fsblk_t

start_off = ac->ac_o_ex.fe_logical << bsbits;

In the above line what we are storing in start_off is the offset in bytes.So it 
makes
sense to use the type loff_t. It is neither logical block nor physical block.


-aneesh
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix oops in mballoc caused by a variable overflow

2008-01-17 Thread Aneesh Kumar K.V
On Thu, Jan 17, 2008 at 10:43:40AM +0100, Valerie Clement wrote:
> Aneesh Kumar K.V wrote:
>> What about this  ? I guess we will overflow start = start << bsbits;
>>
>
> Hi Aneesh,
> your patch below doesn't fix the issue, because as start_off is also  
> loff_t, start_off = ac->ac_o_ex.fe_logical << bsbits  also overflows.
>

loff_t is 64 bits.

typedef long long   __kernel_loff_t;
typedef __u32 ext4_lblk_t;

-aneesh
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Convert EXT2 to use unlocked_ioctl

2008-01-17 Thread Mathieu SEGAUD
Vous m'avez dit récemment :

> On Thursday 17 January 2008, you wrote:
>> 
>> Change ext_ioctl() to be an unlocked_ioctl(), explicitly
>> exposing BKL's uses.
>> 
>> Signed-off-by: Mathieu Segaud <[EMAIL PROTECTED]>
>
> You are now calling lock_kernel() twice in case of ext2_compat_ioctl(),
> which calls back into ext2_ioctl with the BKL already held.
>
> This is allowed with the BKL, but really bad style that you should
> avoid. I assume the ext3 and ext4dev versions of your patch have
> the same issue, but I didn't check in detail.

yep, they do. I noticed this nested calls. I guess I will add
_extX_compat_ioctl() running with no BKL's which would be used by both
extX_ioctl() and extX_compat_ioctl().
Any comments on such a strategy ? thanks a lot for the reminder :)

-- 
Mathieu

ps: I just posted a set of patches for reiserfs that may suffer the
same ugly style
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Convert EXT2 to use unlocked_ioctl

2008-01-17 Thread Arnd Bergmann
On Thursday 17 January 2008, you wrote:
> 
> Change ext_ioctl() to be an unlocked_ioctl(), explicitly
> exposing BKL's uses.
> 
> Signed-off-by: Mathieu Segaud <[EMAIL PROTECTED]>

You are now calling lock_kernel() twice in case of ext2_compat_ioctl(),
which calls back into ext2_ioctl with the BKL already held.

This is allowed with the BKL, but really bad style that you should
avoid. I assume the ext3 and ext4dev versions of your patch have
the same issue, but I didn't check in detail.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.24-rc6 -mm patch]

2008-01-17 Thread Andreas Dilger
On Jan 15, 2008  23:25 -0500, [EMAIL PROTECTED] wrote:
> I've got multiple boxes across the hall that have 50T of disk on them, in one
> case as one large filesystem, and the users want *more* *bigger* still (damned
> researchers - you put a 15 teraflop supercomputer in the room, and then they
> want someplace to *put* all the numbers that come spewing out of there.. ;)
> 
> There comes a point where that downtime gets too long to be politically
> expedient.  6->2 may not be a biggie, because you can likely get a 6 hour
> window.  24->8 suddenly looks a lot different.
> 
> (Having said that, I'll admit the one 52T filesystem is an SGI Itanium box
> running Suse and using XFS rather than ext3).
> 
> Has anybody done a back-of-envelope of what this would do for fsck times for
> a "max realistically achievable ext3 filesystem" (i.e. 100T-200T or ext3
> design limit, whichever is smaller)?

This is exactly the kind of environment that Lustre is designed for.
Not only do you get parallel IO performance, but you can also do parallel
e2fsck on the individual filesystems when you need it.  Not that we aren't
also working on improving e2fsck performance, but 100 * 4TB e2fsck in
parallel is much better than 1 * 400TB e2fsck (probably not possible on
a system today due to RAM constraints though I haven't really done any
calculations either way).  I know customers were having RAM problems with
3 * 2TB e2fsck in parallel on a 2GB node.

Most customers these days use 2-4 4-8TB filesystems per server
(inexpensive SMP node with 2-4GB RAM instead of a single monstrous SGI box).

We have many Lustre filesystems in the 100-200TB range today, some over 1PB
already and much larger ones being planned.

> (And one of the research crew had a not-totally-on-crack proposal to get a
> petabyte of spinning oxide.  Figuring out how to back that up would probably
> have landed firmly in my lap.  Ouch. ;)

Charge them for 2PB of storage, and use rsync :-).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Convert ext3_ioctl() to an unlocked_ioctl

2008-01-17 Thread Mathieu Segaud
Signed-off-by: Mathieu Segaud <[EMAIL PROTECTED]>
---
 fs/ext3/dir.c   |2 +-
 fs/ext3/file.c  |2 +-
 fs/ext3/ioctl.c |  159 --
 include/linux/ext3_fs.h |3 +-
 4 files changed, 113 insertions(+), 53 deletions(-)

diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index 8ca3bfd..5ab6b88 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -42,7 +42,7 @@ const struct file_operations ext3_dir_operations = {
.llseek = generic_file_llseek,
.read   = generic_read_dir,
.readdir= ext3_readdir, /* we take BKL. needed?*/
-   .ioctl  = ext3_ioctl,   /* BKL held */
+   .unlocked_ioctl = ext3_ioctl,   /* BKL held */
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext3_compat_ioctl,
 #endif
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index acc4913..49798ed 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -112,7 +112,7 @@ const struct file_operations ext3_file_operations = {
.write  = do_sync_write,
.aio_read   = generic_file_aio_read,
.aio_write  = ext3_file_write,
-   .ioctl  = ext3_ioctl,
+   .unlocked_ioctl = ext3_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext3_compat_ioctl,
 #endif
diff --git a/fs/ext3/ioctl.c b/fs/ext3/ioctl.c
index 023a070..9d57d47 100644
--- a/fs/ext3/ioctl.c
+++ b/fs/ext3/ioctl.c
@@ -17,12 +17,19 @@
 #include 
 #include 
 
-int ext3_ioctl (struct inode * inode, struct file * filp, unsigned int cmd,
+long ext3_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg)
 {
-   struct ext3_inode_info *ei = EXT3_I(inode);
+   struct ext3_inode_info *ei;
+   struct inode *inode;
unsigned int flags;
unsigned short rsv_window_size;
+   long retval = 0;
+
+   lock_kernel();
+
+   inode = filp->f_dentry->d_inode;
+   ei = EXT3_I(inode);
 
ext3_debug ("cmd = %u, arg = %lu\n", cmd, arg);
 
@@ -30,7 +37,8 @@ int ext3_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
case EXT3_IOC_GETFLAGS:
ext3_get_inode_flags(ei);
flags = ei->i_flags & EXT3_FL_USER_VISIBLE;
-   return put_user(flags, (int __user *) arg);
+   retval = put_user(flags, (int __user *) arg);
+   goto out;
case EXT3_IOC_SETFLAGS: {
handle_t *handle = NULL;
int err;
@@ -38,14 +46,20 @@ int ext3_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
unsigned int oldflags;
unsigned int jflag;
 
-   if (IS_RDONLY(inode))
-   return -EROFS;
+   if (IS_RDONLY(inode)) {
+   retval = -EROFS;
+   goto out;
+   }
 
-   if (!is_owner_or_cap(inode))
-   return -EACCES;
+   if (!is_owner_or_cap(inode)) {
+   retval = -EACCES;
+   goto out;
+   }
 
-   if (get_user(flags, (int __user *) arg))
-   return -EFAULT;
+   if (get_user(flags, (int __user *) arg)) {
+   retval = -EFAULT;
+   goto out;
+   }
 
if (!S_ISDIR(inode->i_mode))
flags &= ~EXT3_DIRSYNC_FL;
@@ -54,7 +68,8 @@ int ext3_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
/* Is it quota file? Do not allow user to mess with it */
if (IS_NOQUOTA(inode)) {
mutex_unlock(&inode->i_mutex);
-   return -EPERM;
+   retval = -EPERM;
+   goto out;
}
oldflags = ei->i_flags;
 
@@ -70,7 +85,8 @@ int ext3_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
if ((flags ^ oldflags) & (EXT3_APPEND_FL | EXT3_IMMUTABLE_FL)) {
if (!capable(CAP_LINUX_IMMUTABLE)) {
mutex_unlock(&inode->i_mutex);
-   return -EPERM;
+   retval = -EPERM;
+   goto out;
}
}
 
@@ -81,7 +97,8 @@ int ext3_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
if ((jflag ^ oldflags) & (EXT3_JOURNAL_DATA_FL)) {
if (!capable(CAP_SYS_RESOURCE)) {
mutex_unlock(&inode->i_mutex);
-   return -EPERM;
+   retval = -EPERM;
+   goto out;
}
}
 
@@ -89,7 +106,8 @@ int ext3_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
handle = ext3_journal_start(inode, 1);
  

[PATCH] Convert ext4_ioctl to an unlocked_ioctl

2008-01-17 Thread Mathieu Segaud
Signed-off-by: Mathieu Segaud <[EMAIL PROTECTED]>
---
 fs/ext4/dir.c   |2 +-
 fs/ext4/file.c  |2 +-
 fs/ext4/ioctl.c |  159 ---
 include/linux/ext4_fs.h |3 +-
 4 files changed, 112 insertions(+), 54 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index f612bef..8f6677a 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -42,7 +42,7 @@ const struct file_operations ext4_dir_operations = {
.llseek = generic_file_llseek,
.read   = generic_read_dir,
.readdir= ext4_readdir, /* we take BKL. needed?*/
-   .ioctl  = ext4_ioctl,   /* BKL held */
+   .unlocked_ioctl = ext4_ioctl,   /* BKL held */
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext4_compat_ioctl,
 #endif
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a81cd6..34852a9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -112,7 +112,7 @@ const struct file_operations ext4_file_operations = {
.write  = do_sync_write,
.aio_read   = generic_file_aio_read,
.aio_write  = ext4_file_write,
-   .ioctl  = ext4_ioctl,
+   .unlocked_ioctl = ext4_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext4_compat_ioctl,
 #endif
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index e7f894b..a0ee342 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -17,12 +17,18 @@
 #include 
 #include 
 
-int ext4_ioctl (struct inode * inode, struct file * filp, unsigned int cmd,
-   unsigned long arg)
+long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
-   struct ext4_inode_info *ei = EXT4_I(inode);
+   struct ext4_inode_info *ei;
+   struct inode *inode;
unsigned int flags;
unsigned short rsv_window_size;
+   long retval = 0;
+
+   lock_kernel();
+
+   inode = filp->f_dentry->d_inode;
+   ei = EXT4_I(inode);
 
ext4_debug ("cmd = %u, arg = %lu\n", cmd, arg);
 
@@ -30,7 +36,8 @@ int ext4_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
case EXT4_IOC_GETFLAGS:
ext4_get_inode_flags(ei);
flags = ei->i_flags & EXT4_FL_USER_VISIBLE;
-   return put_user(flags, (int __user *) arg);
+   retval = put_user(flags, (int __user *) arg);
+   goto out;
case EXT4_IOC_SETFLAGS: {
handle_t *handle = NULL;
int err;
@@ -38,14 +45,20 @@ int ext4_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
unsigned int oldflags;
unsigned int jflag;
 
-   if (IS_RDONLY(inode))
-   return -EROFS;
+   if (IS_RDONLY(inode)) {
+   retval = -EROFS;
+   goto out;
+   }
 
-   if (!is_owner_or_cap(inode))
-   return -EACCES;
+   if (!is_owner_or_cap(inode)) {
+   retval = -EACCES;
+   goto out;
+   }
 
-   if (get_user(flags, (int __user *) arg))
-   return -EFAULT;
+   if (get_user(flags, (int __user *) arg)) {
+   retval = -EFAULT;
+   goto out;
+   }
 
if (!S_ISDIR(inode->i_mode))
flags &= ~EXT4_DIRSYNC_FL;
@@ -54,7 +67,8 @@ int ext4_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
/* Is it quota file? Do not allow user to mess with it */
if (IS_NOQUOTA(inode)) {
mutex_unlock(&inode->i_mutex);
-   return -EPERM;
+   retval = -EPERM;
+   goto out;
}
oldflags = ei->i_flags;
 
@@ -70,7 +84,8 @@ int ext4_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
if ((flags ^ oldflags) & (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) {
if (!capable(CAP_LINUX_IMMUTABLE)) {
mutex_unlock(&inode->i_mutex);
-   return -EPERM;
+   retval = -EPERM;
+   goto out;
}
}
 
@@ -81,7 +96,8 @@ int ext4_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
if ((jflag ^ oldflags) & (EXT4_JOURNAL_DATA_FL)) {
if (!capable(CAP_SYS_RESOURCE)) {
mutex_unlock(&inode->i_mutex);
-   return -EPERM;
+   retval = -EPERM;
+   goto out;
}
}
 
@@ -89,7 +105,8 @@ int ext4_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
handle = ext4_journal

[PATCH] Convert EXT2 to use unlocked_ioctl

2008-01-17 Thread Mathieu Segaud
Change ext_ioctl() to be an unlocked_ioctl(), explicitly
exposing BKL's uses.

Signed-off-by: Mathieu Segaud <[EMAIL PROTECTED]>
---
 fs/ext2/dir.c   |2 +-
 fs/ext2/ext2.h  |3 +-
 fs/ext2/file.c  |4 +-
 fs/ext2/ioctl.c |  101 +-
 4 files changed, 73 insertions(+), 37 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index d868e26..dbee3c9 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -703,7 +703,7 @@ const struct file_operations ext2_dir_operations = {
.llseek = generic_file_llseek,
.read   = generic_read_dir,
.readdir= ext2_readdir,
-   .ioctl  = ext2_ioctl,
+   .unlocked_ioctl = ext2_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext2_compat_ioctl,
 #endif
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index c87ae29..bb9948c 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -139,8 +139,7 @@ int __ext2_write_begin(struct file *file, struct 
address_space *mapping,
struct page **pagep, void **fsdata);
 
 /* ioctl.c */
-extern int ext2_ioctl (struct inode *, struct file *, unsigned int,
-  unsigned long);
+extern long ext2_ioctl(struct file *, unsigned int, unsigned long);
 extern long ext2_compat_ioctl(struct file *, unsigned int, unsigned long);
 
 /* namei.c */
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index c051798..17fe628 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -48,7 +48,7 @@ const struct file_operations ext2_file_operations = {
.write  = do_sync_write,
.aio_read   = generic_file_aio_read,
.aio_write  = generic_file_aio_write,
-   .ioctl  = ext2_ioctl,
+   .unlocked_ioctl = ext2_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext2_compat_ioctl,
 #endif
@@ -65,7 +65,7 @@ const struct file_operations ext2_xip_file_operations = {
.llseek = generic_file_llseek,
.read   = xip_file_read,
.write  = xip_file_write,
-   .ioctl  = ext2_ioctl,
+   .unlocked_ioctl = ext2_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext2_compat_ioctl,
 #endif
diff --git a/fs/ext2/ioctl.c b/fs/ext2/ioctl.c
index 320b2cb..02ba8e4 100644
--- a/fs/ext2/ioctl.c
+++ b/fs/ext2/ioctl.c
@@ -17,12 +17,18 @@
 #include 
 
 
-int ext2_ioctl (struct inode * inode, struct file * filp, unsigned int cmd,
+long ext2_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg)
 {
-   struct ext2_inode_info *ei = EXT2_I(inode);
+   struct ext2_inode_info *ei;
+   struct inode *inode;
unsigned int flags;
unsigned short rsv_window_size;
+   long retval = 0;
+
+   lock_kernel();
+   inode = filp->f_dentry->d_inode;
+   ei = EXT2_I(inode);
 
ext2_debug ("cmd = %u, arg = %lu\n", cmd, arg);
 
@@ -30,18 +36,25 @@ int ext2_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
case EXT2_IOC_GETFLAGS:
ext2_get_inode_flags(ei);
flags = ei->i_flags & EXT2_FL_USER_VISIBLE;
-   return put_user(flags, (int __user *) arg);
+   retval = put_user(flags, (int __user *) arg);
+   goto out;
case EXT2_IOC_SETFLAGS: {
unsigned int oldflags;
 
-   if (IS_RDONLY(inode))
-   return -EROFS;
+   if (IS_RDONLY(inode)) {
+   retval = -EROFS;
+   goto out;
+   }
 
-   if (!is_owner_or_cap(inode))
-   return -EACCES;
+   if (!is_owner_or_cap(inode)) {
+   retval = -EACCES;
+   goto out;
+   }
 
-   if (get_user(flags, (int __user *) arg))
-   return -EFAULT;
+   if (get_user(flags, (int __user *) arg)) {
+   retval = -EFAULT;
+   goto out;
+   }
 
if (!S_ISDIR(inode->i_mode))
flags &= ~EXT2_DIRSYNC_FL;
@@ -50,7 +63,8 @@ int ext2_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
/* Is it quota file? Do not allow user to mess with it */
if (IS_NOQUOTA(inode)) {
mutex_unlock(&inode->i_mutex);
-   return -EPERM;
+   retval = -EPERM;
+   goto out;
}
oldflags = ei->i_flags;
 
@@ -63,7 +77,8 @@ int ext2_ioctl (struct inode * inode, struct file * filp, 
unsigned int cmd,
if ((flags ^ oldflags) & (EXT2_APPEND_FL | EXT2_IMMUTABLE_FL)) {
if (!capable(CAP_LINUX_IMMUTABLE)) {
mutex_unlock(&inode->i_mutex);
-   return -EPERM;
+   retval = -EPERM;
+  

Re: [PATCH] Fix oops in mballoc caused by a variable overflow

2008-01-17 Thread Valerie Clement

Aneesh Kumar K.V wrote:
What about this  ? I guess we will overflow 
start = start << bsbits;




Hi Aneesh,
your patch below doesn't fix the issue, because as start_off is also 
loff_t, start_off = ac->ac_o_ex.fe_logical << bsbits  also overflows.



I guess start should be of type loff_t. Patch below

-aneesh

ext4: Fix overflow in ext4_mb_normalize_request

From: Aneesh Kumar K.V <[EMAIL PROTECTED]>

kernel BUG at fs/ext4/mballoc.c:3148!

The BUG_ON is:
BUG_ON(size <= 0 || size >= EXT4_BLOCKS_PER_GROUP(ac->ac_sb));

where the value of "size" is 4293920768.

This is due to the overflow of the variable "start" in the
ext4_mb_normalize_request() function.

Signed-off-by: Aneesh Kumar K.V <[EMAIL PROTECTED]>
---

 fs/ext4/mballoc.c |   21 -
 1 files changed, 8 insertions(+), 13 deletions(-)


diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d8cd81e..d8a2db8 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2998,7 +2998,7 @@ static void ext4_mb_normalize_request(struct 
ext4_allocation_context *ac,
int bsbits, max;
ext4_lblk_t end;
struct list_head *cur;
-   loff_t size, orig_size;
+   loff_t size, orig_size, start_off;
ext4_lblk_t start, orig_start;
struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
 
@@ -3039,7 +3039,7 @@ static void ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 
 	/* first, try to predict filesize */

/* XXX: should this table be tunable? */
-   start = 0;
+   start_off = 0;
if (size <= 16 * 1024) {
size = 16 * 1024;
} else if (size <= 32 * 1024) {
@@ -3055,26 +3055,21 @@ static void ext4_mb_normalize_request(struct 
ext4_allocation_context *ac,
} else if (size <= 1024 * 1024) {
size = 1024 * 1024;
} else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, bsbits)) {
-   start = ac->ac_o_ex.fe_logical << bsbits;
-   start = (start / (1024 * 1024)) * (1024 * 1024);
+   start_off = (ac->ac_o_ex.fe_logical >> (20 - bsbits)) << 20;
size = 1024 * 1024;
} else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, bsbits)) {
-   start = ac->ac_o_ex.fe_logical << bsbits;
-   start = (start / (4 * (1024 * 1024))) * 4 * (1024 * 1024);
+   start_off = (ac->ac_o_ex.fe_logical >> (22 - bsbits)) << 22;
size = 4 * 1024 * 1024;
} else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len,
(8<<20)>>bsbits, max, bsbits)) {
-   start = ac->ac_o_ex.fe_logical;
-   start = start << bsbits;
-   start = (start / (8 * (1024 * 1024))) * 8 * (1024 * 1024);
+   start_off = (ac->ac_o_ex.fe_logical >> (23 - bsbits)) << 23;
size = 8 * 1024 * 1024;
} else {
-   start = ac->ac_o_ex.fe_logical;
-   start = start << bsbits;
-   size = ac->ac_o_ex.fe_len << bsbits;
+   start_off = ac->ac_o_ex.fe_logical << bsbits;
+   size  = ac->ac_o_ex.fe_len << bsbits;
}
orig_size = size = size >> bsbits;
-   orig_start = start = start >> bsbits;
+   orig_start = start = start_off >> bsbits;
 
 	/* don't cover already allocated blocks in selected range */

if (ar->pleft && start <= ar->lleft) {



-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] writeback: speed up writeback of big dirty files

2008-01-17 Thread Fengguang Wu
> On Jan 16, 2008 9:15 AM, Martin Knoblauch <[EMAIL PROTECTED]> wrote:
> Fengguang's latest writeback patch applies cleanly, builds, boots on 
> 2.6.24-rc8.

Linus, if possible, I'd suggest this patch be merged for 2.6.24.

It's a safer version of the reverted patch. It was tested on
ext2/ext3/jfs/xfs/reiserfs and won't 100% iowait even without the
other bug fixing patches.

Fengguang
---

writeback: speed up writeback of big dirty files

After making dirty a 100M file, the normal behavior is to
start the writeback for all data after 30s delays. But
sometimes the following happens instead:

- after 30s:~4M
- after 5s: ~4M
- after 5s: all remaining 92M

Some analyze shows that the internal io dispatch queues goes like this:

s_ios_more_io
-
1)  100M,1K 0
2)  1K  96M
3)  0   96M
1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)
nr_to_write > 0 in (3) fools the upper layer to think that data have all been
written out. The big dirty file is actually still sitting in s_more_io. We
cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
let the loop in generic_sync_sb_inodes() continue: this may starve newly
expired inodes in s_dirty.  It is also not an option to draw inodes from both
s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
and might also starve other superblocks in sync time(well kupdate may still
starve some superblocks, that's another bug).
We have to return when a full scan of s_io completes. So nr_to_write > 0 does
not necessarily mean that "all data are written". This patch introduces a flag
writeback_control.more_io to indicate that more io should be done. With it the
big dirty file no longer has to wait for the next kupdate invocation 5s later.

In sync_sb_inodes() we only set more_io on super_blocks we actually visited.
This aviods the interaction between two pdflush deamons.

Also in __sync_single_inode() we don't blindly keep requeuing the io if the
filesystem cannot progress. Failing to do so may lead to 100% iowait.

Tested-by: Mike Snitzer <[EMAIL PROTECTED]>
Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
---
 fs/fs-writeback.c |   18 --
 include/linux/writeback.h |1 +
 mm/page-writeback.c   |9 ++---
 3 files changed, 23 insertions(+), 5 deletions(-)

--- linux.orig/fs/fs-writeback.c
+++ linux/fs/fs-writeback.c
@@ -284,7 +284,17 @@ __sync_single_inode(struct inode *inode,
 * soon as the queue becomes uncongested.
 */
inode->i_state |= I_DIRTY_PAGES;
-   requeue_io(inode);
+   if (wbc->nr_to_write <= 0) {
+   /*
+* slice used up: queue for next turn
+*/
+   requeue_io(inode);
+   } else {
+   /*
+* somehow blocked: retry later
+*/
+   redirty_tail(inode);
+   }
} else {
/*
 * Otherwise fully redirty the inode so that
@@ -479,8 +489,12 @@ sync_sb_inodes(struct super_block *sb, s
iput(inode);
cond_resched();
spin_lock(&inode_lock);
-   if (wbc->nr_to_write <= 0)
+   if (wbc->nr_to_write <= 0) {
+   wbc->more_io = 1;
break;
+   }
+   if (!list_empty(&sb->s_more_io))
+   wbc->more_io = 1;
}
return; /* Leave any unwritten inodes on s_io */
 }
--- linux.orig/include/linux/writeback.h
+++ linux/include/linux/writeback.h
@@ -62,6 +62,7 @@ struct writeback_control {
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned for_writepages:1;  /* This is a writepages() call */
unsigned range_cyclic:1;/* range_start is cyclic */
+   unsigned more_io:1; /* more io to be dispatched */
 };
 
 /*
--- linux.orig/mm/page-writeback.c
+++ linux/mm/page-writeback.c
@@ -558,6 +558,7 @@ static void background_writeout(unsigned
global_page_state(NR_UNSTABLE_NFS) < background_thresh
&& min_pages <= 0)
break;
+   wbc.more_io = 0;
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;