Re: [PATCH] rd: Mark ramdisk buffers heads dirty

2007-10-17 Thread Chris Mason
On Wed, 2007-10-17 at 11:57 -0600, Eric W. Biederman wrote:
 Christian Borntraeger [EMAIL PROTECTED] writes:
 
  Eric,
 
  Am Dienstag, 16. Oktober 2007 schrieb Christian Borntraeger:
  Am Dienstag, 16. Oktober 2007 schrieb Eric W. Biederman:
  
   fs/buffer.c |3 +++
   1 files changed, 3 insertions(+), 0 deletions(-)
drivers/block/rd.c |   13 +
1 files changed, 1 insertions(+), 12 deletions(-)
  
  Your patches look sane so far. I have applied both patches, and the 
  problem 
  seems gone. I will try to get these patches to our testers.
  
  As long as they dont find new problems:
 
  Our testers did only a short test, and then they were stopped by problems 
  with
  reiserfs. At the moment I cannot say for sure if your patch caused this, 
  but 
  we got the following BUG
 
 Thanks.
 
  ReiserFS: ram0: warning: Created .reiserfs_priv on ram0 - reserved for xattr
  storage.
  [ cut here ]
  kernel BUG at
  /home/autobuild/BUILD/linux-2.6.23-20071017/fs/reiserfs/journal.c:1117!
  illegal operation: 0001 [#1]
  Modules linked in: reiserfs dm_multipath sunrpc dm_mod qeth ccwgroup vmur
  CPU:3Not tainted
  Process reiserfs/3 (pid: 2592, task: 77dac418, ksp: 7513ee88)
  Krnl PSW : 070c3000 fb344380 (flush_commit_list+0x808/0x95c [reiserfs])
 R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0
  Krnl GPRS: 0002 7411b5c8 002b 
 7b04d000 0001  76d1de00
 7513eec0 0003 0012 77f77680
 7411b608 fb343b7e fb34404a 7513ee50
  Krnl Code: fb344374: a7210002   tmll%r2,2
 fb344378: a7840004   brc 8,fb344380
 fb34437c: a7f40001   brc 15,fb34437e
fb344380: 5810d8c2   l   %r1,2242(%r13)
 fb344384: 5820b03c   l   %r2,60(%r11)
 fb344388: 0de1   basr%r14,%r1
 fb34438a: 5810d90e   l   %r1,2318(%r13)
 fb34438e: 5820b03c   l   %r2,60(%r11)
 
 
  Looking at the code, this really seems related to dirty buffers, so your 
  patch
  is the main suspect at the moment. 
 
 Sounds reasonable.
 
  if (!barrier) {
  /* If there was a write error in the journal - we can't 
  commit
   * this transaction - it will be invalid and, if successful,
   * will just end up propagating the write error out to
   * the file system. */
  if (likely(!retval  !reiserfs_is_journal_aborted 
  (journal))) {
  if (buffer_dirty(jl-j_commit_bh))
  1117BUG();
  mark_buffer_dirty(jl-j_commit_bh) ;
  sync_dirty_buffer(jl-j_commit_bh) ;
  }
  }
 
 Grr.  I'm not certain how to call that.
 
 Given that I should also be able to trigger this case by writing to
 the block device through the buffer cache (to the write spot at the
 write moment) this feels like a reiserfs bug. 
 Although I feel
 screaming about filesystems that go BUG instead of remount read-only
 

In this case, the commit block isn't allowed to be dirty before reiserfs
decides it is safe to write it.  The journal code expects it is the only
spot in the kernel setting buffer heads dirty, and it only does so after
the rest of the log blocks are safely on disk.

Given this is a ramdisk, the check can be ignored, but I'd rather not
sprinkle if (ram_disk) into the FS code

 At the same time I increasingly don't think we should allow user space
 to dirty or update our filesystem metadata buffer heads.  That seems
 like asking for trouble.
 

Demanding trouble ;)

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Mark ramdisk buffers heads dirty

2007-10-17 Thread Chris Mason
On Wed, 2007-10-17 at 14:29 -0600, Eric W. Biederman wrote:
 Chris Mason [EMAIL PROTECTED] writes:
 
  In this case, the commit block isn't allowed to be dirty before reiserfs
  decides it is safe to write it.  The journal code expects it is the only
  spot in the kernel setting buffer heads dirty, and it only does so after
  the rest of the log blocks are safely on disk.
 
 Ok.  So the journal code here fundamentally depends on being able to
 control the order of the writes, and something else being able to set
 the buffer head dirty messes up that control.
 

Right.

  At the same time I increasingly don't think we should allow user space
  to dirty or update our filesystem metadata buffer heads.  That seems
  like asking for trouble.
  
 
  Demanding trouble ;)
 
 Looks like it.  There are even comments in jbd about the same class
 of problems.  Apparently dump and tune2fs on mounted filesystems have
 triggered some of these issues.  The practical question is any of this
 trouble worth handling.
 
 Thinking about it.  I don't believe anyone has ever intentionally built
 a filesystem tool that depends on being able to modify a file systems
 metadata buffer heads while the filesystem is running, and doing that
 would seem to be fragile as it would require a lot of cooperation
 between the tool and the filesystem about how the filesystem uses and
 implement things.
 

That's right.  For example, ext2 is doing directories in the page cache
of the directory inode, so there's a cache alias between the block
device page cache and the directory inode page cache.

 Now I guess I need to see how difficult a patch would be to give
 filesystems magic inodes to keep their metadata buffer heads in.

Not hard, the block device inode is already a magic inode for metadata
buffer heads.  You could just make another one attached to the bdev.

But, I don't think I fully understand the problem you're trying to
solve?

-chris


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Mark ramdisk buffers heads dirty

2007-10-17 Thread Chris Mason
On Wed, 2007-10-17 at 15:30 -0600, Eric W. Biederman wrote:
 Chris Mason [EMAIL PROTECTED] writes:
 
  Thinking about it.  I don't believe anyone has ever intentionally built
  a filesystem tool that depends on being able to modify a file systems
  metadata buffer heads while the filesystem is running, and doing that
  would seem to be fragile as it would require a lot of cooperation
  between the tool and the filesystem about how the filesystem uses and
  implement things.
  
 
  That's right.  For example, ext2 is doing directories in the page cache
  of the directory inode, so there's a cache alias between the block
  device page cache and the directory inode page cache.
 
  Now I guess I need to see how difficult a patch would be to give
  filesystems magic inodes to keep their metadata buffer heads in.
 
  Not hard, the block device inode is already a magic inode for metadata
  buffer heads.  You could just make another one attached to the bdev.
 
  But, I don't think I fully understand the problem you're trying to
  solve?
 
 
 So the start:
 When we write buffers from the buffer cache we clear buffer_dirty
 but not PageDirty

 So try_to_free_buffers() will mark any page with clean buffer_heads
 that is not clean itself clean.
 
 The ramdisk set pages dirty to keep them from being removed from the
 page cache, just like ramfs.
 
So, the problem is using the Dirty bit to indicate pinned.  You're
completely right that our current setup of buffer heads and pages and
filesystem metadata is complex and difficult.

But, moving the buffer heads off of the page cache pages isn't going to
make it any easier to use dirty as pinned, especially in the face of
buffer_head users for file data pages.

You've already seen Nick fsblock code, but you can see my general
approach to replacing buffer heads here:

http://oss.oracle.com/mercurial/mason/btrfs-unstable/file/f89e7971692f/extent_map.h

(alpha quality implementation in extent_map.c and users in inode.c)  The
basic idea is to do extent based record keeping for mapping and state of
things in the filesystem, and to avoid attaching these things to the
page.

 Unfortunately when those dirty ramdisk pages get buffers on them and
 those buffers all go clean and we are trying to reclaim buffer_heads
 we drop those pages from the page cache.   Ouch!
 
 We can fix the ramdisk by setting making certain that buffer_heads
 on ramdisk pages stay dirty as well.  The problem is this breaks
 filesystems like reiserfs and ext3 that expect to be able to make 
 buffer_heads clean sometimes.
 
 There are other ways to solve this for ramdisks, such as changing
 where ramdisks are stored.  However fixing the ramdisks this way
 still leaves the general problem that there are other paths to the
 filesystem metadata buffers, and those other paths cause the code
 to be complicated and buggy.
 
 So I'm trying to see if we can untangle this Gordian knot, so the
 code because more easily maintainable.  
 

Don't get me wrong, I'd love to see a simple and coherent fix for what
reiserfs and ext3 do with buffer head state, but I think for the short
term you're best off pinning the ramdisk pages via some other means.

-chris


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Mark ramdisk buffers heads dirty

2007-10-17 Thread Chris Mason
On Wed, 2007-10-17 at 17:28 -0600, Eric W. Biederman wrote:
 Chris Mason [EMAIL PROTECTED] writes:
 
  So, the problem is using the Dirty bit to indicate pinned.  You're
  completely right that our current setup of buffer heads and pages and
  filesystpem metadata is complex and difficult.
 
  But, moving the buffer heads off of the page cache pages isn't going to
  make it any easier to use dirty as pinned, especially in the face of
  buffer_head users for file data pages.
 
 Let me specific.  Not moving buffer_heads off of page cache pages,
 moving buffer_heads off of the block devices page cache pages.
 
 My problem is the coupling of how block devices are cached and the
 implementation of buffer heads, and by removing that coupling
 we can generally make things better.  Currently that coupling
 means silly things like all block devices are cached in low memory.
 Which probably isn't what you want if you actually have a use
 for block devices.
 
 For the ramdisk case in particular what this means is that there
 are no more users that create buffer_head mappings on the block
 device cache so using the dirty bit will be safe.

Ok, we move the buffer heads off to a different inode, and that indoe
has pages.  The pages on the inode still need to get pinned, how does
that pinning happen?

The problem you described where someone cleans a page because the buffer
heads are clean happens already without help from userland.  So, keeping
the pages away from userland won't save them from cleaning.

Sorry if I'm reading your suggestion wrong...

-chris


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rd: Use a private inode for backing storage

2007-10-22 Thread Chris Mason
On Sun, 21 Oct 2007 12:39:30 -0600
[EMAIL PROTECTED] (Eric W. Biederman) wrote:

 Nick Piggin [EMAIL PROTECTED] writes:
 
  On Sunday 21 October 2007 18:23, Eric W. Biederman wrote:
  Christian Borntraeger [EMAIL PROTECTED] writes:
 
  Let me put it another way.  Looking at /proc/slabinfo I can get
  37 buffer_heads per page.  I can allocate 10% of memory in
  buffer_heads before we start to reclaim them.  So it requires just
  over 3.7 buffer_heads on very page of low memory to even trigger
  this case.  That is a large 1k filesystem or a weird sized
  partition, that we have written to directly.
 
  On a highmem machine it it could be relatively common.
 
 Possibly.  But the same proportions still hold.  1k filesystems
 are not the default these days and ramdisks are relatively uncommon.
 The memory quantities involved are all low mem.

It is definitely common during run time.  It was seen in practice enough
to be reproducible and get fixed for the non-ramdisk case.

The big underlying question is how which ramdisk usage case are we
shooting for. Keeping the ram disk pages off the LRU can certainly help
the VM if larger ramdisks used at runtime are very common.

Otherwise, I'd say to keep it as simple as possible and use Eric's
patch.  By simple I'm not counting lines of code, I'm counting overall
readability between something everyone knows (page cache usage) and
something specific to ramdisks (Nick's patch).

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file

2007-10-23 Thread Chris Mason
On Tue, 23 Oct 2007 19:56:20 +0800
Fengguang Wu [EMAIL PROTECTED] wrote:

 On Tue, Oct 23, 2007 at 12:07:07PM +0200, Peter Zijlstra wrote:
  [ adding reiserfs devs to the CC ]
 
 Thank you.
 
 This fix is kind of crude - even when it fixed Maxim's problem, and
 survived my stress testing of a lot of patching and kernel compiling.
 I'd be glad to see better solutions.

This should be safe, reiserfs has the buffer heads themselves clean and
the page should get cleaned eventually.  The cancel_dirty_page call was
just an optimization to be VM friendly.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[CFP] 2008 Linux Storage and Filesystem Workshop

2007-10-24 Thread Chris Mason

Hello everyone,

We are organizing another filesystem and storage workshop in San Jose
next Feb 25 and 26.  You can find some great writeups of last year's
conference on LWN:

http://lwn.net/Articles/226351/

This year we're trying to concentrate on more problem solving sessions,
short term projects and joint sessions.  You can find all the details
on the conference webpages:

http://www.usenix.org/events/lsf08/

Soon there will be a link for submitting your position statement, which
is basically a note to the organizers that you are interested in
attending and which topics you think should be covered.

We're also looking for people to lead the discussion around the major
topics, so please let us know if you're interested in that.  The
discussion leaders will have input into the people that get invited and
the format of the discussion.

Please let me know if there are any questions about the workshop.

Thanks,
Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] Btrfs v0.12 released

2008-02-11 Thread Chris Mason
On Sunday 10 February 2008, David Miller wrote:
 From: Chris Mason [EMAIL PROTECTED]
 Date: Wed, 6 Feb 2008 12:00:13 -0500

 This function never returns an error, so the simplest fix was to
 return the hash value which avoids all of the issues.  In attempting
 other schemes to fix this, I found it very difficult to give gcc
 a packed attribute for that u64 * argument other than to create
 some new pseudo structure which would have been ugly.

Many thanks, I clearly didn't put enough thought into the unaligned access 
problems.

 Similar code lives in the btrfs kernel code too, I'll try to get a
 partition at least mounted and working minimally and if successful
 I'll send you patches for that too.

The kernel is actually worse, because the set/get macros are more complex.  
Some live in ctree.h like in the progs, but the nasty ones live in 
struct-funcs.c

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BTRFS only works with PAGE_SIZE = 4K

2008-02-12 Thread Chris Mason
On Tuesday 12 February 2008, David Miller wrote:
 From: Chris Mason [EMAIL PROTECTED]
 Date: Wed, 6 Feb 2008 12:00:13 -0500

  So, here's v0.12.

 Any page size larger than 4K will not work with btrfs.  All of the
 extent stuff assumes that PAGE_SIZE = sectorsize.

Yeah, there is definitely clean up to do in that area.


 I confirmed this by forcing mkfs.btrfs to use an 8K sectorsize on
 sparc64 and I was finally able to successfully mount a partition.

Nice


 With 4K there are zero's in the root tree node header, because it's
 extent's location on disk is at a sub-PAGE_SIZE multiple and the
 extent code doesn't handle that.

 You really need to start validating this stuff on other platforms.
 Something that isn't little endian and something that doesn't use 4K
 pages.  I'm sure you have some powerpc parts around somewhere. :)

Grin, I think around v0.4 I grabbed a ppc box for a day and got things 
working.  There has been some churn since then...

My first prio is the newest set of disk format changes, and then I'll sit down 
and work on stability on a bunch of arches.


 Anyways, here is a patch for the kernel bits which fixes most of the
 unaligned accesses on sparc64.

Many thanks, I'll try these out here and push them into the tree.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BTRFS partition usage...

2008-02-12 Thread Chris Mason
On Tuesday 12 February 2008, Jan Engelhardt wrote:
 On Feb 12 2008 09:08, Chris Mason wrote:
  So, if Btrfs starts zeroing at 1k, will that be acceptable for you?
 
  Something looks wrong here. Why would btrfs need to zero at all?
  Superblock at 0, and done. Just like xfs.
  (Yes, I had xfs on sparc before, so it's not like you NEED the
  whitespace at the start of a partition.)
 
 I've had requests to move the super down to 64k to make room for
  bootloaders, which may not matter for sparc, but I don't really plan on
  different locations for different arches.

 In x86, there is even more space for a bootloader (some 28k or so)
 even if your partition table is as closely packed as possible,
 from 0 to 7e00 IIRC.

 For sparc you could have something like

   startlbaendlba  type
   sda10   2   1 Boot
   sda22   58  3 Whole disk
   sda358  9   83 Linux

 and slap the bootloader into MBR, just like on x86.
 Or I am missing something..

It was a request from hpa, and he clearly had something in mind.  He kindly 
offered to review the disk format for bootloaders and other lower level 
issues but I asked him to wait until I firm it up a bit.

From my point of view, 0 is a bad idea because it is very likely to conflict 
with other things.  There are lots of things in the FS that need deep 
thought,and the perfect system to fully use the first 64k of a 1TB filesystem 
isn't quite at the top of my list right now ;)

Regardless of offset, it is a good idea to mop up previous filesystems where 
possible, and a very good idea to align things on some sector boundary.  Even 
going 1MB in wouldn't be a horrible idea to align with erasure blocks on SSD.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BTRFS partition usage...

2008-02-12 Thread Chris Mason
On Tuesday 12 February 2008, Jan Engelhardt wrote:
 On Feb 12 2008 08:49, Chris Mason wrote:
   This is a real issue on sparc where the default sun disk labels
   created use an initial partition where block zero aliases the disk
   label.  It took me a few iterations before I figured out why every
   btrfs make would zero out my disk label :-/
 
  Actually it seems this is only a problem with mkfs.btrfs, it clears
  out the first 64 4K chunks of the disk for whatever reason.
 
 It is a good idea to remove supers from other filesystems.  I also need to
  add zeroing at the end of the device as well.
 
 Looks like I misread the e2fs zeroing code.  It zeros the whole external
  log device, and I assumed it also zero'd out the start of the main FS.
 
 So, if Btrfs starts zeroing at 1k, will that be acceptable for you?

 Something looks wrong here. Why would btrfs need to zero at all?
 Superblock at 0, and done. Just like xfs.
 (Yes, I had xfs on sparc before, so it's not like you NEED the
 whitespace at the start of a partition.)

I've had requests to move the super down to 64k to make room for bootloaders, 
which may not matter for sparc, but I don't really plan on different 
locations for different arches.

4k aligned is important given that sector sizes are growing.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BTRFS partition usage...

2008-02-12 Thread Chris Mason
On Tuesday 12 February 2008, Jan Engelhardt wrote:
 On Feb 12 2008 09:35, Chris Mason wrote:
  and slap the bootloader into MBR, just like on x86.
  Or I am missing something..
 
 It was a request from hpa, and he clearly had something in mind.  He
  kindly offered to review the disk format for bootloaders and other lower
  level issues but I asked him to wait until I firm it up a bit.
 
 From my point of view, 0 is a bad idea because it is very likely to
  conflict with other things.  There are lots of things in the FS that need
  deep thought,and the perfect system to fully use the first 64k of a 1TB
  filesystem isn't quite at the top of my list right now ;)
 
 Regardless of offset, it is a good idea to mop up previous filesystems
  where possible, and a very good idea to align things on some sector
  boundary.  Even going 1MB in wouldn't be a horrible idea to align with
  erasure blocks on SSD.

 I still don't like the idea of btrfs trying to be smarter than a user
 who can partition up his system according to
   (a) his likes
   (b) system or hardware requirements or recommendations
 to align the superblock to a specific location.

Will all the users in the world who think about super block location when they 
partition their disks please raise their hands?

The location of the super block needs to be very simple in order for mount and 
friends to find and detect it.  It needs a simple algorithm to try multiple 
locations in case a given copy of the super is corrupt.

Design in this case is a bunch of compromises around other users of the 
hardware, ease of programming, and the benefits in performance or usability 
from doing something complex.


 1MB alignment does not always mean 1MB alignment.
 Sector 1 begins at 0x7e00 on x86.
 And with the maximum CHS geometry (255/63), partitions begin
 at 0x7e00+n*8225280 bytes, so the SB is unlikely to ever be on
 a 1048576 boundary.

IO is already aligned on sectors, sometimes we'll have a perfect erasure block 
alignment and sometimes not.  When the location of the super is my biggest 
bottleneck, I'll be a very happy boy.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] Btrfs v0.12 released

2008-02-12 Thread Chris Mason
On Tuesday 12 February 2008, David Miller wrote:
 From: Chris Mason [EMAIL PROTECTED]
 Date: Mon, 11 Feb 2008 08:42:20 -0500

  The kernel is actually worse, because the set/get macros are more
  complex. Some live in ctree.h like in the progs, but the nasty ones live
  in struct-funcs.c

 This is really problematic, because you've got these things called
 btrfs_item_ptr() which really isn't a pointer, it's a relative
 'unsigned long' offset cast to a pointer.  The source of this
 seems to be btrfs_leaf_data().

 And then those things get passed down into the SETGET functions!

Explaining it won't make it pretty, but at least I can tell you what the code 
does.

This is all part of the btrfs code that supports tree block sizes larger than 
a page.  The extent_buffer code (extent_io.c) provides a read/write api into 
an extent_buffer based on offsets from the start of the multi-page buffer.  
That's where the relative unsigned long comes from.

The part where I cast it to pointers is me trying to maintain type checking 
throughout the code.  The pointers themselves are useless, they need to be 
matched with an extent_buffer to actually get to the bytes.

There are a few parts where the SETGET funcs are open coded, mostly in very 
performance critical functions.  Just look for lexxx_to_cpu


 Then deeper down we have terribly inconsistent things like
 btrfs_item_nr_offset() and
 btrfs_item_offset_nr().

Btree blocks have the offset of the item header from the start of the block 
and the offset of the item data.  And, I'm very bad at naming.


 Sigh... I'll see what I can do.

Thanks

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Btrfs: remove repeated eb-pages check in, disk-io.c/csum_dirty_buffer

2012-10-09 Thread Chris Mason
On Mon, Oct 08, 2012 at 07:26:15AM -0600, Wang Sheng-Hui wrote:
 In csum_dirty_buffer, we first get eb from page-private.
 Then we check if the page is the first page of eb. Later
 we check it again. Remove the repeated check here.

You had the right idea here, two checks and one has a warning, so you
kept the warning.  But when the metadata block size is bigger than a
page, the WARN_ON triggers for any page that isn't the first one in the
extent buffer.

I kept this commit but removed the WARN_ON(1)

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2012-10-09 Thread Chris Mason
 reservation structure (+39/-22)
Btrfs: fix wrong size for the reservation of the, snapshot creation (+4/-4)
Revert Btrfs: do not do filemap_write_and_wait_range in fsync (+11/-3)
Btrfs: fix file extent discount problem in the, snapshot (+25/-44)
Btrfs: fix orphan transaction on the freezed filesystem (+49/-23)
Btrfs: add a type field for the transaction handle (+21/-42)
Btrfs: fix error path in create_pending_snapshot() (+17/-23)
Btrfs: use a slab for ordered extents allocation (+31/-3)
Btrfs: fix wrong orphan count of the fs/file tree (+1/-1)
Btrfs: fix corrupted metadata in the snapshot (+32/-18)
Btrfs: fix the snapshot that should not exist (+53/-15)
Btrfs: fix memory leak in start_transaction() (+3/-1)
Btrfs: fix unprotected -log_batch (+9/-11)

Liu Bo (13) commits (+150/-113):
Btrfs: fix a bug in checking whether a inode is already in log (+10/-8)
Btrfs: kill obsolete arguments in btrfs_wait_ordered_extents (+7/-18)
Btrfs: fix a bug in parsing return value in logical resolve (+34/-20)
Btrfs: use larger limit for translation of logical to inode (+5/-4)
Btrfs: check if an inode has no checksum when logging it (+12/-11)
Btrfs: update delayed ref's tracepoints to show sequence (+10/-4)
Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag (+28/-14)
Btrfs: improve fsync by filtering extents that we want (+26/-3)
Btrfs: cleanup for duplicated code in find_free_extent (+0/-4)
Btrfs: cleanup extents after we finish logging inode (+6/-0)
Btrfs: use helper for logical resolve (+3/-16)
Btrfs: fix off-by-one in file clone (+9/-9)
Btrfs: cleanup fs_info-hashers (+0/-2)

Tsutomu Itoh (6) commits (+19/-20):
Btrfs: confirmation of value is added before trace_btrfs_get_extent() is 
called (+2/-1)
Btrfs: remove unnecessary IS_ERR in bio_readpage_error() (+1/-1)
Btrfs: cleanup of error processing in btree_get_extent() (+5/-9)
Btrfs: fix error handling in delete_block_group_cache() (+2/-2)
Btrfs: remove unnecessary code in btree_get_extent() (+1/-7)
Btrfs: check return value of ulist_alloc() properly (+8/-0)

David Sterba (4) commits (+119/-62):
btrfs: allow setting NOCOW for a zero sized file via ioctl (+27/-4)
btrfs: move transaction aborts to the point of failure (+80/-47)
btrfs: return EPERM upon rmdir on a subvolume (+3/-2)
btrfs: polish names of kmem caches (+9/-9)

Sage Weil (3) commits (+18/-2):
Btrfs: do not take cleanup_work_sem in btrfs_run_delayed_iputs() (+0/-2)
Btrfs: pass lockdep rwsem metadata to async commit transaction (+16/-0)
Btrfs: set journal_info in async trans commit worker (+2/-0)

Stefan Behrens (2) commits (+156/-21):
Btrfs: make filesystem read-only when submitting barrier fails (+142/-19)
Btrfs: detect corrupted filesystem after write I/O errors (+14/-2)

Robin Dong (2) commits (+12/-157):
btrfs: remove unused function btrfs_insert_some_items() (+0/-143)
btrfs: move inline function code to header file (+12/-14)

Mark Fasheh (2) commits (+848/-116):
btrfs: extended inode ref iteration (+138/-37)
btrfs: extended inode refs (+710/-79)

Wei Yongjun (2) commits (+3/-6):
Btrfs: fix possible memory leak in scrub_setup_recheck_block() (+1/-0)
Btrfs: using for_each_set_bit_from to simplify the code (+2/-6)

Chris Mason (2) commits (+38/-16):
Btrfs: fix btrfs send for inline items and compression (+37/-15)
btrfs: init ref_index to zero in add_inode_ref (+1/-1)

Jan Schmidt (2) commits (+129/-112):
btrfs: improved readablity for add_inode_ref (+97/-81)
Btrfs: fix gcc warnings for 32bit compiles (+32/-31)

Zach Brown (1) commits (+2/-1):
btrfs: fix min csum item size warnings in 32bit

Daniel J Blueman (1) commits (+11/-11):
btrfs: fix message printing

Anand Jain (1) commits (+7/-5):
Btrfs: write_buf is now callable outside send.c

Kent Overstreet (1) commits (+2/-17):
btrfs: Kill some bi_idx references

Andrei Popa (1) commits (+13/-1):
Btrfs: make compress and nodatacow mount options mutually exclusive

liubo (1) commits (+0/-8):
Btrfs: cleanup for unused ref cache stuff

Wang Sheng-Hui (1) commits (+0/-4):
Btrfs: remove repeated eb-pages check in, disk-io.c/csum_dirty_buffer

Total: (121) commits

 fs/btrfs/backref.c   | 299 +++---
 fs/btrfs/backref.h   |  10 +-
 fs/btrfs/btrfs_inode.h   |  15 +-
 fs/btrfs/check-integrity.c   |  16 +-
 fs/btrfs/compression.c   |  13 +-
 fs/btrfs/ctree.c | 148 +--
 fs/btrfs/ctree.h | 109 +-
 fs/btrfs/delayed-inode.c |   6 +-
 fs/btrfs/disk-io.c   | 230 ++-
 fs/btrfs/disk-io.h   |   2 +
 fs/btrfs/extent-tree.c   | 376 +-
 fs/btrfs/extent_io.c | 128 --
 fs/btrfs/extent_io.h |  23 +-
 fs/btrfs/extent_map.c|  55 ++-
 fs/btrfs/extent_map.h|   8 +-
 fs/btrfs/file-item.c |   5 +-
 fs/btrfs/file.c

[GIT PULL] Btrfs fixes

2012-10-26 Thread Chris Mason
Hi Linus,

My for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Has our series of fixes for the next rc.  The biggest batch is from Jan
Schmidt, fixing up some problems in our subvolume quota code and fixing
btrfs send/receive to work with the new extended inode refs.

My git tree is against 3.6, but these were all retested against your
current git.

Jan Schmidt (7) commits (+149/-76):
Btrfs: don't put removals from push_node_left into tree mod log twice 
(+7/-2)
Btrfs: fix a tree mod logging issue for root replacement operations (+2/-8)
Btrfs: tree mod log's old roots could still be part of the tree (+21/-4)
Btrfs: fix extent buffer reference for tree mod log roots (+1/-1)
Btrfs: extended inode refs support for send mechanism (+94/-58)
Btrfs: comment for loop in tree_mod_log_insert_move (+5/-0)
Btrfs: determine level of old roots (+19/-3)

Josef Bacik (2) commits (+8/-6):
Btrfs: Use btrfs_update_inode_fallback when creating a snapshot (+6/-5)
Btrfs: do not bug when we fail to commit the transaction (+2/-1)

Stefan Behrens (1) commits (+2/-2):
Btrfs: Fix wrong error handling code

Lukas Czerner (1) commits (+2/-1):
btrfs: Return EINVAL when length to trim is less than FSB

Arne Jansen (1) commits (+2/-1):
Btrfs: send correct rdev and mode in btrfs-send

Gabriel de Perthuis (1) commits (+1/-1):
Fix a sign bug causing invalid memory access in the ino_paths ioctl.

Liu Bo (1) commits (+5/-3):
Btrfs: fix memory leak when cloning root's node

Alex Lyakas (1) commits (+13/-14):
Btrfs: Send: preserve ownership (uid and gid) also for symlinks.

Miao Xie (1) commits (+7/-0):
Btrfs: fix deadlock caused by the nested chunk allocation

Tsutomu Itoh (1) commits (+13/-4):
Btrfs: fix memory leak in btrfs_quota_enable()

Total: (17) commits (+202/-108)
 fs/btrfs/backref.c |  28 -
 fs/btrfs/backref.h |   4 ++
 fs/btrfs/ctree.c   |  70 +-
 fs/btrfs/ctree.h   |   3 +
 fs/btrfs/extent_io.c   |   4 +-
 fs/btrfs/inode.c   |   7 +--
 fs/btrfs/ioctl.c   |   6 +-
 fs/btrfs/qgroup.c  |  17 --
 fs/btrfs/send.c| 156 ++---
 fs/btrfs/transaction.c |   2 +-
 fs/btrfs/volumes.c |   7 +++
 11 files changed, 199 insertions(+), 105 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE] seekwatcher IO graphing v0.2

2007-07-23 Thread Chris Mason

Hello everyone,

Since doing the initial Btrfs benchmarks, I've made my blktrace graphing
utility a little more generic and tossed it out on oss.oracle.com.

This new version can easily graph two different runs, and has a few
other tweaks that make the graphs look nicer.

Docs, examples and other details are at:

http://oss.oracle.com/~mason/seekwatcher

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] extent mapped page cache

2007-07-24 Thread Chris Mason
On Tue, 10 Jul 2007 17:03:26 -0400
Chris Mason [EMAIL PROTECTED] wrote:

 This patch aims to demonstrate one way to replace buffer heads with a
 few extent trees.  Buffer heads provide a few different features:
 
 1) Mapping of logical file offset to blocks on disk
 2) Recording state (dirty, locked etc)
 3) Providing a mechanism to access sub-page sized blocks.
 
 This patch covers #1 and #2, I'll start on #3 a little later next
 week.
 
Well, almost.  I decided to try out an rbtree instead of the radix,
which turned out to be much faster.  Even though individual operations
are slower, the rbtree was able to do many fewer ops to accomplish the
same thing, especially for merging extents together.  It also uses much
less ram.

This code still has lots of room for optimization, but it comes in at
around 2-5% more cpu time for ext2 streaming reads and writes.  I
haven't done readpages or writepages yet, so this is more or less a
worst case setup.  I'm comparing against ext2 with readpages and
writepages disabled.

The new code has the added benefit of passing fsx-linux, and not
triggering MCE's on my poor little test box.

The basic idea is to store state in byte ranges in an rbtree, and to
mirror that state down into individual pages.  This allows us to store
arbitrary state outside of the page struct, so we could include the pid
of the process that dirtied a page range for cfq purposes.  The example
readpage and writepage code is probably the easiest way to understand
the basic API.

A separate rbtree stores a mapping of byte offset in the file to byte
offset on disk.  This allows the filesystem to fill in mapping
information in bulk, and reduces the number of metadata lookups
required to do common operations.

Because the state and mapping information are separate from the page,
pages can come and go and their corresponding metadata can still be
cached (the current code drops mappings as the last page corresponding
to that mapping disappears).

Two patches follow, the core extent_map implementation and a sample
user (ext2).  This is pretty basic, implementing prepare/commit_write,
read/writepage and a few other funcs to exercise the new code.  Longer
term, it should fit in with Nick's other extent work instead of
prepare/commit_write.

My patch sets page-private to 1, really for no good reason.  It is
just a debugging aid I was using to make sure the page took the right
path down the line.  If this catches on, we might set it to a magic
value so you can if (ExtentPage(page)) or just leave it as null.

-chris


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] extent mapped page cache main code

2007-07-24 Thread Chris Mason
Core Extentmap implementation

diff -r 126111346f94 -r 53cabea328f7 fs/Makefile
--- a/fs/Makefile   Mon Jul 09 10:53:57 2007 -0400
+++ b/fs/Makefile   Tue Jul 24 15:40:27 2007 -0400
@@ -11,7 +11,7 @@ obj-y :=  open.o read_write.o file_table.
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
-   stack.o
+   stack.o extent_map.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff -r 126111346f94 -r 53cabea328f7 fs/extent_map.c
--- /dev/null   Thu Jan 01 00:00:00 1970 +
+++ b/fs/extent_map.c   Tue Jul 24 15:40:27 2007 -0400
@@ -0,0 +1,1591 @@
+#include linux/bitops.h
+#include linux/slab.h
+#include linux/bio.h
+#include linux/mm.h
+#include linux/gfp.h
+#include linux/pagemap.h
+#include linux/page-flags.h
+#include linux/module.h
+#include linux/spinlock.h
+#include linux/blkdev.h
+#include linux/extent_map.h
+
+static struct kmem_cache *extent_map_cache;
+static struct kmem_cache *extent_state_cache;
+
+struct tree_entry {
+   u64 start;
+   u64 end;
+   int in_tree;
+   struct rb_node rb_node;
+};
+
+
+/* bits for the extent state */
+#define EXTENT_DIRTY 1
+#define EXTENT_WRITEBACK (1  1)
+#define EXTENT_UPTODATE (1  2)
+#define EXTENT_LOCKED (1  3)
+#define EXTENT_NEW (1  4)
+
+#define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK)
+
+void __init extent_map_init(void)
+{
+   extent_map_cache = kmem_cache_create(extent_map,
+   sizeof(struct extent_map), 0,
+   SLAB_RECLAIM_ACCOUNT |
+   SLAB_DESTROY_BY_RCU,
+   NULL, NULL);
+   extent_state_cache = kmem_cache_create(extent_state,
+   sizeof(struct extent_state), 0,
+   SLAB_RECLAIM_ACCOUNT |
+   SLAB_DESTROY_BY_RCU,
+   NULL, NULL);
+}
+
+void extent_map_tree_init(struct extent_map_tree *tree,
+ struct address_space *mapping, gfp_t mask)
+{
+   tree-map.rb_node = NULL;
+   tree-state.rb_node = NULL;
+   rwlock_init(tree-lock);
+   tree-mapping = mapping;
+}
+EXPORT_SYMBOL(extent_map_tree_init);
+
+struct extent_map *alloc_extent_map(gfp_t mask)
+{
+   struct extent_map *em;
+   em = kmem_cache_alloc(extent_map_cache, mask);
+   if (!em || IS_ERR(em))
+   return em;
+   em-in_tree = 0;
+   atomic_set(em-refs, 1);
+   return em;
+}
+EXPORT_SYMBOL(alloc_extent_map);
+
+void free_extent_map(struct extent_map *em)
+{
+   if (atomic_dec_and_test(em-refs)) {
+   WARN_ON(em-in_tree);
+   kmem_cache_free(extent_map_cache, em);
+   }
+}
+EXPORT_SYMBOL(free_extent_map);
+
+struct extent_state *alloc_extent_state(gfp_t mask)
+{
+   struct extent_state *state;
+   state = kmem_cache_alloc(extent_state_cache, mask);
+   if (!state || IS_ERR(state))
+   return state;
+   state-state = 0;
+   state-in_tree = 0;
+   atomic_set(state-refs, 1);
+   init_waitqueue_head(state-wq);
+   return state;
+}
+EXPORT_SYMBOL(alloc_extent_state);
+
+void free_extent_state(struct extent_state *state)
+{
+   if (atomic_dec_and_test(state-refs)) {
+   WARN_ON(state-in_tree);
+   kmem_cache_free(extent_state_cache, state);
+   }
+}
+EXPORT_SYMBOL(free_extent_state);
+
+static struct rb_node *tree_insert(struct rb_root *root, u64 offset,
+  struct rb_node *node)
+{
+   struct rb_node ** p = root-rb_node;
+   struct rb_node * parent = NULL;
+   struct tree_entry *entry;
+
+   while(*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct tree_entry, rb_node);
+
+   if (offset  entry-end)
+   p = (*p)-rb_left;
+   else if (offset  entry-end)
+   p = (*p)-rb_right;
+   else
+   return parent;
+   }
+
+   entry = rb_entry(node, struct tree_entry, rb_node);
+   entry-in_tree = 1;
+   rb_link_node(node, parent, p);
+   rb_insert_color(node, root);
+   return NULL;
+}
+
+static struct rb_node *__tree_search(struct rb_root *root, u64 offset,
+  struct rb_node **prev_ret)
+{
+   struct rb_node * n = root-rb_node;
+   struct rb_node *prev = NULL;
+   struct tree_entry *entry;
+   struct tree_entry *prev_entry = NULL;
+
+   while(n) {
+   entry = rb_entry(n, struct tree_entry, rb_node);
+   prev = n;
+   prev_entry = entry;
+
+   if (offset  

[PATCH RFC] ext2 extentmap support

2007-07-24 Thread Chris Mason
mount -o extentmap to use the new stuff

diff -r 126111346f94 -r 53cabea328f7 fs/ext2/ext2.h
--- a/fs/ext2/ext2.hMon Jul 09 10:53:57 2007 -0400
+++ b/fs/ext2/ext2.hTue Jul 24 15:40:27 2007 -0400
@@ -1,5 +1,6 @@
 #include linux/fs.h
 #include linux/ext2_fs.h
+#include linux/extent_map.h
 
 /*
  * ext2 mount options
@@ -65,6 +66,7 @@ struct ext2_inode_info {
struct posix_acl*i_default_acl;
 #endif
rwlock_t i_meta_lock;
+   struct extent_map_tree extent_tree;
struct inodevfs_inode;
 };
 
@@ -167,6 +169,7 @@ extern const struct address_space_operat
 extern const struct address_space_operations ext2_aops;
 extern const struct address_space_operations ext2_aops_xip;
 extern const struct address_space_operations ext2_nobh_aops;
+extern const struct address_space_operations ext2_extent_map_aops;
 
 /* namei.c */
 extern const struct inode_operations ext2_dir_inode_operations;
diff -r 126111346f94 -r 53cabea328f7 fs/ext2/inode.c
--- a/fs/ext2/inode.c   Mon Jul 09 10:53:57 2007 -0400
+++ b/fs/ext2/inode.c   Tue Jul 24 15:40:27 2007 -0400
@@ -625,6 +625,84 @@ changed:
goto reread;
 }
 
+/*
+ * simple get_extent implementation using get_block.  This assumes
+ * the get_block function can return something larger than a single block,
+ * but the ext2 implementation doesn't do so.  Just change b_size to
+ * something larger if get_block can return larger extents.
+ */
+struct extent_map *ext2_get_extent(struct inode *inode, struct page *page,
+  size_t page_offset, u64 start, u64 end,
+  int create)
+{
+   struct buffer_head bh;
+   sector_t iblock;
+   struct extent_map *em = NULL;
+   struct extent_map_tree *extent_tree = EXT2_I(inode)-extent_tree;
+   int ret = 0;
+   u64 max_end = (u64)-1;
+   u64 found_len;
+   u64 bh_start;
+   u64 bh_end;
+
+   bh.b_size = inode-i_sb-s_blocksize;
+   bh.b_state = 0;
+again:
+   em = lookup_extent_mapping(extent_tree, start, end);
+   if (em) {
+   return em;
+   }
+
+   iblock = start  inode-i_blkbits;
+   if (!buffer_mapped(bh)) {
+   ret = ext2_get_block(inode, iblock, bh, create);
+   if (ret)
+   goto out;
+   }
+
+   found_len = min((u64)(bh.b_size), max_end - start);
+   if (!em)
+   em = alloc_extent_map(GFP_NOFS);
+
+   bh_start = start;
+   bh_end = start + found_len - 1;
+   em-start = start;
+   em-end = bh_end;
+   em-bdev = inode-i_sb-s_bdev;
+
+   if (!buffer_mapped(bh)) {
+   em-block_start = 0;
+   em-block_end = 0;
+   } else {
+   em-block_start = bh.b_blocknr  inode-i_blkbits;
+   em-block_end = em-block_start + found_len - 1;
+   }
+   ret = add_extent_mapping(extent_tree, em);
+   if (ret == -EEXIST) {
+   free_extent_map(em);
+   em = NULL;
+   max_end = end;
+   goto again;
+   }
+out:
+   if (ret) {
+   if (em)
+   free_extent_map(em);
+   return ERR_PTR(ret);
+   } else if (em  buffer_new(bh)) {
+   set_extent_new(extent_tree, bh_start, bh_end, GFP_NOFS);
+   }
+   return em;
+}
+
+static int ext2_extent_map_writepage(struct page *page,
+struct writeback_control *wbc)
+{
+   struct extent_map_tree *tree;
+   tree = EXT2_I(page-mapping-host)-extent_tree;
+   return extent_write_full_page(tree, page, ext2_get_extent, wbc);
+}
+
 static int ext2_writepage(struct page *page, struct writeback_control *wbc)
 {
return block_write_full_page(page, ext2_get_block, wbc);
@@ -633,6 +711,42 @@ static int ext2_readpage(struct file *fi
 static int ext2_readpage(struct file *file, struct page *page)
 {
return mpage_readpage(page, ext2_get_block);
+}
+
+static int ext2_extent_map_readpage(struct file *file, struct page *page)
+{
+   struct extent_map_tree *tree;
+   tree = EXT2_I(page-mapping-host)-extent_tree;
+   return extent_read_full_page(tree, page, ext2_get_extent);
+}
+
+static int ext2_extent_map_releasepage(struct page *page,
+  gfp_t unused_gfp_flags)
+{
+   struct extent_map_tree *tree;
+   int ret;
+
+   if (page-private != 1)
+   return try_to_free_buffers(page);
+   tree = EXT2_I(page-mapping-host)-extent_tree;
+   ret = try_release_extent_mapping(tree, page);
+   if (ret == 1) {
+   ClearPagePrivate(page);
+   set_page_private(page, 0);
+   page_cache_release(page);
+   }
+   return ret;
+}
+
+
+static void ext2_extent_map_invalidatepage(struct page *page,
+  unsigned long offset)
+{
+   struct extent_map_tree *tree;
+
+   tree = 

Re: [PATCH RFC] extent mapped page cache

2007-07-24 Thread Chris Mason
On Tue, 24 Jul 2007 23:25:43 +0200
Peter Zijlstra [EMAIL PROTECTED] wrote:

 On Tue, 2007-07-24 at 16:13 -0400, Trond Myklebust wrote:
  On Tue, 2007-07-24 at 16:00 -0400, Chris Mason wrote:
   On Tue, 10 Jul 2007 17:03:26 -0400
   Chris Mason [EMAIL PROTECTED] wrote:
   
This patch aims to demonstrate one way to replace buffer heads
with a few extent trees.  Buffer heads provide a few different
features:

1) Mapping of logical file offset to blocks on disk
2) Recording state (dirty, locked etc)
3) Providing a mechanism to access sub-page sized blocks.

This patch covers #1 and #2, I'll start on #3 a little later
next week.

   Well, almost.  I decided to try out an rbtree instead of the
   radix, which turned out to be much faster.  Even though
   individual operations are slower, the rbtree was able to do many
   fewer ops to accomplish the same thing, especially for merging
   extents together.  It also uses much less ram.
  
  The problem with an rbtree is that you can't use it together with
  RCU to do lockless lookups. You can probably modify it to allocate
  nodes dynamically (like the radix tree does) and thus make it
  RCU-compatible, but then you risk losing the two main benefits that
  you list above.

The tree is a critical part of the patch, but it is also the easiest to
rip out and replace.  Basically the code stores a range by inserting
an object at an index corresponding to the end of the range.

Then it does searches by looking forward from the start of the range.
More or less any tree that can search and return the first key =
than the requested key will work.

So, I'd be happy to rip out the tree and replace with something else.
Going completely lockless will be tricky, its something that will deep
thought once the rest of the interface is sane.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] extent mapped page cache

2007-07-25 Thread Chris Mason
On Wed, 25 Jul 2007 04:32:17 +0200
Nick Piggin [EMAIL PROTECTED] wrote:

 On Tue, Jul 24, 2007 at 07:25:09PM -0400, Chris Mason wrote:
  On Tue, 24 Jul 2007 23:25:43 +0200
  Peter Zijlstra [EMAIL PROTECTED] wrote:
  
  The tree is a critical part of the patch, but it is also the
  easiest to rip out and replace.  Basically the code stores a range
  by inserting an object at an index corresponding to the end of the
  range.
  
  Then it does searches by looking forward from the start of the
  range. More or less any tree that can search and return the first
  key = than the requested key will work.
  
  So, I'd be happy to rip out the tree and replace with something
  else. Going completely lockless will be tricky, its something that
  will deep thought once the rest of the interface is sane.
 
 Just having the other tree and managing it is what makes me a little
 less positive of this approach, especially using it to store pagecache
 state when we already have the pagecache tree.
 
 Having another tree to store block state I think is a good idea as I
 said in the fsblock thread with Dave, but I haven't clicked as to why
 it is a big advantage to use it to manage pagecache state. (and I can
 see some possible disadvantages in locking and tree manipulation
 overhead).

Yes, there are definitely costs with the state tree, it will take some
careful benchmarking to convince me it is a feasible solution. But,
storing all the state in the pages themselves is impossible unless the
block size equals the page size. So, we end up with something like
fsblock/buffer heads or the state tree.

One advantage to the state tree is that it separates the state from
the memory being described, allowing a simple kmap style interface
that covers subpages, highmem and superpages.

It also more naturally matches the way we want to do IO, making for
easy clustering.

O_DIRECT becomes a special case of readpages and writepagesthe
memory used for IO just comes from userland instead of the page cache.

The ability to put in additional tracking info like the process that
first dirtied a range is also significant.  So, I think it is worth
trying.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] extent mapped page cache

2007-07-25 Thread Chris Mason
On Thu, 26 Jul 2007 03:37:28 +0200
Nick Piggin [EMAIL PROTECTED] wrote:

  
  One advantage to the state tree is that it separates the state from
  the memory being described, allowing a simple kmap style interface
  that covers subpages, highmem and superpages.
 
 I suppose so, although we should have added those interfaces long
 ago ;) The variants in fsblock are pretty good, and you could always
 do an arbitrary extent (rather than block) based API using the
 pagecache tree if it would be helpful.

Yes, you could use fsblock for the state bits and make a separate API
to map the actual pages.

  
 
  It also more naturally matches the way we want to do IO, making for
  easy clustering.
 
 Well the pagecache tree is used to reasonable effect for that now.
 OK the code isn't beautiful ;). Granted, this might be an area where
 the seperate state tree ends up being better. We'll see.
 

One thing it gains us is finding the start of the cluster.  Even if
called by kswapd, the state tree allows writepage to find the start of
the cluster and send down a big bio (provided I implement trylock to
avoid various deadlocks).

  
  O_DIRECT becomes a special case of readpages and writepagesthe
  memory used for IO just comes from userland instead of the page
  cache.
 
 Could be, although you'll probably also need to teach the mm about
 the state tree and/or still manipulate the pagecache tree to prevent
 concurrency?

Well, it isn't coded yet, but I should be able to do it from the FS
specific ops.

 
 But isn't the main aim of O_DIRECT to do as little locking and
 synchronisation with the pagecache as possible? I thought this is
 why your race fixing patches got put on the back burner (although
 they did look fairly nice from a correctness POV).

I put the placeholder patches on hold because handling a corner case
where userland did O_DIRECT from a mmap'd region of the same file (Linus
pointed it out to me).  Basically my patches had to work in 64k chunks
to avoid a deadlock in get_user_pages.  With the state tree, I can
allow the page to be faulted in but still properly deal with it.

 
 Well I'm kind of handwaving when it comes to O_DIRECT ;) It does look
 like this might be another advantage of the state tree (although you
 aren't allowed to slow down buffered IO to achieve the locking ;)).

;) The O_DIRECT benefit is a fringe thing.  I've long wanted to help
clean up that code, but the real point of the patch is to make general
usage faster and less complex.  If I can't get there, the O_DIRECT
stuff doesn't matter.
 
  
  The ability to put in additional tracking info like the process that
  first dirtied a range is also significant.  So, I think it is worth
  trying.
 
 Definitely, and I'm glad you are. You haven't converted me yet, but
 I look forward to finding the best ideas from our two approaches when
 the patches are further along (ext2 port of fsblock coming along, so
 we'll be able to have races soon :P).

I'm sure we can find some river in Cambridge, winner gets to throw
Axboe in.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] extent mapped page cache

2007-07-26 Thread Chris Mason
On Thu, 26 Jul 2007 04:36:39 +0200
Nick Piggin [EMAIL PROTECTED] wrote:

[ are state trees a good idea? ]

  One thing it gains us is finding the start of the cluster.  Even if
  called by kswapd, the state tree allows writepage to find the start
  of the cluster and send down a big bio (provided I implement
  trylock to avoid various deadlocks).
 
 That's very true, we could potentially also do that with the block
 extent tree that I want to try with fsblock.

If fsblock records and extent of 200MB, and writepage is called on a
page in the middle of the extent, how do you walk the radix backwards
to find the first dirty  up to date page in the range?

 
 I'm looking at cleaning up some of these aops APIs so hopefully
 most of the deadlock problems go away. Should be useful to both our
 efforts. Will post patches hopefully when I get time to finish the
 draft this weekend.

Great

 
 
O_DIRECT becomes a special case of readpages and
writepagesthe memory used for IO just comes from userland
instead of the page cache.
   
   Could be, although you'll probably also need to teach the mm about
   the state tree and/or still manipulate the pagecache tree to
   prevent concurrency?
  
  Well, it isn't coded yet, but I should be able to do it from the FS
  specific ops.
 
 Probably, if you invalidate all the pagecache in the range beforehand
 you should be able to do it (and I guess you want to do the invalidate
 anyway). Although, below deadlock issues might still bite somehwere...

Well, O_DIRECT is french for deadlocks.  But I shouldn't have to worry
so much about evicting the pages themselves since I can tag the range.

 
 
   But isn't the main aim of O_DIRECT to do as little locking and
   synchronisation with the pagecache as possible? I thought this is
   why your race fixing patches got put on the back burner (although
   they did look fairly nice from a correctness POV).
  
  I put the placeholder patches on hold because handling a corner case
  where userland did O_DIRECT from a mmap'd region of the same file
  (Linus pointed it out to me).  Basically my patches had to work in
  64k chunks to avoid a deadlock in get_user_pages.  With the state
  tree, I can allow the page to be faulted in but still properly deal
  with it.
 
 Oh right, I didn't think of that one. Would you still have similar
 issues with the external state tree? I mean, the filesystem doesn't
 really know why the fault is taken. O_DIRECT read from a file into
 mmapped memory of the same block in the file is almost hopeless I
 think.

Racing is fine as long as we don't deadlock or expose garbage from disk.

The ability to put in additional tracking info like the process
that first dirtied a range is also significant.  So, I think it
is worth trying.
   
   Definitely, and I'm glad you are. You haven't converted me yet,
   but I look forward to finding the best ideas from our two
   approaches when the patches are further along (ext2 port of
   fsblock coming along, so we'll be able to have races soon :P).
  
  I'm sure we can find some river in Cambridge, winner gets to throw
  Axboe in.
 
 Very noble of you to donate your colleage to such a worthy cause.

Jens is always interested in helping solve such debates.  It's a
fantastic service he provides to the community.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE] seekwatcher v0.3 IO graphing an animation

2007-07-27 Thread Chris Mason

Hello everyone,

I've tossed out seekwatcher v0.3.  The major changes are using rolling
averages to smooth out the seek and throughput graphs, and it can
generate mpgs of the IO done by a given trace.

Here's a sample of the smoother graphs (creating 20 kernel trees):

http://oss.oracle.com/~mason/seekwatcher/ext3_vs_btrfs_vs_xfs.png

There are details and sample movies of the kernel tree run at:

http://oss.oracle.com/~mason/seekwatcher

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-22 Thread Chris Mason
On Wed, 22 Aug 2007 09:18:41 +0800
Fengguang Wu [EMAIL PROTECTED] wrote:

 On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
  On Sun, 12 Aug 2007 17:11:20 +0800
  Fengguang Wu [EMAIL PROTECTED] wrote:
  
   Andrew and Ken,
   
   Here are some more experiments on the writeback stuff.
   Comments are highly welcome~ 
  
  I've been doing benchmarks lately to try and trigger fragmentation,
  and one of them is a simulation of make -j N.  It takes a list of
  all the .o files in the kernel tree, randomly sorts them and then
  creates bogus files with the same names and sizes in clean kernel
  trees.
  
  This is basically creating a whole bunch of files in random order
  in a whole bunch of subdirectories.
  
  The results aren't pretty:
  
  http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png
  
  The top graph shows one dot for each write over time.  It shows that
  ext3 is basically writing all over the place the whole time.  But,
  ext3 actually wins the read phase, so the layout isn't horrible.
  My guess is that if we introduce some write clustering by sending a
  group of inodes down at the same time, it'll go much much better.
  
  Andrew has mentioned bringing a few radix trees into the writeback
  paths before, it seems like file servers and other general uses
  will benefit from better clustering here.
  
  I'm hoping to talk you into trying it out ;)
 
 Thank you for the description of problem. So far I have a similar one
 in mind: if we are to delay writeback of atime-dirty-only inodes to
 above 1 hour, some grouping/piggy-backing scenario would be
 beneficial.  (Which I guess does not deserve the complexity now that
 we have Ingo's make-reltime-default patch.)

Good clustering would definitely help some delayed atime writeback
scheme.

 
 My vague idea is to
 - keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching
 queue.
 - convert s_dirty to some radix-tree/rbtree based data structure.
   It would have dual functions: delayed-writeback and
 clustered-writeback. 
 clustered-writeback:
 - Use inode number as clue of locality, hence the key for the sorted
   tree.
 - Drain some more s_dirty inodes into s_io on every kupdate wakeup,
   but do it in the ascending order of inode number instead of
   -dirtied_when. 
 
 delayed-writeback:
 - Make sure that a full scan of the s_dirty tree takes =30s, i.e.
   dirty_expire_interval.

I think we should assume a full scan of s_dirty is impossible in the
presence of concurrent writers.  We want to be able to pick a start
time (right now) and find all the inodes older than that start time.
New things will come in while we're scanning.  But perhaps that's what
you're saying...

At any rate, we've got two types of lists now.  One keeps track of age
and the other two keep track of what is currently being written.  I
would try two things:

1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
indexes by inode number (or some arbitrary field the FS can set in the
inode).  Radix tree tags are used to indicate which things in s_io are
already in progress or are pending (hand waving because I'm not sure
exactly).

inodes are pulled off s_dirty and the corresponding slot in s_io is
tagged to indicate IO has started.  Any nearby inodes in s_io are also
sent down.

2) s_dirty and s_io both become radix trees.  s_dirty is indexed by a
sequence number that corresponds to age.  It is treated as a big
circular indexed list that can wrap around over time.  Radix tree tags
are used both on s_dirty and s_io to flag which inodes are in progress.

 
 Notes:
 (1) I'm not sure inode number is correlated to disk location in
 filesystems other than ext2/3/4. Or parent dir?

In general, it is a better assumption than sorting by time.  It may
make sense to one day let the FS provide a clustering hint
(corresponding to the first block in the file?), but for starters it
makes sense to just go with the inode number.

 (2) It duplicates some function of elevators. Why is it necessary?
 Maybe we have no clue on the exact data location at this time?

The elevator can only sort the pending IO, and we send down a
relatively small window of all the dirty pages at a time.  If we sent
down all the dirty pages and let the elevator sort it out, we wouldn't
need this clustering at all.

But, that has other issues ;)

-chris


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-23 Thread Chris Mason
On Thu, 23 Aug 2007 12:47:23 +1000
David Chinner [EMAIL PROTECTED] wrote:

 On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
  I think we should assume a full scan of s_dirty is impossible in the
  presence of concurrent writers.  We want to be able to pick a start
  time (right now) and find all the inodes older than that start time.
  New things will come in while we're scanning.  But perhaps that's
  what you're saying...
  
  At any rate, we've got two types of lists now.  One keeps track of
  age and the other two keep track of what is currently being
  written.  I would try two things:
  
  1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
  indexes by inode number (or some arbitrary field the FS can set in
  the inode).  Radix tree tags are used to indicate which things in
  s_io are already in progress or are pending (hand waving because
  I'm not sure exactly).
  
  inodes are pulled off s_dirty and the corresponding slot in s_io is
  tagged to indicate IO has started.  Any nearby inodes in s_io are
  also sent down.
 
 the problem with this approach is that it only looks at inode
 locality. Data locality is ignored completely here and the data for
 all the inodes that are close together could be splattered all over
 the drive. In that case, clustering by inode location is exactly the
 wrong thing to do.

Usually it won't be less wrong than clustering by time.

 
 For example, XFs changes allocation strategy at 1TB for 32bit inode
 filesystems which makes the data get placed way away from the inodes.
 i.e. inodes in AGs below 1TB, all data in AGs  1TB. clustering
 by inode number for data writeback is mostly useless in the 1TB
 case.

I agree we'll want a way to let the FS provide the clustering key.  But
for the first cut on the patch, I would suggest keeping it simple.

 
 The inode32 for 1Tb and inode64 allocators both try to keep data
 close to the inode (i.e. in the same AG) so clustering by inode number
 might work better here.
 
 Also, it might be worthwhile allowing the filesystem to supply a
 hint or mask for closeness for inode clustering. This would help
 the gernic code only try to cluster inode writes to inodes that
 fall into the same cluster as the first inode

Yes, also a good idea after things are working.

 
   Notes:
   (1) I'm not sure inode number is correlated to disk location in
   filesystems other than ext2/3/4. Or parent dir?
  
  In general, it is a better assumption than sorting by time.  It may
  make sense to one day let the FS provide a clustering hint
  (corresponding to the first block in the file?), but for starters it
  makes sense to just go with the inode number.
 
 Perhaps multiple hints are needed - one for data locality and one
 for inode cluster locality.

So, my feature creep idea would have been more data clustering.  I'm
mainly trying to solve this graph:

http://oss.oracle.com/~mason/compilebench/makej/compare-create-dirs-0.png

Where background writing of the block device inode is making ext3 do
seeky writes while directory trees.  My simple idea was to kick
off a 'I've just written block X' call back to the FS, where it may
decide to send down dirty chunks of the block device inode that also
happen to be dirty.

But, maintaining the kupdate max dirty time and congestion limits in
the face of all this clustering gets tricky.  So, I wasn't going to
suggest it until the basic machinery was working.

Fengguang, this isn't a small project ;)  But, lots of people will be
interested in the results.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-24 Thread Chris Mason
On Fri, 24 Aug 2007 21:24:58 +0800
Fengguang Wu [EMAIL PROTECTED] wrote:
 
  2) s_dirty and s_io both become radix trees.  s_dirty is indexed by
  a sequence number that corresponds to age.  It is treated as a big
  circular indexed list that can wrap around over time.  Radix tree
  tags are used both on s_dirty and s_io to flag which inodes are in
  progress.
 
 It's meaningless to convert s_io to radix tree. Because inodes on s_io
 will normally be sent to block layer elevators at the same time.

Not entirely, using a radix tree instead lets you tag things instead of
doing the current backflips across three lists.

 
 Also s_dirty holds 30 seconds of inodes, while s_io only 5 seconds.
 The more inodes, the more chances of good clustering. That's the
 general rule.
 
 s_dirty is the right place to do address-clustering.
 As for the dirty_expire_interval parameter on dirty age,
 we can apply a simple rule: do one full scan/sweep over the
 fs-address-space in every 30s, syncing all inodes encountered,
 and sparing those newly dirtied in less than 5s. With that rule,
 any inode will get synced after being dirtied for 5-35 seconds.

This gives you an O(inodes dirty) behavior instead of the current O(old
inodes).  It might not matter, but walking the radix tree is more
expensive than walking a list.

But, I look forward to your patches, we can tune from there.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] [RFC][PATCH] clustered writeback

2007-08-27 Thread Chris Mason
On Mon, 27 Aug 2007 05:03:36 -0700
Arjan van de Ven [EMAIL PROTECTED] wrote:

 On Mon, 27 Aug 2007 19:21:52 +0800
  
  Because it does the work in small batches of 10 inodes, when the
  system has =10 dirty inodes, its behavior will reduce to:
  - do a full sweep *at once* on every 25s
  Which means the disk will flicker once every 25s, not bad :)
 
 25 seconds is quite not good already though it takes a disk a
 second or two of no activity to go into low power mode, every 25
 seconds means you now have at least a 10% constant power cost
 
 I don't know the right answer (well other than make sure inodes
 aren't dirty, which involves fixing apps to not do as much file
 operations, as well as relatime) but just every 25s is no big deal
 isn't really the case ;-(

But fixing this isn't the job of this patchIt needs something like
the laptop mode logic where it says o, the disk is awake, lets send
stuff down.

kupdate hitting on the disk isn't really a new problem, I'd rather
address it with a different patch series.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-28 Thread Chris Mason
On Wed, 29 Aug 2007 00:55:30 +1000
David Chinner [EMAIL PROTECTED] wrote:

 On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote:
  On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote:
   On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
Notes:
(1) I'm not sure inode number is correlated to disk location in
filesystems other than ext2/3/4. Or parent dir?
   
   The correspond to the exact location on disk on XFS. But, XFS has
   it's own inode clustering (see xfs_iflush) and it can't be moved
   up into the generic layers because of locking and integration into
   the transaction subsystem.
  
(2) It duplicates some function of elevators. Why is it
necessary?
   
   The elevators have no clue as to how the filesystem might treat
   adjacent inodes. In XFS, inode clustering is a fundamental
   feature of the inode reading and writing and that is something no
   elevator can hope to acheive
   
  Thank you. That explains the linear write curve(perfect!) in Chris'
  graph.
  
  I wonder if XFS can benefit any more from the general writeback
  clustering. How large would be a typical XFS cluster?
 
 Depends on inode size. typically they are 8k in size, so anything
 from 4-32 inodes. The inode writeback clustering is pretty tightly
 integrated into the transaction subsystem and has some intricate
 locking, so it's not likely to be easy (or perhaps even possible) to
 make it more generic.

When I talked to hch about this, he said the order file data pages got
written in XFS was still dictated by the order the higher layers sent
things down.  Shouldn't the clustering still help to have delalloc done
in inode order instead of in whatever random order pdflush sends things
down now?

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-28 Thread Chris Mason
On Wed, 29 Aug 2007 02:33:08 +1000
David Chinner [EMAIL PROTECTED] wrote:

 On Tue, Aug 28, 2007 at 11:08:20AM -0400, Chris Mason wrote:

I wonder if XFS can benefit any more from the general writeback
clustering. How large would be a typical XFS cluster?
   
   Depends on inode size. typically they are 8k in size, so anything
   from 4-32 inodes. The inode writeback clustering is pretty tightly
   integrated into the transaction subsystem and has some intricate
   locking, so it's not likely to be easy (or perhaps even possible)
   to make it more generic.
  
  When I talked to hch about this, he said the order file data pages
  got written in XFS was still dictated by the order the higher
  layers sent things down.
 
 Sure, that's file data. I was talking about the inode writeback, not
 the data writeback.

I think we're trying to gain different things from inode based
clustering...I'm not worried that the inode be next to the data.  I'm
going under the assumption that most of the time, the FS will try to
allocate inodes in groups in a directory, and so most of the time the
data blocks for inode N will be close to inode N+1.

So what I'm really trying for here is data block clustering when
writing multiple inodes at once.  This matters most when files are
relatively small and written in groups, which is a common workload.

It may make the most sense to change the patch to supply some key for
the data block clustering instead of the inode number, but its an easy
first pass.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-21 Thread Chris Mason
On Sun, 12 Aug 2007 17:11:20 +0800
Fengguang Wu [EMAIL PROTECTED] wrote:

 Andrew and Ken,
 
 Here are some more experiments on the writeback stuff.
 Comments are highly welcome~ 

I've been doing benchmarks lately to try and trigger fragmentation, and
one of them is a simulation of make -j N.  It takes a list of all
the .o files in the kernel tree, randomly sorts them and then
creates bogus files with the same names and sizes in clean kernel trees.

This is basically creating a whole bunch of files in random order in a
whole bunch of subdirectories.

The results aren't pretty:

http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png

The top graph shows one dot for each write over time.  It shows that
ext3 is basically writing all over the place the whole time.  But, ext3
actually wins the read phase, so the layout isn't horrible.  My guess
is that if we introduce some write clustering by sending a group of
inodes down at the same time, it'll go much much better.

Andrew has mentioned bringing a few radix trees into the writeback paths
before, it seems like file servers and other general uses will benefit
from better clustering here.

I'm hoping to talk you into trying it out ;)

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/23] per device dirty throttling -v8

2007-08-06 Thread Chris Mason
On Sun, 5 Aug 2007 11:00:29 -0400
Theodore Tso [EMAIL PROTECTED] wrote:

 On Sun, Aug 05, 2007 at 02:26:53AM +0200, Andi Kleen wrote:
  I always thought the right solution would be to just sync atime only
  very very lazily. This means if a inode is only dirty because of an
  atime update put it on a only write out when there is nothing to do
  or the memory is really needed list.
 
 As I've mentioend earlier, the memory balancing issues that arise when
 we add an atime dirty bit scare me a little.  It can be addressed,
 obviously, but at the cost of more code complexity.

ext3 and reiser both use a dirty_inode method to make sure that we
don't actually have dirty inodes.  This way, kswapd doesn't get stuck
on the log and is able to do real work.

It would be interesting to see a comparison of relatime with a kinoded
that is willing to get stuck on the log.  The FS would need a few
tweaks so that write_inode() could know if it really needed to log or
not, but for testing you could just drop ext3_dirty_inode and have
ext3_write_inode do real work.

Then just change kswapd to kick a new kinoded and benchmark away.  A
real patch would have to look for places where mark_inode_dirty was
used and expected the dirty_inode callback to log things right away,
but for testing its good enough.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


More Large blocksize benchmarks

2007-10-15 Thread Chris Mason
Hello everyone,

I'm stealing the cc list and reviving and old thread because I've
finally got some numbers to go along with the Btrfs variable blocksize
feature.  The basic idea is to create a read/write interface to
map a range of bytes on the address space, and use it in Btrfs for all
metadata operations (file operations have always been extent based).

So, instead of casting buffer_head-b_data to some structure, I read and
write at offsets in a struct extent_buffer.  The extent buffer is very
small and backed by an address space, and I get large block sizes the
same way file_write gets to write to 16k at a time, by finding the
appropriate page in the addess space.  This is an over simplification
since I try to cache these mapping decisions to avoid using too much
CPU, but hopefully you get the idea.

The advantage to this approach is the changes are all inside Btrfs.  No
extra kernel patches were required.

Dave reported that XFS saw much higher write throughput with large
blocksizes, but so far I'm seeing the most benefits during reads.

The next step is a bunch more benchmarks.  I've done the first round
and posted it here:

http://oss.oracle.com/~mason/blocksizes/

The Btrfs code makes it relatively easy to experiment, and so this may
be a good step toward figuring out if some automagic solution is worth
it in general.  I can even use different sizes for nodes and leaves,
although I haven't done much testing at all there yet.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: More Large blocksize benchmarks

2007-10-16 Thread Chris Mason
On Tue, 2007-10-16 at 12:36 +1000, David Chinner wrote:
 On Mon, Oct 15, 2007 at 08:22:31PM -0400, Chris Mason wrote:
  Hello everyone,
  
  I'm stealing the cc list and reviving and old thread because I've
  finally got some numbers to go along with the Btrfs variable blocksize
  feature.  The basic idea is to create a read/write interface to
  map a range of bytes on the address space, and use it in Btrfs for all
  metadata operations (file operations have always been extent based).
  
  So, instead of casting buffer_head-b_data to some structure, I read and
  write at offsets in a struct extent_buffer.  The extent buffer is very
  small and backed by an address space, and I get large block sizes the
  same way file_write gets to write to 16k at a time, by finding the
  appropriate page in the addess space.  This is an over simplification
  since I try to cache these mapping decisions to avoid using too much
  CPU, but hopefully you get the idea.
  
  The advantage to this approach is the changes are all inside Btrfs.  No
  extra kernel patches were required.
  
  Dave reported that XFS saw much higher write throughput with large
  blocksizes, but so far I'm seeing the most benefits during reads.
 
 Apples to oranges, Chris ;)
 

Grin, if the two were the same, there'd be no reason to write a new one.
I didn't expect faster writes on btrfs, at least not for workloads that
did not require reads.  The basic idea is to show there are a variety of
ways the larger blocks can improve (and hurt) performance.

Also, vmap isn't the only implementation path.  Its true the Btrfs
changes for this were huge, but a big chunk of the changes were for
different leaf/node blocksizes, something that may never get used in
practice.

-chris


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New SCM and commit list

2005-04-11 Thread Chris Mason
On Monday 11 April 2005 03:38, Ingo Molnar wrote:
 * Linus Torvalds [EMAIL PROTECTED] wrote:
  So anything that got modified in just one tree obviously merges to
  that version. Any file that got modified in two trees will end up just
  being passed to the merge program. See man merge and man diff3.
  The merger gets to fix up any conflicts by hand.

 at that point Chris Mason's rej tool is pretty nifty:

   ftp://ftp.suse.com/pub/people/mason/rej/rej-0.13.tar.gz

 (There is no fully automatic mode in where it would not bother the user
 with the really trivial rejects - but it has an automatic mode where you
 basically have to do nothing - maybe a fully automatic one could be
 added that would resolve low-risk rejects?)


rej -M skips the merge program, so rej -a -M will give you something like 
this:

coffee:/local/linux.p # rej -a -M drivers/ide/ide.c.rej
drivers/ide/ide.c: 1 matched, 0 conflicts remain

But I would want to go over the bit that calculates the conflicts remaining 
more carefully if people plan on trusting this ;)  It'll run on unified diffs 
too, although it will be slower then patch since the assumption is the quick 
and easy placement patch does has already failed.  (that's easy enough to fix 
though).

 it's really easy to use (but then again i'm a vim user, so i'm biased),
 just try it on a random .rej file you have (rej -a kernel/sched.c.rej
 or whatever).

you can rej -m kdiff3|meld|tkdiff or any program that does a side by side 
comparison of two files. (export REJMERGE=foo sets the diff prog as well)

I use rej frequently to merge patches in here, but that is mostly because 
there is no easy way to get the common ancestor and parent revision of the 
patches I'm merging.

With that info in hand, kdiff3 is pretty nice.  You would have to spoon feed 
it the renames, but it should have most of the other features you're looking 
for, including the 'no gui if all conflicts are auto-solvable'

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New SCM and commit list

2005-04-11 Thread Chris Mason
On Monday 11 April 2005 08:51, Chris Mason wrote:

 rej -M skips the merge program, so rej -a -M will give you something like
 this:

 coffee:/local/linux.p # rej -a -M drivers/ide/ide.c.rej
 drivers/ide/ide.c: 1 matched, 0 conflicts remain

 But I would want to go over the bit that calculates the conflicts remaining
 more carefully if people plan on trusting this ;) 

Ok,  looks like this should be safe.  I changed -q to skip the gui compare 
when rej thinks it has resolved all the conflicts correctly.  With rej 0.14 
(just uploaded now) this should do what you want:

rej -q -a foo.rej 

Download site is here: ftp://ftp.suse.com/pub/people/mason/rej/

Please let me know if you find patches where rej is doing the wrong thing.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Chris Mason
On Tuesday 12 July 2005 20:27, Rob Mueller wrote:
   We're also applying the attached patch.  There's a bug in reiserfs that
   gets tickled by our huge MMAP usage (it's amazing what really busy
   Cyrus daemons can do to a server, ouch).  It's fixed in generic_write,
   so we take the few percent performance hit for something that doesn't
   break!
 
  Interesting - When I got the problem it was on mail servers under high
  load (handling 60.000 emails pr. hour) with reiserfs as file system. I
  have seen this problem on 5 different servers so I am confident that
  it is not hardware failure.
 
  Sometimes the server load just rises and then the server dies other
  times the load rises but the kernel manages to get it back alive
  filling up syslog with messages like this

 Sounds like a different issue. The patch Bron included before fixes (or at
 least reduces to the point where it fixes it for us) a problem where
 processes get stuck in D state and are unkillable. A reboot is required to
 remove them. Apparently this is a known bug in ReiserFS (see messages
 below). As noted, the same bug exists in ext3. There appears to have been
 some patches to try and fix it for both reiserfs and ext3, but I'm not sure
 if they're in the mainline kernel yet.

 http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/2056.html
 http://hulllug.principalhosting.net/archive/index.php/t-22774.html


There is a much less complex solution that I've just recently gotten working 
in the SUSE kernel.  If reiser3/ext3 don't log the inode during atime 
updates, the problem goes away.

You can solve this now by mounting with -o noatime (although that might not 
play well with cyrus, not sure).  My current patch works around this in ugly 
ways, what I plan on doing during OLS is finding out why ext3 is still 
logging the inode all the time.

For reiser3, this was to avoid kswapd having to log a bunch of inodes in 
response to memory pressure, but that was back in 2.4 when things were 
different.  We shouldn't need to do it anymore...

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Chris Mason
On Tuesday 12 July 2005 20:42, Chris Mason wrote:

  Sounds like a different issue. The patch Bron included before fixes (or
  at least reduces to the point where it fixes it for us) a problem where
  processes get stuck in D state and are unkillable. A reboot is required
  to remove them. Apparently this is a known bug in ReiserFS (see messages
  below). As noted, the same bug exists in ext3. There appears to have been
  some patches to try and fix it for both reiserfs and ext3, but I'm not
  sure if they're in the mainline kernel yet.
 
  http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/2056.html
  http://hulllug.principalhosting.net/archive/index.php/t-22774.html

 There is a much less complex solution that I've just recently gotten
 working in the SUSE kernel.  If reiser3/ext3 don't log the inode during
 atime updates, the problem goes away.

The sysrq is huge, and I haven't yet found the person holding the transaction 
open.  But, here's another place that starts a transaction with the mmap sem 
held, and I would guess the transaction writer is waiting on something for 
that mmap sem.

atime updates alone won't fix this one

-chris

imapd D F3159530 0 32412   2292 32413 32411 (NOTLB)
e1cdfdfc 0082 0008 f3159530 0202  c1e5b0a0 c013b1a0
   0034 0202 c301b520 0001 4609 beca3f0b 5566 c301b520
   c315b530 f3159530 f3159654  0001 000e d9309dac 
Call Trace:
 [c013b1a0] free_hot_cold_page+0x20/0xd0
 [c01ac8dd] queue_log_writer+0x5d/0x80
 [c0114b10] default_wake_function+0x0/0x20
 [c01acb8a] do_journal_begin_r+0x1ca/0x2b0
 [c01409e0] truncate_inode_pages+0x290/0x2b0
 [c01ace9e] journal_begin+0x8e/0xe0
 [c0191061] reiserfs_delete_inode+0x51/0xc0
 [c01447fa] unmap_vmas+0x14a/0x260
 [c0191010] reiserfs_delete_inode+0x0/0xc0
 [c016c97d] generic_delete_inode+0x7d/0xe0
 [c016cb83] iput+0x63/0x70
 [c0169db6] dput+0x176/0x1b0
 [c01547cb] __fput+0xcb/0x100
 [c01470ff] remove_vm_struct+0x5f/0x80
 [c014873a] unmap_vma_list+0x1a/0x30
 [c0148a9f] do_munmap+0xdf/0xf0
 [c0148aff] sys_munmap+0x4f/0x70
 [c0102a15] syscall_call+0x7/0xb

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Chris Mason
On Tuesday 12 July 2005 20:50, Rob Mueller wrote:


 Are you saying that if you mount with noatime *and* use your new patch it
 will fix the problem?

 What about the 2 threads linked to. Did those end up getting anywhere?

Sorry for the confusion, you're hitting the other mmap_sem - transaction lock 
problem.  This one should be solvable with an iget so we make sure not to do 
the final unlink until after the mmap sem is dropped.

Lets see what I can do...

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: aio-stress throughput regressions from 2.6.11 to 2.6.12

2005-07-05 Thread Chris Mason
On Friday 01 July 2005 03:56, Suparna Bhattacharya wrote:
 Has anyone else noticed major throughput regressions for random
 reads/writes with aio-stress in 2.6.12 ?
 Or have there been any other FS/IO regressions lately ?

 On one test system I see a degradation from around 17+ MB/s to 11MB/s
 for random O_DIRECT AIO (aio-stress -o3 testext3/rwfile5) from 2.6.11
 to 2.6.12. It doesn't seem filesystem specific. Not good :(

 BTW, Chris/Ben, it doesn't look like the changes to aio.c have had an
 impact (I copied those back to my 2.6.11 tree and tried the runs with no
 effect) So it is something else ...

 Ideas/thoughts/observations ?

Lets try to narrow it down a bit:

aio-stress -o 3 -d 1 will set the depth to 1, (io_submit then wait one request 
at a time).  This doesn't take the aio subsystem out of the picture, but it 
does make the block layer interaction more or less the same as non-aio 
benchmarks.  If this is slow, I would suspect something in the block layer, 
and iozone -I -i 2 -w -f testext3/rwfile5 should also show the regression.

If it doesn't regress, I would suspect something in the aio core.  My first 
attempts at the context switch reduction patches caused this kind of 
regression.  There was too much latency in sending the events up to userland.

Other options:

Try different elevators
Try O_SYNC aio random writes
Try aio random reads
Try buffers random reads

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NFS Client patch

2001-07-20 Thread Chris Mason



On Friday, July 20, 2001 10:50:57 AM +0200 Trond Myklebust 
[EMAIL PROTECTED] wrote:

   == Hans Reiser [EMAIL PROTECTED] writes:
 
   The current code does rely on hidden knowledge of the filesytem
   on the server, and refuses to operate with any FS that does not
   describe a position in a directory as an offset or hash that
   fits into 32 or 64 bits.
 
 I'm not saying that ReiserFS is wrong to question the correctness of
 this. I'm just saying that NFSv2 and v3 are fixed protocols, and that
 it's too late to do anything about them. I read Chris mail as a
 suggestion of creating yet another NQNFS, and this would IMHO be a
 mistake. Better to concentrate on NFSv4 which is meant to be
 extendible.

Ah, then I was unclear...I think that while we certainly could
make linux (or reiserfs) specific changes to NFSvOld, it would be
a really bad idea.  In my mind, the biggest strength behind NFS
is its cross platform support, and maintaining some extension
would only be slightly more fun than daily visits to the dentist ;-)

I also think it is easy to call NFSv4 poorly designed, but much
harder to design it to exploit the strengths of every FS on every
unix flavor.  Shrug, there are tradeoffs everywhere.  

I don't plan on supporting NFSv4 because it is the best network 
filesystem ever made, but because it is in our best interest to 
be compatible with those kinds of industry standards.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] speedup reiserfs O_SYNC and fsync

2001-07-12 Thread Chris Mason


Hello everyone,

This patch makes reiserfs O_SYNC and fsync faster by only
committing the last transcation a file/dir was included in,
instead of forcing a commit on the current transaction.

More speedups are still possible, this patch is fairly conservative.

It is based on 2.4.7-pre6 + the direct-indirect target flushing
patch I just sent.  More testers would be greatly appreciated ;-)

Note, this changes the reiserfs in-core inode.  modules users need
to recompile the whole kernel.

-chris

diff -Nru a/fs/reiserfs/dir.c b/fs/reiserfs/dir.c
--- a/fs/reiserfs/dir.c Thu Jul 12 10:46:26 2001
+++ b/fs/reiserfs/dir.c Thu Jul 12 10:46:26 2001
@@ -47,22 +47,10 @@
 };
 
 int reiserfs_dir_fsync(struct file *filp, struct dentry *dentry, int datasync) {
-  int ret = 0 ;
-  int windex ;
-  struct reiserfs_transaction_handle th ;
-
   lock_kernel();
-
-  journal_begin(th, dentry-d_inode-i_sb, 1) ;
-  windex = push_journal_writer(dir_fsync) ;
-  reiserfs_prepare_for_journal(th.t_super, SB_BUFFER_WITH_SB(th.t_super), 1) ;
-  journal_mark_dirty(th, dentry-d_inode-i_sb, SB_BUFFER_WITH_SB 
(dentry-d_inode-i_sb)) ;
-  pop_journal_writer(windex) ;
-  journal_end_sync(th, dentry-d_inode-i_sb, 1) ;
-
-  unlock_kernel();
-
-  return ret ;
+  reiserfs_commit_for_inode(dentry-d_inode) ;
+  unlock_kernel() ;
+  return 0 ;
 }
 
 
diff -Nru a/fs/reiserfs/file.c b/fs/reiserfs/file.c
--- a/fs/reiserfs/file.cThu Jul 12 10:46:26 2001
+++ b/fs/reiserfs/file.cThu Jul 12 10:46:26 2001
@@ -50,6 +50,7 @@
 lock_kernel() ;
 down (inode-i_sem); 
 journal_begin(th, inode-i_sb, JOURNAL_PER_BALANCE_CNT * 3) ;
+reiserfs_update_inode_transaction(inode) ;
 
 #ifdef REISERFS_PREALLOCATE
 reiserfs_discard_prealloc (th, inode);
@@ -83,10 +84,7 @@
  int datasync
  ) {
   struct inode * p_s_inode = p_s_dentry-d_inode;
-  struct reiserfs_transaction_handle th ;
   int n_err = 0;
-  int windex ;
-  int jbegin_count = 1 ;
 
   lock_kernel() ;
 
@@ -94,14 +92,9 @@
   BUG ();
 
   n_err = fsync_inode_buffers(p_s_inode) ;
-  /* commit the current transaction to flush any metadata
-  ** changes.  sys_fsync takes care of flushing the dirty pages for us
-  */
-  journal_begin(th, p_s_inode-i_sb, jbegin_count) ;
-  windex = push_journal_writer(sync_file) ;
-  reiserfs_update_sd(th, p_s_inode);
-  pop_journal_writer(windex) ;
-  journal_end_sync(th, p_s_inode-i_sb,jbegin_count) ;
+
+  reiserfs_commit_for_inode(p_s_inode) ;
+
   unlock_kernel() ;
   return ( n_err  0 ) ? -EIO : 0;
 }
diff -Nru a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
--- a/fs/reiserfs/inode.c   Thu Jul 12 10:46:26 2001
+++ b/fs/reiserfs/inode.c   Thu Jul 12 10:46:26 2001
@@ -41,6 +41,7 @@
down (inode-i_sem); 
 
journal_begin(th, inode-i_sb, jbegin_count) ;
+   reiserfs_update_inode_transaction(inode) ;
windex = push_journal_writer(delete_inode) ;
 
reiserfs_delete_object (th, inode);
@@ -232,6 +233,7 @@
   reiserfs_update_sd(th, inode) ;
   journal_end(th, s, len) ;
   journal_begin(th, s, len) ;
+  reiserfs_update_inode_transaction(inode) ;
 }
 
 // it is called by get_block when create == 0. Returns block number
@@ -567,6 +569,7 @@
  TYPE_ANY, 3/*key length*/);
 if ((new_offset + inode-i_sb-s_blocksize) = inode-i_size) {
journal_begin(th, inode-i_sb, jbegin_count) ;
+   reiserfs_update_inode_transaction(inode) ;
transaction_started = 1 ;
 }
  research:
@@ -591,6 +594,7 @@
if (!transaction_started) {
pathrelse(path) ;
journal_begin(th, inode-i_sb, jbegin_count) ;
+   reiserfs_update_inode_transaction(inode) ;
transaction_started = 1 ;
goto research ;
}
@@ -658,6 +662,7 @@
*/
pathrelse(path) ;
journal_begin(th, inode-i_sb, jbegin_count) ;
+   reiserfs_update_inode_transaction(inode) ;
transaction_started = 1 ;
goto research;
 }
@@ -1277,6 +1282,10 @@
 return ;
 }
 lock_kernel() ;
+
+/* this is really only used for atime updates, so they don't have
+** to be included in O_SYNC or fsync
+*/
 journal_begin(th, inode-i_sb, 1) ;
 reiserfs_update_sd (th, inode);
 journal_end(th, inode-i_sb, 1) ;
@@ -1650,6 +1659,7 @@
 ** (it will unmap bh if it packs).
 */
 journal_begin(th, p_s_inode-i_sb,  JOURNAL_PER_BALANCE_CNT * 2 ) ;
+reiserfs_update_inode_transaction(p_s_inode) ;
 windex = push_journal_writer(reiserfs_vfs_truncate_file) ;
 reiserfs_do_truncate (th, p_s_inode, page, update_timestamps) ;
 pop_journal_writer(windex) ;
@@ -1696,6 +1706,7 @@
 start_over:
 lock_kernel() ;
 journal_begin(th, inode-i_sb, jbegin_count) ;
+reiserfs_update_inode_transaction(inode) ;
 
 make_cpu_key(key, inode, byte_offset, TYPE_ANY, 3) ;
 
@@ -1927,22 +1938,34 @@
 static int reiserfs_commit_write(struct file *f, struct page *page,

[reiserfs-list] Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure

2002-02-01 Thread Chris Mason



On Tuesday, January 29, 2002 01:46:43 PM +0300 Hans Reiser [EMAIL PROTECTED] wrote:

 Alexander Viro wrote:
 
 
 On Tue, 29 Jan 2002, Hans Reiser wrote:
 
 This fails to recover an object (e.g. dcache entry) which is used once, 
 and then spends a year in cache on the same page as an object which is 
 hot all the time.  This means that the hot set of objects becomes 
 diffused over an order of magnitude more pages than if garbage 
 collection squeezes them all together.  That makes for very poor caching.
 
 
 Any GC that is going to move active dentries around is out of question.
 It would need a locking of such strength that you would be the first
 to cry bloody murder - about 5 seconds after you look at the scalability
 benchmarks.
 
 
 
 I don't mean to suggest that the dentry cache locking is an easy problem to solve, 
but the problem discussed is a real one, and it is sufficient to illustrate that the 
unified cache is fundamentally flawed as an algorithm compared to using subcache 
plugins.

It isn't just dentries.  If a subcache object is in use, it can't be moved
to a warmer page without invalidating all existing pointers to it.

If it isn't in use, it can be migrated when the VM asks for the page to
be flushed.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 3.4.4-rt13: btrfs + xfstests 006 = BOOM.. and a bonus rt_mutex deadlock report for absolutely free!

2012-07-13 Thread Chris Mason
On Fri, Jul 13, 2012 at 04:26:26AM -0600, Thomas Gleixner wrote:
 On Fri, 13 Jul 2012, Mike Galbraith wrote:
  On Fri, 2012-07-13 at 11:52 +0200, Thomas Gleixner wrote: 
   On Fri, 13 Jul 2012, Mike Galbraith wrote:
On Thu, 2012-07-12 at 15:31 +0200, Thomas Gleixner wrote: 
 Bingo, that makes it more likely that this is caused by copying w/o
 initializing the lock and then freeing the original structure.
 
 A quick check for memcpy finds that __btrfs_close_devices() does a
 memcpy of btrfs_device structs w/o initializing the lock in the new
 copy, but I have no idea whether that's the place we are looking for.

Thanks a bunch Thomas.  I doubt I would have ever figured out that lala
land resulted from _copying_ a lock.  That's one I won't be forgetting
any time soon.  Box not only survived a few thousand xfstests 006 runs,
dbench seemed disinterested in deadlocking virgin 3.0-rt.
   
   Cute. It think that the lock copying caused the deadlock problem as
   the list pointed to the wrong place, so we might have ended up with
   following down the wrong chain when walking the list as long as the
   original struct was not freed. That beast is freed under RCU so there
   could be a rcu read side critical section fiddling with the old lock
   and cause utter confusion.
  
  Virgin 3.0-rt appears to really be solid.  But then it doesn't have
  pesky rwlocks.
 
 Ah. So 3.0 is not having those rwlock thingies. Bummer.
  
   /me goes and writes a nastigram^W proper changelog
   
btrfs still locks up in my enterprise kernel, so I suppose I had better
plug your fix into 3.4-rt and see what happens, and go beat hell out of
virgin 3.0-rt again to be sure box really really survives dbench.
   
   A test against 3.4-rt sans enterprise mess might be nice as well.
  
  Enterprise is 3.0-stable with um 555 btrfs patches (oh dear).
  
  Virgin 3.4-rt and 3.2-rt deadlock gripe.  Enterprise doesn't gripe, but
  deadlocks, so I have another adventure in my future even if I figure out
  wth to do about rwlocks.
 
 Hrmpf. /me goes to stare into fs/btrfs/ some more.

Please post the deadlocks here, I'll help ;)

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.4.4-rt13: btrfs + xfstests 006 = BOOM.. and a bonus rt_mutex deadlock report for absolutely free!

2012-07-13 Thread Chris Mason
On Wed, Jul 11, 2012 at 11:47:40PM -0600, Mike Galbraith wrote:
 Greetings,

[ deadlocks with btrfs and the recent RT kernels ]

I talked with Thomas about this and I think the problem is the
single-reader nature of the RW rwlocks.  The lockdep report below
mentions that btrfs is calling:

 [  692.963099]  [811fabd2] btrfs_clear_path_blocking+0x32/0x70

In this case, the task has a number of blocking read locks on the btrfs buffers,
and we're trying to turn them back into spinning read locks.  Even
though btrfs is taking the read rwlock, it doesn't think of this as a new
lock operation because we were blocking out new writers.

If the second task has taken the spinning read lock, it is going to
prevent that clear_path_blocking operation from progressing, even though
it would have worked on a non-RT kernel.

The solution should be to make the blocking read locks in btrfs honor the
single-reader semantics.  This means not allowing more than one blocking
reader and not allowing a spinning reader when there is a blocking
reader.  Strictly speaking btrfs shouldn't need recursive readers on a
single lock, so I wouldn't worry about that part.

There is also a chunk of code in btrfs_clear_path_blocking that makes
sure to strictly honor top down locking order during the conversion.  It
only does this when lockdep is enabled because in non-RT kernels we
don't need to worry about it.  For RT we'll want to enable that as well.

I'll give this a shot later today.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.4.4-rt13: btrfs + xfstests 006 = BOOM.. and a bonus rt_mutex deadlock report for absolutely free!

2012-07-15 Thread Chris Mason
On Sat, Jul 14, 2012 at 04:14:43AM -0600, Mike Galbraith wrote:
 On Fri, 2012-07-13 at 08:50 -0400, Chris Mason wrote: 
  On Wed, Jul 11, 2012 at 11:47:40PM -0600, Mike Galbraith wrote:
   Greetings,
  
  [ deadlocks with btrfs and the recent RT kernels ]
  
  I talked with Thomas about this and I think the problem is the
  single-reader nature of the RW rwlocks.  The lockdep report below
  mentions that btrfs is calling:
  
   [  692.963099]  [811fabd2] btrfs_clear_path_blocking+0x32/0x70
  
  In this case, the task has a number of blocking read locks on the btrfs 
  buffers,
  and we're trying to turn them back into spinning read locks.  Even
  though btrfs is taking the read rwlock, it doesn't think of this as a new
  lock operation because we were blocking out new writers.
  
  If the second task has taken the spinning read lock, it is going to
  prevent that clear_path_blocking operation from progressing, even though
  it would have worked on a non-RT kernel.
  
  The solution should be to make the blocking read locks in btrfs honor the
  single-reader semantics.  This means not allowing more than one blocking
  reader and not allowing a spinning reader when there is a blocking
  reader.  Strictly speaking btrfs shouldn't need recursive readers on a
  single lock, so I wouldn't worry about that part.
  
  There is also a chunk of code in btrfs_clear_path_blocking that makes
  sure to strictly honor top down locking order during the conversion.  It
  only does this when lockdep is enabled because in non-RT kernels we
  don't need to worry about it.  For RT we'll want to enable that as well.
  
  I'll give this a shot later today.
 
 I took a poke at it.  Did I do something similar to what you had in
 mind, or just hide behind performance stealing paranoid trylock loops?
 Box survived 1000 x xfstests 006 and dbench [-s] massive right off the
 bat, so it gets posted despite skepticism.

Great, thanks!  I got stuck in bug land on Friday.  You mentioned
performance problems earlier on Saturday, did this improve performance?

One other question:

  again:
 +#ifdef CONFIG_PREEMPT_RT_BASE
 + while (atomic_read(eb-blocking_readers))
 + cpu_chill();
 + while(!read_trylock(eb-lock))
 + cpu_chill();
 + if (atomic_read(eb-blocking_readers)) {
 + read_unlock(eb-lock);
 + goto again;
 + }

Why use read_trylock() in a loop instead of just trying to take the
lock?  Is this an RTism or are there other reasons?  

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.4.4-rt13: btrfs + xfstests 006 = BOOM.. and a bonus rt_mutex deadlock report for absolutely free!

2012-07-16 Thread Chris Mason
On Mon, Jul 16, 2012 at 04:55:44AM -0600, Mike Galbraith wrote:
 On Sat, 2012-07-14 at 12:14 +0200, Mike Galbraith wrote: 
  On Fri, 2012-07-13 at 08:50 -0400, Chris Mason wrote: 
   On Wed, Jul 11, 2012 at 11:47:40PM -0600, Mike Galbraith wrote:
Greetings,
   
   [ deadlocks with btrfs and the recent RT kernels ]
   
   I talked with Thomas about this and I think the problem is the
   single-reader nature of the RW rwlocks.  The lockdep report below
   mentions that btrfs is calling:
   
[  692.963099]  [811fabd2] btrfs_clear_path_blocking+0x32/0x70
   
   In this case, the task has a number of blocking read locks on the btrfs 
   buffers,
   and we're trying to turn them back into spinning read locks.  Even
   though btrfs is taking the read rwlock, it doesn't think of this as a new
   lock operation because we were blocking out new writers.
   
   If the second task has taken the spinning read lock, it is going to
   prevent that clear_path_blocking operation from progressing, even though
   it would have worked on a non-RT kernel.
   
   The solution should be to make the blocking read locks in btrfs honor the
   single-reader semantics.  This means not allowing more than one blocking
   reader and not allowing a spinning reader when there is a blocking
   reader.  Strictly speaking btrfs shouldn't need recursive readers on a
   single lock, so I wouldn't worry about that part.
   
   There is also a chunk of code in btrfs_clear_path_blocking that makes
   sure to strictly honor top down locking order during the conversion.  It
   only does this when lockdep is enabled because in non-RT kernels we
   don't need to worry about it.  For RT we'll want to enable that as well.
   
   I'll give this a shot later today.
  
  I took a poke at it.  Did I do something similar to what you had in
  mind, or just hide behind performance stealing paranoid trylock loops?
  Box survived 1000 x xfstests 006 and dbench [-s] massive right off the
  bat, so it gets posted despite skepticism.
 
 Seems btrfs isn't entirely convinced either.
 
 [ 2292.336229] use_block_rsv: 1810 callbacks suppressed
 [ 2292.336231] [ cut here ]
 [ 2292.336255] WARNING: at fs/btrfs/extent-tree.c:6344 
 use_block_rsv+0x17d/0x190 [btrfs]()
 [ 2292.336257] Hardware name: System x3550 M3 -[7944K3G]-
 [ 2292.336259] btrfs: block rsv returned -28

This is unrelated.  You got far enough into the benchmark to hit an
ENOSPC warning.  This can be ignored (I just deleted it when we used 3.0
for oracle).

re: dbench performance.  dbench tends to penalize fairness.  I can
imagine RT making it slower in general.

It also triggers lots of lock contention in btrfs because the dataset is
fairly small and the trees don't fan out a lot.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.4.4-rt13: btrfs + xfstests 006 = BOOM.. and a bonus rt_mutex deadlock report for absolutely free!

2012-07-16 Thread Chris Mason
On Mon, Jul 16, 2012 at 10:26:08AM -0600, Mike Galbraith wrote:
 On Mon, 2012-07-16 at 12:02 -0400, Steven Rostedt wrote: 
  On Mon, 2012-07-16 at 04:02 +0200, Mike Galbraith wrote:
  
Great, thanks!  I got stuck in bug land on Friday.  You mentioned
performance problems earlier on Saturday, did this improve performance?
   
   Yeah, the read_trylock() seems to improve throughput.  That's not
   heavily tested, but it certainly looks like it does.  No idea why.
  
  Ouch, you just turned the rt_read_lock() into a spin lock. If a higher
  priority process preempted a lower priority process that holds the same
  lock, it will deadlock.
 
 Hm, how, it's doing cpu_chill()?
 
  I'm not sure why you would get a performance benefit from this, as the
  mutex used is an adaptive one (failure to acquire the lock will only
  sleep if preempted or if the owner is not running).
 
 I'm not attached to it, can whack it in a heartbeat.. especially so it
 the thing can deadlock.  I've seen enough of those of late.
 
  We should look at why this performs better (if it really does).
 
 Not sure it really does, there's variance, but it looked like it did.
 

I'd use a benchmark that is more consistent than dbench for this.  I
love dbench for generating load (and the occasional deadlock) but it
tends to steer you in the wrong direction on performance.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG REPORT] Kernel panic on 3.9.0-rc7-4-gbb33db7

2013-04-19 Thread Chris Mason
Quoting Tejun Heo (2013-04-19 01:57:54)
 
 Ewweehh
 
 No wonder this thing crashes.  Chris, can't the original bio carry
 bbio in bi_private and let end_bio_extent_readpage() free the bbio
 instead of abusing bi_bdev like this?

Yes, we can definitely carry bbio up higher in the stack.  I'll patch it
up right now.  I do agree that it'll be too big for -final, but we'll
have it either way.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG REPORT] Kernel panic on 3.9.0-rc7-4-gbb33db7

2013-04-19 Thread Chris Mason
Quoting Jens Axboe (2013-04-19 09:32:50)
  
  No wonder this thing crashes.  Chris, can't the original bio carry
  bbio in bi_private and let end_bio_extent_readpage() free the bbio
  instead of abusing bi_bdev like this?
 
 Ugh, wtf.
 
 Chris, time for a swim in the bay :-)

Yeah, I can't really defend this one.  We needed a space for an int and
I assumed end_io meant the FS was free to do horrible things.

Really though, I'll just take a quick dip in the lake and patch this out
of btrfs. 

Jan is probably right about changing around our endio callbacks to
explicitly pass the mirror, it should be less complex and cleaner.

Many thanks to everyone here that tracked it down.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] One more btrfs

2013-04-13 Thread Chris Mason
Hi Linus

My for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Has a recent fix from Josef for our tree log replay code.  It fixes
problems where the inode counter for the number of bytes in the file
wasn't getting updated properly during fsync replay.

The commit did get rebased this morning, but it was only to clean up the
subject line.  The code hasn't changed.

Josef Bacik (1) commits (+42/-6):
Btrfs: make sure nbytes are right after log replay

Total: (1) commits (+42/-6)

 fs/btrfs/tree-log.c | 48 ++--
 1 file changed, 42 insertions(+), 6 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs updates

2013-03-29 Thread Chris Mason
Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

We've had a busy two weeks of bug fixing.  The biggest patches in here
are some long standing early-enospc problems (Josef) and a very old race
where compression and mmap combine forces to lose writes (me).  I'm
fairly sure the mmap bug goes all the way back to the introduction of
the compression code, which is proof that fsx doesn't trigger every
possible mmap corner after all.

I'm sure you'll notice one of these is from this morning, it's a small
and isolated use-after-free fix in our scrub error reporting.  I double
checked it here.

Josef Bacik (6) commits (+90/-18):
Btrfs: hold the ordered operations mutex when waiting on ordered extents 
(+2/-0)
Btrfs: don't drop path when printing out tree errors in scrub (+2/-1)
Btrfs: fix space leak when we fail to reserve metadata space (+41/-6)
Btrfs: fix space accounting for unlink and rename (+2/-4)
Btrfs: limit the global reserve to 512mb (+1/-1)
Btrfs: handle a bogus chunk tree nicely (+42/-6)

Jan Schmidt (2) commits (+24/-16):
Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes 
(+4/-6)
Btrfs: fix locking on ROOT_REPLACE operations in tree mod log (+20/-10)

Wang Shilong (2) commits (+10/-2):
Btrfs: fix double free in the btrfs_qgroup_account_ref() (+1/-2)
Btrfs: fix missing qgroup reservation before fallocating (+9/-0)

Miao Xie (2) commits (+5/-3):
Btrfs: fix wrong return value of btrfs_lookup_csum() (+3/-1)
Btrfs: fix wrong reservation of csums (+2/-2)

Chris Mason (1) commits (+49/-0):
Btrfs: fix race between mmap writes and compression

Liu Bo (1) commits (+1/-1):
Btrfs: update to use fs_state bit

Tsutomu Itoh (1) commits (+9/-3):
Btrfs: fix memory leak in btrfs_create_tree()

Total: (15) commits

 fs/btrfs/ctree.c| 30 --
 fs/btrfs/disk-io.c  | 14 ++---
 fs/btrfs/extent-tree.c  | 84 ++---
 fs/btrfs/extent_io.c| 33 +++
 fs/btrfs/extent_io.h|  2 ++
 fs/btrfs/file-item.c|  6 ++--
 fs/btrfs/file.c |  9 ++
 fs/btrfs/inode.c| 22 ++---
 fs/btrfs/ordered-data.c |  2 ++
 fs/btrfs/qgroup.c   |  3 +-
 fs/btrfs/scrub.c|  3 +-
 fs/btrfs/send.c | 10 +++---
 fs/btrfs/volumes.c  | 13 +++-
 13 files changed, 188 insertions(+), 43 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs fixes

2013-03-17 Thread Chris Mason
Hi Linus,

My for-linus branch has some btrfs fixes:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Eric's rcu barrier patch fixes a long standing problem with our unmount
code hanging on to devices in workqueue helpers.  Liu Bo nailed down a
difficult assertion for in-memory extent mappings.

Liu Bo (4) commits (+9/-7):
Btrfs: get better concurrency for snapshot-aware defrag work (+3/-0)
Btrfs: fix warning when creating snapshots (+5/-6)
Btrfs: fix warning of free_extent_map (+1/-0)
Btrfs: remove btrfs_try_spin_lock (+0/-1)

Josef Bacik (1) commits (+4/-1):
Btrfs: return EIO if we have extent tree corruption

Eric Sandeen (1) commits (+6/-0):
btrfs: use rcu_barrier() to wait for bdev puts at unmount

Wang Shilong (1) commits (+6/-4):
Btrfs: return as soon as possible when edquot happens

Total: (7) commits (+25/-12)

 fs/btrfs/extent-tree.c |  5 -
 fs/btrfs/file.c|  1 +
 fs/btrfs/inode.c   |  3 +++
 fs/btrfs/locking.h |  1 -
 fs/btrfs/qgroup.c  | 10 ++
 fs/btrfs/transaction.c | 11 +--
 fs/btrfs/volumes.c |  6 ++
 7 files changed, 25 insertions(+), 12 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs updates

2013-03-08 Thread Chris Mason
Hi Linus,

Please grab my for-linus:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

These are scattered fixes and one performance improvement.  The biggest
functional change is in how we throttle metadata changes.  The new code
bumps our average file creation rate up by ~13% in fs_mark, and lowers
CPU usage.

Stefan bisected out a regression in our allocation code that made
balance loop on extents larger than 256MB.

Liu Bo (6) commits (+71/-19):
Btrfs: build up error handling for merge_reloc_roots (+35/-12)
Btrfs: check for NULL pointer in updating reloc roots (+2/-0)
Btrfs: avoid deadlock on transaction waiting list (+7/-0)
Btrfs: free all recorded tree blocks on error (+6/-3)
Btrfs: do not BUG_ON on aborted situation (+12/-3)
Btrfs: do not BUG_ON in prepare_to_reloc (+9/-1)

Chris Mason (2) commits (+96/-63):
Btrfs: enforce min_bytes parameter during extent allocation (+4/-2)
Btrfs: improve the delayed inode throttling (+92/-61)

Miao Xie (2) commits (+45/-39):
Btrfs: fix unclosed transaction handler when the async transaction 
commitment fails (+4/-0)
Btrfs: fix wrong handle at error path of create_snapshot() when the commit 
fails (+41/-39)

Stefan Behrens (1) commits (+0/-8):
Btrfs: allow running defrag in parallel to administrative tasks

Ilya Dryomov (1) commits (+5/-0):
Btrfs: fix a mismerge in btrfs_balance()

Josef Bacik (1) commits (+4/-1):
Btrfs: use set_nlink if our i_nlink is 0

Total: (13) commits (+221/-130)

 fs/btrfs/delayed-inode.c | 151 ---
 fs/btrfs/delayed-inode.h |   2 +
 fs/btrfs/disk-io.c   |  16 +++--
 fs/btrfs/inode.c |   6 +-
 fs/btrfs/ioctl.c |  18 ++
 fs/btrfs/relocation.c|  74 +--
 fs/btrfs/transaction.c   |  65 
 fs/btrfs/tree-log.c  |   5 +-
 fs/btrfs/volumes.c   |  14 -
 9 files changed, 221 insertions(+), 130 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] Btrfs

2013-03-02 Thread Chris Mason
On Sat, Mar 02, 2013 at 05:45:41PM -0700, Linus Torvalds wrote:
 On Sat, Mar 2, 2013 at 7:15 AM, Chris Mason chris.ma...@fusionio.com wrote:
 
  Our set of btrfs features, fixes and cleanups are in my for-linus
  branch:
 
 I *really* wish that big pull requests like this would come in earlier
 in the merge window. I hate seeing them the day before I close the
 window - really.  A number of the latter commits are done in the last
 few days, which also smells bad.

Definitely, I wanted to send this earlier in the merge window.  But I
was out last week and also didn't want to send the big stuff (raid 5/6
and the fsync work) to you right before I left on vacation.

So instead I sent things off to linux-next, and everyone on the btrfs
list collected fixes while I was gone.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] btrfs/raid56: Add missing #include linux/vmalloc.h

2013-03-03 Thread Chris Mason
On Sun, Mar 03, 2013 at 04:44:41AM -0700, Geert Uytterhoeven wrote:
 tilegx_defconfig:
 
 fs/btrfs/raid56.c: In function 'btrfs_alloc_stripe_hash_table':
 fs/btrfs/raid56.c:206:3: error: implicit declaration of function 'vzalloc' 
 [-Werror=implicit-function-declaration]
 fs/btrfs/raid56.c:206:9: warning: assignment makes pointer from integer 
 without a cast [enabled by default]
 fs/btrfs/raid56.c:226:4: error: implicit declaration of function 'vfree' 
 [-Werror=implicit-function-declaration]

Thanks, I've got this one in my for-linus now.  It'll go with the next
pull.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs fixup

2013-03-03 Thread Chris Mason
Hi Linus,

Geert and James both sent this one in, sorry guys.

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Geert Uytterhoeven (1) commits (+1/-0):
btrfs/raid56: Add missing #include linux/vmalloc.h

Total: (1) commits (+1/-0)

 fs/btrfs/raid56.c | 1 +
 1 file changed, 1 insertion(+)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/4] ipc: reduce ipc lock contention

2013-03-07 Thread Chris Mason
On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote:
 On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote:
 
  Indeed.  Though how well my patches will work with Oracle will
  depend a lot on what kind of semctl syscalls they are doing.
  
  Does Oracle typically do one semop per semctl syscall, or does
  it pass in a whole bunch at once?
 
 https://oss.oracle.com/~mason/sembench.c
 
 I think Chris wrote that to match a particular pattern of semaphore
 operations the database engine in question does. I haven't checked to
 see if it triggers the case in point though.
 
 Also, Chris since left Oracle but maybe he knows who to poke.
 

Dave Kleikamp (cc'd) took over my patches and did the most recent
benchmarking.  Ported against 3.0:

https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c

The current versions are still in the 2.6.32 oracle kernel, but it looks
like they reverted this 3.0 commit.  I think with Manfred's upstream
work my more complex approach wasn't required anymore, but hopefully
Dave can fill in details.

Here is some of the original discussion around the patch:

https://lkml.org/lkml/2010/4/12/257

In terms of how oracle uses IPC, the part that shows up in profiles is
using semtimedop for bulk wakeups.  They can configure things to use
either a bunch of small arrays or a huge single array (and anything in
between). 

There is one IPC semaphore per process and they use this to wait for
some event (like a log commit).  When the event comes in, everyone
waiting is woken in bulk via a semtimedop call.

So, single proc waking many waiters at once.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/4] ipc: reduce ipc lock contention

2013-03-07 Thread Chris Mason
On Thu, Mar 07, 2013 at 08:54:55AM -0700, Dave Kleikamp wrote:
 On 03/07/2013 06:55 AM, Chris Mason wrote:
  On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote:
  On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote:
 
  Indeed.  Though how well my patches will work with Oracle will
  depend a lot on what kind of semctl syscalls they are doing.
 
  Does Oracle typically do one semop per semctl syscall, or does
  it pass in a whole bunch at once?
 
  https://oss.oracle.com/~mason/sembench.c
 
  I think Chris wrote that to match a particular pattern of semaphore
  operations the database engine in question does. I haven't checked to
  see if it triggers the case in point though.
 
  Also, Chris since left Oracle but maybe he knows who to poke.
 
  
  Dave Kleikamp (cc'd) took over my patches and did the most recent
  benchmarking.  Ported against 3.0:
  
  https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c
  
  The current versions are still in the 2.6.32 oracle kernel, but it looks
  like they reverted this 3.0 commit.  I think with Manfred's upstream
  work my more complex approach wasn't required anymore, but hopefully
  Dave can fill in details.
 
 From what I recall, I could never get better performance from your
 patches that we saw with Manfred's work alone. I can't remember the
 reasons for including and then reverting the patches from the 3.0
 (2.6.39) Oracle kernel, but in the end we weren't able to justify their
 inclusion.

Ok, so after this commit, oracle was happy:

commit fd5db42254518fbf241dc454e918598fbe494fa2
Author: Manfred Spraul manf...@colorfullife.com
Date:   Wed May 26 14:43:40 2010 -0700

ipc/sem.c: optimize update_queue() for bulk wakeup calls

But that doesn't explain why Davidlohr saw semtimedop at the top of the
oracle profiles in his runs.

Looking through the patches in this thread, I don't see anything that
I'd expect to slow down oracle TPC numbers.

I dealt with the ipc_perm lock a little differently:

https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commitdiff;h=78fe45325c8e2e3f4b6ebb1ee15b6c2e8af5ddb1;hp=8102e1ff9d667661b581209323faaf7a84f0f528

My code switched the ipc_rcu_hdr refcount to an atomic, which changed
where I needed the spinlock.  It may make things easier in patches 3/4
and 4/4.

(some of this code was Jens, but at the time he made me promise to
pretend he never touched it)

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-29 Thread Chris Mason
On Wed, Nov 28, 2012 at 11:16:21PM -0700, Linus Torvalds wrote:
 On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  But the fact that the code wants to do things like
 
  block = (sector_t)page-index  (PAGE_CACHE_SHIFT - bbits);
 
  seriously seems to be the main thing that keeps us using
  'inode-i_blkbits'. Calculating bbits from bh-b_size is just costly
  enough to hurt (not everywhere, but on some machines).
 
  Very annoying.
 
 Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
 whole ilog2 thing, but at the same time, in other cases it actually
 seems to improve code generation (ie gets rid of the whole unnecessary
 two dereferences through page-mapping-host just to get the block
 size, when we have it in the buffer-head that we have to touch
 *anyway*).
 
 Comments? Again, untested.

Jumping in based on Linus original patch, which is doing something like
this:

set_blocksize() {
block new calls to writepage, prepare/commit_write
set the block size
unblock

 --- can race in here and find bad buffers ---

sync_blockdev()
kill_bdev() 

 --- now we're safe --- 
}

We could add a second semaphore and a page_mkwrite call:

set_blocksize() {

block new calls to prepare/commit_write and page_mkwrite(), but
leave writepage unblocked.

sync_blockev()

--- now we're safe.  There are no dirty pages and no ways to
make new ones --- 

block new calls to readpage (writepage too for good luck?)

kill_bdev()
set the block size

unblock readpage/writepage
unblock prepare/commit_write and page_mkwrite

}

Another way to look at things:

As Linus said in a different email, we don't need to drop the pages, just
the buffers.  Once we've blocked prepare/commit_write,
there is no way to make a partially up to date page with dirty data.
We may make fully uptodate dirty pages, but for those we can
just create dirty buffers for the whole page.

As long as we had prepare/commit write blocked while we ran
sync_blockdev, we can blindly detach any buffers that are the wrong size
and just make new ones.

This may or may not apply to loop.c, I'd have to read that more
carefully.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-29 Thread Chris Mason
On Thu, Nov 29, 2012 at 07:12:49AM -0700, Chris Mason wrote:
 On Wed, Nov 28, 2012 at 11:16:21PM -0700, Linus Torvalds wrote:
  On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
  torva...@linux-foundation.org wrote:
  
   But the fact that the code wants to do things like
  
   block = (sector_t)page-index  (PAGE_CACHE_SHIFT - bbits);
  
   seriously seems to be the main thing that keeps us using
   'inode-i_blkbits'. Calculating bbits from bh-b_size is just costly
   enough to hurt (not everywhere, but on some machines).
  
   Very annoying.
  
  Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
  whole ilog2 thing, but at the same time, in other cases it actually
  seems to improve code generation (ie gets rid of the whole unnecessary
  two dereferences through page-mapping-host just to get the block
  size, when we have it in the buffer-head that we have to touch
  *anyway*).
  
  Comments? Again, untested.
 
 Jumping in based on Linus original patch, which is doing something like
 this:
 
 set_blocksize() {
   block new calls to writepage, prepare/commit_write
   set the block size
   unblock
 
--- can race in here and find bad buffers ---
 
   sync_blockdev()
   kill_bdev() 
   
--- now we're safe --- 
 }
 
 We could add a second semaphore and a page_mkwrite call:
 
 set_blocksize() {
 
   block new calls to prepare/commit_write and page_mkwrite(), but
   leave writepage unblocked.
 
   sync_blockev()
 
   --- now we're safe.  There are no dirty pages and no ways to
   make new ones --- 
 
   block new calls to readpage (writepage too for good luck?)
 
   kill_bdev()

Whoops, kill_bdev needs the page lock, which sends us into ABBA when
readpage does the down_read.  So, slight modification, unblock
readpage/writepage before the kill_bdev.  We'd need to change readpage
to discard buffers with the wrong size.  The risk is that readpage can
find buffers with the wrong size, and would need to be changed to
discard them.

The patch below is based on Linus' original and doesn't deal with the
readpage race.  But it does get the rest of the idea across.  It boots
and survives banging no blockdev --setbsz with mkfs, but I definitely
wouldn't trust it.

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1a1e5e3..1377171 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -116,8 +116,6 @@ EXPORT_SYMBOL(invalidate_bdev);
 
 int set_blocksize(struct block_device *bdev, int size)
 {
-   struct address_space *mapping;
-
/* Size must be a power of two, and between 512 and PAGE_SIZE */
if (size  PAGE_SIZE || size  512 || !is_power_of_2(size))
return -EINVAL;
@@ -126,28 +124,40 @@ int set_blocksize(struct block_device *bdev, int size)
if (size  bdev_logical_block_size(bdev))
return -EINVAL;
 
-   /* Prevent starting I/O or mapping the device */
-   percpu_down_write(bdev-bd_block_size_semaphore);
-
-   /* Check that the block device is not memory mapped */
-   mapping = bdev-bd_inode-i_mapping;
-   mutex_lock(mapping-i_mmap_mutex);
-   if (mapping_mapped(mapping)) {
-   mutex_unlock(mapping-i_mmap_mutex);
-   percpu_up_write(bdev-bd_block_size_semaphore);
-   return -EBUSY;
-   }
-   mutex_unlock(mapping-i_mmap_mutex);
-
/* Don't change the size if it is same as current */
if (bdev-bd_block_size != size) {
+   /* block all modifications via writing and page_mkwrite */
+   percpu_down_write(bdev-bd_block_size_semaphore);
+
+   /* write everything that was dirty */
sync_blockdev(bdev);
+
+   /* block readpage and writepage */
+   percpu_down_write(bdev-bd_page_semaphore);
+
bdev-bd_block_size = size;
bdev-bd_inode-i_blkbits = blksize_bits(size);
+
+   /* we can't call kill_bdev with the page_semaphore down
+* because we'll deadlock against readpage.
+* The block_size_semaphore should prevent any new
+* pages from being dirty, but readpage can jump
+* in once we up the bd_page_sem and find a
+* page with buffers from the old size.
+*
+* The kill_bdev call below is going to get rid
+* of those buffers, but we do have a race here.
+* readpage needs to deal with it and verify
+* any buffers on the page are the right size
+*/
+   percpu_up_write(bdev-bd_page_semaphore);
+
+   /* drop all the pages and all the buffers */
kill_bdev(bdev);
-   }
 
-   percpu_up_write(bdev-bd_block_size_semaphore);
+   /* open the gates and let everyone back in */
+   percpu_up_write(bdev-bd_block_size_semaphore);
+   }
 
return 0

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-29 Thread Chris Mason
On Thu, Nov 29, 2012 at 10:26:56AM -0700, Linus Torvalds wrote:
 On Thu, Nov 29, 2012 at 6:12 AM, Chris Mason chris.ma...@fusionio.com wrote:
 
  Jumping in based on Linus original patch, which is doing something like
  this:
 
  set_blocksize() {
  block new calls to writepage, prepare/commit_write
  set the block size
  unblock
 
   --- can race in here and find bad buffers ---
 
  sync_blockdev()
  kill_bdev()
 
   --- now we're safe --- 
  }
 
  We could add a second semaphore and a page_mkwrite call:
 
 Yeah, we could be fancy, but the more I think about it, the less I can
 say I care.
 
 After all, the only things that do the whole set_blocksize() thing should be:
 
  - filesystems at mount-time
 
  - things like loop/md at block device init time.
 
 and quite frankly, if there are any *concurrent* writes with either of
 the above, I really *really* don't think we should care. I mean,
 seriously.
 
 So the _only_ real reason for the locking in the first place is to
 make sure of internal kernel consistency. We do not want to oops or
 corrupt memory if people do odd things. But we really *really* don't
 care if somebody writes to a partition at the same time as somebody
 else mounts it. Not enough to do extra work to please insane people.
 
 It's also worth noting that NONE OF THIS HAS EVER WORKED IN THE PAST.
 The whole sequence always used to be unlocked. The locking is entirely
 new. There is certainly not any legacy users that can possibly rely on
 I did writes at the same time as the mount with no serialization, and
 it worked. It never has worked.
 
 So I think this is a case of perfect is the enemy of good.
 Especially since I think that with the fs/buffer.c approach, we don't
 actually need any locking at all at higher levels.

The bigger question is do we have users that expect to be able to set
the blocksize after mmaping the block device (no writes required)?  I
actually feel a little bad for taking up internet bandwidth asking, but
it is a change in behaviour.

Regardless, changing mmap for a race in the page cache is just backwards, and
with the current 3.7 code, we can still trigger the race with fadvise -
readpage in the middle of set_blocksize()

Obviously nobody does any of this, otherwise we'd have tons of reports
from those handy WARN_ONs in fs/buffer.c.  So its definitely hard to be
worried one way or another.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Chris Mason
On Thu, Nov 29, 2012 at 12:02:17PM -0700, Linus Torvalds wrote:
 On Thu, Nov 29, 2012 at 9:19 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  I think I'll apply this for 3.7 (since it's too late to do anything
  fancier), and then for 3.8 I will rip out all the locking entirely,
  because looking at the fs/buffer.c patch I wrote up, it's all totally
  unnecessary.
 
  Adding a ACCESS_ONCE() to the read of the i_blkbits value (when
  creating new buffers) simply makes the whole locking thing pointless.
  Just make the page lock protect the block size, and make it per-page,
  and we're done.
 
 There's a 'block-dev' branch in my git tree, if you guys want to play
 around with it.
 
 It actually reverts fs/block-dev.c back to the 3.6 state (except for
 some whitespace damage that I refused to re-introduce), so that part
 of the changes should be pretty safe and well tested.
 
 The fs/buffer.c changes, of course, are new. It's largely the same
 patch I already sent out, with a small helper function to simplify it,
 and to keep the whole ACCESS_ONCE() thing in just a single place.

The fs/buffer.c part makes sense during a quick read.  But
fs/direct-io.c plays with i_blkbits too.  The semaphore was fixing real
bugs there.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Chris Mason
On Thu, Nov 29, 2012 at 12:26:06PM -0700, Linus Torvalds wrote:
 On Thu, Nov 29, 2012 at 11:15 AM, Chris Mason chris.ma...@fusionio.com 
 wrote:
 
  The fs/buffer.c part makes sense during a quick read.  But
  fs/direct-io.c plays with i_blkbits too.  The semaphore was fixing real
  bugs there.
 
 Ugh. I _hate_ direct-IO. What a mess. And yeah, it seems to be
 incestuously playing games that should be in fs/buffer.c. I thought it
 was doing the sane thing with the page cache.
 
 (I now realize that Mikulas was talking about this mess, while I
 thought he was talking about the AIO code which is largely sane).

It was all a trick to get you to say the AIO code was sane.

It looks like we could use the private copy of i_blkbits that DIO is
already recording.

blkdev_get_blocks (called during DIO) is also checking i_blkbits, but I
really don't get why that isn't byte based instead.  DIO is already
doing the shift  mask game.

I think only clean_blockdev_aliases is intentionally using the inode's
i_blkbits, but again that shouldn't be changing for filesystems so it
seems safe to use the DIO copy.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Chris Mason
On Thu, Nov 29, 2012 at 01:52:22PM -0700, Linus Torvalds wrote:
 On Thu, Nov 29, 2012 at 11:48 AM, Chris Mason chris.ma...@fusionio.com 
 wrote:
 
  It was all a trick to get you to say the AIO code was sane.
 
 It's only sane compared to the DIO code.
 
 That said, I hate AIO much less these days that we've largely merged
 the code with the regular IO. It's still a horrible interface, but at
 least it is no longer a really disgusting separate implementation in
 the kernel of that horrible interface.
 
 So yeah, I guess AIO really is pretty sane these days.
 
  It looks like we could use the private copy of i_blkbits that DIO is
  already recording.
 
 Yes. But that didn't fix the blkdev_get_blocks() mess you pointed out.
 
 I've pushed out two more commits to the 'block-dev' branch at
 
   git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux block-dev
 
 in case anybody wants to take a look.
 
 It is - as usual - entirely untested. It compiles, and I *think* that
 blkdev_get_blocks() makes a whole lot more sense this way - as you
 said, it should be byte-based (although it actually does the block
 number conversion because I worried about overflow - probably
 unnecessarily).
 
 Comments?

Your blkdev_get_blocks emails were great reading while at the dentist,
thanks for helping me pass the time.

Just reading the new blkdev_get_blocks, it looks like we're mixing
shifts.  In direct-io.c map_bh-b_size is how much we'd like to map, and
it has no relation at all to the actual block size of the device.  The
interface is abusing b_size to ask for as large a mapping as possible.

Most importantly, it has no relation to the fs_startblk that we pass in,
which is based on inode-i_blkbits.

So your new check in blkdev_get_blocks:

   if (iblock = end_block) {

Is wrong because iblock and end_block are based on different sizes.  I
think we have to do the eof checks inside fs/direct-io.c or change the
get_blocks interface completely.

I really thought fs/direct-io.c was already doing eof checks, but I'm
reading harder.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Chris Mason
On Thu, Nov 29, 2012 at 03:36:38PM -0700, Linus Torvalds wrote:
 On Thu, Nov 29, 2012 at 2:16 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  But you're right. The direct-IO code really *is* violating that, and
  knows that get_block() ends up being defined in i_blkbits regardless
  of b_size.
 
 It turns out fs/ioctl.c does the same - it fills in the buffer head
 with some random bh-b_size too. I think it's not even a power of two
 in that case.
 
 And I guess it's understandable - they don't actually *use* the
 buffer, they just want the offset. So the b_size field really is just
 random crap to the users of the get_block interfaces, since they've
 never cared before.
 
 Ugh, this was definitely a dark and disgusting underbelly of the VFS
 layer. We've not had to really touch it for a *looong* time..

I searched through filemap.c for the magic i_size check that would let
us get away with ignoring i_blkbits in get_blocks, but its just not
there.  The whole fallback-to-buffered scheme seems to rely on
get_blocks checking for i_size.  I really hope I'm just missing
something.

If we're going to change this, I'd vote for something non-bh based.  I
didn't check every single FS, but I don't think direct-IO really wants
or needs buffer heads at all.

One less wart in direct-io.c would really be nice, but I'm assuming
it'll take us at least one full release to hammer out a shiny new
get_blocks.  Passing i_blkbits would be more mechanical, since all the
filesystems would just ignore it.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Chris Mason
On Thu, Nov 29, 2012 at 07:13:02PM -0700, Linus Torvalds wrote:
 On Thu, Nov 29, 2012 at 5:16 PM, Chris Mason chris.ma...@fusionio.com wrote:
 
  I searched through filemap.c for the magic i_size check that would let
  us get away with ignoring i_blkbits in get_blocks, but its just not
  there.  The whole fallback-to-buffered scheme seems to rely on
  get_blocks checking for i_size.  I really hope I'm just missing
  something.
 
 So generic_write_checks() limits the size to i_size at for writes (and
 for isblk).

Great, that's what I was missing.

 
 Sure, then it will do the buffered part after that, but that should
 all be fine anyway, since by then we use the normal page cache.
 
 For reads, generic_file_aio_read() will check pos  size, but doesn't
 seem to actually limit the size of the iovec.

I couldn't explain that either.

 
 I'm not sure why it doesn't just do iov_shorten().
 
 Anyway, having looked at actually passing in the block size to
 get_block(), I can say that is a horrible idea. There are tons of
 get_block functions (for various filesystems), and *none* of them
 really want the block size, because they tend to work on block
 indexes. And if they do want the block size, they'll just get it from
 the inode or sb, since they are filesystems and it's all stable.
 
 So the *only* of the places that would want the block size is
 fs/block_dev.c. And the callers really already seem to do the i_size
 check, although they sometimes do it badly. And since there are fewer
 callers than there are get_block() implementations, I think we should
 just fix the callers and be done with it.
 
 Those generic_file_aio_read/write() functions in fs/direct-io.c really
 just seem to be badly written. The fact that they may depend on the
 i_size check in get_blocks() is sad, but I think we should fix it and
 just remove the check for block devices. That's going to simplify so
 much..
 
 I updated the 'block-dev' branch to have that simpler fs/block_dev.c
 model instead. I'll look at the iovec shortening later. It's a
 non-fast-forward thing, look out!
 
 (I actually think we should just add the max-offset check to
 rw_copy_check_uvector(). That one already does the MAX_RW_COUNT thing,
 and we could make it do a max_offset check as well).

This is definitely easier, and I can't see any reason not to do it.  I'm
used to get_block being expensive and so it didn't even cross my mind.

We can benchmark things just to make sure.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-30 Thread Chris Mason
On Thu, Nov 29, 2012 at 07:49:10PM -0700, Dave Chinner wrote:
 On Thu, Nov 29, 2012 at 02:16:50PM -0800, Linus Torvalds wrote:
  On Thu, Nov 29, 2012 at 1:29 PM, Chris Mason chris.ma...@fusionio.com 
  wrote:
  
   Just reading the new blkdev_get_blocks, it looks like we're mixing
   shifts.  In direct-io.c map_bh-b_size is how much we'd like to map, and
   it has no relation at all to the actual block size of the device.  The
   interface is abusing b_size to ask for as large a mapping as possible.
  
  Ugh. That's a big violation of how buffer-heads are supposed to work:
  the block number is very much defined to be in multiples of b_size
  (see for example submit_bh() that turns it into a sector number).
  
  But you're right. The direct-IO code really *is* violating that, and
  knows that get_block() ends up being defined in i_blkbits regardless
  of b_size.
 
 Same with mpage_readpages(), so it's not just direct IO that has
 this problem

I guess the good news is that block devices don't have readpages.  The
bad news would be that we can't put readpages in without much bigger
changes.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs fixes

2013-02-06 Thread Chris Mason
[ sorry, my lbdb seems to really like linux-ker...@vger.kerrnel.org,
fixed for real this time ]

Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

We've got corner cases for updating i_size that ceph was hitting, error
handling for quotas when we run out of space, a very subtle snapshot
deletion race, a crash while removing devices, and one deadlock between
subvolume creation and the sb_internal code (thanks lockdep).

Josef Bacik (3) commits (+12/-4):
Btrfs: do not merge logged extents if we've removed them from the tree 
(+2/-1)
Btrfs: fix possible stale data exposure (+1/-1)
Btrfs: fix missing i_size update (+9/-2)

Miao Xie (2) commits (+21/-9):
Btrfs: fix missing release of the space/qgroup reservation in 
start_transaction() (+19/-8)
Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write() (+2/-1)

Jan Schmidt (1) commits (+10/-12):
Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata

Liu Bo (1) commits (+38/-9):
Btrfs: fix race between snapshot deletion and getting inode

Chris Mason (1) commits (+4/-1):
Btrfs: move d_instantiate outside the transaction during mksubvol

Eric Sandeen (1) commits (+2/-1):
btrfs: don't try to notify udev about missing devices

Total: (9) commits

 fs/btrfs/extent-tree.c  | 22 ++
 fs/btrfs/extent_map.c   |  3 ++-
 fs/btrfs/file.c | 25 -
 fs/btrfs/ioctl.c|  5 -
 fs/btrfs/ordered-data.c | 13 ++---
 fs/btrfs/scrub.c| 25 -
 fs/btrfs/transaction.c  | 27 +++
 fs/btrfs/volumes.c  |  3 ++-
 8 files changed, 87 insertions(+), 36 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Oops when mounting btrfs partition

2013-02-02 Thread Chris Mason
Hi Arnd,

First things first, nospace_cache is a safe thing to use.  It is slow
because it's finding free extents, but it's just a cache and always safe
to discard.  With your other errors, I'd just mount it readonly
and then you won't waste time on atime updates.

I'll take a look at the BUG you got during log recovery.  We've fixed a
few of those during the 3.8 rc cycle.

 Feb  1 22:57:37 localhost kernel: [ 8561.599482] Kernel BUG at 
 a01fdcf7 [verbose debug info unavailable]

 Jan 14 19:18:42 localhost kernel: [1060055.746373] btrfs csum failed ino 
 15619835 off 454656 csum 2755731641 private 864823192
 Jan 14 19:18:42 localhost kernel: [1060055.746381] btrfs: bdev /dev/sdb1 
 errs: wr 0, rd 0, flush 0, corrupt 17, gen 0
 ...
 Jan 21 16:35:40 localhost kernel: [1655047.701147] parent transid verify 
 failed on 17006399488 wanted 54700 found 54764

These aren't good.  With a few exceptions for really tight races in fsx
use cases, csum errors are bad data from the disk.  The transid verify
failed shows we wanted to find a metadata block from generation 54700
but found 54764 instead:

54700 = 0xD5AC
54764 = 0xD5EC

This same bad block comes up a few different times.

 Jan 21 16:35:40 localhost kernel: [1655047.752692] btrfs read error 
 corrected: ino 1 off 17006399488 (dev /dev/sdb1 sector 64689288)

This shows we pulled from the second copy of this block and got the
right answer, and then wrote the right answer to the duplicate.
Inode 1 means it was metadata.

But for some reason still aborted the transaction.  It could have been
an EIO on the correction, but the auto correction code in 3.5 did work
well.

I think your plan to pull the data off and reformat is a good one.  I'd
also look hard at your ram since drives don't usually send back single bit
errors.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2013-03-02 Thread Chris Mason
 mapping if we fail to add chunk (+12/-2)
Btrfs: relax the block group size limit for bitmaps (+9/-3)
Btrfs: cleanup orphan reservation if truncate fails (+2/-0)
Btrfs: make sure NODATACOW also gets NODATASUM set (+2/-1)
Btrfs: don't re-enter when allocating a chunk (+9/-0)
Btrfs: remove unused extent io tree ops V2 (+11/-27)
Btrfs: fix chunk allocation error handling (+22/-10)

Liu Bo (14) commits (+796/-109):
Btrfs: kill unused argument of btrfs_pin_extent_for_log_replay (+3/-6)
Btrfs: fix cleaner thread not working with inode cache option (+8/-1)
Btrfs: use token to avoid times mapping extent buffer (+35/-28)
Btrfs: extend the checksum item as much as possible (+46/-21)
Btrfs: fix NULL pointer after aborting a transaction (+7/-1)
Btrfs: use reserved space for creating a snapshot (+2/-0)
Btrfs: kill unused argument of update_block_group (+5/-7)
Btrfs: kill unused arguments of cache_block_group (+5/-8)
Btrfs: do not change inode flags in rename (+0/-25)
Btrfs: record first logical byte in memory (+20/-1)
Btrfs: fix memory leak of log roots (+9/-2)
Btrfs: remove deprecated comments (+0/-6)
Btrfs: snapshot-aware defrag (+654/-0)
Btrfs: save us a read_lock (+2/-3)

Eric Sandeen (11) commits (+58/-108):
btrfs: ensure we don't overrun devices_info[] in __btrfs_alloc_chunk (+5/-1)
btrfs: remove unused item in btrfs_insert_delayed_item() (+0/-2)
btrfs: remove unused fs_info from btrfs_decode_error() (+4/-5)
btrfs: remove cache only arguments from defrag path (+32/-82)
btrfs: remove unnecessary DEFINE_WAIT() declarations (+0/-2)
btrfs: annotate intentional switch case fallthroughs (+2/-0)
btrfs: add missing break in btrfs_print_leaf() (+1/-0)
btrfs: remove unused fd in btrfs_ioctl_send() (+0/-3)
btrfs: handle null fs_info in btrfs_panic() (+7/-4)
btrfs: fix varargs in __btrfs_std_error (+7/-7)
btrfs: list_entry can't return NULL (+0/-2)

Chris Mason (7) commits (+561/-30):
Btrfs: reduce CPU contention while waiting for delayed extent operations 
(+70/-5)
Btrfs: remove conflicting check for minimum number of devices in raid56 
(+0/-8)
Btrfs: reduce lock contention on extent buffer locks (+16/-0)
Btrfs: add a plugging callback to raid56 writes (+124/-4)
Btrfs: fix cluster alignment for mount -o ssd (+6/-1)
Btrfs: fix max chunk size on raid5/6 (+21/-4)
Btrfs: Add a stripe cache to raid56 (+324/-8)

Wang Shilong (6) commits (+78/-68):
Btrfs: remove reduplicate check about root in the function 
btrfs_clean_quota_tree (+0/-3)
Btrfs: cleanup to make the function btrfs_delalloc_reserve_metadata more 
logic (+38/-44)
Btrfs: return ENOMEM rather than use BUG_ON when btrfs_alloc_path fails 
(+9/-3)
Btrfs: don't call btrfs_qgroup_free if just btrfs_qgroup_reserve fails 
(+6/-5)
Btrfs: fix missing deleted items in btrfs_clean_quota_tree (+21/-13)
Btrfs: fix missing check before disabling quota (+4/-0)

David Sterba (6) commits (+131/-42):
btrfs: access superblock via pagecache in scan_one_device (+64/-6)
btrfs: put some enospc messages under enospc_debug (+15/-11)
btrfs: try harder to allocate raid56 stripe cache (+26/-7)
btrfs: use only inline_pages from extent buffer (+7/-17)
btrfs: remove a printk from scan_one_device (+0/-1)
btrfs: add cancellation points to defrag (+19/-0)

Zach Brown (2) commits (+9/-12):
btrfs: limit fallocate extent reservation to 256MB (+4/-3)
btrfs: define BTRFS_MAGIC as a u64 value (+5/-9)

David Woodhouse (2) commits (+2294/-113):
Btrfs: add rw argument to merge_bio_hook() (+11/-11)
Btrfs: RAID5 and RAID6 (+2283/-102)

Ilya Dryomov (2) commits (+6/-6):
Btrfs: allow for selecting only completely empty chunks (+1/-1)
Btrfs: eliminate a use-after-free in btrfs_balance() (+5/-5)

jeff.liu (2) commits (+67/-0):
Btrfs: Add a new ioctl to get the label of a mounted file system (+23/-0)
Btrfs: set/change the label of a mounted file system (+44/-0)

Filipe Brandenburger (1) commits (+19/-11):
Btrfs: move fs/btrfs/ioctl.h to include/uapi/linux/btrfs.h

Mark Fasheh (1) commits (+54/-4):
btrfs: add no file data flag to btrfs send ioctl

Alexandre Oliva (1) commits (+3/-3):
clear chunk_alloc flag on retryable failure

Thomas Gleixner (1) commits (+1/-0):
btrfs: Init io_lock after cloning btrfs device struct

Paul Gortmaker (1) commits (+1/-4):
btrfs: fixup/remove module.h usage as required

Tomasz Torcz (1) commits (+1/-0):
Btrfs: select XOR_BLOCKS in Kconfig

Jan Schmidt (1) commits (+1/-4):
Btrfs: fix backref walking race with tree deletions

Qu Wenruo (1) commits (+25/-38):
btrfs: cleanup for open-coded alignment

Kusanagi Kouichi (1) commits (+1/-1):
Btrfs: Check CAP_DAC_READ_SEARCH for BTRFS_IOC_INO_PATHS

Arne Jansen (1) commits (+1/-1):
Btrfs: fix crash in log replay with qgroups enabled

Total: (118) commits

 fs/btrfs/Kconfig

Re: [GIT PULL] Btrfs fixes

2013-01-24 Thread Chris Mason
On Tue, Jan 22, 2013 at 05:48:33PM -0700, Chris Mason wrote:
 Hi Linus,
 
 My for-linus branch has our batch of btrfs fixes:
 
 git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus
 
 We've been hammering away at a crc corruption as well, which I was
 really hoping to get into this pull.  It isn't nailed down yet, but we
 were finally able to get a solid way to reproduce.  The only good
 news is it isn't a recent regression.

Update on this, we've tracked down the crc errors and are doing final
checks on the patches.  Linus are you planning on taking this pull?  If
not I can just fold the new stuff into a bigger request.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs fixes (v2)

2013-01-24 Thread Chris Mason
Hi Linus,

My for-linus branch has our batch of btrfs fixes:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

It turns out that we had two crc bugs when running fsx-linux in a
loop.  Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it
all down.  Miao also has a new OOM fix in this v2 pull as well.

Ilya fixed a regression Liu Bo found in the balance ioctls for pausing
and resuming a running balance across drives.

Josef's orphan truncate patch fixes an obscure corruption we'd see
during xfstests.

Arne's patches address problems with subvolume quotas.  If the user
destroys quota groups incorrectly the FS will refuse to mount.

The rest are smaller fixes and plugs for memory leaks.

Miao Xie (8) commits (+76/-24):
Btrfs: fix missing write access release in btrfs_ioctl_resize() (+1/-0)
Btrfs: do not delete a subvolume which is in a R/O subvolume (+5/-5)
Btrfs: Add ACCESS_ONCE() to transaction-abort accesses (+3/-2)
Btrfs: fix wrong max device number for single profile (+1/-1)
Btrfs: fix repeated delalloc work allocation (+41/-14)
Btrfs: fix missed transaction-aborted check (+16/-0)
Btrfs: fix resize a readonly device (+4/-2)
Btrfs: disable qgroup id 0 (+5/-0)

Ilya Dryomov (6) commits (+94/-32):
Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag (+9/-8)
Btrfs: fix mutually exclusive op is running error code (+4/-4)
Btrfs: fix a regression in balance usage filter (+8/-1)
Btrfs: bring back balance pause/resume logic (+71/-17)
Btrfs: fix unlock order in btrfs_ioctl_rm_dev (+1/-1)
Btrfs: fix unlock order in btrfs_ioctl_resize (+1/-1)

Liu Bo (5) commits (+23/-7):
Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents 
(+14/-6)
Btrfs: use right range to find checksum for compressed extents (+5/-0)
Btrfs: let allocation start from the right raid type (+1/-1)
Btrfs: reset path lock state to zero (+2/-0)
Btrfs: fix off-by-one in lseek (+1/-0)

Josef Bacik (5) commits (+69/-29):
Btrfs: do not allow logged extents to be merged or removed (+16/-3)
Btrfs: add orphan before truncating pagecache (+38/-15)
Btrfs: set flushing if we're limited flushing (+1/-1)
Btrfs: put csums on the right ordered extent (+2/-2)
Btrfs: fix panic when recovering tree log (+12/-8)

Arne Jansen (2) commits (+19/-1):
Btrfs: prevent qgroup destroy when there are still relations (+12/-1)
Btrfs: ignore orphan qgroup relations (+7/-0)

Zach Brown (1) commits (+1/-0):
btrfs: fix btrfs_cont_expand() freeing IS_ERR em

Lukas Czerner (1) commits (+1/-1):
btrfs: get the device in write mode when deleting it

Eric Sandeen (1) commits (+14/-3):
btrfs: update timestamps on truncate()

Tsutomu Itoh (1) commits (+3/-1):
Btrfs: fix memory leak in name_cache_insert()

Total: (30) commits (+300/-98)

 fs/btrfs/extent-tree.c  |   6 +-
 fs/btrfs/extent_map.c   |  13 -
 fs/btrfs/extent_map.h   |   1 +
 fs/btrfs/file-item.c|   4 +-
 fs/btrfs/file.c |  10 +++-
 fs/btrfs/free-space-cache.c |  20 ---
 fs/btrfs/inode.c| 137 +---
 fs/btrfs/ioctl.c| 129 ++---
 fs/btrfs/qgroup.c   |  20 ++-
 fs/btrfs/send.c |   4 +-
 fs/btrfs/super.c|   2 +-
 fs/btrfs/transaction.c  |  19 +-
 fs/btrfs/tree-log.c |  10 +++-
 fs/btrfs/volumes.c  |  23 ++--
 14 files changed, 300 insertions(+), 98 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs fixes

2013-01-22 Thread Chris Mason
Hi Linus,

My for-linus branch has our batch of btrfs fixes:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

We've been hammering away at a crc corruption as well, which I was
really hoping to get into this pull.  It isn't nailed down yet, but we
were finally able to get a solid way to reproduce.  The only good
news is it isn't a recent regression.

The most important batch of fixes in here come from Ilya.  They address
a regression Liu Bo found in the balance ioctls for pausing and resuming
a running balance across drives.

Josef's orphan truncate patch fixes an obscure corruption we'd see
during xfstests.

Arne's patches address problems with subvolume quotas.  If the user
destroys quota groups incorrectly the FS will refuse to mount.

The rest are smaller fixes and plugs for memory leaks.

Ilya Dryomov (6) commits (+94/-32):
Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag (+9/-8)
Btrfs: fix mutually exclusive op is running error code (+4/-4)
Btrfs: fix a regression in balance usage filter (+8/-1)
Btrfs: bring back balance pause/resume logic (+71/-17)
Btrfs: fix unlock order in btrfs_ioctl_rm_dev (+1/-1)
Btrfs: fix unlock order in btrfs_ioctl_resize (+1/-1)

Liu Bo (4) commits (+18/-7):
Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents 
(+14/-6)
Btrfs: let allocation start from the right raid type (+1/-1)
Btrfs: reset path lock state to zero (+2/-0)
Btrfs: fix off-by-one in lseek (+1/-0)

Miao Xie (4) commits (+15/-7):
Btrfs: fix missing write access release in btrfs_ioctl_resize() (+1/-0)
Btrfs: do not delete a subvolume which is in a R/O subvolume (+5/-5)
Btrfs: fix resize a readonly device (+4/-2)
Btrfs: disable qgroup id 0 (+5/-0)

Arne Jansen (2) commits (+19/-1):
Btrfs: prevent qgroup destroy when there are still relations (+12/-1)
Btrfs: ignore orphan qgroup relations (+7/-0)

Josef Bacik (2) commits (+39/-16):
Btrfs: add orphan before truncating pagecache (+38/-15)
Btrfs: set flushing if we're limited flushing (+1/-1)

Zach Brown (1) commits (+1/-0):
btrfs: fix btrfs_cont_expand() freeing IS_ERR em

Lukas Czerner (1) commits (+1/-1):
btrfs: get the device in write mode when deleting it

Eric Sandeen (1) commits (+14/-3):
btrfs: update timestamps on truncate()

Tsutomu Itoh (1) commits (+3/-1):
Btrfs: fix memory leak in name_cache_insert()

Total: (22) commits

 fs/btrfs/extent-tree.c |   6 ++-
 fs/btrfs/file.c|  10 ++--
 fs/btrfs/inode.c   |  82 +++
 fs/btrfs/ioctl.c   | 129 +++--
 fs/btrfs/qgroup.c  |  20 +++-
 fs/btrfs/send.c|   4 +-
 fs/btrfs/volumes.c |  21 ++--
 7 files changed, 204 insertions(+), 68 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] Btrfs fixes

2013-01-22 Thread Chris Mason
On Tue, Jan 22, 2013 at 06:28:21PM -0700, Liu Bo wrote:
 On Tue, Jan 22, 2013 at 07:48:33PM -0500, Chris Mason wrote:
  Hi Linus,
  
  My for-linus branch has our batch of btrfs fixes:
  
  git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
  for-linus
  
  We've been hammering away at a crc corruption as well, which I was
  really hoping to get into this pull.  It isn't nailed down yet, but we
  were finally able to get a solid way to reproduce.  The only good
  news is it isn't a recent regression.
  
  The most important batch of fixes in here come from Ilya.  They address
  a regression Liu Bo found in the balance ioctls for pausing and resuming
  a running balance across drives.
  
  Josef's orphan truncate patch fixes an obscure corruption we'd see
  during xfstests.
  
  Arne's patches address problems with subvolume quotas.  If the user
  destroys quota groups incorrectly the FS will refuse to mount.
  
  The rest are smaller fixes and plugs for memory leaks.
 
 Hi,
 
 Any chance to get these in this round?  I think they're good fixes,
 a memory leak and a warning fix, both are got from xfstests.

I'll get these tested in the next pull.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


dma engine bugs

2013-01-17 Thread Chris Mason
Hi Dan,

I'm doing some benchmarking on MD raid5/6 on 4 fusionio cards in an HP
DL380p.  I'm doing 128K randomw writes on a 4 drive raid6 with a 64K
stripe size per drive.  I have 4 fio processes sending down the aio/dio,
and a high queue depth (8192).

When I bump up the MD raid stripe cache size, I'm running into
soft lockups in the async memcopy code:

[34336.959645] BUG: soft lockup - CPU#6 stuck for 22s! [fio:38296]
[34336.959648] BUG: soft lockup - CPU#9 stuck for 22s! [md0_raid6:5172]
[34336.959704] Modules linked in: raid456 async_raid6_recov async_pq
async_xor async_memcpy async_tx iomemory_vsl(O) binfmt_misc
cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq
mperf loop dm_mod coretemp kvm_intel kvm ghash_clmulni_intel sr_mod
cdrom aesni_intel ablk_helper cryptd lrw aes_x86_64 ata_generic xts
gf128mul ioatdma sb_edac gpio_ich ata_piix hid_generic dca edac_core
lpc_ich microcode serio_raw mfd_core hpilo hpwdt button container tg3 sg
acpi_power_meter usbhid mgag200 ttm drm_kms_helper drm i2c_algo_bit
sysimgblt sysfillrect syscopyarea uhci_hcd crc32c_intel ehci_hcd hpsa
processor thermal_sys scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc
scsi_dh_alua scsi_dh btrfs raid6_pq zlib_deflate xor libcrc32c asix
usbnet usbcore usb_common

[34336.959709] CPU 9
[34336.959709] Pid: 5172, comm: md0_raid6 Tainted: GW  O 
3.7.1-1-default #2 HP ProLiant DL380p Gen8
[34336.959720] RIP: 0010:[815381ad]  [815381ad] 
_raw_spin_unlock_irqrestore+0xd/0x20
[34336.959721] RSP: 0018:8807af6db858  EFLAGS: 0292
[34336.959722] RAX: 1000 RBX: 8810176fd000 RCX: 0292
[34336.959723] RDX: 1000 RSI: 0292 RDI: 0292
[34336.959724] RBP: 8807af6db858 R08: 881017e40440 R09: 880f554fabc0
[34336.959725] R10: 2000 R11:  R12: 881017e40460
[34336.959726] R13: 0040 R14: 0001 R15: 881017e40480
[34336.959728] FS:  () GS:88103f66() 
knlGS:
[34336.959729] CS:  0010 DS:  ES:  CR0: 80050033
[34336.959730] CR2: 035cf458 CR3: 01a0b000 CR4: 000407e0
[34336.959731] DR0:  DR1:  DR2: 
[34336.959733] DR3:  DR6: 0ff0 DR7: 0400
[34336.959734] Process md0_raid6 (pid: 5172, threadinfo 8807af6da000, task 
88077d7725c0)
[34336.959735] Stack:
[34336.959738]  8807af6db898 8114f287 8807af6db8b8 

[34336.959740]   005bd84a 881015f2fa18 
881017632a38
[34336.959742]  8807af6db8e8 a057adf4  
881015f2fa18
[34336.959743] Call Trace:
[34336.959750]  [8114f287] dma_pool_alloc+0x67/0x270
[34336.959758]  [a057adf4] ioat2_alloc_ring_ent+0x34/0xc0 [ioatdma]
[34336.959761]  [a057afc5] reshape_ring+0x145/0x370 [ioatdma]
[34336.959764]  [8153841d] ? _raw_spin_lock_bh+0x2d/0x40
[34336.959767]  [a057b2d9] ioat2_check_space_lock+0xe9/0x240 [ioatdma]
[34336.959768]  [81538381] ? _raw_spin_unlock_bh+0x11/0x20
[34336.959771]  [a057b48c] ioat2_dma_prep_memcpy_lock+0x5c/0x280 
[ioatdma]
[34336.959773]  [a03102df] ? do_async_gen_syndrome+0x29f/0x3d0 
[async_pq]
[34336.959775]  [81538381] ? _raw_spin_unlock_bh+0x11/0x20
[34336.959790]  [a057ac22] ? ioat2_tx_submit_unlock+0x92/0x100 
[ioatdma]
[34336.959792]  [a02f8207] async_memcpy+0x207/0x1000 [async_memcpy]
[34336.959795]  [a031f67d] async_copy_data+0x9d/0x150 [raid456]
[34336.959797]  [a03206ba] __raid_run_ops+0x4ca/0x990 [raid456]
[34336.959802]  [811b7c42] ? __aio_put_req+0x102/0x150
[34336.959805]  [a031c7ae] ?  handle_stripe_dirtying+0x30e/0x440 
[raid456]
[34336.959807]  [a03217a8] handle_stripe+0x528/0x10b0 [raid456]
[34336.959810]  [a03226f0] handle_active_stripes+0x1e0/0x270 [raid456]
[34336.959814]  [81293bb3] ? blk_flush_plug_list+0xb3/0x220
[34336.959817]  [a03229a0] raid5d+0x220/0x3c0 [raid456]
[34336.959822]  [81413b0e] md_thread+0x12e/0x160
[34336.959828]  [8106bfa0] ? wake_up_bit+0x40/0x40
[34336.959829]  [814139e0] ? md_rdev_init+0x110/0x110
[34336.959831]  [8106b806] kthread+0xc6/0xd0
[34336.959834]  [8106b740] ?  kthread_freezable_should_stop+0x70/0x70
[34336.959849]  [8154047c] ret_from_fork+0x7c/0xb0
[34336.959851]  [8106b740] ?  kthread_freezable_should_stop+0x70/0x70

Since I'm running on fast cards, I assumed MD was just hammering on this
path so much that MD needed a cond_resched().  But now that I've
sprinkled conditional pixie dust everywhere I'm still seeing exactly the
same trace, and the lockups keep flowing forever, even after I've
stopped all new IO.

Looking at ioat2_check_space_lock(), it is looping when the ring
allocation fails.  We're trying to 

dma engine bugs

2013-01-17 Thread Chris Mason
[ Sorry resend with the right address for Dan ]

Hi Dan,

I'm doing some benchmarking on MD raid5/6 on 4 fusionio cards in an HP
DL380p.  I'm doing 128K randomw writes on a 4 drive raid6 with a 64K
stripe size per drive.  I have 4 fio processes sending down the aio/dio,
and a high queue depth (8192).

When I bump up the MD raid stripe cache size, I'm running into
soft lockups in the async memcopy code:

[34336.959645] BUG: soft lockup - CPU#6 stuck for 22s! [fio:38296]
[34336.959648] BUG: soft lockup - CPU#9 stuck for 22s! [md0_raid6:5172]
[34336.959704] Modules linked in: raid456 async_raid6_recov async_pq
async_xor async_memcpy async_tx iomemory_vsl(O) binfmt_misc
cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq
mperf loop dm_mod coretemp kvm_intel kvm ghash_clmulni_intel sr_mod
cdrom aesni_intel ablk_helper cryptd lrw aes_x86_64 ata_generic xts
gf128mul ioatdma sb_edac gpio_ich ata_piix hid_generic dca edac_core
lpc_ich microcode serio_raw mfd_core hpilo hpwdt button container tg3 sg
acpi_power_meter usbhid mgag200 ttm drm_kms_helper drm i2c_algo_bit
sysimgblt sysfillrect syscopyarea uhci_hcd crc32c_intel ehci_hcd hpsa
processor thermal_sys scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc
scsi_dh_alua scsi_dh btrfs raid6_pq zlib_deflate xor libcrc32c asix
usbnet usbcore usb_common

[34336.959709] CPU 9
[34336.959709] Pid: 5172, comm: md0_raid6 Tainted: GW  O 
3.7.1-1-default #2 HP ProLiant DL380p Gen8
[34336.959720] RIP: 0010:[815381ad]  [815381ad] 
_raw_spin_unlock_irqrestore+0xd/0x20
[34336.959721] RSP: 0018:8807af6db858  EFLAGS: 0292
[34336.959722] RAX: 1000 RBX: 8810176fd000 RCX: 0292
[34336.959723] RDX: 1000 RSI: 0292 RDI: 0292
[34336.959724] RBP: 8807af6db858 R08: 881017e40440 R09: 880f554fabc0
[34336.959725] R10: 2000 R11:  R12: 881017e40460
[34336.959726] R13: 0040 R14: 0001 R15: 881017e40480
[34336.959728] FS:  () GS:88103f66() 
knlGS:
[34336.959729] CS:  0010 DS:  ES:  CR0: 80050033
[34336.959730] CR2: 035cf458 CR3: 01a0b000 CR4: 000407e0
[34336.959731] DR0:  DR1:  DR2: 
[34336.959733] DR3:  DR6: 0ff0 DR7: 0400
[34336.959734] Process md0_raid6 (pid: 5172, threadinfo 8807af6da000, task 
88077d7725c0)
[34336.959735] Stack:
[34336.959738]  8807af6db898 8114f287 8807af6db8b8 

[34336.959740]   005bd84a 881015f2fa18 
881017632a38
[34336.959742]  8807af6db8e8 a057adf4  
881015f2fa18
[34336.959743] Call Trace:
[34336.959750]  [8114f287] dma_pool_alloc+0x67/0x270
[34336.959758]  [a057adf4] ioat2_alloc_ring_ent+0x34/0xc0 [ioatdma]
[34336.959761]  [a057afc5] reshape_ring+0x145/0x370 [ioatdma]
[34336.959764]  [8153841d] ? _raw_spin_lock_bh+0x2d/0x40
[34336.959767]  [a057b2d9] ioat2_check_space_lock+0xe9/0x240 [ioatdma]
[34336.959768]  [81538381] ? _raw_spin_unlock_bh+0x11/0x20
[34336.959771]  [a057b48c] ioat2_dma_prep_memcpy_lock+0x5c/0x280 
[ioatdma]
[34336.959773]  [a03102df] ? do_async_gen_syndrome+0x29f/0x3d0 
[async_pq]
[34336.959775]  [81538381] ? _raw_spin_unlock_bh+0x11/0x20
[34336.959790]  [a057ac22] ? ioat2_tx_submit_unlock+0x92/0x100 
[ioatdma]
[34336.959792]  [a02f8207] async_memcpy+0x207/0x1000 [async_memcpy]
[34336.959795]  [a031f67d] async_copy_data+0x9d/0x150 [raid456]
[34336.959797]  [a03206ba] __raid_run_ops+0x4ca/0x990 [raid456]
[34336.959802]  [811b7c42] ? __aio_put_req+0x102/0x150
[34336.959805]  [a031c7ae] ?  handle_stripe_dirtying+0x30e/0x440 
[raid456]
[34336.959807]  [a03217a8] handle_stripe+0x528/0x10b0 [raid456]
[34336.959810]  [a03226f0] handle_active_stripes+0x1e0/0x270 [raid456]
[34336.959814]  [81293bb3] ? blk_flush_plug_list+0xb3/0x220
[34336.959817]  [a03229a0] raid5d+0x220/0x3c0 [raid456]
[34336.959822]  [81413b0e] md_thread+0x12e/0x160
[34336.959828]  [8106bfa0] ? wake_up_bit+0x40/0x40
[34336.959829]  [814139e0] ? md_rdev_init+0x110/0x110
[34336.959831]  [8106b806] kthread+0xc6/0xd0
[34336.959834]  [8106b740] ?  kthread_freezable_should_stop+0x70/0x70
[34336.959849]  [8154047c] ret_from_fork+0x7c/0xb0
[34336.959851]  [8106b740] ?  kthread_freezable_should_stop+0x70/0x70

Since I'm running on fast cards, I assumed MD was just hammering on this
path so much that MD needed a cond_resched().  But now that I've
sprinkled conditional pixie dust everywhere I'm still seeing exactly the
same trace, and the lockups keep flowing forever, even after I've
stopped all new IO.

Looking at ioat2_check_space_lock(), it is looping 

Re: dma engine bugs

2013-01-17 Thread Chris Mason
On Thu, Jan 17, 2013 at 07:53:18PM -0700, Dan Williams wrote:
 On Thu, Jan 17, 2013 at 6:38 AM, Chris Mason chris.ma...@fusionio.com wrote:
  [ Sorry resend with the right address for Dan ]
 
  Hi Dan,
 
  I'm doing some benchmarking on MD raid5/6 on 4 fusionio cards in an HP
  DL380p.  I'm doing 128K randomw writes on a 4 drive raid6 with a 64K
  stripe size per drive.  I have 4 fio processes sending down the aio/dio,
  and a high queue depth (8192).
 
  When I bump up the MD raid stripe cache size, I'm running into
  soft lockups in the async memcopy code:
 
  [34336.959645] BUG: soft lockup - CPU#6 stuck for 22s! [fio:38296]
  [34336.959648] BUG: soft lockup - CPU#9 stuck for 22s! [md0_raid6:5172]
  [34336.959704] Modules linked in: raid456 async_raid6_recov async_pq
  async_xor async_memcpy async_tx iomemory_vsl(O) binfmt_misc
  cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq
  mperf loop dm_mod coretemp kvm_intel kvm ghash_clmulni_intel sr_mod
  cdrom aesni_intel ablk_helper cryptd lrw aes_x86_64 ata_generic xts
  gf128mul ioatdma sb_edac gpio_ich ata_piix hid_generic dca edac_core
  lpc_ich microcode serio_raw mfd_core hpilo hpwdt button container tg3 sg
  acpi_power_meter usbhid mgag200 ttm drm_kms_helper drm i2c_algo_bit
  sysimgblt sysfillrect syscopyarea uhci_hcd crc32c_intel ehci_hcd hpsa
  processor thermal_sys scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc
  scsi_dh_alua scsi_dh btrfs raid6_pq zlib_deflate xor libcrc32c asix
  usbnet usbcore usb_common
 
  [34336.959709] CPU 9
  [34336.959709] Pid: 5172, comm: md0_raid6 Tainted: GW  O 
  3.7.1-1-default #2 HP ProLiant DL380p Gen8
  [34336.959720] RIP: 0010:[815381ad]  [815381ad] 
  _raw_spin_unlock_irqrestore+0xd/0x20
  [34336.959721] RSP: 0018:8807af6db858  EFLAGS: 0292
  [34336.959722] RAX: 1000 RBX: 8810176fd000 RCX: 
  0292
  [34336.959723] RDX: 1000 RSI: 0292 RDI: 
  0292
  [34336.959724] RBP: 8807af6db858 R08: 881017e40440 R09: 
  880f554fabc0
  [34336.959725] R10: 2000 R11:  R12: 
  881017e40460
  [34336.959726] R13: 0040 R14: 0001 R15: 
  881017e40480
  [34336.959728] FS:  () GS:88103f66() 
  knlGS:
  [34336.959729] CS:  0010 DS:  ES:  CR0: 80050033
  [34336.959730] CR2: 035cf458 CR3: 01a0b000 CR4: 
  000407e0
  [34336.959731] DR0:  DR1:  DR2: 
  
  [34336.959733] DR3:  DR6: 0ff0 DR7: 
  0400
  [34336.959734] Process md0_raid6 (pid: 5172, threadinfo 8807af6da000, 
  task 88077d7725c0)
  [34336.959735] Stack:
  [34336.959738]  8807af6db898 8114f287 8807af6db8b8 
  
  [34336.959740]   005bd84a 881015f2fa18 
  881017632a38
  [34336.959742]  8807af6db8e8 a057adf4  
  881015f2fa18
  [34336.959743] Call Trace:
  [34336.959750]  [8114f287] dma_pool_alloc+0x67/0x270
  [34336.959758]  [a057adf4] ioat2_alloc_ring_ent+0x34/0xc0 
  [ioatdma]
  [34336.959761]  [a057afc5] reshape_ring+0x145/0x370 [ioatdma]
  [34336.959764]  [8153841d] ? _raw_spin_lock_bh+0x2d/0x40
  [34336.959767]  [a057b2d9] ioat2_check_space_lock+0xe9/0x240 
  [ioatdma]
  [34336.959768]  [81538381] ? _raw_spin_unlock_bh+0x11/0x20
  [34336.959771]  [a057b48c] ioat2_dma_prep_memcpy_lock+0x5c/0x280 
  [ioatdma]
  [34336.959773]  [a03102df] ? do_async_gen_syndrome+0x29f/0x3d0 
  [async_pq]
  [34336.959775]  [81538381] ? _raw_spin_unlock_bh+0x11/0x20
  [34336.959790]  [a057ac22] ? ioat2_tx_submit_unlock+0x92/0x100 
  [ioatdma]
  [34336.959792]  [a02f8207] async_memcpy+0x207/0x1000 
  [async_memcpy]
  [34336.959795]  [a031f67d] async_copy_data+0x9d/0x150 [raid456]
  [34336.959797]  [a03206ba] __raid_run_ops+0x4ca/0x990 [raid456]
  [34336.959802]  [811b7c42] ? __aio_put_req+0x102/0x150
  [34336.959805]  [a031c7ae] ?  handle_stripe_dirtying+0x30e/0x440 
  [raid456]
  [34336.959807]  [a03217a8] handle_stripe+0x528/0x10b0 [raid456]
  [34336.959810]  [a03226f0] handle_active_stripes+0x1e0/0x270 
  [raid456]
  [34336.959814]  [81293bb3] ? blk_flush_plug_list+0xb3/0x220
  [34336.959817]  [a03229a0] raid5d+0x220/0x3c0 [raid456]
  [34336.959822]  [81413b0e] md_thread+0x12e/0x160
  [34336.959828]  [8106bfa0] ? wake_up_bit+0x40/0x40
  [34336.959829]  [814139e0] ? md_rdev_init+0x110/0x110
  [34336.959831]  [8106b806] kthread+0xc6/0xd0
  [34336.959834]  [8106b740] ?  
  kthread_freezable_should_stop+0x70/0x70
  [34336.959849]  [8154047c] ret_from_fork+0x7c/0xb0
  [34336.959851]  [8106b740] ?  
  kthread_freezable_should_stop+0x70/0x70

[GIT PULL] Btrfs updates

2013-02-15 Thread Chris Mason
Hi Linus,

If you're doing another RC, please grab these two.  Otherwise I'll send
them off to -stable.

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

This fixes a long standing problem where the btrfs scan ioctl was racing
with mkfs.btrfs and dropping dirty pages created by mkfs.  It also fixes
a crash during tree log replay with quota enabled.

David Sterba (1) commits (+64/-6):
btrfs: access superblock via pagecache in scan_one_device

Arne Jansen (1) commits (+1/-1):
Btrfs: fix crash in log replay with qgroups enabled

Total: (2) commits (+65/-7)

 fs/btrfs/ctree.c   |  2 +-
 fs/btrfs/volumes.c | 70 +-
 2 files changed, 65 insertions(+), 7 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-12 Thread Chris Mason
On Fri, Jan 12, 2007 at 10:06:22AM -0800, Linus Torvalds wrote:

  looking at the splice(2) api it seems like it'll be difficult to implement 
  O_DIRECT pread/pwrite from userland using splice... so there'd need to be 
  some help there.
 
 You'd use vmsplice() to put the write buffers into kernel space (user 
 space sees it's a pipe file descriptor, but you should just ignore that: 
 it's really just a kernel buffer). And then splice the resulting kernel 
 buffers to the destination.

I recently spent some time trying to integrate O_DIRECT locking with
page cache locking.  The basic theory is that instead of using
semaphores for solving O_DIRECT vs buffered races, you put something
into the radix tree (I call it a placeholder) to keep the page cache
users out, and lock any existing pages that are present.

O_DIRECT does save cpu from avoiding copies, but it also saves cpu from
fewer radix tree operations during massive IOs.  The cost of radix tree
insertion/deletion on 1MB O_DIRECT ios added ~10% system time on
my tiny little dual core box.  I'm sure it would be much worse if there
was lock contention on a big numa machine, and it grows as the io grows
(SGI does massive O_DIRECT ios).

To help reduce radix churn, I made it possible for a single placeholder
entry to lock down a range in the radix:

http://thread.gmane.org/gmane.linux.file-systems/12263

It looks to me as though vmsplice is going to have the same issues as my
early patches.  The current splice code can avoid the copy but is still
working in page sized chunks.  Also, splice doesn't support zero copy on
things smaller than page sized chunks.

The compromise my patch makes is to hide placeholders from almost
everything except the DIO code.  It may be worthwhile to turn the
placeholders into an IO marker that can be useful to filemap_fdatawrite
and friends.

It should be able to:

record the userland/kernel pages involved in a given io
map blocks from the FS for making a bio
start the io
wake people up when the io is done

This would allow splice to operate without stealing the userland page
(stealing would still be an option of course), and could get rid of big
chunks of fs/direct-io.c.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc patch] optimize o_direct on block device

2006-12-01 Thread Chris Mason
On Thu, Nov 30, 2006 at 10:16:53PM -0800, Chen, Kenneth W wrote:
 Zach Brown wrote on Thursday, November 30, 2006 1:45 PM
   At that time, a patch was written for raw device to demonstrate that
   large performance head room is achievable (at ~20% speedup for micro-
   benchmark and ~2% for db transaction processing benchmark) with a  
   tight I/O submission processing loop.
  
  Where exactly does the benefit come from?  icache misses?  atomic  
  ops leading to pipeline flushes?
 
 It benefit from shorter path length. It takes much shorter time to process
 one I/O request, both in the submit and completion path.  I always think in
 terms of how many instructions, or clock ticks does it take to convert user
 request into bio, submit it and in the return path, to process the bio call
 back function and do the appropriate io completion (sync or async).  The
 stock 2.6.19 kernel takes about 5.17 micro-seconds to process one 4K aligned
 DIO (just the submit and completion path, less disk I/O latency).  With the
 patch, the time is reduced to 4.26 us.

I'm not completely against a minimal DIO implementation for the block
device, but right now we get block device QA for free when we test the
rest of the DIO code.  Splitting the code base makes DIO (already a
special case) that much harder to test.

It's obvious there's a lot less code in your patch than fs/direct-io.c,
but I'm still interested in which part of the fs/direct-io.c path is
taking the most time.  I would guess it is allocating the dio?

I don't think we should cut out fs/direct-io.c until we understand
exactly where the hit is coming from.  I know you've done lots of
instrumentation already, but can you share some percentages on the hot
paths?

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] a btrfs fix

2012-09-15 Thread Chris Mason
Hi Linus,

My for-linus branch has one revert in the new quota code:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

We're building up more fixes at etc for the next merge window, but I'm
keeping them out unless they are bigger regressions or have a huge
impact.

Chris Mason (1):
  Revert Btrfs: fix some error codes in btrfs_qgroup_inherit()

 fs/btrfs/qgroup.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] a large btrfs update

2012-07-26 Thread Chris Mason
 subvol uuids and times (+292/-15)
Btrfs: don't update atime on RO subvolumes (+7/-0)
Btrfs: add btrfs_compare_trees function (+440/-0)
Btrfs: make iref_to_path non static (+9/-5)

Chris Mason (5) commits (+22/-9):
Btrfs: call the ordered free operation without any locks held (+8/-1)
Btrfs: don't wait around for new log writers on an SSD (+2/-1)
Btrfs: add a barrier before a waitqueue_active check (+1/-0)
Btrfs: reduce calls to wake_up on uncontended locks (+9/-5)
Btrfs: uninit variable fixes in send/receive (+2/-2)

Stefan Behrens (3) commits (+9/-4):
Btrfs: avoid I/O repair BUG() from btree_read_extent_buffer_pages() (+1/-1)
Btrfs: remove unwanted printk() for btrfs device I/O stats (+0/-3)
Btrfs: suppress printk() if all device I/O stats are zero (+8/-0)

Li Zefan (3) commits (+159/-122):
Btrfs: kill free_space pointer from inode structure (+10/-19)
Btrfs: zero unused bytes in inode item (+3/-0)
Btrfs: rewrite BTRFS_SETGET_FUNCS (+146/-103)

Ilya Dryomov (2) commits (+3/-3):
Btrfs: do not ignore errors from btrfs_cleanup_fs_roots() when mounting 
(+2/-2)
Btrfs: do not return EINVAL instead of ENOMEM from open_ctree() (+1/-1)

Dan Carpenter (2) commits (+4/-3):
Btrfs: small naming cleanup in join_transaction() (+2/-2)
Btrfs: fix error handling in __add_reloc_root() (+2/-1)

David Sterba (2) commits (+23/-18):
btrfs: allow cross-subvolume file clone (+8/-3)
btrfs: join DEV_STATS ioctls to one (+15/-15)

Arnd Hannemann (1) commits (+8/-1):
Btrfs: allow mount -o remount,compress=no

Anand Jain (1) commits (+1/-1):
btrfs read error corrected message floods the console during recovery

Mitch Harder (1) commits (+20/-14):
Btrfs: Check INCOMPAT flags on remount and add helper function

Tsutomu Itoh (1) commits (+3/-3):
Btrfs: return error of btrfs_update_inode() to caller

Andrew Mahone (1) commits (+5/-3):
btrfs: ignore unfragmented file checks in defrag when compression enabled - 
rebased

Total: (65) commits

 fs/btrfs/Makefile   |2 +-
 fs/btrfs/async-thread.c |9 +-
 fs/btrfs/backref.c  |   40 +-
 fs/btrfs/backref.h  |7 +-
 fs/btrfs/btrfs_inode.h  |   14 +-
 fs/btrfs/check-integrity.c  |7 +-
 fs/btrfs/ctree.c|  775 +++-
 fs/btrfs/ctree.h|  368 +++-
 fs/btrfs/delayed-inode.c|   23 +-
 fs/btrfs/delayed-inode.h|2 +
 fs/btrfs/delayed-ref.c  |   56 +-
 fs/btrfs/delayed-ref.h  |   62 +-
 fs/btrfs/disk-io.c  |  150 +-
 fs/btrfs/disk-io.h  |6 +
 fs/btrfs/extent-tree.c  |  358 ++--
 fs/btrfs/extent_io.c|   58 +-
 fs/btrfs/file-item.c|4 +-
 fs/btrfs/free-space-cache.c |2 +-
 fs/btrfs/inode.c|   42 +-
 fs/btrfs/ioctl.c|  471 -
 fs/btrfs/ioctl.h|   97 +-
 fs/btrfs/locking.c  |   14 +-
 fs/btrfs/qgroup.c   | 1571 +++
 fs/btrfs/relocation.c   |3 +-
 fs/btrfs/root-tree.c|  107 +-
 fs/btrfs/send.c | 4570 +++
 fs/btrfs/send.h |  133 ++
 fs/btrfs/struct-funcs.c |  196 +-
 fs/btrfs/super.c|   28 +-
 fs/btrfs/transaction.c  |  101 +-
 fs/btrfs/transaction.h  |   12 +
 fs/btrfs/tree-log.c |4 +-
 fs/btrfs/volumes.c  |   25 +-
 fs/btrfs/volumes.h  |4 +-
 fs/inode.c  |2 +
 35 files changed, 8690 insertions(+), 633 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[no subject]

2012-08-29 Thread Chris Mason
Hi Linus,

I've split out the big send/receive update from my last pull request and
now have just the fixes in my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

For anyone who wants send/receive updates, they are maintained as well.
But it is has enough cleanups (without fixes) that we shouldn't be asking
Linus to take it right now.  The send/recv branch will wander over to
linux-next shortly though.

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git send-recv

The largest patches in this pull are Josef's patches to fix DIO locking
problems and his patch to fix a crash during balance.
They are both well tested.

The rest are smaller fixes that we've had queued.  The last rc came out
while I was hacking new and exciting ways to recover from a misplaced rm
-rf on my dev box, so these missed rc3.

Josef Bacik (9) commits (+322/-216):
Btrfs: don't allocate a seperate csums array for direct reads (+19/-32)
Btrfs: do not use missing devices when showing devname (+2/-0)
Btrfs: fix enospc problems when deleting a subvol (+1/-1)
Btrfs: increase the size of the free space cache (+7/-8)
Btrfs: lock extents as we map them in DIO (+127/-129)
Btrfs: fix deadlock with freeze and sync V2 (+9/-4)
Btrfs: allow delayed refs to be merged (+142/-27)
Btrfs: do not strdup non existent strings (+5/-3)
Btrfs: barrier before waitqueue_active (+10/-12)

Stefan Behrens (5) commits (+16/-77):
Btrfs: fix that repair code is spuriously executed for transid failures 
(+6/-2)
Btrfs: revert checksum error statistic which can cause a BUG() (+2/-39)
Btrfs: fix a misplaced address operator in a condition (+1/-1)
Btrfs: remove superblock writing after fatal error (+5/-33)
Btrfs: fix that error value is changed by mistake (+2/-2)

Dan Carpenter (4) commits (+16/-8):
Btrfs: unlock on error in btrfs_delalloc_reserve_metadata() (+3/-1)
Btrfs: fix some error codes in btrfs_qgroup_inherit() (+6/-2)
Btrfs: fix some endian bugs handling the root times (+4/-4)
Btrfs: checking for NULL instead of IS_ERR (+3/-1)

Liu Bo (2) commits (+25/-6):
Btrfs: fix ordered extent leak when failing to start a transaction (+5/-2)
Btrfs: fix a dio write regression (+20/-4)

Arne Jansen (2) commits (+38/-73):
Btrfs: fix deadlock in wait_for_more_refs (+21/-73)
Btrfs: fix race in run_clustered_refs (+17/-0)

Chris Mason (1) commits (+3/-0):
Btrfs: don't run __tree_mod_log_free_eb on leaves

Fengguang Wu (1) commits (+3/-2):
btrfs: fix second lock in btrfs_delete_delayed_items()

Miao Xie (1) commits (+1/-0):
Btrfs: fix wrong mtime and ctime when creating snapshots

Total: (25) commits (+424/-382)

 fs/btrfs/backref.c   |   4 +-
 fs/btrfs/compression.c   |   1 +
 fs/btrfs/ctree.c |   9 +-
 fs/btrfs/ctree.h |   3 +-
 fs/btrfs/delayed-inode.c |  12 +-
 fs/btrfs/delayed-ref.c   | 163 +++-
 fs/btrfs/delayed-ref.h   |   4 +
 fs/btrfs/disk-io.c   |  53 ++--
 fs/btrfs/disk-io.h   |   2 +-
 fs/btrfs/extent-tree.c   | 123 +-
 fs/btrfs/extent_io.c |  17 +--
 fs/btrfs/file-item.c |   4 +-
 fs/btrfs/inode.c | 326 ---
 fs/btrfs/ioctl.c |   2 +-
 fs/btrfs/locking.c   |   2 +-
 fs/btrfs/qgroup.c|  12 +-
 fs/btrfs/root-tree.c |   4 +-
 fs/btrfs/super.c |  15 ++-
 fs/btrfs/transaction.c   |   3 +-
 fs/btrfs/volumes.c   |  33 +
 fs/btrfs/volumes.h   |   2 -
 21 files changed, 418 insertions(+), 376 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs updates

2012-08-29 Thread Chris Mason
Hi Linus,

I've split out the big send/receive update from my last pull request and
now have just the fixes in my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

For anyone who wants send/receive updates, they are maintained as well.
But it is has enough cleanups (without fixes) that we shouldn't be asking
Linus to take it right now.  The send/recv branch will wander over to
linux-next shortly though.

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git send-recv

The largest patches in this pull are Josef's patches to fix DIO locking
problems and his patch to fix a crash during balance.
They are both well tested.

The rest are smaller fixes that we've had queued.  The last rc came out
while I was hacking new and exciting ways to recover from a misplaced rm
-rf on my dev box, so these missed rc3.

Josef Bacik (9) commits (+322/-216):
Btrfs: don't allocate a seperate csums array for direct reads (+19/-32)
Btrfs: do not use missing devices when showing devname (+2/-0)
Btrfs: fix enospc problems when deleting a subvol (+1/-1)
Btrfs: increase the size of the free space cache (+7/-8)
Btrfs: lock extents as we map them in DIO (+127/-129)
Btrfs: fix deadlock with freeze and sync V2 (+9/-4)
Btrfs: allow delayed refs to be merged (+142/-27)
Btrfs: do not strdup non existent strings (+5/-3)
Btrfs: barrier before waitqueue_active (+10/-12)

Stefan Behrens (5) commits (+16/-77):
Btrfs: fix that repair code is spuriously executed for transid failures 
(+6/-2)
Btrfs: revert checksum error statistic which can cause a BUG() (+2/-39)
Btrfs: fix a misplaced address operator in a condition (+1/-1)
Btrfs: remove superblock writing after fatal error (+5/-33)
Btrfs: fix that error value is changed by mistake (+2/-2)

Dan Carpenter (4) commits (+16/-8):
Btrfs: unlock on error in btrfs_delalloc_reserve_metadata() (+3/-1)
Btrfs: fix some error codes in btrfs_qgroup_inherit() (+6/-2)
Btrfs: fix some endian bugs handling the root times (+4/-4)
Btrfs: checking for NULL instead of IS_ERR (+3/-1)

Liu Bo (2) commits (+25/-6):
Btrfs: fix ordered extent leak when failing to start a transaction (+5/-2)
Btrfs: fix a dio write regression (+20/-4)

Arne Jansen (2) commits (+38/-73):
Btrfs: fix deadlock in wait_for_more_refs (+21/-73)
Btrfs: fix race in run_clustered_refs (+17/-0)

Chris Mason (1) commits (+3/-0):
Btrfs: don't run __tree_mod_log_free_eb on leaves

Fengguang Wu (1) commits (+3/-2):
btrfs: fix second lock in btrfs_delete_delayed_items()

Miao Xie (1) commits (+1/-0):
Btrfs: fix wrong mtime and ctime when creating snapshots

Total: (25) commits (+424/-382)

 fs/btrfs/backref.c   |   4 +-
 fs/btrfs/compression.c   |   1 +
 fs/btrfs/ctree.c |   9 +-
 fs/btrfs/ctree.h |   3 +-
 fs/btrfs/delayed-inode.c |  12 +-
 fs/btrfs/delayed-ref.c   | 163 +++-
 fs/btrfs/delayed-ref.h   |   4 +
 fs/btrfs/disk-io.c   |  53 ++--
 fs/btrfs/disk-io.h   |   2 +-
 fs/btrfs/extent-tree.c   | 123 +-
 fs/btrfs/extent_io.c |  17 +--
 fs/btrfs/file-item.c |   4 +-
 fs/btrfs/inode.c | 326 ---
 fs/btrfs/ioctl.c |   2 +-
 fs/btrfs/locking.c   |   2 +-
 fs/btrfs/qgroup.c|  12 +-
 fs/btrfs/root-tree.c |   4 +-
 fs/btrfs/super.c |  15 ++-
 fs/btrfs/transaction.c   |   3 +-
 fs/btrfs/volumes.c   |  33 +
 fs/btrfs/volumes.h   |   2 -
 21 files changed, 418 insertions(+), 376 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL 1/2] Btrfs fixes

2012-08-09 Thread Chris Mason
Hi everyone,

This first pull is the bulk of our changes for the next rc.  It is
against the 3.5 kernel so people testing the new features have a stable
point to work against.  This was tested against Linus' current tree as
well.

The second pull is just one fix against 3.6-rc1 (in another email).

Linus, please grab my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Most of these fixes are against the new send/receive code.  Alexander
fixed a number of bugs in there and I found a more while backing up my
laptop.  It does nightly incremental runs now about 3x faster than
rsync, so things are looking pretty good.

On top of that we have fixes for some long standing bugs in the delayed
reference code (a few more of these are still being worked on),
deadlocks and other small fixes.

Alexander Block (23) commits (+482/-419):
Btrfs: don't treat top/root directory inode as deleted/reused (+20/-1)
Btrfs: fix use of radix_tree for name_cache in send/receive (+37/-39)
Btrfs: rename backref_ctx::found_in_send_root to found_itself (+4/-4)
Btrfs: pass root instead of parent_root to iterate_inode_ref (+2/-2)
Btrfs: add correct parent to check_dirs when dir got moved (+11/-0)
Btrfs: add missing check for dir != tmp_dir to is_first_ref (+1/-1)
Btrfs: fix check for changed extent in is_extent_unchanged (+2/-2)
Btrfs: free nce and nce_head on error in name_cache_insert (+5/-1)
Btrfs: don't break in the final loop of find_extent_clone (+0/-1)
Btrfs: fix cur_ino  parent_ino case for send/receive (+146/-244)
Btrfs: add/fix comments/documentation for send/receive (+134/-6)
Btrfs: use normal return path for root == send_root case (+0/-6)
Btrfs: fix memory leak for name_cache in send/receive (+1/-0)
Btrfs: use kmalloc instead of stack for backref_ctx (+18/-11)
Btrfs: remove unused use_list from send/receive code (+0/-2)
Btrfs: remove unused tmp_path from iterate_dir_item (+0/-8)
Btrfs: add rdev to get_inode_info in send/receive (+17/-13)
Btrfs: use = instead of  in is_extent_unchanged (+1/-1)
Btrfs: update send_progress at correct places (+20/-6)
Btrfs: ignore non-FS inodes for send/receive (+5/-0)
Btrfs: code cleanups for send/receive (+35/-48)
Btrfs: make aux field of ulist 64 bit (+21/-23)
Btrfs: remove unused code with #if 0 (+2/-0)

Josef Bacik (9) commits (+325/-215):
Btrfs: don't allocate a seperate csums array for direct reads (+19/-32)
Btrfs: do not use missing devices when showing devname (+2/-0)
Btrfs: fix enospc problems when deleting a subvol (+1/-1)
Btrfs: increase the size of the free space cache (+7/-8)
Btrfs: lock extents as we map them in DIO (+127/-129)
Btrfs: allow delayed refs to be merged (+142/-27)
Btrfs: do not strdup non existent strings (+5/-3)
Btrfs: barrier before waitqueue_active (+10/-12)
Btrfs: use a slab for btrfs_dio_private (+12/-3)

Dan Carpenter (4) commits (+16/-8):
Btrfs: unlock on error in btrfs_delalloc_reserve_metadata() (+3/-1)
Btrfs: fix some error codes in btrfs_qgroup_inherit() (+6/-2)
Btrfs: fix some endian bugs handling the root times (+4/-4)
Btrfs: checking for NULL instead of IS_ERR (+3/-1)

Stefan Behrens (3) commits (+8/-36):
Btrfs: fix a misplaced address operator in a condition (+1/-1)
Btrfs: remove superblock writing after fatal error (+5/-33)
Btrfs: fix that error value is changed by mistake (+2/-2)

Chris Mason (2) commits (+40/-15):
Btrfs: fix btrfs send for inline items and compression (+37/-15)
Btrfs: don't run __tree_mod_log_free_eb on leaves (+3/-0)

Fengguang Wu (2) commits (+4/-6):
btrfs: fix second lock in btrfs_delete_delayed_items() (+3/-2)
btrfs: Use PTR_RET in btrfs_resume_balance_async() (+1/-4)

Arne Jansen (2) commits (+38/-73):
Btrfs: fix deadlock in wait_for_more_refs (+21/-73)
Btrfs: fix race in run_clustered_refs (+17/-0)

Miao Xie (1) commits (+1/-0):
Btrfs: fix wrong mtime and ctime when creating snapshots

Total: (46) commits

 fs/btrfs/backref.c   |  12 +-
 fs/btrfs/compression.c   |   1 +
 fs/btrfs/ctree.c |  14 +-
 fs/btrfs/ctree.h |   3 +-
 fs/btrfs/delayed-inode.c |  12 +-
 fs/btrfs/delayed-ref.c   | 163 +++--
 fs/btrfs/delayed-ref.h   |   4 +
 fs/btrfs/disk-io.c   |  45 +--
 fs/btrfs/disk-io.h   |   2 +-
 fs/btrfs/extent-tree.c   | 123 +++
 fs/btrfs/extent_io.c |   1 -
 fs/btrfs/file-item.c |   4 +-
 fs/btrfs/inode.c | 318 -
 fs/btrfs/ioctl.c |   2 +-
 fs/btrfs/locking.c   |   2 +-
 fs/btrfs/qgroup.c|  32 +-
 fs/btrfs/root-tree.c |   4 +-
 fs/btrfs/send.c  | 895 ++-
 fs/btrfs/super.c |   2 +
 fs/btrfs/transaction.c   |   3 +-
 fs/btrfs/ulist.c |   7 +-
 fs/btrfs/ulist.h |   9 +-
 fs/btrfs/volumes.c   |  16 +-
 23 files changed, 908 insertions

[GIT PULL 2/2] Btrfs merge fix

2012-08-09 Thread Chris Mason
Hi Linus,

Please pull my for-linus-3.6 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-3.6

It fixes a merging error in rc1.  The calls to mnt_want_write should
have been removed.

Alexander Block (1):
  Btrfs: remove mnt_want_write call in btrfs_mksubvol

 fs/btrfs/ioctl.c | 5 -
 1 file changed, 5 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL 1/2] Btrfs fixes

2012-08-21 Thread Chris Mason
On Mon, Aug 20, 2012 at 07:55:59PM -0600, Linus Torvalds wrote:
 On Mon, Aug 20, 2012 at 6:53 PM, Chris Samuel ch...@csamuel.org wrote:
 
  This pull request with a whole heap of btrfs fixes (46 commits) appears
  not to have been merged yet, does anyone know if it was rejected or just
  missed ?
 
 Read my -rc2 release notes.
 
 TL;DR: I rejected big pull requests that didn't convince me. Make a
 damn good case for it, or send minimal fixes instead.
 
 I'm tried of these oops, what we sent you for -rc1 wasn't ready, so
 here's a thousand lines of changes crap.

When just the second pull went in, I wasn't sure if it was waiting for
vacation or you felt it was too big, but when I saw rc2 it was pretty
clear.

So I'm working up an rc3 pull with longer explanations.  The bulk of my
last pull was send/receive fixes.  The rc1 send/recv worked fine for me
on my test box, but larger scale use on well aged filesystems showed
some problems.

It's fair to say send/receive wasn't ready.  I did expect some fixes for
rc2 but not that many.  More details will be in my pull this afternoon,
but with our current code it is working very well for me.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE] Btrfs v0.12 released

2008-02-06 Thread Chris Mason
Hello everyone,

I wasn't planning on releasing v0.12 yet, and it was supposed to have some 
initial support for multiple devices.  But, I have made a number of 
performance fixes and small bug fixes, and I wanted to get them out there 
before the (destabilizing) work on multiple-devices took over.

So, here's v0.12.  It comes with a shiny new disk format (sorry), but the gain 
is dramatically better random writes to existing files.  In testing here, the 
random write phase of tiobench went from 1MB/s to 30MB/s.  The fix was to 
change the way back references for file extents were hashed.

Other changes:

Insert and delete multiple items at once in the btree where possible.  Back 
references added more tree balances, and it showed up in a few benchmarks.  
With v0.12, backrefs have no real impact on performance.

Optimize bio end_io routines.  Btrfs was spending way too much CPU time in the 
bio end_io routines, leading to lock contention and other problems.

Optimize read ahead during transaction commit.  The old code was trying to 
read far too much at once, which made the end_io problems really stand out.

mount -o ssd option, which clusters file data writes together regardless of 
the directory the files belong to.  There are a number of other performance 
tweaks for SSD, aimed at clustering metadata and data writes to better take 
advantage of the hardware.

mount -o max_inline=size option, to override the default max inline file data 
size (default is 8k).  Any value up to the leaf size is allowed (default 
16k).

Simple -ENOSPC handling.  Emphasis on simple, but it prevents accidentally 
filling the disk most of the time.  With enough threads/procs banging on 
things, you can still easily crash the box.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-30 Thread Chris Mason
On Wednesday 30 January 2008, Al Boldi wrote:
 Jan Kara wrote:
   Chris Snook wrote:
Al Boldi wrote:
 This RFC proposes to introduce a tunable which allows to disable
 fsync and changes ordered into writeback writeout on a per-process
 basis like this:

   echo 1  /proc/`pidof process`/softsync
   
This is basically a kernel workaround for stupid app behavior.
  
   Exactly right to some extent, but don't forget the underlying
   data=ordered starvation problem, which looks like a genuinely deep
   problem maybe related to blockIO.
 
It is a problem with the way how ext3 does fsync (at least that's what
  we ended up with in that konqueror problem)... It has to flush the
  current transaction which means that app doing fsync() has to wait till
  all dirty data of all files on the filesystem are written (if we are in
  ordered mode). And that takes quite some time... There are possibilities
  how to avoid that but especially with freshly created files, it's tough
  and I don't see a way how to do it without some fundamental changes to
  JBD.

 Ok, but keep in mind that this starvation occurs even in the absence of
 fsync, as the benchmarks show.

 And, a quick test of successive 1sec delayed syncs shows no hangs until
 about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
 hangs for minutes on end, and io-wait shows almost 100%.

Do you see this on older kernels as well?  The first thing we need to 
understand is if this particular stall is new.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-31 Thread Chris Mason
On Thursday 31 January 2008, Jan Kara wrote:
 On Thu 31-01-08 11:56:01, Chris Mason wrote:
  On Thursday 31 January 2008, Al Boldi wrote:
   Andreas Dilger wrote:
On Wednesday 30 January 2008, Al Boldi wrote:
 And, a quick test of successive 1sec delayed syncs shows no hangs
 until about 1 minute (~180mb) of db-writeout activity, when the
 sync abruptly hangs for minutes on end, and io-wait shows almost
 100%.
   
How large is the journal in this filesystem?  You can check via
debugfs -R 'stat 8' /dev/XXX.
  
   32mb.
  
Is this affected by increasing
the journal size?  You can set the journal size via mke2fs -J
size=400 at format time, or on an unmounted filesystem by running
tune2fs -O ^has_journal /dev/XXX then tune2fs -J size=400
/dev/XXX.
  
   Setting size=400 doesn't help, nor does size=4.
  
I suspect that the stall is caused by the journal filling up, and
then waiting while the entire journal is checkpointed back to the
filesystem before the next transaction can start.
   
It is possible to improve this behaviour in JBD by reducing the
amount of space that is cleared if the journal becomes full, and
also doing journal checkpointing before it becomes full.  While that
may reduce performance a small amount, it would help avoid such huge
latency problems. I believe we have such a patch in one of the Lustre
branches already, and while I'm not sure what kernel it is for the
JBD code rarely changes much
  
   The big difference between ordered and writeback is that once the
   slowdown starts, ordered goes into ~100% iowait, whereas writeback
   continues 100% user.
 
  Does data=ordered write buffers in the order they were dirtied?  This
  might explain the extreme problems in transactional workloads.

   Well, it does but we submit them to block layer all at once so elevator
 should sort the requests for us...

nr_requests is fairly small, so a long stream of random requests should still 
end up being random IO.

Al, could you please compare the write throughput from vmstat for the 
data=ordered vs data=writeback runs?  I would guess the data=ordered one has 
a lower overall write throughput.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-31 Thread Chris Mason
On Thursday 31 January 2008, Al Boldi wrote:
 Andreas Dilger wrote:
  On Wednesday 30 January 2008, Al Boldi wrote:
   And, a quick test of successive 1sec delayed syncs shows no hangs until
   about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
   hangs for minutes on end, and io-wait shows almost 100%.
 
  How large is the journal in this filesystem?  You can check via
  debugfs -R 'stat 8' /dev/XXX.

 32mb.

  Is this affected by increasing
  the journal size?  You can set the journal size via mke2fs -J size=400
  at format time, or on an unmounted filesystem by running
  tune2fs -O ^has_journal /dev/XXX then tune2fs -J size=400 /dev/XXX.

 Setting size=400 doesn't help, nor does size=4.

  I suspect that the stall is caused by the journal filling up, and then
  waiting while the entire journal is checkpointed back to the filesystem
  before the next transaction can start.
 
  It is possible to improve this behaviour in JBD by reducing the amount
  of space that is cleared if the journal becomes full, and also doing
  journal checkpointing before it becomes full.  While that may reduce
  performance a small amount, it would help avoid such huge latency
  problems. I believe we have such a patch in one of the Lustre branches
  already, and while I'm not sure what kernel it is for the JBD code rarely
  changes much

 The big difference between ordered and writeback is that once the slowdown
 starts, ordered goes into ~100% iowait, whereas writeback continues 100%
 user.

Does data=ordered write buffers in the order they were dirtied?  This might 
explain the extreme problems in transactional workloads.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] fast file mapping for loop

2008-01-15 Thread Chris Mason
On Tue, 15 Jan 2008 11:07:40 +0100
Jens Axboe [EMAIL PROTECTED] wrote:

   I split and merged the patch into five bits (added ext3 support),
   so perhaps that would be easier for people to read/review.
   Attached and also exist in the loop-extent_map branch here:

Thanks!

   
   http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=loop-extent_map
  
  Seems my ext3 version doesn't work, it craps out in
  ext3_get_blocks_handle() triggering this bug:
  
  J_ASSERT(handle != NULL || create == 0);
  
  I'll see if I can fix that, being fairly fs ignorant...
 
 This works, but probably pretty suboptimal (should end the new journal
 in map_io_complete()?). And yes I know the  9 isn't correct, since
 the fs block size is larger. Just making sure that we always have
 enough blocks.

You can use DIO_CREDITS instead of len  9, just like the ext3
O_DIRECT code does.  Your current patch is fine, except it breaks
data=ordered rules.  My plan to work within data=ordered:

1) Inside ext3_map_extent (while the transaction was running), increment
a counter in the ext3 journal for number of pending IOs.  Then end the
transaction handle.

2) Drop this counter inside the IO completion call

3) Change the ext3 commit code to wait for the IO count to be zero.

I'll give it a shot later this week, until then your current patch is
just data=writeback, which is good enough for testing.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)

2008-01-15 Thread Chris Mason

Hello everyone,

Btrfs v0.10 is now available for download from:

http://oss.oracle.com/projects/btrfs/

Btrfs is still in an early alpha state, and the disk format is not finalized.
v0.10 introduces a new disk format, and is not compatible with v0.9.

The core of this release is explicit back references for all metadata blocks,
data extents, and directory items.  These are a crucial building block for
future features such as online fsck and migration between devices.  The back
references are verified during deletes, and the extent back references are
checked by the existing offline fsck tool.

For all of the details of how the back references are maintained, please
see the design document:

http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html

Other new features (described in detail below):

* Online resizing (including shrinking)
* In place conversion from Ext3 to Btrfs
* data=ordered support
* Mount options to disable data COW and checksumming 
* Barrier support for sata and IDE drives

[ Resizing ]

In order to demonstrate and test the back references, I've added an online
resizer, which can both grow and shrink the filesystem:

mount -t btrfs /dev/xxx /mnt

# add 2GB to the FS
btrfsctl -r +2g /mnt

# shrink the FS by 4GB
btrfsctl -r -4g /mnt

# Explicitly set the FS size
btrfsctl -r 20g /mnt

# Use 'max' to grow the FS to the limit of the device
btrfsctl -r max /mnt

[ Conversion from Ext3 ]

This is an offline, in place, conversion program written by Yan Zheng.  It
has been through basic testing, but should not be trusted with critical data.

To build the conversion program, run 'make convert' in the btrfs-progs
tree. It depends on libe2fs and acl development libraries.

The conversion program uses the copy on write nature of Btrfs to preserve the
original Ext3 FS, sharing the data blocks between Btrfs and Ext3 metadata.
Btrfs metadata is created inside the free space of the Ext3 filesystem, and it
is possible to either make the conversion permanent (reclaiming the space used
by Ext3) or roll back the conversion to the original Ext3 filesystem.

More details and example usage of the conversion program can be found here:

http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-converter.html

Thanks to Yan Zheng for all of his work on the converter.

[ New mount options ]

mount -o nodatacsum disables checksumming on data extents

mount -o nodatacow disables copy on write of data extents, unless a given
extent is referenced by more than one snapshot.  This is targeted at database
workloads, where copy on write is not optimal for performance.

The explicit back references allow the nodatacow code to make sure copy
on write is done when multiple snapshots reference the same file, maintaining
snapshot consistency.

mount -o alloc_start=num forces allocation hints to start at least num bytes
into the disk.  This was introduced to test the resizer.  Example usage:

mount -o alloc_start=16g /dev/ /mnt
(do something to the FS)
btrfsctl -r 12g /mnt

The btrfsctl command will resize the FS down to 12GB in size.  Because
the FS was mounted with -o alloc_start=16g, any allocations done after
mounting will need to be relocated by the resizer.

It is safe to specify a number past the end of the FS, if the alloc_start is too
large, it is ignored.

mount -o nobarrier disables cache flushes during commit.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-15 Thread Chris Mason
On Tue, 15 Jan 2008 20:24:27 -0500
Daniel Phillips [EMAIL PROTECTED] wrote:

 On Jan 15, 2008 7:15 PM, Alan Cox [EMAIL PROTECTED] wrote:
   Writeback cache on disk in iteself is not bad, it only gets bad
   if the disk is not engineered to save all its dirty cache on
   power loss, using the disk motor as a generator or alternatively
   a small battery. It would be awfully nice to know which brands
   fail here, if any, because writeback cache is a big performance
   booster.
 
  AFAIK no drive saves the cache. The worst case cache flush for
  drives is several seconds with no retries and a couple of minutes
  if something really bad happens.
 
  This is why the kernel has some knowledge of barriers and uses them
  to issue flushes when needed.
 
 Indeed, you are right, which is supported by actual measurements:
 
 http://sr5tech.com/write_back_cache_experiments.htm
 
 Sorry for implying that anybody has engineered a drive that can do
 such a nice thing with writeback cache.
 
 The disk motor as a generator tale may not be purely folklore.  When
 an IDE drive is not in writeback mode, something special needs to done
 to ensure the last write to media is not a scribble.
 
 A small UPS can make writeback mode actually reliable, provided the
 system is smart enough to take the drives out of writeback mode when
 the line power is off.

We've had mount -o barrier=1 for ext3 for a while now, it makes
writeback caching safe.  XFS has this on by default, as does reiserfs.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Btrfs-devel] [ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)

2008-01-17 Thread Chris mason
On Tuesday 15 January 2008, Chris Mason wrote:
 Hello everyone,

 Btrfs v0.10 is now available for download from:

 http://oss.oracle.com/projects/btrfs/

Well, it turns out this release had a few small problems:

* data=ordered deadlock on older kernels (including 2.6.23)
* Compile problems when ACLs were not enabled in the kernel

So, I've put v0.11 out there.  It fixes those two problems and will also 
compile on older (2.6.18) enterprise kernels.

v0.11 does not have any disk format changes.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Btrfs-devel] [ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)

2008-01-17 Thread Chris mason
On Thursday 17 January 2008, Daniel Phillips wrote:
 On Jan 17, 2008 1:25 PM, Chris mason [EMAIL PROTECTED] wrote:
  So, I've put v0.11 out there.  It fixes those two problems and will also
  compile on older (2.6.18) enterprise kernels.
 
  v0.11 does not have any disk format changes.

 Hi Chris,

 First, massive congratulations for bringing this to fruition in such a
 short time.

 Now back to the regular carping: why even support older kernels?

The general answer is the backports are small and easy.  I don't test them 
heavily, and I don't go out of my way to make things work. 

But, they do make it easier for people to try out, and to figure how to use 
all these new features to solve problems.  Small changes that enable more 
testers are always welcome.

In general, the core parts of the kernel that btrfs uses haven't had many 
interface changes since 2.6.18, so this isn't a huge deal.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: konqueror deadlocks on 2.6.22

2008-01-22 Thread Chris Mason
On Tuesday 22 January 2008, Al Boldi wrote:
 Ingo Molnar wrote:
  * Oliver Pinter (Pintér Olivér) [EMAIL PROTECTED] wrote:
   and then please update to CFS-v24.1
   http://people.redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.6.22.15-v24.
  1 .patch
  
Yes with CFSv20.4, as in the log.
   
It also hangs on 2.6.23.13
 
  my feeling is that this is some sort of timing dependent race in
  konqueror/kde/qt that is exposed when a different scheduler is put in.
 
  If it disappears with CFS-v24.1 it is probably just because the timings
  will change again. Would be nice to debug this on the konqueror side and
  analyze why it fails and how. You can probably tune the timings by
  enabling SCHED_DEBUG and tweaking /proc/sys/kernel/*sched* values - in
  particular sched_latency and the granularity settings. Setting wakeup
  granularity to 0 might be one of the things that could make a
  difference.

 Thanks Ingo, but Mike suggested that data=writeback may make a difference,
 which it does indeed.

 So the bug seems to be related to data=ordered, although I haven't gotten
 any feedback from the ext3 gurus yet.

 Seems rather critical though, as data=writeback is a dangerous mode to run.

Running fsync in data=ordered means that all of the dirty blocks on the FS 
will get written before fsync returns.  Your original stack trace shows 
everyone either performing writeback for a log commit or waiting for the log 
commit to return.

They key task in your trace is kjournald, stuck in get_request_wait.  It could 
be a block layer bug, not giving him requests quickly enough, or it could be 
the scheduler not giving him back the cpu fast enough.

At any rate, that's where to concentrate the debugging.  You should be able to 
simulate this by running a few instances of the below loop and looking for 
stalls:

while(true) ; do
time dd if=/dev/zero of=foo bs=50M count=4 oflags=sync
done

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: konqueror deadlocks on 2.6.22

2008-01-22 Thread Chris Mason
On Tuesday 22 January 2008, Al Boldi wrote:
 Chris Mason wrote:
  Running fsync in data=ordered means that all of the dirty blocks on the
  FS will get written before fsync returns.

 Hm, that's strange, I expected this kind of behaviour from data=journal.

 data=writeback should return immediatly, which seems it does, but
 data=ordered should only wait for metadata flush, it shouldn't wait for
 filedata flush.  Are you sure it waits for both?

I over simplified.  data=ordered means that all data blocks are written before 
the metadata that references them commits.

So, if you add 1GB to a fileA in a transaction and then run fsync(fileB) in 
the same transaction, the 1GB from fileA is sent to disk (and waited on) 
before the fsync on fileB returns.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   5   6   7   8   9   10   >