Re: [PATCH] rd: Mark ramdisk buffers heads dirty
On Wed, 2007-10-17 at 11:57 -0600, Eric W. Biederman wrote: Christian Borntraeger [EMAIL PROTECTED] writes: Eric, Am Dienstag, 16. Oktober 2007 schrieb Christian Borntraeger: Am Dienstag, 16. Oktober 2007 schrieb Eric W. Biederman: fs/buffer.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) drivers/block/rd.c | 13 + 1 files changed, 1 insertions(+), 12 deletions(-) Your patches look sane so far. I have applied both patches, and the problem seems gone. I will try to get these patches to our testers. As long as they dont find new problems: Our testers did only a short test, and then they were stopped by problems with reiserfs. At the moment I cannot say for sure if your patch caused this, but we got the following BUG Thanks. ReiserFS: ram0: warning: Created .reiserfs_priv on ram0 - reserved for xattr storage. [ cut here ] kernel BUG at /home/autobuild/BUILD/linux-2.6.23-20071017/fs/reiserfs/journal.c:1117! illegal operation: 0001 [#1] Modules linked in: reiserfs dm_multipath sunrpc dm_mod qeth ccwgroup vmur CPU:3Not tainted Process reiserfs/3 (pid: 2592, task: 77dac418, ksp: 7513ee88) Krnl PSW : 070c3000 fb344380 (flush_commit_list+0x808/0x95c [reiserfs]) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0 Krnl GPRS: 0002 7411b5c8 002b 7b04d000 0001 76d1de00 7513eec0 0003 0012 77f77680 7411b608 fb343b7e fb34404a 7513ee50 Krnl Code: fb344374: a7210002 tmll%r2,2 fb344378: a7840004 brc 8,fb344380 fb34437c: a7f40001 brc 15,fb34437e fb344380: 5810d8c2 l %r1,2242(%r13) fb344384: 5820b03c l %r2,60(%r11) fb344388: 0de1 basr%r14,%r1 fb34438a: 5810d90e l %r1,2318(%r13) fb34438e: 5820b03c l %r2,60(%r11) Looking at the code, this really seems related to dirty buffers, so your patch is the main suspect at the moment. Sounds reasonable. if (!barrier) { /* If there was a write error in the journal - we can't commit * this transaction - it will be invalid and, if successful, * will just end up propagating the write error out to * the file system. */ if (likely(!retval !reiserfs_is_journal_aborted (journal))) { if (buffer_dirty(jl-j_commit_bh)) 1117BUG(); mark_buffer_dirty(jl-j_commit_bh) ; sync_dirty_buffer(jl-j_commit_bh) ; } } Grr. I'm not certain how to call that. Given that I should also be able to trigger this case by writing to the block device through the buffer cache (to the write spot at the write moment) this feels like a reiserfs bug. Although I feel screaming about filesystems that go BUG instead of remount read-only In this case, the commit block isn't allowed to be dirty before reiserfs decides it is safe to write it. The journal code expects it is the only spot in the kernel setting buffer heads dirty, and it only does so after the rest of the log blocks are safely on disk. Given this is a ramdisk, the check can be ignored, but I'd rather not sprinkle if (ram_disk) into the FS code At the same time I increasingly don't think we should allow user space to dirty or update our filesystem metadata buffer heads. That seems like asking for trouble. Demanding trouble ;) -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rd: Mark ramdisk buffers heads dirty
On Wed, 2007-10-17 at 14:29 -0600, Eric W. Biederman wrote: Chris Mason [EMAIL PROTECTED] writes: In this case, the commit block isn't allowed to be dirty before reiserfs decides it is safe to write it. The journal code expects it is the only spot in the kernel setting buffer heads dirty, and it only does so after the rest of the log blocks are safely on disk. Ok. So the journal code here fundamentally depends on being able to control the order of the writes, and something else being able to set the buffer head dirty messes up that control. Right. At the same time I increasingly don't think we should allow user space to dirty or update our filesystem metadata buffer heads. That seems like asking for trouble. Demanding trouble ;) Looks like it. There are even comments in jbd about the same class of problems. Apparently dump and tune2fs on mounted filesystems have triggered some of these issues. The practical question is any of this trouble worth handling. Thinking about it. I don't believe anyone has ever intentionally built a filesystem tool that depends on being able to modify a file systems metadata buffer heads while the filesystem is running, and doing that would seem to be fragile as it would require a lot of cooperation between the tool and the filesystem about how the filesystem uses and implement things. That's right. For example, ext2 is doing directories in the page cache of the directory inode, so there's a cache alias between the block device page cache and the directory inode page cache. Now I guess I need to see how difficult a patch would be to give filesystems magic inodes to keep their metadata buffer heads in. Not hard, the block device inode is already a magic inode for metadata buffer heads. You could just make another one attached to the bdev. But, I don't think I fully understand the problem you're trying to solve? -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rd: Mark ramdisk buffers heads dirty
On Wed, 2007-10-17 at 15:30 -0600, Eric W. Biederman wrote: Chris Mason [EMAIL PROTECTED] writes: Thinking about it. I don't believe anyone has ever intentionally built a filesystem tool that depends on being able to modify a file systems metadata buffer heads while the filesystem is running, and doing that would seem to be fragile as it would require a lot of cooperation between the tool and the filesystem about how the filesystem uses and implement things. That's right. For example, ext2 is doing directories in the page cache of the directory inode, so there's a cache alias between the block device page cache and the directory inode page cache. Now I guess I need to see how difficult a patch would be to give filesystems magic inodes to keep their metadata buffer heads in. Not hard, the block device inode is already a magic inode for metadata buffer heads. You could just make another one attached to the bdev. But, I don't think I fully understand the problem you're trying to solve? So the start: When we write buffers from the buffer cache we clear buffer_dirty but not PageDirty So try_to_free_buffers() will mark any page with clean buffer_heads that is not clean itself clean. The ramdisk set pages dirty to keep them from being removed from the page cache, just like ramfs. So, the problem is using the Dirty bit to indicate pinned. You're completely right that our current setup of buffer heads and pages and filesystem metadata is complex and difficult. But, moving the buffer heads off of the page cache pages isn't going to make it any easier to use dirty as pinned, especially in the face of buffer_head users for file data pages. You've already seen Nick fsblock code, but you can see my general approach to replacing buffer heads here: http://oss.oracle.com/mercurial/mason/btrfs-unstable/file/f89e7971692f/extent_map.h (alpha quality implementation in extent_map.c and users in inode.c) The basic idea is to do extent based record keeping for mapping and state of things in the filesystem, and to avoid attaching these things to the page. Unfortunately when those dirty ramdisk pages get buffers on them and those buffers all go clean and we are trying to reclaim buffer_heads we drop those pages from the page cache. Ouch! We can fix the ramdisk by setting making certain that buffer_heads on ramdisk pages stay dirty as well. The problem is this breaks filesystems like reiserfs and ext3 that expect to be able to make buffer_heads clean sometimes. There are other ways to solve this for ramdisks, such as changing where ramdisks are stored. However fixing the ramdisks this way still leaves the general problem that there are other paths to the filesystem metadata buffers, and those other paths cause the code to be complicated and buggy. So I'm trying to see if we can untangle this Gordian knot, so the code because more easily maintainable. Don't get me wrong, I'd love to see a simple and coherent fix for what reiserfs and ext3 do with buffer head state, but I think for the short term you're best off pinning the ramdisk pages via some other means. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rd: Mark ramdisk buffers heads dirty
On Wed, 2007-10-17 at 17:28 -0600, Eric W. Biederman wrote: Chris Mason [EMAIL PROTECTED] writes: So, the problem is using the Dirty bit to indicate pinned. You're completely right that our current setup of buffer heads and pages and filesystpem metadata is complex and difficult. But, moving the buffer heads off of the page cache pages isn't going to make it any easier to use dirty as pinned, especially in the face of buffer_head users for file data pages. Let me specific. Not moving buffer_heads off of page cache pages, moving buffer_heads off of the block devices page cache pages. My problem is the coupling of how block devices are cached and the implementation of buffer heads, and by removing that coupling we can generally make things better. Currently that coupling means silly things like all block devices are cached in low memory. Which probably isn't what you want if you actually have a use for block devices. For the ramdisk case in particular what this means is that there are no more users that create buffer_head mappings on the block device cache so using the dirty bit will be safe. Ok, we move the buffer heads off to a different inode, and that indoe has pages. The pages on the inode still need to get pinned, how does that pinning happen? The problem you described where someone cleans a page because the buffer heads are clean happens already without help from userland. So, keeping the pages away from userland won't save them from cleaning. Sorry if I'm reading your suggestion wrong... -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rd: Use a private inode for backing storage
On Sun, 21 Oct 2007 12:39:30 -0600 [EMAIL PROTECTED] (Eric W. Biederman) wrote: Nick Piggin [EMAIL PROTECTED] writes: On Sunday 21 October 2007 18:23, Eric W. Biederman wrote: Christian Borntraeger [EMAIL PROTECTED] writes: Let me put it another way. Looking at /proc/slabinfo I can get 37 buffer_heads per page. I can allocate 10% of memory in buffer_heads before we start to reclaim them. So it requires just over 3.7 buffer_heads on very page of low memory to even trigger this case. That is a large 1k filesystem or a weird sized partition, that we have written to directly. On a highmem machine it it could be relatively common. Possibly. But the same proportions still hold. 1k filesystems are not the default these days and ramdisks are relatively uncommon. The memory quantities involved are all low mem. It is definitely common during run time. It was seen in practice enough to be reproducible and get fixed for the non-ramdisk case. The big underlying question is how which ramdisk usage case are we shooting for. Keeping the ram disk pages off the LRU can certainly help the VM if larger ramdisks used at runtime are very common. Otherwise, I'd say to keep it as simple as possible and use Eric's patch. By simple I'm not counting lines of code, I'm counting overall readability between something everyone knows (page cache usage) and something specific to ramdisks (Nick's patch). -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file
On Tue, 23 Oct 2007 19:56:20 +0800 Fengguang Wu [EMAIL PROTECTED] wrote: On Tue, Oct 23, 2007 at 12:07:07PM +0200, Peter Zijlstra wrote: [ adding reiserfs devs to the CC ] Thank you. This fix is kind of crude - even when it fixed Maxim's problem, and survived my stress testing of a lot of patching and kernel compiling. I'd be glad to see better solutions. This should be safe, reiserfs has the buffer heads themselves clean and the page should get cleaned eventually. The cancel_dirty_page call was just an optimization to be VM friendly. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[CFP] 2008 Linux Storage and Filesystem Workshop
Hello everyone, We are organizing another filesystem and storage workshop in San Jose next Feb 25 and 26. You can find some great writeups of last year's conference on LWN: http://lwn.net/Articles/226351/ This year we're trying to concentrate on more problem solving sessions, short term projects and joint sessions. You can find all the details on the conference webpages: http://www.usenix.org/events/lsf08/ Soon there will be a link for submitting your position statement, which is basically a note to the organizers that you are interested in attending and which topics you think should be covered. We're also looking for people to lead the discussion around the major topics, so please let us know if you're interested in that. The discussion leaders will have input into the people that get invited and the format of the discussion. Please let me know if there are any questions about the workshop. Thanks, Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] Btrfs v0.12 released
On Sunday 10 February 2008, David Miller wrote: From: Chris Mason [EMAIL PROTECTED] Date: Wed, 6 Feb 2008 12:00:13 -0500 This function never returns an error, so the simplest fix was to return the hash value which avoids all of the issues. In attempting other schemes to fix this, I found it very difficult to give gcc a packed attribute for that u64 * argument other than to create some new pseudo structure which would have been ugly. Many thanks, I clearly didn't put enough thought into the unaligned access problems. Similar code lives in the btrfs kernel code too, I'll try to get a partition at least mounted and working minimally and if successful I'll send you patches for that too. The kernel is actually worse, because the set/get macros are more complex. Some live in ctree.h like in the progs, but the nasty ones live in struct-funcs.c -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BTRFS only works with PAGE_SIZE = 4K
On Tuesday 12 February 2008, David Miller wrote: From: Chris Mason [EMAIL PROTECTED] Date: Wed, 6 Feb 2008 12:00:13 -0500 So, here's v0.12. Any page size larger than 4K will not work with btrfs. All of the extent stuff assumes that PAGE_SIZE = sectorsize. Yeah, there is definitely clean up to do in that area. I confirmed this by forcing mkfs.btrfs to use an 8K sectorsize on sparc64 and I was finally able to successfully mount a partition. Nice With 4K there are zero's in the root tree node header, because it's extent's location on disk is at a sub-PAGE_SIZE multiple and the extent code doesn't handle that. You really need to start validating this stuff on other platforms. Something that isn't little endian and something that doesn't use 4K pages. I'm sure you have some powerpc parts around somewhere. :) Grin, I think around v0.4 I grabbed a ppc box for a day and got things working. There has been some churn since then... My first prio is the newest set of disk format changes, and then I'll sit down and work on stability on a bunch of arches. Anyways, here is a patch for the kernel bits which fixes most of the unaligned accesses on sparc64. Many thanks, I'll try these out here and push them into the tree. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BTRFS partition usage...
On Tuesday 12 February 2008, Jan Engelhardt wrote: On Feb 12 2008 09:08, Chris Mason wrote: So, if Btrfs starts zeroing at 1k, will that be acceptable for you? Something looks wrong here. Why would btrfs need to zero at all? Superblock at 0, and done. Just like xfs. (Yes, I had xfs on sparc before, so it's not like you NEED the whitespace at the start of a partition.) I've had requests to move the super down to 64k to make room for bootloaders, which may not matter for sparc, but I don't really plan on different locations for different arches. In x86, there is even more space for a bootloader (some 28k or so) even if your partition table is as closely packed as possible, from 0 to 7e00 IIRC. For sparc you could have something like startlbaendlba type sda10 2 1 Boot sda22 58 3 Whole disk sda358 9 83 Linux and slap the bootloader into MBR, just like on x86. Or I am missing something.. It was a request from hpa, and he clearly had something in mind. He kindly offered to review the disk format for bootloaders and other lower level issues but I asked him to wait until I firm it up a bit. From my point of view, 0 is a bad idea because it is very likely to conflict with other things. There are lots of things in the FS that need deep thought,and the perfect system to fully use the first 64k of a 1TB filesystem isn't quite at the top of my list right now ;) Regardless of offset, it is a good idea to mop up previous filesystems where possible, and a very good idea to align things on some sector boundary. Even going 1MB in wouldn't be a horrible idea to align with erasure blocks on SSD. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BTRFS partition usage...
On Tuesday 12 February 2008, Jan Engelhardt wrote: On Feb 12 2008 08:49, Chris Mason wrote: This is a real issue on sparc where the default sun disk labels created use an initial partition where block zero aliases the disk label. It took me a few iterations before I figured out why every btrfs make would zero out my disk label :-/ Actually it seems this is only a problem with mkfs.btrfs, it clears out the first 64 4K chunks of the disk for whatever reason. It is a good idea to remove supers from other filesystems. I also need to add zeroing at the end of the device as well. Looks like I misread the e2fs zeroing code. It zeros the whole external log device, and I assumed it also zero'd out the start of the main FS. So, if Btrfs starts zeroing at 1k, will that be acceptable for you? Something looks wrong here. Why would btrfs need to zero at all? Superblock at 0, and done. Just like xfs. (Yes, I had xfs on sparc before, so it's not like you NEED the whitespace at the start of a partition.) I've had requests to move the super down to 64k to make room for bootloaders, which may not matter for sparc, but I don't really plan on different locations for different arches. 4k aligned is important given that sector sizes are growing. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BTRFS partition usage...
On Tuesday 12 February 2008, Jan Engelhardt wrote: On Feb 12 2008 09:35, Chris Mason wrote: and slap the bootloader into MBR, just like on x86. Or I am missing something.. It was a request from hpa, and he clearly had something in mind. He kindly offered to review the disk format for bootloaders and other lower level issues but I asked him to wait until I firm it up a bit. From my point of view, 0 is a bad idea because it is very likely to conflict with other things. There are lots of things in the FS that need deep thought,and the perfect system to fully use the first 64k of a 1TB filesystem isn't quite at the top of my list right now ;) Regardless of offset, it is a good idea to mop up previous filesystems where possible, and a very good idea to align things on some sector boundary. Even going 1MB in wouldn't be a horrible idea to align with erasure blocks on SSD. I still don't like the idea of btrfs trying to be smarter than a user who can partition up his system according to (a) his likes (b) system or hardware requirements or recommendations to align the superblock to a specific location. Will all the users in the world who think about super block location when they partition their disks please raise their hands? The location of the super block needs to be very simple in order for mount and friends to find and detect it. It needs a simple algorithm to try multiple locations in case a given copy of the super is corrupt. Design in this case is a bunch of compromises around other users of the hardware, ease of programming, and the benefits in performance or usability from doing something complex. 1MB alignment does not always mean 1MB alignment. Sector 1 begins at 0x7e00 on x86. And with the maximum CHS geometry (255/63), partitions begin at 0x7e00+n*8225280 bytes, so the SB is unlikely to ever be on a 1048576 boundary. IO is already aligned on sectors, sometimes we'll have a perfect erasure block alignment and sometimes not. When the location of the super is my biggest bottleneck, I'll be a very happy boy. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] Btrfs v0.12 released
On Tuesday 12 February 2008, David Miller wrote: From: Chris Mason [EMAIL PROTECTED] Date: Mon, 11 Feb 2008 08:42:20 -0500 The kernel is actually worse, because the set/get macros are more complex. Some live in ctree.h like in the progs, but the nasty ones live in struct-funcs.c This is really problematic, because you've got these things called btrfs_item_ptr() which really isn't a pointer, it's a relative 'unsigned long' offset cast to a pointer. The source of this seems to be btrfs_leaf_data(). And then those things get passed down into the SETGET functions! Explaining it won't make it pretty, but at least I can tell you what the code does. This is all part of the btrfs code that supports tree block sizes larger than a page. The extent_buffer code (extent_io.c) provides a read/write api into an extent_buffer based on offsets from the start of the multi-page buffer. That's where the relative unsigned long comes from. The part where I cast it to pointers is me trying to maintain type checking throughout the code. The pointers themselves are useless, they need to be matched with an extent_buffer to actually get to the bytes. There are a few parts where the SETGET funcs are open coded, mostly in very performance critical functions. Just look for lexxx_to_cpu Then deeper down we have terribly inconsistent things like btrfs_item_nr_offset() and btrfs_item_offset_nr(). Btree blocks have the offset of the item header from the start of the block and the offset of the item data. And, I'm very bad at naming. Sigh... I'll see what I can do. Thanks -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Btrfs: remove repeated eb-pages check in, disk-io.c/csum_dirty_buffer
On Mon, Oct 08, 2012 at 07:26:15AM -0600, Wang Sheng-Hui wrote: In csum_dirty_buffer, we first get eb from page-private. Then we check if the page is the first page of eb. Later we check it again. Remove the repeated check here. You had the right idea here, two checks and one has a warning, so you kept the warning. But when the metadata block size is bigger than a page, the WARN_ON triggers for any page that isn't the first one in the extent buffer. I kept this commit but removed the WARN_ON(1) -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Btrfs
reservation structure (+39/-22) Btrfs: fix wrong size for the reservation of the, snapshot creation (+4/-4) Revert Btrfs: do not do filemap_write_and_wait_range in fsync (+11/-3) Btrfs: fix file extent discount problem in the, snapshot (+25/-44) Btrfs: fix orphan transaction on the freezed filesystem (+49/-23) Btrfs: add a type field for the transaction handle (+21/-42) Btrfs: fix error path in create_pending_snapshot() (+17/-23) Btrfs: use a slab for ordered extents allocation (+31/-3) Btrfs: fix wrong orphan count of the fs/file tree (+1/-1) Btrfs: fix corrupted metadata in the snapshot (+32/-18) Btrfs: fix the snapshot that should not exist (+53/-15) Btrfs: fix memory leak in start_transaction() (+3/-1) Btrfs: fix unprotected -log_batch (+9/-11) Liu Bo (13) commits (+150/-113): Btrfs: fix a bug in checking whether a inode is already in log (+10/-8) Btrfs: kill obsolete arguments in btrfs_wait_ordered_extents (+7/-18) Btrfs: fix a bug in parsing return value in logical resolve (+34/-20) Btrfs: use larger limit for translation of logical to inode (+5/-4) Btrfs: check if an inode has no checksum when logging it (+12/-11) Btrfs: update delayed ref's tracepoints to show sequence (+10/-4) Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag (+28/-14) Btrfs: improve fsync by filtering extents that we want (+26/-3) Btrfs: cleanup for duplicated code in find_free_extent (+0/-4) Btrfs: cleanup extents after we finish logging inode (+6/-0) Btrfs: use helper for logical resolve (+3/-16) Btrfs: fix off-by-one in file clone (+9/-9) Btrfs: cleanup fs_info-hashers (+0/-2) Tsutomu Itoh (6) commits (+19/-20): Btrfs: confirmation of value is added before trace_btrfs_get_extent() is called (+2/-1) Btrfs: remove unnecessary IS_ERR in bio_readpage_error() (+1/-1) Btrfs: cleanup of error processing in btree_get_extent() (+5/-9) Btrfs: fix error handling in delete_block_group_cache() (+2/-2) Btrfs: remove unnecessary code in btree_get_extent() (+1/-7) Btrfs: check return value of ulist_alloc() properly (+8/-0) David Sterba (4) commits (+119/-62): btrfs: allow setting NOCOW for a zero sized file via ioctl (+27/-4) btrfs: move transaction aborts to the point of failure (+80/-47) btrfs: return EPERM upon rmdir on a subvolume (+3/-2) btrfs: polish names of kmem caches (+9/-9) Sage Weil (3) commits (+18/-2): Btrfs: do not take cleanup_work_sem in btrfs_run_delayed_iputs() (+0/-2) Btrfs: pass lockdep rwsem metadata to async commit transaction (+16/-0) Btrfs: set journal_info in async trans commit worker (+2/-0) Stefan Behrens (2) commits (+156/-21): Btrfs: make filesystem read-only when submitting barrier fails (+142/-19) Btrfs: detect corrupted filesystem after write I/O errors (+14/-2) Robin Dong (2) commits (+12/-157): btrfs: remove unused function btrfs_insert_some_items() (+0/-143) btrfs: move inline function code to header file (+12/-14) Mark Fasheh (2) commits (+848/-116): btrfs: extended inode ref iteration (+138/-37) btrfs: extended inode refs (+710/-79) Wei Yongjun (2) commits (+3/-6): Btrfs: fix possible memory leak in scrub_setup_recheck_block() (+1/-0) Btrfs: using for_each_set_bit_from to simplify the code (+2/-6) Chris Mason (2) commits (+38/-16): Btrfs: fix btrfs send for inline items and compression (+37/-15) btrfs: init ref_index to zero in add_inode_ref (+1/-1) Jan Schmidt (2) commits (+129/-112): btrfs: improved readablity for add_inode_ref (+97/-81) Btrfs: fix gcc warnings for 32bit compiles (+32/-31) Zach Brown (1) commits (+2/-1): btrfs: fix min csum item size warnings in 32bit Daniel J Blueman (1) commits (+11/-11): btrfs: fix message printing Anand Jain (1) commits (+7/-5): Btrfs: write_buf is now callable outside send.c Kent Overstreet (1) commits (+2/-17): btrfs: Kill some bi_idx references Andrei Popa (1) commits (+13/-1): Btrfs: make compress and nodatacow mount options mutually exclusive liubo (1) commits (+0/-8): Btrfs: cleanup for unused ref cache stuff Wang Sheng-Hui (1) commits (+0/-4): Btrfs: remove repeated eb-pages check in, disk-io.c/csum_dirty_buffer Total: (121) commits fs/btrfs/backref.c | 299 +++--- fs/btrfs/backref.h | 10 +- fs/btrfs/btrfs_inode.h | 15 +- fs/btrfs/check-integrity.c | 16 +- fs/btrfs/compression.c | 13 +- fs/btrfs/ctree.c | 148 +-- fs/btrfs/ctree.h | 109 +- fs/btrfs/delayed-inode.c | 6 +- fs/btrfs/disk-io.c | 230 ++- fs/btrfs/disk-io.h | 2 + fs/btrfs/extent-tree.c | 376 +- fs/btrfs/extent_io.c | 128 -- fs/btrfs/extent_io.h | 23 +- fs/btrfs/extent_map.c| 55 ++- fs/btrfs/extent_map.h| 8 +- fs/btrfs/file-item.c | 5 +- fs/btrfs/file.c
[GIT PULL] Btrfs fixes
Hi Linus, My for-linus branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus Has our series of fixes for the next rc. The biggest batch is from Jan Schmidt, fixing up some problems in our subvolume quota code and fixing btrfs send/receive to work with the new extended inode refs. My git tree is against 3.6, but these were all retested against your current git. Jan Schmidt (7) commits (+149/-76): Btrfs: don't put removals from push_node_left into tree mod log twice (+7/-2) Btrfs: fix a tree mod logging issue for root replacement operations (+2/-8) Btrfs: tree mod log's old roots could still be part of the tree (+21/-4) Btrfs: fix extent buffer reference for tree mod log roots (+1/-1) Btrfs: extended inode refs support for send mechanism (+94/-58) Btrfs: comment for loop in tree_mod_log_insert_move (+5/-0) Btrfs: determine level of old roots (+19/-3) Josef Bacik (2) commits (+8/-6): Btrfs: Use btrfs_update_inode_fallback when creating a snapshot (+6/-5) Btrfs: do not bug when we fail to commit the transaction (+2/-1) Stefan Behrens (1) commits (+2/-2): Btrfs: Fix wrong error handling code Lukas Czerner (1) commits (+2/-1): btrfs: Return EINVAL when length to trim is less than FSB Arne Jansen (1) commits (+2/-1): Btrfs: send correct rdev and mode in btrfs-send Gabriel de Perthuis (1) commits (+1/-1): Fix a sign bug causing invalid memory access in the ino_paths ioctl. Liu Bo (1) commits (+5/-3): Btrfs: fix memory leak when cloning root's node Alex Lyakas (1) commits (+13/-14): Btrfs: Send: preserve ownership (uid and gid) also for symlinks. Miao Xie (1) commits (+7/-0): Btrfs: fix deadlock caused by the nested chunk allocation Tsutomu Itoh (1) commits (+13/-4): Btrfs: fix memory leak in btrfs_quota_enable() Total: (17) commits (+202/-108) fs/btrfs/backref.c | 28 - fs/btrfs/backref.h | 4 ++ fs/btrfs/ctree.c | 70 +- fs/btrfs/ctree.h | 3 + fs/btrfs/extent_io.c | 4 +- fs/btrfs/inode.c | 7 +-- fs/btrfs/ioctl.c | 6 +- fs/btrfs/qgroup.c | 17 -- fs/btrfs/send.c| 156 ++--- fs/btrfs/transaction.c | 2 +- fs/btrfs/volumes.c | 7 +++ 11 files changed, 199 insertions(+), 105 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[ANNOUNCE] seekwatcher IO graphing v0.2
Hello everyone, Since doing the initial Btrfs benchmarks, I've made my blktrace graphing utility a little more generic and tossed it out on oss.oracle.com. This new version can easily graph two different runs, and has a few other tweaks that make the graphs look nicer. Docs, examples and other details are at: http://oss.oracle.com/~mason/seekwatcher -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC] extent mapped page cache
On Tue, 10 Jul 2007 17:03:26 -0400 Chris Mason [EMAIL PROTECTED] wrote: This patch aims to demonstrate one way to replace buffer heads with a few extent trees. Buffer heads provide a few different features: 1) Mapping of logical file offset to blocks on disk 2) Recording state (dirty, locked etc) 3) Providing a mechanism to access sub-page sized blocks. This patch covers #1 and #2, I'll start on #3 a little later next week. Well, almost. I decided to try out an rbtree instead of the radix, which turned out to be much faster. Even though individual operations are slower, the rbtree was able to do many fewer ops to accomplish the same thing, especially for merging extents together. It also uses much less ram. This code still has lots of room for optimization, but it comes in at around 2-5% more cpu time for ext2 streaming reads and writes. I haven't done readpages or writepages yet, so this is more or less a worst case setup. I'm comparing against ext2 with readpages and writepages disabled. The new code has the added benefit of passing fsx-linux, and not triggering MCE's on my poor little test box. The basic idea is to store state in byte ranges in an rbtree, and to mirror that state down into individual pages. This allows us to store arbitrary state outside of the page struct, so we could include the pid of the process that dirtied a page range for cfq purposes. The example readpage and writepage code is probably the easiest way to understand the basic API. A separate rbtree stores a mapping of byte offset in the file to byte offset on disk. This allows the filesystem to fill in mapping information in bulk, and reduces the number of metadata lookups required to do common operations. Because the state and mapping information are separate from the page, pages can come and go and their corresponding metadata can still be cached (the current code drops mappings as the last page corresponding to that mapping disappears). Two patches follow, the core extent_map implementation and a sample user (ext2). This is pretty basic, implementing prepare/commit_write, read/writepage and a few other funcs to exercise the new code. Longer term, it should fit in with Nick's other extent work instead of prepare/commit_write. My patch sets page-private to 1, really for no good reason. It is just a debugging aid I was using to make sure the page took the right path down the line. If this catches on, we might set it to a magic value so you can if (ExtentPage(page)) or just leave it as null. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC] extent mapped page cache main code
Core Extentmap implementation diff -r 126111346f94 -r 53cabea328f7 fs/Makefile --- a/fs/Makefile Mon Jul 09 10:53:57 2007 -0400 +++ b/fs/Makefile Tue Jul 24 15:40:27 2007 -0400 @@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table. attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \ seq_file.o xattr.o libfs.o fs-writeback.o \ pnode.o drop_caches.o splice.o sync.o utimes.o \ - stack.o + stack.o extent_map.o ifeq ($(CONFIG_BLOCK),y) obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o diff -r 126111346f94 -r 53cabea328f7 fs/extent_map.c --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/fs/extent_map.c Tue Jul 24 15:40:27 2007 -0400 @@ -0,0 +1,1591 @@ +#include linux/bitops.h +#include linux/slab.h +#include linux/bio.h +#include linux/mm.h +#include linux/gfp.h +#include linux/pagemap.h +#include linux/page-flags.h +#include linux/module.h +#include linux/spinlock.h +#include linux/blkdev.h +#include linux/extent_map.h + +static struct kmem_cache *extent_map_cache; +static struct kmem_cache *extent_state_cache; + +struct tree_entry { + u64 start; + u64 end; + int in_tree; + struct rb_node rb_node; +}; + + +/* bits for the extent state */ +#define EXTENT_DIRTY 1 +#define EXTENT_WRITEBACK (1 1) +#define EXTENT_UPTODATE (1 2) +#define EXTENT_LOCKED (1 3) +#define EXTENT_NEW (1 4) + +#define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK) + +void __init extent_map_init(void) +{ + extent_map_cache = kmem_cache_create(extent_map, + sizeof(struct extent_map), 0, + SLAB_RECLAIM_ACCOUNT | + SLAB_DESTROY_BY_RCU, + NULL, NULL); + extent_state_cache = kmem_cache_create(extent_state, + sizeof(struct extent_state), 0, + SLAB_RECLAIM_ACCOUNT | + SLAB_DESTROY_BY_RCU, + NULL, NULL); +} + +void extent_map_tree_init(struct extent_map_tree *tree, + struct address_space *mapping, gfp_t mask) +{ + tree-map.rb_node = NULL; + tree-state.rb_node = NULL; + rwlock_init(tree-lock); + tree-mapping = mapping; +} +EXPORT_SYMBOL(extent_map_tree_init); + +struct extent_map *alloc_extent_map(gfp_t mask) +{ + struct extent_map *em; + em = kmem_cache_alloc(extent_map_cache, mask); + if (!em || IS_ERR(em)) + return em; + em-in_tree = 0; + atomic_set(em-refs, 1); + return em; +} +EXPORT_SYMBOL(alloc_extent_map); + +void free_extent_map(struct extent_map *em) +{ + if (atomic_dec_and_test(em-refs)) { + WARN_ON(em-in_tree); + kmem_cache_free(extent_map_cache, em); + } +} +EXPORT_SYMBOL(free_extent_map); + +struct extent_state *alloc_extent_state(gfp_t mask) +{ + struct extent_state *state; + state = kmem_cache_alloc(extent_state_cache, mask); + if (!state || IS_ERR(state)) + return state; + state-state = 0; + state-in_tree = 0; + atomic_set(state-refs, 1); + init_waitqueue_head(state-wq); + return state; +} +EXPORT_SYMBOL(alloc_extent_state); + +void free_extent_state(struct extent_state *state) +{ + if (atomic_dec_and_test(state-refs)) { + WARN_ON(state-in_tree); + kmem_cache_free(extent_state_cache, state); + } +} +EXPORT_SYMBOL(free_extent_state); + +static struct rb_node *tree_insert(struct rb_root *root, u64 offset, + struct rb_node *node) +{ + struct rb_node ** p = root-rb_node; + struct rb_node * parent = NULL; + struct tree_entry *entry; + + while(*p) { + parent = *p; + entry = rb_entry(parent, struct tree_entry, rb_node); + + if (offset entry-end) + p = (*p)-rb_left; + else if (offset entry-end) + p = (*p)-rb_right; + else + return parent; + } + + entry = rb_entry(node, struct tree_entry, rb_node); + entry-in_tree = 1; + rb_link_node(node, parent, p); + rb_insert_color(node, root); + return NULL; +} + +static struct rb_node *__tree_search(struct rb_root *root, u64 offset, + struct rb_node **prev_ret) +{ + struct rb_node * n = root-rb_node; + struct rb_node *prev = NULL; + struct tree_entry *entry; + struct tree_entry *prev_entry = NULL; + + while(n) { + entry = rb_entry(n, struct tree_entry, rb_node); + prev = n; + prev_entry = entry; + + if (offset
[PATCH RFC] ext2 extentmap support
mount -o extentmap to use the new stuff diff -r 126111346f94 -r 53cabea328f7 fs/ext2/ext2.h --- a/fs/ext2/ext2.hMon Jul 09 10:53:57 2007 -0400 +++ b/fs/ext2/ext2.hTue Jul 24 15:40:27 2007 -0400 @@ -1,5 +1,6 @@ #include linux/fs.h #include linux/ext2_fs.h +#include linux/extent_map.h /* * ext2 mount options @@ -65,6 +66,7 @@ struct ext2_inode_info { struct posix_acl*i_default_acl; #endif rwlock_t i_meta_lock; + struct extent_map_tree extent_tree; struct inodevfs_inode; }; @@ -167,6 +169,7 @@ extern const struct address_space_operat extern const struct address_space_operations ext2_aops; extern const struct address_space_operations ext2_aops_xip; extern const struct address_space_operations ext2_nobh_aops; +extern const struct address_space_operations ext2_extent_map_aops; /* namei.c */ extern const struct inode_operations ext2_dir_inode_operations; diff -r 126111346f94 -r 53cabea328f7 fs/ext2/inode.c --- a/fs/ext2/inode.c Mon Jul 09 10:53:57 2007 -0400 +++ b/fs/ext2/inode.c Tue Jul 24 15:40:27 2007 -0400 @@ -625,6 +625,84 @@ changed: goto reread; } +/* + * simple get_extent implementation using get_block. This assumes + * the get_block function can return something larger than a single block, + * but the ext2 implementation doesn't do so. Just change b_size to + * something larger if get_block can return larger extents. + */ +struct extent_map *ext2_get_extent(struct inode *inode, struct page *page, + size_t page_offset, u64 start, u64 end, + int create) +{ + struct buffer_head bh; + sector_t iblock; + struct extent_map *em = NULL; + struct extent_map_tree *extent_tree = EXT2_I(inode)-extent_tree; + int ret = 0; + u64 max_end = (u64)-1; + u64 found_len; + u64 bh_start; + u64 bh_end; + + bh.b_size = inode-i_sb-s_blocksize; + bh.b_state = 0; +again: + em = lookup_extent_mapping(extent_tree, start, end); + if (em) { + return em; + } + + iblock = start inode-i_blkbits; + if (!buffer_mapped(bh)) { + ret = ext2_get_block(inode, iblock, bh, create); + if (ret) + goto out; + } + + found_len = min((u64)(bh.b_size), max_end - start); + if (!em) + em = alloc_extent_map(GFP_NOFS); + + bh_start = start; + bh_end = start + found_len - 1; + em-start = start; + em-end = bh_end; + em-bdev = inode-i_sb-s_bdev; + + if (!buffer_mapped(bh)) { + em-block_start = 0; + em-block_end = 0; + } else { + em-block_start = bh.b_blocknr inode-i_blkbits; + em-block_end = em-block_start + found_len - 1; + } + ret = add_extent_mapping(extent_tree, em); + if (ret == -EEXIST) { + free_extent_map(em); + em = NULL; + max_end = end; + goto again; + } +out: + if (ret) { + if (em) + free_extent_map(em); + return ERR_PTR(ret); + } else if (em buffer_new(bh)) { + set_extent_new(extent_tree, bh_start, bh_end, GFP_NOFS); + } + return em; +} + +static int ext2_extent_map_writepage(struct page *page, +struct writeback_control *wbc) +{ + struct extent_map_tree *tree; + tree = EXT2_I(page-mapping-host)-extent_tree; + return extent_write_full_page(tree, page, ext2_get_extent, wbc); +} + static int ext2_writepage(struct page *page, struct writeback_control *wbc) { return block_write_full_page(page, ext2_get_block, wbc); @@ -633,6 +711,42 @@ static int ext2_readpage(struct file *fi static int ext2_readpage(struct file *file, struct page *page) { return mpage_readpage(page, ext2_get_block); +} + +static int ext2_extent_map_readpage(struct file *file, struct page *page) +{ + struct extent_map_tree *tree; + tree = EXT2_I(page-mapping-host)-extent_tree; + return extent_read_full_page(tree, page, ext2_get_extent); +} + +static int ext2_extent_map_releasepage(struct page *page, + gfp_t unused_gfp_flags) +{ + struct extent_map_tree *tree; + int ret; + + if (page-private != 1) + return try_to_free_buffers(page); + tree = EXT2_I(page-mapping-host)-extent_tree; + ret = try_release_extent_mapping(tree, page); + if (ret == 1) { + ClearPagePrivate(page); + set_page_private(page, 0); + page_cache_release(page); + } + return ret; +} + + +static void ext2_extent_map_invalidatepage(struct page *page, + unsigned long offset) +{ + struct extent_map_tree *tree; + + tree =
Re: [PATCH RFC] extent mapped page cache
On Tue, 24 Jul 2007 23:25:43 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: On Tue, 2007-07-24 at 16:13 -0400, Trond Myklebust wrote: On Tue, 2007-07-24 at 16:00 -0400, Chris Mason wrote: On Tue, 10 Jul 2007 17:03:26 -0400 Chris Mason [EMAIL PROTECTED] wrote: This patch aims to demonstrate one way to replace buffer heads with a few extent trees. Buffer heads provide a few different features: 1) Mapping of logical file offset to blocks on disk 2) Recording state (dirty, locked etc) 3) Providing a mechanism to access sub-page sized blocks. This patch covers #1 and #2, I'll start on #3 a little later next week. Well, almost. I decided to try out an rbtree instead of the radix, which turned out to be much faster. Even though individual operations are slower, the rbtree was able to do many fewer ops to accomplish the same thing, especially for merging extents together. It also uses much less ram. The problem with an rbtree is that you can't use it together with RCU to do lockless lookups. You can probably modify it to allocate nodes dynamically (like the radix tree does) and thus make it RCU-compatible, but then you risk losing the two main benefits that you list above. The tree is a critical part of the patch, but it is also the easiest to rip out and replace. Basically the code stores a range by inserting an object at an index corresponding to the end of the range. Then it does searches by looking forward from the start of the range. More or less any tree that can search and return the first key = than the requested key will work. So, I'd be happy to rip out the tree and replace with something else. Going completely lockless will be tricky, its something that will deep thought once the rest of the interface is sane. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] extent mapped page cache
On Wed, 25 Jul 2007 04:32:17 +0200 Nick Piggin [EMAIL PROTECTED] wrote: On Tue, Jul 24, 2007 at 07:25:09PM -0400, Chris Mason wrote: On Tue, 24 Jul 2007 23:25:43 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: The tree is a critical part of the patch, but it is also the easiest to rip out and replace. Basically the code stores a range by inserting an object at an index corresponding to the end of the range. Then it does searches by looking forward from the start of the range. More or less any tree that can search and return the first key = than the requested key will work. So, I'd be happy to rip out the tree and replace with something else. Going completely lockless will be tricky, its something that will deep thought once the rest of the interface is sane. Just having the other tree and managing it is what makes me a little less positive of this approach, especially using it to store pagecache state when we already have the pagecache tree. Having another tree to store block state I think is a good idea as I said in the fsblock thread with Dave, but I haven't clicked as to why it is a big advantage to use it to manage pagecache state. (and I can see some possible disadvantages in locking and tree manipulation overhead). Yes, there are definitely costs with the state tree, it will take some careful benchmarking to convince me it is a feasible solution. But, storing all the state in the pages themselves is impossible unless the block size equals the page size. So, we end up with something like fsblock/buffer heads or the state tree. One advantage to the state tree is that it separates the state from the memory being described, allowing a simple kmap style interface that covers subpages, highmem and superpages. It also more naturally matches the way we want to do IO, making for easy clustering. O_DIRECT becomes a special case of readpages and writepagesthe memory used for IO just comes from userland instead of the page cache. The ability to put in additional tracking info like the process that first dirtied a range is also significant. So, I think it is worth trying. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] extent mapped page cache
On Thu, 26 Jul 2007 03:37:28 +0200 Nick Piggin [EMAIL PROTECTED] wrote: One advantage to the state tree is that it separates the state from the memory being described, allowing a simple kmap style interface that covers subpages, highmem and superpages. I suppose so, although we should have added those interfaces long ago ;) The variants in fsblock are pretty good, and you could always do an arbitrary extent (rather than block) based API using the pagecache tree if it would be helpful. Yes, you could use fsblock for the state bits and make a separate API to map the actual pages. It also more naturally matches the way we want to do IO, making for easy clustering. Well the pagecache tree is used to reasonable effect for that now. OK the code isn't beautiful ;). Granted, this might be an area where the seperate state tree ends up being better. We'll see. One thing it gains us is finding the start of the cluster. Even if called by kswapd, the state tree allows writepage to find the start of the cluster and send down a big bio (provided I implement trylock to avoid various deadlocks). O_DIRECT becomes a special case of readpages and writepagesthe memory used for IO just comes from userland instead of the page cache. Could be, although you'll probably also need to teach the mm about the state tree and/or still manipulate the pagecache tree to prevent concurrency? Well, it isn't coded yet, but I should be able to do it from the FS specific ops. But isn't the main aim of O_DIRECT to do as little locking and synchronisation with the pagecache as possible? I thought this is why your race fixing patches got put on the back burner (although they did look fairly nice from a correctness POV). I put the placeholder patches on hold because handling a corner case where userland did O_DIRECT from a mmap'd region of the same file (Linus pointed it out to me). Basically my patches had to work in 64k chunks to avoid a deadlock in get_user_pages. With the state tree, I can allow the page to be faulted in but still properly deal with it. Well I'm kind of handwaving when it comes to O_DIRECT ;) It does look like this might be another advantage of the state tree (although you aren't allowed to slow down buffered IO to achieve the locking ;)). ;) The O_DIRECT benefit is a fringe thing. I've long wanted to help clean up that code, but the real point of the patch is to make general usage faster and less complex. If I can't get there, the O_DIRECT stuff doesn't matter. The ability to put in additional tracking info like the process that first dirtied a range is also significant. So, I think it is worth trying. Definitely, and I'm glad you are. You haven't converted me yet, but I look forward to finding the best ideas from our two approaches when the patches are further along (ext2 port of fsblock coming along, so we'll be able to have races soon :P). I'm sure we can find some river in Cambridge, winner gets to throw Axboe in. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] extent mapped page cache
On Thu, 26 Jul 2007 04:36:39 +0200 Nick Piggin [EMAIL PROTECTED] wrote: [ are state trees a good idea? ] One thing it gains us is finding the start of the cluster. Even if called by kswapd, the state tree allows writepage to find the start of the cluster and send down a big bio (provided I implement trylock to avoid various deadlocks). That's very true, we could potentially also do that with the block extent tree that I want to try with fsblock. If fsblock records and extent of 200MB, and writepage is called on a page in the middle of the extent, how do you walk the radix backwards to find the first dirty up to date page in the range? I'm looking at cleaning up some of these aops APIs so hopefully most of the deadlock problems go away. Should be useful to both our efforts. Will post patches hopefully when I get time to finish the draft this weekend. Great O_DIRECT becomes a special case of readpages and writepagesthe memory used for IO just comes from userland instead of the page cache. Could be, although you'll probably also need to teach the mm about the state tree and/or still manipulate the pagecache tree to prevent concurrency? Well, it isn't coded yet, but I should be able to do it from the FS specific ops. Probably, if you invalidate all the pagecache in the range beforehand you should be able to do it (and I guess you want to do the invalidate anyway). Although, below deadlock issues might still bite somehwere... Well, O_DIRECT is french for deadlocks. But I shouldn't have to worry so much about evicting the pages themselves since I can tag the range. But isn't the main aim of O_DIRECT to do as little locking and synchronisation with the pagecache as possible? I thought this is why your race fixing patches got put on the back burner (although they did look fairly nice from a correctness POV). I put the placeholder patches on hold because handling a corner case where userland did O_DIRECT from a mmap'd region of the same file (Linus pointed it out to me). Basically my patches had to work in 64k chunks to avoid a deadlock in get_user_pages. With the state tree, I can allow the page to be faulted in but still properly deal with it. Oh right, I didn't think of that one. Would you still have similar issues with the external state tree? I mean, the filesystem doesn't really know why the fault is taken. O_DIRECT read from a file into mmapped memory of the same block in the file is almost hopeless I think. Racing is fine as long as we don't deadlock or expose garbage from disk. The ability to put in additional tracking info like the process that first dirtied a range is also significant. So, I think it is worth trying. Definitely, and I'm glad you are. You haven't converted me yet, but I look forward to finding the best ideas from our two approaches when the patches are further along (ext2 port of fsblock coming along, so we'll be able to have races soon :P). I'm sure we can find some river in Cambridge, winner gets to throw Axboe in. Very noble of you to donate your colleage to such a worthy cause. Jens is always interested in helping solve such debates. It's a fantastic service he provides to the community. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[ANNOUNCE] seekwatcher v0.3 IO graphing an animation
Hello everyone, I've tossed out seekwatcher v0.3. The major changes are using rolling averages to smooth out the seek and throughput graphs, and it can generate mpgs of the IO done by a given trace. Here's a sample of the smoother graphs (creating 20 kernel trees): http://oss.oracle.com/~mason/seekwatcher/ext3_vs_btrfs_vs_xfs.png There are details and sample movies of the kernel tree run at: http://oss.oracle.com/~mason/seekwatcher -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] writeback time order/delay fixes take 3
On Wed, 22 Aug 2007 09:18:41 +0800 Fengguang Wu [EMAIL PROTECTED] wrote: On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote: On Sun, 12 Aug 2007 17:11:20 +0800 Fengguang Wu [EMAIL PROTECTED] wrote: Andrew and Ken, Here are some more experiments on the writeback stuff. Comments are highly welcome~ I've been doing benchmarks lately to try and trigger fragmentation, and one of them is a simulation of make -j N. It takes a list of all the .o files in the kernel tree, randomly sorts them and then creates bogus files with the same names and sizes in clean kernel trees. This is basically creating a whole bunch of files in random order in a whole bunch of subdirectories. The results aren't pretty: http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png The top graph shows one dot for each write over time. It shows that ext3 is basically writing all over the place the whole time. But, ext3 actually wins the read phase, so the layout isn't horrible. My guess is that if we introduce some write clustering by sending a group of inodes down at the same time, it'll go much much better. Andrew has mentioned bringing a few radix trees into the writeback paths before, it seems like file servers and other general uses will benefit from better clustering here. I'm hoping to talk you into trying it out ;) Thank you for the description of problem. So far I have a similar one in mind: if we are to delay writeback of atime-dirty-only inodes to above 1 hour, some grouping/piggy-backing scenario would be beneficial. (Which I guess does not deserve the complexity now that we have Ingo's make-reltime-default patch.) Good clustering would definitely help some delayed atime writeback scheme. My vague idea is to - keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching queue. - convert s_dirty to some radix-tree/rbtree based data structure. It would have dual functions: delayed-writeback and clustered-writeback. clustered-writeback: - Use inode number as clue of locality, hence the key for the sorted tree. - Drain some more s_dirty inodes into s_io on every kupdate wakeup, but do it in the ascending order of inode number instead of -dirtied_when. delayed-writeback: - Make sure that a full scan of the s_dirty tree takes =30s, i.e. dirty_expire_interval. I think we should assume a full scan of s_dirty is impossible in the presence of concurrent writers. We want to be able to pick a start time (right now) and find all the inodes older than that start time. New things will come in while we're scanning. But perhaps that's what you're saying... At any rate, we've got two types of lists now. One keeps track of age and the other two keep track of what is currently being written. I would try two things: 1) s_dirty stays a list for FIFO. s_io becomes a radix tree that indexes by inode number (or some arbitrary field the FS can set in the inode). Radix tree tags are used to indicate which things in s_io are already in progress or are pending (hand waving because I'm not sure exactly). inodes are pulled off s_dirty and the corresponding slot in s_io is tagged to indicate IO has started. Any nearby inodes in s_io are also sent down. 2) s_dirty and s_io both become radix trees. s_dirty is indexed by a sequence number that corresponds to age. It is treated as a big circular indexed list that can wrap around over time. Radix tree tags are used both on s_dirty and s_io to flag which inodes are in progress. Notes: (1) I'm not sure inode number is correlated to disk location in filesystems other than ext2/3/4. Or parent dir? In general, it is a better assumption than sorting by time. It may make sense to one day let the FS provide a clustering hint (corresponding to the first block in the file?), but for starters it makes sense to just go with the inode number. (2) It duplicates some function of elevators. Why is it necessary? Maybe we have no clue on the exact data location at this time? The elevator can only sort the pending IO, and we send down a relatively small window of all the dirty pages at a time. If we sent down all the dirty pages and let the elevator sort it out, we wouldn't need this clustering at all. But, that has other issues ;) -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] writeback time order/delay fixes take 3
On Thu, 23 Aug 2007 12:47:23 +1000 David Chinner [EMAIL PROTECTED] wrote: On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote: I think we should assume a full scan of s_dirty is impossible in the presence of concurrent writers. We want to be able to pick a start time (right now) and find all the inodes older than that start time. New things will come in while we're scanning. But perhaps that's what you're saying... At any rate, we've got two types of lists now. One keeps track of age and the other two keep track of what is currently being written. I would try two things: 1) s_dirty stays a list for FIFO. s_io becomes a radix tree that indexes by inode number (or some arbitrary field the FS can set in the inode). Radix tree tags are used to indicate which things in s_io are already in progress or are pending (hand waving because I'm not sure exactly). inodes are pulled off s_dirty and the corresponding slot in s_io is tagged to indicate IO has started. Any nearby inodes in s_io are also sent down. the problem with this approach is that it only looks at inode locality. Data locality is ignored completely here and the data for all the inodes that are close together could be splattered all over the drive. In that case, clustering by inode location is exactly the wrong thing to do. Usually it won't be less wrong than clustering by time. For example, XFs changes allocation strategy at 1TB for 32bit inode filesystems which makes the data get placed way away from the inodes. i.e. inodes in AGs below 1TB, all data in AGs 1TB. clustering by inode number for data writeback is mostly useless in the 1TB case. I agree we'll want a way to let the FS provide the clustering key. But for the first cut on the patch, I would suggest keeping it simple. The inode32 for 1Tb and inode64 allocators both try to keep data close to the inode (i.e. in the same AG) so clustering by inode number might work better here. Also, it might be worthwhile allowing the filesystem to supply a hint or mask for closeness for inode clustering. This would help the gernic code only try to cluster inode writes to inodes that fall into the same cluster as the first inode Yes, also a good idea after things are working. Notes: (1) I'm not sure inode number is correlated to disk location in filesystems other than ext2/3/4. Or parent dir? In general, it is a better assumption than sorting by time. It may make sense to one day let the FS provide a clustering hint (corresponding to the first block in the file?), but for starters it makes sense to just go with the inode number. Perhaps multiple hints are needed - one for data locality and one for inode cluster locality. So, my feature creep idea would have been more data clustering. I'm mainly trying to solve this graph: http://oss.oracle.com/~mason/compilebench/makej/compare-create-dirs-0.png Where background writing of the block device inode is making ext3 do seeky writes while directory trees. My simple idea was to kick off a 'I've just written block X' call back to the FS, where it may decide to send down dirty chunks of the block device inode that also happen to be dirty. But, maintaining the kupdate max dirty time and congestion limits in the face of all this clustering gets tricky. So, I wasn't going to suggest it until the basic machinery was working. Fengguang, this isn't a small project ;) But, lots of people will be interested in the results. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] writeback time order/delay fixes take 3
On Fri, 24 Aug 2007 21:24:58 +0800 Fengguang Wu [EMAIL PROTECTED] wrote: 2) s_dirty and s_io both become radix trees. s_dirty is indexed by a sequence number that corresponds to age. It is treated as a big circular indexed list that can wrap around over time. Radix tree tags are used both on s_dirty and s_io to flag which inodes are in progress. It's meaningless to convert s_io to radix tree. Because inodes on s_io will normally be sent to block layer elevators at the same time. Not entirely, using a radix tree instead lets you tag things instead of doing the current backflips across three lists. Also s_dirty holds 30 seconds of inodes, while s_io only 5 seconds. The more inodes, the more chances of good clustering. That's the general rule. s_dirty is the right place to do address-clustering. As for the dirty_expire_interval parameter on dirty age, we can apply a simple rule: do one full scan/sweep over the fs-address-space in every 30s, syncing all inodes encountered, and sparing those newly dirtied in less than 5s. With that rule, any inode will get synced after being dirtied for 5-35 seconds. This gives you an O(inodes dirty) behavior instead of the current O(old inodes). It might not matter, but walking the radix tree is more expensive than walking a list. But, I look forward to your patches, we can tune from there. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] [RFC][PATCH] clustered writeback
On Mon, 27 Aug 2007 05:03:36 -0700 Arjan van de Ven [EMAIL PROTECTED] wrote: On Mon, 27 Aug 2007 19:21:52 +0800 Because it does the work in small batches of 10 inodes, when the system has =10 dirty inodes, its behavior will reduce to: - do a full sweep *at once* on every 25s Which means the disk will flicker once every 25s, not bad :) 25 seconds is quite not good already though it takes a disk a second or two of no activity to go into low power mode, every 25 seconds means you now have at least a 10% constant power cost I don't know the right answer (well other than make sure inodes aren't dirty, which involves fixing apps to not do as much file operations, as well as relatime) but just every 25s is no big deal isn't really the case ;-( But fixing this isn't the job of this patchIt needs something like the laptop mode logic where it says o, the disk is awake, lets send stuff down. kupdate hitting on the disk isn't really a new problem, I'd rather address it with a different patch series. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] writeback time order/delay fixes take 3
On Wed, 29 Aug 2007 00:55:30 +1000 David Chinner [EMAIL PROTECTED] wrote: On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote: On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote: On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote: On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote: Notes: (1) I'm not sure inode number is correlated to disk location in filesystems other than ext2/3/4. Or parent dir? The correspond to the exact location on disk on XFS. But, XFS has it's own inode clustering (see xfs_iflush) and it can't be moved up into the generic layers because of locking and integration into the transaction subsystem. (2) It duplicates some function of elevators. Why is it necessary? The elevators have no clue as to how the filesystem might treat adjacent inodes. In XFS, inode clustering is a fundamental feature of the inode reading and writing and that is something no elevator can hope to acheive Thank you. That explains the linear write curve(perfect!) in Chris' graph. I wonder if XFS can benefit any more from the general writeback clustering. How large would be a typical XFS cluster? Depends on inode size. typically they are 8k in size, so anything from 4-32 inodes. The inode writeback clustering is pretty tightly integrated into the transaction subsystem and has some intricate locking, so it's not likely to be easy (or perhaps even possible) to make it more generic. When I talked to hch about this, he said the order file data pages got written in XFS was still dictated by the order the higher layers sent things down. Shouldn't the clustering still help to have delalloc done in inode order instead of in whatever random order pdflush sends things down now? -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] writeback time order/delay fixes take 3
On Wed, 29 Aug 2007 02:33:08 +1000 David Chinner [EMAIL PROTECTED] wrote: On Tue, Aug 28, 2007 at 11:08:20AM -0400, Chris Mason wrote: I wonder if XFS can benefit any more from the general writeback clustering. How large would be a typical XFS cluster? Depends on inode size. typically they are 8k in size, so anything from 4-32 inodes. The inode writeback clustering is pretty tightly integrated into the transaction subsystem and has some intricate locking, so it's not likely to be easy (or perhaps even possible) to make it more generic. When I talked to hch about this, he said the order file data pages got written in XFS was still dictated by the order the higher layers sent things down. Sure, that's file data. I was talking about the inode writeback, not the data writeback. I think we're trying to gain different things from inode based clustering...I'm not worried that the inode be next to the data. I'm going under the assumption that most of the time, the FS will try to allocate inodes in groups in a directory, and so most of the time the data blocks for inode N will be close to inode N+1. So what I'm really trying for here is data block clustering when writing multiple inodes at once. This matters most when files are relatively small and written in groups, which is a common workload. It may make the most sense to change the patch to supply some key for the data block clustering instead of the inode number, but its an easy first pass. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] writeback time order/delay fixes take 3
On Sun, 12 Aug 2007 17:11:20 +0800 Fengguang Wu [EMAIL PROTECTED] wrote: Andrew and Ken, Here are some more experiments on the writeback stuff. Comments are highly welcome~ I've been doing benchmarks lately to try and trigger fragmentation, and one of them is a simulation of make -j N. It takes a list of all the .o files in the kernel tree, randomly sorts them and then creates bogus files with the same names and sizes in clean kernel trees. This is basically creating a whole bunch of files in random order in a whole bunch of subdirectories. The results aren't pretty: http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png The top graph shows one dot for each write over time. It shows that ext3 is basically writing all over the place the whole time. But, ext3 actually wins the read phase, so the layout isn't horrible. My guess is that if we introduce some write clustering by sending a group of inodes down at the same time, it'll go much much better. Andrew has mentioned bringing a few radix trees into the writeback paths before, it seems like file servers and other general uses will benefit from better clustering here. I'm hoping to talk you into trying it out ;) -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/23] per device dirty throttling -v8
On Sun, 5 Aug 2007 11:00:29 -0400 Theodore Tso [EMAIL PROTECTED] wrote: On Sun, Aug 05, 2007 at 02:26:53AM +0200, Andi Kleen wrote: I always thought the right solution would be to just sync atime only very very lazily. This means if a inode is only dirty because of an atime update put it on a only write out when there is nothing to do or the memory is really needed list. As I've mentioend earlier, the memory balancing issues that arise when we add an atime dirty bit scare me a little. It can be addressed, obviously, but at the cost of more code complexity. ext3 and reiser both use a dirty_inode method to make sure that we don't actually have dirty inodes. This way, kswapd doesn't get stuck on the log and is able to do real work. It would be interesting to see a comparison of relatime with a kinoded that is willing to get stuck on the log. The FS would need a few tweaks so that write_inode() could know if it really needed to log or not, but for testing you could just drop ext3_dirty_inode and have ext3_write_inode do real work. Then just change kswapd to kick a new kinoded and benchmark away. A real patch would have to look for places where mark_inode_dirty was used and expected the dirty_inode callback to log things right away, but for testing its good enough. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
More Large blocksize benchmarks
Hello everyone, I'm stealing the cc list and reviving and old thread because I've finally got some numbers to go along with the Btrfs variable blocksize feature. The basic idea is to create a read/write interface to map a range of bytes on the address space, and use it in Btrfs for all metadata operations (file operations have always been extent based). So, instead of casting buffer_head-b_data to some structure, I read and write at offsets in a struct extent_buffer. The extent buffer is very small and backed by an address space, and I get large block sizes the same way file_write gets to write to 16k at a time, by finding the appropriate page in the addess space. This is an over simplification since I try to cache these mapping decisions to avoid using too much CPU, but hopefully you get the idea. The advantage to this approach is the changes are all inside Btrfs. No extra kernel patches were required. Dave reported that XFS saw much higher write throughput with large blocksizes, but so far I'm seeing the most benefits during reads. The next step is a bunch more benchmarks. I've done the first round and posted it here: http://oss.oracle.com/~mason/blocksizes/ The Btrfs code makes it relatively easy to experiment, and so this may be a good step toward figuring out if some automagic solution is worth it in general. I can even use different sizes for nodes and leaves, although I haven't done much testing at all there yet. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: More Large blocksize benchmarks
On Tue, 2007-10-16 at 12:36 +1000, David Chinner wrote: On Mon, Oct 15, 2007 at 08:22:31PM -0400, Chris Mason wrote: Hello everyone, I'm stealing the cc list and reviving and old thread because I've finally got some numbers to go along with the Btrfs variable blocksize feature. The basic idea is to create a read/write interface to map a range of bytes on the address space, and use it in Btrfs for all metadata operations (file operations have always been extent based). So, instead of casting buffer_head-b_data to some structure, I read and write at offsets in a struct extent_buffer. The extent buffer is very small and backed by an address space, and I get large block sizes the same way file_write gets to write to 16k at a time, by finding the appropriate page in the addess space. This is an over simplification since I try to cache these mapping decisions to avoid using too much CPU, but hopefully you get the idea. The advantage to this approach is the changes are all inside Btrfs. No extra kernel patches were required. Dave reported that XFS saw much higher write throughput with large blocksizes, but so far I'm seeing the most benefits during reads. Apples to oranges, Chris ;) Grin, if the two were the same, there'd be no reason to write a new one. I didn't expect faster writes on btrfs, at least not for workloads that did not require reads. The basic idea is to show there are a variety of ways the larger blocks can improve (and hurt) performance. Also, vmap isn't the only implementation path. Its true the Btrfs changes for this were huge, but a big chunk of the changes were for different leaf/node blocksizes, something that may never get used in practice. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New SCM and commit list
On Monday 11 April 2005 03:38, Ingo Molnar wrote: * Linus Torvalds [EMAIL PROTECTED] wrote: So anything that got modified in just one tree obviously merges to that version. Any file that got modified in two trees will end up just being passed to the merge program. See man merge and man diff3. The merger gets to fix up any conflicts by hand. at that point Chris Mason's rej tool is pretty nifty: ftp://ftp.suse.com/pub/people/mason/rej/rej-0.13.tar.gz (There is no fully automatic mode in where it would not bother the user with the really trivial rejects - but it has an automatic mode where you basically have to do nothing - maybe a fully automatic one could be added that would resolve low-risk rejects?) rej -M skips the merge program, so rej -a -M will give you something like this: coffee:/local/linux.p # rej -a -M drivers/ide/ide.c.rej drivers/ide/ide.c: 1 matched, 0 conflicts remain But I would want to go over the bit that calculates the conflicts remaining more carefully if people plan on trusting this ;) It'll run on unified diffs too, although it will be slower then patch since the assumption is the quick and easy placement patch does has already failed. (that's easy enough to fix though). it's really easy to use (but then again i'm a vim user, so i'm biased), just try it on a random .rej file you have (rej -a kernel/sched.c.rej or whatever). you can rej -m kdiff3|meld|tkdiff or any program that does a side by side comparison of two files. (export REJMERGE=foo sets the diff prog as well) I use rej frequently to merge patches in here, but that is mostly because there is no easy way to get the common ancestor and parent revision of the patches I'm merging. With that info in hand, kdiff3 is pretty nice. You would have to spoon feed it the renames, but it should have most of the other features you're looking for, including the 'no gui if all conflicts are auto-solvable' -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New SCM and commit list
On Monday 11 April 2005 08:51, Chris Mason wrote: rej -M skips the merge program, so rej -a -M will give you something like this: coffee:/local/linux.p # rej -a -M drivers/ide/ide.c.rej drivers/ide/ide.c: 1 matched, 0 conflicts remain But I would want to go over the bit that calculates the conflicts remaining more carefully if people plan on trusting this ;) Ok, looks like this should be safe. I changed -q to skip the gui compare when rej thinks it has resolved all the conflicts correctly. With rej 0.14 (just uploaded now) this should do what you want: rej -q -a foo.rej Download site is here: ftp://ftp.suse.com/pub/people/mason/rej/ Please let me know if you find patches where rej is doing the wrong thing. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.12.2 dies after 24 hours
On Tuesday 12 July 2005 20:27, Rob Mueller wrote: We're also applying the attached patch. There's a bug in reiserfs that gets tickled by our huge MMAP usage (it's amazing what really busy Cyrus daemons can do to a server, ouch). It's fixed in generic_write, so we take the few percent performance hit for something that doesn't break! Interesting - When I got the problem it was on mail servers under high load (handling 60.000 emails pr. hour) with reiserfs as file system. I have seen this problem on 5 different servers so I am confident that it is not hardware failure. Sometimes the server load just rises and then the server dies other times the load rises but the kernel manages to get it back alive filling up syslog with messages like this Sounds like a different issue. The patch Bron included before fixes (or at least reduces to the point where it fixes it for us) a problem where processes get stuck in D state and are unkillable. A reboot is required to remove them. Apparently this is a known bug in ReiserFS (see messages below). As noted, the same bug exists in ext3. There appears to have been some patches to try and fix it for both reiserfs and ext3, but I'm not sure if they're in the mainline kernel yet. http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/2056.html http://hulllug.principalhosting.net/archive/index.php/t-22774.html There is a much less complex solution that I've just recently gotten working in the SUSE kernel. If reiser3/ext3 don't log the inode during atime updates, the problem goes away. You can solve this now by mounting with -o noatime (although that might not play well with cyrus, not sure). My current patch works around this in ugly ways, what I plan on doing during OLS is finding out why ext3 is still logging the inode all the time. For reiser3, this was to avoid kswapd having to log a bunch of inodes in response to memory pressure, but that was back in 2.4 when things were different. We shouldn't need to do it anymore... -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.12.2 dies after 24 hours
On Tuesday 12 July 2005 20:42, Chris Mason wrote: Sounds like a different issue. The patch Bron included before fixes (or at least reduces to the point where it fixes it for us) a problem where processes get stuck in D state and are unkillable. A reboot is required to remove them. Apparently this is a known bug in ReiserFS (see messages below). As noted, the same bug exists in ext3. There appears to have been some patches to try and fix it for both reiserfs and ext3, but I'm not sure if they're in the mainline kernel yet. http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/2056.html http://hulllug.principalhosting.net/archive/index.php/t-22774.html There is a much less complex solution that I've just recently gotten working in the SUSE kernel. If reiser3/ext3 don't log the inode during atime updates, the problem goes away. The sysrq is huge, and I haven't yet found the person holding the transaction open. But, here's another place that starts a transaction with the mmap sem held, and I would guess the transaction writer is waiting on something for that mmap sem. atime updates alone won't fix this one -chris imapd D F3159530 0 32412 2292 32413 32411 (NOTLB) e1cdfdfc 0082 0008 f3159530 0202 c1e5b0a0 c013b1a0 0034 0202 c301b520 0001 4609 beca3f0b 5566 c301b520 c315b530 f3159530 f3159654 0001 000e d9309dac Call Trace: [c013b1a0] free_hot_cold_page+0x20/0xd0 [c01ac8dd] queue_log_writer+0x5d/0x80 [c0114b10] default_wake_function+0x0/0x20 [c01acb8a] do_journal_begin_r+0x1ca/0x2b0 [c01409e0] truncate_inode_pages+0x290/0x2b0 [c01ace9e] journal_begin+0x8e/0xe0 [c0191061] reiserfs_delete_inode+0x51/0xc0 [c01447fa] unmap_vmas+0x14a/0x260 [c0191010] reiserfs_delete_inode+0x0/0xc0 [c016c97d] generic_delete_inode+0x7d/0xe0 [c016cb83] iput+0x63/0x70 [c0169db6] dput+0x176/0x1b0 [c01547cb] __fput+0xcb/0x100 [c01470ff] remove_vm_struct+0x5f/0x80 [c014873a] unmap_vma_list+0x1a/0x30 [c0148a9f] do_munmap+0xdf/0xf0 [c0148aff] sys_munmap+0x4f/0x70 [c0102a15] syscall_call+0x7/0xb - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.12.2 dies after 24 hours
On Tuesday 12 July 2005 20:50, Rob Mueller wrote: Are you saying that if you mount with noatime *and* use your new patch it will fix the problem? What about the 2 threads linked to. Did those end up getting anywhere? Sorry for the confusion, you're hitting the other mmap_sem - transaction lock problem. This one should be solvable with an iget so we make sure not to do the final unlink until after the mmap sem is dropped. Lets see what I can do... -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: aio-stress throughput regressions from 2.6.11 to 2.6.12
On Friday 01 July 2005 03:56, Suparna Bhattacharya wrote: Has anyone else noticed major throughput regressions for random reads/writes with aio-stress in 2.6.12 ? Or have there been any other FS/IO regressions lately ? On one test system I see a degradation from around 17+ MB/s to 11MB/s for random O_DIRECT AIO (aio-stress -o3 testext3/rwfile5) from 2.6.11 to 2.6.12. It doesn't seem filesystem specific. Not good :( BTW, Chris/Ben, it doesn't look like the changes to aio.c have had an impact (I copied those back to my 2.6.11 tree and tried the runs with no effect) So it is something else ... Ideas/thoughts/observations ? Lets try to narrow it down a bit: aio-stress -o 3 -d 1 will set the depth to 1, (io_submit then wait one request at a time). This doesn't take the aio subsystem out of the picture, but it does make the block layer interaction more or less the same as non-aio benchmarks. If this is slow, I would suspect something in the block layer, and iozone -I -i 2 -w -f testext3/rwfile5 should also show the regression. If it doesn't regress, I would suspect something in the aio core. My first attempts at the context switch reduction patches caused this kind of regression. There was too much latency in sending the events up to userland. Other options: Try different elevators Try O_SYNC aio random writes Try aio random reads Try buffers random reads -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NFS Client patch
On Friday, July 20, 2001 10:50:57 AM +0200 Trond Myklebust [EMAIL PROTECTED] wrote: == Hans Reiser [EMAIL PROTECTED] writes: The current code does rely on hidden knowledge of the filesytem on the server, and refuses to operate with any FS that does not describe a position in a directory as an offset or hash that fits into 32 or 64 bits. I'm not saying that ReiserFS is wrong to question the correctness of this. I'm just saying that NFSv2 and v3 are fixed protocols, and that it's too late to do anything about them. I read Chris mail as a suggestion of creating yet another NQNFS, and this would IMHO be a mistake. Better to concentrate on NFSv4 which is meant to be extendible. Ah, then I was unclear...I think that while we certainly could make linux (or reiserfs) specific changes to NFSvOld, it would be a really bad idea. In my mind, the biggest strength behind NFS is its cross platform support, and maintaining some extension would only be slightly more fun than daily visits to the dentist ;-) I also think it is easy to call NFSv4 poorly designed, but much harder to design it to exploit the strengths of every FS on every unix flavor. Shrug, there are tradeoffs everywhere. I don't plan on supporting NFSv4 because it is the best network filesystem ever made, but because it is in our best interest to be compatible with those kinds of industry standards. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] speedup reiserfs O_SYNC and fsync
Hello everyone, This patch makes reiserfs O_SYNC and fsync faster by only committing the last transcation a file/dir was included in, instead of forcing a commit on the current transaction. More speedups are still possible, this patch is fairly conservative. It is based on 2.4.7-pre6 + the direct-indirect target flushing patch I just sent. More testers would be greatly appreciated ;-) Note, this changes the reiserfs in-core inode. modules users need to recompile the whole kernel. -chris diff -Nru a/fs/reiserfs/dir.c b/fs/reiserfs/dir.c --- a/fs/reiserfs/dir.c Thu Jul 12 10:46:26 2001 +++ b/fs/reiserfs/dir.c Thu Jul 12 10:46:26 2001 @@ -47,22 +47,10 @@ }; int reiserfs_dir_fsync(struct file *filp, struct dentry *dentry, int datasync) { - int ret = 0 ; - int windex ; - struct reiserfs_transaction_handle th ; - lock_kernel(); - - journal_begin(th, dentry-d_inode-i_sb, 1) ; - windex = push_journal_writer(dir_fsync) ; - reiserfs_prepare_for_journal(th.t_super, SB_BUFFER_WITH_SB(th.t_super), 1) ; - journal_mark_dirty(th, dentry-d_inode-i_sb, SB_BUFFER_WITH_SB (dentry-d_inode-i_sb)) ; - pop_journal_writer(windex) ; - journal_end_sync(th, dentry-d_inode-i_sb, 1) ; - - unlock_kernel(); - - return ret ; + reiserfs_commit_for_inode(dentry-d_inode) ; + unlock_kernel() ; + return 0 ; } diff -Nru a/fs/reiserfs/file.c b/fs/reiserfs/file.c --- a/fs/reiserfs/file.cThu Jul 12 10:46:26 2001 +++ b/fs/reiserfs/file.cThu Jul 12 10:46:26 2001 @@ -50,6 +50,7 @@ lock_kernel() ; down (inode-i_sem); journal_begin(th, inode-i_sb, JOURNAL_PER_BALANCE_CNT * 3) ; +reiserfs_update_inode_transaction(inode) ; #ifdef REISERFS_PREALLOCATE reiserfs_discard_prealloc (th, inode); @@ -83,10 +84,7 @@ int datasync ) { struct inode * p_s_inode = p_s_dentry-d_inode; - struct reiserfs_transaction_handle th ; int n_err = 0; - int windex ; - int jbegin_count = 1 ; lock_kernel() ; @@ -94,14 +92,9 @@ BUG (); n_err = fsync_inode_buffers(p_s_inode) ; - /* commit the current transaction to flush any metadata - ** changes. sys_fsync takes care of flushing the dirty pages for us - */ - journal_begin(th, p_s_inode-i_sb, jbegin_count) ; - windex = push_journal_writer(sync_file) ; - reiserfs_update_sd(th, p_s_inode); - pop_journal_writer(windex) ; - journal_end_sync(th, p_s_inode-i_sb,jbegin_count) ; + + reiserfs_commit_for_inode(p_s_inode) ; + unlock_kernel() ; return ( n_err 0 ) ? -EIO : 0; } diff -Nru a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c --- a/fs/reiserfs/inode.c Thu Jul 12 10:46:26 2001 +++ b/fs/reiserfs/inode.c Thu Jul 12 10:46:26 2001 @@ -41,6 +41,7 @@ down (inode-i_sem); journal_begin(th, inode-i_sb, jbegin_count) ; + reiserfs_update_inode_transaction(inode) ; windex = push_journal_writer(delete_inode) ; reiserfs_delete_object (th, inode); @@ -232,6 +233,7 @@ reiserfs_update_sd(th, inode) ; journal_end(th, s, len) ; journal_begin(th, s, len) ; + reiserfs_update_inode_transaction(inode) ; } // it is called by get_block when create == 0. Returns block number @@ -567,6 +569,7 @@ TYPE_ANY, 3/*key length*/); if ((new_offset + inode-i_sb-s_blocksize) = inode-i_size) { journal_begin(th, inode-i_sb, jbegin_count) ; + reiserfs_update_inode_transaction(inode) ; transaction_started = 1 ; } research: @@ -591,6 +594,7 @@ if (!transaction_started) { pathrelse(path) ; journal_begin(th, inode-i_sb, jbegin_count) ; + reiserfs_update_inode_transaction(inode) ; transaction_started = 1 ; goto research ; } @@ -658,6 +662,7 @@ */ pathrelse(path) ; journal_begin(th, inode-i_sb, jbegin_count) ; + reiserfs_update_inode_transaction(inode) ; transaction_started = 1 ; goto research; } @@ -1277,6 +1282,10 @@ return ; } lock_kernel() ; + +/* this is really only used for atime updates, so they don't have +** to be included in O_SYNC or fsync +*/ journal_begin(th, inode-i_sb, 1) ; reiserfs_update_sd (th, inode); journal_end(th, inode-i_sb, 1) ; @@ -1650,6 +1659,7 @@ ** (it will unmap bh if it packs). */ journal_begin(th, p_s_inode-i_sb, JOURNAL_PER_BALANCE_CNT * 2 ) ; +reiserfs_update_inode_transaction(p_s_inode) ; windex = push_journal_writer(reiserfs_vfs_truncate_file) ; reiserfs_do_truncate (th, p_s_inode, page, update_timestamps) ; pop_journal_writer(windex) ; @@ -1696,6 +1706,7 @@ start_over: lock_kernel() ; journal_begin(th, inode-i_sb, jbegin_count) ; +reiserfs_update_inode_transaction(inode) ; make_cpu_key(key, inode, byte_offset, TYPE_ANY, 3) ; @@ -1927,22 +1938,34 @@ static int reiserfs_commit_write(struct file *f, struct page *page,
[reiserfs-list] Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure
On Tuesday, January 29, 2002 01:46:43 PM +0300 Hans Reiser [EMAIL PROTECTED] wrote: Alexander Viro wrote: On Tue, 29 Jan 2002, Hans Reiser wrote: This fails to recover an object (e.g. dcache entry) which is used once, and then spends a year in cache on the same page as an object which is hot all the time. This means that the hot set of objects becomes diffused over an order of magnitude more pages than if garbage collection squeezes them all together. That makes for very poor caching. Any GC that is going to move active dentries around is out of question. It would need a locking of such strength that you would be the first to cry bloody murder - about 5 seconds after you look at the scalability benchmarks. I don't mean to suggest that the dentry cache locking is an easy problem to solve, but the problem discussed is a real one, and it is sufficient to illustrate that the unified cache is fundamentally flawed as an algorithm compared to using subcache plugins. It isn't just dentries. If a subcache object is in use, it can't be moved to a warmer page without invalidating all existing pointers to it. If it isn't in use, it can be migrated when the VM asks for the page to be flushed. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.4.4-rt13: btrfs + xfstests 006 = BOOM.. and a bonus rt_mutex deadlock report for absolutely free!
On Fri, Jul 13, 2012 at 04:26:26AM -0600, Thomas Gleixner wrote: On Fri, 13 Jul 2012, Mike Galbraith wrote: On Fri, 2012-07-13 at 11:52 +0200, Thomas Gleixner wrote: On Fri, 13 Jul 2012, Mike Galbraith wrote: On Thu, 2012-07-12 at 15:31 +0200, Thomas Gleixner wrote: Bingo, that makes it more likely that this is caused by copying w/o initializing the lock and then freeing the original structure. A quick check for memcpy finds that __btrfs_close_devices() does a memcpy of btrfs_device structs w/o initializing the lock in the new copy, but I have no idea whether that's the place we are looking for. Thanks a bunch Thomas. I doubt I would have ever figured out that lala land resulted from _copying_ a lock. That's one I won't be forgetting any time soon. Box not only survived a few thousand xfstests 006 runs, dbench seemed disinterested in deadlocking virgin 3.0-rt. Cute. It think that the lock copying caused the deadlock problem as the list pointed to the wrong place, so we might have ended up with following down the wrong chain when walking the list as long as the original struct was not freed. That beast is freed under RCU so there could be a rcu read side critical section fiddling with the old lock and cause utter confusion. Virgin 3.0-rt appears to really be solid. But then it doesn't have pesky rwlocks. Ah. So 3.0 is not having those rwlock thingies. Bummer. /me goes and writes a nastigram^W proper changelog btrfs still locks up in my enterprise kernel, so I suppose I had better plug your fix into 3.4-rt and see what happens, and go beat hell out of virgin 3.0-rt again to be sure box really really survives dbench. A test against 3.4-rt sans enterprise mess might be nice as well. Enterprise is 3.0-stable with um 555 btrfs patches (oh dear). Virgin 3.4-rt and 3.2-rt deadlock gripe. Enterprise doesn't gripe, but deadlocks, so I have another adventure in my future even if I figure out wth to do about rwlocks. Hrmpf. /me goes to stare into fs/btrfs/ some more. Please post the deadlocks here, I'll help ;) -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.4.4-rt13: btrfs + xfstests 006 = BOOM.. and a bonus rt_mutex deadlock report for absolutely free!
On Wed, Jul 11, 2012 at 11:47:40PM -0600, Mike Galbraith wrote: Greetings, [ deadlocks with btrfs and the recent RT kernels ] I talked with Thomas about this and I think the problem is the single-reader nature of the RW rwlocks. The lockdep report below mentions that btrfs is calling: [ 692.963099] [811fabd2] btrfs_clear_path_blocking+0x32/0x70 In this case, the task has a number of blocking read locks on the btrfs buffers, and we're trying to turn them back into spinning read locks. Even though btrfs is taking the read rwlock, it doesn't think of this as a new lock operation because we were blocking out new writers. If the second task has taken the spinning read lock, it is going to prevent that clear_path_blocking operation from progressing, even though it would have worked on a non-RT kernel. The solution should be to make the blocking read locks in btrfs honor the single-reader semantics. This means not allowing more than one blocking reader and not allowing a spinning reader when there is a blocking reader. Strictly speaking btrfs shouldn't need recursive readers on a single lock, so I wouldn't worry about that part. There is also a chunk of code in btrfs_clear_path_blocking that makes sure to strictly honor top down locking order during the conversion. It only does this when lockdep is enabled because in non-RT kernels we don't need to worry about it. For RT we'll want to enable that as well. I'll give this a shot later today. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.4.4-rt13: btrfs + xfstests 006 = BOOM.. and a bonus rt_mutex deadlock report for absolutely free!
On Sat, Jul 14, 2012 at 04:14:43AM -0600, Mike Galbraith wrote: On Fri, 2012-07-13 at 08:50 -0400, Chris Mason wrote: On Wed, Jul 11, 2012 at 11:47:40PM -0600, Mike Galbraith wrote: Greetings, [ deadlocks with btrfs and the recent RT kernels ] I talked with Thomas about this and I think the problem is the single-reader nature of the RW rwlocks. The lockdep report below mentions that btrfs is calling: [ 692.963099] [811fabd2] btrfs_clear_path_blocking+0x32/0x70 In this case, the task has a number of blocking read locks on the btrfs buffers, and we're trying to turn them back into spinning read locks. Even though btrfs is taking the read rwlock, it doesn't think of this as a new lock operation because we were blocking out new writers. If the second task has taken the spinning read lock, it is going to prevent that clear_path_blocking operation from progressing, even though it would have worked on a non-RT kernel. The solution should be to make the blocking read locks in btrfs honor the single-reader semantics. This means not allowing more than one blocking reader and not allowing a spinning reader when there is a blocking reader. Strictly speaking btrfs shouldn't need recursive readers on a single lock, so I wouldn't worry about that part. There is also a chunk of code in btrfs_clear_path_blocking that makes sure to strictly honor top down locking order during the conversion. It only does this when lockdep is enabled because in non-RT kernels we don't need to worry about it. For RT we'll want to enable that as well. I'll give this a shot later today. I took a poke at it. Did I do something similar to what you had in mind, or just hide behind performance stealing paranoid trylock loops? Box survived 1000 x xfstests 006 and dbench [-s] massive right off the bat, so it gets posted despite skepticism. Great, thanks! I got stuck in bug land on Friday. You mentioned performance problems earlier on Saturday, did this improve performance? One other question: again: +#ifdef CONFIG_PREEMPT_RT_BASE + while (atomic_read(eb-blocking_readers)) + cpu_chill(); + while(!read_trylock(eb-lock)) + cpu_chill(); + if (atomic_read(eb-blocking_readers)) { + read_unlock(eb-lock); + goto again; + } Why use read_trylock() in a loop instead of just trying to take the lock? Is this an RTism or are there other reasons? -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.4.4-rt13: btrfs + xfstests 006 = BOOM.. and a bonus rt_mutex deadlock report for absolutely free!
On Mon, Jul 16, 2012 at 04:55:44AM -0600, Mike Galbraith wrote: On Sat, 2012-07-14 at 12:14 +0200, Mike Galbraith wrote: On Fri, 2012-07-13 at 08:50 -0400, Chris Mason wrote: On Wed, Jul 11, 2012 at 11:47:40PM -0600, Mike Galbraith wrote: Greetings, [ deadlocks with btrfs and the recent RT kernels ] I talked with Thomas about this and I think the problem is the single-reader nature of the RW rwlocks. The lockdep report below mentions that btrfs is calling: [ 692.963099] [811fabd2] btrfs_clear_path_blocking+0x32/0x70 In this case, the task has a number of blocking read locks on the btrfs buffers, and we're trying to turn them back into spinning read locks. Even though btrfs is taking the read rwlock, it doesn't think of this as a new lock operation because we were blocking out new writers. If the second task has taken the spinning read lock, it is going to prevent that clear_path_blocking operation from progressing, even though it would have worked on a non-RT kernel. The solution should be to make the blocking read locks in btrfs honor the single-reader semantics. This means not allowing more than one blocking reader and not allowing a spinning reader when there is a blocking reader. Strictly speaking btrfs shouldn't need recursive readers on a single lock, so I wouldn't worry about that part. There is also a chunk of code in btrfs_clear_path_blocking that makes sure to strictly honor top down locking order during the conversion. It only does this when lockdep is enabled because in non-RT kernels we don't need to worry about it. For RT we'll want to enable that as well. I'll give this a shot later today. I took a poke at it. Did I do something similar to what you had in mind, or just hide behind performance stealing paranoid trylock loops? Box survived 1000 x xfstests 006 and dbench [-s] massive right off the bat, so it gets posted despite skepticism. Seems btrfs isn't entirely convinced either. [ 2292.336229] use_block_rsv: 1810 callbacks suppressed [ 2292.336231] [ cut here ] [ 2292.336255] WARNING: at fs/btrfs/extent-tree.c:6344 use_block_rsv+0x17d/0x190 [btrfs]() [ 2292.336257] Hardware name: System x3550 M3 -[7944K3G]- [ 2292.336259] btrfs: block rsv returned -28 This is unrelated. You got far enough into the benchmark to hit an ENOSPC warning. This can be ignored (I just deleted it when we used 3.0 for oracle). re: dbench performance. dbench tends to penalize fairness. I can imagine RT making it slower in general. It also triggers lots of lock contention in btrfs because the dataset is fairly small and the trees don't fan out a lot. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.4.4-rt13: btrfs + xfstests 006 = BOOM.. and a bonus rt_mutex deadlock report for absolutely free!
On Mon, Jul 16, 2012 at 10:26:08AM -0600, Mike Galbraith wrote: On Mon, 2012-07-16 at 12:02 -0400, Steven Rostedt wrote: On Mon, 2012-07-16 at 04:02 +0200, Mike Galbraith wrote: Great, thanks! I got stuck in bug land on Friday. You mentioned performance problems earlier on Saturday, did this improve performance? Yeah, the read_trylock() seems to improve throughput. That's not heavily tested, but it certainly looks like it does. No idea why. Ouch, you just turned the rt_read_lock() into a spin lock. If a higher priority process preempted a lower priority process that holds the same lock, it will deadlock. Hm, how, it's doing cpu_chill()? I'm not sure why you would get a performance benefit from this, as the mutex used is an adaptive one (failure to acquire the lock will only sleep if preempted or if the owner is not running). I'm not attached to it, can whack it in a heartbeat.. especially so it the thing can deadlock. I've seen enough of those of late. We should look at why this performs better (if it really does). Not sure it really does, there's variance, but it looked like it did. I'd use a benchmark that is more consistent than dbench for this. I love dbench for generating load (and the occasional deadlock) but it tends to steer you in the wrong direction on performance. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG REPORT] Kernel panic on 3.9.0-rc7-4-gbb33db7
Quoting Tejun Heo (2013-04-19 01:57:54) Ewweehh No wonder this thing crashes. Chris, can't the original bio carry bbio in bi_private and let end_bio_extent_readpage() free the bbio instead of abusing bi_bdev like this? Yes, we can definitely carry bbio up higher in the stack. I'll patch it up right now. I do agree that it'll be too big for -final, but we'll have it either way. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG REPORT] Kernel panic on 3.9.0-rc7-4-gbb33db7
Quoting Jens Axboe (2013-04-19 09:32:50) No wonder this thing crashes. Chris, can't the original bio carry bbio in bi_private and let end_bio_extent_readpage() free the bbio instead of abusing bi_bdev like this? Ugh, wtf. Chris, time for a swim in the bay :-) Yeah, I can't really defend this one. We needed a space for an int and I assumed end_io meant the FS was free to do horrible things. Really though, I'll just take a quick dip in the lake and patch this out of btrfs. Jan is probably right about changing around our endio callbacks to explicitly pass the mirror, it should be less complex and cleaner. Many thanks to everyone here that tracked it down. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] One more btrfs
Hi Linus My for-linus branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus Has a recent fix from Josef for our tree log replay code. It fixes problems where the inode counter for the number of bytes in the file wasn't getting updated properly during fsync replay. The commit did get rebased this morning, but it was only to clean up the subject line. The code hasn't changed. Josef Bacik (1) commits (+42/-6): Btrfs: make sure nbytes are right after log replay Total: (1) commits (+42/-6) fs/btrfs/tree-log.c | 48 ++-- 1 file changed, 42 insertions(+), 6 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Btrfs updates
Hi Linus, Please pull my for-linus branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus We've had a busy two weeks of bug fixing. The biggest patches in here are some long standing early-enospc problems (Josef) and a very old race where compression and mmap combine forces to lose writes (me). I'm fairly sure the mmap bug goes all the way back to the introduction of the compression code, which is proof that fsx doesn't trigger every possible mmap corner after all. I'm sure you'll notice one of these is from this morning, it's a small and isolated use-after-free fix in our scrub error reporting. I double checked it here. Josef Bacik (6) commits (+90/-18): Btrfs: hold the ordered operations mutex when waiting on ordered extents (+2/-0) Btrfs: don't drop path when printing out tree errors in scrub (+2/-1) Btrfs: fix space leak when we fail to reserve metadata space (+41/-6) Btrfs: fix space accounting for unlink and rename (+2/-4) Btrfs: limit the global reserve to 512mb (+1/-1) Btrfs: handle a bogus chunk tree nicely (+42/-6) Jan Schmidt (2) commits (+24/-16): Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes (+4/-6) Btrfs: fix locking on ROOT_REPLACE operations in tree mod log (+20/-10) Wang Shilong (2) commits (+10/-2): Btrfs: fix double free in the btrfs_qgroup_account_ref() (+1/-2) Btrfs: fix missing qgroup reservation before fallocating (+9/-0) Miao Xie (2) commits (+5/-3): Btrfs: fix wrong return value of btrfs_lookup_csum() (+3/-1) Btrfs: fix wrong reservation of csums (+2/-2) Chris Mason (1) commits (+49/-0): Btrfs: fix race between mmap writes and compression Liu Bo (1) commits (+1/-1): Btrfs: update to use fs_state bit Tsutomu Itoh (1) commits (+9/-3): Btrfs: fix memory leak in btrfs_create_tree() Total: (15) commits fs/btrfs/ctree.c| 30 -- fs/btrfs/disk-io.c | 14 ++--- fs/btrfs/extent-tree.c | 84 ++--- fs/btrfs/extent_io.c| 33 +++ fs/btrfs/extent_io.h| 2 ++ fs/btrfs/file-item.c| 6 ++-- fs/btrfs/file.c | 9 ++ fs/btrfs/inode.c| 22 ++--- fs/btrfs/ordered-data.c | 2 ++ fs/btrfs/qgroup.c | 3 +- fs/btrfs/scrub.c| 3 +- fs/btrfs/send.c | 10 +++--- fs/btrfs/volumes.c | 13 +++- 13 files changed, 188 insertions(+), 43 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Btrfs fixes
Hi Linus, My for-linus branch has some btrfs fixes: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus Eric's rcu barrier patch fixes a long standing problem with our unmount code hanging on to devices in workqueue helpers. Liu Bo nailed down a difficult assertion for in-memory extent mappings. Liu Bo (4) commits (+9/-7): Btrfs: get better concurrency for snapshot-aware defrag work (+3/-0) Btrfs: fix warning when creating snapshots (+5/-6) Btrfs: fix warning of free_extent_map (+1/-0) Btrfs: remove btrfs_try_spin_lock (+0/-1) Josef Bacik (1) commits (+4/-1): Btrfs: return EIO if we have extent tree corruption Eric Sandeen (1) commits (+6/-0): btrfs: use rcu_barrier() to wait for bdev puts at unmount Wang Shilong (1) commits (+6/-4): Btrfs: return as soon as possible when edquot happens Total: (7) commits (+25/-12) fs/btrfs/extent-tree.c | 5 - fs/btrfs/file.c| 1 + fs/btrfs/inode.c | 3 +++ fs/btrfs/locking.h | 1 - fs/btrfs/qgroup.c | 10 ++ fs/btrfs/transaction.c | 11 +-- fs/btrfs/volumes.c | 6 ++ 7 files changed, 25 insertions(+), 12 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Btrfs updates
Hi Linus, Please grab my for-linus: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus These are scattered fixes and one performance improvement. The biggest functional change is in how we throttle metadata changes. The new code bumps our average file creation rate up by ~13% in fs_mark, and lowers CPU usage. Stefan bisected out a regression in our allocation code that made balance loop on extents larger than 256MB. Liu Bo (6) commits (+71/-19): Btrfs: build up error handling for merge_reloc_roots (+35/-12) Btrfs: check for NULL pointer in updating reloc roots (+2/-0) Btrfs: avoid deadlock on transaction waiting list (+7/-0) Btrfs: free all recorded tree blocks on error (+6/-3) Btrfs: do not BUG_ON on aborted situation (+12/-3) Btrfs: do not BUG_ON in prepare_to_reloc (+9/-1) Chris Mason (2) commits (+96/-63): Btrfs: enforce min_bytes parameter during extent allocation (+4/-2) Btrfs: improve the delayed inode throttling (+92/-61) Miao Xie (2) commits (+45/-39): Btrfs: fix unclosed transaction handler when the async transaction commitment fails (+4/-0) Btrfs: fix wrong handle at error path of create_snapshot() when the commit fails (+41/-39) Stefan Behrens (1) commits (+0/-8): Btrfs: allow running defrag in parallel to administrative tasks Ilya Dryomov (1) commits (+5/-0): Btrfs: fix a mismerge in btrfs_balance() Josef Bacik (1) commits (+4/-1): Btrfs: use set_nlink if our i_nlink is 0 Total: (13) commits (+221/-130) fs/btrfs/delayed-inode.c | 151 --- fs/btrfs/delayed-inode.h | 2 + fs/btrfs/disk-io.c | 16 +++-- fs/btrfs/inode.c | 6 +- fs/btrfs/ioctl.c | 18 ++ fs/btrfs/relocation.c| 74 +-- fs/btrfs/transaction.c | 65 fs/btrfs/tree-log.c | 5 +- fs/btrfs/volumes.c | 14 - 9 files changed, 221 insertions(+), 130 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] Btrfs
On Sat, Mar 02, 2013 at 05:45:41PM -0700, Linus Torvalds wrote: On Sat, Mar 2, 2013 at 7:15 AM, Chris Mason chris.ma...@fusionio.com wrote: Our set of btrfs features, fixes and cleanups are in my for-linus branch: I *really* wish that big pull requests like this would come in earlier in the merge window. I hate seeing them the day before I close the window - really. A number of the latter commits are done in the last few days, which also smells bad. Definitely, I wanted to send this earlier in the merge window. But I was out last week and also didn't want to send the big stuff (raid 5/6 and the fsync work) to you right before I left on vacation. So instead I sent things off to linux-next, and everyone on the btrfs list collected fixes while I was gone. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] btrfs/raid56: Add missing #include linux/vmalloc.h
On Sun, Mar 03, 2013 at 04:44:41AM -0700, Geert Uytterhoeven wrote: tilegx_defconfig: fs/btrfs/raid56.c: In function 'btrfs_alloc_stripe_hash_table': fs/btrfs/raid56.c:206:3: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration] fs/btrfs/raid56.c:206:9: warning: assignment makes pointer from integer without a cast [enabled by default] fs/btrfs/raid56.c:226:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration] Thanks, I've got this one in my for-linus now. It'll go with the next pull. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Btrfs fixup
Hi Linus, Geert and James both sent this one in, sorry guys. git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus Geert Uytterhoeven (1) commits (+1/-0): btrfs/raid56: Add missing #include linux/vmalloc.h Total: (1) commits (+1/-0) fs/btrfs/raid56.c | 1 + 1 file changed, 1 insertion(+) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote: On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote: Indeed. Though how well my patches will work with Oracle will depend a lot on what kind of semctl syscalls they are doing. Does Oracle typically do one semop per semctl syscall, or does it pass in a whole bunch at once? https://oss.oracle.com/~mason/sembench.c I think Chris wrote that to match a particular pattern of semaphore operations the database engine in question does. I haven't checked to see if it triggers the case in point though. Also, Chris since left Oracle but maybe he knows who to poke. Dave Kleikamp (cc'd) took over my patches and did the most recent benchmarking. Ported against 3.0: https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c The current versions are still in the 2.6.32 oracle kernel, but it looks like they reverted this 3.0 commit. I think with Manfred's upstream work my more complex approach wasn't required anymore, but hopefully Dave can fill in details. Here is some of the original discussion around the patch: https://lkml.org/lkml/2010/4/12/257 In terms of how oracle uses IPC, the part that shows up in profiles is using semtimedop for bulk wakeups. They can configure things to use either a bunch of small arrays or a huge single array (and anything in between). There is one IPC semaphore per process and they use this to wait for some event (like a log commit). When the event comes in, everyone waiting is woken in bulk via a semtimedop call. So, single proc waking many waiters at once. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Thu, Mar 07, 2013 at 08:54:55AM -0700, Dave Kleikamp wrote: On 03/07/2013 06:55 AM, Chris Mason wrote: On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote: On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote: Indeed. Though how well my patches will work with Oracle will depend a lot on what kind of semctl syscalls they are doing. Does Oracle typically do one semop per semctl syscall, or does it pass in a whole bunch at once? https://oss.oracle.com/~mason/sembench.c I think Chris wrote that to match a particular pattern of semaphore operations the database engine in question does. I haven't checked to see if it triggers the case in point though. Also, Chris since left Oracle but maybe he knows who to poke. Dave Kleikamp (cc'd) took over my patches and did the most recent benchmarking. Ported against 3.0: https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c The current versions are still in the 2.6.32 oracle kernel, but it looks like they reverted this 3.0 commit. I think with Manfred's upstream work my more complex approach wasn't required anymore, but hopefully Dave can fill in details. From what I recall, I could never get better performance from your patches that we saw with Manfred's work alone. I can't remember the reasons for including and then reverting the patches from the 3.0 (2.6.39) Oracle kernel, but in the end we weren't able to justify their inclusion. Ok, so after this commit, oracle was happy: commit fd5db42254518fbf241dc454e918598fbe494fa2 Author: Manfred Spraul manf...@colorfullife.com Date: Wed May 26 14:43:40 2010 -0700 ipc/sem.c: optimize update_queue() for bulk wakeup calls But that doesn't explain why Davidlohr saw semtimedop at the top of the oracle profiles in his runs. Looking through the patches in this thread, I don't see anything that I'd expect to slow down oracle TPC numbers. I dealt with the ipc_perm lock a little differently: https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commitdiff;h=78fe45325c8e2e3f4b6ebb1ee15b6c2e8af5ddb1;hp=8102e1ff9d667661b581209323faaf7a84f0f528 My code switched the ipc_rcu_hdr refcount to an atomic, which changed where I needed the spinlock. It may make things easier in patches 3/4 and 4/4. (some of this code was Jens, but at the time he made me promise to pretend he never touched it) -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, Nov 28, 2012 at 11:16:21PM -0700, Linus Torvalds wrote: On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds torva...@linux-foundation.org wrote: But the fact that the code wants to do things like block = (sector_t)page-index (PAGE_CACHE_SHIFT - bbits); seriously seems to be the main thing that keeps us using 'inode-i_blkbits'. Calculating bbits from bh-b_size is just costly enough to hurt (not everywhere, but on some machines). Very annoying. Hmm. Here's a patch that does that anyway. I'm not 100% happy with the whole ilog2 thing, but at the same time, in other cases it actually seems to improve code generation (ie gets rid of the whole unnecessary two dereferences through page-mapping-host just to get the block size, when we have it in the buffer-head that we have to touch *anyway*). Comments? Again, untested. Jumping in based on Linus original patch, which is doing something like this: set_blocksize() { block new calls to writepage, prepare/commit_write set the block size unblock --- can race in here and find bad buffers --- sync_blockdev() kill_bdev() --- now we're safe --- } We could add a second semaphore and a page_mkwrite call: set_blocksize() { block new calls to prepare/commit_write and page_mkwrite(), but leave writepage unblocked. sync_blockev() --- now we're safe. There are no dirty pages and no ways to make new ones --- block new calls to readpage (writepage too for good luck?) kill_bdev() set the block size unblock readpage/writepage unblock prepare/commit_write and page_mkwrite } Another way to look at things: As Linus said in a different email, we don't need to drop the pages, just the buffers. Once we've blocked prepare/commit_write, there is no way to make a partially up to date page with dirty data. We may make fully uptodate dirty pages, but for those we can just create dirty buffers for the whole page. As long as we had prepare/commit write blocked while we ran sync_blockdev, we can blindly detach any buffers that are the wrong size and just make new ones. This may or may not apply to loop.c, I'd have to read that more carefully. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Thu, Nov 29, 2012 at 07:12:49AM -0700, Chris Mason wrote: On Wed, Nov 28, 2012 at 11:16:21PM -0700, Linus Torvalds wrote: On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds torva...@linux-foundation.org wrote: But the fact that the code wants to do things like block = (sector_t)page-index (PAGE_CACHE_SHIFT - bbits); seriously seems to be the main thing that keeps us using 'inode-i_blkbits'. Calculating bbits from bh-b_size is just costly enough to hurt (not everywhere, but on some machines). Very annoying. Hmm. Here's a patch that does that anyway. I'm not 100% happy with the whole ilog2 thing, but at the same time, in other cases it actually seems to improve code generation (ie gets rid of the whole unnecessary two dereferences through page-mapping-host just to get the block size, when we have it in the buffer-head that we have to touch *anyway*). Comments? Again, untested. Jumping in based on Linus original patch, which is doing something like this: set_blocksize() { block new calls to writepage, prepare/commit_write set the block size unblock --- can race in here and find bad buffers --- sync_blockdev() kill_bdev() --- now we're safe --- } We could add a second semaphore and a page_mkwrite call: set_blocksize() { block new calls to prepare/commit_write and page_mkwrite(), but leave writepage unblocked. sync_blockev() --- now we're safe. There are no dirty pages and no ways to make new ones --- block new calls to readpage (writepage too for good luck?) kill_bdev() Whoops, kill_bdev needs the page lock, which sends us into ABBA when readpage does the down_read. So, slight modification, unblock readpage/writepage before the kill_bdev. We'd need to change readpage to discard buffers with the wrong size. The risk is that readpage can find buffers with the wrong size, and would need to be changed to discard them. The patch below is based on Linus' original and doesn't deal with the readpage race. But it does get the rest of the idea across. It boots and survives banging no blockdev --setbsz with mkfs, but I definitely wouldn't trust it. diff --git a/fs/block_dev.c b/fs/block_dev.c index 1a1e5e3..1377171 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -116,8 +116,6 @@ EXPORT_SYMBOL(invalidate_bdev); int set_blocksize(struct block_device *bdev, int size) { - struct address_space *mapping; - /* Size must be a power of two, and between 512 and PAGE_SIZE */ if (size PAGE_SIZE || size 512 || !is_power_of_2(size)) return -EINVAL; @@ -126,28 +124,40 @@ int set_blocksize(struct block_device *bdev, int size) if (size bdev_logical_block_size(bdev)) return -EINVAL; - /* Prevent starting I/O or mapping the device */ - percpu_down_write(bdev-bd_block_size_semaphore); - - /* Check that the block device is not memory mapped */ - mapping = bdev-bd_inode-i_mapping; - mutex_lock(mapping-i_mmap_mutex); - if (mapping_mapped(mapping)) { - mutex_unlock(mapping-i_mmap_mutex); - percpu_up_write(bdev-bd_block_size_semaphore); - return -EBUSY; - } - mutex_unlock(mapping-i_mmap_mutex); - /* Don't change the size if it is same as current */ if (bdev-bd_block_size != size) { + /* block all modifications via writing and page_mkwrite */ + percpu_down_write(bdev-bd_block_size_semaphore); + + /* write everything that was dirty */ sync_blockdev(bdev); + + /* block readpage and writepage */ + percpu_down_write(bdev-bd_page_semaphore); + bdev-bd_block_size = size; bdev-bd_inode-i_blkbits = blksize_bits(size); + + /* we can't call kill_bdev with the page_semaphore down +* because we'll deadlock against readpage. +* The block_size_semaphore should prevent any new +* pages from being dirty, but readpage can jump +* in once we up the bd_page_sem and find a +* page with buffers from the old size. +* +* The kill_bdev call below is going to get rid +* of those buffers, but we do have a race here. +* readpage needs to deal with it and verify +* any buffers on the page are the right size +*/ + percpu_up_write(bdev-bd_page_semaphore); + + /* drop all the pages and all the buffers */ kill_bdev(bdev); - } - percpu_up_write(bdev-bd_block_size_semaphore); + /* open the gates and let everyone back in */ + percpu_up_write(bdev-bd_block_size_semaphore); + } return 0
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Thu, Nov 29, 2012 at 10:26:56AM -0700, Linus Torvalds wrote: On Thu, Nov 29, 2012 at 6:12 AM, Chris Mason chris.ma...@fusionio.com wrote: Jumping in based on Linus original patch, which is doing something like this: set_blocksize() { block new calls to writepage, prepare/commit_write set the block size unblock --- can race in here and find bad buffers --- sync_blockdev() kill_bdev() --- now we're safe --- } We could add a second semaphore and a page_mkwrite call: Yeah, we could be fancy, but the more I think about it, the less I can say I care. After all, the only things that do the whole set_blocksize() thing should be: - filesystems at mount-time - things like loop/md at block device init time. and quite frankly, if there are any *concurrent* writes with either of the above, I really *really* don't think we should care. I mean, seriously. So the _only_ real reason for the locking in the first place is to make sure of internal kernel consistency. We do not want to oops or corrupt memory if people do odd things. But we really *really* don't care if somebody writes to a partition at the same time as somebody else mounts it. Not enough to do extra work to please insane people. It's also worth noting that NONE OF THIS HAS EVER WORKED IN THE PAST. The whole sequence always used to be unlocked. The locking is entirely new. There is certainly not any legacy users that can possibly rely on I did writes at the same time as the mount with no serialization, and it worked. It never has worked. So I think this is a case of perfect is the enemy of good. Especially since I think that with the fs/buffer.c approach, we don't actually need any locking at all at higher levels. The bigger question is do we have users that expect to be able to set the blocksize after mmaping the block device (no writes required)? I actually feel a little bad for taking up internet bandwidth asking, but it is a change in behaviour. Regardless, changing mmap for a race in the page cache is just backwards, and with the current 3.7 code, we can still trigger the race with fadvise - readpage in the middle of set_blocksize() Obviously nobody does any of this, otherwise we'd have tons of reports from those handy WARN_ONs in fs/buffer.c. So its definitely hard to be worried one way or another. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 12:02:17PM -0700, Linus Torvalds wrote: On Thu, Nov 29, 2012 at 9:19 AM, Linus Torvalds torva...@linux-foundation.org wrote: I think I'll apply this for 3.7 (since it's too late to do anything fancier), and then for 3.8 I will rip out all the locking entirely, because looking at the fs/buffer.c patch I wrote up, it's all totally unnecessary. Adding a ACCESS_ONCE() to the read of the i_blkbits value (when creating new buffers) simply makes the whole locking thing pointless. Just make the page lock protect the block size, and make it per-page, and we're done. There's a 'block-dev' branch in my git tree, if you guys want to play around with it. It actually reverts fs/block-dev.c back to the 3.6 state (except for some whitespace damage that I refused to re-introduce), so that part of the changes should be pretty safe and well tested. The fs/buffer.c changes, of course, are new. It's largely the same patch I already sent out, with a small helper function to simplify it, and to keep the whole ACCESS_ONCE() thing in just a single place. The fs/buffer.c part makes sense during a quick read. But fs/direct-io.c plays with i_blkbits too. The semaphore was fixing real bugs there. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 12:26:06PM -0700, Linus Torvalds wrote: On Thu, Nov 29, 2012 at 11:15 AM, Chris Mason chris.ma...@fusionio.com wrote: The fs/buffer.c part makes sense during a quick read. But fs/direct-io.c plays with i_blkbits too. The semaphore was fixing real bugs there. Ugh. I _hate_ direct-IO. What a mess. And yeah, it seems to be incestuously playing games that should be in fs/buffer.c. I thought it was doing the sane thing with the page cache. (I now realize that Mikulas was talking about this mess, while I thought he was talking about the AIO code which is largely sane). It was all a trick to get you to say the AIO code was sane. It looks like we could use the private copy of i_blkbits that DIO is already recording. blkdev_get_blocks (called during DIO) is also checking i_blkbits, but I really don't get why that isn't byte based instead. DIO is already doing the shift mask game. I think only clean_blockdev_aliases is intentionally using the inode's i_blkbits, but again that shouldn't be changing for filesystems so it seems safe to use the DIO copy. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 01:52:22PM -0700, Linus Torvalds wrote: On Thu, Nov 29, 2012 at 11:48 AM, Chris Mason chris.ma...@fusionio.com wrote: It was all a trick to get you to say the AIO code was sane. It's only sane compared to the DIO code. That said, I hate AIO much less these days that we've largely merged the code with the regular IO. It's still a horrible interface, but at least it is no longer a really disgusting separate implementation in the kernel of that horrible interface. So yeah, I guess AIO really is pretty sane these days. It looks like we could use the private copy of i_blkbits that DIO is already recording. Yes. But that didn't fix the blkdev_get_blocks() mess you pointed out. I've pushed out two more commits to the 'block-dev' branch at git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux block-dev in case anybody wants to take a look. It is - as usual - entirely untested. It compiles, and I *think* that blkdev_get_blocks() makes a whole lot more sense this way - as you said, it should be byte-based (although it actually does the block number conversion because I worried about overflow - probably unnecessarily). Comments? Your blkdev_get_blocks emails were great reading while at the dentist, thanks for helping me pass the time. Just reading the new blkdev_get_blocks, it looks like we're mixing shifts. In direct-io.c map_bh-b_size is how much we'd like to map, and it has no relation at all to the actual block size of the device. The interface is abusing b_size to ask for as large a mapping as possible. Most importantly, it has no relation to the fs_startblk that we pass in, which is based on inode-i_blkbits. So your new check in blkdev_get_blocks: if (iblock = end_block) { Is wrong because iblock and end_block are based on different sizes. I think we have to do the eof checks inside fs/direct-io.c or change the get_blocks interface completely. I really thought fs/direct-io.c was already doing eof checks, but I'm reading harder. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 03:36:38PM -0700, Linus Torvalds wrote: On Thu, Nov 29, 2012 at 2:16 PM, Linus Torvalds torva...@linux-foundation.org wrote: But you're right. The direct-IO code really *is* violating that, and knows that get_block() ends up being defined in i_blkbits regardless of b_size. It turns out fs/ioctl.c does the same - it fills in the buffer head with some random bh-b_size too. I think it's not even a power of two in that case. And I guess it's understandable - they don't actually *use* the buffer, they just want the offset. So the b_size field really is just random crap to the users of the get_block interfaces, since they've never cared before. Ugh, this was definitely a dark and disgusting underbelly of the VFS layer. We've not had to really touch it for a *looong* time.. I searched through filemap.c for the magic i_size check that would let us get away with ignoring i_blkbits in get_blocks, but its just not there. The whole fallback-to-buffered scheme seems to rely on get_blocks checking for i_size. I really hope I'm just missing something. If we're going to change this, I'd vote for something non-bh based. I didn't check every single FS, but I don't think direct-IO really wants or needs buffer heads at all. One less wart in direct-io.c would really be nice, but I'm assuming it'll take us at least one full release to hammer out a shiny new get_blocks. Passing i_blkbits would be more mechanical, since all the filesystems would just ignore it. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 07:13:02PM -0700, Linus Torvalds wrote: On Thu, Nov 29, 2012 at 5:16 PM, Chris Mason chris.ma...@fusionio.com wrote: I searched through filemap.c for the magic i_size check that would let us get away with ignoring i_blkbits in get_blocks, but its just not there. The whole fallback-to-buffered scheme seems to rely on get_blocks checking for i_size. I really hope I'm just missing something. So generic_write_checks() limits the size to i_size at for writes (and for isblk). Great, that's what I was missing. Sure, then it will do the buffered part after that, but that should all be fine anyway, since by then we use the normal page cache. For reads, generic_file_aio_read() will check pos size, but doesn't seem to actually limit the size of the iovec. I couldn't explain that either. I'm not sure why it doesn't just do iov_shorten(). Anyway, having looked at actually passing in the block size to get_block(), I can say that is a horrible idea. There are tons of get_block functions (for various filesystems), and *none* of them really want the block size, because they tend to work on block indexes. And if they do want the block size, they'll just get it from the inode or sb, since they are filesystems and it's all stable. So the *only* of the places that would want the block size is fs/block_dev.c. And the callers really already seem to do the i_size check, although they sometimes do it badly. And since there are fewer callers than there are get_block() implementations, I think we should just fix the callers and be done with it. Those generic_file_aio_read/write() functions in fs/direct-io.c really just seem to be badly written. The fact that they may depend on the i_size check in get_blocks() is sad, but I think we should fix it and just remove the check for block devices. That's going to simplify so much.. I updated the 'block-dev' branch to have that simpler fs/block_dev.c model instead. I'll look at the iovec shortening later. It's a non-fast-forward thing, look out! (I actually think we should just add the max-offset check to rw_copy_check_uvector(). That one already does the MAX_RW_COUNT thing, and we could make it do a max_offset check as well). This is definitely easier, and I can't see any reason not to do it. I'm used to get_block being expensive and so it didn't even cross my mind. We can benchmark things just to make sure. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 07:49:10PM -0700, Dave Chinner wrote: On Thu, Nov 29, 2012 at 02:16:50PM -0800, Linus Torvalds wrote: On Thu, Nov 29, 2012 at 1:29 PM, Chris Mason chris.ma...@fusionio.com wrote: Just reading the new blkdev_get_blocks, it looks like we're mixing shifts. In direct-io.c map_bh-b_size is how much we'd like to map, and it has no relation at all to the actual block size of the device. The interface is abusing b_size to ask for as large a mapping as possible. Ugh. That's a big violation of how buffer-heads are supposed to work: the block number is very much defined to be in multiples of b_size (see for example submit_bh() that turns it into a sector number). But you're right. The direct-IO code really *is* violating that, and knows that get_block() ends up being defined in i_blkbits regardless of b_size. Same with mpage_readpages(), so it's not just direct IO that has this problem I guess the good news is that block devices don't have readpages. The bad news would be that we can't put readpages in without much bigger changes. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Btrfs fixes
[ sorry, my lbdb seems to really like linux-ker...@vger.kerrnel.org, fixed for real this time ] Hi Linus, Please pull my for-linus branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus We've got corner cases for updating i_size that ceph was hitting, error handling for quotas when we run out of space, a very subtle snapshot deletion race, a crash while removing devices, and one deadlock between subvolume creation and the sb_internal code (thanks lockdep). Josef Bacik (3) commits (+12/-4): Btrfs: do not merge logged extents if we've removed them from the tree (+2/-1) Btrfs: fix possible stale data exposure (+1/-1) Btrfs: fix missing i_size update (+9/-2) Miao Xie (2) commits (+21/-9): Btrfs: fix missing release of the space/qgroup reservation in start_transaction() (+19/-8) Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write() (+2/-1) Jan Schmidt (1) commits (+10/-12): Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata Liu Bo (1) commits (+38/-9): Btrfs: fix race between snapshot deletion and getting inode Chris Mason (1) commits (+4/-1): Btrfs: move d_instantiate outside the transaction during mksubvol Eric Sandeen (1) commits (+2/-1): btrfs: don't try to notify udev about missing devices Total: (9) commits fs/btrfs/extent-tree.c | 22 ++ fs/btrfs/extent_map.c | 3 ++- fs/btrfs/file.c | 25 - fs/btrfs/ioctl.c| 5 - fs/btrfs/ordered-data.c | 13 ++--- fs/btrfs/scrub.c| 25 - fs/btrfs/transaction.c | 27 +++ fs/btrfs/volumes.c | 3 ++- 8 files changed, 87 insertions(+), 36 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops when mounting btrfs partition
Hi Arnd, First things first, nospace_cache is a safe thing to use. It is slow because it's finding free extents, but it's just a cache and always safe to discard. With your other errors, I'd just mount it readonly and then you won't waste time on atime updates. I'll take a look at the BUG you got during log recovery. We've fixed a few of those during the 3.8 rc cycle. Feb 1 22:57:37 localhost kernel: [ 8561.599482] Kernel BUG at a01fdcf7 [verbose debug info unavailable] Jan 14 19:18:42 localhost kernel: [1060055.746373] btrfs csum failed ino 15619835 off 454656 csum 2755731641 private 864823192 Jan 14 19:18:42 localhost kernel: [1060055.746381] btrfs: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 17, gen 0 ... Jan 21 16:35:40 localhost kernel: [1655047.701147] parent transid verify failed on 17006399488 wanted 54700 found 54764 These aren't good. With a few exceptions for really tight races in fsx use cases, csum errors are bad data from the disk. The transid verify failed shows we wanted to find a metadata block from generation 54700 but found 54764 instead: 54700 = 0xD5AC 54764 = 0xD5EC This same bad block comes up a few different times. Jan 21 16:35:40 localhost kernel: [1655047.752692] btrfs read error corrected: ino 1 off 17006399488 (dev /dev/sdb1 sector 64689288) This shows we pulled from the second copy of this block and got the right answer, and then wrote the right answer to the duplicate. Inode 1 means it was metadata. But for some reason still aborted the transaction. It could have been an EIO on the correction, but the auto correction code in 3.5 did work well. I think your plan to pull the data off and reformat is a good one. I'd also look hard at your ram since drives don't usually send back single bit errors. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Btrfs
mapping if we fail to add chunk (+12/-2) Btrfs: relax the block group size limit for bitmaps (+9/-3) Btrfs: cleanup orphan reservation if truncate fails (+2/-0) Btrfs: make sure NODATACOW also gets NODATASUM set (+2/-1) Btrfs: don't re-enter when allocating a chunk (+9/-0) Btrfs: remove unused extent io tree ops V2 (+11/-27) Btrfs: fix chunk allocation error handling (+22/-10) Liu Bo (14) commits (+796/-109): Btrfs: kill unused argument of btrfs_pin_extent_for_log_replay (+3/-6) Btrfs: fix cleaner thread not working with inode cache option (+8/-1) Btrfs: use token to avoid times mapping extent buffer (+35/-28) Btrfs: extend the checksum item as much as possible (+46/-21) Btrfs: fix NULL pointer after aborting a transaction (+7/-1) Btrfs: use reserved space for creating a snapshot (+2/-0) Btrfs: kill unused argument of update_block_group (+5/-7) Btrfs: kill unused arguments of cache_block_group (+5/-8) Btrfs: do not change inode flags in rename (+0/-25) Btrfs: record first logical byte in memory (+20/-1) Btrfs: fix memory leak of log roots (+9/-2) Btrfs: remove deprecated comments (+0/-6) Btrfs: snapshot-aware defrag (+654/-0) Btrfs: save us a read_lock (+2/-3) Eric Sandeen (11) commits (+58/-108): btrfs: ensure we don't overrun devices_info[] in __btrfs_alloc_chunk (+5/-1) btrfs: remove unused item in btrfs_insert_delayed_item() (+0/-2) btrfs: remove unused fs_info from btrfs_decode_error() (+4/-5) btrfs: remove cache only arguments from defrag path (+32/-82) btrfs: remove unnecessary DEFINE_WAIT() declarations (+0/-2) btrfs: annotate intentional switch case fallthroughs (+2/-0) btrfs: add missing break in btrfs_print_leaf() (+1/-0) btrfs: remove unused fd in btrfs_ioctl_send() (+0/-3) btrfs: handle null fs_info in btrfs_panic() (+7/-4) btrfs: fix varargs in __btrfs_std_error (+7/-7) btrfs: list_entry can't return NULL (+0/-2) Chris Mason (7) commits (+561/-30): Btrfs: reduce CPU contention while waiting for delayed extent operations (+70/-5) Btrfs: remove conflicting check for minimum number of devices in raid56 (+0/-8) Btrfs: reduce lock contention on extent buffer locks (+16/-0) Btrfs: add a plugging callback to raid56 writes (+124/-4) Btrfs: fix cluster alignment for mount -o ssd (+6/-1) Btrfs: fix max chunk size on raid5/6 (+21/-4) Btrfs: Add a stripe cache to raid56 (+324/-8) Wang Shilong (6) commits (+78/-68): Btrfs: remove reduplicate check about root in the function btrfs_clean_quota_tree (+0/-3) Btrfs: cleanup to make the function btrfs_delalloc_reserve_metadata more logic (+38/-44) Btrfs: return ENOMEM rather than use BUG_ON when btrfs_alloc_path fails (+9/-3) Btrfs: don't call btrfs_qgroup_free if just btrfs_qgroup_reserve fails (+6/-5) Btrfs: fix missing deleted items in btrfs_clean_quota_tree (+21/-13) Btrfs: fix missing check before disabling quota (+4/-0) David Sterba (6) commits (+131/-42): btrfs: access superblock via pagecache in scan_one_device (+64/-6) btrfs: put some enospc messages under enospc_debug (+15/-11) btrfs: try harder to allocate raid56 stripe cache (+26/-7) btrfs: use only inline_pages from extent buffer (+7/-17) btrfs: remove a printk from scan_one_device (+0/-1) btrfs: add cancellation points to defrag (+19/-0) Zach Brown (2) commits (+9/-12): btrfs: limit fallocate extent reservation to 256MB (+4/-3) btrfs: define BTRFS_MAGIC as a u64 value (+5/-9) David Woodhouse (2) commits (+2294/-113): Btrfs: add rw argument to merge_bio_hook() (+11/-11) Btrfs: RAID5 and RAID6 (+2283/-102) Ilya Dryomov (2) commits (+6/-6): Btrfs: allow for selecting only completely empty chunks (+1/-1) Btrfs: eliminate a use-after-free in btrfs_balance() (+5/-5) jeff.liu (2) commits (+67/-0): Btrfs: Add a new ioctl to get the label of a mounted file system (+23/-0) Btrfs: set/change the label of a mounted file system (+44/-0) Filipe Brandenburger (1) commits (+19/-11): Btrfs: move fs/btrfs/ioctl.h to include/uapi/linux/btrfs.h Mark Fasheh (1) commits (+54/-4): btrfs: add no file data flag to btrfs send ioctl Alexandre Oliva (1) commits (+3/-3): clear chunk_alloc flag on retryable failure Thomas Gleixner (1) commits (+1/-0): btrfs: Init io_lock after cloning btrfs device struct Paul Gortmaker (1) commits (+1/-4): btrfs: fixup/remove module.h usage as required Tomasz Torcz (1) commits (+1/-0): Btrfs: select XOR_BLOCKS in Kconfig Jan Schmidt (1) commits (+1/-4): Btrfs: fix backref walking race with tree deletions Qu Wenruo (1) commits (+25/-38): btrfs: cleanup for open-coded alignment Kusanagi Kouichi (1) commits (+1/-1): Btrfs: Check CAP_DAC_READ_SEARCH for BTRFS_IOC_INO_PATHS Arne Jansen (1) commits (+1/-1): Btrfs: fix crash in log replay with qgroups enabled Total: (118) commits fs/btrfs/Kconfig
Re: [GIT PULL] Btrfs fixes
On Tue, Jan 22, 2013 at 05:48:33PM -0700, Chris Mason wrote: Hi Linus, My for-linus branch has our batch of btrfs fixes: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus We've been hammering away at a crc corruption as well, which I was really hoping to get into this pull. It isn't nailed down yet, but we were finally able to get a solid way to reproduce. The only good news is it isn't a recent regression. Update on this, we've tracked down the crc errors and are doing final checks on the patches. Linus are you planning on taking this pull? If not I can just fold the new stuff into a bigger request. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Btrfs fixes (v2)
Hi Linus, My for-linus branch has our batch of btrfs fixes: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus It turns out that we had two crc bugs when running fsx-linux in a loop. Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it all down. Miao also has a new OOM fix in this v2 pull as well. Ilya fixed a regression Liu Bo found in the balance ioctls for pausing and resuming a running balance across drives. Josef's orphan truncate patch fixes an obscure corruption we'd see during xfstests. Arne's patches address problems with subvolume quotas. If the user destroys quota groups incorrectly the FS will refuse to mount. The rest are smaller fixes and plugs for memory leaks. Miao Xie (8) commits (+76/-24): Btrfs: fix missing write access release in btrfs_ioctl_resize() (+1/-0) Btrfs: do not delete a subvolume which is in a R/O subvolume (+5/-5) Btrfs: Add ACCESS_ONCE() to transaction-abort accesses (+3/-2) Btrfs: fix wrong max device number for single profile (+1/-1) Btrfs: fix repeated delalloc work allocation (+41/-14) Btrfs: fix missed transaction-aborted check (+16/-0) Btrfs: fix resize a readonly device (+4/-2) Btrfs: disable qgroup id 0 (+5/-0) Ilya Dryomov (6) commits (+94/-32): Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag (+9/-8) Btrfs: fix mutually exclusive op is running error code (+4/-4) Btrfs: fix a regression in balance usage filter (+8/-1) Btrfs: bring back balance pause/resume logic (+71/-17) Btrfs: fix unlock order in btrfs_ioctl_rm_dev (+1/-1) Btrfs: fix unlock order in btrfs_ioctl_resize (+1/-1) Liu Bo (5) commits (+23/-7): Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents (+14/-6) Btrfs: use right range to find checksum for compressed extents (+5/-0) Btrfs: let allocation start from the right raid type (+1/-1) Btrfs: reset path lock state to zero (+2/-0) Btrfs: fix off-by-one in lseek (+1/-0) Josef Bacik (5) commits (+69/-29): Btrfs: do not allow logged extents to be merged or removed (+16/-3) Btrfs: add orphan before truncating pagecache (+38/-15) Btrfs: set flushing if we're limited flushing (+1/-1) Btrfs: put csums on the right ordered extent (+2/-2) Btrfs: fix panic when recovering tree log (+12/-8) Arne Jansen (2) commits (+19/-1): Btrfs: prevent qgroup destroy when there are still relations (+12/-1) Btrfs: ignore orphan qgroup relations (+7/-0) Zach Brown (1) commits (+1/-0): btrfs: fix btrfs_cont_expand() freeing IS_ERR em Lukas Czerner (1) commits (+1/-1): btrfs: get the device in write mode when deleting it Eric Sandeen (1) commits (+14/-3): btrfs: update timestamps on truncate() Tsutomu Itoh (1) commits (+3/-1): Btrfs: fix memory leak in name_cache_insert() Total: (30) commits (+300/-98) fs/btrfs/extent-tree.c | 6 +- fs/btrfs/extent_map.c | 13 - fs/btrfs/extent_map.h | 1 + fs/btrfs/file-item.c| 4 +- fs/btrfs/file.c | 10 +++- fs/btrfs/free-space-cache.c | 20 --- fs/btrfs/inode.c| 137 +--- fs/btrfs/ioctl.c| 129 ++--- fs/btrfs/qgroup.c | 20 ++- fs/btrfs/send.c | 4 +- fs/btrfs/super.c| 2 +- fs/btrfs/transaction.c | 19 +- fs/btrfs/tree-log.c | 10 +++- fs/btrfs/volumes.c | 23 ++-- 14 files changed, 300 insertions(+), 98 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Btrfs fixes
Hi Linus, My for-linus branch has our batch of btrfs fixes: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus We've been hammering away at a crc corruption as well, which I was really hoping to get into this pull. It isn't nailed down yet, but we were finally able to get a solid way to reproduce. The only good news is it isn't a recent regression. The most important batch of fixes in here come from Ilya. They address a regression Liu Bo found in the balance ioctls for pausing and resuming a running balance across drives. Josef's orphan truncate patch fixes an obscure corruption we'd see during xfstests. Arne's patches address problems with subvolume quotas. If the user destroys quota groups incorrectly the FS will refuse to mount. The rest are smaller fixes and plugs for memory leaks. Ilya Dryomov (6) commits (+94/-32): Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag (+9/-8) Btrfs: fix mutually exclusive op is running error code (+4/-4) Btrfs: fix a regression in balance usage filter (+8/-1) Btrfs: bring back balance pause/resume logic (+71/-17) Btrfs: fix unlock order in btrfs_ioctl_rm_dev (+1/-1) Btrfs: fix unlock order in btrfs_ioctl_resize (+1/-1) Liu Bo (4) commits (+18/-7): Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents (+14/-6) Btrfs: let allocation start from the right raid type (+1/-1) Btrfs: reset path lock state to zero (+2/-0) Btrfs: fix off-by-one in lseek (+1/-0) Miao Xie (4) commits (+15/-7): Btrfs: fix missing write access release in btrfs_ioctl_resize() (+1/-0) Btrfs: do not delete a subvolume which is in a R/O subvolume (+5/-5) Btrfs: fix resize a readonly device (+4/-2) Btrfs: disable qgroup id 0 (+5/-0) Arne Jansen (2) commits (+19/-1): Btrfs: prevent qgroup destroy when there are still relations (+12/-1) Btrfs: ignore orphan qgroup relations (+7/-0) Josef Bacik (2) commits (+39/-16): Btrfs: add orphan before truncating pagecache (+38/-15) Btrfs: set flushing if we're limited flushing (+1/-1) Zach Brown (1) commits (+1/-0): btrfs: fix btrfs_cont_expand() freeing IS_ERR em Lukas Czerner (1) commits (+1/-1): btrfs: get the device in write mode when deleting it Eric Sandeen (1) commits (+14/-3): btrfs: update timestamps on truncate() Tsutomu Itoh (1) commits (+3/-1): Btrfs: fix memory leak in name_cache_insert() Total: (22) commits fs/btrfs/extent-tree.c | 6 ++- fs/btrfs/file.c| 10 ++-- fs/btrfs/inode.c | 82 +++ fs/btrfs/ioctl.c | 129 +++-- fs/btrfs/qgroup.c | 20 +++- fs/btrfs/send.c| 4 +- fs/btrfs/volumes.c | 21 ++-- 7 files changed, 204 insertions(+), 68 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] Btrfs fixes
On Tue, Jan 22, 2013 at 06:28:21PM -0700, Liu Bo wrote: On Tue, Jan 22, 2013 at 07:48:33PM -0500, Chris Mason wrote: Hi Linus, My for-linus branch has our batch of btrfs fixes: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus We've been hammering away at a crc corruption as well, which I was really hoping to get into this pull. It isn't nailed down yet, but we were finally able to get a solid way to reproduce. The only good news is it isn't a recent regression. The most important batch of fixes in here come from Ilya. They address a regression Liu Bo found in the balance ioctls for pausing and resuming a running balance across drives. Josef's orphan truncate patch fixes an obscure corruption we'd see during xfstests. Arne's patches address problems with subvolume quotas. If the user destroys quota groups incorrectly the FS will refuse to mount. The rest are smaller fixes and plugs for memory leaks. Hi, Any chance to get these in this round? I think they're good fixes, a memory leak and a warning fix, both are got from xfstests. I'll get these tested in the next pull. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
dma engine bugs
Hi Dan, I'm doing some benchmarking on MD raid5/6 on 4 fusionio cards in an HP DL380p. I'm doing 128K randomw writes on a 4 drive raid6 with a 64K stripe size per drive. I have 4 fio processes sending down the aio/dio, and a high queue depth (8192). When I bump up the MD raid stripe cache size, I'm running into soft lockups in the async memcopy code: [34336.959645] BUG: soft lockup - CPU#6 stuck for 22s! [fio:38296] [34336.959648] BUG: soft lockup - CPU#9 stuck for 22s! [md0_raid6:5172] [34336.959704] Modules linked in: raid456 async_raid6_recov async_pq async_xor async_memcpy async_tx iomemory_vsl(O) binfmt_misc cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf loop dm_mod coretemp kvm_intel kvm ghash_clmulni_intel sr_mod cdrom aesni_intel ablk_helper cryptd lrw aes_x86_64 ata_generic xts gf128mul ioatdma sb_edac gpio_ich ata_piix hid_generic dca edac_core lpc_ich microcode serio_raw mfd_core hpilo hpwdt button container tg3 sg acpi_power_meter usbhid mgag200 ttm drm_kms_helper drm i2c_algo_bit sysimgblt sysfillrect syscopyarea uhci_hcd crc32c_intel ehci_hcd hpsa processor thermal_sys scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh btrfs raid6_pq zlib_deflate xor libcrc32c asix usbnet usbcore usb_common [34336.959709] CPU 9 [34336.959709] Pid: 5172, comm: md0_raid6 Tainted: GW O 3.7.1-1-default #2 HP ProLiant DL380p Gen8 [34336.959720] RIP: 0010:[815381ad] [815381ad] _raw_spin_unlock_irqrestore+0xd/0x20 [34336.959721] RSP: 0018:8807af6db858 EFLAGS: 0292 [34336.959722] RAX: 1000 RBX: 8810176fd000 RCX: 0292 [34336.959723] RDX: 1000 RSI: 0292 RDI: 0292 [34336.959724] RBP: 8807af6db858 R08: 881017e40440 R09: 880f554fabc0 [34336.959725] R10: 2000 R11: R12: 881017e40460 [34336.959726] R13: 0040 R14: 0001 R15: 881017e40480 [34336.959728] FS: () GS:88103f66() knlGS: [34336.959729] CS: 0010 DS: ES: CR0: 80050033 [34336.959730] CR2: 035cf458 CR3: 01a0b000 CR4: 000407e0 [34336.959731] DR0: DR1: DR2: [34336.959733] DR3: DR6: 0ff0 DR7: 0400 [34336.959734] Process md0_raid6 (pid: 5172, threadinfo 8807af6da000, task 88077d7725c0) [34336.959735] Stack: [34336.959738] 8807af6db898 8114f287 8807af6db8b8 [34336.959740] 005bd84a 881015f2fa18 881017632a38 [34336.959742] 8807af6db8e8 a057adf4 881015f2fa18 [34336.959743] Call Trace: [34336.959750] [8114f287] dma_pool_alloc+0x67/0x270 [34336.959758] [a057adf4] ioat2_alloc_ring_ent+0x34/0xc0 [ioatdma] [34336.959761] [a057afc5] reshape_ring+0x145/0x370 [ioatdma] [34336.959764] [8153841d] ? _raw_spin_lock_bh+0x2d/0x40 [34336.959767] [a057b2d9] ioat2_check_space_lock+0xe9/0x240 [ioatdma] [34336.959768] [81538381] ? _raw_spin_unlock_bh+0x11/0x20 [34336.959771] [a057b48c] ioat2_dma_prep_memcpy_lock+0x5c/0x280 [ioatdma] [34336.959773] [a03102df] ? do_async_gen_syndrome+0x29f/0x3d0 [async_pq] [34336.959775] [81538381] ? _raw_spin_unlock_bh+0x11/0x20 [34336.959790] [a057ac22] ? ioat2_tx_submit_unlock+0x92/0x100 [ioatdma] [34336.959792] [a02f8207] async_memcpy+0x207/0x1000 [async_memcpy] [34336.959795] [a031f67d] async_copy_data+0x9d/0x150 [raid456] [34336.959797] [a03206ba] __raid_run_ops+0x4ca/0x990 [raid456] [34336.959802] [811b7c42] ? __aio_put_req+0x102/0x150 [34336.959805] [a031c7ae] ? handle_stripe_dirtying+0x30e/0x440 [raid456] [34336.959807] [a03217a8] handle_stripe+0x528/0x10b0 [raid456] [34336.959810] [a03226f0] handle_active_stripes+0x1e0/0x270 [raid456] [34336.959814] [81293bb3] ? blk_flush_plug_list+0xb3/0x220 [34336.959817] [a03229a0] raid5d+0x220/0x3c0 [raid456] [34336.959822] [81413b0e] md_thread+0x12e/0x160 [34336.959828] [8106bfa0] ? wake_up_bit+0x40/0x40 [34336.959829] [814139e0] ? md_rdev_init+0x110/0x110 [34336.959831] [8106b806] kthread+0xc6/0xd0 [34336.959834] [8106b740] ? kthread_freezable_should_stop+0x70/0x70 [34336.959849] [8154047c] ret_from_fork+0x7c/0xb0 [34336.959851] [8106b740] ? kthread_freezable_should_stop+0x70/0x70 Since I'm running on fast cards, I assumed MD was just hammering on this path so much that MD needed a cond_resched(). But now that I've sprinkled conditional pixie dust everywhere I'm still seeing exactly the same trace, and the lockups keep flowing forever, even after I've stopped all new IO. Looking at ioat2_check_space_lock(), it is looping when the ring allocation fails. We're trying to
dma engine bugs
[ Sorry resend with the right address for Dan ] Hi Dan, I'm doing some benchmarking on MD raid5/6 on 4 fusionio cards in an HP DL380p. I'm doing 128K randomw writes on a 4 drive raid6 with a 64K stripe size per drive. I have 4 fio processes sending down the aio/dio, and a high queue depth (8192). When I bump up the MD raid stripe cache size, I'm running into soft lockups in the async memcopy code: [34336.959645] BUG: soft lockup - CPU#6 stuck for 22s! [fio:38296] [34336.959648] BUG: soft lockup - CPU#9 stuck for 22s! [md0_raid6:5172] [34336.959704] Modules linked in: raid456 async_raid6_recov async_pq async_xor async_memcpy async_tx iomemory_vsl(O) binfmt_misc cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf loop dm_mod coretemp kvm_intel kvm ghash_clmulni_intel sr_mod cdrom aesni_intel ablk_helper cryptd lrw aes_x86_64 ata_generic xts gf128mul ioatdma sb_edac gpio_ich ata_piix hid_generic dca edac_core lpc_ich microcode serio_raw mfd_core hpilo hpwdt button container tg3 sg acpi_power_meter usbhid mgag200 ttm drm_kms_helper drm i2c_algo_bit sysimgblt sysfillrect syscopyarea uhci_hcd crc32c_intel ehci_hcd hpsa processor thermal_sys scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh btrfs raid6_pq zlib_deflate xor libcrc32c asix usbnet usbcore usb_common [34336.959709] CPU 9 [34336.959709] Pid: 5172, comm: md0_raid6 Tainted: GW O 3.7.1-1-default #2 HP ProLiant DL380p Gen8 [34336.959720] RIP: 0010:[815381ad] [815381ad] _raw_spin_unlock_irqrestore+0xd/0x20 [34336.959721] RSP: 0018:8807af6db858 EFLAGS: 0292 [34336.959722] RAX: 1000 RBX: 8810176fd000 RCX: 0292 [34336.959723] RDX: 1000 RSI: 0292 RDI: 0292 [34336.959724] RBP: 8807af6db858 R08: 881017e40440 R09: 880f554fabc0 [34336.959725] R10: 2000 R11: R12: 881017e40460 [34336.959726] R13: 0040 R14: 0001 R15: 881017e40480 [34336.959728] FS: () GS:88103f66() knlGS: [34336.959729] CS: 0010 DS: ES: CR0: 80050033 [34336.959730] CR2: 035cf458 CR3: 01a0b000 CR4: 000407e0 [34336.959731] DR0: DR1: DR2: [34336.959733] DR3: DR6: 0ff0 DR7: 0400 [34336.959734] Process md0_raid6 (pid: 5172, threadinfo 8807af6da000, task 88077d7725c0) [34336.959735] Stack: [34336.959738] 8807af6db898 8114f287 8807af6db8b8 [34336.959740] 005bd84a 881015f2fa18 881017632a38 [34336.959742] 8807af6db8e8 a057adf4 881015f2fa18 [34336.959743] Call Trace: [34336.959750] [8114f287] dma_pool_alloc+0x67/0x270 [34336.959758] [a057adf4] ioat2_alloc_ring_ent+0x34/0xc0 [ioatdma] [34336.959761] [a057afc5] reshape_ring+0x145/0x370 [ioatdma] [34336.959764] [8153841d] ? _raw_spin_lock_bh+0x2d/0x40 [34336.959767] [a057b2d9] ioat2_check_space_lock+0xe9/0x240 [ioatdma] [34336.959768] [81538381] ? _raw_spin_unlock_bh+0x11/0x20 [34336.959771] [a057b48c] ioat2_dma_prep_memcpy_lock+0x5c/0x280 [ioatdma] [34336.959773] [a03102df] ? do_async_gen_syndrome+0x29f/0x3d0 [async_pq] [34336.959775] [81538381] ? _raw_spin_unlock_bh+0x11/0x20 [34336.959790] [a057ac22] ? ioat2_tx_submit_unlock+0x92/0x100 [ioatdma] [34336.959792] [a02f8207] async_memcpy+0x207/0x1000 [async_memcpy] [34336.959795] [a031f67d] async_copy_data+0x9d/0x150 [raid456] [34336.959797] [a03206ba] __raid_run_ops+0x4ca/0x990 [raid456] [34336.959802] [811b7c42] ? __aio_put_req+0x102/0x150 [34336.959805] [a031c7ae] ? handle_stripe_dirtying+0x30e/0x440 [raid456] [34336.959807] [a03217a8] handle_stripe+0x528/0x10b0 [raid456] [34336.959810] [a03226f0] handle_active_stripes+0x1e0/0x270 [raid456] [34336.959814] [81293bb3] ? blk_flush_plug_list+0xb3/0x220 [34336.959817] [a03229a0] raid5d+0x220/0x3c0 [raid456] [34336.959822] [81413b0e] md_thread+0x12e/0x160 [34336.959828] [8106bfa0] ? wake_up_bit+0x40/0x40 [34336.959829] [814139e0] ? md_rdev_init+0x110/0x110 [34336.959831] [8106b806] kthread+0xc6/0xd0 [34336.959834] [8106b740] ? kthread_freezable_should_stop+0x70/0x70 [34336.959849] [8154047c] ret_from_fork+0x7c/0xb0 [34336.959851] [8106b740] ? kthread_freezable_should_stop+0x70/0x70 Since I'm running on fast cards, I assumed MD was just hammering on this path so much that MD needed a cond_resched(). But now that I've sprinkled conditional pixie dust everywhere I'm still seeing exactly the same trace, and the lockups keep flowing forever, even after I've stopped all new IO. Looking at ioat2_check_space_lock(), it is looping
Re: dma engine bugs
On Thu, Jan 17, 2013 at 07:53:18PM -0700, Dan Williams wrote: On Thu, Jan 17, 2013 at 6:38 AM, Chris Mason chris.ma...@fusionio.com wrote: [ Sorry resend with the right address for Dan ] Hi Dan, I'm doing some benchmarking on MD raid5/6 on 4 fusionio cards in an HP DL380p. I'm doing 128K randomw writes on a 4 drive raid6 with a 64K stripe size per drive. I have 4 fio processes sending down the aio/dio, and a high queue depth (8192). When I bump up the MD raid stripe cache size, I'm running into soft lockups in the async memcopy code: [34336.959645] BUG: soft lockup - CPU#6 stuck for 22s! [fio:38296] [34336.959648] BUG: soft lockup - CPU#9 stuck for 22s! [md0_raid6:5172] [34336.959704] Modules linked in: raid456 async_raid6_recov async_pq async_xor async_memcpy async_tx iomemory_vsl(O) binfmt_misc cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf loop dm_mod coretemp kvm_intel kvm ghash_clmulni_intel sr_mod cdrom aesni_intel ablk_helper cryptd lrw aes_x86_64 ata_generic xts gf128mul ioatdma sb_edac gpio_ich ata_piix hid_generic dca edac_core lpc_ich microcode serio_raw mfd_core hpilo hpwdt button container tg3 sg acpi_power_meter usbhid mgag200 ttm drm_kms_helper drm i2c_algo_bit sysimgblt sysfillrect syscopyarea uhci_hcd crc32c_intel ehci_hcd hpsa processor thermal_sys scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh btrfs raid6_pq zlib_deflate xor libcrc32c asix usbnet usbcore usb_common [34336.959709] CPU 9 [34336.959709] Pid: 5172, comm: md0_raid6 Tainted: GW O 3.7.1-1-default #2 HP ProLiant DL380p Gen8 [34336.959720] RIP: 0010:[815381ad] [815381ad] _raw_spin_unlock_irqrestore+0xd/0x20 [34336.959721] RSP: 0018:8807af6db858 EFLAGS: 0292 [34336.959722] RAX: 1000 RBX: 8810176fd000 RCX: 0292 [34336.959723] RDX: 1000 RSI: 0292 RDI: 0292 [34336.959724] RBP: 8807af6db858 R08: 881017e40440 R09: 880f554fabc0 [34336.959725] R10: 2000 R11: R12: 881017e40460 [34336.959726] R13: 0040 R14: 0001 R15: 881017e40480 [34336.959728] FS: () GS:88103f66() knlGS: [34336.959729] CS: 0010 DS: ES: CR0: 80050033 [34336.959730] CR2: 035cf458 CR3: 01a0b000 CR4: 000407e0 [34336.959731] DR0: DR1: DR2: [34336.959733] DR3: DR6: 0ff0 DR7: 0400 [34336.959734] Process md0_raid6 (pid: 5172, threadinfo 8807af6da000, task 88077d7725c0) [34336.959735] Stack: [34336.959738] 8807af6db898 8114f287 8807af6db8b8 [34336.959740] 005bd84a 881015f2fa18 881017632a38 [34336.959742] 8807af6db8e8 a057adf4 881015f2fa18 [34336.959743] Call Trace: [34336.959750] [8114f287] dma_pool_alloc+0x67/0x270 [34336.959758] [a057adf4] ioat2_alloc_ring_ent+0x34/0xc0 [ioatdma] [34336.959761] [a057afc5] reshape_ring+0x145/0x370 [ioatdma] [34336.959764] [8153841d] ? _raw_spin_lock_bh+0x2d/0x40 [34336.959767] [a057b2d9] ioat2_check_space_lock+0xe9/0x240 [ioatdma] [34336.959768] [81538381] ? _raw_spin_unlock_bh+0x11/0x20 [34336.959771] [a057b48c] ioat2_dma_prep_memcpy_lock+0x5c/0x280 [ioatdma] [34336.959773] [a03102df] ? do_async_gen_syndrome+0x29f/0x3d0 [async_pq] [34336.959775] [81538381] ? _raw_spin_unlock_bh+0x11/0x20 [34336.959790] [a057ac22] ? ioat2_tx_submit_unlock+0x92/0x100 [ioatdma] [34336.959792] [a02f8207] async_memcpy+0x207/0x1000 [async_memcpy] [34336.959795] [a031f67d] async_copy_data+0x9d/0x150 [raid456] [34336.959797] [a03206ba] __raid_run_ops+0x4ca/0x990 [raid456] [34336.959802] [811b7c42] ? __aio_put_req+0x102/0x150 [34336.959805] [a031c7ae] ? handle_stripe_dirtying+0x30e/0x440 [raid456] [34336.959807] [a03217a8] handle_stripe+0x528/0x10b0 [raid456] [34336.959810] [a03226f0] handle_active_stripes+0x1e0/0x270 [raid456] [34336.959814] [81293bb3] ? blk_flush_plug_list+0xb3/0x220 [34336.959817] [a03229a0] raid5d+0x220/0x3c0 [raid456] [34336.959822] [81413b0e] md_thread+0x12e/0x160 [34336.959828] [8106bfa0] ? wake_up_bit+0x40/0x40 [34336.959829] [814139e0] ? md_rdev_init+0x110/0x110 [34336.959831] [8106b806] kthread+0xc6/0xd0 [34336.959834] [8106b740] ? kthread_freezable_should_stop+0x70/0x70 [34336.959849] [8154047c] ret_from_fork+0x7c/0xb0 [34336.959851] [8106b740] ? kthread_freezable_should_stop+0x70/0x70
[GIT PULL] Btrfs updates
Hi Linus, If you're doing another RC, please grab these two. Otherwise I'll send them off to -stable. git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus This fixes a long standing problem where the btrfs scan ioctl was racing with mkfs.btrfs and dropping dirty pages created by mkfs. It also fixes a crash during tree log replay with quota enabled. David Sterba (1) commits (+64/-6): btrfs: access superblock via pagecache in scan_one_device Arne Jansen (1) commits (+1/-1): Btrfs: fix crash in log replay with qgroups enabled Total: (2) commits (+65/-7) fs/btrfs/ctree.c | 2 +- fs/btrfs/volumes.c | 70 +- 2 files changed, 65 insertions(+), 7 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Fri, Jan 12, 2007 at 10:06:22AM -0800, Linus Torvalds wrote: looking at the splice(2) api it seems like it'll be difficult to implement O_DIRECT pread/pwrite from userland using splice... so there'd need to be some help there. You'd use vmsplice() to put the write buffers into kernel space (user space sees it's a pipe file descriptor, but you should just ignore that: it's really just a kernel buffer). And then splice the resulting kernel buffers to the destination. I recently spent some time trying to integrate O_DIRECT locking with page cache locking. The basic theory is that instead of using semaphores for solving O_DIRECT vs buffered races, you put something into the radix tree (I call it a placeholder) to keep the page cache users out, and lock any existing pages that are present. O_DIRECT does save cpu from avoiding copies, but it also saves cpu from fewer radix tree operations during massive IOs. The cost of radix tree insertion/deletion on 1MB O_DIRECT ios added ~10% system time on my tiny little dual core box. I'm sure it would be much worse if there was lock contention on a big numa machine, and it grows as the io grows (SGI does massive O_DIRECT ios). To help reduce radix churn, I made it possible for a single placeholder entry to lock down a range in the radix: http://thread.gmane.org/gmane.linux.file-systems/12263 It looks to me as though vmsplice is going to have the same issues as my early patches. The current splice code can avoid the copy but is still working in page sized chunks. Also, splice doesn't support zero copy on things smaller than page sized chunks. The compromise my patch makes is to hide placeholders from almost everything except the DIO code. It may be worthwhile to turn the placeholders into an IO marker that can be useful to filemap_fdatawrite and friends. It should be able to: record the userland/kernel pages involved in a given io map blocks from the FS for making a bio start the io wake people up when the io is done This would allow splice to operate without stealing the userland page (stealing would still be an option of course), and could get rid of big chunks of fs/direct-io.c. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc patch] optimize o_direct on block device
On Thu, Nov 30, 2006 at 10:16:53PM -0800, Chen, Kenneth W wrote: Zach Brown wrote on Thursday, November 30, 2006 1:45 PM At that time, a patch was written for raw device to demonstrate that large performance head room is achievable (at ~20% speedup for micro- benchmark and ~2% for db transaction processing benchmark) with a tight I/O submission processing loop. Where exactly does the benefit come from? icache misses? atomic ops leading to pipeline flushes? It benefit from shorter path length. It takes much shorter time to process one I/O request, both in the submit and completion path. I always think in terms of how many instructions, or clock ticks does it take to convert user request into bio, submit it and in the return path, to process the bio call back function and do the appropriate io completion (sync or async). The stock 2.6.19 kernel takes about 5.17 micro-seconds to process one 4K aligned DIO (just the submit and completion path, less disk I/O latency). With the patch, the time is reduced to 4.26 us. I'm not completely against a minimal DIO implementation for the block device, but right now we get block device QA for free when we test the rest of the DIO code. Splitting the code base makes DIO (already a special case) that much harder to test. It's obvious there's a lot less code in your patch than fs/direct-io.c, but I'm still interested in which part of the fs/direct-io.c path is taking the most time. I would guess it is allocating the dio? I don't think we should cut out fs/direct-io.c until we understand exactly where the hit is coming from. I know you've done lots of instrumentation already, but can you share some percentages on the hot paths? -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] a btrfs fix
Hi Linus, My for-linus branch has one revert in the new quota code: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus We're building up more fixes at etc for the next merge window, but I'm keeping them out unless they are bigger regressions or have a huge impact. Chris Mason (1): Revert Btrfs: fix some error codes in btrfs_qgroup_inherit() fs/btrfs/qgroup.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] a large btrfs update
subvol uuids and times (+292/-15) Btrfs: don't update atime on RO subvolumes (+7/-0) Btrfs: add btrfs_compare_trees function (+440/-0) Btrfs: make iref_to_path non static (+9/-5) Chris Mason (5) commits (+22/-9): Btrfs: call the ordered free operation without any locks held (+8/-1) Btrfs: don't wait around for new log writers on an SSD (+2/-1) Btrfs: add a barrier before a waitqueue_active check (+1/-0) Btrfs: reduce calls to wake_up on uncontended locks (+9/-5) Btrfs: uninit variable fixes in send/receive (+2/-2) Stefan Behrens (3) commits (+9/-4): Btrfs: avoid I/O repair BUG() from btree_read_extent_buffer_pages() (+1/-1) Btrfs: remove unwanted printk() for btrfs device I/O stats (+0/-3) Btrfs: suppress printk() if all device I/O stats are zero (+8/-0) Li Zefan (3) commits (+159/-122): Btrfs: kill free_space pointer from inode structure (+10/-19) Btrfs: zero unused bytes in inode item (+3/-0) Btrfs: rewrite BTRFS_SETGET_FUNCS (+146/-103) Ilya Dryomov (2) commits (+3/-3): Btrfs: do not ignore errors from btrfs_cleanup_fs_roots() when mounting (+2/-2) Btrfs: do not return EINVAL instead of ENOMEM from open_ctree() (+1/-1) Dan Carpenter (2) commits (+4/-3): Btrfs: small naming cleanup in join_transaction() (+2/-2) Btrfs: fix error handling in __add_reloc_root() (+2/-1) David Sterba (2) commits (+23/-18): btrfs: allow cross-subvolume file clone (+8/-3) btrfs: join DEV_STATS ioctls to one (+15/-15) Arnd Hannemann (1) commits (+8/-1): Btrfs: allow mount -o remount,compress=no Anand Jain (1) commits (+1/-1): btrfs read error corrected message floods the console during recovery Mitch Harder (1) commits (+20/-14): Btrfs: Check INCOMPAT flags on remount and add helper function Tsutomu Itoh (1) commits (+3/-3): Btrfs: return error of btrfs_update_inode() to caller Andrew Mahone (1) commits (+5/-3): btrfs: ignore unfragmented file checks in defrag when compression enabled - rebased Total: (65) commits fs/btrfs/Makefile |2 +- fs/btrfs/async-thread.c |9 +- fs/btrfs/backref.c | 40 +- fs/btrfs/backref.h |7 +- fs/btrfs/btrfs_inode.h | 14 +- fs/btrfs/check-integrity.c |7 +- fs/btrfs/ctree.c| 775 +++- fs/btrfs/ctree.h| 368 +++- fs/btrfs/delayed-inode.c| 23 +- fs/btrfs/delayed-inode.h|2 + fs/btrfs/delayed-ref.c | 56 +- fs/btrfs/delayed-ref.h | 62 +- fs/btrfs/disk-io.c | 150 +- fs/btrfs/disk-io.h |6 + fs/btrfs/extent-tree.c | 358 ++-- fs/btrfs/extent_io.c| 58 +- fs/btrfs/file-item.c|4 +- fs/btrfs/free-space-cache.c |2 +- fs/btrfs/inode.c| 42 +- fs/btrfs/ioctl.c| 471 - fs/btrfs/ioctl.h| 97 +- fs/btrfs/locking.c | 14 +- fs/btrfs/qgroup.c | 1571 +++ fs/btrfs/relocation.c |3 +- fs/btrfs/root-tree.c| 107 +- fs/btrfs/send.c | 4570 +++ fs/btrfs/send.h | 133 ++ fs/btrfs/struct-funcs.c | 196 +- fs/btrfs/super.c| 28 +- fs/btrfs/transaction.c | 101 +- fs/btrfs/transaction.h | 12 + fs/btrfs/tree-log.c |4 +- fs/btrfs/volumes.c | 25 +- fs/btrfs/volumes.h |4 +- fs/inode.c |2 + 35 files changed, 8690 insertions(+), 633 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[no subject]
Hi Linus, I've split out the big send/receive update from my last pull request and now have just the fixes in my for-linus branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus For anyone who wants send/receive updates, they are maintained as well. But it is has enough cleanups (without fixes) that we shouldn't be asking Linus to take it right now. The send/recv branch will wander over to linux-next shortly though. git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git send-recv The largest patches in this pull are Josef's patches to fix DIO locking problems and his patch to fix a crash during balance. They are both well tested. The rest are smaller fixes that we've had queued. The last rc came out while I was hacking new and exciting ways to recover from a misplaced rm -rf on my dev box, so these missed rc3. Josef Bacik (9) commits (+322/-216): Btrfs: don't allocate a seperate csums array for direct reads (+19/-32) Btrfs: do not use missing devices when showing devname (+2/-0) Btrfs: fix enospc problems when deleting a subvol (+1/-1) Btrfs: increase the size of the free space cache (+7/-8) Btrfs: lock extents as we map them in DIO (+127/-129) Btrfs: fix deadlock with freeze and sync V2 (+9/-4) Btrfs: allow delayed refs to be merged (+142/-27) Btrfs: do not strdup non existent strings (+5/-3) Btrfs: barrier before waitqueue_active (+10/-12) Stefan Behrens (5) commits (+16/-77): Btrfs: fix that repair code is spuriously executed for transid failures (+6/-2) Btrfs: revert checksum error statistic which can cause a BUG() (+2/-39) Btrfs: fix a misplaced address operator in a condition (+1/-1) Btrfs: remove superblock writing after fatal error (+5/-33) Btrfs: fix that error value is changed by mistake (+2/-2) Dan Carpenter (4) commits (+16/-8): Btrfs: unlock on error in btrfs_delalloc_reserve_metadata() (+3/-1) Btrfs: fix some error codes in btrfs_qgroup_inherit() (+6/-2) Btrfs: fix some endian bugs handling the root times (+4/-4) Btrfs: checking for NULL instead of IS_ERR (+3/-1) Liu Bo (2) commits (+25/-6): Btrfs: fix ordered extent leak when failing to start a transaction (+5/-2) Btrfs: fix a dio write regression (+20/-4) Arne Jansen (2) commits (+38/-73): Btrfs: fix deadlock in wait_for_more_refs (+21/-73) Btrfs: fix race in run_clustered_refs (+17/-0) Chris Mason (1) commits (+3/-0): Btrfs: don't run __tree_mod_log_free_eb on leaves Fengguang Wu (1) commits (+3/-2): btrfs: fix second lock in btrfs_delete_delayed_items() Miao Xie (1) commits (+1/-0): Btrfs: fix wrong mtime and ctime when creating snapshots Total: (25) commits (+424/-382) fs/btrfs/backref.c | 4 +- fs/btrfs/compression.c | 1 + fs/btrfs/ctree.c | 9 +- fs/btrfs/ctree.h | 3 +- fs/btrfs/delayed-inode.c | 12 +- fs/btrfs/delayed-ref.c | 163 +++- fs/btrfs/delayed-ref.h | 4 + fs/btrfs/disk-io.c | 53 ++-- fs/btrfs/disk-io.h | 2 +- fs/btrfs/extent-tree.c | 123 +- fs/btrfs/extent_io.c | 17 +-- fs/btrfs/file-item.c | 4 +- fs/btrfs/inode.c | 326 --- fs/btrfs/ioctl.c | 2 +- fs/btrfs/locking.c | 2 +- fs/btrfs/qgroup.c| 12 +- fs/btrfs/root-tree.c | 4 +- fs/btrfs/super.c | 15 ++- fs/btrfs/transaction.c | 3 +- fs/btrfs/volumes.c | 33 + fs/btrfs/volumes.h | 2 - 21 files changed, 418 insertions(+), 376 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Btrfs updates
Hi Linus, I've split out the big send/receive update from my last pull request and now have just the fixes in my for-linus branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus For anyone who wants send/receive updates, they are maintained as well. But it is has enough cleanups (without fixes) that we shouldn't be asking Linus to take it right now. The send/recv branch will wander over to linux-next shortly though. git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git send-recv The largest patches in this pull are Josef's patches to fix DIO locking problems and his patch to fix a crash during balance. They are both well tested. The rest are smaller fixes that we've had queued. The last rc came out while I was hacking new and exciting ways to recover from a misplaced rm -rf on my dev box, so these missed rc3. Josef Bacik (9) commits (+322/-216): Btrfs: don't allocate a seperate csums array for direct reads (+19/-32) Btrfs: do not use missing devices when showing devname (+2/-0) Btrfs: fix enospc problems when deleting a subvol (+1/-1) Btrfs: increase the size of the free space cache (+7/-8) Btrfs: lock extents as we map them in DIO (+127/-129) Btrfs: fix deadlock with freeze and sync V2 (+9/-4) Btrfs: allow delayed refs to be merged (+142/-27) Btrfs: do not strdup non existent strings (+5/-3) Btrfs: barrier before waitqueue_active (+10/-12) Stefan Behrens (5) commits (+16/-77): Btrfs: fix that repair code is spuriously executed for transid failures (+6/-2) Btrfs: revert checksum error statistic which can cause a BUG() (+2/-39) Btrfs: fix a misplaced address operator in a condition (+1/-1) Btrfs: remove superblock writing after fatal error (+5/-33) Btrfs: fix that error value is changed by mistake (+2/-2) Dan Carpenter (4) commits (+16/-8): Btrfs: unlock on error in btrfs_delalloc_reserve_metadata() (+3/-1) Btrfs: fix some error codes in btrfs_qgroup_inherit() (+6/-2) Btrfs: fix some endian bugs handling the root times (+4/-4) Btrfs: checking for NULL instead of IS_ERR (+3/-1) Liu Bo (2) commits (+25/-6): Btrfs: fix ordered extent leak when failing to start a transaction (+5/-2) Btrfs: fix a dio write regression (+20/-4) Arne Jansen (2) commits (+38/-73): Btrfs: fix deadlock in wait_for_more_refs (+21/-73) Btrfs: fix race in run_clustered_refs (+17/-0) Chris Mason (1) commits (+3/-0): Btrfs: don't run __tree_mod_log_free_eb on leaves Fengguang Wu (1) commits (+3/-2): btrfs: fix second lock in btrfs_delete_delayed_items() Miao Xie (1) commits (+1/-0): Btrfs: fix wrong mtime and ctime when creating snapshots Total: (25) commits (+424/-382) fs/btrfs/backref.c | 4 +- fs/btrfs/compression.c | 1 + fs/btrfs/ctree.c | 9 +- fs/btrfs/ctree.h | 3 +- fs/btrfs/delayed-inode.c | 12 +- fs/btrfs/delayed-ref.c | 163 +++- fs/btrfs/delayed-ref.h | 4 + fs/btrfs/disk-io.c | 53 ++-- fs/btrfs/disk-io.h | 2 +- fs/btrfs/extent-tree.c | 123 +- fs/btrfs/extent_io.c | 17 +-- fs/btrfs/file-item.c | 4 +- fs/btrfs/inode.c | 326 --- fs/btrfs/ioctl.c | 2 +- fs/btrfs/locking.c | 2 +- fs/btrfs/qgroup.c| 12 +- fs/btrfs/root-tree.c | 4 +- fs/btrfs/super.c | 15 ++- fs/btrfs/transaction.c | 3 +- fs/btrfs/volumes.c | 33 + fs/btrfs/volumes.h | 2 - 21 files changed, 418 insertions(+), 376 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL 1/2] Btrfs fixes
Hi everyone, This first pull is the bulk of our changes for the next rc. It is against the 3.5 kernel so people testing the new features have a stable point to work against. This was tested against Linus' current tree as well. The second pull is just one fix against 3.6-rc1 (in another email). Linus, please grab my for-linus branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus Most of these fixes are against the new send/receive code. Alexander fixed a number of bugs in there and I found a more while backing up my laptop. It does nightly incremental runs now about 3x faster than rsync, so things are looking pretty good. On top of that we have fixes for some long standing bugs in the delayed reference code (a few more of these are still being worked on), deadlocks and other small fixes. Alexander Block (23) commits (+482/-419): Btrfs: don't treat top/root directory inode as deleted/reused (+20/-1) Btrfs: fix use of radix_tree for name_cache in send/receive (+37/-39) Btrfs: rename backref_ctx::found_in_send_root to found_itself (+4/-4) Btrfs: pass root instead of parent_root to iterate_inode_ref (+2/-2) Btrfs: add correct parent to check_dirs when dir got moved (+11/-0) Btrfs: add missing check for dir != tmp_dir to is_first_ref (+1/-1) Btrfs: fix check for changed extent in is_extent_unchanged (+2/-2) Btrfs: free nce and nce_head on error in name_cache_insert (+5/-1) Btrfs: don't break in the final loop of find_extent_clone (+0/-1) Btrfs: fix cur_ino parent_ino case for send/receive (+146/-244) Btrfs: add/fix comments/documentation for send/receive (+134/-6) Btrfs: use normal return path for root == send_root case (+0/-6) Btrfs: fix memory leak for name_cache in send/receive (+1/-0) Btrfs: use kmalloc instead of stack for backref_ctx (+18/-11) Btrfs: remove unused use_list from send/receive code (+0/-2) Btrfs: remove unused tmp_path from iterate_dir_item (+0/-8) Btrfs: add rdev to get_inode_info in send/receive (+17/-13) Btrfs: use = instead of in is_extent_unchanged (+1/-1) Btrfs: update send_progress at correct places (+20/-6) Btrfs: ignore non-FS inodes for send/receive (+5/-0) Btrfs: code cleanups for send/receive (+35/-48) Btrfs: make aux field of ulist 64 bit (+21/-23) Btrfs: remove unused code with #if 0 (+2/-0) Josef Bacik (9) commits (+325/-215): Btrfs: don't allocate a seperate csums array for direct reads (+19/-32) Btrfs: do not use missing devices when showing devname (+2/-0) Btrfs: fix enospc problems when deleting a subvol (+1/-1) Btrfs: increase the size of the free space cache (+7/-8) Btrfs: lock extents as we map them in DIO (+127/-129) Btrfs: allow delayed refs to be merged (+142/-27) Btrfs: do not strdup non existent strings (+5/-3) Btrfs: barrier before waitqueue_active (+10/-12) Btrfs: use a slab for btrfs_dio_private (+12/-3) Dan Carpenter (4) commits (+16/-8): Btrfs: unlock on error in btrfs_delalloc_reserve_metadata() (+3/-1) Btrfs: fix some error codes in btrfs_qgroup_inherit() (+6/-2) Btrfs: fix some endian bugs handling the root times (+4/-4) Btrfs: checking for NULL instead of IS_ERR (+3/-1) Stefan Behrens (3) commits (+8/-36): Btrfs: fix a misplaced address operator in a condition (+1/-1) Btrfs: remove superblock writing after fatal error (+5/-33) Btrfs: fix that error value is changed by mistake (+2/-2) Chris Mason (2) commits (+40/-15): Btrfs: fix btrfs send for inline items and compression (+37/-15) Btrfs: don't run __tree_mod_log_free_eb on leaves (+3/-0) Fengguang Wu (2) commits (+4/-6): btrfs: fix second lock in btrfs_delete_delayed_items() (+3/-2) btrfs: Use PTR_RET in btrfs_resume_balance_async() (+1/-4) Arne Jansen (2) commits (+38/-73): Btrfs: fix deadlock in wait_for_more_refs (+21/-73) Btrfs: fix race in run_clustered_refs (+17/-0) Miao Xie (1) commits (+1/-0): Btrfs: fix wrong mtime and ctime when creating snapshots Total: (46) commits fs/btrfs/backref.c | 12 +- fs/btrfs/compression.c | 1 + fs/btrfs/ctree.c | 14 +- fs/btrfs/ctree.h | 3 +- fs/btrfs/delayed-inode.c | 12 +- fs/btrfs/delayed-ref.c | 163 +++-- fs/btrfs/delayed-ref.h | 4 + fs/btrfs/disk-io.c | 45 +-- fs/btrfs/disk-io.h | 2 +- fs/btrfs/extent-tree.c | 123 +++ fs/btrfs/extent_io.c | 1 - fs/btrfs/file-item.c | 4 +- fs/btrfs/inode.c | 318 - fs/btrfs/ioctl.c | 2 +- fs/btrfs/locking.c | 2 +- fs/btrfs/qgroup.c| 32 +- fs/btrfs/root-tree.c | 4 +- fs/btrfs/send.c | 895 ++- fs/btrfs/super.c | 2 + fs/btrfs/transaction.c | 3 +- fs/btrfs/ulist.c | 7 +- fs/btrfs/ulist.h | 9 +- fs/btrfs/volumes.c | 16 +- 23 files changed, 908 insertions
[GIT PULL 2/2] Btrfs merge fix
Hi Linus, Please pull my for-linus-3.6 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-3.6 It fixes a merging error in rc1. The calls to mnt_want_write should have been removed. Alexander Block (1): Btrfs: remove mnt_want_write call in btrfs_mksubvol fs/btrfs/ioctl.c | 5 - 1 file changed, 5 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL 1/2] Btrfs fixes
On Mon, Aug 20, 2012 at 07:55:59PM -0600, Linus Torvalds wrote: On Mon, Aug 20, 2012 at 6:53 PM, Chris Samuel ch...@csamuel.org wrote: This pull request with a whole heap of btrfs fixes (46 commits) appears not to have been merged yet, does anyone know if it was rejected or just missed ? Read my -rc2 release notes. TL;DR: I rejected big pull requests that didn't convince me. Make a damn good case for it, or send minimal fixes instead. I'm tried of these oops, what we sent you for -rc1 wasn't ready, so here's a thousand lines of changes crap. When just the second pull went in, I wasn't sure if it was waiting for vacation or you felt it was too big, but when I saw rc2 it was pretty clear. So I'm working up an rc3 pull with longer explanations. The bulk of my last pull was send/receive fixes. The rc1 send/recv worked fine for me on my test box, but larger scale use on well aged filesystems showed some problems. It's fair to say send/receive wasn't ready. I did expect some fixes for rc2 but not that many. More details will be in my pull this afternoon, but with our current code it is working very well for me. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[ANNOUNCE] Btrfs v0.12 released
Hello everyone, I wasn't planning on releasing v0.12 yet, and it was supposed to have some initial support for multiple devices. But, I have made a number of performance fixes and small bug fixes, and I wanted to get them out there before the (destabilizing) work on multiple-devices took over. So, here's v0.12. It comes with a shiny new disk format (sorry), but the gain is dramatically better random writes to existing files. In testing here, the random write phase of tiobench went from 1MB/s to 30MB/s. The fix was to change the way back references for file extents were hashed. Other changes: Insert and delete multiple items at once in the btree where possible. Back references added more tree balances, and it showed up in a few benchmarks. With v0.12, backrefs have no real impact on performance. Optimize bio end_io routines. Btrfs was spending way too much CPU time in the bio end_io routines, leading to lock contention and other problems. Optimize read ahead during transaction commit. The old code was trying to read far too much at once, which made the end_io problems really stand out. mount -o ssd option, which clusters file data writes together regardless of the directory the files belong to. There are a number of other performance tweaks for SSD, aimed at clustering metadata and data writes to better take advantage of the hardware. mount -o max_inline=size option, to override the default max inline file data size (default is 8k). Any value up to the leaf size is allowed (default 16k). Simple -ENOSPC handling. Emphasis on simple, but it prevents accidentally filling the disk most of the time. With enough threads/procs banging on things, you can still easily crash the box. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3: per-process soft-syncing data=ordered mode
On Wednesday 30 January 2008, Al Boldi wrote: Jan Kara wrote: Chris Snook wrote: Al Boldi wrote: This RFC proposes to introduce a tunable which allows to disable fsync and changes ordered into writeback writeout on a per-process basis like this: echo 1 /proc/`pidof process`/softsync This is basically a kernel workaround for stupid app behavior. Exactly right to some extent, but don't forget the underlying data=ordered starvation problem, which looks like a genuinely deep problem maybe related to blockIO. It is a problem with the way how ext3 does fsync (at least that's what we ended up with in that konqueror problem)... It has to flush the current transaction which means that app doing fsync() has to wait till all dirty data of all files on the filesystem are written (if we are in ordered mode). And that takes quite some time... There are possibilities how to avoid that but especially with freshly created files, it's tough and I don't see a way how to do it without some fundamental changes to JBD. Ok, but keep in mind that this starvation occurs even in the absence of fsync, as the benchmarks show. And, a quick test of successive 1sec delayed syncs shows no hangs until about 1 minute (~180mb) of db-writeout activity, when the sync abruptly hangs for minutes on end, and io-wait shows almost 100%. Do you see this on older kernels as well? The first thing we need to understand is if this particular stall is new. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3: per-process soft-syncing data=ordered mode
On Thursday 31 January 2008, Jan Kara wrote: On Thu 31-01-08 11:56:01, Chris Mason wrote: On Thursday 31 January 2008, Al Boldi wrote: Andreas Dilger wrote: On Wednesday 30 January 2008, Al Boldi wrote: And, a quick test of successive 1sec delayed syncs shows no hangs until about 1 minute (~180mb) of db-writeout activity, when the sync abruptly hangs for minutes on end, and io-wait shows almost 100%. How large is the journal in this filesystem? You can check via debugfs -R 'stat 8' /dev/XXX. 32mb. Is this affected by increasing the journal size? You can set the journal size via mke2fs -J size=400 at format time, or on an unmounted filesystem by running tune2fs -O ^has_journal /dev/XXX then tune2fs -J size=400 /dev/XXX. Setting size=400 doesn't help, nor does size=4. I suspect that the stall is caused by the journal filling up, and then waiting while the entire journal is checkpointed back to the filesystem before the next transaction can start. It is possible to improve this behaviour in JBD by reducing the amount of space that is cleared if the journal becomes full, and also doing journal checkpointing before it becomes full. While that may reduce performance a small amount, it would help avoid such huge latency problems. I believe we have such a patch in one of the Lustre branches already, and while I'm not sure what kernel it is for the JBD code rarely changes much The big difference between ordered and writeback is that once the slowdown starts, ordered goes into ~100% iowait, whereas writeback continues 100% user. Does data=ordered write buffers in the order they were dirtied? This might explain the extreme problems in transactional workloads. Well, it does but we submit them to block layer all at once so elevator should sort the requests for us... nr_requests is fairly small, so a long stream of random requests should still end up being random IO. Al, could you please compare the write throughput from vmstat for the data=ordered vs data=writeback runs? I would guess the data=ordered one has a lower overall write throughput. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3: per-process soft-syncing data=ordered mode
On Thursday 31 January 2008, Al Boldi wrote: Andreas Dilger wrote: On Wednesday 30 January 2008, Al Boldi wrote: And, a quick test of successive 1sec delayed syncs shows no hangs until about 1 minute (~180mb) of db-writeout activity, when the sync abruptly hangs for minutes on end, and io-wait shows almost 100%. How large is the journal in this filesystem? You can check via debugfs -R 'stat 8' /dev/XXX. 32mb. Is this affected by increasing the journal size? You can set the journal size via mke2fs -J size=400 at format time, or on an unmounted filesystem by running tune2fs -O ^has_journal /dev/XXX then tune2fs -J size=400 /dev/XXX. Setting size=400 doesn't help, nor does size=4. I suspect that the stall is caused by the journal filling up, and then waiting while the entire journal is checkpointed back to the filesystem before the next transaction can start. It is possible to improve this behaviour in JBD by reducing the amount of space that is cleared if the journal becomes full, and also doing journal checkpointing before it becomes full. While that may reduce performance a small amount, it would help avoid such huge latency problems. I believe we have such a patch in one of the Lustre branches already, and while I'm not sure what kernel it is for the JBD code rarely changes much The big difference between ordered and writeback is that once the slowdown starts, ordered goes into ~100% iowait, whereas writeback continues 100% user. Does data=ordered write buffers in the order they were dirtied? This might explain the extreme problems in transactional workloads. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] fast file mapping for loop
On Tue, 15 Jan 2008 11:07:40 +0100 Jens Axboe [EMAIL PROTECTED] wrote: I split and merged the patch into five bits (added ext3 support), so perhaps that would be easier for people to read/review. Attached and also exist in the loop-extent_map branch here: Thanks! http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=loop-extent_map Seems my ext3 version doesn't work, it craps out in ext3_get_blocks_handle() triggering this bug: J_ASSERT(handle != NULL || create == 0); I'll see if I can fix that, being fairly fs ignorant... This works, but probably pretty suboptimal (should end the new journal in map_io_complete()?). And yes I know the 9 isn't correct, since the fs block size is larger. Just making sure that we always have enough blocks. You can use DIO_CREDITS instead of len 9, just like the ext3 O_DIRECT code does. Your current patch is fine, except it breaks data=ordered rules. My plan to work within data=ordered: 1) Inside ext3_map_extent (while the transaction was running), increment a counter in the ext3 journal for number of pending IOs. Then end the transaction handle. 2) Drop this counter inside the IO completion call 3) Change the ext3 commit code to wait for the IO count to be zero. I'll give it a shot later this week, until then your current patch is just data=writeback, which is good enough for testing. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)
Hello everyone, Btrfs v0.10 is now available for download from: http://oss.oracle.com/projects/btrfs/ Btrfs is still in an early alpha state, and the disk format is not finalized. v0.10 introduces a new disk format, and is not compatible with v0.9. The core of this release is explicit back references for all metadata blocks, data extents, and directory items. These are a crucial building block for future features such as online fsck and migration between devices. The back references are verified during deletes, and the extent back references are checked by the existing offline fsck tool. For all of the details of how the back references are maintained, please see the design document: http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html Other new features (described in detail below): * Online resizing (including shrinking) * In place conversion from Ext3 to Btrfs * data=ordered support * Mount options to disable data COW and checksumming * Barrier support for sata and IDE drives [ Resizing ] In order to demonstrate and test the back references, I've added an online resizer, which can both grow and shrink the filesystem: mount -t btrfs /dev/xxx /mnt # add 2GB to the FS btrfsctl -r +2g /mnt # shrink the FS by 4GB btrfsctl -r -4g /mnt # Explicitly set the FS size btrfsctl -r 20g /mnt # Use 'max' to grow the FS to the limit of the device btrfsctl -r max /mnt [ Conversion from Ext3 ] This is an offline, in place, conversion program written by Yan Zheng. It has been through basic testing, but should not be trusted with critical data. To build the conversion program, run 'make convert' in the btrfs-progs tree. It depends on libe2fs and acl development libraries. The conversion program uses the copy on write nature of Btrfs to preserve the original Ext3 FS, sharing the data blocks between Btrfs and Ext3 metadata. Btrfs metadata is created inside the free space of the Ext3 filesystem, and it is possible to either make the conversion permanent (reclaiming the space used by Ext3) or roll back the conversion to the original Ext3 filesystem. More details and example usage of the conversion program can be found here: http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-converter.html Thanks to Yan Zheng for all of his work on the converter. [ New mount options ] mount -o nodatacsum disables checksumming on data extents mount -o nodatacow disables copy on write of data extents, unless a given extent is referenced by more than one snapshot. This is targeted at database workloads, where copy on write is not optimal for performance. The explicit back references allow the nodatacow code to make sure copy on write is done when multiple snapshots reference the same file, maintaining snapshot consistency. mount -o alloc_start=num forces allocation hints to start at least num bytes into the disk. This was introduced to test the resizer. Example usage: mount -o alloc_start=16g /dev/ /mnt (do something to the FS) btrfsctl -r 12g /mnt The btrfsctl command will resize the FS down to 12GB in size. Because the FS was mounted with -o alloc_start=16g, any allocations done after mounting will need to be relocated by the resizer. It is safe to specify a number past the end of the FS, if the alloc_start is too large, it is ignored. mount -o nobarrier disables cache flushes during commit. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
On Tue, 15 Jan 2008 20:24:27 -0500 Daniel Phillips [EMAIL PROTECTED] wrote: On Jan 15, 2008 7:15 PM, Alan Cox [EMAIL PROTECTED] wrote: Writeback cache on disk in iteself is not bad, it only gets bad if the disk is not engineered to save all its dirty cache on power loss, using the disk motor as a generator or alternatively a small battery. It would be awfully nice to know which brands fail here, if any, because writeback cache is a big performance booster. AFAIK no drive saves the cache. The worst case cache flush for drives is several seconds with no retries and a couple of minutes if something really bad happens. This is why the kernel has some knowledge of barriers and uses them to issue flushes when needed. Indeed, you are right, which is supported by actual measurements: http://sr5tech.com/write_back_cache_experiments.htm Sorry for implying that anybody has engineered a drive that can do such a nice thing with writeback cache. The disk motor as a generator tale may not be purely folklore. When an IDE drive is not in writeback mode, something special needs to done to ensure the last write to media is not a scribble. A small UPS can make writeback mode actually reliable, provided the system is smart enough to take the drives out of writeback mode when the line power is off. We've had mount -o barrier=1 for ext3 for a while now, it makes writeback caching safe. XFS has this on by default, as does reiserfs. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Btrfs-devel] [ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)
On Tuesday 15 January 2008, Chris Mason wrote: Hello everyone, Btrfs v0.10 is now available for download from: http://oss.oracle.com/projects/btrfs/ Well, it turns out this release had a few small problems: * data=ordered deadlock on older kernels (including 2.6.23) * Compile problems when ACLs were not enabled in the kernel So, I've put v0.11 out there. It fixes those two problems and will also compile on older (2.6.18) enterprise kernels. v0.11 does not have any disk format changes. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Btrfs-devel] [ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)
On Thursday 17 January 2008, Daniel Phillips wrote: On Jan 17, 2008 1:25 PM, Chris mason [EMAIL PROTECTED] wrote: So, I've put v0.11 out there. It fixes those two problems and will also compile on older (2.6.18) enterprise kernels. v0.11 does not have any disk format changes. Hi Chris, First, massive congratulations for bringing this to fruition in such a short time. Now back to the regular carping: why even support older kernels? The general answer is the backports are small and easy. I don't test them heavily, and I don't go out of my way to make things work. But, they do make it easier for people to try out, and to figure how to use all these new features to solve problems. Small changes that enable more testers are always welcome. In general, the core parts of the kernel that btrfs uses haven't had many interface changes since 2.6.18, so this isn't a huge deal. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: konqueror deadlocks on 2.6.22
On Tuesday 22 January 2008, Al Boldi wrote: Ingo Molnar wrote: * Oliver Pinter (Pintér Olivér) [EMAIL PROTECTED] wrote: and then please update to CFS-v24.1 http://people.redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.6.22.15-v24. 1 .patch Yes with CFSv20.4, as in the log. It also hangs on 2.6.23.13 my feeling is that this is some sort of timing dependent race in konqueror/kde/qt that is exposed when a different scheduler is put in. If it disappears with CFS-v24.1 it is probably just because the timings will change again. Would be nice to debug this on the konqueror side and analyze why it fails and how. You can probably tune the timings by enabling SCHED_DEBUG and tweaking /proc/sys/kernel/*sched* values - in particular sched_latency and the granularity settings. Setting wakeup granularity to 0 might be one of the things that could make a difference. Thanks Ingo, but Mike suggested that data=writeback may make a difference, which it does indeed. So the bug seems to be related to data=ordered, although I haven't gotten any feedback from the ext3 gurus yet. Seems rather critical though, as data=writeback is a dangerous mode to run. Running fsync in data=ordered means that all of the dirty blocks on the FS will get written before fsync returns. Your original stack trace shows everyone either performing writeback for a log commit or waiting for the log commit to return. They key task in your trace is kjournald, stuck in get_request_wait. It could be a block layer bug, not giving him requests quickly enough, or it could be the scheduler not giving him back the cpu fast enough. At any rate, that's where to concentrate the debugging. You should be able to simulate this by running a few instances of the below loop and looking for stalls: while(true) ; do time dd if=/dev/zero of=foo bs=50M count=4 oflags=sync done -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: konqueror deadlocks on 2.6.22
On Tuesday 22 January 2008, Al Boldi wrote: Chris Mason wrote: Running fsync in data=ordered means that all of the dirty blocks on the FS will get written before fsync returns. Hm, that's strange, I expected this kind of behaviour from data=journal. data=writeback should return immediatly, which seems it does, but data=ordered should only wait for metadata flush, it shouldn't wait for filedata flush. Are you sure it waits for both? I over simplified. data=ordered means that all data blocks are written before the metadata that references them commits. So, if you add 1GB to a fileA in a transaction and then run fsync(fileB) in the same transaction, the 1GB from fileA is sent to disk (and waited on) before the fsync on fileB returns. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/