Re: New data=ordered code pushed out to btrfs-unstable
On Fri, 2008-07-18 at 16:09 -0400, Ric Wheeler wrote: Just to kick the tires, I tried the same test that I ran last week on ext4. Everything was going great, I decided to kill it after 6 million files or so and restart. The unmount has taken a very, very long time - seems like we are cleaning up the pending transactions at a very slow rate: For many workloads, the long unmount time is now fixed in btrfs-unstable as well (thanks to a cache from Yan Zheng). For fs_mark, I created 10 million files (20k each) and was able to unmount in 2s. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New data=ordered code pushed out to btrfs-unstable
On Mon, 2008-07-21 at 15:23 -0400, Ric Wheeler wrote: [ lock timeouts and stalls ] Ok, I've made a few changes that should lower overall contenion on the allocation mutex. I'm getting better performance on a 3 million file run, please give it a shot. After an update, clean rebuild reboot, the test is running along and has hit about 10 million files. I still see some messages like: INFO: task pdflush:4051 blocked for more than 120 seconds. The latest code in btrfs-unstable has everything I can safely do right now :) Basically the stalls come from someone doing IO with the allocation mutex held. It is surprising that we should be stalling for such a long time, it is probably a mixture of elevator starvation and btrfs fun. But, btrfs-unstable also has code to replace the page lock with a per-tree block mutex, which will allow me to get rid of the big allocation mutex over the long term. I was able to break up most of the long operations and have them drop/reacquire the allocation mutex to prevent this starvation most of the time. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New data=ordered code pushed out to btrfs-unstable
On Sun, 2008-07-20 at 09:46 -0400, Ric Wheeler wrote: Just to kick the tires, I tried the same test that I ran last week on ext4. Everything was going great, I decided to kill it after 6 million files or so and restart. Well, it looks like I neglected to push all the changesets, especially the last one that made it less racey. So, I've just done another push, sorry. For the fs_mark workload, it shouldn't change anything. This code still hasn't really survived an overnight run, hopefully this commit will. The test is still running, but slowly, with a (slow) stream of messages about: [ lock timeouts and stalls ] Ok, I've made a few changes that should lower overall contenion on the allocation mutex. I'm getting better performance on a 3 million file run, please give it a shot. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New data=ordered code pushed out to btrfs-unstable
Chris Mason wrote: On Mon, 2008-07-21 at 14:29 -0400, Ric Wheeler wrote: Chris Mason wrote: On Sun, 2008-07-20 at 09:46 -0400, Ric Wheeler wrote: Just to kick the tires, I tried the same test that I ran last week on ext4. Everything was going great, I decided to kill it after 6 million files or so and restart. Well, it looks like I neglected to push all the changesets, especially the last one that made it less racey. So, I've just done another push, sorry. For the fs_mark workload, it shouldn't change anything. This code still hasn't really survived an overnight run, hopefully this commit will. The test is still running, but slowly, with a (slow) stream of messages about: [ lock timeouts and stalls ] Ok, I've made a few changes that should lower overall contenion on the allocation mutex. I'm getting better performance on a 3 million file run, please give it a shot. -chris Hi Chris, After an update, clean rebuild reboot, the test is running along and has hit about 10 million files. I still see some messages like: INFO: task pdflush:4051 blocked for more than 120 seconds. echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. pdflush D 8129c5b0 0 4051 2 81002ae77870 0046 81002ae77834 0001 814b2280 814b2280 00010001 81003f188000 81003fac5980 81003f188350 but not as many as before. I will attach the messages file, I'll try running with soft-lockup detection here, see if I can hunt down the cause of these stalls. Good to know I've made progress though ;) -chris This is an 8 core box, so it is might be more prone to hitting these things ;-) ric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New data=ordered code pushed out to btrfs-unstable
Chris Mason wrote: On Fri, 2008-07-18 at 18:35 -0400, Ric Wheeler wrote: Chris Mason wrote: On Fri, 2008-07-18 at 16:09 -0400, Ric Wheeler wrote: Just to kick the tires, I tried the same test that I ran last week on ext4. Everything was going great, I decided to kill it after 6 million files or so and restart. Well, it looks like I neglected to push all the changesets, especially the last one that made it less racey. So, I've just done another push, sorry. For the fs_mark workload, it shouldn't change anything. This code still hasn't really survived an overnight run, hopefully this commit will. -chris The test is still running, but slowly, with a (slow) stream of messages about: Jul 19 10:55:38 localhost kernel: INFO: task btrfs:448 blocked for more than 120 seconds. Jul 19 10:55:38 localhost kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Jul 19 10:55:38 localhost kernel: btrfs D 8129c5b0 0 448 2 Jul 19 10:55:38 localhost kernel: 8100283dfc50 0046 a0514254 Jul 19 10:55:38 localhost kernel: 810012c061c0 814b2280 814b2280 81000b5ce0a0 Jul 19 10:55:38 localhost kernel: 81003f182cc0 81003fac 81003f183010 Jul 19 10:55:38 localhost kernel: Call Trace: Jul 19 10:55:38 localhost kernel: [a0514254] ? :btrfs:free_extent_state+0x69/0x6e Jul 19 10:55:38 localhost kernel: [8127ff47] __mutex_lock_slowpath+0x6b/0xa2 Jul 19 10:55:38 localhost kernel: [8127fdd2] mutex_lock+0x2f/0x33 Jul 19 10:55:38 localhost kernel: [a04f6e96] :btrfs:maybe_lock_mutex+0x29/0x2b Jul 19 10:55:38 localhost kernel: [a04fae0c] :btrfs:btrfs_alloc_reserved_extent+0x2c/0x67 Jul 19 10:55:38 localhost kernel: [a0512106] ? :btrfs:btrfs_lookup_ordered_extent+0x139/0x148 Jul 19 10:55:38 localhost kernel: [a050635f] :btrfs:btrfs_finish_ordered_io+0x102/0x2a8 Jul 19 10:55:38 localhost kernel: [a050] :btrfs:btrfs_writepage_end_io_hook+0x10/0x12 Jul 19 10:55:38 localhost kernel: [a0516b62] :btrfs:end_bio_extent_writepage+0xbe/0x28d Jul 19 10:55:38 localhost kernel: [810c6f5d] bio_endio+0x2b/0x2d Jul 19 10:55:38 localhost kernel: [a0500d19] :btrfs:end_workqueue_fn+0x103/0x110 Jul 19 10:55:38 localhost kernel: [a051b43c] :btrfs:worker_loop+0x63/0x13e Jul 19 10:55:38 localhost kernel: [a051b3d9] ? :btrfs:worker_loop+0x0/0x13e Jul 19 10:55:38 localhost kernel: [810460e3] kthread+0x49/0x76 Jul 19 10:55:38 localhost kernel: [8100cda8] child_rip+0xa/0x12 Jul 19 10:55:38 localhost kernel: [8104609a] ? kthread+0x0/0x76 Jul 19 10:55:38 localhost kernel: [8100cd9e] ? child_rip+0x0/0x12 Jul 19 10:55:38 localhost kernel: ric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New data=ordered code pushed out to btrfs-unstable
On Sun, 2008-07-20 at 08:19 -0400, Ric Wheeler wrote: Chris Mason wrote: On Fri, 2008-07-18 at 18:35 -0400, Ric Wheeler wrote: Chris Mason wrote: On Fri, 2008-07-18 at 16:09 -0400, Ric Wheeler wrote: Just to kick the tires, I tried the same test that I ran last week on ext4. Everything was going great, I decided to kill it after 6 million files or so and restart. Well, it looks like I neglected to push all the changesets, especially the last one that made it less racey. So, I've just done another push, sorry. For the fs_mark workload, it shouldn't change anything. This code still hasn't really survived an overnight run, hopefully this commit will. -chris The test is still running, but slowly, with a (slow) stream of messages about: Could you please grab the sysrq-w if this is still running? -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New data=ordered code pushed out to btrfs-unstable
Chris Mason wrote: On Sun, 2008-07-20 at 08:19 -0400, Ric Wheeler wrote: Chris Mason wrote: On Fri, 2008-07-18 at 18:35 -0400, Ric Wheeler wrote: Chris Mason wrote: On Fri, 2008-07-18 at 16:09 -0400, Ric Wheeler wrote: Just to kick the tires, I tried the same test that I ran last week on ext4. Everything was going great, I decided to kill it after 6 million files or so and restart. Well, it looks like I neglected to push all the changesets, especially the last one that made it less racey. So, I've just done another push, sorry. For the fs_mark workload, it shouldn't change anything. This code still hasn't really survived an overnight run, hopefully this commit will. -chris The test is still running, but slowly, with a (slow) stream of messages about: Could you please grab the sysrq-w if this is still running? -chris Attached ric Jul 20 09:44:02 localhost kernel: btrfs D 8129c5b0 0 454 2 Jul 20 09:44:02 localhost kernel: 810036d3f890 0046 81003edf61e0 Jul 20 09:44:02 localhost kernel: 81003edf61f0 814b2280 814b2280 0204 Jul 20 09:44:02 localhost kernel: 0001 81002595 81003fac5980 810025950350 Jul 20 09:44:02 localhost kernel: Call Trace: Jul 20 09:44:02 localhost kernel: [81011a3d] ? read_tsc+0x9/0x1c Jul 20 09:44:02 localhost kernel: [8127f7ea] io_schedule+0x63/0xa5 Jul 20 09:44:02 localhost kernel: [810756ad] sync_page+0x3c/0x40 Jul 20 09:44:02 localhost kernel: [8127fbdf] __wait_on_bit_lock+0x45/0x79 Jul 20 09:44:02 localhost kernel: [81075671] ? sync_page+0x0/0x40 Jul 20 09:44:02 localhost kernel: [8107565d] __lock_page+0x63/0x6a Jul 20 09:44:02 localhost kernel: [81046444] ? wake_bit_function+0x0/0x2a Jul 20 09:44:02 localhost kernel: [810758a5] find_lock_page+0x66/0x93 Jul 20 09:44:02 localhost kernel: [81076167] find_or_create_page+0x27/0x79 Jul 20 09:44:02 localhost kernel: [a0514900] :btrfs:alloc_extent_buffer+0xee/0x23b Jul 20 09:44:02 localhost kernel: [a04ffc3b] :btrfs:read_tree_block+0x37/0x62 Jul 20 09:44:02 localhost kernel: [a04f5db1] :btrfs:btrfs_search_slot+0x179b/0x18d7 Jul 20 09:44:02 localhost kernel: [a04fdb3d] :btrfs:btrfs_lookup_file_extent+0x38/0x3a Jul 20 09:44:02 localhost kernel: [a0509231] :btrfs:btrfs_drop_extents+0xd4/0xa8c Jul 20 09:44:02 localhost kernel: [8102a42d] ? wake_up_process+0x10/0x12 Jul 20 09:44:02 localhost kernel: [812800c4] ? __mutex_unlock_slowpath+0x31/0x39 Jul 20 09:44:02 localhost kernel: [8127fd88] ? mutex_unlock+0xe/0x10 Jul 20 09:44:02 localhost kernel: [a04f6ec1] ? :btrfs:maybe_unlock_mutex+0x29/0x2b Jul 20 09:44:02 localhost kernel: [a04fae36] ? :btrfs:btrfs_alloc_reserved_extent+0x56/0x67 Jul 20 09:44:02 localhost kernel: [a0512106] ? :btrfs:btrfs_lookup_ordered_extent+0x139/0x148 Jul 20 09:44:02 localhost kernel: [a05063a9] :btrfs:btrfs_finish_ordered_io+0x14c/0x2a8 Jul 20 09:44:02 localhost kernel: [a050] :btrfs:btrfs_writepage_end_io_hook+0x10/0x12 Jul 20 09:44:02 localhost kernel: [a0516b62] :btrfs:end_bio_extent_writepage+0xbe/0x28d Jul 20 09:44:02 localhost kernel: [810c6f5d] bio_endio+0x2b/0x2d Jul 20 09:44:02 localhost kernel: [a0500d19] :btrfs:end_workqueue_fn+0x103/0x110 Jul 20 09:44:02 localhost kernel: [a051b43c] :btrfs:worker_loop+0x63/0x13e Jul 20 09:44:02 localhost kernel: [a051b3d9] ? :btrfs:worker_loop+0x0/0x13e Jul 20 09:44:02 localhost kernel: [810460e3] kthread+0x49/0x76 Jul 20 09:44:02 localhost kernel: [8100cda8] child_rip+0xa/0x12 Jul 20 09:44:02 localhost kernel: [8104609a] ? kthread+0x0/0x76 Jul 20 09:44:02 localhost kernel: [8100cd9e] ? child_rip+0x0/0x12 Jul 20 09:44:02 localhost kernel: Jul 20 09:44:02 localhost kernel: btrfs-cleaner D 8129c5b0 0 456 2 Jul 20 09:44:02 localhost kernel: 81003d959a10 0046 8100154bdea0 Jul 20 09:44:02 localhost kernel: 8100154bdea0 814b2280 814b2280 0002 Jul 20 09:44:02 localhost kernel: 0034 810025952cc0 81003fb62cc0 810025953010 Jul 20 09:44:02 localhost kernel: Call Trace: Jul 20 09:44:02 localhost kernel: [8127ff47] __mutex_lock_slowpath+0x6b/0xa2 Jul 20 09:44:02 localhost kernel: [8127fdd2] mutex_lock+0x2f/0x33 Jul 20 09:44:02 localhost kernel: [a04f6e96] :btrfs:maybe_lock_mutex+0x29/0x2b Jul 20 09:44:02 localhost kernel: [a04fabbd] :btrfs:btrfs_alloc_extent+0x31/0xa8 Jul 20 09:44:02 localhost kernel: [a04fac9e] :btrfs:btrfs_alloc_free_block+0x6a/0x1ac Jul 20 09:44:02 localhost
Re: New data=ordered code pushed out to btrfs-unstable
Chris Mason wrote: Hello everyone, It took me much longer to chase down races in my new data=ordered code, but I think I've finally got it, and have pushed it out to the unstable trees. There are no disk format changes included. I need to make minor mods to the resizing and balancing code, but I wanted to get this stuff out the door. In general, I'll call data=ordered any system that prevents seeing stale data on the disk after a crash. This would include null bytes from areas not yet written when we crashed and the contents of old blocks the filesystem had freed in the past. The old data=ordered code worked something like this: file_write: * modify pages in page cache * set delayed allocation bits * Update in memory and on-disk i_size writepage: * collect a large delalloc region * allocate new extent * drop existing extents from the metadata * insert new extent * start the page io transaction commit: * write and wait on any dirty file data to finish * commit the new btree pointers The end result was very large latencies during transaction commit because it had to wait on all the file data. A fsync of a single file was forced to write out all the dirty metadata and dirty data on the FS. This is how ext3 works today, xfs does something smarter. ext4 is moving to something similar to xfs. With the new code, metadata is not modified in the btree until new extents are fully on disk. It now looks something like this: file write (start, len): * wait on pending ordered extents for the start, len range * modify pages in the page cache * set delayed allocation bits * Update in memory only i_size writepage: * collect a large delalloc extent * reserve a extent on disk in the allocation tree * create an ordered extent record * start the page io At IO completion (done in a kthread): * find the corresponding ordered extent record * if fully written, remove old extents from the tree, add new extents to the tree, update on disk i_size At commit time: * Just do only metadata IO The end result of all of this is lower commit latencies and a smoother system. -chris Just to kick the tires, I tried the same test that I ran last week on ext4. Everything was going great, I decided to kill it after 6 million files or so and restart. The unmount has taken a very, very long time - seems like we are cleaning up the pending transactions at a very slow rate: Jul 18 16:06:04 localhost kernel: cleaner awake Jul 18 16:06:04 localhost kernel: cleaner done Jul 18 16:06:34 localhost kernel: trans 188 in commit Jul 18 16:06:35 localhost kernel: trans 188 done in commit Jul 18 16:06:35 localhost kernel: cleaner awake Jul 18 16:06:35 localhost kernel: cleaner done Jul 18 16:07:05 localhost kernel: trans 189 in commit Jul 18 16:07:06 localhost kernel: trans 189 done in commit Jul 18 16:07:06 localhost kernel: cleaner awake Jul 18 16:07:06 localhost kernel: cleaner done Jul 18 16:07:36 localhost kernel: trans 190 in commit Jul 18 16:07:37 localhost kernel: trans 190 done in commit Jul 18 16:07:37 localhost kernel: cleaner awake Jul 18 16:07:37 localhost kernel: cleaner done Jul 18 16:08:07 localhost kernel: trans 191 in commit Jul 18 16:08:09 localhost kernel: trans 191 done in commit Jul 18 16:08:09 localhost kernel: cleaner awake Jul 18 16:08:09 localhost kernel: cleaner done Jul 18 16:08:39 localhost kernel: trans 192 in commit Jul 18 16:08:39 localhost kernel: trans 192 done in commit Jul 18 16:08:39 localhost kernel: cleaner awake Jul 18 16:08:39 localhost kernel: cleaner done The command I ran was: fs_mark -d /mnt/test -D 256 -n 10 -t 4 -s 20480 -F -S 0 -l btrfs_new.txt (No fsyncs involved here) ric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html