Re: New data=ordered code pushed out to btrfs-unstable

2008-07-28 Thread Chris Mason
On Fri, 2008-07-18 at 16:09 -0400, Ric Wheeler wrote:

 Just to kick the tires, I tried the same test that I ran last week on 
 ext4. Everything was going great, I decided to kill it after 6 million 
 files or so and restart.
 
 The unmount has taken a very, very long time - seems like we are 
 cleaning up the pending transactions at a very slow rate:

For many workloads, the long unmount time is now fixed in btrfs-unstable
as well (thanks to a cache from Yan Zheng).

For fs_mark, I created 10 million files (20k each) and was able to
unmount in 2s.

-chris



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New data=ordered code pushed out to btrfs-unstable

2008-07-25 Thread Chris Mason
On Mon, 2008-07-21 at 15:23 -0400, Ric Wheeler wrote:

  [ lock timeouts and stalls ]
 
 
  Ok, I've made a few changes that should lower overall contenion on the
  allocation mutex.  I'm getting better performance on a 3 million file
  run, please give it a shot.
 
  After an update, clean rebuild  reboot, the test is running along and 
  has hit about 10 million files. I still see some messages like:
 
  INFO: task pdflush:4051 blocked for more than 120 seconds.

The latest code in btrfs-unstable has everything I can safely do right
now :)  

Basically the stalls come from someone doing IO with the allocation
mutex held.  It is surprising that we should be stalling for such a long
time, it is probably a  mixture of elevator starvation and btrfs fun.

But, btrfs-unstable also has code to replace the page lock with a
per-tree block mutex, which will allow me to get rid of the big
allocation mutex over the long term.  I was able to break up most of the
long operations and have them drop/reacquire the allocation mutex to
prevent this starvation most of the time.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New data=ordered code pushed out to btrfs-unstable

2008-07-21 Thread Chris Mason
On Sun, 2008-07-20 at 09:46 -0400, Ric Wheeler wrote:

  Just to kick the tires, I tried the same test that I ran last week on 
  ext4. Everything was going great, I decided to kill it after 6 million 
  files or so and restart.
  
  
  Well, it looks like I neglected to push all the changesets, especially
  the last one that made it less racey.  So, I've just done another push,
  sorry.  For the fs_mark workload, it shouldn't change anything.
 
  This code still hasn't really survived an overnight run, hopefully this
  commit will.
 
  The test is still running, but slowly, with a (slow) stream of messages 
  about:

[ lock timeouts and stalls ]


Ok, I've made a few changes that should lower overall contenion on the
allocation mutex.  I'm getting better performance on a 3 million file
run, please give it a shot.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New data=ordered code pushed out to btrfs-unstable

2008-07-21 Thread Ric Wheeler

Chris Mason wrote:

On Mon, 2008-07-21 at 14:29 -0400, Ric Wheeler wrote:
  

Chris Mason wrote:


On Sun, 2008-07-20 at 09:46 -0400, Ric Wheeler wrote:
  
  
   


Just to kick the tires, I tried the same test that I ran last week on 
ext4. Everything was going great, I decided to kill it after 6 million 
files or so and restart.





Well, it looks like I neglected to push all the changesets, especially
the last one that made it less racey.  So, I've just done another push,
sorry.  For the fs_mark workload, it shouldn't change anything.

This code still hasn't really survived an overnight run, hopefully this
commit will.

  
  
The test is still running, but slowly, with a (slow) stream of messages 
about:



[ lock timeouts and stalls ]


Ok, I've made a few changes that should lower overall contenion on the
allocation mutex.  I'm getting better performance on a 3 million file
run, please give it a shot.

-chris

  
  

Hi Chris,

After an update, clean rebuild  reboot, the test is running along and 
has hit about 10 million files. I still see some messages like:


INFO: task pdflush:4051 blocked for more than 120 seconds.
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
pdflush   D 8129c5b0 0  4051  2
 81002ae77870 0046  81002ae77834
 0001 814b2280 814b2280 00010001
  81003f188000 81003fac5980 81003f188350

but not as many as before.

I will attach the messages file,



I'll try running with soft-lockup detection here, see if I can hunt down
the cause of these stalls.  Good to know I've made progress though ;)

-chris

  
This is an 8 core box, so it is might be more prone to hitting these 
things ;-)


ric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New data=ordered code pushed out to btrfs-unstable

2008-07-20 Thread Ric Wheeler

Chris Mason wrote:

On Fri, 2008-07-18 at 18:35 -0400, Ric Wheeler wrote:
  

Chris Mason wrote:


On Fri, 2008-07-18 at 16:09 -0400, Ric Wheeler wrote:

  
Just to kick the tires, I tried the same test that I ran last week on 
ext4. Everything was going great, I decided to kill it after 6 million 
files or so and restart.



Well, it looks like I neglected to push all the changesets, especially
the last one that made it less racey.  So, I've just done another push,
sorry.  For the fs_mark workload, it shouldn't change anything.

This code still hasn't really survived an overnight run, hopefully this
commit will.

-chris


  
The test is still running, but slowly, with a (slow) stream of messages 
about:


Jul 19 10:55:38 localhost kernel: INFO: task btrfs:448 blocked for more 
than 120 seconds.
Jul 19 10:55:38 localhost kernel: echo 0  
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Jul 19 10:55:38 localhost kernel: btrfs D 8129c5b0 
0   448  2
Jul 19 10:55:38 localhost kernel:  8100283dfc50 0046 
 a0514254
Jul 19 10:55:38 localhost kernel:  810012c061c0 814b2280 
814b2280 81000b5ce0a0
Jul 19 10:55:38 localhost kernel:   81003f182cc0 
81003fac 81003f183010

Jul 19 10:55:38 localhost kernel: Call Trace:
Jul 19 10:55:38 localhost kernel:  [a0514254] ? 
:btrfs:free_extent_state+0x69/0x6e
Jul 19 10:55:38 localhost kernel:  [8127ff47] 
__mutex_lock_slowpath+0x6b/0xa2

Jul 19 10:55:38 localhost kernel:  [8127fdd2] mutex_lock+0x2f/0x33
Jul 19 10:55:38 localhost kernel:  [a04f6e96] 
:btrfs:maybe_lock_mutex+0x29/0x2b
Jul 19 10:55:38 localhost kernel:  [a04fae0c] 
:btrfs:btrfs_alloc_reserved_extent+0x2c/0x67
Jul 19 10:55:38 localhost kernel:  [a0512106] ? 
:btrfs:btrfs_lookup_ordered_extent+0x139/0x148
Jul 19 10:55:38 localhost kernel:  [a050635f] 
:btrfs:btrfs_finish_ordered_io+0x102/0x2a8
Jul 19 10:55:38 localhost kernel:  [a050] 
:btrfs:btrfs_writepage_end_io_hook+0x10/0x12
Jul 19 10:55:38 localhost kernel:  [a0516b62] 
:btrfs:end_bio_extent_writepage+0xbe/0x28d

Jul 19 10:55:38 localhost kernel:  [810c6f5d] bio_endio+0x2b/0x2d
Jul 19 10:55:38 localhost kernel:  [a0500d19] 
:btrfs:end_workqueue_fn+0x103/0x110
Jul 19 10:55:38 localhost kernel:  [a051b43c] 
:btrfs:worker_loop+0x63/0x13e
Jul 19 10:55:38 localhost kernel:  [a051b3d9] ? 
:btrfs:worker_loop+0x0/0x13e

Jul 19 10:55:38 localhost kernel:  [810460e3] kthread+0x49/0x76
Jul 19 10:55:38 localhost kernel:  [8100cda8] child_rip+0xa/0x12
Jul 19 10:55:38 localhost kernel:  [8104609a] ? kthread+0x0/0x76
Jul 19 10:55:38 localhost kernel:  [8100cd9e] ? child_rip+0x0/0x12
Jul 19 10:55:38 localhost kernel:

ric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New data=ordered code pushed out to btrfs-unstable

2008-07-20 Thread Chris Mason
On Sun, 2008-07-20 at 08:19 -0400, Ric Wheeler wrote:
 Chris Mason wrote:
  On Fri, 2008-07-18 at 18:35 -0400, Ric Wheeler wrote:

  Chris Mason wrote:
  
  On Fri, 2008-07-18 at 16:09 -0400, Ric Wheeler wrote:
 

  Just to kick the tires, I tried the same test that I ran last week on 
  ext4. Everything was going great, I decided to kill it after 6 million 
  files or so and restart.
  
 
  Well, it looks like I neglected to push all the changesets, especially
  the last one that made it less racey.  So, I've just done another push,
  sorry.  For the fs_mark workload, it shouldn't change anything.
 
  This code still hasn't really survived an overnight run, hopefully this
  commit will.
 
  -chris
 
 

 The test is still running, but slowly, with a (slow) stream of messages 
 about:
 

Could you please grab the sysrq-w if this is still running?

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New data=ordered code pushed out to btrfs-unstable

2008-07-20 Thread Ric Wheeler

Chris Mason wrote:

On Sun, 2008-07-20 at 08:19 -0400, Ric Wheeler wrote:
  

Chris Mason wrote:


On Fri, 2008-07-18 at 18:35 -0400, Ric Wheeler wrote:
  
  

Chris Mason wrote:



On Fri, 2008-07-18 at 16:09 -0400, Ric Wheeler wrote:

  
  
Just to kick the tires, I tried the same test that I ran last week on 
ext4. Everything was going great, I decided to kill it after 6 million 
files or so and restart.



Well, it looks like I neglected to push all the changesets, especially
the last one that made it less racey.  So, I've just done another push,
sorry.  For the fs_mark workload, it shouldn't change anything.

This code still hasn't really survived an overnight run, hopefully this
commit will.

-chris


  
  
The test is still running, but slowly, with a (slow) stream of messages 
about:





Could you please grab the sysrq-w if this is still running?

-chris


  

Attached

ric


Jul 20 09:44:02 localhost kernel: btrfs D 8129c5b0 0   454  
2
Jul 20 09:44:02 localhost kernel:  810036d3f890 0046 
 81003edf61e0
Jul 20 09:44:02 localhost kernel:  81003edf61f0 814b2280 
814b2280 0204
Jul 20 09:44:02 localhost kernel:  0001 81002595 
81003fac5980 810025950350
Jul 20 09:44:02 localhost kernel: Call Trace:
Jul 20 09:44:02 localhost kernel:  [81011a3d] ? read_tsc+0x9/0x1c
Jul 20 09:44:02 localhost kernel:  [8127f7ea] io_schedule+0x63/0xa5
Jul 20 09:44:02 localhost kernel:  [810756ad] sync_page+0x3c/0x40
Jul 20 09:44:02 localhost kernel:  [8127fbdf] 
__wait_on_bit_lock+0x45/0x79
Jul 20 09:44:02 localhost kernel:  [81075671] ? sync_page+0x0/0x40
Jul 20 09:44:02 localhost kernel:  [8107565d] __lock_page+0x63/0x6a
Jul 20 09:44:02 localhost kernel:  [81046444] ? 
wake_bit_function+0x0/0x2a
Jul 20 09:44:02 localhost kernel:  [810758a5] find_lock_page+0x66/0x93
Jul 20 09:44:02 localhost kernel:  [81076167] 
find_or_create_page+0x27/0x79
Jul 20 09:44:02 localhost kernel:  [a0514900] 
:btrfs:alloc_extent_buffer+0xee/0x23b
Jul 20 09:44:02 localhost kernel:  [a04ffc3b] 
:btrfs:read_tree_block+0x37/0x62
Jul 20 09:44:02 localhost kernel:  [a04f5db1] 
:btrfs:btrfs_search_slot+0x179b/0x18d7
Jul 20 09:44:02 localhost kernel:  [a04fdb3d] 
:btrfs:btrfs_lookup_file_extent+0x38/0x3a
Jul 20 09:44:02 localhost kernel:  [a0509231] 
:btrfs:btrfs_drop_extents+0xd4/0xa8c
Jul 20 09:44:02 localhost kernel:  [8102a42d] ? 
wake_up_process+0x10/0x12
Jul 20 09:44:02 localhost kernel:  [812800c4] ? 
__mutex_unlock_slowpath+0x31/0x39
Jul 20 09:44:02 localhost kernel:  [8127fd88] ? mutex_unlock+0xe/0x10
Jul 20 09:44:02 localhost kernel:  [a04f6ec1] ? 
:btrfs:maybe_unlock_mutex+0x29/0x2b
Jul 20 09:44:02 localhost kernel:  [a04fae36] ? 
:btrfs:btrfs_alloc_reserved_extent+0x56/0x67
Jul 20 09:44:02 localhost kernel:  [a0512106] ? 
:btrfs:btrfs_lookup_ordered_extent+0x139/0x148
Jul 20 09:44:02 localhost kernel:  [a05063a9] 
:btrfs:btrfs_finish_ordered_io+0x14c/0x2a8
Jul 20 09:44:02 localhost kernel:  [a050] 
:btrfs:btrfs_writepage_end_io_hook+0x10/0x12
Jul 20 09:44:02 localhost kernel:  [a0516b62] 
:btrfs:end_bio_extent_writepage+0xbe/0x28d
Jul 20 09:44:02 localhost kernel:  [810c6f5d] bio_endio+0x2b/0x2d
Jul 20 09:44:02 localhost kernel:  [a0500d19] 
:btrfs:end_workqueue_fn+0x103/0x110
Jul 20 09:44:02 localhost kernel:  [a051b43c] 
:btrfs:worker_loop+0x63/0x13e
Jul 20 09:44:02 localhost kernel:  [a051b3d9] ? 
:btrfs:worker_loop+0x0/0x13e
Jul 20 09:44:02 localhost kernel:  [810460e3] kthread+0x49/0x76
Jul 20 09:44:02 localhost kernel:  [8100cda8] child_rip+0xa/0x12
Jul 20 09:44:02 localhost kernel:  [8104609a] ? kthread+0x0/0x76
Jul 20 09:44:02 localhost kernel:  [8100cd9e] ? child_rip+0x0/0x12
Jul 20 09:44:02 localhost kernel: 
Jul 20 09:44:02 localhost kernel: btrfs-cleaner D 8129c5b0 0   456  
2
Jul 20 09:44:02 localhost kernel:  81003d959a10 0046 
 8100154bdea0
Jul 20 09:44:02 localhost kernel:  8100154bdea0 814b2280 
814b2280 0002
Jul 20 09:44:02 localhost kernel:  0034 810025952cc0 
81003fb62cc0 810025953010
Jul 20 09:44:02 localhost kernel: Call Trace:
Jul 20 09:44:02 localhost kernel:  [8127ff47] 
__mutex_lock_slowpath+0x6b/0xa2
Jul 20 09:44:02 localhost kernel:  [8127fdd2] mutex_lock+0x2f/0x33
Jul 20 09:44:02 localhost kernel:  [a04f6e96] 
:btrfs:maybe_lock_mutex+0x29/0x2b
Jul 20 09:44:02 localhost kernel:  [a04fabbd] 
:btrfs:btrfs_alloc_extent+0x31/0xa8
Jul 20 09:44:02 localhost kernel:  [a04fac9e] 
:btrfs:btrfs_alloc_free_block+0x6a/0x1ac
Jul 20 09:44:02 localhost 

Re: New data=ordered code pushed out to btrfs-unstable

2008-07-18 Thread Ric Wheeler

Chris Mason wrote:

Hello everyone,

It took me much longer to chase down races in my new data=ordered code,
but I think I've finally got it, and have pushed it out to the unstable
trees.

There are no disk format changes included.  I need to make minor mods to
the resizing and balancing code, but I wanted to get this stuff out the
door.

In general, I'll call data=ordered any system that prevents seeing stale
data on the disk after a crash.  This would include null bytes from
areas not yet written when we crashed and the contents of old blocks the
filesystem had freed in the past.

The old data=ordered code worked something like this:

file_write: 
	* modify pages in page cache

* set delayed allocation bits
* Update in memory and on-disk i_size

writepage:
* collect a large delalloc region
* allocate new extent
* drop existing extents from the metadata
* insert new extent
* start the page io

transaction commit:
* write and wait on any dirty file data to finish
* commit the new btree pointers

The end result was very large latencies during transaction commit
because it had to wait on all the file data.  A fsync of a single file
was forced to write out all the dirty metadata and dirty data on the FS.
This is how ext3 works today, xfs does something smarter.  ext4 is
moving to something similar to xfs.

With the new code, metadata is not modified in the btree until new
extents are fully on disk.  It now looks something like this:

file write (start, len):
* wait on pending ordered extents for the start, len range
* modify pages in the page cache
* set delayed allocation bits
* Update in memory only i_size

writepage:
* collect a large delalloc extent
* reserve a extent on disk in the allocation tree
* create an ordered extent record
* start the page io

At IO completion (done in a kthread):
* find the corresponding ordered extent record
* if fully written, remove old extents from the tree,
  add new extents to the tree, update on disk i_size

At commit time:
* Just do only metadata IO

The end result of all of this is lower commit latencies and a smoother
system.

-chris
  


Just to kick the tires, I tried the same test that I ran last week on 
ext4. Everything was going great, I decided to kill it after 6 million 
files or so and restart.


The unmount has taken a very, very long time - seems like we are 
cleaning up the pending transactions at a very slow rate:


Jul 18 16:06:04 localhost kernel: cleaner awake
Jul 18 16:06:04 localhost kernel: cleaner done
Jul 18 16:06:34 localhost kernel: trans 188 in commit
Jul 18 16:06:35 localhost kernel: trans 188 done in commit
Jul 18 16:06:35 localhost kernel: cleaner awake
Jul 18 16:06:35 localhost kernel: cleaner done
Jul 18 16:07:05 localhost kernel: trans 189 in commit
Jul 18 16:07:06 localhost kernel: trans 189 done in commit
Jul 18 16:07:06 localhost kernel: cleaner awake
Jul 18 16:07:06 localhost kernel: cleaner done
Jul 18 16:07:36 localhost kernel: trans 190 in commit
Jul 18 16:07:37 localhost kernel: trans 190 done in commit
Jul 18 16:07:37 localhost kernel: cleaner awake
Jul 18 16:07:37 localhost kernel: cleaner done
Jul 18 16:08:07 localhost kernel: trans 191 in commit
Jul 18 16:08:09 localhost kernel: trans 191 done in commit
Jul 18 16:08:09 localhost kernel: cleaner awake
Jul 18 16:08:09 localhost kernel: cleaner done
Jul 18 16:08:39 localhost kernel: trans 192 in commit
Jul 18 16:08:39 localhost kernel: trans 192 done in commit
Jul 18 16:08:39 localhost kernel: cleaner awake
Jul 18 16:08:39 localhost kernel: cleaner done

The command I ran was:

fs_mark  -d  /mnt/test  -D  256  -n  10  -t  4  -s  20480  -F  -S  0 
-l btrfs_new.txt


(No fsyncs involved here)

ric



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html