I am trying to get very fast disk drive performance and I am seeing
some interesting bottlenecks. We are trying to get 800 MB/sec or more
(yes, that is megabytes per second). We are currently using
PCI-Express with a 16 drive raid card (SATA drives). We have achieved
that speed, but only through the SG (SCSI generic) driver. This is
running the stock 2.6.10 kernel. And the device is not mounted as a
file system. I also set the read ahead size on the device to 16KB
(which speeds things up a lot):
blockdev --setra 16834 /dev/sdb
So here are the results:
$ time dd if=/dev/sdb of=/dev/null bs=64k count=100
100+0 records in
100+0 records out
0.27user 86.19system 2:40.68elapsed 53%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+177minor)pagefaults 0swaps
64k * 100 / 160.68 = 398.3 MB/sec
Using sg_dd just to make sure it works the same:
$ time sg_dd if=/dev/sdb of=/dev/null bs=64k count=100
100+0 records in
100+0 records out
0.05user 144.27system 2:41.55elapsed 89%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (17major+5375minor)pagefaults 0swaps
Pretty much the same speed. Now using the SG device (sg1 is tied to
sdb):
$ time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=100
Reducing read to 16 blocks per loop
100+0 records in
100+0 records out
0.22user 66.21system 1:10.23elapsed 94%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+2327minor)pagefaults 0swaps
64k * 100 / 70.23 = 911.3 MB/sec
Now that's more like the speeds we expected. I understand that the
SG device uses direct I/O and/or mmap memory from the kernel. What I
cannot believe is that there is that much overhead in going through the
page buffer/cache system in Linux.
We also tried going through a file system (various ones, JFS, XFS,
Reiser, Ext3). They all seem to bottleneck at around 400MB/sec, much
like /dev/sdb does. We also have a "real" SCSI raid system which also
bottlenecks right at 400 MB/sec. Under Windows (XP) both of these
systems run at 650 (SCSI) or 800 (SATA) MB/sec.
Other variations I've tried: setting the read ahead to larger or
smaller number (1, 2, 4, 8, 16, 32, 64 KB)... 8 or 16 seems to be
optimal. Using different block sizes in the dd command (again 1, 2, 4,
8, 16, 32, 64). 16, 32, 64 are pretty much identical and fastest.
Below is an oprofile (truncated) of (the same) dd running on /dev/sdb.
So is the overhead really that high? Hopefully there's a bottleneck
in there that no one has come across yet, and it can be optimized.
Anyone else trying to pull close to 1GB/sec from disk? :) The kernel
has changed a lot since the last time I really worked with it (2.2), so
any suggestions are appreciated.
Ian Godin
Senior Software Developer
DTS/Lowry Digital Images
---
CPU: P4 / Xeon, speed 3402.13 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 10
samples %symbol name
8481858.3510 __copy_to_user_ll
7721727.6026 do_anonymous_page
7015796.9076 _spin_lock_irq
5790245.7009 __copy_user_intel
3616343.5606 _spin_lock
3430183.3773 _spin_lock_irqsave
3074623.0272 kmap_atomic
1933271.9035 page_fault
1810401.7825 schedule
1745021.7181 radix_tree_delete
1589671.5652 end_buffer_async_read
1241241.2221 free_hot_cold_page
1190571.1722 sysenter_past_esp
1173841.1557 shrink_list
1127621.1102 buffered_rmqueue
1054901.0386 smp_call_function
1015681. kmem_cache_alloc
97404 0.9590 kmem_cache_free
95826 0.9435 __rmqueue
95443 0.9397 __copy_from_user_ll
93181 0.9174 free_pages_bulk
92732 0.9130 release_pages
86912 0.8557 shrink_cache
85896 0.8457 block_read_full_page
79629 0.7840 free_block
78304 0.7710 mempool_free
72264 0.7115 create_empty_buffers
71303 0.7020 do_syslog
70769 0.6968 emit_log_char
66413 0.6539 mark_offset_tsc
64333 0.6334 vprintk
63468 0.6249 file_read_actor
63292 0.6232 add_to_page_cache
62281 0.6132 unlock_page
61655 0.6070 _spin_unlock_irqrestore
59486 0.5857 find_get_page
58901 0.5799 drop_buffers
58775 0.5787 do_generic_mapping_read
55070 0.5422 __wake_up_bit
48681 0.4793 __end_that_request_first
47121 0.4639 bad_range
47102 0.4638 submit_bh
45009 0.4431 journal_add_journal_head
41270 0.4063 __alloc_pages
41247 0.4061 page_waitqueue
39520 0.3891 generic_file_buffered_write
38520 0.3793 __pagevec_lru_add
38142 0.3755 do_select
38105 0.3752 do_mpage_readpage
37020 0.3645 vsnprintf
36541 0.3598 __clear_page_buffers
35932 0.3538 journal_put_journal_head
35769 0.3522 radix_tree_lookup
35636 0.3509 bio_put
34904 0.3437 jfs_get_blocks
34865 0.3433 mark_page_accessed
33686 0.3317 b