Re: Drive performance bottleneck
Andrew Morton wrote: Lincoln Dale <[EMAIL PROTECTED]> wrote: sg_dd uses a window into a kernel DMA window. as such, two of the four memory acccesses are cut out (1. DMA from HBA to RAM, 2. userspace accessing data). 1.6Gbps / 2 = 800MB/s -- or roughly what Ian was seeing with sg_dd. Right. That's a fancy way of saying "cheating" ;) But from the oprofile output it appears to me that there is plenty of CPU capacity left over. Maybe I'm misreading it due to oprofile adding in the SMP factor (25% CPU on a 4-way means we've exhausted CPU capacity). sg_dd is lying or /dev/sg* is broken. Try to do that sg_dd test in any single drive and you'll get 20 times the performance you're supposed to achieve: puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=100 time=1 Reducing read to 64 blocks per loop time to transfer data was 69.784784 secs, 939.12 MB/sec 100+0 records in 100+0 records out This is a single sata drive. I'm lucky, am I not? ;-) Regards, Nuno Silva - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
Andrew Morton wrote: Lincoln Dale [EMAIL PROTECTED] wrote: sg_dd uses a window into a kernel DMA window. as such, two of the four memory acccesses are cut out (1. DMA from HBA to RAM, 2. userspace accessing data). 1.6Gbps / 2 = 800MB/s -- or roughly what Ian was seeing with sg_dd. Right. That's a fancy way of saying cheating ;) But from the oprofile output it appears to me that there is plenty of CPU capacity left over. Maybe I'm misreading it due to oprofile adding in the SMP factor (25% CPU on a 4-way means we've exhausted CPU capacity). sg_dd is lying or /dev/sg* is broken. Try to do that sg_dd test in any single drive and you'll get 20 times the performance you're supposed to achieve: puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=100 time=1 Reducing read to 64 blocks per loop time to transfer data was 69.784784 secs, 939.12 MB/sec 100+0 records in 100+0 records out This is a single sata drive. I'm lucky, am I not? ;-) Regards, Nuno Silva - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
Lincoln Dale <[EMAIL PROTECTED]> wrote: > > sg_dd uses a window into a kernel DMA window. as such, two of the four > memory acccesses are cut out (1. DMA from HBA to RAM, 2. userspace > accessing data). > 1.6Gbps / 2 = 800MB/s -- or roughly what Ian was seeing with sg_dd. Right. That's a fancy way of saying "cheating" ;) But from the oprofile output it appears to me that there is plenty of CPU capacity left over. Maybe I'm misreading it due to oprofile adding in the SMP factor (25% CPU on a 4-way means we've exhausted CPU capacity). > DIRECT_IO should achieve similar numbers to sg_dd, but perhaps not quite as > efficient. Probably so. There are various tools in http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz which are more useful than dd, btw. `odread' and `odwrite' are usful for this sort of thing. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
At 08:32 PM 4/02/2005, Andrew Morton wrote: Something funny is happening here - it looks like there's plenty of CPU capacity left over. [..] Could you monitor the CPU load during the various tests? If the `dd' workload isn't pegging the CPU then it could be that there's something wrong with the I/O submission patterns. as an educated guess, i'd say that the workload is running out of memory bandwidth .. lets say the RAM is single-channel DDR400. peak bandwidth = 3.2Gb/s (400 x 10^6 x 64 bits / 10). its fair to say that peak bandwidth is pretty rare thing to achieve with SDRAM given real-world access patterns -- lets take a conservative "it'll be 50% efficient" -- so DDR400 realistic peak = 1.6Gbps. as far as memory-accesses go, a standard user-space read() from disk results in 4 memory-accesses (1. DMA from HBA to RAM, 2. read in copy_to_user(), 3. write in copy_to_user(), 4. userspace accessing that data). 1.6Gbps / 4 = 400MB/s -- or roughly what Ian was seeing. sg_dd uses a window into a kernel DMA window. as such, two of the four memory acccesses are cut out (1. DMA from HBA to RAM, 2. userspace accessing data). 1.6Gbps / 2 = 800MB/s -- or roughly what Ian was seeing with sg_dd. DIRECT_IO should achieve similar numbers to sg_dd, but perhaps not quite as efficient. cheers, lincoln. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
On Thu, Feb 03, 2005 at 07:03:48PM +, Paulo Marques wrote: > FYI there was a patch running around last April that made a new option > for "dd" to make it use O_DIRECT. You can get it here: > > http://marc.theaimsgroup.com/?l=linux-kernel=108135935629589=2 > > Unfortunately this hasn't made it into coreutils. Follow down the thread and you'll see that it was merged to coreutils CVS (message-id <[EMAIL PROTECTED]>), but there apparently hasn't been a coreutils release since then. Nor has the patch been added to the debian package. -andy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
Ian Godin <[EMAIL PROTECTED]> wrote: > > >I am trying to get very fast disk drive performance and I am seeing > some interesting bottlenecks. We are trying to get 800 MB/sec or more > (yes, that is megabytes per second). We are currently using > PCI-Express with a 16 drive raid card (SATA drives). We have achieved > that speed, but only through the SG (SCSI generic) driver. This is > running the stock 2.6.10 kernel. And the device is not mounted as a > file system. I also set the read ahead size on the device to 16KB > (which speeds things up a lot): > ... > samples %symbol name > 8481858.3510 __copy_to_user_ll > 7721727.6026 do_anonymous_page > 7015796.9076 _spin_lock_irq > 5790245.7009 __copy_user_intel > 3616343.5606 _spin_lock > 3430183.3773 _spin_lock_irqsave > 3074623.0272 kmap_atomic > 1933271.9035 page_fault Something funny is happening here - it looks like there's plenty of CPU capacity left over. It's odd that you're getting a lot of pagefaults in this test but not with the sg_dd test, too. I wonder why dd is getting so many pagefaults? (I recall that sg_dd did something cheaty, but I forget what it was). Could you monitor the CPU load during the various tests? If the `dd' workload isn't pegging the CPU then it could be that there's something wrong with the I/O submission patterns. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
Ian Godin [EMAIL PROTECTED] wrote: I am trying to get very fast disk drive performance and I am seeing some interesting bottlenecks. We are trying to get 800 MB/sec or more (yes, that is megabytes per second). We are currently using PCI-Express with a 16 drive raid card (SATA drives). We have achieved that speed, but only through the SG (SCSI generic) driver. This is running the stock 2.6.10 kernel. And the device is not mounted as a file system. I also set the read ahead size on the device to 16KB (which speeds things up a lot): ... samples %symbol name 8481858.3510 __copy_to_user_ll 7721727.6026 do_anonymous_page 7015796.9076 _spin_lock_irq 5790245.7009 __copy_user_intel 3616343.5606 _spin_lock 3430183.3773 _spin_lock_irqsave 3074623.0272 kmap_atomic 1933271.9035 page_fault Something funny is happening here - it looks like there's plenty of CPU capacity left over. It's odd that you're getting a lot of pagefaults in this test but not with the sg_dd test, too. I wonder why dd is getting so many pagefaults? (I recall that sg_dd did something cheaty, but I forget what it was). Could you monitor the CPU load during the various tests? If the `dd' workload isn't pegging the CPU then it could be that there's something wrong with the I/O submission patterns. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
On Thu, Feb 03, 2005 at 07:03:48PM +, Paulo Marques wrote: FYI there was a patch running around last April that made a new option for dd to make it use O_DIRECT. You can get it here: http://marc.theaimsgroup.com/?l=linux-kernelm=108135935629589w=2 Unfortunately this hasn't made it into coreutils. Follow down the thread and you'll see that it was merged to coreutils CVS (message-id [EMAIL PROTECTED]), but there apparently hasn't been a coreutils release since then. Nor has the patch been added to the debian package. -andy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
At 08:32 PM 4/02/2005, Andrew Morton wrote: Something funny is happening here - it looks like there's plenty of CPU capacity left over. [..] Could you monitor the CPU load during the various tests? If the `dd' workload isn't pegging the CPU then it could be that there's something wrong with the I/O submission patterns. as an educated guess, i'd say that the workload is running out of memory bandwidth .. lets say the RAM is single-channel DDR400. peak bandwidth = 3.2Gb/s (400 x 10^6 x 64 bits / 10). its fair to say that peak bandwidth is pretty rare thing to achieve with SDRAM given real-world access patterns -- lets take a conservative it'll be 50% efficient -- so DDR400 realistic peak = 1.6Gbps. as far as memory-accesses go, a standard user-space read() from disk results in 4 memory-accesses (1. DMA from HBA to RAM, 2. read in copy_to_user(), 3. write in copy_to_user(), 4. userspace accessing that data). 1.6Gbps / 4 = 400MB/s -- or roughly what Ian was seeing. sg_dd uses a window into a kernel DMA window. as such, two of the four memory acccesses are cut out (1. DMA from HBA to RAM, 2. userspace accessing data). 1.6Gbps / 2 = 800MB/s -- or roughly what Ian was seeing with sg_dd. DIRECT_IO should achieve similar numbers to sg_dd, but perhaps not quite as efficient. cheers, lincoln. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
Lincoln Dale [EMAIL PROTECTED] wrote: sg_dd uses a window into a kernel DMA window. as such, two of the four memory acccesses are cut out (1. DMA from HBA to RAM, 2. userspace accessing data). 1.6Gbps / 2 = 800MB/s -- or roughly what Ian was seeing with sg_dd. Right. That's a fancy way of saying cheating ;) But from the oprofile output it appears to me that there is plenty of CPU capacity left over. Maybe I'm misreading it due to oprofile adding in the SMP factor (25% CPU on a 4-way means we've exhausted CPU capacity). DIRECT_IO should achieve similar numbers to sg_dd, but perhaps not quite as efficient. Probably so. There are various tools in http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz which are more useful than dd, btw. `odread' and `odwrite' are usful for this sort of thing. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
Ian Godin wrote: [...] Definitely have been able to repeat that here, so the SG driver definitely appears to be broken. At least I'm glad I am not going insane, I was starting to wonder :) I'll run some more tests with O_DIRECT and such things, see if I can figure out what the REAL max speed is. FYI there was a patch running around last April that made a new option for "dd" to make it use O_DIRECT. You can get it here: http://marc.theaimsgroup.com/?l=linux-kernel=108135935629589=2 Unfortunately this hasn't made it into coreutils. IIRC there were issues about dd being multi-platform and the way O_DIRECT was done in other systems. Anyway, you can patch dd yourself and have a tool for debugging with O_DIRECT. I hope this helps, -- Paulo Marques - www.grupopie.com All that is necessary for the triumph of evil is that good men do nothing. Edmund Burke (1729 - 1797) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
Ian Godin wrote: I am trying to get very fast disk drive performance and I am seeing some interesting bottlenecks. We are trying to get 800 MB/sec or more (yes, that is megabytes per second). We are currently using PCI-Express with a 16 drive raid card (SATA drives). We have achieved that speed, but only through the SG (SCSI generic) driver. This is running the stock 2.6.10 kernel. And the device is not mounted as a file system. I also set the read ahead size on the device to 16KB (which speeds things up a lot): I was trying to reproduce but got distracted by this: (use page down, if you just want to see the odd result) puma:/tmp/dd# sg_map /dev/sg0 /dev/sda /dev/sg1 /dev/sdb /dev/sg2 /dev/scd0 /dev/sg3 /dev/sdc puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/tmp/dd/sg1 bs=64k count=1000 Reducing read to 64 blocks per loop 1000+0 records in 1000+0 records out real0m0.187s user0m0.001s sys 0m0.141s puma:/tmp/dd# time dd if=/dev/sdb of=/tmp/dd/sdb bs=64k count=1000 1000+0 records in 1000+0 records out 65536000 bytes transferred in 1.203468 seconds (54455956 bytes/sec) real0m1.219s user0m0.001s sys 0m0.138s puma:/tmp/dd# ls -l total 128000 -rw-r--r-- 1 root root 65536000 Feb 3 17:16 sdb -rw-r--r-- 1 root root 65536000 Feb 3 17:16 sg1 puma:/tmp/dd# md5sum * ec31224970ddd3fb74501c8e68327e7b sdb 60d4689227d60e6122f1ffe0ec1b2ad7 sg1 ^ See? dd from sdb is not the same as sg1! Is this supposed to happen? About the 900MB/sec: This same sg1 (= sdb, which is a single hitachi sata hdd) performes like this: puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=100 time=1 Reducing read to 64 blocks per loop time to transfer data was 69.784784 secs, 939.12 MB/sec 100+0 records in 100+0 records out real1m9.787s user0m0.063s sys 0m58.115s I can assure you that this drive can't do more than 60MB/sec sustained. My only conclusion is that sg (or sg_dd) is broken? ;) Peace, Nuno Silva - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
On Feb 3, 2005, at 9:40 AM, Nuno Silva wrote: Ian Godin wrote: I am trying to get very fast disk drive performance and I am seeing some interesting bottlenecks. We are trying to get 800 MB/sec or more (yes, that is megabytes per second). We are currently using PCI-Express with a 16 drive raid card (SATA drives). We have achieved that speed, but only through the SG (SCSI generic) driver. This is running the stock 2.6.10 kernel. And the device is not mounted as a file system. I also set the read ahead size on the device to 16KB (which speeds things up a lot): I was trying to reproduce but got distracted by this: (use page down, if you just want to see the odd result) puma:/tmp/dd# sg_map /dev/sg0 /dev/sda /dev/sg1 /dev/sdb /dev/sg2 /dev/scd0 /dev/sg3 /dev/sdc puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/tmp/dd/sg1 bs=64k count=1000 Reducing read to 64 blocks per loop 1000+0 records in 1000+0 records out real0m0.187s user0m0.001s sys 0m0.141s puma:/tmp/dd# time dd if=/dev/sdb of=/tmp/dd/sdb bs=64k count=1000 1000+0 records in 1000+0 records out 65536000 bytes transferred in 1.203468 seconds (54455956 bytes/sec) real0m1.219s user0m0.001s sys 0m0.138s puma:/tmp/dd# ls -l total 128000 -rw-r--r-- 1 root root 65536000 Feb 3 17:16 sdb -rw-r--r-- 1 root root 65536000 Feb 3 17:16 sg1 puma:/tmp/dd# md5sum * ec31224970ddd3fb74501c8e68327e7b sdb 60d4689227d60e6122f1ffe0ec1b2ad7 sg1 ^ See? dd from sdb is not the same as sg1! Is this supposed to happen? About the 900MB/sec: This same sg1 (= sdb, which is a single hitachi sata hdd) performes like this: puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=100 time=1 Reducing read to 64 blocks per loop time to transfer data was 69.784784 secs, 939.12 MB/sec 100+0 records in 100+0 records out real1m9.787s user0m0.063s sys 0m58.115s I can assure you that this drive can't do more than 60MB/sec sustained. My only conclusion is that sg (or sg_dd) is broken? ;) Peace, Nuno Silva Definitely have been able to repeat that here, so the SG driver definitely appears to be broken. At least I'm glad I am not going insane, I was starting to wonder :) I'll run some more tests with O_DIRECT and such things, see if I can figure out what the REAL max speed is. Thanks for the help everyone, Ian. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
On Feb 2, 2005, at 7:56 PM, Bernd Eckenfels wrote: In article <[EMAIL PROTECTED]> you wrote: Below is an oprofile (truncated) of (the same) dd running on /dev/sdb. do you also have the oprofile of the sg_dd handy? Greetings Bernd Just ran it on the sg_dd (using /dev/sg1): CPU: P4 / Xeon, speed 3402.13 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 10 samples %symbol name 2145136 89.7474 __copy_to_user_ll 64720 2.7077 lock_kernel 17883 0.7482 mark_offset_tsc 16212 0.6783 page_address 8091 0.3385 schedule 5380 0.2251 timer_interrupt 5314 0.2223 _spin_lock 5311 0. mwait_idle 4034 0.1688 sysenter_past_esp 3569 0.1493 do_anonymous_page 3530 0.1477 apic_timer_interrupt 3368 0.1409 _spin_lock_irqsave 3150 0.1318 sg_read_xfer 3043 0.1273 kmem_cache_alloc 2590 0.1084 find_busiest_group 2553 0.1068 scheduler_tick 2319 0.0970 __copy_from_user_ll 2299 0.0962 sg_ioctl 2131 0.0892 irq_entries_start 1905 0.0797 sys_ioctl 1773 0.0742 copy_page_range 1678 0.0702 fget 1648 0.0689 __switch_to 1648 0.0689 scsi_block_when_processing_errors 1632 0.0683 sg_start_req 1569 0.0656 increment_tail 1511 0.0632 try_to_wake_up 1506 0.0630 update_one_process 1454 0.0608 fput 1397 0.0584 do_gettimeofday 1396 0.0584 zap_pte_range 1371 0.0574 recalc_task_prio 1357 0.0568 sched_clock 1352 0.0566 sg_link_reserve 1282 0.0536 __elv_add_request 1229 0.0514 kmap_atomic 1195 0.0500 page_fault Oddly enough it appears to also be copying to user space... Ian. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
On Wed, Feb 02 2005, Ian Godin wrote: > > I am trying to get very fast disk drive performance and I am seeing > some interesting bottlenecks. We are trying to get 800 MB/sec or more > (yes, that is megabytes per second). We are currently using > PCI-Express with a 16 drive raid card (SATA drives). We have achieved > that speed, but only through the SG (SCSI generic) driver. This is > running the stock 2.6.10 kernel. And the device is not mounted as a > file system. I also set the read ahead size on the device to 16KB > (which speeds things up a lot): > > blockdev --setra 16834 /dev/sdb > > So here are the results: > > $ time dd if=/dev/sdb of=/dev/null bs=64k count=100 > 100+0 records in > 100+0 records out > 0.27user 86.19system 2:40.68elapsed 53%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (0major+177minor)pagefaults 0swaps > > 64k * 100 / 160.68 = 398.3 MB/sec > > Using sg_dd just to make sure it works the same: > > $ time sg_dd if=/dev/sdb of=/dev/null bs=64k count=100 > 100+0 records in > 100+0 records out > 0.05user 144.27system 2:41.55elapsed 89%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (17major+5375minor)pagefaults 0swaps > > Pretty much the same speed. Now using the SG device (sg1 is tied to > sdb): > > $ time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=100 > Reducing read to 16 blocks per loop > 100+0 records in > 100+0 records out > 0.22user 66.21system 1:10.23elapsed 94%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (0major+2327minor)pagefaults 0swaps > > 64k * 100 / 70.23 = 911.3 MB/sec > > Now that's more like the speeds we expected. I understand that the > SG device uses direct I/O and/or mmap memory from the kernel. What I > cannot believe is that there is that much overhead in going through the > page buffer/cache system in Linux. It's not going through the page cache that is the problem, it's the copying to user space. Have you tried using O_DIRECT? What kind of speeds are you getting with that? -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
On Wed, Feb 02 2005, Ian Godin wrote: I am trying to get very fast disk drive performance and I am seeing some interesting bottlenecks. We are trying to get 800 MB/sec or more (yes, that is megabytes per second). We are currently using PCI-Express with a 16 drive raid card (SATA drives). We have achieved that speed, but only through the SG (SCSI generic) driver. This is running the stock 2.6.10 kernel. And the device is not mounted as a file system. I also set the read ahead size on the device to 16KB (which speeds things up a lot): blockdev --setra 16834 /dev/sdb So here are the results: $ time dd if=/dev/sdb of=/dev/null bs=64k count=100 100+0 records in 100+0 records out 0.27user 86.19system 2:40.68elapsed 53%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+177minor)pagefaults 0swaps 64k * 100 / 160.68 = 398.3 MB/sec Using sg_dd just to make sure it works the same: $ time sg_dd if=/dev/sdb of=/dev/null bs=64k count=100 100+0 records in 100+0 records out 0.05user 144.27system 2:41.55elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (17major+5375minor)pagefaults 0swaps Pretty much the same speed. Now using the SG device (sg1 is tied to sdb): $ time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=100 Reducing read to 16 blocks per loop 100+0 records in 100+0 records out 0.22user 66.21system 1:10.23elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+2327minor)pagefaults 0swaps 64k * 100 / 70.23 = 911.3 MB/sec Now that's more like the speeds we expected. I understand that the SG device uses direct I/O and/or mmap memory from the kernel. What I cannot believe is that there is that much overhead in going through the page buffer/cache system in Linux. It's not going through the page cache that is the problem, it's the copying to user space. Have you tried using O_DIRECT? What kind of speeds are you getting with that? -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
On Feb 2, 2005, at 7:56 PM, Bernd Eckenfels wrote: In article [EMAIL PROTECTED] you wrote: Below is an oprofile (truncated) of (the same) dd running on /dev/sdb. do you also have the oprofile of the sg_dd handy? Greetings Bernd Just ran it on the sg_dd (using /dev/sg1): CPU: P4 / Xeon, speed 3402.13 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 10 samples %symbol name 2145136 89.7474 __copy_to_user_ll 64720 2.7077 lock_kernel 17883 0.7482 mark_offset_tsc 16212 0.6783 page_address 8091 0.3385 schedule 5380 0.2251 timer_interrupt 5314 0.2223 _spin_lock 5311 0. mwait_idle 4034 0.1688 sysenter_past_esp 3569 0.1493 do_anonymous_page 3530 0.1477 apic_timer_interrupt 3368 0.1409 _spin_lock_irqsave 3150 0.1318 sg_read_xfer 3043 0.1273 kmem_cache_alloc 2590 0.1084 find_busiest_group 2553 0.1068 scheduler_tick 2319 0.0970 __copy_from_user_ll 2299 0.0962 sg_ioctl 2131 0.0892 irq_entries_start 1905 0.0797 sys_ioctl 1773 0.0742 copy_page_range 1678 0.0702 fget 1648 0.0689 __switch_to 1648 0.0689 scsi_block_when_processing_errors 1632 0.0683 sg_start_req 1569 0.0656 increment_tail 1511 0.0632 try_to_wake_up 1506 0.0630 update_one_process 1454 0.0608 fput 1397 0.0584 do_gettimeofday 1396 0.0584 zap_pte_range 1371 0.0574 recalc_task_prio 1357 0.0568 sched_clock 1352 0.0566 sg_link_reserve 1282 0.0536 __elv_add_request 1229 0.0514 kmap_atomic 1195 0.0500 page_fault Oddly enough it appears to also be copying to user space... Ian. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
On Feb 3, 2005, at 9:40 AM, Nuno Silva wrote: Ian Godin wrote: I am trying to get very fast disk drive performance and I am seeing some interesting bottlenecks. We are trying to get 800 MB/sec or more (yes, that is megabytes per second). We are currently using PCI-Express with a 16 drive raid card (SATA drives). We have achieved that speed, but only through the SG (SCSI generic) driver. This is running the stock 2.6.10 kernel. And the device is not mounted as a file system. I also set the read ahead size on the device to 16KB (which speeds things up a lot): I was trying to reproduce but got distracted by this: (use page down, if you just want to see the odd result) puma:/tmp/dd# sg_map /dev/sg0 /dev/sda /dev/sg1 /dev/sdb /dev/sg2 /dev/scd0 /dev/sg3 /dev/sdc puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/tmp/dd/sg1 bs=64k count=1000 Reducing read to 64 blocks per loop 1000+0 records in 1000+0 records out real0m0.187s user0m0.001s sys 0m0.141s puma:/tmp/dd# time dd if=/dev/sdb of=/tmp/dd/sdb bs=64k count=1000 1000+0 records in 1000+0 records out 65536000 bytes transferred in 1.203468 seconds (54455956 bytes/sec) real0m1.219s user0m0.001s sys 0m0.138s puma:/tmp/dd# ls -l total 128000 -rw-r--r-- 1 root root 65536000 Feb 3 17:16 sdb -rw-r--r-- 1 root root 65536000 Feb 3 17:16 sg1 puma:/tmp/dd# md5sum * ec31224970ddd3fb74501c8e68327e7b sdb 60d4689227d60e6122f1ffe0ec1b2ad7 sg1 ^ See? dd from sdb is not the same as sg1! Is this supposed to happen? About the 900MB/sec: This same sg1 (= sdb, which is a single hitachi sata hdd) performes like this: puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=100 time=1 Reducing read to 64 blocks per loop time to transfer data was 69.784784 secs, 939.12 MB/sec 100+0 records in 100+0 records out real1m9.787s user0m0.063s sys 0m58.115s I can assure you that this drive can't do more than 60MB/sec sustained. My only conclusion is that sg (or sg_dd) is broken? ;) Peace, Nuno Silva Definitely have been able to repeat that here, so the SG driver definitely appears to be broken. At least I'm glad I am not going insane, I was starting to wonder :) I'll run some more tests with O_DIRECT and such things, see if I can figure out what the REAL max speed is. Thanks for the help everyone, Ian. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
Ian Godin wrote: I am trying to get very fast disk drive performance and I am seeing some interesting bottlenecks. We are trying to get 800 MB/sec or more (yes, that is megabytes per second). We are currently using PCI-Express with a 16 drive raid card (SATA drives). We have achieved that speed, but only through the SG (SCSI generic) driver. This is running the stock 2.6.10 kernel. And the device is not mounted as a file system. I also set the read ahead size on the device to 16KB (which speeds things up a lot): I was trying to reproduce but got distracted by this: (use page down, if you just want to see the odd result) puma:/tmp/dd# sg_map /dev/sg0 /dev/sda /dev/sg1 /dev/sdb /dev/sg2 /dev/scd0 /dev/sg3 /dev/sdc puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/tmp/dd/sg1 bs=64k count=1000 Reducing read to 64 blocks per loop 1000+0 records in 1000+0 records out real0m0.187s user0m0.001s sys 0m0.141s puma:/tmp/dd# time dd if=/dev/sdb of=/tmp/dd/sdb bs=64k count=1000 1000+0 records in 1000+0 records out 65536000 bytes transferred in 1.203468 seconds (54455956 bytes/sec) real0m1.219s user0m0.001s sys 0m0.138s puma:/tmp/dd# ls -l total 128000 -rw-r--r-- 1 root root 65536000 Feb 3 17:16 sdb -rw-r--r-- 1 root root 65536000 Feb 3 17:16 sg1 puma:/tmp/dd# md5sum * ec31224970ddd3fb74501c8e68327e7b sdb 60d4689227d60e6122f1ffe0ec1b2ad7 sg1 ^ See? dd from sdb is not the same as sg1! Is this supposed to happen? About the 900MB/sec: This same sg1 (= sdb, which is a single hitachi sata hdd) performes like this: puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=100 time=1 Reducing read to 64 blocks per loop time to transfer data was 69.784784 secs, 939.12 MB/sec 100+0 records in 100+0 records out real1m9.787s user0m0.063s sys 0m58.115s I can assure you that this drive can't do more than 60MB/sec sustained. My only conclusion is that sg (or sg_dd) is broken? ;) Peace, Nuno Silva - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
Ian Godin wrote: [...] Definitely have been able to repeat that here, so the SG driver definitely appears to be broken. At least I'm glad I am not going insane, I was starting to wonder :) I'll run some more tests with O_DIRECT and such things, see if I can figure out what the REAL max speed is. FYI there was a patch running around last April that made a new option for dd to make it use O_DIRECT. You can get it here: http://marc.theaimsgroup.com/?l=linux-kernelm=108135935629589w=2 Unfortunately this hasn't made it into coreutils. IIRC there were issues about dd being multi-platform and the way O_DIRECT was done in other systems. Anyway, you can patch dd yourself and have a tool for debugging with O_DIRECT. I hope this helps, -- Paulo Marques - www.grupopie.com All that is necessary for the triumph of evil is that good men do nothing. Edmund Burke (1729 - 1797) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Drive performance bottleneck
In article <[EMAIL PROTECTED]> you wrote: > Below is an oprofile (truncated) of (the same) dd running on /dev/sdb. do you also have the oprofile of the sg_dd handy? Greetings Bernd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Drive performance bottleneck
I am trying to get very fast disk drive performance and I am seeing some interesting bottlenecks. We are trying to get 800 MB/sec or more (yes, that is megabytes per second). We are currently using PCI-Express with a 16 drive raid card (SATA drives). We have achieved that speed, but only through the SG (SCSI generic) driver. This is running the stock 2.6.10 kernel. And the device is not mounted as a file system. I also set the read ahead size on the device to 16KB (which speeds things up a lot): blockdev --setra 16834 /dev/sdb So here are the results: $ time dd if=/dev/sdb of=/dev/null bs=64k count=100 100+0 records in 100+0 records out 0.27user 86.19system 2:40.68elapsed 53%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+177minor)pagefaults 0swaps 64k * 100 / 160.68 = 398.3 MB/sec Using sg_dd just to make sure it works the same: $ time sg_dd if=/dev/sdb of=/dev/null bs=64k count=100 100+0 records in 100+0 records out 0.05user 144.27system 2:41.55elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (17major+5375minor)pagefaults 0swaps Pretty much the same speed. Now using the SG device (sg1 is tied to sdb): $ time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=100 Reducing read to 16 blocks per loop 100+0 records in 100+0 records out 0.22user 66.21system 1:10.23elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+2327minor)pagefaults 0swaps 64k * 100 / 70.23 = 911.3 MB/sec Now that's more like the speeds we expected. I understand that the SG device uses direct I/O and/or mmap memory from the kernel. What I cannot believe is that there is that much overhead in going through the page buffer/cache system in Linux. We also tried going through a file system (various ones, JFS, XFS, Reiser, Ext3). They all seem to bottleneck at around 400MB/sec, much like /dev/sdb does. We also have a "real" SCSI raid system which also bottlenecks right at 400 MB/sec. Under Windows (XP) both of these systems run at 650 (SCSI) or 800 (SATA) MB/sec. Other variations I've tried: setting the read ahead to larger or smaller number (1, 2, 4, 8, 16, 32, 64 KB)... 8 or 16 seems to be optimal. Using different block sizes in the dd command (again 1, 2, 4, 8, 16, 32, 64). 16, 32, 64 are pretty much identical and fastest. Below is an oprofile (truncated) of (the same) dd running on /dev/sdb. So is the overhead really that high? Hopefully there's a bottleneck in there that no one has come across yet, and it can be optimized. Anyone else trying to pull close to 1GB/sec from disk? :) The kernel has changed a lot since the last time I really worked with it (2.2), so any suggestions are appreciated. Ian Godin Senior Software Developer DTS/Lowry Digital Images --- CPU: P4 / Xeon, speed 3402.13 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 10 samples %symbol name 8481858.3510 __copy_to_user_ll 7721727.6026 do_anonymous_page 7015796.9076 _spin_lock_irq 5790245.7009 __copy_user_intel 3616343.5606 _spin_lock 3430183.3773 _spin_lock_irqsave 3074623.0272 kmap_atomic 1933271.9035 page_fault 1810401.7825 schedule 1745021.7181 radix_tree_delete 1589671.5652 end_buffer_async_read 1241241.2221 free_hot_cold_page 1190571.1722 sysenter_past_esp 1173841.1557 shrink_list 1127621.1102 buffered_rmqueue 1054901.0386 smp_call_function 1015681. kmem_cache_alloc 97404 0.9590 kmem_cache_free 95826 0.9435 __rmqueue 95443 0.9397 __copy_from_user_ll 93181 0.9174 free_pages_bulk 92732 0.9130 release_pages 86912 0.8557 shrink_cache 85896 0.8457 block_read_full_page 79629 0.7840 free_block 78304 0.7710 mempool_free 72264 0.7115 create_empty_buffers 71303 0.7020 do_syslog 70769 0.6968 emit_log_char 66413 0.6539 mark_offset_tsc 64333 0.6334 vprintk 63468 0.6249 file_read_actor 63292 0.6232 add_to_page_cache 62281 0.6132 unlock_page 61655 0.6070 _spin_unlock_irqrestore 59486 0.5857 find_get_page 58901 0.5799 drop_buffers 58775 0.5787 do_generic_mapping_read 55070 0.5422 __wake_up_bit 48681 0.4793 __end_that_request_first 47121 0.4639 bad_range 47102 0.4638 submit_bh 45009 0.4431 journal_add_journal_head 41270 0.4063 __alloc_pages 41247 0.4061 page_waitqueue 39520 0.3891 generic_file_buffered_write 38520 0.3793 __pagevec_lru_add 38142 0.3755 do_select 38105 0.3752 do_mpage_readpage 37020 0.3645 vsnprintf 36541 0.3598 __clear_page_buffers 35932 0.3538 journal_put_journal_head 35769 0.3522 radix_tree_lookup 35636 0.3509 bio_put 34904 0.3437 jfs_get_blocks 34865 0.3433 mark_page_accessed 33686 0.3317 bio_alloc 33273 0.3276
Drive performance bottleneck
I am trying to get very fast disk drive performance and I am seeing some interesting bottlenecks. We are trying to get 800 MB/sec or more (yes, that is megabytes per second). We are currently using PCI-Express with a 16 drive raid card (SATA drives). We have achieved that speed, but only through the SG (SCSI generic) driver. This is running the stock 2.6.10 kernel. And the device is not mounted as a file system. I also set the read ahead size on the device to 16KB (which speeds things up a lot): blockdev --setra 16834 /dev/sdb So here are the results: $ time dd if=/dev/sdb of=/dev/null bs=64k count=100 100+0 records in 100+0 records out 0.27user 86.19system 2:40.68elapsed 53%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+177minor)pagefaults 0swaps 64k * 100 / 160.68 = 398.3 MB/sec Using sg_dd just to make sure it works the same: $ time sg_dd if=/dev/sdb of=/dev/null bs=64k count=100 100+0 records in 100+0 records out 0.05user 144.27system 2:41.55elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (17major+5375minor)pagefaults 0swaps Pretty much the same speed. Now using the SG device (sg1 is tied to sdb): $ time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=100 Reducing read to 16 blocks per loop 100+0 records in 100+0 records out 0.22user 66.21system 1:10.23elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+2327minor)pagefaults 0swaps 64k * 100 / 70.23 = 911.3 MB/sec Now that's more like the speeds we expected. I understand that the SG device uses direct I/O and/or mmap memory from the kernel. What I cannot believe is that there is that much overhead in going through the page buffer/cache system in Linux. We also tried going through a file system (various ones, JFS, XFS, Reiser, Ext3). They all seem to bottleneck at around 400MB/sec, much like /dev/sdb does. We also have a real SCSI raid system which also bottlenecks right at 400 MB/sec. Under Windows (XP) both of these systems run at 650 (SCSI) or 800 (SATA) MB/sec. Other variations I've tried: setting the read ahead to larger or smaller number (1, 2, 4, 8, 16, 32, 64 KB)... 8 or 16 seems to be optimal. Using different block sizes in the dd command (again 1, 2, 4, 8, 16, 32, 64). 16, 32, 64 are pretty much identical and fastest. Below is an oprofile (truncated) of (the same) dd running on /dev/sdb. So is the overhead really that high? Hopefully there's a bottleneck in there that no one has come across yet, and it can be optimized. Anyone else trying to pull close to 1GB/sec from disk? :) The kernel has changed a lot since the last time I really worked with it (2.2), so any suggestions are appreciated. Ian Godin Senior Software Developer DTS/Lowry Digital Images --- CPU: P4 / Xeon, speed 3402.13 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 10 samples %symbol name 8481858.3510 __copy_to_user_ll 7721727.6026 do_anonymous_page 7015796.9076 _spin_lock_irq 5790245.7009 __copy_user_intel 3616343.5606 _spin_lock 3430183.3773 _spin_lock_irqsave 3074623.0272 kmap_atomic 1933271.9035 page_fault 1810401.7825 schedule 1745021.7181 radix_tree_delete 1589671.5652 end_buffer_async_read 1241241.2221 free_hot_cold_page 1190571.1722 sysenter_past_esp 1173841.1557 shrink_list 1127621.1102 buffered_rmqueue 1054901.0386 smp_call_function 1015681. kmem_cache_alloc 97404 0.9590 kmem_cache_free 95826 0.9435 __rmqueue 95443 0.9397 __copy_from_user_ll 93181 0.9174 free_pages_bulk 92732 0.9130 release_pages 86912 0.8557 shrink_cache 85896 0.8457 block_read_full_page 79629 0.7840 free_block 78304 0.7710 mempool_free 72264 0.7115 create_empty_buffers 71303 0.7020 do_syslog 70769 0.6968 emit_log_char 66413 0.6539 mark_offset_tsc 64333 0.6334 vprintk 63468 0.6249 file_read_actor 63292 0.6232 add_to_page_cache 62281 0.6132 unlock_page 61655 0.6070 _spin_unlock_irqrestore 59486 0.5857 find_get_page 58901 0.5799 drop_buffers 58775 0.5787 do_generic_mapping_read 55070 0.5422 __wake_up_bit 48681 0.4793 __end_that_request_first 47121 0.4639 bad_range 47102 0.4638 submit_bh 45009 0.4431 journal_add_journal_head 41270 0.4063 __alloc_pages 41247 0.4061 page_waitqueue 39520 0.3891 generic_file_buffered_write 38520 0.3793 __pagevec_lru_add 38142 0.3755 do_select 38105 0.3752 do_mpage_readpage 37020 0.3645 vsnprintf 36541 0.3598 __clear_page_buffers 35932 0.3538 journal_put_journal_head 35769 0.3522 radix_tree_lookup 35636 0.3509 bio_put 34904 0.3437 jfs_get_blocks 34865 0.3433 mark_page_accessed 33686 0.3317 bio_alloc 33273 0.3276
Re: Drive performance bottleneck
In article [EMAIL PROTECTED] you wrote: Below is an oprofile (truncated) of (the same) dd running on /dev/sdb. do you also have the oprofile of the sg_dd handy? Greetings Bernd - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/