Re: IBM blade server abysmal disk write performances
Hi, Quick follow up on this. As I mentioned in a previous email we have moved to SATA drives and the SAS drives have been shelved for now. The current project will be using those so further tests on SAS have been postponed to an undefined date. Thanks, Karim. PS: I'll keep the SAS tests in my back pocket so I get a head start when we get around SAS testing again. On 18/01/2013 6:32 PM, Karim Fodil-Lemelin wrote: On 18/01/2013 5:42 PM, Matthew Jacob wrote: This is all turning into a bikeshed discussion. As far as I can tell, the basic original question was why a *SAS* (not a SATA) drive was not performing as well as expected based upon experiences with Linux. I still don't know whether reads or writes were being used for dd. This morning, I ran a fio test with a single threaded read component and a multithreaded write component to see if there were differences. All I had connected to my MPT system were ATA drives (Seagate 500GBs) and I'm remote now and won't be back until Sunday to put one of my 'good' SAS drives (140 GB Seagates, i.e., real SAS 15K RPM drives, not fat SATA bs drives). The numbers were pretty much the same for both FreeBSD and Linux. In fact, FreeBSD was slightly faster. I won't report the exact numbers right now, but only mention this as a piece of information that at least in my case the differences between the OS platform involved is negligible. This would, at least in my case, rule out issues based upon different platform access methods and different drivers. All of this other discussion, about WCE and what not is nice, but for all intents and purposes it serves could be moved to *-advocacy. Thanks for the clarifications! I did mention at some point those were write speeds and reads were just fine and those were either writes to the filesystem or direct access (only on SAS again). Here is what I am planning to do next week when I get the chance: 0) I plan on focusing on the SAS driver tests _only_ since SATA is working as expected so nothing to report there. 1) Look carefully at how the drives are physically connected. Although it feels like if the SATA works fine the SAS should also but I'll check anyway. 2) Boot verbose with boot -v and send the dmesg output. mpt driver might give us a clue. 3) Run gstat -abc in a loop for the test duration. Although I would think ctlstat(8) might be more interesting here so I'll run it too for good measure :). Please note that in all tests write caching was enabled as I think this is the default with FBSD 9.1 GENERIC but I'll confirm this with camcontrol(8). I've also seen quite a lot of 'quirks' for tagged command queuing in the source code (/sys/cam/scsi/scps_xtp.c) but a particular one got my attention (thanks to whomever writes good comments in source code :) : /* * Slow when tagged queueing is enabled. Write performance * steadily drops off with more and more concurrent * transactions. Best sequential write performance with * tagged queueing turned off and write caching turned on. * * PR: kern/10398 * Submitted by: Hideaki Okada hok...@isl.melco.co.jp * Drive: DCAS-34330 w/ S65A firmware. * * The drive with the problem had the S65A firmware * revision, and has also been reported (by Stephen J. * Roznowski s...@home.net) for a drive with the S61A * firmware revision. * * Although no one has reported problems with the 2 gig * version of the DCAS drive, the assumption is that it * has the same problems as the 4 gig version. Therefore * this quirk entries disables tagged queueing for all * DCAS drives. */ { T_DIRECT, SIP_MEDIA_FIXED, IBM, DCAS*, * }, /*quirks*/0, /*mintags*/0, /*maxtags*/0 So I looked at the kern/10398 pr and got some feeling of 'deja vu' although the original problem was on FreeBSD 3.1 so its most likely not that but I though I would mention it. The issue described is awfully familiar. Basically the SAS drive (scsi back then) is slow on writes but fast on reads with dd. Could be a coincidence or a ghost from the past who knows... Cheers, Karim. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
Interesting. Is there a way to tell, other than coming up with some way to actually test it, whether a particular drive waits until my crappy laptop hard drive behave the same no matter if i turn write cache on, off or leave default. seems like it is always on. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
With SATA vs SAS, the gap is much narrower. The TCQ command set (still used by SAS) is still better than the NCQ command set, but the in what point TCQ is exactly better than SATA NCQ. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
I've had my share of sudden UPS failures over the years. Probably more everything can fail. That's why serious sysadmins do proper backup, no matter what safety features are used in their servers. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
Turning the write cache off eliminates the risk of having the write cache on. this sentence sounds like not having a car eliminates a risks of driving. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
Am 19.01.2013 00:32, schrieb Karim Fodil-Lemelin: * Although no one has reported problems with the 2 gig * version of the DCAS drive, the assumption is that it * has the same problems as the 4 gig version. Therefore * this quirk entries disables tagged queueing for all * DCAS drives. */ { T_DIRECT, SIP_MEDIA_FIXED, IBM, DCAS*, * }, /*quirks*/0, /*mintags*/0, /*maxtags*/0 So I looked at the kern/10398 pr and got some feeling of 'deja vu' although the original problem was on FreeBSD 3.1 so its most likely not that but I though I would mention it. The issue described is awfully familiar. Basically the SAS drive (scsi back then) is slow on writes but fast on reads with dd. Could be a coincidence or a ghost from the past who knows... I remember those drives from some 20 years ago. Before that time, SCSI and IDE drives were independently developed and SCSI drives offered way better performance and reliability. But at about this time there were SCSI and IDE drives that differed only in their interface electronics. And from that time I and models I remember several SCSI quirks in IBM drives (DCAS and DORS), often with regard to tagged commands. I seem to remember, that drives of that time required the write cache to be enabled to get any speed-up from tagged commands. This was no risk with SCSI drives, since the cache did not make the drives lye about command completion (i.e. the status for the write was only returned when the cached data had been written to disk, independently of the write cache enable). Regards, STefan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
I remember those drives from some 20 years ago. Before that time, SCSI and IDE drives were independently developed and SCSI drives offered way yes. 20 years ago it was true. even in 1995, when i had SCSI controller in my 486 and it was great compared to ATA. today SATA and SAS are mostly the same, just protocol are different. the main difference is that SATA is simpler and have less problems. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
Stefan writes: I seem to remember, that drives of that time required the write cache to be enabled to get any speed-up from tagged commands. This was no risk with SCSI drives, since the cache did not make the drives lye about command completion (i.e. the status for the write was only returned when the cached data had been written to disk, independently of the write cache enable). Interesting. Is there a way to tell, other than coming up with some way to actually test it, whether a particular drive waits until the data has been written to non-volatile memory (the platters in conventional disks) before sending the command completion message? I'm having thoughts of putting sensing resistors in the disk's power cable, attaching an oscilloscope, and displaying the timing of data on the data cable along with power usage from seeking. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
to be enabled to get any speed-up from tagged commands. This was no risk with SCSI drives, since the cache did not make the drives lye i see no correlation between interface type and possibility of lying about command completion. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On Jan 19, 2013, at 4:33 PM, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl wrote: to be enabled to get any speed-up from tagged commands. This was no risk with SCSI drives, since the cache did not make the drives lye i see no correlation between interface type and possibility of lying about command completion. Any interface that enables write cache will lie about write completions. This is true for SAS, SATA, SCSI, and PATA (and probably FC and iSCSI). That's the whole point of the write cache =-) Where things got interesting was in the days of SCSI vs PATA. There was no tagged queuing for PATA, except for a hack that allowed CDROMs to disconnect from the shared bus. So you only got 1 command at a time, and you payed a serialized latency penalty. The only way to get reasonable write performance on PATA was to enable the write cache. Meanwhile, SCSI had TCQ and could amortize the latency penalty to the point where performance with TCQ and no WC was almost as good at with WC. This made SCSI the clear choice for performance + data safety. With SATA vs SAS, the gap is much narrower. The TCQ command set (still used by SAS) is still better than the NCQ command set, but the differences are minor enough that it doesn't matter for most applications. Scott ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On 18 Jan, Wojciech Puchar wrote: If computer have UPS then write caching is fine. even if FreeBSD crash, disk would write data I've had my share of sudden UPS failures over the years. Probably more than half have been during an automatic battery self test. UPS goes on battery, and then *boom*, everything shuts down. At that point the UPS helpfully indicates that the battery needs to be replaced. This seems to happen more frequently once the batteries get to be about 4 years old. I've started replacing them after 3 years. My next big build will have redundant PSUs, each connected to a separate UPS. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On 19 Jan, Stefan Esser wrote: I seem to remember, that drives of that time required the write cache to be enabled to get any speed-up from tagged commands. This was no risk with SCSI drives, since the cache did not make the drives lye about command completion (i.e. the status for the write was only returned when the cached data had been written to disk, independently of the write cache enable). For a very long time, all of the SCSI drives that I have purchased have come with the WCE bit turned on. I always had to remember to use camcontrol to turn it off. When I last benchmarked it quite a few years ago, buildworld times were about the same with either setting, and my filesystems were a lot safer with WCE off, which UFS+SU depends on. I've also seen drives dynamically drop the number of supported tags WCE was on and the write cache started getting full, which made CAM unhappy. I've been using SCSI for anything important for all these years except on my laptop. I haven't yet switched to SATA because I haven't put together a new system since NCQ support made it into -STABLE. The hard drives in my -CURRENT machine are cast-offs from my primary machine. Just doin' my part to make sure legacy support isn't broken ... ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On Thu, 17 Jan 2013 16:12:17 -0600, Karim Fodil-Lemelin fodillemlinka...@gmail.com wrote: SAS controllers may connect to SATA devices, either directly connected using native SATA protocol or through SAS expanders using SATA Tunneled Protocol (STP). The systems is currently put in place using SATA instead of SAS although its using the same interface and backplane connectors and the drives (SATA) show as da0 in BSD _but_ with the SATA drive we get *much* better performances. I am thinking that something fancy in that SAS drive is not being handled correctly by the FreeBSD driver. I am planning to revisit the SAS drive issue at a later point (sometimes next week). Your SATA drives are connected directly not with an interposer such as the LSISS9252, correct? If so, this might be the cause of your problems. Mixing SAS and SATA drives is known to cause serious performance issues for almost every JBOD/controller/expander/what-have-you. Change your configuration so there is only one protocol being spoken on the bus (SAS) by putting your SATA drives behind interposers which translate SAS to SATA just before the disk. This will solve many problems. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
Try adding the following to /boot/loader.conf and reboot: hw.mpt.enable_sata_wc=1 The default value, -1, instructs the driver to leave the STA drives at their configuration default. Often times this means that the MPT BIOS will turn off the write cache on every system boot sequence. IT DOES THIS FOR A GOOD REASON! An enabled write cache is counter to data reliability. Yes, it helps make benchmarks look really good, and it's acceptable if your data can be safely thrown away (for example, you're just caching from a slower source, and the cache can be rebuilt if it gets corrupted). And yes, Linux has many tricks to make this benchmark look really good. The tricks range from buffering the raw device to having 'dd' recognize the requested task and short-circuit the process of going to /dev/null or pulling from /dev/zero. I can't tell you how bogus these tests are and how completely irrelevant they are in predicting actual workload performance. But, I'm not going to stop anyone from trying, so give the above tunable a try and let me know how it works. Btw, I'm not subscribed to the hackers mailing list, so please redistribute this email as needed. Scott From: Dieter BSD dieter...@gmail.com To: freebsd-hackers@freebsd.org Cc: mja...@freebsd.org; gi...@freebsd.org; sco...@freebsd.org Sent: Thursday, January 17, 2013 9:03 PM Subject: Re: IBM blade server abysmal disk write performances I am thinking that something fancy in that SAS drive is not being handled correctly by the FreeBSD driver. I think so too, and I think the something fancy is tagged command queuing. The driver prints da0: Command Queueing enabled and yet your SAS drive is only getting 1 write per rev, and queuing should get you more than that. Your SATA drive is getting the expected performance, which means that NCQ must be working. Please let me know if there is anything you would like me to run on the BSD 9.1 system to help diagnose this issue? Looking at the mpt driver, a verbose boot may give more info. Looks like you can set a debug device hint, but I don't see any documentation on what to set it to. I think it is time to ask the driver wizards why TCQ isn't working, so I'm cc-ing the authors listed on the mpt man page. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
The default value, -1, instructs the driver to leave the STA drives at their configuration default. Often times this means that the MPT BIOS will turn off the write cache on every system boot sequence. IT DOES THIS FOR A GOOD REASON! An enabled write cache is counter to data reliability. Yes, it helps make benchmarks look really good, and it's acceptable if your data can be safely thrown away (for example, you're just caching from a slower source, and the cache can be rebuilt if it gets corrupted). And yes, Linux has many tricks to make this benchmark look really good. The tricks range from buffering the raw device to having 'dd' recognize the requested task and short-circuit the process of going to /dev/null or pulling from /dev/zero. I can't tell you how bogus these tests are and how completely irrelevant they are in predicting actual workload performance. But, I'm not going to stop anyone from trying, so give the above tunable a try and let me know how it works. If computer have UPS then write caching is fine. even if FreeBSD crash, disk would write data___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
- Original Message - From: Wojciech Puchar woj...@wojtek.tensor.gdynia.pl To: Scott Long scott4l...@yahoo.com Cc: Dieter BSD dieter...@gmail.com; freebsd-hackers@freebsd.org freebsd-hackers@freebsd.org; gi...@freebsd.org gi...@freebsd.org; sco...@freebsd.org sco...@freebsd.org; mja...@freebsd.org mja...@freebsd.org Sent: Friday, January 18, 2013 11:10 AM Subject: Re: IBM blade server abysmal disk write performances The default value, -1, instructs the driver to leave the STA drives at their configuration default. Often times this means that the MPT BIOS will turn off the write cache on every system boot sequence. IT DOES THIS FOR A GOOD REASON! An enabled write cache is counter to data reliability. Yes, it helps make benchmarks look really good, and it's acceptable if your data can be safely thrown away (for example, you're just caching from a slower source, and the cache can be rebuilt if it gets corrupted). And yes, Linux has many tricks to make this benchmark look really good. The tricks range from buffering the raw device to having 'dd' recognize the requested task and short-circuit the process of going to /dev/null or pulling from /dev/zero. I can't tell you how bogus these tests are and how completely irrelevant they are in predicting actual workload performance. But, I'm not going to stop anyone from trying, so give the above tunable a try and let me know how it works. If computer have UPS then write caching is fine. even if FreeBSD crash, disk would write data I suspect that I'm encountering situations right now at netflix where this advice is not true. I have drives that are seeing intermittent errors, then being forced into reset after a timeout, and then coming back up with filesystem problems. It's only a suspicion at this point, not a confirmed case. Scott ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
disk would write data I suspect that I'm encountering situations right now at netflix where this advice is not true. I have drives that are seeing intermittent errors, then being forced into reset after a timeout, and then coming back up with filesystem problems. It's only a suspicion at this point, not a confirmed case. true. I just assumed that anywhere it matters one would use gmirror. As for myself - i always prefer to put different manufacturers drives for gmirror or at least - not manufactured at similar time. 2 fails at the same moment is rather unlikely. Of course - everything is possible so i do proper backups to remote sites. Remote means another city.___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
Wojciech writes: If computer have UPS then write caching is fine. even if FreeBSD crash, disk would write data That is incorrect. A UPS reduces the risk, but does not eliminate it. It is impossible to completely eliminate the risk of having the write cache on. If you care about your data you must turn the disk's write cache off. If you are using the drive in an application where the data does not matter, or can easily be regenerated (e.g. disk duplication, if it fails, just start over), then turning the write cache on for that one drive can be ok. There is a patch that allows turning the write cache on and off on a per drive basis. The patch is for ata(4), but should be possible with other drivers. camcontrol(8) may work for SCSI and SAS drives. I have yet to see a USB-to-*ATA bridge that allows turning the write cache off, so USB disks are useless for most applications. But for most applications, you must have the write cache off, and you need queuing (e.g. TCQ or NCQ) for performance. If you have queuing, there is no need to turn the write cache on. It is inexcusable that FreeBSD defaults to leaving the write cache on for SATA PATA drives. At least the admin can easily fix this by adding hw.ata.wc=0 to /boot/loader.conf. The bigger problem is that FreeBSD does not support queuing on all controllers that support it. Not something that admins can fix, and inexcusable for an OS that claims to care about performance. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On Fri, 2013-01-18 at 20:37 +0100, Wojciech Puchar wrote: disk would write data I suspect that I'm encountering situations right now at netflix where this advice is not true. I have drives that are seeing intermittent errors, then being forced into reset after a timeout, and then coming back up with filesystem problems. It's only a suspicion at this point, not a confirmed case. true. I just assumed that anywhere it matters one would use gmirror. As for myself - i always prefer to put different manufacturers drives for gmirror or at least - not manufactured at similar time. That is good advice. I bought six 1TB drives at the same time a few years ago and received drives with consequtive serial numbers. They were all part of the same array, and they all failed (click of death) within a six hour timespan of each other. Luckily I noticed the clicking right away and was able to get all the data copied to another array within a few hours, before they all died. -- Ian 2 fails at the same moment is rather unlikely. Of course - everything is possible so i do proper backups to remote sites. Remote means another city. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
That is incorrect. A UPS reduces the risk, but does not eliminate it. nothing eliminate all risks. But for most applications, you must have the write cache off, and you need queuing (e.g. TCQ or NCQ) for performance. If you have queuing, there is no need to turn the write cache on. did you tested the above claim? i have SATA drives everywhere, all in ahci mode, all with NCQ active. It is inexcusable that FreeBSD defaults to leaving the write cache on for SATA PATA drives. At least the admin can easily fix this by adding hw.ata.wc=0 to /boot/loader.conf. The bigger problem is that FreeBSD does not support queuing on all controllers that support it. i must be happy as i never had a case of not seeing adaX: Command Queueing enabled on my machines. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On Jan 18, 2013, at 1:12 PM, Dieter BSD dieter...@gmail.com wrote: It is inexcusable that FreeBSD defaults to leaving the write cache on for SATA PATA drives. This was completely driven by the need to satisfy idiotic benchmarkers, tech writers, and system administrators. It was a huge deal for FreeBSD 4.4, IIRC. It had been silently enabled it, we turned it off, released 4.4, and then got murdered in the press for being slow. If I had my way, the WC would be off, everyone would be using SAS, and anyone who enabled SATA WC or complained about I/O slowness would be forced into Siberian salt mines for the remainder of their lives. At least the admin can easily fix this by adding hw.ata.wc=0 to /boot/loader.conf. The bigger problem is that FreeBSD does not support queuing on all controllers that support it. Not something that admins can fix, and inexcusable for an OS that claims to care about performance. You keep saying this, but I'm unclear on what you mean. Can you explain? Scott ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
and anyone who enabled SATA WC or complained about I/O slowness would be forced into Siberian salt mines for the remainder of their lives. so reserve a place for me there. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On Fri, 2013-01-18 at 22:18 +0100, Wojciech Puchar wrote: and anyone who enabled SATA WC or complained about I/O slowness would be forced into Siberian salt mines for the remainder of their lives. so reserve a place for me there. Yeah, me too. I prefer to go for all-out performance with separate risk mitigation strategies. I wouldn't set up a client datacenter that way, but it's wholly appropriate for what I do with this machine. -- Ian ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On 2013-Jan-18 12:12:11 -0800, Dieter BSD dieter...@gmail.com wrote: adding hw.ata.wc=0 to /boot/loader.conf. The bigger problem is that FreeBSD does not support queuing on all controllers that support it. Not something that admins can fix, and inexcusable for an OS that claims to care about performance. Apart from continuous whinging and whining on mailing lists, what have you done to add support for queuing? -- Peter Jeremy pgpPelv8iAQPo.pgp Description: PGP signature
Re: IBM blade server abysmal disk write performances
On 18/01/2013 10:16 AM, Mark Felder wrote: On Thu, 17 Jan 2013 16:12:17 -0600, Karim Fodil-Lemelin fodillemlinka...@gmail.com wrote: SAS controllers may connect to SATA devices, either directly connected using native SATA protocol or through SAS expanders using SATA Tunneled Protocol (STP). The systems is currently put in place using SATA instead of SAS although its using the same interface and backplane connectors and the drives (SATA) show as da0 in BSD _but_ with the SATA drive we get *much* better performances. I am thinking that something fancy in that SAS drive is not being handled correctly by the FreeBSD driver. I am planning to revisit the SAS drive issue at a later point (sometimes next week). Your SATA drives are connected directly not with an interposer such as the LSISS9252, correct? If so, this might be the cause of your problems. Mixing SAS and SATA drives is known to cause serious performance issues for almost every JBOD/controller/expander/what-have-you. Change your configuration so there is only one protocol being spoken on the bus (SAS) by putting your SATA drives behind interposers which translate SAS to SATA just before the disk. This will solve many problems. Not sure what you mean by this but isn't the mpt detecting an interposer in this line: mpt0: LSILogic SAS/SATA Adapter port 0x1000-0x10ff mem 0x9991-0x99913fff,0x9990-0x9990 irq 28 at device 0.0 on pci11 mpt0: MPI Version=1.5.20.0 mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 ) mpt0: 0 Active Volumes (2 Max) mpt0: 0 Hidden Drive Members (14 Max) Also please not SATA speed in that same hardware setup works just fine. In any case I will have a look. Thanks, Karim. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
This is all turning into a bikeshed discussion. As far as I can tell, the basic original question was why a *SAS* (not a SATA) drive was not performing as well as expected based upon experiences with Linux. I still don't know whether reads or writes were being used for dd. This morning, I ran a fio test with a single threaded read component and a multithreaded write component to see if there were differences. All I had connected to my MPT system were ATA drives (Seagate 500GBs) and I'm remote now and won't be back until Sunday to put one of my 'good' SAS drives (140 GB Seagates, i.e., real SAS 15K RPM drives, not fat SATA bs drives). The numbers were pretty much the same for both FreeBSD and Linux. In fact, FreeBSD was slightly faster. I won't report the exact numbers right now, but only mention this as a piece of information that at least in my case the differences between the OS platform involved is negligible. This would, at least in my case, rule out issues based upon different platform access methods and different drivers. All of this other discussion, about WCE and what not is nice, but for all intents and purposes it serves could be moved to *-advocacy. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
mpt0: LSILogic SAS/SATA Adapter port 0x1000-0x10ff mem 0x9991-0x99913fff,0x9990-0x9990 irq 28 at device 0.0 on pci11 mpt0: MPI Version=1.5.20.0 mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 ) mpt0: 0 Active Volumes (2 Max) mpt0: 0 Hidden Drive Members (14 Max) Ah. Historically IBM systems (the 335, for one) have been very slow with the Integrated Raid software, at least on FreeBSD. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
Scott writes: If I had my way, the WC would be off, everyone would be using SAS, and anyone who enabled SATA WC or complained about I/O slowness would be forced into Siberian salt mines for the remainder of their lives. Actually, If you are running SAS, having SATA WC on or off wouldn't matter, it would be SCSI's WC you'd care about. :-) The bigger problem is that FreeBSD does not support queuing on all controllers that support it. Not something that admins can fix, and inexcusable for an OS that claims to care about performance. You keep saying this, but I'm unclear on what you mean. Can you explain? For most applications you need the write cache to be off. Having the write cache off is fine as long as you have queuing. But with the write cache off, if you don't have queuing, performance sucks. Like getting only 6% of the performance you should be getting. Some of the early SATA controllers didn't have NCQ. Knowing that queuing was very important, I made sure to choose a mainboard with NCQ, giving up other useful features to get it. But FreeBSD does not support NCQ on the nforce4-ultra's SATA controllers. Even the sad joke of an OS Linux has had NCQ on nforce4 since Oct 2006. But Linux is such crap it is unusable. Linux is slowly improving, but I don't expect to live long enough to see it become usable. Seriously. I've tried it several times but I have completely given up on it. Anyway, even after all these years the supposedly performance oriented FreeBSD still does not support NCQ on nforce4, which isn't some obscure chip. they sold a lot them. I've added 3 additional SATA controllers on expansion cards, and FreeBSD supports NCQ on them, so the slow controllers limited by PCIe-x1 have much better write performance than the much faster controllers in the chipset with all the bandwidth they need. I can't add more controllers, there aren't any free slots. The nforce will remain in service for years, aside from the monetary cost, silicon has a huge amount of environmental cost: embedded energy, water, pollution, etc. And there are a lot of them. Wojciech writes: That is incorrect. A UPS reduces the risk, but does not eliminate it. nothing eliminate all risks. Turning the write cache off eliminates the risk of having the write cache on. Yes you can still lose data for other reasons. Backups are still a good idea. But for most applications, you must have the write cache off, and you need queuing (e.g. TCQ or NCQ) for performance. If you have queuing, there is no need to turn the write cache on. did you tested the above claim? i have SATA drives everywhere, all in ahci mode, all with NCQ active. Yes, turn the write cache off and NCQ will give you the performance. As long as you have queuing you can have the best of both worlds. Which is why Karim's problem is so odd. Driver says there is queuing, but performance (1 write per rev) looks exactly like there is no queuing. Maybe there is something else that causes only 1 write per rev but I don't know what that might be. Peter writes: Apart from continuous whinging and whining on mailing lists, what have you done to add support for queuing? Submitted PR, it was closed without being fixed. Looked at code, but Greek to me, even though I have successfully modified a BSD based device driver in the past giving major performance improvement. If I were a C-level exec of a Fortune 500 company I'd just hire some device driver wizard. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On 18/01/2013 5:42 PM, Matthew Jacob wrote: This is all turning into a bikeshed discussion. As far as I can tell, the basic original question was why a *SAS* (not a SATA) drive was not performing as well as expected based upon experiences with Linux. I still don't know whether reads or writes were being used for dd. This morning, I ran a fio test with a single threaded read component and a multithreaded write component to see if there were differences. All I had connected to my MPT system were ATA drives (Seagate 500GBs) and I'm remote now and won't be back until Sunday to put one of my 'good' SAS drives (140 GB Seagates, i.e., real SAS 15K RPM drives, not fat SATA bs drives). The numbers were pretty much the same for both FreeBSD and Linux. In fact, FreeBSD was slightly faster. I won't report the exact numbers right now, but only mention this as a piece of information that at least in my case the differences between the OS platform involved is negligible. This would, at least in my case, rule out issues based upon different platform access methods and different drivers. All of this other discussion, about WCE and what not is nice, but for all intents and purposes it serves could be moved to *-advocacy. Thanks for the clarifications! I did mention at some point those were write speeds and reads were just fine and those were either writes to the filesystem or direct access (only on SAS again). Here is what I am planning to do next week when I get the chance: 0) I plan on focusing on the SAS driver tests _only_ since SATA is working as expected so nothing to report there. 1) Look carefully at how the drives are physically connected. Although it feels like if the SATA works fine the SAS should also but I'll check anyway. 2) Boot verbose with boot -v and send the dmesg output. mpt driver might give us a clue. 3) Run gstat -abc in a loop for the test duration. Although I would think ctlstat(8) might be more interesting here so I'll run it too for good measure :). Please note that in all tests write caching was enabled as I think this is the default with FBSD 9.1 GENERIC but I'll confirm this with camcontrol(8). I've also seen quite a lot of 'quirks' for tagged command queuing in the source code (/sys/cam/scsi/scps_xtp.c) but a particular one got my attention (thanks to whomever writes good comments in source code :) : /* * Slow when tagged queueing is enabled. Write performance * steadily drops off with more and more concurrent * transactions. Best sequential write performance with * tagged queueing turned off and write caching turned on. * * PR: kern/10398 * Submitted by: Hideaki Okada hok...@isl.melco.co.jp * Drive: DCAS-34330 w/ S65A firmware. * * The drive with the problem had the S65A firmware * revision, and has also been reported (by Stephen J. * Roznowski s...@home.net) for a drive with the S61A * firmware revision. * * Although no one has reported problems with the 2 gig * version of the DCAS drive, the assumption is that it * has the same problems as the 4 gig version. Therefore * this quirk entries disables tagged queueing for all * DCAS drives. */ { T_DIRECT, SIP_MEDIA_FIXED, IBM, DCAS*, * }, /*quirks*/0, /*mintags*/0, /*maxtags*/0 So I looked at the kern/10398 pr and got some feeling of 'deja vu' although the original problem was on FreeBSD 3.1 so its most likely not that but I though I would mention it. The issue described is awfully familiar. Basically the SAS drive (scsi back then) is slow on writes but fast on reads with dd. Could be a coincidence or a ghost from the past who knows... Cheers, Karim. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
Matthew writes: There is also no information in the original email as to which direction the I/O was being sent. In one of the followups, Karim reported: # dd if=/dev/zero of=foo count=10 bs=1024000 10+0 records in 10+0 records out 1024 bytes transferred in 19.615134 secs (522046 bytes/sec) 522 KB/s is pathetic. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On 18 January 2013 19:11, Dieter BSD dieter...@gmail.com wrote: Matthew writes: There is also no information in the original email as to which direction the I/O was being sent. In one of the followups, Karim reported: # dd if=/dev/zero of=foo count=10 bs=1024000 10+0 records in 10+0 records out 1024 bytes transferred in 19.615134 secs (522046 bytes/sec) 522 KB/s is pathetic. When this is running, use gstat and see exactly how many IOPS/sec there are and the average io size is. Yes, 522kbytes/sec is really pathetic, but there's a lot of potential reasons for that. adrian ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
Note that the driver says Command Queueing enabled without specifying which. If the driver is trying to use SATA's NCQ but the drive only speaks SCSI's TCQ, that could explain it. Or if the TCQ isn't working for some other reason. even without TCQ,NCQ and write cache the write speed is really terrible. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On 16/01/2013 2:48 AM, Dieter BSD wrote: Karim writes: It is quite obvious that something is awfully slow on SAS drives, whatever it is and regardless of OS comparison. We swapped the SAS drives for SATA and we're seeing much higher speeds. Basically on par with what we were expecting (roughly 300 to 400 times faster then what we see with SAS...). Major clue there! According to wikipedia: Most SAS drives provide tagged command queuing, while most newer SATA drives provide native command queuing [1] Note that the driver says Command Queueing enabled without specifying which. If the driver is trying to use SATA's NCQ but the drive only speaks SCSI's TCQ, that could explain it. Or if the TCQ isn't working for some other reason. See if there are any error messages in dmesg or /var/log. If not, perhaps the driver has extra debugging you could turn on. Get TCQ working and make sure your partitions are aligned on 4 KiB boundaries (in case the drive actually has 4 KiB sectors), and you should get the expected performance. [1] http://en.wikipedia.org/wiki/Serial_attached_SCSI Thanks for the wiki article reference it is very interesting and confirms our current setup. I'm mostly thinking about this line: SAS controllers may connect to SATA devices, either directly connected using native SATA protocol or through SAS expanders using SATA Tunneled Protocol (STP). The systems is currently put in place using SATA instead of SAS although its using the same interface and backplane connectors and the drives (SATA) show as da0 in BSD _but_ with the SATA drive we get *much* better performances. I am thinking that something fancy in that SAS drive is not being handled correctly by the FreeBSD driver. I am planning to revisit the SAS drive issue at a later point (sometimes next week). Here is some trimmed and hopefully relevant information (from dmesg): SAS drive: mpt0: LSILogic SAS/SATA Adapter port 0x1000-0x10ff mem 0x9991-0x99913fff,0x9990-0x9990 irq 28 at device 0.0 on pci11 mpt0: MPI Version=1.5.20.0 mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 ) mpt0: 0 Active Volumes (2 Max) mpt0: 0 Hidden Drive Members (14 Max) ... da0 at mpt0 bus 0 scbus0 target 1 lun 0 da0: IBM-ESXS HUC106030CSS60 D3A6 Fixed Direct Access SCSI-6 device da0: 300.000MB/s transfers da0: Command Queueing enabled da0: 286102MB (585937500 512 byte sectors: 255H 63S/T 36472C) ... GEOM: da0: the primary GPT table is corrupt or invalid. GEOM: da0: using the secondary instead -- recovery strongly advised. SATA drive: mpt0: LSILogic SAS/SATA Adapter port 0x1000-0x10ff mem 0x9b91-0x9b913fff,0x9b90-0x9b90 irq 28 at device 0.0 on pci11 mpt0: MPI Version=1.5.20.0 mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 ) mpt0: 0 Active Volumes (2 Max) mpt0: 0 Hidden Drive Members (14 Max) ... da0 at mpt0 bus 0 scbus0 target 2 lun 0 da0: ATA ST91000640NS SN03 Fixed Direct Access SCSI-5 device da0: 300.000MB/s transfers da0: Command Queueing enabled da0: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C) ... GEOM: da0s1: geometry does not match label (16h,63s != 255h,63s). Please let me know if there is anything you would like me to run on the BSD 9.1 system to help diagnose this issue? Thank you, Karim. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
I am thinking that something fancy in that SAS drive is not being handled correctly by the FreeBSD driver. I think so too, and I think the something fancy is tagged command queuing. The driver prints da0: Command Queueing enabled and yet your SAS drive is only getting 1 write per rev, and queuing should get you more than that. Your SATA drive is getting the expected performance, which means that NCQ must be working. Please let me know if there is anything you would like me to run on the BSD 9.1 system to help diagnose this issue? Looking at the mpt driver, a verbose boot may give more info. Looks like you can set a debug device hint, but I don't see any documentation on what to set it to. I think it is time to ask the driver wizards why TCQ isn't working, so I'm cc-ing the authors listed on the mpt man page. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
When you run gstat, how many ops/sec are you seeing? Adrian On 17 January 2013 20:03, Dieter BSD dieter...@gmail.com wrote: I am thinking that something fancy in that SAS drive is not being handled correctly by the FreeBSD driver. I think so too, and I think the something fancy is tagged command queuing. The driver prints da0: Command Queueing enabled and yet your SAS drive is only getting 1 write per rev, and queuing should get you more than that. Your SATA drive is getting the expected performance, which means that NCQ must be working. Please let me know if there is anything you would like me to run on the BSD 9.1 system to help diagnose this issue? Looking at the mpt driver, a verbose boot may give more info. Looks like you can set a debug device hint, but I don't see any documentation on what to set it to. I think it is time to ask the driver wizards why TCQ isn't working, so I'm cc-ing the authors listed on the mpt man page. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On 1/17/2013 8:03 PM, Dieter BSD wrote: I think it is time to ask the driver wizards why TCQ isn't working, so I'm cc-ing the authors listed on the mpt man page. It is the MPT firmware that implements SATL, but there are probably tweaks that the FreeBSD driver doesn't do that the Linux driver does do. The MPT driver was also worked on years ago and for a variety of reasons is unloved. In general ATA drives have caching enabled, and in fact it is difficult to turn off. There is no info in the email trail that says what the state of the SAS drive is wrt cache enable. There is also no information in the original email as to which direction the I/O was being sent. Let's also get a grip about linux vs. freebsd- using 'dd' is not necessarily and apple-apple comparison where writes are concerned because of the linux heavy write behind policy (plugging I/Os until it gets a large xfer built up and then releasing, which gets larger xfers, while freebsd will use the blocksize you tell it to (whether that's optimal or not). I'll see if I can generate some A/B numbers using fio here and report back. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On Tue, Jan 15, 2013 at 09:12:14AM -0500 I heard the voice of Karim Fodil-Lemelin, and lo! it spake thus: da0: IBM-ESXS HUC106030CSS60 D3A6 Fixed Direct Access SCSI-6 device That's a 10k RPM drive. FreeBSD 9.1: 1+0 records in 1+0 records out 512 bytes transferred in 60.024997 secs (85298 bytes/sec) 1 ops in 60 seconds is practically the definition of a 10k drive. CentOS: 10+0 records in 10+0 records out 5120 bytes (51 MB) copied, 1.97883 s, 25.9 MB/s 10k ops in 2 seconds is 300k per second. You could make a flat-out *KILLING* if you could sell a platter drive that can pull that off. Presumably this is an instance of Linux only has block devices for hard drives, not character devices, so you're getting your writes all buffered over there. Which is to say, nothing's wrong, you're just not measuring the same thing. -- Matthew Fuller (MF4839) | fulle...@over-yonder.net Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ On the Internet, nobody can hear you scream. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
Dur... 10k ops in 2 seconds is 300k per second. RPM I mean... -- Matthew Fuller (MF4839) | fulle...@over-yonder.net Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ On the Internet, nobody can hear you scream. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On Tue, 15 Jan 2013 08:12:14 -0600, Karim Fodil-Lemelin fodillemlinka...@gmail.com wrote: Hi, I'm struggling getting FreeBSD 9.1 properly work on an IBM blade server (HS22). Here is a dd output from Linux CentOS vs FreeBSD 9.1. GNU dd is heavily buffered unless you tell it not to be. There really is no reason why you should want dd to be buffered by default. How can you trust that your attempt at writing raw data to a device actually completed if it's buffered? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
1+0 records out 512 bytes transferred in 60.024997 secs (85298 bytes/sec) 1 ops in 60 seconds is practically the definition of a 10k drive. nonsense. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On Jan 15, 2013, at 6:12 AM, Karim Fodil-Lemelin wrote: Hi, I'm struggling getting FreeBSD 9.1 properly work on an IBM blade server (HS22). Here is a dd output from Linux CentOS vs FreeBSD 9.1. CentOS: 10+0 records in 10+0 records out 5120 bytes (51 MB) copied, 1.97883 s, 25.9 MB/s FreeBSD 9.1: 1+0 records in 1+0 records out 512 bytes transferred in 60.024997 secs (85298 bytes/sec) What exactly was the 'dd' command you used? In particular, what block size did you specify? Can you strace the 'dd' command on CentOS to verify that it's using the actual block size you specified? Some programs (I've written at least one) cheat by actually doing larger I/O operations than you request. This makes a big difference in performance. So this could reflect optimizations in GNU dd more than any difference in the actual disk I/O. If you want to do a more robust comparison, look for one of the disk benchmarking programs in ports and see if it's available (in the same version) for CentOS. Tim ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
1+0 records in 1+0 records out 512 bytes transferred in 60.024997 secs (85298 bytes/sec) What exactly was the 'dd' command you used? In particular, what block size did you specify? 512/1=512 default if it takes one revolution for one write it means that write caching is disabled. that's all. linux always uses buffered devices, only relatively recently special OPTION was added to have raw ones. Complete nonsense but it's linux. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On 15/01/2013 3:03 PM, Dieter BSD wrote: Disabling the disks's write cache is *required* for data integrity. One op per rev means write caching is disabled and no queueing. But dmesg claims Command Queueing enabled, so you should be getting more than one op per rev, and writes should be fast. Is this dd to the raw drive, to a filesystem? (FFS? ZFS? other?) Are you running compression, encryption, or some other feature that might slow things down? Also, try dd with a larger block size, like bs=1m. Hi, Thanks to everyone that answered so far. Here is a follow up. dd to the raw drive and no compression/encryption or some other features, just a naive boot off a live 9.1 CD then dd (see below). The following results have been gathered on the FreeBSD 9.1 system: # dd if=/dev/zero of=toto count=100 100+0 records in 100+0 records out 51200 bytes transferred in 1.057507 secs (48416 bytes/sec) # dd if=/dev/zero of=toto count=100 bs=104 100+0 records in 100+0 records out 10400 bytes transferred in 1.209524 secs (8598 bytes/sec) # dd if=/dev/zero of=toto count=100 bs=1024 100+0 records in 100+0 records out 102400 bytes transferred in 0.844302 secs (121284 bytes/sec) # dd if=/dev/zero of=toto count=100 bs=10240 100+0 records in 100+0 records out 1024000 bytes transferred in 2.173532 secs (471123 bytes/sec) # dd if=/dev/zero of=toto count=100 bs=102400 100+0 records in 100+0 records out 1024 bytes transferred in 19.915159 secs (514181 bytes/sec) # dd if=/dev/zero of=toto count=100 100+0 records in 100+0 records out 51200 bytes transferred in 1.070473 secs (47829 bytes/sec) # dd if=/dev/zero of=foo count=100 100+0 records in 100+0 records out 51200 bytes transferred in 0.683736 secs (74883 bytes/sec) # dd if=/dev/zero of=foo count=100 bs=1024 100+0 records in 100+0 records out 102400 bytes transferred in 0.682579 secs (150019 bytes/sec) # dd if=/dev/zero of=foo count=100 bs=10240 100+0 records in 100+0 records out 1024000 bytes transferred in 2.431012 secs (421224 bytes/sec) # dd if=/dev/zero of=foo count=100 bs=102400 100+0 records in 100+0 records out 1024 bytes transferred in 19.963030 secs (512948 bytes/sec) # dd if=/dev/zero of=foo count=10 bs=1024000 10+0 records in 10+0 records out 1024 bytes transferred in 19.615134 secs (522046 bytes/sec) # dd if=/dev/zero of=foo count=1 bs=1024 1+0 records in 1+0 records out 1024 bytes transferred in 19.579077 secs (523007 bytes/sec) Best regards, Karim. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
Hi, You're only doing one IO at the end. That's just plain silly. There's all kinds of overhead that could show up, that would be amortized over doing many IOs. You should also realise that the raw disk IO on Linux is by default buffered, so you're hitting the buffer cache. The results aren't going to match, not unless you exhaust physical memory and start falling behind on disk IO. At that point you'll see what the fuss is about. Adrian ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
# dd if=/dev/zero of=foo count=1 bs=1024 1+0 records in 1+0 records out 1024 bytes transferred in 19.579077 secs (523007 bytes/sec) you write to file not device, so it will be clustered anyway by FreeBSD. 128kB by default, more if you put options MAXPHYS=... in kernel config and recompile. Even with hard drive write cache disabled, it should about one write per revolution but seems to do 4 writes per second. so probably it is not that but much worse failure. Did you rest read speed? dd if=/dev/disk of=/dev/null bs=512 dd if=/dev/disk of=/dev/null bs=4k dd if=/dev/disk of=/dev/null bs=128k ? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On Tue, Jan 15, 2013 at 12:03:33PM -0800 I heard the voice of Dieter BSD, and lo! it spake thus: But dmesg claims Command Queueing enabled, so you should be getting more than one op per rev, and writes should be fast. Queueing would only help if your load threw multiple ops at the drive before waiting for any of them to complete. I'd expect a dd to a raw device to throw a single, wait for it to return complete, then throw the next, leading to no more than 1 op per rev. (possibly less, with sufficiently fast revs and a sufficiently slow system, but that's a pretty unlikely combo with platter drives and remotely modern hardware unless it's under serious load otherwise) -- Matthew Fuller (MF4839) | fulle...@over-yonder.net Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ On the Internet, nobody can hear you scream. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On 15/01/2013 3:55 PM, Adrian Chadd wrote: You're only doing one IO at the end. That's just plain silly. There's all kinds of overhead that could show up, that would be amortized over doing many IOs. You should also realise that the raw disk IO on Linux is by default buffered, so you're hitting the buffer cache. The results aren't going to match, not unless you exhaust physical memory and start falling behind on disk IO. At that point you'll see what the fuss is about. To put is simply and maybe give a bit more context, here is what we're doing: 1) Boot OS (Linux or FreeBSD in this case) 2) dd some image over to the SAS drive. 3) rinse and repeat for X times. 4) profit. In this case if step 1) is done with Linux we get 100 times more profit. I was wondering if we could close the gap. Karim. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On 15/01/2013 4:54 PM, Wojciech Puchar wrote: # dd if=/dev/zero of=foo count=1 bs=1024 1+0 records in 1+0 records out 1024 bytes transferred in 19.579077 secs (523007 bytes/sec) you write to file not device, so it will be clustered anyway by FreeBSD. 128kB by default, more if you put options MAXPHYS=... in kernel config and recompile. Even with hard drive write cache disabled, it should about one write per revolution but seems to do 4 writes per second. so probably it is not that but much worse failure. Did you rest read speed? dd if=/dev/disk of=/dev/null bs=512 dd if=/dev/disk of=/dev/null bs=4k dd if=/dev/disk of=/dev/null bs=128k As you mentioned the dd file tests were done UFS and not on raw device. I will get those numbers for you. Thanks, Karim. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
Karim writes: dd to the raw drive and no compression/encryption or some other features, just a naive boot off a live 9.1 CD then dd (see below). The following results have been gathered on the FreeBSD 9.1 system: # dd if=/dev/zero of=toto count=100 100+0 records in 100+0 records out 51200 bytes transferred in 1.057507 secs (48416 bytes/sec) By raw drive I meant something like dd if=/dev/zero of=/dev/da0 bs=1m count=1000 of=toto implies that you are using a filesystem. (FFS? ZFS? other?) Matthew writes: But dmesg claims Command Queueing enabled, so you should be getting more than one op per rev, and writes should be fast. Queueing would only help if your load threw multiple ops at the drive before waiting for any of them to complete. I'd expect a dd to a raw device to throw a single, wait for it to return complete, then throw the next, leading to no more than 1 op per rev. I see a huge speedup from NCQ on both raw disks and with FFS/su. Without NCQ I only get 6% of the expected performance, even with a large blocksize. The kernel must be doing write-behind even to a raw disk, otherwise waiting for write(2) to return before issuing the next write would slow it down as Matthew suggests. Writing an entire 3TB disk (raw disk, no fs) gives: 21378.98 real 2.00 user 440.98 sys or 140 MB/s (133 MiB/s) on slow controller in PCIe x1 slot. The same test on the same make model disk on a much faster controller in the chipset takes over 10x as long because FreeBSD does not support NCQ on that controller. :-( Karim's data sure looks like 1 op per rev. Either it isn't really doing NCQ or the filesystem is doing something to keep NCQ from being effective. For example, mounting the fs with the sync option would probably have that effect. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
I wrote: The kernel must be doing write-behind even to a raw disk, otherwise waiting for write(2) to return before issuing the next write would slow it down as Matthew suggests. And a minute after hitting send, I remembered that FreeBSD does not provide the traditional raw disk devices, e.g. /dev/rda0 with an 'r'. (Now if I could just remember *why* it doesn't.) ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
25.9 MB/s Even Linux is pretty slow. Transfer rates: outside: 102400 kbytes in 0.685483 sec = 149384 kbytes/sec middle:102400 kbytes in 0.747424 sec = 137004 kbytes/sec inside:102400 kbytes in 1.051036 sec = 97428 kbytes/sec That's more like it. I assume these numbers are reading. You should get numbers nearly this high when writing. Can you try writing to the bare drive without a filesystem? time dd if=/dev/da0 of=/dev/null bs=124k count=25 time (dd if=/dev/zero if=/dev/da0 bs=124k count=25; sync) Between writing more data than the size of memory and the sync, this should hopefully reduce any buffering effects down into the noise and make the numbers more comparable between FreeBSD and Linux. (and more honest) Also eliminates any effect from the filesystem, which will be different between FreeBSD and Linux. Writing should be almost as fast as reading. Is the disk healthy? Smartctl might give a clue. If the disk is healthy and you still get numbers that indicate one write per rev without a filesystem, then the question is why does the driver claim queueing but not deliver it? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On 15/01/2013 4:54 PM, Wojciech Puchar wrote: # dd if=/dev/zero of=foo count=1 bs=1024 1+0 records in 1+0 records out 1024 bytes transferred in 19.579077 secs (523007 bytes/sec) you write to file not device, so it will be clustered anyway by FreeBSD. 128kB by default, more if you put options MAXPHYS=... in kernel config and recompile. Even with hard drive write cache disabled, it should about one write per revolution but seems to do 4 writes per second. so probably it is not that but much worse failure. Did you rest read speed? dd if=/dev/disk of=/dev/null bs=512 dd if=/dev/disk of=/dev/null bs=4k dd if=/dev/disk of=/dev/null bs=128k ? I'll do the read test as well but if I recall correctly it seemed pretty decent. It is quite obvious that something is awfully slow on SAS drives, whatever it is and regardless of OS comparison. We swapped the SAS drives for SATA and we're seeing much higher speeds. Basically on par with what we were expecting (roughly 300 to 400 times faster then what we see with SAS...). I find it strange that diskinfo reports those transfer rates: Transfer rates: outside: 102400 kbytes in 0.685483 sec = 149384 kbytes/sec middle:102400 kbytes in 0.747424 sec = 137004 kbytes/sec inside:102400 kbytes in 1.051036 sec = 97428 kbytes/sec Yet we get only a tiny fraction of those (it takes 20 seconds to transfer 10MB!) when using dd. I also doubt its dd's behavior since how can we explain the performance going up with SATA when doing the same test? Unfortunately, we'll have to move on soon and we're about to write off SAS and use SATA instead. Thanks, Karim. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
On Tue, 2013-01-15 at 15:28 -0500, Karim Fodil-Lemelin wrote: On 15/01/2013 3:03 PM, Dieter BSD wrote: Disabling the disks's write cache is *required* for data integrity. One op per rev means write caching is disabled and no queueing. But dmesg claims Command Queueing enabled, so you should be getting more than one op per rev, and writes should be fast. Is this dd to the raw drive, to a filesystem? (FFS? ZFS? other?) Are you running compression, encryption, or some other feature that might slow things down? Also, try dd with a larger block size, like bs=1m. Hi, Thanks to everyone that answered so far. Here is a follow up. dd to the raw drive and no compression/encryption or some other features, just a naive boot off a live 9.1 CD then dd (see below). The following results have been gathered on the FreeBSD 9.1 system: You say dd with a raw drive, but as several people have pointed out, linux dd doesn't go directly to the drive by default. It looks like you can make it do so with the direct option, which should make it behave the same as freebsd's dd behaves by default (I think, I'm no linux expert). For example, using a usb thumb drive: th2 # dd if=/dev/sdb4 of=/dev/null count=100 100+0 records in 100+0 records out 51200 bytes (51 kB) copied, 0.0142396 s, 3.6 MB/s th2 # dd if=/dev/sdb4 of=/dev/null count=100 iflag=direct 100+0 records in 100+0 records out 51200 bytes (51 kB) copied, 0.0628582 s, 815 kB/s Hmm, just before hitting send I saw your other response that SAS drives behave badly, SATA are fine. That does seem to point away from dd behavior. It might still be interesting to see if the direct flag on linux drops performance into the same horrible range as freebsd with SAS. -- Ian ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
The kernel must be doing write-behind even to a raw disk, otherwise waiting for write(2) to return before issuing the next write would slow it down as Matthew suggests. And a minute after hitting send, I remembered that FreeBSD does not provide the traditional raw disk devices, e.g. /dev/rda0 with an 'r'. (Now if I could just remember *why* it doesn't.) because they are only raw devices. caching is in filesystem. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
Transfer rates: outside: 102400 kbytes in 0.685483 sec = 149384 kbytes/sec middle:102400 kbytes in 0.747424 sec = 137004 kbytes/sec inside:102400 kbytes in 1.051036 sec = 97428 kbytes/sec this is right. Yet we get only a tiny fraction of those (it takes 20 seconds to transfer 10MB!) when using dd. I also doubt its dd's behavior since how can we explain dd is fine. hardware configuration isn't ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IBM blade server abysmal disk write performances
Karim writes: It is quite obvious that something is awfully slow on SAS drives, whatever it is and regardless of OS comparison. We swapped the SAS drives for SATA and we're seeing much higher speeds. Basically on par with what we were expecting (roughly 300 to 400 times faster then what we see with SAS...). Major clue there! According to wikipedia: Most SAS drives provide tagged command queuing, while most newer SATA drives provide native command queuing [1] Note that the driver says Command Queueing enabled without specifying which. If the driver is trying to use SATA's NCQ but the drive only speaks SCSI's TCQ, that could explain it. Or if the TCQ isn't working for some other reason. See if there are any error messages in dmesg or /var/log. If not, perhaps the driver has extra debugging you could turn on. Get TCQ working and make sure your partitions are aligned on 4 KiB boundaries (in case the drive actually has 4 KiB sectors), and you should get the expected performance. [1] http://en.wikipedia.org/wiki/Serial_attached_SCSI ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org