Re: IBM blade server abysmal disk write performances

2013-01-25 Thread Karim Fodil-Lemelin

Hi,

Quick follow up on this. As I mentioned in a previous email we have 
moved to SATA drives and the SAS drives have been shelved for now. The 
current project will be using those so further tests on SAS have been 
postponed to an undefined date.


Thanks,

Karim.

PS: I'll keep the SAS tests in my back pocket so I get a head start when 
we get around SAS testing again.


On 18/01/2013 6:32 PM, Karim Fodil-Lemelin wrote:

On 18/01/2013 5:42 PM, Matthew Jacob wrote:
This is all turning into a bikeshed discussion. As far as I can tell, 
the basic original question was why a *SAS* (not a SATA) drive was 
not performing as well as expected based upon experiences with Linux. 
I still don't know whether reads or writes were being used for dd.


This morning, I ran a fio test with a single threaded read component 
and a multithreaded write component to see if there were differences. 
All I had connected to my MPT system were ATA drives (Seagate 500GBs) 
and I'm remote now and won't be back until Sunday to put one of my 
'good' SAS drives (140 GB Seagates, i.e., real SAS 15K RPM drives, 
not fat SATA bs drives).


The numbers were pretty much the same for both FreeBSD and Linux. In 
fact, FreeBSD was slightly faster. I won't report the exact numbers 
right now, but only mention this as a piece of information that at 
least in my case the differences between the OS platform involved is 
negligible. This would, at least in my case, rule out issues based 
upon different platform access methods and different drivers.


All of this other discussion, about WCE and what not is nice, but for 
all intents and purposes it serves could be moved to *-advocacy.



Thanks for the clarifications!

I did mention at some point those were write speeds and reads were 
just fine and those were either writes to the filesystem or direct 
access (only on SAS again).


Here is what I am planning to do next week when I get the chance:

0) I plan on focusing on the SAS driver tests _only_ since SATA is 
working as expected so nothing to report there.
1) Look carefully at how the drives are physically connected. Although 
it feels like if the SATA works fine the SAS should also but I'll 
check anyway.
2) Boot verbose with boot -v and send the dmesg output. mpt driver 
might give us a clue.
3) Run gstat -abc in a loop for the test duration. Although I would 
think ctlstat(8) might be more interesting here so I'll run it too for 
good measure :).


Please note that in all tests write caching was enabled as I think 
this is the default with FBSD 9.1 GENERIC but I'll confirm this with 
camcontrol(8).


I've also seen quite a lot of 'quirks' for tagged command queuing in 
the source code (/sys/cam/scsi/scps_xtp.c) but a particular one got my 
attention (thanks to whomever writes good comments in source code :) :


/*
 * Slow when tagged queueing is enabled. Write 
performance

 * steadily drops off with more and more concurrent
 * transactions.  Best sequential write performance with
 * tagged queueing turned off and write caching turned 
on.

 *
 * PR:  kern/10398
 * Submitted by:  Hideaki Okada hok...@isl.melco.co.jp
 * Drive:  DCAS-34330 w/ S65A firmware.
 *
 * The drive with the problem had the S65A firmware
 * revision, and has also been reported (by Stephen J.
 * Roznowski s...@home.net) for a drive with the S61A
 * firmware revision.
 *
 * Although no one has reported problems with the 2 gig
 * version of the DCAS drive, the assumption is that it
 * has the same problems as the 4 gig version. Therefore
 * this quirk entries disables tagged queueing for all
 * DCAS drives.
 */
{ T_DIRECT, SIP_MEDIA_FIXED, IBM, DCAS*, * },
/*quirks*/0, /*mintags*/0, /*maxtags*/0

So I looked at the kern/10398 pr and got some feeling of 'deja vu' 
although the original problem was on FreeBSD 3.1 so its most likely 
not that but I though I would mention it. The issue described is 
awfully familiar. Basically the SAS drive (scsi back then) is slow on 
writes but fast on reads with dd. Could be a coincidence or a ghost 
from the past who knows...


Cheers,

Karim.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-21 Thread Wojciech Puchar


Interesting.  Is there a way to tell, other than coming up with
some way to actually test it, whether a particular drive waits until


my crappy laptop hard drive behave the same no matter if i turn write 
cache on, off or leave default. seems like it is always on.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-21 Thread Wojciech Puchar

With SATA vs SAS, the gap is much narrower.  The TCQ command set
(still used by SAS) is still better than the NCQ command set, but the


in what point TCQ is exactly better than SATA NCQ.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-21 Thread Wojciech Puchar

I've had my share of sudden UPS failures over the years.  Probably more


everything can fail.

That's why serious sysadmins do proper backup, no matter what safety 
features are used in their servers.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-19 Thread Wojciech Puchar

Turning the write cache off eliminates the risk of having the write cache
on.
this sentence sounds like not having a car eliminates a risks of 
driving.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-19 Thread Stefan Esser
Am 19.01.2013 00:32, schrieb Karim Fodil-Lemelin:
  * Although no one has reported problems with the 2 gig
  * version of the DCAS drive, the assumption is that it
  * has the same problems as the 4 gig version.  Therefore
  * this quirk entries disables tagged queueing for all
  * DCAS drives.
  */
 { T_DIRECT, SIP_MEDIA_FIXED, IBM, DCAS*, * },
 /*quirks*/0, /*mintags*/0, /*maxtags*/0
 
 So I looked at the kern/10398 pr and got some feeling of 'deja vu'
 although the original problem was on FreeBSD 3.1 so its most likely not
 that but I though I would mention it. The issue described is awfully
 familiar. Basically the SAS drive (scsi back then) is slow on writes but
 fast on reads with dd. Could be a coincidence or a ghost from the past
 who knows...

I remember those drives from some 20 years ago. Before that time, SCSI
and IDE drives were independently developed and SCSI drives offered way
better performance and reliability. But at about this time there were
SCSI and IDE drives that differed only in their interface electronics.
And from that time I and models I remember several SCSI quirks in IBM
drives (DCAS and DORS), often with regard to tagged commands.

I seem to remember, that drives of that time required the write cache
to be enabled to get any speed-up from tagged commands. This was no
risk with SCSI drives, since the cache did not make the drives lye
about command completion (i.e. the status for the write was only
returned when the cached data had been written to disk, independently
of the write cache enable).

Regards, STefan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-19 Thread Wojciech Puchar


I remember those drives from some 20 years ago. Before that time, SCSI
and IDE drives were independently developed and SCSI drives offered way


yes. 20 years ago it was true. even in 1995, when i had SCSI controller in 
my 486 and it was great compared to ATA.


today SATA and SAS are mostly the same, just protocol are different.
the main difference is that SATA is simpler and have less problems.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-19 Thread Dieter BSD
Stefan writes:
 I seem to remember, that drives of that time required the write cache
 to be enabled to get any speed-up from tagged commands. This was no
 risk with SCSI drives, since the cache did not make the drives lye
 about command completion (i.e. the status for the write was only
 returned when the cached data had been written to disk, independently
 of the write cache enable).

Interesting.  Is there a way to tell, other than coming up with
some way to actually test it, whether a particular drive waits until
the data has been written to non-volatile memory (the platters in
conventional disks) before sending the command completion message?

I'm having thoughts of putting sensing resistors in the disk's
power cable, attaching an oscilloscope, and displaying the
timing of data on the data cable along with power usage from
seeking.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-19 Thread Wojciech Puchar

to be enabled to get any speed-up from tagged commands. This was no
risk with SCSI drives, since the cache did not make the drives lye


i see no correlation between interface type and possibility of lying about 
command completion.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-19 Thread Scott Long

On Jan 19, 2013, at 4:33 PM, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl 
wrote:

 to be enabled to get any speed-up from tagged commands. This was no
 risk with SCSI drives, since the cache did not make the drives lye
 
 i see no correlation between interface type and possibility of lying about 
 command completion.
 

Any interface that enables write cache will lie about write completions.  This
is true for SAS, SATA, SCSI, and PATA (and probably FC and iSCSI).  That's
the whole point of the write cache =-)

Where things got interesting was in the days of SCSI vs PATA.  There was
no tagged queuing for PATA, except for a hack that allowed CDROMs to
disconnect from the shared bus.  So you only got 1 command at a time, and
you payed a serialized latency penalty.  The only way to get reasonable
write performance on PATA was to enable the write cache.  Meanwhile,
SCSI had TCQ and could amortize the latency penalty to the point where
performance with TCQ and no WC was almost as good at with WC.  This
made SCSI the clear choice for performance + data safety.

With SATA vs SAS, the gap is much narrower.  The TCQ command set
(still used by SAS) is still better than the NCQ command set, but the
differences are minor enough that it doesn't matter for most applications.

Scott

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-19 Thread Don Lewis
On 18 Jan, Wojciech Puchar wrote:

 If computer have UPS then write caching is fine. even if FreeBSD crash, 
 disk would write data

I've had my share of sudden UPS failures over the years.  Probably more
than half have been during an automatic battery self test.  UPS goes on
battery, and then *boom*, everything shuts down.  At that point the UPS
helpfully indicates that the battery needs to be replaced.  This seems
to happen more frequently once the batteries get to be about 4 years
old.  I've started replacing them after 3 years.

My next big build will have redundant PSUs, each connected to a separate
UPS.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-19 Thread Don Lewis
On 19 Jan, Stefan Esser wrote:
 
 I seem to remember, that drives of that time required the write cache
 to be enabled to get any speed-up from tagged commands. This was no
 risk with SCSI drives, since the cache did not make the drives lye
 about command completion (i.e. the status for the write was only
 returned when the cached data had been written to disk, independently
 of the write cache enable).

For a very long time, all of the SCSI drives that I have purchased have
come with the WCE bit turned on.  I always had to remember to use
camcontrol to turn it off.  When I last benchmarked it quite a few years
ago, buildworld times were about the same with either setting, and my
filesystems were a lot safer with WCE off, which UFS+SU depends on. I've
also seen drives dynamically drop the number of supported tags WCE was
on and the write cache started getting full, which made CAM unhappy.

I've been using SCSI for anything important for all these years except
on my laptop.  I haven't yet switched to SATA because I haven't put
together a new system since NCQ support made it into -STABLE.  The hard
drives in my -CURRENT machine are cast-offs from my primary machine.
Just doin' my part to make sure legacy support isn't broken ...


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Mark Felder
On Thu, 17 Jan 2013 16:12:17 -0600, Karim Fodil-Lemelin  
fodillemlinka...@gmail.com wrote:


SAS controllers may connect to SATA devices, either directly connected  
using native SATA protocol or through SAS expanders using SATA Tunneled  
Protocol (STP).
 The systems is currently put in place using SATA instead of SAS  
although its using the same interface and backplane connectors and the  
drives (SATA) show as da0 in BSD _but_ with the SATA drive we get *much*  
better performances. I am thinking that something fancy in that SAS  
drive is not being handled correctly by the FreeBSD driver. I am  
planning to revisit the SAS drive issue at a later point (sometimes next  
week).


Your SATA drives are connected directly not with an interposer such as the  
LSISS9252, correct? If so, this might be the cause of your problems.  
Mixing SAS and SATA drives is known to cause serious performance issues  
for almost every JBOD/controller/expander/what-have-you. Change your  
configuration so there is only one protocol being spoken on the bus (SAS)  
by putting your SATA drives behind interposers which translate SAS to SATA  
just before the disk. This will solve many problems.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Scott Long
Try adding the following to /boot/loader.conf and reboot:

hw.mpt.enable_sata_wc=1

The default value, -1, instructs the driver to leave the STA drives at their 
configuration default.  Often times this means that the MPT BIOS will turn off 
the write cache on every system boot sequence.  IT DOES THIS FOR A GOOD REASON! 
 An enabled write cache is counter to data reliability.  Yes, it helps make 
benchmarks look really good, and it's acceptable if your data can be safely 
thrown away (for example, you're just caching from a slower source, and the 
cache can be rebuilt if it gets corrupted).  And yes, Linux has many tricks to 
make this benchmark look really good.  The tricks range from buffering the raw 
device to having 'dd' recognize the requested task and short-circuit the 
process of going to /dev/null or pulling from /dev/zero.  I can't tell you how 
bogus these tests are and how completely irrelevant they are in predicting 
actual workload performance.  But, I'm not going to stop anyone from trying, so 
give the above tunable a try
 and let me know how it works.

Btw, I'm not subscribed to the hackers mailing list, so please redistribute 
this email as needed.

Scott





 From: Dieter BSD dieter...@gmail.com
To: freebsd-hackers@freebsd.org 
Cc: mja...@freebsd.org; gi...@freebsd.org; sco...@freebsd.org 
Sent: Thursday, January 17, 2013 9:03 PM
Subject: Re: IBM blade server abysmal disk write performances
 
 I am thinking that something fancy in that SAS drive is
 not being handled correctly by the FreeBSD driver.

I think so too, and I think the something fancy is tagged command queuing.
The driver prints da0: Command Queueing enabled and yet your SAS drive
is only getting 1 write per rev, and queuing should get you more than that.
Your SATA drive is getting the expected performance, which means that NCQ
must be working.

 Please let me know if there is anything you would like me to run on the
 BSD 9.1 system to help diagnose this issue?

Looking at the mpt driver, a verbose boot may give more info.
Looks like you can set a debug device hint, but I don't
see any documentation on what to set it to.

I think it is time to ask the driver wizards why TCQ isn't working,
so I'm cc-ing the authors listed on the mpt man page.



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Wojciech Puchar


The default value, -1, instructs the driver to leave the STA drives at their 
configuration default.  Often times this means that the MPT BIOS will turn off 
the write cache on every system boot sequence.  IT DOES THIS FOR A GOOD REASON! 
 An enabled write cache is counter to data reliability.  Yes, it helps make 
benchmarks look really good, and it's acceptable if your data can be safely 
thrown away (for example, you're just caching from a slower source, and the 
cache can be rebuilt if it gets corrupted).  And yes, Linux has many tricks to 
make this benchmark look really good.  The tricks range from buffering the raw 
device to having 'dd' recognize the requested task and short-circuit the 
process of going to /dev/null or pulling from /dev/zero.  I can't tell you how 
bogus these tests are and how completely irrelevant they are in predicting 
actual workload performance.  But, I'm not going to stop anyone from trying, so 
give the above tunable a try
and let me know how it works.

If computer have UPS then write caching is fine. even if FreeBSD crash, 
disk would write data___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Scott Long
- Original Message -

 From: Wojciech Puchar woj...@wojtek.tensor.gdynia.pl
 To: Scott Long scott4l...@yahoo.com
 Cc: Dieter BSD dieter...@gmail.com; freebsd-hackers@freebsd.org 
 freebsd-hackers@freebsd.org; gi...@freebsd.org gi...@freebsd.org; 
 sco...@freebsd.org sco...@freebsd.org; mja...@freebsd.org 
 mja...@freebsd.org
 Sent: Friday, January 18, 2013 11:10 AM
 Subject: Re: IBM blade server abysmal disk write performances
 
 
  The default value, -1, instructs the driver to leave the STA drives at 
 their configuration default.  Often times this means that the MPT BIOS will 
 turn 
 off the write cache on every system boot sequence.  IT DOES THIS FOR A GOOD 
 REASON!  An enabled write cache is counter to data reliability.  Yes, it 
 helps 
 make benchmarks look really good, and it's acceptable if your data can be 
 safely thrown away (for example, you're just caching from a slower source, 
 and the cache can be rebuilt if it gets corrupted).  And yes, Linux has many 
 tricks to make this benchmark look really good.  The tricks range from 
 buffering 
 the raw device to having 'dd' recognize the requested task and 
 short-circuit the process of going to /dev/null or pulling from /dev/zero.  I 
 can't tell you how bogus these tests are and how completely irrelevant they 
 are in predicting actual workload performance.  But, I'm not going to stop 
 anyone from trying, so give the above tunable a try
  and let me know how it works.
 
 If computer have UPS then write caching is fine. even if FreeBSD crash, 
 disk would write data
 

I suspect that I'm encountering situations right now at netflix where this 
advice is not true.  I have drives that are seeing intermittent errors, then 
being forced into reset after a timeout, and then coming back up with 
filesystem problems.  It's only a suspicion at this point, not a confirmed case.

Scott

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Wojciech Puchar

disk would write data



I suspect that I'm encountering situations right now at netflix where this 
advice is not true.  I have drives that are seeing intermittent errors, then 
being forced into reset after a timeout, and then coming back up with 
filesystem problems.  It's only a suspicion at this point, not a confirmed case.

true. I just assumed that anywhere it matters one would use gmirror.
As for myself - i always prefer to put different manufacturers drives for 
gmirror or at least - not manufactured at similar time.


2 fails at the same moment is rather unlikely. Of course - everything is 
possible so i do proper backups to remote sites. Remote means another 
city.___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Dieter BSD
Wojciech writes:
 If computer have UPS then write caching is fine. even if FreeBSD crash,
 disk would write data

That is incorrect.  A UPS reduces the risk, but does not eliminate it.
It is impossible to completely eliminate the risk of having the
write cache on.  If you care about your data you must turn the disk's
write cache off.

If you are using the drive in an application where the data does
not matter, or can easily be regenerated (e.g. disk duplication,
if it fails, just start over), then turning the write cache on
for that one drive can be ok. There is a patch that allows turning
the write cache on and off on a per drive basis. The patch is for
ata(4), but should be possible with other drivers.  camcontrol(8)
may work for SCSI and SAS drives. I have yet to see a USB-to-*ATA
bridge that allows turning the write cache off, so USB disks are
useless for most applications.

But for most applications, you must have the write cache off,
and you need queuing (e.g. TCQ or NCQ) for performance.  If
you have queuing, there is no need to turn the write cache
on.

It is inexcusable that FreeBSD defaults to leaving the write cache on
for SATA  PATA drives.  At least the admin can easily fix this by
adding hw.ata.wc=0 to /boot/loader.conf.  The bigger problem is that
FreeBSD does not support queuing on all controllers that support it.
Not something that admins can fix, and inexcusable for an OS that
claims to care about performance.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Ian Lepore
On Fri, 2013-01-18 at 20:37 +0100, Wojciech Puchar wrote:
  disk would write data
 
 
  I suspect that I'm encountering situations right now at netflix where this 
  advice is not true.  I have drives that are seeing intermittent errors, 
  then being forced into reset after a timeout, and then coming back up with 
  filesystem problems.  It's only a suspicion at this point, not a confirmed 
  case.
 true. I just assumed that anywhere it matters one would use gmirror.
 As for myself - i always prefer to put different manufacturers drives for 
 gmirror or at least - not manufactured at similar time.
 

That is good advice.  I bought six 1TB drives at the same time a few
years ago and received drives with consequtive serial numbers.  They
were all part of the same array, and they all failed (click of death)
within a six hour timespan of each other.  Luckily I noticed the
clicking right away and was able to get all the data copied to another
array within a few hours, before they all died.

-- Ian

 2 fails at the same moment is rather unlikely. Of course - everything is 
 possible so i do proper backups to remote sites. Remote means another 
 city.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Wojciech Puchar


That is incorrect.  A UPS reduces the risk, but does not eliminate it.


nothing eliminate all risks.


But for most applications, you must have the write cache off,
and you need queuing (e.g. TCQ or NCQ) for performance.  If
you have queuing, there is no need to turn the write cache
on.
did you tested the above claim? i have SATA drives everywhere, all in ahci 
mode, all with NCQ active.






It is inexcusable that FreeBSD defaults to leaving the write cache on
for SATA  PATA drives.  At least the admin can easily fix this by
adding hw.ata.wc=0 to /boot/loader.conf.  The bigger problem is that
FreeBSD does not support queuing on all controllers that support it.

i must be happy as i never had a case of not seeing
adaX: Command Queueing enabled
on my machines.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Scott Long

On Jan 18, 2013, at 1:12 PM, Dieter BSD dieter...@gmail.com wrote:
 It is inexcusable that FreeBSD defaults to leaving the write cache on
 for SATA  PATA drives.

This was completely driven by the need to satisfy idiotic benchmarkers,
tech writers, and system administrators.  It was a huge deal for FreeBSD
4.4, IIRC.  It had been silently enabled it, we turned it off, released 4.4,
and then got murdered in the press for being slow.

If I had my way, the WC would be off, everyone would be using SAS,
and anyone who enabled SATA WC or complained about I/O slowness
would be forced into Siberian salt mines for the remainder of their lives.


  At least the admin can easily fix this by
 adding hw.ata.wc=0 to /boot/loader.conf.  The bigger problem is that
 FreeBSD does not support queuing on all controllers that support it.
 Not something that admins can fix, and inexcusable for an OS that
 claims to care about performance.

You keep saying this, but I'm unclear on what you mean.  Can you
explain?

Scott

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Wojciech Puchar

and anyone who enabled SATA WC or complained about I/O slowness
would be forced into Siberian salt mines for the remainder of their lives.


so reserve a place for me there.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Ian Lepore
On Fri, 2013-01-18 at 22:18 +0100, Wojciech Puchar wrote:
  and anyone who enabled SATA WC or complained about I/O slowness
  would be forced into Siberian salt mines for the remainder of their lives.
 
 so reserve a place for me there.

Yeah, me too.  I prefer to go for all-out performance with separate risk
mitigation strategies.  I wouldn't set up a client datacenter that way,
but it's wholly appropriate for what I do with this machine.

-- Ian


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Peter Jeremy
On 2013-Jan-18 12:12:11 -0800, Dieter BSD dieter...@gmail.com wrote:
adding hw.ata.wc=0 to /boot/loader.conf.  The bigger problem is that
FreeBSD does not support queuing on all controllers that support it.
Not something that admins can fix, and inexcusable for an OS that
claims to care about performance.

Apart from continuous whinging and whining on mailing lists, what have
you done to add support for queuing?

-- 
Peter Jeremy


pgpPelv8iAQPo.pgp
Description: PGP signature


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Karim Fodil-Lemelin

On 18/01/2013 10:16 AM, Mark Felder wrote:
On Thu, 17 Jan 2013 16:12:17 -0600, Karim Fodil-Lemelin 
fodillemlinka...@gmail.com wrote:


SAS controllers may connect to SATA devices, either directly 
connected using native SATA protocol or through SAS expanders using 
SATA Tunneled Protocol (STP).
 The systems is currently put in place using SATA instead of SAS 
although its using the same interface and backplane connectors and 
the drives (SATA) show as da0 in BSD _but_ with the SATA drive we get 
*much* better performances. I am thinking that something fancy in 
that SAS drive is not being handled correctly by the FreeBSD driver. 
I am planning to revisit the SAS drive issue at a later point 
(sometimes next week).


Your SATA drives are connected directly not with an interposer such as 
the LSISS9252, correct? If so, this might be the cause of your 
problems. Mixing SAS and SATA drives is known to cause serious 
performance issues for almost every 
JBOD/controller/expander/what-have-you. Change your configuration so 
there is only one protocol being spoken on the bus (SAS) by putting 
your SATA drives behind interposers which translate SAS to SATA just 
before the disk. This will solve many problems.
Not sure what you mean by this but isn't the mpt detecting an interposer 
in this line:


mpt0: LSILogic SAS/SATA Adapter port 0x1000-0x10ff mem 
0x9991-0x99913fff,0x9990-0x9990 irq 28 at device 0.0 on pci11

mpt0: MPI Version=1.5.20.0
mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 )
mpt0: 0 Active Volumes (2 Max)
mpt0: 0 Hidden Drive Members (14 Max)

Also please not SATA speed in that same hardware setup works just fine. 
In any case I will have a look.


Thanks,

Karim.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Matthew Jacob
This is all turning into a bikeshed discussion. As far as I can tell, 
the basic original question was why a *SAS* (not a SATA) drive was not 
performing as well as expected based upon experiences with Linux. I 
still don't know whether reads or writes were being used for dd.


This morning, I ran a fio test with a single threaded read component and 
a multithreaded write component to see if there were differences. All I 
had connected to my MPT system were ATA drives (Seagate 500GBs) and I'm 
remote now and won't be back until Sunday to put one of my 'good' SAS 
drives (140 GB Seagates, i.e., real SAS 15K RPM drives, not fat SATA 
bs drives).


The numbers were pretty much the same for both FreeBSD and Linux. In 
fact, FreeBSD was slightly faster. I won't report the exact numbers 
right now, but only mention this as a piece of information that at least 
in my case the differences between the OS platform involved is 
negligible. This would, at least in my case, rule out issues based upon 
different platform access methods and different drivers.


All of this other discussion, about WCE and what not is nice, but for 
all intents and purposes it serves could be moved to *-advocacy.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Matthew Jacob




mpt0: LSILogic SAS/SATA Adapter port 0x1000-0x10ff mem 
0x9991-0x99913fff,0x9990-0x9990 irq 28 at device 0.0 on pci11

mpt0: MPI Version=1.5.20.0
mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 )
mpt0: 0 Active Volumes (2 Max)
mpt0: 0 Hidden Drive Members (14 Max)
Ah. Historically IBM systems (the 335, for one) have been very slow with 
the Integrated Raid software, at least on FreeBSD.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Dieter BSD
Scott writes:
 If I had my way, the WC would be off, everyone would be using SAS,
 and anyone who enabled SATA WC or complained about I/O slowness
 would be forced into Siberian salt mines for the remainder of their lives.

Actually, If you are running SAS, having SATA WC on or off wouldn't
matter, it would be SCSI's WC you'd care about.  :-)

 The bigger problem is that
 FreeBSD does not support queuing on all controllers that support it.
 Not something that admins can fix, and inexcusable for an OS that
 claims to care about performance.

 You keep saying this, but I'm unclear on what you mean.  Can you
 explain?

For most applications you need the write cache to be off.
Having the write cache off is fine as long as you have queuing.
But with the write cache off, if you don't have queuing, performance
sucks. Like getting only 6% of the performance you should be getting.
Some of the early SATA controllers didn't have NCQ.  Knowing that
queuing was very important, I made sure to choose a mainboard with
NCQ, giving up other useful features to get it.  But FreeBSD does
not support NCQ on the nforce4-ultra's SATA controllers.  Even the
sad joke of an OS Linux has had NCQ on nforce4 since Oct 2006.
But Linux is such crap it is unusable.  Linux is slowly improving,
but I don't expect to live long enough to see it become usable.
Seriously. I've tried it several times but I have completely
given up on it.  Anyway, even after all these years the supposedly
performance oriented FreeBSD still does not support NCQ on nforce4,
which isn't some obscure chip. they sold a lot them.  I've added
3 additional SATA controllers on expansion cards, and FreeBSD
supports NCQ on them, so the slow controllers limited by PCIe-x1
have much better write performance than the much faster controllers
in the chipset with all the bandwidth they need.  I can't add
more controllers, there aren't any free slots.  The nforce
will remain in service for years, aside from the monetary cost,
silicon has a huge amount of environmental cost: embedded energy,
water, pollution, etc.  And there are a lot of them.

Wojciech writes:
 That is incorrect.  A UPS reduces the risk, but does not eliminate it.

 nothing eliminate all risks.

Turning the write cache off eliminates the risk of having the write cache
on.  Yes you can still lose data for other reasons.  Backups are still a
good idea.

 But for most applications, you must have the write cache off,
 and you need queuing (e.g. TCQ or NCQ) for performance.  If
 you have queuing, there is no need to turn the write cache
 on.

 did you tested the above claim? i have SATA drives everywhere, all in ahci
 mode, all with NCQ active.

Yes, turn the write cache off and NCQ will give you the performance.
As long as you have queuing you can have the best of both worlds.

Which is why Karim's problem is so odd.  Driver says there is queuing,
but performance (1 write per rev) looks exactly like there is no queuing.
Maybe there is something else that causes only 1 write per rev but
I don't know what that might be.

Peter writes:
 Apart from continuous whinging and whining on mailing lists, what have
 you done to add support for queuing?

Submitted PR, it was closed without being fixed.  Looked at code,
but Greek to me, even though I have successfully modified a BSD based device
driver in the past giving major performance improvement.  If I were
a C-level exec of a Fortune 500 company I'd just hire some device driver
wizard.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Karim Fodil-Lemelin

On 18/01/2013 5:42 PM, Matthew Jacob wrote:
This is all turning into a bikeshed discussion. As far as I can tell, 
the basic original question was why a *SAS* (not a SATA) drive was not 
performing as well as expected based upon experiences with Linux. I 
still don't know whether reads or writes were being used for dd.


This morning, I ran a fio test with a single threaded read component 
and a multithreaded write component to see if there were differences. 
All I had connected to my MPT system were ATA drives (Seagate 500GBs) 
and I'm remote now and won't be back until Sunday to put one of my 
'good' SAS drives (140 GB Seagates, i.e., real SAS 15K RPM drives, not 
fat SATA bs drives).


The numbers were pretty much the same for both FreeBSD and Linux. In 
fact, FreeBSD was slightly faster. I won't report the exact numbers 
right now, but only mention this as a piece of information that at 
least in my case the differences between the OS platform involved is 
negligible. This would, at least in my case, rule out issues based 
upon different platform access methods and different drivers.


All of this other discussion, about WCE and what not is nice, but for 
all intents and purposes it serves could be moved to *-advocacy.



Thanks for the clarifications!

I did mention at some point those were write speeds and reads were just 
fine and those were either writes to the filesystem or direct access 
(only on SAS again).


Here is what I am planning to do next week when I get the chance:

0) I plan on focusing on the SAS driver tests _only_ since SATA is 
working as expected so nothing to report there.
1) Look carefully at how the drives are physically connected. Although 
it feels like if the SATA works fine the SAS should also but I'll check 
anyway.
2) Boot verbose with boot -v and send the dmesg output. mpt driver 
might give us a clue.
3) Run gstat -abc in a loop for the test duration. Although I would 
think ctlstat(8) might be more interesting here so I'll run it too for 
good measure :).


Please note that in all tests write caching was enabled as I think this 
is the default with FBSD 9.1 GENERIC but I'll confirm this with 
camcontrol(8).


I've also seen quite a lot of 'quirks' for tagged command queuing in the 
source code (/sys/cam/scsi/scps_xtp.c) but a particular one got my 
attention (thanks to whomever writes good comments in source code :) :


/*
 * Slow when tagged queueing is enabled. Write performance
 * steadily drops off with more and more concurrent
 * transactions.  Best sequential write performance with
 * tagged queueing turned off and write caching turned on.
 *
 * PR:  kern/10398
 * Submitted by:  Hideaki Okada hok...@isl.melco.co.jp
 * Drive:  DCAS-34330 w/ S65A firmware.
 *
 * The drive with the problem had the S65A firmware
 * revision, and has also been reported (by Stephen J.
 * Roznowski s...@home.net) for a drive with the S61A
 * firmware revision.
 *
 * Although no one has reported problems with the 2 gig
 * version of the DCAS drive, the assumption is that it
 * has the same problems as the 4 gig version.  Therefore
 * this quirk entries disables tagged queueing for all
 * DCAS drives.
 */
{ T_DIRECT, SIP_MEDIA_FIXED, IBM, DCAS*, * },
/*quirks*/0, /*mintags*/0, /*maxtags*/0

So I looked at the kern/10398 pr and got some feeling of 'deja vu' 
although the original problem was on FreeBSD 3.1 so its most likely not 
that but I though I would mention it. The issue described is awfully 
familiar. Basically the SAS drive (scsi back then) is slow on writes but 
fast on reads with dd. Could be a coincidence or a ghost from the past 
who knows...


Cheers,

Karim.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Dieter BSD
Matthew writes:
 There is also no information in the original email as to which direction
 the I/O was being sent.

In one of the followups, Karim reported:
  # dd if=/dev/zero of=foo count=10 bs=1024000
  10+0 records in
  10+0 records out
  1024 bytes transferred in 19.615134 secs (522046 bytes/sec)

522 KB/s is pathetic.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Adrian Chadd
On 18 January 2013 19:11, Dieter BSD dieter...@gmail.com wrote:
 Matthew writes:
 There is also no information in the original email as to which direction
 the I/O was being sent.

 In one of the followups, Karim reported:
   # dd if=/dev/zero of=foo count=10 bs=1024000
   10+0 records in
   10+0 records out
   1024 bytes transferred in 19.615134 secs (522046 bytes/sec)

 522 KB/s is pathetic.

When this is running, use gstat and see exactly how many IOPS/sec
there are and the average io size is.

Yes, 522kbytes/sec is really pathetic, but there's a lot of potential
reasons for that.


adrian
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-17 Thread Wojciech Puchar


Note that the driver says Command Queueing enabled without
specifying which.  If the driver is trying to use SATA's NCQ but
the drive only speaks SCSI's TCQ, that could explain it. Or if
the TCQ isn't working for some other reason.


even without TCQ,NCQ and write cache the write speed is really terrible.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-17 Thread Karim Fodil-Lemelin

On 16/01/2013 2:48 AM, Dieter BSD wrote:

Karim writes:

It is quite obvious that something is awfully slow on SAS drives,
whatever it is and regardless of OS comparison. We swapped the SAS
drives for SATA and we're seeing much higher speeds. Basically on par
with what we were expecting (roughly 300 to 400 times faster then what
we see with SAS...).

Major clue there!  According to wikipedia: Most SAS drives provide
tagged command queuing, while most newer SATA drives provide native
command queuing [1]

Note that the driver says Command Queueing enabled without
specifying which.  If the driver is trying to use SATA's NCQ but
the drive only speaks SCSI's TCQ, that could explain it. Or if
the TCQ isn't working for some other reason.

See if there are any error messages in dmesg or /var/log.
If not, perhaps the driver has extra debugging you could turn on.

Get TCQ working and make sure your partitions are aligned on
4 KiB boundaries (in case the drive actually has 4 KiB sectors),
and you should get the expected performance.

[1] http://en.wikipedia.org/wiki/Serial_attached_SCSI

Thanks for the wiki article reference it is very interesting and 
confirms our current setup. I'm mostly thinking about this line:


SAS controllers may connect to SATA devices, either directly connected 
using native SATA protocol or through SAS expanders using SATA Tunneled 
Protocol (STP).


The systems is currently put in place using SATA instead of SAS although 
its using the same interface and backplane connectors and the drives 
(SATA) show as da0 in BSD _but_ with the SATA drive we get *much* better 
performances. I am thinking that something fancy in that SAS drive is 
not being handled correctly by the FreeBSD driver. I am planning to 
revisit the SAS drive issue at a later point (sometimes next week).


Here is some trimmed and hopefully relevant information (from dmesg):

SAS drive:

mpt0: LSILogic SAS/SATA Adapter port 0x1000-0x10ff mem 
0x9991-0x99913fff,0x9990-0x9990 irq 28 at device 0.0 on pci11

mpt0: MPI Version=1.5.20.0
mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 )
mpt0: 0 Active Volumes (2 Max)
mpt0: 0 Hidden Drive Members (14 Max)
...

da0 at mpt0 bus 0 scbus0 target 1 lun 0
da0: IBM-ESXS HUC106030CSS60 D3A6 Fixed Direct Access SCSI-6 device
da0: 300.000MB/s transfers
da0: Command Queueing enabled
da0: 286102MB (585937500 512 byte sectors: 255H 63S/T 36472C)
...
GEOM: da0: the primary GPT table is corrupt or invalid.
GEOM: da0: using the secondary instead -- recovery strongly advised.

SATA drive:

mpt0: LSILogic SAS/SATA Adapter port 0x1000-0x10ff mem 
0x9b91-0x9b913fff,0x9b90-0x9b90 irq 28 at device 0.0 on pci11

mpt0: MPI Version=1.5.20.0
mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 )
mpt0: 0 Active Volumes (2 Max)
mpt0: 0 Hidden Drive Members (14 Max)
...
da0 at mpt0 bus 0 scbus0 target 2 lun 0
da0: ATA ST91000640NS SN03 Fixed Direct Access SCSI-5 device
da0: 300.000MB/s transfers
da0: Command Queueing enabled
da0: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
...
GEOM: da0s1: geometry does not match label (16h,63s != 255h,63s).

Please let me know if there is anything you would like me to run on the 
BSD 9.1 system to help diagnose this issue?


Thank you,

Karim.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-17 Thread Dieter BSD
 I am thinking that something fancy in that SAS drive is
 not being handled correctly by the FreeBSD driver.

I think so too, and I think the something fancy is tagged command queuing.
The driver prints da0: Command Queueing enabled and yet your SAS drive
is only getting 1 write per rev, and queuing should get you more than that.
Your SATA drive is getting the expected performance, which means that NCQ
must be working.

 Please let me know if there is anything you would like me to run on the
 BSD 9.1 system to help diagnose this issue?

Looking at the mpt driver, a verbose boot may give more info.
Looks like you can set a debug device hint, but I don't
see any documentation on what to set it to.

I think it is time to ask the driver wizards why TCQ isn't working,
so I'm cc-ing the authors listed on the mpt man page.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-17 Thread Adrian Chadd
When you run gstat, how many ops/sec are you seeing?




Adrian


On 17 January 2013 20:03, Dieter BSD dieter...@gmail.com wrote:
 I am thinking that something fancy in that SAS drive is
 not being handled correctly by the FreeBSD driver.

 I think so too, and I think the something fancy is tagged command queuing.
 The driver prints da0: Command Queueing enabled and yet your SAS drive
 is only getting 1 write per rev, and queuing should get you more than that.
 Your SATA drive is getting the expected performance, which means that NCQ
 must be working.

 Please let me know if there is anything you would like me to run on the
 BSD 9.1 system to help diagnose this issue?

 Looking at the mpt driver, a verbose boot may give more info.
 Looks like you can set a debug device hint, but I don't
 see any documentation on what to set it to.

 I think it is time to ask the driver wizards why TCQ isn't working,
 so I'm cc-ing the authors listed on the mpt man page.
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-17 Thread Matthew Jacob

On 1/17/2013 8:03 PM, Dieter BSD wrote:
I think it is time to ask the driver wizards why TCQ isn't working, so 
I'm cc-ing the authors listed on the mpt man page. 


It is the MPT firmware that implements SATL, but there are probably 
tweaks that the FreeBSD driver doesn't do that the Linux driver does 
do.  The MPT driver was also worked on years ago and for a variety of 
reasons is unloved.


In general ATA drives have caching enabled, and in fact it is difficult 
to turn off.  There is no info in the email trail that says what the 
state of the SAS drive is wrt cache enable.


There is also no information in the original email as to which direction 
the I/O was being sent.


Let's also get a grip about linux vs. freebsd- using 'dd' is not 
necessarily and apple-apple comparison where writes are concerned 
because of the linux heavy write behind policy (plugging I/Os until it 
gets a large xfer built up and then releasing, which gets larger xfers, 
while freebsd will use the blocksize you tell it to (whether that's 
optimal or not).


I'll see if I can generate some A/B numbers using fio here and report back.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Matthew D. Fuller
On Tue, Jan 15, 2013 at 09:12:14AM -0500 I heard the voice of
Karim Fodil-Lemelin, and lo! it spake thus:

 da0: IBM-ESXS HUC106030CSS60 D3A6 Fixed Direct Access SCSI-6 device 

That's a 10k RPM drive.


 FreeBSD 9.1:
 
 1+0 records in
 1+0 records out
 512 bytes transferred in 60.024997 secs (85298 bytes/sec)

1 ops in 60 seconds is practically the definition of a 10k drive.


 CentOS:
 
 10+0 records in
 10+0 records out
 5120 bytes (51 MB) copied, 1.97883 s, 25.9 MB/s

10k ops in 2 seconds is 300k per second.  You could make a flat-out
*KILLING* if you could sell a platter drive that can pull that off.

Presumably this is an instance of Linux only has block devices for
hard drives, not character devices, so you're getting your writes all
buffered over there.  Which is to say, nothing's wrong, you're just
not measuring the same thing.


-- 
Matthew Fuller (MF4839)   |  fulle...@over-yonder.net
Systems/Network Administrator |  http://www.over-yonder.net/~fullermd/
   On the Internet, nobody can hear you scream.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Matthew D. Fuller
Dur...

 10k ops in 2 seconds is 300k per second.
   RPM I mean...


-- 
Matthew Fuller (MF4839)   |  fulle...@over-yonder.net
Systems/Network Administrator |  http://www.over-yonder.net/~fullermd/
   On the Internet, nobody can hear you scream.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Mark Felder
On Tue, 15 Jan 2013 08:12:14 -0600, Karim Fodil-Lemelin  
fodillemlinka...@gmail.com wrote:



Hi,

I'm struggling getting FreeBSD 9.1 properly work on an IBM blade server
(HS22). Here is a dd output from Linux CentOS vs FreeBSD 9.1.



GNU dd is heavily buffered unless you tell it not to be. There really is  
no reason why you should want dd to be buffered by default. How can you  
trust that your attempt at writing raw data to a device actually completed  
if it's buffered?

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Wojciech Puchar

1+0 records out
512 bytes transferred in 60.024997 secs (85298 bytes/sec)


1 ops in 60 seconds is practically the definition of a 10k drive.


nonsense.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Tim Kientzle

On Jan 15, 2013, at 6:12 AM, Karim Fodil-Lemelin wrote:

 Hi,
 
 I'm struggling getting FreeBSD 9.1 properly work on an IBM blade server 
 (HS22). Here is a dd output from Linux CentOS vs FreeBSD 9.1.
 
 CentOS:
 
 10+0 records in
 10+0 records out
 5120 bytes (51 MB) copied, 1.97883 s, 25.9 MB/s
 
 
 FreeBSD 9.1:
 
 1+0 records in
 1+0 records out
 512 bytes transferred in 60.024997 secs (85298 bytes/sec)

What exactly was the 'dd' command you used?

In particular, what block size did you specify?

Can you strace the 'dd' command on CentOS to
verify that it's using the actual block size you
specified?

Some programs (I've written at least one) cheat
by actually doing larger I/O operations than you
request.  This makes a big difference in performance.

So this could reflect optimizations in GNU dd
more than any difference in the actual disk I/O.

If you want to do a more robust comparison, look
for one of the disk benchmarking programs in ports
and see if it's available (in the same version) for
CentOS.

Tim

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Wojciech Puchar


1+0 records in
1+0 records out
512 bytes transferred in 60.024997 secs (85298 bytes/sec)


What exactly was the 'dd' command you used?

In particular, what block size did you specify?


512/1=512

default

if it takes one revolution for one write it means that write caching is 
disabled.


that's all.

linux always uses buffered devices, only relatively recently special 
OPTION was added to have raw ones. Complete nonsense but it's linux.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Karim Fodil-Lemelin

On 15/01/2013 3:03 PM, Dieter BSD wrote:

Disabling the disks's write cache is *required* for data integrity.
One op per rev means write caching is disabled and no queueing.
But dmesg claims Command Queueing enabled, so you should be getting
more than one op per rev, and writes should be fast.
Is this dd to the raw drive, to a filesystem? (FFS? ZFS? other?)
Are you running compression, encryption, or some other feature
that might slow things down? Also, try dd with a larger block size,
like bs=1m.

Hi,

Thanks to everyone that answered so far. Here is a follow up.  dd to the 
raw drive and no compression/encryption or some other features, just a 
naive boot off a live 9.1 CD then dd (see below). The following results 
have been gathered on the FreeBSD 9.1 system:


# dd if=/dev/zero of=toto count=100
100+0 records in
100+0 records out
51200 bytes transferred in 1.057507 secs (48416 bytes/sec)

# dd if=/dev/zero of=toto count=100 bs=104
100+0 records in
100+0 records out
10400 bytes transferred in 1.209524 secs (8598 bytes/sec)

# dd if=/dev/zero of=toto count=100 bs=1024
100+0 records in
100+0 records out
102400 bytes transferred in 0.844302 secs (121284 bytes/sec)

# dd if=/dev/zero of=toto count=100 bs=10240
100+0 records in
100+0 records out
1024000 bytes transferred in 2.173532 secs (471123 bytes/sec)

# dd if=/dev/zero of=toto count=100 bs=102400
100+0 records in
100+0 records out
1024 bytes transferred in 19.915159 secs (514181 bytes/sec)

# dd if=/dev/zero of=toto count=100
100+0 records in
100+0 records out
51200 bytes transferred in 1.070473 secs (47829 bytes/sec)

# dd if=/dev/zero of=foo count=100
100+0 records in
100+0 records out
51200 bytes transferred in 0.683736 secs (74883 bytes/sec)

# dd if=/dev/zero of=foo count=100 bs=1024
100+0 records in
100+0 records out
102400 bytes transferred in 0.682579 secs (150019 bytes/sec)

# dd if=/dev/zero of=foo count=100 bs=10240
100+0 records in
100+0 records out
1024000 bytes transferred in 2.431012 secs (421224 bytes/sec)

# dd if=/dev/zero of=foo count=100 bs=102400
100+0 records in
100+0 records out
1024 bytes transferred in 19.963030 secs (512948 bytes/sec)

# dd if=/dev/zero of=foo count=10 bs=1024000
10+0 records in
10+0 records out
1024 bytes transferred in 19.615134 secs (522046 bytes/sec)

# dd if=/dev/zero of=foo count=1 bs=1024
1+0 records in
1+0 records out
1024 bytes transferred in 19.579077 secs (523007 bytes/sec)

Best regards,

Karim.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Adrian Chadd
Hi,

You're only doing one IO at the end. That's just plain silly. There's
all kinds of overhead that could show up, that would be amortized over
doing many IOs.

You should also realise that the raw disk IO on Linux is by default
buffered, so you're hitting the buffer cache. The results aren't going
to match, not unless you exhaust physical memory and start falling
behind on disk IO. At that point you'll see what the fuss is about.



Adrian
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Wojciech Puchar


# dd if=/dev/zero of=foo count=1 bs=1024
1+0 records in
1+0 records out
1024 bytes transferred in 19.579077 secs (523007 bytes/sec)


you write to file not device, so it will be clustered anyway by FreeBSD.

128kB by default, more if you put options MAXPHYS=... in kernel config and 
recompile.


Even with hard drive write cache disabled, it should about one write 
per revolution but seems to do 4 writes per second.


so probably it is not that but much worse failure.

Did you rest read speed?

dd if=/dev/disk of=/dev/null bs=512

dd if=/dev/disk of=/dev/null bs=4k

dd if=/dev/disk of=/dev/null bs=128k

?

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Matthew D. Fuller
On Tue, Jan 15, 2013 at 12:03:33PM -0800 I heard the voice of
Dieter BSD, and lo! it spake thus:
 
 But dmesg claims Command Queueing enabled, so you should be
 getting more than one op per rev, and writes should be fast.

Queueing would only help if your load threw multiple ops at the drive
before waiting for any of them to complete.  I'd expect a dd to a raw
device to throw a single, wait for it to return complete, then throw
the next, leading to no more than 1 op per rev.

(possibly less, with sufficiently fast revs and a sufficiently slow
system, but that's a pretty unlikely combo with platter drives and
remotely modern hardware unless it's under serious load otherwise)


-- 
Matthew Fuller (MF4839)   |  fulle...@over-yonder.net
Systems/Network Administrator |  http://www.over-yonder.net/~fullermd/
   On the Internet, nobody can hear you scream.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Karim Fodil-Lemelin

On 15/01/2013 3:55 PM, Adrian Chadd wrote:

You're only doing one IO at the end. That's just plain silly. There's
all kinds of overhead that could show up, that would be amortized over
doing many IOs.

You should also realise that the raw disk IO on Linux is by default
buffered, so you're hitting the buffer cache. The results aren't going
to match, not unless you exhaust physical memory and start falling
behind on disk IO. At that point you'll see what the fuss is about.

To put is simply and maybe give a bit more context, here is what we're 
doing:


1) Boot OS (Linux or FreeBSD in this case)
2) dd some image over to the SAS drive.
3) rinse and repeat for X times.
4) profit.

In this case if step 1) is done with Linux we get 100 times more profit. 
I was wondering if we could close the gap.


Karim.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Karim Fodil-Lemelin

On 15/01/2013 4:54 PM, Wojciech Puchar wrote:


# dd if=/dev/zero of=foo count=1 bs=1024
1+0 records in
1+0 records out
1024 bytes transferred in 19.579077 secs (523007 bytes/sec)


you write to file not device, so it will be clustered anyway by FreeBSD.

128kB by default, more if you put options MAXPHYS=... in kernel config 
and recompile.


Even with hard drive write cache disabled, it should about one write 
per revolution but seems to do 4 writes per second.


so probably it is not that but much worse failure.

Did you rest read speed?

dd if=/dev/disk of=/dev/null bs=512

dd if=/dev/disk of=/dev/null bs=4k

dd if=/dev/disk of=/dev/null bs=128k

As you mentioned the dd file tests were done UFS and not on raw device. 
I will get those numbers for you.


Thanks,

Karim.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Dieter BSD
Karim writes:
 dd to the
 raw drive and no compression/encryption or some other features, just a
 naive boot off a live 9.1 CD then dd (see below). The following results
 have been gathered on the FreeBSD 9.1 system:

 # dd if=/dev/zero of=toto count=100
 100+0 records in
 100+0 records out
 51200 bytes transferred in 1.057507 secs (48416 bytes/sec)

By raw drive I meant something like
dd if=/dev/zero of=/dev/da0 bs=1m count=1000
of=toto implies that you are using a filesystem. (FFS? ZFS? other?)

Matthew writes:
 But dmesg claims Command Queueing enabled, so you should be
 getting more than one op per rev, and writes should be fast.

 Queueing would only help if your load threw multiple ops at the drive
 before waiting for any of them to complete.  I'd expect a dd to a raw
 device to throw a single, wait for it to return complete, then throw
 the next, leading to no more than 1 op per rev.

I see a huge speedup from NCQ on both raw disks and with FFS/su.
Without NCQ I only get 6% of the expected performance, even with
a large blocksize.

The kernel must be doing write-behind even to a raw disk, otherwise
waiting for write(2) to return before issuing the next write would
slow it down as Matthew suggests.

Writing an entire 3TB disk (raw disk, no fs) gives:
21378.98 real 2.00 user   440.98 sys
or 140 MB/s (133 MiB/s) on slow controller in PCIe x1 slot.
The same test on the same make  model disk on a much faster controller
in the chipset takes over 10x as long because FreeBSD does not support
NCQ on that controller. :-(

Karim's data sure looks like 1 op per rev. Either it isn't really doing
NCQ or the filesystem is doing something to keep NCQ from being effective.
For example, mounting the fs with the sync option would probably have
that effect.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Dieter BSD
I wrote:
 The kernel must be doing write-behind even to a raw disk, otherwise
 waiting for write(2) to return before issuing the next write would
 slow it down as Matthew suggests.

And a minute after hitting send, I remembered that FreeBSD does not
provide the traditional raw disk devices, e.g. /dev/rda0 with an 'r'.
(Now if I could just remember *why* it doesn't.)
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Dieter BSD
 25.9 MB/s

Even Linux is pretty slow.

 Transfer rates:
 outside:   102400 kbytes in   0.685483 sec = 149384 kbytes/sec
 middle:102400 kbytes in   0.747424 sec = 137004 kbytes/sec
 inside:102400 kbytes in   1.051036 sec = 97428 kbytes/sec

That's more like it.  I assume these numbers are reading.  You should get
numbers nearly this high when writing.

Can you try writing to the bare drive without a filesystem?

time dd if=/dev/da0 of=/dev/null bs=124k count=25
time (dd if=/dev/zero if=/dev/da0 bs=124k count=25; sync)

Between writing more data than the size of memory and the sync,
this should hopefully reduce any buffering effects down into the noise
and make the numbers more comparable between FreeBSD and Linux.
(and more honest) Also eliminates any effect from the filesystem, which
will be different between FreeBSD and Linux.

Writing should be almost as fast as reading.

Is the disk healthy?  Smartctl might give a clue.

If the disk is healthy and you still get numbers that indicate one
write per rev without a filesystem, then the question is why does
the driver claim queueing but not deliver it?
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Karim Fodil-Lemelin

On 15/01/2013 4:54 PM, Wojciech Puchar wrote:


# dd if=/dev/zero of=foo count=1 bs=1024
1+0 records in
1+0 records out
1024 bytes transferred in 19.579077 secs (523007 bytes/sec)


you write to file not device, so it will be clustered anyway by FreeBSD.

128kB by default, more if you put options MAXPHYS=... in kernel config 
and recompile.


Even with hard drive write cache disabled, it should about one write 
per revolution but seems to do 4 writes per second.


so probably it is not that but much worse failure.

Did you rest read speed?

dd if=/dev/disk of=/dev/null bs=512

dd if=/dev/disk of=/dev/null bs=4k

dd if=/dev/disk of=/dev/null bs=128k

?

I'll do the read test as well but if I recall correctly it seemed pretty 
decent.


It is quite obvious that something is awfully slow on SAS drives, 
whatever it is and regardless of OS comparison. We swapped the SAS 
drives for SATA and we're seeing much higher speeds. Basically on par 
with what we were expecting (roughly 300 to 400 times faster then what 
we see with SAS...).


I find it strange that diskinfo reports those transfer rates:

Transfer rates:
outside:   102400 kbytes in   0.685483 sec = 149384 kbytes/sec
middle:102400 kbytes in   0.747424 sec = 137004 kbytes/sec
inside:102400 kbytes in   1.051036 sec = 97428 kbytes/sec

Yet we get only a tiny fraction of those (it takes 20 seconds to 
transfer 10MB!) when using dd. I also doubt its dd's behavior since how 
can we explain the performance going up with SATA when doing the same test?


Unfortunately, we'll have to move on soon and we're about to write off 
SAS and use SATA instead.


Thanks,

Karim.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Ian Lepore
On Tue, 2013-01-15 at 15:28 -0500, Karim Fodil-Lemelin wrote:
 On 15/01/2013 3:03 PM, Dieter BSD wrote:
  Disabling the disks's write cache is *required* for data integrity.
  One op per rev means write caching is disabled and no queueing.
  But dmesg claims Command Queueing enabled, so you should be
 getting
  more than one op per rev, and writes should be fast.
  Is this dd to the raw drive, to a filesystem? (FFS? ZFS? other?)
  Are you running compression, encryption, or some other feature
  that might slow things down? Also, try dd with a larger block size,
  like bs=1m.
 Hi,
 
 Thanks to everyone that answered so far. Here is a follow up.  dd to
 the 
 raw drive and no compression/encryption or some other features, just
 a 
 naive boot off a live 9.1 CD then dd (see below). The following
 results 
 have been gathered on the FreeBSD 9.1 system:

You say dd with a raw drive, but as several people have pointed out,
linux dd doesn't go directly to the drive by default.  It looks like you
can make it do so with the direct option, which should make it behave
the same as freebsd's dd behaves by default (I think, I'm no linux
expert).

For example, using a usb thumb drive:

th2 # dd if=/dev/sdb4 of=/dev/null count=100 
100+0 records in
100+0 records out
51200 bytes (51 kB) copied, 0.0142396 s, 3.6 MB/s

th2 # dd if=/dev/sdb4 of=/dev/null count=100 iflag=direct
100+0 records in
100+0 records out
51200 bytes (51 kB) copied, 0.0628582 s, 815 kB/s

Hmm, just before hitting send I saw your other response that SAS drives
behave badly, SATA are fine.  That does seem to point away from dd
behavior.  It might still be interesting to see if the direct flag on
linux drops performance into the same horrible range as freebsd with
SAS.

-- Ian


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Wojciech Puchar

The kernel must be doing write-behind even to a raw disk, otherwise
waiting for write(2) to return before issuing the next write would
slow it down as Matthew suggests.


And a minute after hitting send, I remembered that FreeBSD does not
provide the traditional raw disk devices, e.g. /dev/rda0 with an 'r'.
(Now if I could just remember *why* it doesn't.)

because they are only raw devices. caching is in filesystem.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Wojciech Puchar

Transfer rates:
   outside:   102400 kbytes in   0.685483 sec = 149384 kbytes/sec
   middle:102400 kbytes in   0.747424 sec = 137004 kbytes/sec
   inside:102400 kbytes in   1.051036 sec = 97428 kbytes/sec


this is right.
Yet we get only a tiny fraction of those (it takes 20 seconds to transfer 
10MB!) when using dd. I also doubt its dd's behavior since how can we explain 

dd is fine. hardware configuration isn't
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-15 Thread Dieter BSD
Karim writes:
 It is quite obvious that something is awfully slow on SAS drives,
 whatever it is and regardless of OS comparison. We swapped the SAS
 drives for SATA and we're seeing much higher speeds. Basically on par
 with what we were expecting (roughly 300 to 400 times faster then what
 we see with SAS...).

Major clue there!  According to wikipedia: Most SAS drives provide
tagged command queuing, while most newer SATA drives provide native
command queuing [1]

Note that the driver says Command Queueing enabled without
specifying which.  If the driver is trying to use SATA's NCQ but
the drive only speaks SCSI's TCQ, that could explain it. Or if
the TCQ isn't working for some other reason.

See if there are any error messages in dmesg or /var/log.
If not, perhaps the driver has extra debugging you could turn on.

Get TCQ working and make sure your partitions are aligned on
4 KiB boundaries (in case the drive actually has 4 KiB sectors),
and you should get the expected performance.

[1] http://en.wikipedia.org/wiki/Serial_attached_SCSI
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org