Re: [zfs-discuss] Problems with zfs and a STK RAID INT SAS HBA

2010-04-05 Thread Ragnar Sundblad

On 5 apr 2010, at 04.35, Edward Ned Harvey wrote:

 When running the card in copyback write cache mode, I got horrible
 performance (with zfs), much worse than with copyback disabled
 (which I believe should mean it does write-through), when tested
 with filebench.
 
 When I benchmark my disks, I also find that the system is slower with
 WriteBack enabled.  I would not call it much worse, I'd estimate about 10%
 worse.

Yes, I oversimplified - I have been benchmarking with filebench,
just running the tests shipped with the OS trimmed a little
according to http://www.solarisinternals.com/wiki/index.php/FileBench.
For most tests, I typically get a little worse performance with
writeback enabled (or copyback, as they called it on this card),
maybe about 10 % in average could be about right for these tests too.

The interesting part is that with these tests and writeback disabled,
on a 4 way stripe of sun stock 2.5 146 GB 1 RPM drives, the test
takes 2 hours and 18 minutes (138 minutes) to complete, but with
writeback enabled it takes 16 hours 57 minutes (1017 minutes), or 
over 7.3 times as long time!

I can't (yet) explain the large difference in test time and the
small diff in test results.

Maybe a hardware - or driver - problem has its' part in this.

I have made a few simple tests with these cards before and was
not really impressed, even with all the bells and whistles turned of
they merely seemed to be an IOPS and maybe BW bottleneck, but the above
seems just not right.

  This, naturally, is counterintuitive.  I do have an explanation,
 however, which is partly conjecture:  With the WriteBack enabled, when the
 OS tells the HBA to write something, it seems to complete instantly.  So the
 OS will issue another, and another, and another.  The HBA has no knowledge
 of the underlying pool data structure, so it cannot consolidate the smaller
 writes into larger sequential ones.  It will brainlessly (or
 less-brainfully) do as it was told, and write the blocks to precisely the
 addresses that it was instructed to write.  Even if those are many small
 writes, scattered throughout the platters.  ZFS is smarter than that.  It's
 able to consolidate a zillion tiny writes, as well as some larger writes,
 all into a larger sequential transaction.  ZFS has flexibility, in choosing
 precisely how large a transaction it will create, before sending it to disk.
 One of the variables used to decide how large the transaction should be is
 ... Is the disk busy writing, right now?  If the disks are still busy, I
 might as well wait a little longer and continue building up my next
 sequential block of data to write.  If it appears to have completed the
 previous transaction already, no need to wait any longer.  Don't let the
 disks sit idle.  Just send another small write to the disk.
 
 Long story short, I think, ZFS simply does a better job of write buffering
 than the HBA could possibly do.  So you benefit by disabling the WriteBack,
 in order to allow ZFS handle that instead.

You could think that ZIL transactions could get a speedup by the
writeback cache, meaning more sync actions per second, and in some
cases that seems to be true, and that the card should be designed to
be able to handle intermittent load as the txg completions bursts
(typically every 30 seconds), but something strange obviously happens,
at least on this setup.

(Actually I'd prefer if I could conclude that there is no use for
writeback caching HBAs - I'd like these machines to be as stable as
they possible can and therefore to be just as plain and simple as possible,
and for us to be able to just quickly move the disks if one machine should
brake - with some data stuck in some silly writeback cache inside a HBA
that may or may not cooperate depending on it's state of mind, mood and the
moon phase, that can't be done and I'd need a much more complicated
(= error- and mistake-prone) setup. But my tests so far seems just not
right and probably can't be used to conclude anything.
I'd rather use slogs, and have a few Intel X25-Es to test with, but
then I just recently read on this list that X25-Es aren't supported for
slog anymore! Maybe because they always have their writeback cache
turned on by default and ignore cache flush commands (and that is not a
bug - is the design from outer space?), I don't know yet.
(Don't know why I am stubbornly fooling around with this intel junk - they
right now manage to annoy me with a crappy (or broken) PCI-PCI bridge,
a crappy HBA and a crappy SSD drives...))

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problems with zfs and a STK RAID INT SAS HBA

2010-04-04 Thread Edward Ned Harvey
 When running the card in copyback write cache mode, I got horrible
 performance (with zfs), much worse than with copyback disabled
 (which I believe should mean it does write-through), when tested
 with filebench.

When I benchmark my disks, I also find that the system is slower with
WriteBack enabled.  I would not call it much worse, I'd estimate about 10%
worse.  This, naturally, is counterintuitive.  I do have an explanation,
however, which is partly conjecture:  With the WriteBack enabled, when the
OS tells the HBA to write something, it seems to complete instantly.  So the
OS will issue another, and another, and another.  The HBA has no knowledge
of the underlying pool data structure, so it cannot consolidate the smaller
writes into larger sequential ones.  It will brainlessly (or
less-brainfully) do as it was told, and write the blocks to precisely the
addresses that it was instructed to write.  Even if those are many small
writes, scattered throughout the platters.  ZFS is smarter than that.  It's
able to consolidate a zillion tiny writes, as well as some larger writes,
all into a larger sequential transaction.  ZFS has flexibility, in choosing
precisely how large a transaction it will create, before sending it to disk.
One of the variables used to decide how large the transaction should be is
... Is the disk busy writing, right now?  If the disks are still busy, I
might as well wait a little longer and continue building up my next
sequential block of data to write.  If it appears to have completed the
previous transaction already, no need to wait any longer.  Don't let the
disks sit idle.  Just send another small write to the disk.

Long story short, I think, ZFS simply does a better job of write buffering
than the HBA could possibly do.  So you benefit by disabling the WriteBack,
in order to allow ZFS handle that instead.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Problems with zfs and a STK RAID INT SAS HBA

2010-04-03 Thread Ragnar Sundblad

Hello,

Maybe this question should be put on another list, but since there
are a lot of people here using all kinds of HBAs, this could be right
anyway;

I have a X4150 running snv_134. It was shipped with a STK RAID INT
adaptec/intel/storagetek/sun SAS HBA.

When running the card in copyback write cache mode, I got horrible
performance (with zfs), much worse than with copyback disabled
(which I believe should mean it does write-through), when tested
with filebench.
This could actually be expected, depending on how good or bad the
the card is, but I am still not sure about what to expect.

It logs some errors, as shown with fmdump -e(V).
It is most often a pci bridge error (I think), about five to ten
times an hour, and occasionally a problem with accessing a
mode page on the disks for enabling/disabling the write cache,
one error for each disk, about every three hours.
I don't believe the two have to be related.

I am not sure if the PCI-PCI bridge is on the RAID board itself
or in the host.

I haven't seen this problem on other more or less identical
machines running sol10.

Is this a known software problem, or do I have faulty hardware?

Thanks!

/ragge

--

% fmdump -e
...
Apr 04 01:21:53.2244 ereport.io.pci.fabric   
Apr 04 01:30:00.6999 ereport.io.pci.fabric   
Apr 04 01:30:23.4647 ereport.io.scsi.cmd.disk.dev.uderr
Apr 04 01:30:23.4651 ereport.io.scsi.cmd.disk.dev.uderr
...
% fmdump -eV
Apr 04 2010 01:21:53.224492765 ereport.io.pci.fabric
nvlist version: 0
class = ereport.io.pci.fabric
ena = 0xd6a00a43be800c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /p...@0,0/pci8086,2...@4
(end detector)

bdf = 0x20
device_id = 0x25f8
vendor_id = 0x8086
rev_id = 0xb1
dev_type = 0x40
pcie_off = 0x6c
pcix_off = 0x0
aer_off = 0x100
ecc_ver = 0x0
pci_status = 0x10
pci_command = 0x147
pci_bdg_sec_status = 0x0
pci_bdg_ctrl = 0x3
pcie_status = 0x0
pcie_command = 0x2027
pcie_dev_cap = 0xfc1
pcie_adv_ctl = 0x0
pcie_ue_status = 0x0
pcie_ue_mask = 0x10
pcie_ue_sev = 0x62031
pcie_ue_hdr0 = 0x0
pcie_ue_hdr1 = 0x0
pcie_ue_hdr2 = 0x0
pcie_ue_hdr3 = 0x0
pcie_ce_status = 0x0
pcie_ce_mask = 0x0
pcie_rp_status = 0x0
pcie_rp_control = 0x7
pcie_adv_rp_status = 0x0
pcie_adv_rp_command = 0x7
pcie_adv_rp_ce_src_id = 0x0
pcie_adv_rp_ue_src_id = 0x0
remainder = 0x0
severity = 0x1
__ttl = 0x1
__tod = 0x4bb7cd91 0xd617cdd
...
Apr 04 2010 01:30:23.464768275 ereport.io.scsi.cmd.disk.dev.uderr
nvlist version: 0
class = ereport.io.scsi.cmd.disk.dev.uderr
ena = 0xde0cd54f84201c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /p...@0,0/pci8086,2...@4/pci108e,2...@0/d...@5,0
devid = id1,s...@tsun_stk_raid_intea4b6f24
(end detector)

driver-assessment = fail
op-code = 0x1a
cdb = 0x1a 0x0 0x8 0x0 0x18 0x0
pkt-reason = 0x0
pkt-state = 0x1f
pkt-stats = 0x0
stat-code = 0x0
un-decode-info = sd_get_write_cache_enabled: Mode Sense caching page 
code mismatch 0

un-decode-value =
__ttl = 0x1
__tod = 0x4bb7cf8f 0x1bb3cd13
...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss