[zfs-discuss] iscsi share on different subnet?

2009-10-20 Thread Kent Watsen


I have ZFS/Xen server for my home network.  The box itself has two 
physical NICs.   I want Dom0 to be on my management network and the 
guest domains to be on the dmz and private networks.  The private 
network is where all my home computers are and would like to export 
iscsi volumes directly to them - without having to create a firewall 
rule to grant them access to the management network.  After some 
searching, I have yet to find a way to specify the subnet an iSCSI 
target is visible to - is there any way to do that?


Another idea, I suppose, would be to have one of the guest domains mount 
the volume and then export it itself, but this would be less performant 
and more complicated...


Thanks,
Kent

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] how to destroy a pool by id?

2009-06-20 Thread Kent Watsen





Over the course of multiple OpenSolaris installs , I first created a
pool called "tank" and then, later and resusing some of the same
drives, I created another pool called tank. I can `zpool export tank`,
but when I `zpool import tank`, I get:

bash-3.2# zpool import tank
  cannot import 'tank': more than one matching pool
  import by numeric ID instead


Then, using just `zpool import` I see the IDs:

bash-3.2# zpool import 
   pool: tank
   id: 15608629750614119537
  state: ONLINE
  action: The pool can be imported using its name or numeric
identifier.
  config:
  
   tank ONLINE
   raidz2 ONLINE
   c3t0d0 ONLINE
   c3t4d0 ONLINE
   c4t0d0 ONLINE
   c4t4d0 ONLINE
   c5t0d0 ONLINE
   c5t4d0 ONLINE
   raidz2 ONLINE
   c3t1d0 ONLINE
   c3t5d0 ONLINE
   c4t1d0 ONLINE
   c4t5d0 ONLINE
   c5t1d0 ONLINE
   c5t5d0 ONLINE
  
   pool: tank
   id: 3280066346390919920
  state: ONLINE
  status: The pool was last accessed by another system.
  action: The pool can be imported using its name or numeric
identifier and
   the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
  config:
  
   tank ONLINE
   raidz2 ONLINE
   c4t1d0p0 ONLINE
   c3t1d0p0 ONLINE
   c4t4d0p0 ONLINE
   c3t4d0p0 ONLINE
   c3t5d0p0 ONLINE
   c3t0d0p0 ONLINE


How can I destroy the pool 3280066346390919920 so I do have to
specify the ID to import tank in the future?

Thanks,
kent




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ATA UDMA data parity error

2008-01-22 Thread Kent Watsen

For the archive, I swapped the mobo and all is good now...  (I copied 
100GB into the pool without a crash)

One problem I had was that Solaris would hang whenever booting - even 
when all the aoc-sat2-mv8 cards were pulled out.  Turns out that 
switching the BIOS field USB 2.0 Controller Mode from HiSpeed to 
FullSpeed makes the difference - any ideas why?

Thanks,
Kent

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ATA UDMA data parity error

2008-01-18 Thread Kent Watsen


Thanks for the note Anton.  I let memtest86 run overnight and it found 
no issues.  I've also now moved the cards around and have confirmed that 
slot #3 on the mobo is bad (all my aoc-sat2-mv8 cards, cables, and 
backplanes are OK).

However, I think its more than just slot #3 that has a fault because 
when I have all three cards plugged into mobo slots other than #3, they 
all work fine individually, but when I run the exact same per-card tests 
in parallel, the system crashes.

I'm now going to have the system integrator that built my system send me 
a new mobo  (ugh!)

Thanks again,
Kent


Anton B. Rang wrote:
 Definitely a hardware problem (possibly compounded by a bug).  Some key 
 phrases and routines:

   ATA UDMA data parity error

 This one actually looks like a misnomer.  At least, I'd normally expect data 
 parity error not to crash the system!  (It should result in a retry or EIO.)

   PCI(-X) Express Fatal Error

 This one's more of an issue -- it indicates that the PCI Express bus had an 
 error.

   pcie_pci:pepb_err_msi_intr

 This indicates an error on the PCI bus which has been reflected through to 
 the PCI Express bus. There should be more detail, but it's hard to figure it 
 out from what's below. (The report is showing multiple errors, including both 
 parity errors  system errors, which seems unlikely unless there's a hardware 
 design flaw or a software bug.)

 Others have suggested the power supply or memory, but in my experience these 
 types of errors are more often due to a faulty system backplane or card (and 
 occasionally a bad bridge chip).
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ATA UDMA data parity error

2008-01-17 Thread Kent Watsen

Hey all,

I'm not sure if this is a ZFS bug or a hardware issue I'm having - any 
pointers would be great!


Following contents include:
  - high-level info about my system
  - my first thought to debugging this
  - stack trace
  - format output
  - zpool status output
  - dmesg output



High-Level Info About My System
-
- fresh install of b78
- first time trying to do anything IO-intensive with ZFS
 - command was `cp -r /cdrom /tank/sxce_b78_disk1`
 - but this also fails `cp -r /usr /tank/usr`
- system has 24 sata/sas drive bays, but only 12 of them all populated
- system has three AOC-SAT2-MV8 cards plugged into 6 mini-sas backplanes
- card1 (c3)
- bp1  (c3t0d0, c3t1d0)
- bp2 (c3t4d0, c3t5d0)
- card2 (c4)
- bp1  (c4t0d0, c4t1d0)
- bp2 (c4t4d0, c4t5d0)
- card3 (c5)
- bp1  (c5t0d0, c5t1d0)
- bp2 (c5t4d0, c5t5d0)
- system has one Barcelona Opteron (step BA)
  - the one with the potential look-aside cache bug...
  - though its not clear this is related...



My First Thought To Debugging This

After crashing my system several times (using `cp -r /usr /tank/usr`) 
and comparing the outputs, I noticed that it stack trace always points 
to device-path=/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED],1, which 
corresponds to all the drives connected to aoc-sat-mv8 card #3 (i.e. c5)

But looking at the `format` output, this device path only differs from 
the other devices in that there is a ,1 trailing the 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL PROTECTED] 
part.  Further, again looking at 
the `format` output,  c3 devices have 4/disk, c4 devices have 
6/disk, and c5 devices also have 6/disk.

The only other thing I can add to this is that if I boot a Xen kernel, 
which I was *not* using for all these tests, I the following IRQ errors 
are reported:

SunOS Release 5.11 Version snv_78 
64-bit   
Copyright 1983-2007 Sun Microsystems, Inc.  All rights 
reserved.   
Use is subject to license 
terms.   
Hostname: 
san  
NOTICE: IRQ17 is 
shared   
Reading ZFS config: done
Mounting ZFS filesystems: (1/1)
NOTICE: IRQ20 is shared
NOTICE: IRQ21 is shared
NOTICE: IRQ22 is shared
 
Any ideas?




Stack Trace   (note: I've done this a few times and its always 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL PROTECTED],1)
---
ATA UDMA data parity error
SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major
EVENT-TIME: 0x478f39ab.0x160dc688 (0x5344bb5958)
PLATFORM: i86pc, CSN: -, HOSTNAME: san
SOURCE: SunOS, REV: 5.11 snv_78
DESC: Errors have been detected that require a reboot to ensure system
integrity.  See http://www.sun.com/msg/SUNOS-8000-0G for more information.
AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry
IMPACT: The system will sync files, save a crash dump if needed, and reboot
REC-ACTION: Save the error summary below in case telemetry cannot be saved


panic[cpu3]/thread=ff000f7c2c80: pcie_pci-0: PCI(-X) Express Fatal Error

ff000f7c2bc0 pcie_pci:pepb_err_msi_intr+d2 ()
ff000f7c2c20 unix:av_dispatch_autovect+78 ()
ff000f7c2c60 unix:dispatch_hardint+2f ()
ff000fd09fd0 unix:switch_sp_and_call+13 ()
ff000fd0a020 unix:do_interrupt+a0 ()
ff000fd0a030 unix:cmnint+ba ()
ff000fd0a130 genunix:avl_first+1e ()
ff000fd0a1f0 zfs:metaslab_group_alloc+d1 ()
ff000fd0a2c0 zfs:metaslab_alloc_dva+1b7 ()
ff000fd0a360 zfs:metaslab_alloc+82 ()
ff000fd0a3b0 zfs:zio_dva_allocate+8a ()
ff000fd0a3d0 zfs:zio_next_stage+b3 ()
ff000fd0a400 zfs:zio_checksum_generate+6e ()
ff000fd0a420 zfs:zio_next_stage+b3 ()
ff000fd0a490 zfs:zio_write_compress+239 ()
ff000fd0a4b0 zfs:zio_next_stage+b3 ()
ff000fd0a500 zfs:zio_wait_for_children+5d ()
ff000fd0a520 zfs:zio_wait_children_ready+20 ()
ff000fd0a540 zfs:zio_next_stage_async+bb ()
ff000fd0a560 zfs:zio_nowait+11 ()
ff000fd0a870 zfs:dbuf_sync_leaf+1ac ()
ff000fd0a8b0 zfs:dbuf_sync_list+51 ()
ff000fd0a900 zfs:dbuf_sync_indirect+cd ()
ff000fd0a940 zfs:dbuf_sync_list+5e ()
ff000fd0a9b0 zfs:dnode_sync+23b ()
ff000fd0a9f0 zfs:dmu_objset_sync_dnodes+55 ()
ff000fd0aa70 zfs:dmu_objset_sync+13d ()
ff000fd0aac0 zfs:dsl_dataset_sync+5d ()
ff000fd0ab30 zfs:dsl_pool_sync+b5 ()
ff000fd0abd0 zfs:spa_sync+208 ()
ff000fd0ac60 zfs:txg_sync_thread+19a ()
ff000fd0ac70 unix:thread_start+8 ()

syncing file systems... 1 1 done
ereport.io.pciex.rc.fe-msg ena=5344b8176c00c01 detector=[ version=0 scheme=
 

Re: [zfs-discuss] ATA UDMA data parity error

2008-01-17 Thread Kent Watsen

On a lark, I decided to create a new pool not including any devices 
connected to card #3 (i.e. c5)

It crashes again, but this time with a slightly different dump (see below)
  - actually, there are two dumps below, the first is using the xVM 
kernel and the second is not

Any ideas?

Kent



[NOTE: this one using xVM kernel - see below for dump without xVM kernel]

# zpool destroy tank
# zpool status
no pools available
# zpool create tank raidz2 c3t0d0 c3t4d0 c4t0d0 c4t4d0 raidz2 c3t1d0 
c3t5d0 c4t1d0 c4t5d0
# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  raidz2ONLINE   0 0 0
c3t0d0  ONLINE   0 0 0
c3t4d0  ONLINE   0 0 0
c4t0d0  ONLINE   0 0 0
c4t4d0  ONLINE   0 0 0
  raidz2ONLINE   0 0 0
c3t1d0  ONLINE   0 0 0
c3t5d0  ONLINE   0 0 0
c4t1d0  ONLINE   0 0 0
c4t5d0  ONLINE   0 0 0

errors: No known data errors
# ls /tank
# cp -r /usr /tank/usr
Jan 17 08:48:53 san sata: NOTICE: 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]:
Jan 17 08:48:53 san  port 5: device reset
Jan 17 08:48:53 san sata: NOTICE: 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]:
Jan 17 08:48:53 san  port 5: link lost
Jan 17 08:48:53 san sata: NOTICE: 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]:
Jan 17 08:48:53 san  port 5: link established
Jan 17 08:48:55 san marvell88sx: WARNING: marvell88sx1: port 4: DMA 
completed after timed out
Jan 17 08:48:55 san last message repeated 14 times
Jan 17 08:48:55 san sata: NOTICE: 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]:
Jan 17 08:48:55 san  port 4: device reset
Jan 17 08:48:55 san sata: NOTICE: 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]:
Jan 17 08:48:55 san  port 4: link lost
Jan 17 08:48:55 san sata: NOTICE: 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]:
Jan 17 08:48:55 san  port 4: link established
Jan 17 08:48:55 san scsi: WARNING: 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd15):
Jan 17 08:48:55 san Error for Command: write   Error 
Level: Retryable
Jan 17 08:48:55 san scsi:   Requested Block: 
11893 Error Block: 11893
Jan 17 08:48:55 san scsi:   Vendor: 
ATASerial Number:
Jan 17 08:48:55 san scsi:   Sense Key: No_Additional_Sense
Jan 17 08:48:55 san scsi:   ASC: 0x0 (no additional sense info), 
ASCQ: 0x0, FRU: 0x0
Jan 17 08:48:55 san scsi: WARNING: 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd15):
Jan 17 08:48:55 san Error for Command: write   Error 
Level: Retryable
Jan 17 08:48:55 san scsi:   Requested Block: 
11983 Error Block: 11983
Jan 17 08:48:55 san scsi:   Vendor: 
ATASerial Number:
Jan 17 08:48:55 san scsi:   Sense Key: No_Additional_Sense
Jan 17 08:48:55 san scsi:   ASC: 0x0 (no additional sense info), 
ASCQ: 0x0, FRU: 0x0
Jan 17 08:48:55 san scsi: WARNING: 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd15):
Jan 17 08:48:55 san Error for Command: write   Error 
Level: Retryable
Jan 17 08:48:55 san scsi:   Requested Block: 
12988 Error Block: 12988
Jan 17 08:48:55 san scsi:   Vendor: 
ATASerial Number:
Jan 17 08:48:55 san scsi:   Sense Key: No_Additional_Sense
Jan 17 08:48:55 san scsi:   ASC: 0x0 (no additional sense info), 
ASCQ: 0x0, FRU: 0x0
Jan 17 08:48:55 san scsi: WARNING: 
/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd15):
Jan 17 08:48:55 WARNING: marvell88sx1: error on port 4:
ATA UDMA data parity error
WARNING: marvell88sx1: error on port 4:
ATA UDMA data parity error
WARNING: marvell88sx1: error on port 4:
ATA UDMA data parity error
WARNING: marvell88sx1: error on port 4:
ATA UDMA data parity error
WARNING: marvell88sx1: error on port 4:
ATA UDMA data parity error
WARNING: marvell88sx1: error on port 4:
ATA UDMA data parity error
WARNING: marvell88sx1: error on port 4:
ATA UDMA data parity error
WARNING: marvell88sx1: error on port 4:

Re: [zfs-discuss] ATA UDMA data parity error

2008-01-17 Thread Kent Watsen



Below I create zpools isolating one card at a time
 - when just card#1 - it works
 - when just card #2 - it fails
 - when just card #3 - it works

And then again using the two cards that seem to work:
 - when cards #1 and #3 - it fails

So, at first I thought I narrowed it down to a card, but my last test 
shows that it still fails when the zpool uses two cards that succeed 
individually...


The only thing I can think to point out here is that those two cards on 
on different buses - one connected to a NECuPD720400 and the other 
connected to a AIC-7902, which itself is then connected to the NECuPD720400


Any ideas?

Thanks,
Kent





OK, doing it again using just card #1 (i.e. c3) works!

   # zpool destroy tank
   # zpool create tank raidz2 c3t0d0 c3t4d0 c3t1d0 c3t5d0
   # cp -r /usr /tank/usr
   cp: cycle detected: /usr/ccs/lib/link_audit/32
   cp: cannot access /usr/lib/amd64/libdbus-1.so.2


Doing it again using just card #2 (i.e. c4) still fails:

   # zpool destroy tank
   # zpool create tank raidz2 c4t0d0 c4t4d0 c4t1d0 c4t5d0   
   # cp -r /usr /tank/usr

   cp: cycle detected: /usr/ccs/lib/link_audit/32
   cp: cannot access /usr/lib/amd64/libdbus-1.so.2
   WARNING: marvell88sx1: error on port 1:
   ATA UDMA data parity error
   WARNING: marvell88sx1: error on port 1:
   ATA UDMA data parity error
   WARNING: marvell88sx1: error on port 1:
   ATA UDMA data parity error
   WARNING: marvell88sx1: error on port 1:
   ATA UDMA data parity error
   WARNING: marvell88sx1: error on port 1:
   ATA UDMA data parity error
   WARNING: marvell88sx1: error on port 1:
   ATA UDMA data parity error

   SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major
   EVENT-TIME: 0x478f6148.0x376ebd4b (0xbf8f86652d)
   PLATFORM: i86pc, CSN: -, HOSTNAME: san
   SOURCE: SunOS, REV: 5.11 snv_78
   DESC: Errors have been detected that require a reboot to ensure system
   integrity.  See http://www.sun.com/msg/SUNOS-8000-0G for more
   information.
   AUTO-RESPONSE: Solaris will attempt to save and diagnose the error
   telemetry
   IMPACT: The system will sync files, save a crash dump if needed, and
   reboot
   REC-ACTION: Save the error summary below in case telemetry cannot be
   saved


   panic[cpu3]/thread=ff000f7bcc80: pcie_pci-0: PCI(-X) Express
   Fatal Error

   ff000f7bcbc0 pcie_pci:pepb_err_msi_intr+d2 ()
   ff000f7bcc20 unix:av_dispatch_autovect+78 ()
   ff000f7bcc60 unix:dispatch_hardint+2f ()
   ff000f786ac0 unix:switch_sp_and_call+13 ()
   ff000f786b10 unix:do_interrupt+a0 ()
   ff000f786b20 unix:cmnint+ba ()
   ff000f786c10 unix:mach_cpu_idle+b ()
   ff000f786c40 unix:cpu_idle+c8 ()
   ff000f786c60 unix:idle+10e ()
   ff000f786c70 unix:thread_start+8 ()

   syncing file systems... done
   ereport.io.pciex.rc.fe-msg ena=bf8f828ea700c01 detector=[ version=0
   scheme=
dev device-path=/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED] ] 
rc-status=87c
   source-id=200
source-valid=1

   ereport.io.pciex.rc.mue-msg ena=bf8f828ea700c01 detector=[ version=0
   scheme=
dev device-path=/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED] ] 
rc-status=87c

   ereport.io.pci.sec-rserr ena=bf8f828ea700c01 detector=[ version=0
   scheme=dev
device-path=/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED] ] 
pci-sec-status=6000
   pci-bdg-ctrl=3

   ereport.io.pci.sec-ma ena=bf8f828ea700c01 detector=[ version=0
   scheme=dev
device-path=/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED] ] 
pci-sec-status=6000
   pci-bdg-ctrl=3

   ereport.io.pciex.bdg.sec-perr ena=bf8f828ea700c01 detector=[
   version=0 scheme=
dev device-path=/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED] ]
   sue-status=1800
source-id=200 source-valid=1

   ereport.io.pciex.bdg.sec-serr ena=bf8f828ea700c01 detector=[
   version=0 scheme=
dev device-path=/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED] ]
   sue-status=1800

   ereport.io.pci.sec-rserr ena=bf8f828ea700c01 detector=[ version=0
   scheme=dev
device-path=/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED] ]
   pci-sec-status=6420
pci-bdg-ctrl=7

   dumping to /dev/dsk/c2t0d0s1, offset 215547904, content: kernel
   NOTICE: /[EMAIL PROTECTED],0/pci15d9,[EMAIL PROTECTED]:
port 0: device reset

   100% done:


And doing it again using just card #3 (i.e. c5) works!

   # zpool destroy tank
   cannot open 'tank': no such pool
 (interesting)
   # zpool create tank raidz2 c5t0d0 c5t4d0 c5t1d0 c5t5d0   
   # cp -r /usr /tank/usr





And doing it again using cards #1 and #3 (i.e. c3 and c5) fails!

   # zpool destroy tank
   # zpool create tank raidz2 c3t0d0 c3t4d0 c3t1d0 c3t5d0 raidz2 c5t0d0
   c5t4d0 c5t1d0 c5t5d0
   # cp -r /usr /tank/usr
   cp: cycle detected: /usr/ccs/lib/link_audit/32
   cp: cannot access /usr/lib/amd64/libdbus-1.so.2
   

Re: [zfs-discuss] how to create whole disk links?

2007-12-28 Thread Kent Watsen

Eric Schrock wrote:
 Or just let ZFS work its magic ;-)
   


Oh, I didn't realize that `zpool create` could be fed vdevs that didn't 
exist in /dev/dsk/ - and, as a bonus, it also creates the /dev/dsk/ links!

# zpool create -f tank raidz2 c3t0d0 c3t4d0 c4t0d0 c4t4d0 c5t0d0 c5t4d0z
# ls -l /dev/dsk/ | grep :wd$
lrwxrwxrwx   1 root root  78 Dec 27 17:32 c3t0d0 - 
../../devices/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED],1/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:wd
lrwxrwxrwx   1 root root  78 Dec 27 17:32 c3t1d0 - 
../../devices/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED],1/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:wd
lrwxrwxrwx   1 root root  78 Dec 27 17:32 c3t2d0 - 
../../devices/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED],1/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:wd
lrwxrwxrwx   1 root root  78 Dec 27 17:32 c3t3d0 - 
../../devices/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED],1/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:wd
lrwxrwxrwx   1 root root  78 Dec 27 17:32 c3t4d0 - 
../../devices/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED],1/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:wd
lrwxrwxrwx   1 root root  78 Dec 27 17:32 c3t5d0 - 
../../devices/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED],1/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:wd
lrwxrwxrwx   1 root root  76 Dec 28 12:45 c4t0d0 - 
../../devices/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:wd
lrwxrwxrwx   1 root root  76 Dec 27 22:38 c4t1d0 - 
../../devices/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:wd
lrwxrwxrwx   1 root root  76 Dec 27 22:38 c4t4d0 - 
../../devices/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:wd
lrwxrwxrwx   1 root root  76 Dec 28 12:45 c5t0d0 - 
../../devices/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:wd
lrwxrwxrwx   1 root root  76 Dec 28 12:45 c5t4d0 - 
../../devices/[EMAIL PROTECTED],0/pci10de,[EMAIL PROTECTED]/pci1033,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:wd



Thanks for the pointer!

Kent



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] aoc-sat2-mv8 (was: LSI SAS3081E = unstable drive numbers?)

2007-12-18 Thread Kent Watsen


Kent Watsen wrote:
 So, I picked up an AOC-SAT2-MV8 off eBay for not too much and then I got 
 a 4xSATA to one SFF-8087 cable to connect it to one one my six 
 backplanes.  But, as fortune would have it, the cable I bought has SATA 
 connectors that are physically too big to plug into the AOC-SAT2-MV8 - 
 since the AOC-SAT2-MV8 stacks two SATA connectors on top of each other...
   


As a temporary solution, I hooked up the reverse breakout cable using 
ports 1, 3, 5, and 7 on the aoc-sat2-mv8 - the cables fit this way 
because its using only one port from each stack.  Anyway, the good news 
is that drives showed up in Solaris right away and their IDs are stable 
between hot-swaps and reboots.  So I'll be keeping the aoc-sat2-mv8 
(anybody want a SAS3081E?)

I've already ordered more cables for the aoc-sat2-mv8 and will report 
which ones work when I get them


Thanks,
Kent

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] LSI SAS3081E = unstable drive numbers?

2007-12-16 Thread Kent Watsen



Paul Jochum wrote:
What the lsiutil does for me is clear the persistent mapping for 
 all of the drives on a card.  
Since James  confirms that I'm doomed to ad hoc methods  tracking 
device-ids to bays, I'm interested in knowing if somehow your ability to 
clear the persistent mapping for *all* of the drives on the card somehow 
gets you to a stable-state?  Seems to me that, if the tool is run from 
the BIOS, like the LSI Configuration Utility, then it kind defeats the 
uptime that I'm trying to achieve by having a RAID system in the first 
place...  Is it, by chance, a user land program?

 I don't know of a way to disable the mapping completely (but that does 
 sound like a nice option).  Since SUN is reselling this card now (that 
 is how I got my cards), I wonder if they can put in a request to LSI 
 to provide this enhancement?
Yes, please! - can someone from SUN please request LSI to add a feature 
to selectively disable their card's persistent drive mapping feature?


Thanks,
Kent






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] LSI SAS3081E = unstable drive numbers?

2007-12-15 Thread Kent Watsen
Kent Watsen wrote:
 Given that manually tracking shifting ids doesn't sound appealing to 
 me,  would using a SATA controller like the AOC-SAT2-MV8 resolve the 
 issue?  Given that I currently only have one LSI HBA - I'd need to get 2 
 more for all 24 drives ---or--- I could get 3 of these SATA controllers 
 plus 6 discrete-to-8087 reverse breakout cables.  Going down the 
 LSI-route would cost about $600 while going down the AOC-SAT2-MV8 would 
 cost about $400.  I understand that the SATA controllers are less 
 performant, but I'd gladly exchange some performance that I'm likely to 
 never need to simplify my administrative overhead...
   
So, I picked up an AOC-SAT2-MV8 off eBay for not too much and then I got 
a 4xSATA to one SFF-8087 cable to connect it to one one my six 
backplanes.  But, as fortune would have it, the cable I bought has SATA 
connectors that are physically too big to plug into the AOC-SAT2-MV8 - 
since the AOC-SAT2-MV8 stacks two SATA connectors on top of each other...

Any chance anyone knows of a 4xSATA to SFF-8087 reverse breakout cable 
that will plug into the AOC-SAT2-MV8?  Alternatively, anyone know how 
hard it would be to splice the SATA cables that came with the 
AOC-SAT2-MV8 into the cable I bought?

Also, the cable I got actually had 5 cables going into the SFF-8087 - 4 
SATA and another that is rectangular and has holes for either 7 or 8 
pins (not sure if its for 7 or 8 pins as one of the holes appears a bit 
different).  Any idea what this fifth cable is for?

Thanks,
Kent


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] LSI SAS3081E = unstable drive numbers?

2007-12-15 Thread Kent Watsen
Eric Schrock wrote:
 For x86 systems, you can use ipmitool to manipulate the led state
 (ipmitool sunoem led ...).   On older galaxy systems, you can only set the
 fail LED ('io.hdd0.led'), as the ok2rm LED is not physically connected
 to anything.  On newer systems, you can set both the 'fail' and 'okrm'
 LEDs.  You cannot change the activity LED except by manually sending the
 'set sensor reading' IPMI command (not available via impitool).

 For external enclosures, you'll need a SES control program.

 Both of these problems are being worked on under the FMA sensor
 framework to create a unified view through libtopo.  Until that's
 complete, you'll be stuck using ad hoc methods.
   
Hi Eric,

I've looked at your blog and have tried your suggestions, but I need a 
little more advice.

I am on an x86 system running SXCE svn_74 - the system has 6 SAS 
backplanes but, according to the integrator, no real scsi enclosure 
services.  according to the man page, I found that I could use `sdr 
elist generic` to list LEDs, but that command doesn't return any output:

# ipmitool sdr elist generic
#   /* there was no output */

Is it not returning sensor ids because I don't have real scsi enclosure 
services?  Is there anything I can do or am I doomed to ad hoc methods 
forever?

Thanks,
Kent


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] LSI SAS3081E = unstable drive numbers?

2007-12-12 Thread Kent Watsen

Based on recommendations from this list, I asked the company that built 
my box to use an LSI SAS3081E controller.

The first problem I noticed was that the drive-numbers were ordered 
incorrectly.  That is, given that my system has 24 bays (6 rows, 4 
bays/row), the drive numbers from top-to-bottom  left-to-right were 6, 
1, 0, 2, 4, 5 - even though when the system boots, each drive is scanned 
in perfect order (I can tell by watching the LEDs blink).

I contacted LSI tech support and they explained:

start response
SAS treats device IDs differently than SCSI.  LSI SAS controllers
remember devices in the order they were discovered by the controller.
This memory is persistent across power cycles.  It is based on the world
wide name (WWN) given uniquely to every SAS device.  This allows your
boot device to remain your boot device no matter where it migrates in
the SAS topology.

In order to clear the memory of existing devices you need at least one
device that will not be present in your final configuration.  Re-boot
the machine and enter the LSI configuration utility (CTRL-C).  Then find
your way to SAS Topology.  To see more options, press CTRL-M.  Choose
the option to clear all non-present device IDs.  This clears the
persistent memory of all devices not present at that time.  Exchange the
drives.  The system will now remember the order it finds the drives
after the next boot cycle. 
end response 


Sure enough, I was able to physical reorder my drives so they were 0, 1, 
2, 4, 5, 6 - so, appearantly, the company that put my system together 
moved the drives around after they were initially scanned.  But where is 
3?  (answer below).  Then I tried another test:

  1. make first disk blink

# run dd if=/dev/dsk/c2t0d0p0 of=/dev/null count=10
10+0 records in
10+0 records out

  2. pull disk '0' out and replace it with a brand new disk

# run dd if=/dev/dsk/c2t0d0p0 of=/dev/null count=10
dd: /dev/dsk/c2t0d0p0: open: No such file or directory

  3. scratch head and try again with '3'  (I had previously cleared the 
LSI's controllers memory)

# run dd if=/dev/dsk/c2t3d0p0 of=/dev/null count=10
10+0 records in
10+0 records out

So, it seems my SAS controller is being too smart for its own good - it 
tracks the drives themselves, not the drive-bays.  If I hot-swap a brand 
new drive into a bay, Solaris will see it as a new disk, not a 
replacement for the old disk.  How can ZFS support this?  I asked the 
LSI tech support again and got:

start quote
I don't have the knowledge to answer that, so I'll just say
this:  most vendors, including Sun, set up the SAS HBA to use
enclosure/slot naming, which means that if a drive is
swapped, it does NOT get a new name (after all, the enclosure
and slot did not change).
end quote


So, now I turn to you...  Here some information about my system:


Specs:

   Motherboard: SuperMicro H8DME-2 Rev 2.01
   - BIOS: AMI v2.58
   HBA: LSI SAS3081E (SN: P068170707) installed in Slot #5
   - LSI Configuration Utility v6.16.00.00 (2007-05-07)
   Backplane: CI-Design 12-6412-01BR
   HBA connected to BP via two SFF-8087-SFF-8087 cables
   OS: SXCE b74


Details:

   * Chassis has 24 SAS/SATA bays
   * There are 6 backplanes - one for each *row* of drives
   * I currently have only 6 drives installed (see pic)
   * The LSI card is plugged into backplanes 1  2
   * The LSI card is NOT configured to do any RAID - its only
  JBOD as I'm using Solaris's ZFS (software-RAID)

Question:

   * I only plan to use SATA drives, would using a SATA controller like 
Supermicro's
  AOC-SAT2-MV8 help?



Thanks again,
Kent
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] LSI SAS3081E = unstable drive numbers?

2007-12-12 Thread Kent Watsen

Wow, how fortunate for me that you are on this list!

I guess I do have a follow-up question...   If each new drive gets a new 
id when plugged into the system - and I learn to discover that drive's 
id using dmesg or iostat and use `zfs replace `  correctly - when a 
drive fails, what will it take for me to physically find it.  I'm hoping 
there is a command, like dd, that I can use to make that drive's LED 
blink, but I don't know if I can trust that  `dd` will work at all when 
the drive is failed!  Since I don't have enclosure-services, does that 
mean my best option is to manually track id-to-bay mappings? (envision a 
clip-board hanging on rack)

Given that manually tracking shifting ids doesn't sound appealing to 
me,  would using a SATA controller like the AOC-SAT2-MV8 resolve the 
issue?  Given that I currently only have one LSI HBA - I'd need to get 2 
more for all 24 drives ---or--- I could get 3 of these SATA controllers 
plus 6 discrete-to-8087 reverse breakout cables.  Going down the 
LSI-route would cost about $600 while going down the AOC-SAT2-MV8 would 
cost about $400.  I understand that the SATA controllers are less 
performant, but I'd gladly exchange some performance that I'm likely to 
never need to simplify my administrative overhead...

Thanks,
Kent




James C. McPherson wrote:

 Hi Kent,
 I'm one of the team that works on Solaris' mpt driver, which we
 recently enhanced to deliver mpxio support with SAS. I have a bit
 of knowledge about your issue :-)

 Kent Watsen wrote:
 Based on recommendations from this list, I asked the company that 
 built my box to use an LSI SAS3081E controller.

 The first problem I noticed was that the drive-numbers were ordered 
 incorrectly.  That is, given that my system has 24 bays (6 rows, 4 
 bays/row), the drive numbers from top-to-bottom  left-to-right were 
 6, 1, 0, 2, 4, 5 - even though when the system boots, each drive is 
 scanned in perfect order (I can tell by watching the LEDs blink).

 I contacted LSI tech support and they explained:

 start response
 SAS treats device IDs differently than SCSI.  LSI SAS controllers
 remember devices in the order they were discovered by the controller.
 This memory is persistent across power cycles.  It is based on the world
 wide name (WWN) given uniquely to every SAS device.  This allows your
 boot device to remain your boot device no matter where it migrates in
 the SAS topology.

 In order to clear the memory of existing devices you need at least one
 device that will not be present in your final configuration.  Re-boot
 the machine and enter the LSI configuration utility (CTRL-C).  Then find
 your way to SAS Topology.  To see more options, press CTRL-M.  Choose
 the option to clear all non-present device IDs.  This clears the
 persistent memory of all devices not present at that time.  Exchange the
 drives.  The system will now remember the order it finds the drives
 after the next boot cycle. end response 


 Firstly, yes, the LSI SAS hbas do use persistent mapping,
 with a logical target id by default. This is where the
 hba does the translation between the physical disk device's
 SAS address (which you'll see in prtconf -v as the devid),
 and an essentially arbitrary target number which gets passed
 up to the OS - in this case Solaris.

 The support person @ LSI was correct about deleting all those
 mappings.

 Yes, the controller is being smart and tracking the actual
 device rather than a particular bay/slot mapping. This isn't
 so bad, mostly. The effect for you is that you can't assume
 that the replaced device is going to have the same target
 number as the old one (in fact, I'd call that quite unlikely)
 so you'll have to see what the new device name is by checking
 your dmesg or iostat -En output.




 Sure enough, I was able to physical reorder my drives so they were 0, 
 1, 2, 4, 5, 6 - so, appearantly, the company that put my system 
 together moved the drives around after they were initially scanned.  
 But where is 3?  (answer below).  Then I tried another test:

   1. make first disk blink

 # run dd if=/dev/dsk/c2t0d0p0 of=/dev/null count=10
 10+0 records in
 10+0 records out

   2. pull disk '0' out and replace it with a brand new disk

 # run dd if=/dev/dsk/c2t0d0p0 of=/dev/null count=10
 dd: /dev/dsk/c2t0d0p0: open: No such file or directory

   3. scratch head and try again with '3'  (I had previously cleared 
 the LSI's controllers memory)

 # run dd if=/dev/dsk/c2t3d0p0 of=/dev/null count=10
 10+0 records in
 10+0 records out

 So, it seems my SAS controller is being too smart for its own good - 
 it tracks the drives themselves, not the drive-bays.  If I hot-swap a 
 brand new drive into a bay, Solaris will see it as a new disk, not a 
 replacement for the old disk.  How can ZFS support this?  I asked the 
 LSI tech support again and got:

 start quote
 I don't have the knowledge to answer that, so I'll just

Re: [zfs-discuss] LSI SAS3081E = unstable drive numbers?

2007-12-12 Thread Kent Watsen

Hi Paul,

Already in my LSI Configuration Utility I have an option to clear the 
persistent mapping for drives not present, but then the card resumes its 
normal persistent-mapping logic.  What I really want is to disable to 
persistent mapping logic completely - is the `lsiutil` doing that for you?

Thanks,
Kent



Paul Jochum wrote:
 Hi Kent:

 I have run into the same problem before, and have worked with LSI and SUN 
 support to fix it.  LSI calls this persistant drive mapping, and here is 
 how to clear it

 1) obtain the latest version of the program lsiutil from LSI.  They don't 
 seem to have the Solaris versions on their website, but I got it by email 
 when entering a ticket into their support system.  I know that they have a 
 version for Solaris x86 (and I believe a Sparc version also).  The version I 
 currently have is: LSI Logic MPT Configuration Utility, Version 1.52, 
 September 7, 2007

 2) Execute the lsiutil program on your target box.
a) first it will ask you to select which card to use (I have multiple 
 cards in my machine, don't know if it will ask if you only have 1 card in 
 your box)
 b) then you need to select option 15 (it is a hidden option, not shown on 
 the menu)
 c) then you select option 10 (Clear all persistant mappings)
 d) then option 0 multiple times to get out of the program
 e) I normally than reboot the box, and the next time it comes up, the 
 drives are back in order.
 e) or (instead of rebooting) option 99, to reset the chip (causes new 
 mappings to be established), then option 8 (to verify lower target IDs), then 
 devfsadm.  After devfsadm completes, lsiutil option 42 should display valid 
 device names (in /dev/rdsk), and format should find the devices so that you 
 can label them.

 Hope this helps.  I happened to need it last night again (I normally have to 
 run it after re-imaging a box, assuming that I don't want to save the data 
 that was on those drives).

 Paul Jochum
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Solaris 10u5 Proposed Changes

2007-09-18 Thread Kent Watsen

How does one access the PSARC database to lookup the description of 
these features?

Sorry if this has been asked before! - I tried google before posting 
this  :-[

Kent


George Wilson wrote:
 ZFS Fans,

 Here's a list of features that we are proposing for Solaris 10u5. Keep 
 in mind that this is subject to change.

 Features:
 PSARC 2007/142 zfs rename -r
 PSARC 2007/171 ZFS Separate Intent Log
 PSARC 2007/197 ZFS hotplug
 PSARC 2007/199 zfs {create,clone,rename} -p
 PSARC 2007/283 FMA for ZFS Phase 2
 PSARC/2006/465 ZFS Delegated Administration
 PSARC/2006/577 zpool property to disable delegation
 PSARC/2006/625 Enhancements to zpool history
 PSARC/2007/121 zfs set copies
 PSARC/2007/228 ZFS delegation amendments
 PSARC/2007/295 ZFS Delegated Administration Addendum
 PSARC/2007/328 zfs upgrade

 Stay tuned for a finalized list of RFEs and fixes.

 Thanks,
 George
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware sizing for a zfs-based system?

2007-09-17 Thread Kent Watsen






  Probably not, my box has 10 drives and two very thirsty FX74 processors
and it draws 450W max.

At 1500W, I'd be more concerned about power bills and cooling than the UPS!
  


Yeah - good point, but I need my TV! - or so I tell my wife so I can
play with all this gear  :-X 

Cheers,
Kent



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [xen-discuss] hardware sizing for a zfs-based system?

2007-09-16 Thread Kent Watsen
David Edmondson wrote:
 One option I'm still holding on to is to also use the ZFS system as a
 Xen-server - that is OpenSolaris would be running in Dom0...  Given that
 the Xen hypervisor has a pretty small cpu/memory footprint, do you think
 it could share 2-cores + 4Gb with ZFS or should I allocate 3 cores of
 Dom0 and bump the memory up 512MB?

 A dom0 with 4G and 2 cores should be plenty to run ZFS and the support 
 necessary for a reasonable (16) paravirtualised domains. If the guest 
 domains end up using HVM then the dom0 load is higher, but we haven't 
 done the work to quantify this properly yet.

A tasty insight - a million thanks!

I think if I get 2 quad-cores and 16Gb mem, I'd be able to stomach the 
overhead of 25%cpu and 25%mem going to the host - as the cost 
differential of have a dedicated SAN with another totally-redundant Xen 
box would be more expensive

Cheers!
Kent

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware sizing for a zfs-based system?

2007-09-16 Thread Kent Watsen

 I know what you are saying, but I , wonder if it would be noticeable?  I 

 Well, noticeable again comes back to your workflow. As you point out 
 to Richard, it's (theoretically) 2x IOPS difference, which can be very 
 significant for some people.
Yeah, but my point is if it would be noticeable to *me* (yes, I am a bit 
self-centered)

 I would say no, not even close to pushing it. Remember, we're 
 measuring performance in MBytes/s, and video throughput is measured in 
 Mbit/s (and even then, I imagine that a 27 Mbit/s stream over the air 
 is going to be pretty rare). So I'm figuring you're just scratching 
 the surface of even a minimal array.

 Put it this way: can a single, modern hard drive keep up with an 
 ADSL2+ (24 Mbit/s) connection?
 Throw 24 spindles at the problem, and I'd say you have headroom for a 
 *lot* of streams.
Sweet!  I should probably hang-up this thread now, but there are too 
many other juicy bits to respond too...

 I wasn't sure, with your workload. I know with mine, I'm seeing the 
 data store as being mostly temporary. With that much data streaming in 
 and out, are you planning on archiving *everything*? Cos that's only 
 one month's worth of HD video.
Well, not to down-play the importance of my TV recordings, which is 
really a laugh because I'm not really a big TV watcher, I simply don't 
want to ever have to think about this again after getting it setup

 I'd consider tuning a portion of the array for high throughput, and 
 another for high redundancy as an archive for whatever you don't want 
 to lose. Whether that's by setting copies=2, or by having a mirrored 
 zpool (smart for an archive, because you'll be less sensitive to the 
 write performance that suffers there), it's up to you...
 ZFS gives us a *lot* of choices. (But then you knew that, and it's 
 what brought you to the list :)
All true, but if 4(4+2) serves all my needs, I think that its simpler to 
administrate as I can arbitrarily allocate space as needed without 
needing to worry about what kind of space it is - all the space is good 
and fast space...

 I also committed to having at least one hot spare, which, after 
 staring at relling's graphs for days on end, seems to be the cheapest, 
 easiest way of upping the MTTDL for any array. I'd recommend it.
No doubt that a hot-spare gives you a bump in MTTDL, but double-parity 
trumps it big time - check out Richard's blog...

 As I understand it, 5(2+1) would scale to better IOPS performance than 
 4(4+2), and IOPS represents the performance baseline; as you ask the 
 array to do more and more at once, it'll look more like random seeks.

 What you get from those bigger zvol groups of 4+2 is higher 
 performance per zvol. That said, with my few datapoints on 4+1 RAID-Z 
 groups (running on 2 controllers) suggest that that configuration runs 
 into a bottleneck somewhere, and underperforms from what's expected.
Er?  Can anyone fill in the missing blank here?


 Oh, the bus will far exceed your needs, I think.
 The exercise is to specify something that handles what you need 
 without breaking the bank, no?
Bank, smank - I build a system every 5+ years and I want it to kick ass 
all the way until I build the next one - cheers!


 BTW, where are these HDTV streams coming from/going to? Ethernet? A 
 capture card? (and which ones will work with Solaris?)
Glad you asked, for the lists sake, I'm using two HDHomeRun tuners 
(http://www.silicondust.com/wiki/products/hdhomerun) - actually, I 
bought 3 of them because I felt like I needed a spare :-D


 Yeah, perhaps I've been a bit too circumspect about it, but I haven't 
 been all that impressed with my PCI-X bus configuration. Knowing what 
 I know now, I might've spec'd something different. Of all the 
 suggestions that've gone out on the list, I was most impressed with 
 Tim Cook's:

 Won't come cheap, but this mobo comes with 6x pci-x slots... should 
 get the job done :)

 http://www.supermicro.com/products/motherboard/Xeon1333/5000P/X7DBE-X.cfm 


 That has 3x 133MHz PCI-X slots each connected to the Southbridge via a 
 different PCIe bus, which sounds worthy of being the core of the 
 demi-Thumper you propose.
Yeah, but getting back to PCIe I see these tasty SAS/SATA HBAs from LSI: 
http://www.lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/lsisas3081er/index.html
 
(note, LSI also sells matching PCI-X HBA controllers, in case you need 
to balance your mobo's architecture]

 ...But It all depends what you intend to spend. (This is what I 
 was going to say in my next blog entry on the system:) We're talking 
 about benchmarks that are really far past what you say is your most 
 taxing work load. I say I'm disappointed with the contention on my 
 bus putting limits on maximum throughputs, but really, what I have far 
 outstrips my ability to get data into or out of the system.
So moving to the PCIe-based cards should fix that - no?

 So all of my disappointment is in theory.
Seems like this 

Re: [zfs-discuss] hardware sizing for a zfs-based system?

2007-09-15 Thread Kent Watsen

Hey Adam,

 My first posting contained my use-cases, but I'd say that video 
 recording/serving will dominate the disk utilization - thats why I'm 
 pushing for 4 striped sets of RAIDZ2 - I think that it would be all 
 around goodness

 It sounds good, that way, but (in theory), you'll see random I/O 
 suffer a bit when using RAID-Z2: the extra parity will drag 
 performance down a bit.
I know what you are saying, but I , wonder if it would be noticeable?  I 
think my worst case scenario would be 3 myth frontends watching 1080p 
content while 4 tuners are recording 1080p content - with each 1080p 
stream being 27Mb/s, that would be 108Mb/s writes and 81Mb/s reads (all 
sequential I/O) - does that sound like it would even come close to 
pushing a 4(4+2) array?



 The RAS guys will flinch at this, but have you considered 8*(2+1) 
 RAID-Z1?
That configuration showed up in the output of the program I posted back 
in July 
(http://mail.opensolaris.org/pipermail/zfs-discuss/2007-July/041778.html):

24 bays w/ 500 GB drives having MTBF=5 years
  - can have 8 (2+1) w/ 0 spares providing 8000 GB with MTTDL of
95.05 years
  - can have 6 (2+2) w/ 0 spares providing 6000 GB with MTTDL of
28911.68 years
  - can have 4 (4+1) w/ 4 spares providing 8000 GB with MTTDL of
684.38 years
  - can have 4 (4+2) w/ 0 spares providing 8000 GB with MTTDL of
8673.50 years
  - can have 2 (8+1) w/ 6 spares providing 8000 GB with MTTDL of
380.21 years
  - can have 2 (8+2) w/ 4 spares providing 8000 GB with MTTDL of
416328.12 years

But it is 91 times more likely to fail and this system will contain data 
that  I don't want to risk losing



 I don't want to over-pimp my links, but I do think my blogged 
 experiences with my server (also linked in another thread) might give 
 you something to think about:
  http://lindsay.at/blog/archive/tag/zfs-performance/
I see that you also set up a video server (myth?),  from you blog, I 
think you are doing 5(2+1) (plus a hot-spare?)  - this is what my 
program says about a 16-bay system:

16 bays w/ 500 GB drives having MTBF=5 years
  - can have 5 (2+1) w/ 1 spares providing 5000 GB with MTTDL of
1825.00 years
  - can have 4 (2+2) w/ 0 spares providing 4000 GB with MTTDL of
43367.51 years
  - can have 3 (4+1) w/ 1 spares providing 6000 GB with MTTDL of
912.50 years
  - can have 2 (4+2) w/ 4 spares providing 4000 GB with MTTDL of
2497968.75 years
  - can have 1 (8+1) w/ 7 spares providing 4000 GB with MTTDL of
760.42 years
  - can have 1 (8+2) w/ 6 spares providing 4000 GB with MTTDL of
832656.25 years

Note that are MTTDL isn't quite as bad as 8(2+1) since you have three 
less strips.  Also, its interesting for me to note that have have 5 
strips and my 4(4+2) setup would have just one less - so the question to 
answer if your extra strip is better than my 2 extra disks in each raid-set?



 Testing 16 disks locally, however, I do run into noticeable I/O 
 bottlenecks, and I believe it's down to the top limits of the PCI-X bus.
Yes, too bad Supermicro doesn't make a PCIe-based version...   But 
still, the limit of a 64-bit, 133.3MHz PCI-X bus is 1067 MB/s whereas a 
64-bit, 100MHz, PCI-X bus is 800MB/s - either way, its much faster than 
my worst case scenario from above where 7 1080p streams would be 189Mb/s...



  As far as a mobo with good PCI-X architecture - check out
 the latest from Tyan 
 (http://tyan.com/product_board_detail.aspx?pid=523) - it has three 
 133/100MHz PCI-X slots

 I use a Tyan in my server, and have looked at a lot of variations, but 
 I hadn't noticed that one. It has some potential.

 Still, though, take a look at the block diagram on the datasheet: that 
 actually looks like 1x PCI-X 133MHz slot and a bridge sharing 2x 
 100MHz slots. My benchmarks so far show that putting a controller on a 
 100MHz slot is measurably slower than 133MHz, but contention over a 
 single bridge can be even worse.
Hmmm, I hadn't thought about that...  Here is another new mobo from Tyan 
(http://tyan.com/product_board_detail.aspx?pid=517) - its datasheet 
shows the PCI-X buses configured the same way as your S3892:



Thanks!
Kent

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware sizing for a zfs-based system?

2007-09-15 Thread Kent Watsen

 Nit: small, random read I/O may suffer.  Large random read or any random
 write workloads should be ok.
Given that video-serving is all sequential-read, is it correct that  
that raidz2, specifically 4(4+2), would be just fine?

 For 24 data disks there are enough combinations that it is not easy to
 pick from.  The attached RAIDoptimizer output may help you decide on
 the trade-offs.  
Wow! - thanks for running it with 24 disks!

 For description of the theory behind it, see my blog
 http://blogs.sun.com/relling
I used your theory to write my own program (posted in July), but your's 
is way more complete

 I recommend loading it into StarOffice 
Nice little plug  ;-)

 and using graphs or sorts to
 reorder the data, based on your priorities.  
Interesting, my 4(4+2) has 282 iops, where as 8(2+1) has 565 iops - 
exactly double, which is kind of expected given that it has twice as 
many stripes)...  Also, it helps to see that the iops extremes are 
12(raid1) with 1694 iops and 2(10+2) with 141 iops - so 4(4+2) is not a 
great 24-disk performer but isn't 282 iops is probably overkill for my 
home network?

Yes, I (obviously :-) recommend

 http://www.sun.com/storagetek/storage_networking/hba/sas/specs.xml
Very nice - think I'll be getting 3 of these!


Thanks,
Kent




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware sizing for a zfs-based system?

2007-09-15 Thread Kent Watsen


 Sorry, but looking again at the RMP page, I see that the chassis I 
 recommended is actually different than the one we have.  I can't find 
 this chassis only online, but here's what we bought:

 http://www.siliconmechanics.com/i10561/intel-storage-server.php?cat=625
That is such a cool looking case!

 From their picture gallery, you can't see the back, but it has space 
 for 3.5 drives in the back.  You can put hot swap trays back there 
 for your OS drives.  The guys at Silicon Mechanics are great, so you 
 could probably call them to ask who makes this chassis.  They may also 
 be able to build you a partial system, if you like.
An excellent suggestion, but after configuring the nServ K501 (because I 
want quad-core AMD) the way I want it, there price is almost exactly the 
same a my thrifty-shopper price, unlike RackMountPro which seems to add 
about 20% overhead - so I'll probably order the whole system from them, 
sans the Host Bus Adapter, as I'll use the SUN card Richard suggested



Thanks!
Kent
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware sizing for a zfs-based system?

2007-09-15 Thread Kent Watsen

[CC-ing xen-discuss regarding question below]


 Probably a 64 bit dual core with 4GB of (ECC) RAM would be a good
 starting point.

 Agreed.
So I was completely out of a the ball-park - I hope the ZFS Wiki can be 
updated to contain some sensible hardware-sizing information...

One option I'm still holding on to is to also use the ZFS system as a 
Xen-server - that is OpenSolaris would be running in Dom0...  Given that 
the Xen hypervisor has a pretty small cpu/memory footprint, do you think 
it could share 2-cores + 4Gb with ZFS or should I allocate 3 cores of 
Dom0 and bump the memory up 512MB?


Thanks,
Kent

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware sizing for a zfs-based system?

2007-09-14 Thread Kent Watsen


 I will only comment on the chassis, as this is made by AIC (short for 
 American Industrial Computer), and I have three of these in service at 
 my work.  These chassis are quite well made, but I have experienced 
 the following two problems:

 snip
Oh my, thanks for the heads-up!  Charlie at RMP said that they were most 
popular - so I assumed that they were solid...


 For all new systems, I've gone with this chassis instead (I just 
 noticed Rackmount Pro sells 'em also):

 http://rackmountpro.com/productpage.php?prodid=2043
But I was hoping for resiliency and easy replacement for the OS drive - 
hot-swap RAID1 seemed like a no-brainer...  This case has 1 internal and 
one external 3.5 drive bays.  I could use a CF reader for resiliency 
and reduce need for replacement - assuming I spool logs to internal 
drive so as to not burn out the CF.  Alternatively, I could put a couple 
2.5 drives into a single 3.5 bay for RAID1 resiliency, but I'd have to 
shutdown to replace a drive...  What do you recommend?


 One other thing, that you may know already.  Rackmount Pro will try to 
 sell you 3ware cards, which work great in the Linux/Windows 
 environment, but aren't supported in Open Solaris, even in JBOD mode.  
 You will need alternate SATA host adapters for this application.
Indeed, but why pay for a RAID controller when you only need SATA 
ports?  - thats why I was thinking of picking up three of these bad boys 
(http://www.supermicro.com/products/accessories/addon/AoC-SAT2-MV8.cfm) 
for about $100 each

 Good luck,
Getting there - can anybody clue me into how much CPU/Mem ZFS needs?
I have an old 1.2Ghz with 1Gb of mem laying around - would it be sufficient?


Thanks!
Kent






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware sizing for a zfs-based system?

2007-09-14 Thread Kent Watsen

 Fun exercise! :)

Indeed! - though my wife and kids don't seem to appreciate it so much  ;)


 I'm thinking about using this 26-disk case:  [FYI: 2-disk RAID1 for 
 the OS  4*(4+2) RAIDZ2 for SAN]

 What are you *most* interested in for this server? Reliability? 
 Capacity? High Performance? Reading or writing? Large contiguous reads 
 or small seeks?

 One thing that I did that got a good feedback from this list was 
 picking apart the requirements of the most demanding workflow I 
 imagined for the machine I was speccing out.
My first posting contained my use-cases, but I'd say that video 
recording/serving will dominate the disk utilization - thats why I'm 
pushing for 4 striped sets of RAIDZ2 - I think that it would be all 
around goodness


 I'm learning more and more about this subject as I test the server 
 (not all that dissimilar to what you've described, except with only 18 
 disks) I now have. I'm frustrated at the relative unavailability of 
 PCIe SATA controller cards that are ZFS-friendly (i.e., JBOD), and the 
 relative unavailability of motherboards that support both the latest 
 CPUs as well as have a good PCI-X architecture.
Good point - another reply I just sent noted a PCI-X sata controller 
card, but I'd prefer a PCIe card - do you have a recommendation on a 
PCIe card?  As far as a mobo with good PCI-X architecture - check out 
the latest from Tyan (http://tyan.com/product_board_detail.aspx?pid=523) 
- it has three 133/100MHz PCI-X slots

 If you come across some potential solutions, I think a lot of people 
 here will thank you for sharing...
Will keep the list posted!


Thanks,
Kent





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] hardware sizing for a zfs-based system?

2007-09-13 Thread Kent Watsen

Hi all,

I'm putting together a OpenSolaris ZFS-based system and need help 
picking hardware.

I'm thinking about using this 26-disk case:  [FYI: 2-disk RAID1 for the 
OS  4*(4+2) RAIDZ2 for SAN]

http://rackmountpro.com/productpage.php?prodid=2418

Regarding the mobo, cpus, and memory - I searched goggle and the ZFS 
site and all I came up with so far is that, for a dedicated iSCSI-based 
SAN, I'll need about 1 Gb of memory and a low-end processor - can anyone 
clarify exactly how much memory/cpu I'd need to be in the safe-zone?  
Also, are there any mobo/chipsets that are particularly well suited for 
a dedicated iSCSI-based SAN?

This is for my home network, which includes internet/intranet services 
(mail, web, ldap, samba, netatalk, code-repository), build/test 
environments (for my cross-platform projects), and a video server 
(mythtv-backend). 

Right now, the aforementioned run on two separate machines, but I'm 
planning to consolidate them into a single Xen-based server.  One idea I 
have is to host a Xen-server on this same machine - that is, an 
OpenSolaris-based Dom0 serving ZFS-based volumes to the DomU guest 
machines.  But if I go this way, then I'd be looking at 4-socket Opteron 
mobo to use with AMD's just released quad-core CPUs and tons of memory.  
My biggest concern with this approach is getting PSUs large enough to 
power it all - if anyone has experience on this front, I'd love to hear 
about it too

Thanks!
Kent





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] pool analysis

2007-07-11 Thread Kent Watsen

Richard's blog analyzes MTTDL as a function of N+P+S:
http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl

But to understand how to best utilize an array with a fixed number of 
drives, I add the following constraints:
  - N+P should follow ZFS best-practice rule of N={2,4,8} and P={1,2}
  - all sets in an array should be configured similarly
  - the MTTDL for S sets is equal to (MTTDL for one set)/S

I got the following results by varying the NUM_BAYS parameter in the 
source code below:

*_4 bays w/ 300 GB drives having MTBF=4 years_*
  - can have 1 (2+1) w/ 1 spares providing 600 GB with MTTDL of
5840.00 years
  - can have 1 (2+2) w/ 0 spares providing 600 GB with MTTDL of
799350.00 years
  - can have 0 (4+1) w/ 4 spares providing 0 GB with MTTDL of Inf years
  - can have 0 (4+2) w/ 4 spares providing 0 GB with MTTDL of Inf years
  - can have 0 (8+1) w/ 4 spares providing 0 GB with MTTDL of Inf years
  - can have 0 (8+2) w/ 4 spares providing 0 GB with MTTDL of Inf years

*_8 bays w/ 300 GB drives having MTBF=4 years_*
  - can have 2 (2+1) w/ 2 spares providing 1200 GB with MTTDL of
2920.00 years
  - can have 2 (2+2) w/ 0 spares providing 1200 GB with MTTDL of
399675.00 years
  - can have 1 (4+1) w/ 3 spares providing 1200 GB with MTTDL of
1752.00 years
  - can have 1 (4+2) w/ 2 spares providing 1200 GB with MTTDL of
2557920.00 years
  - can have 0 (8+1) w/ 8 spares providing 0 GB with MTTDL of Inf years
  - can have 0 (8+2) w/ 8 spares providing 0 GB with MTTDL of Inf years

*_12 bays w/ 300 GB drives having MTBF=4 years_*
  - can have 4 (2+1) w/ 0 spares providing 2400 GB with MTTDL of
365.00 years
  - can have 3 (2+2) w/ 0 spares providing 1800 GB with MTTDL of
266450.00 years
  - can have 2 (4+1) w/ 2 spares providing 2400 GB with MTTDL of
876.00 years
  - can have 2 (4+2) w/ 0 spares providing 2400 GB with MTTDL of
79935.00 years
  - can have 1 (8+1) w/ 3 spares providing 2400 GB with MTTDL of
486.67 years
  - can have 1 (8+2) w/ 2 spares providing 2400 GB with MTTDL of
426320.00 years

*_16 bays w/ 300 GB drives having MTBF=4 years_*
  - can have 5 (2+1) w/ 1 spares providing 3000 GB with MTTDL of
1168.00 years
  - can have 4 (2+2) w/ 0 spares providing 2400 GB with MTTDL of
199837.50 years
  - can have 3 (4+1) w/ 1 spares providing 3600 GB with MTTDL of
584.00 years
  - can have 2 (4+2) w/ 4 spares providing 2400 GB with MTTDL of
1278960.00 years
  - can have 1 (8+1) w/ 7 spares providing 2400 GB with MTTDL of
486.67 years
  - can have 1 (8+2) w/ 6 spares providing 2400 GB with MTTDL of
426320.00 years

*_20 bays w/ 300 GB drives having MTBF=4 years_*
  - can have 6 (2+1) w/ 2 spares providing 3600 GB with MTTDL of
973.33 years
  - can have 5 (2+2) w/ 0 spares providing 3000 GB with MTTDL of
159870.00 years
  - can have 4 (4+1) w/ 0 spares providing 4800 GB with MTTDL of
109.50 years
  - can have 3 (4+2) w/ 2 spares providing 3600 GB with MTTDL of
852640.00 years
  - can have 2 (8+1) w/ 2 spares providing 4800 GB with MTTDL of
243.33 years
  - can have 2 (8+2) w/ 0 spares providing 4800 GB with MTTDL of
13322.50 years

*_24 bays w/ 300 GB drives having MTBF=4 years_*
  - can have 8 (2+1) w/ 0 spares providing 4800 GB with MTTDL of
182.50 years
  - can have 6 (2+2) w/ 0 spares providing 3600 GB with MTTDL of
133225.00 years
  - can have 4 (4+1) w/ 4 spares providing 4800 GB with MTTDL of
438.00 years
  - can have 4 (4+2) w/ 0 spares providing 4800 GB with MTTDL of
39967.50 years
  - can have 2 (8+1) w/ 6 spares providing 4800 GB with MTTDL of
243.33 years
  - can have 2 (8+2) w/ 4 spares providing 4800 GB with MTTDL of
213160.00 years

While its true that RAIDZ2 is /much /safer that RAIDZ, it seems that 
/any /RAIDZ configuration will outlive me and so I conclude that RAIDZ2 
is unnecessary in a practical sense...  This conclusion surprises me 
given the amount of attention people give to double-parity solutions - 
what am I overlooking?

Thanks,
Kent



_*Source Code*_  (compile with: cc -std:c99 -lm filename) [its more 
than 80 columns - sorry!]

#include stdio.h
#include math.h

#define NUM_BAYS 24
#define DRIVE_SIZE_GB 300
#define MTBF_YEARS 4
#define MTTR_HOURS_NO_SPARE 16
#define MTTR_HOURS_SPARE 4

int main() {

printf(\n);
printf(%u bays w/ %u GB drives having MTBF=%u years\n, NUM_BAYS, 
DRIVE_SIZE_GB, MTBF_YEARS);
for (int num_drives=2; num_drives=8; num_drives*=2) {
for (int num_parity=1; num_parity=2; num_parity++) {
double  mttdl;

int mtbf_hours  = MTBF_YEARS * 365 * 24;
int total_num_drives= num_drives + num_parity;
int num_instances   = NUM_BAYS / total_num_drives;
int  

Re: [zfs-discuss] pool analysis

2007-07-11 Thread Kent Watsen

Resent as HTML to avoid line-wrapping:


Richard's blog analyzes MTTDL as a function of N+P+S:
  http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl

But to understand how to best utilize an array with a fixed number of 
drives, I add the following constraints:

- N+P should follow ZFS best-practice rule of N={2,4,8} and P={1,2}
- all sets in an array should be configured similarly
- the MTTDL for S sets is equal to (MTTDL for one set)/S

I got the following results by varying the NUM_BAYS parameter in the 
source code below:


  _*4 bays w/ 300 GB drives having MTBF=4 years*_
- can have 1 (2+1) w/ 1 spares providing 600 GB with MTTDL of 
5840.00 years
- can have 1 (2+2) w/ 0 spares providing 600 GB with MTTDL of 
799350.00 years

- can have 0 (4+1) w/ 4 spares providing 0 GB with MTTDL of Inf years
- can have 0 (4+2) w/ 4 spares providing 0 GB with MTTDL of Inf years
- can have 0 (8+1) w/ 4 spares providing 0 GB with MTTDL of Inf years
- can have 0 (8+2) w/ 4 spares providing 0 GB with MTTDL of Inf years

  _*8 bays w/ 300 GB drives having MTBF=4 years*_
- can have 2 (2+1) w/ 2 spares providing 1200 GB with MTTDL of 
2920.00 years
- can have 2 (2+2) w/ 0 spares providing 1200 GB with MTTDL of 
399675.00 years
- can have 1 (4+1) w/ 3 spares providing 1200 GB with MTTDL of 
1752.00 years
- can have 1 (4+2) w/ 2 spares providing 1200 GB with MTTDL of 
2557920.00 years

- can have 0 (8+1) w/ 8 spares providing 0 GB with MTTDL of Inf years
- can have 0 (8+2) w/ 8 spares providing 0 GB with MTTDL of Inf years

  _*12 bays w/ 300 GB drives having MTBF=4 years*_
- can have 4 (2+1) w/ 0 spares providing 2400 GB with MTTDL of 
365.00 years
- can have 3 (2+2) w/ 0 spares providing 1800 GB with MTTDL of 
266450.00 years
- can have 2 (4+1) w/ 2 spares providing 2400 GB with MTTDL of 
876.00 years
- can have 2 (4+2) w/ 0 spares providing 2400 GB with MTTDL of 
79935.00 years
- can have 1 (8+1) w/ 3 spares providing 2400 GB with MTTDL of 
486.67 years
- can have 1 (8+2) w/ 2 spares providing 2400 GB with MTTDL of 
426320.00 years


*   _16 bays w/ 300 GB drives having MTBF=4 years_*
- can have 5 (2+1) w/ 1 spares providing 3000 GB with MTTDL of 
1168.00 years
- can have 4 (2+2) w/ 0 spares providing 2400 GB with MTTDL of 
199837.50 years
- can have 3 (4+1) w/ 1 spares providing 3600 GB with MTTDL of 
584.00 years
- can have 2 (4+2) w/ 4 spares providing 2400 GB with MTTDL of 
1278960.00 years
- can have 1 (8+1) w/ 7 spares providing 2400 GB with MTTDL of 
486.67 years
- can have 1 (8+2) w/ 6 spares providing 2400 GB with MTTDL of 
426320.00 years


  _*20 bays w/ 300 GB drives having MTBF=4 years*_
- can have 6 (2+1) w/ 2 spares providing 3600 GB with MTTDL of 
973.33 years
- can have 5 (2+2) w/ 0 spares providing 3000 GB with MTTDL of 
159870.00 years
- can have 4 (4+1) w/ 0 spares providing 4800 GB with MTTDL of 
109.50 years
- can have 3 (4+2) w/ 2 spares providing 3600 GB with MTTDL of 
852640.00 years
- can have 2 (8+1) w/ 2 spares providing 4800 GB with MTTDL of 
243.33 years
- can have 2 (8+2) w/ 0 spares providing 4800 GB with MTTDL of 
13322.50 years


  _*24 bays w/ 300 GB drives having MTBF=4 years*_
- can have 8 (2+1) w/ 0 spares providing 4800 GB with MTTDL of 
182.50 years
- can have 6 (2+2) w/ 0 spares providing 3600 GB with MTTDL of 
133225.00 years
- can have 4 (4+1) w/ 4 spares providing 4800 GB with MTTDL of 
438.00 years
- can have 4 (4+2) w/ 0 spares providing 4800 GB with MTTDL of 
39967.50 years
- can have 2 (8+1) w/ 6 spares providing 4800 GB with MTTDL of 
243.33 years
- can have 2 (8+2) w/ 4 spares providing 4800 GB with MTTDL of 
213160.00 years


While its true that RAIDZ2 is /much /safer that RAIDZ, it seems that 
/any /RAIDZ configuration will outlive me and so I conclude that RAIDZ2 
is unnecessary in a practical sense...  This conclusion surprises me 
given the amount of attention people give to double-parity solutions - 
what am I overlooking?


Thanks,
Kent



_Source Code_  (compile with: cc -std:c99 -lm filename) [its more than 
80 columns - sorry!]


#include stdio.h
#include math.h

#define NUM_BAYS 24
#define DRIVE_SIZE_GB 300
#define MTBF_YEARS 4
#define MTTR_HOURS_NO_SPARE 16
#define MTTR_HOURS_SPARE 4

int main() {

  printf(\n);
  printf(%u bays w/ %u GB drives having MTBF=%u years\n, NUM_BAYS, 
DRIVE_SIZE_GB, MTBF_YEARS);

  for (int num_drives=2; num_drives=8; num_drives*=2) {
  for (int num_parity=1; num_parity=2; num_parity++) {
  double  mttdl;

  int mtbf_hours  = MTBF_YEARS * 365 * 24;
  int total_num_drives= num_drives + num_parity;
  int num_instances   = NUM_BAYS / total_num_drives;
  int num_spares  = NUM_BAYS % total_num_drives;
  double  mttr= num_spares==0 ? 
MTTR_HOURS_NO_SPARE : 

Re: [zfs-discuss] pool analysis

2007-07-11 Thread Kent Watsen

 But to understand how to best utilize an array with a fixed number of 
 drives, I add the following constraints:
   - N+P should follow ZFS best-practice rule of N={2,4,8} and P={1,2}
   - all sets in an array should be configured similarly
   - the MTTDL for S sets is equal to (MTTDL for one set)/S

 Yes, these are reasonable and will reduce the problem space, somewhat.
Actually, I wish I could get more insight into why N can only be 2, 4,or 
8.  In contemplating a 16-bay array, I many times think that 3 (3+2) + 1 
spare would be perfect, but I have no understanding what N=3 implicates...




 While its true that RAIDZ2 is /much /safer that RAIDZ, it seems that 
 /any /RAIDZ configuration will outlive me and so I conclude that 
 RAIDZ2 is unnecessary in a practical sense...  This conclusion 
 surprises me given the amount of attention people give to 
 double-parity solutions - what am I overlooking?

 You are overlooking statistics :-).  As I discuss in
 http://blogs.sun.com/relling/entry/using_mtbf_and_time_dependent
 the MTBF (F == death) of children aged 5-14 in the US is 4,807 years, but
 clearly no child will live anywhere close to 4,807 years.  
Thanks - I hadn't seen that blog entry yet...


 #define MTTR_HOURS_NO_SPARE 16

 I think this is optimistic :-)
Not really for me as the array is in my basement - so I assume that I'll 
swap in a drive when I get home from work  ;)


 There are many more facets of looking at these sorts of analysis, 
 which is
 why I wrote RAIDoptimizer.  
Is RAIDoptimizer the name of a spreadsheet you developed - is it 
publically available?


Thanks,
Kent
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] 8+2 or 8+1+spare?

2007-07-09 Thread Kent Watsen

Hi all,

I'm new here and to ZFS but I've been lurking for quite some time...  My 
question is simple: which is better 8+2 or 8+1+spare?   Both follow the 
(N+P) N={2,4,8} P={1,2} rule, but 8+2 results in a total or 10 disks, 
which is one disk more than 3=num-disks=9 rule.   But 8+2 has much 
better MTTDL than 8+1+spare and so I'm trying to understand how bad it 
would really be - what doesn't work/scale?

Thanks,
Kent

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 8+2 or 8+1+spare?

2007-07-09 Thread Kent Watsen

 I think that the 3=num-disks=9 rule only applies to RAIDZ and it was
 changed to 4=num-disks=10 for RAIDZ2, but I might be remembering wrong.
   
Can anybody confirm that the 3=num-disks=9 rule only applies to RAIDZ 
and that 4=num-disks=10 applies to RAIDZ2?

Thanks,
Kent

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 8+2 or 8+1+spare?

2007-07-09 Thread Kent Watsen

 Don't confuse vdevs with pools.  If you add two 4+1 vdevs to a single pool it
 still appears to be one place to put things.  ;)
   
Newbie oversight - thanks!

Kent

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 8+2 or 8+1+spare?

2007-07-09 Thread Kent Watsen

 Another reason to recommend spares is when you have multiple top-level 
 vdevs
 and want to amortize the spare cost over multiple sets.  For example, if
 you have 19 disks then 2x 8+1 raidz + spare amortizes the cost of the 
 spare
 across two raidz sets.
  -- richard

Interesting - I hadn't realized that a spare could be used across sets

Thanks!
Kent

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 8+2 or 8+1+spare?

2007-07-09 Thread Kent Watsen
Rob Logan wrote:

  which is better 8+2 or 8+1+spare?

 8+2 is safer for the same speed
 8+2 requires alittle more math, so its slower in theory. (unlikely seen)
 (4+1)*2 is 2x faster, and in theory is less likely to have wasted space
 in transaction group (unlikely seen)

I keep reading that (4+1)*2 is 2x faster, but if all the data I care 
about is in one of the two sets, does it follow that my access to just 
that data is also 2x faster?  - or is it more that simultaneous 
read/write of the entire array is (globally) 2x faster?

Thanks,
Kent

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 8+2 or 8+1+spare?

2007-07-09 Thread Kent Watsen
John-Paul Drawneek wrote:
 Your data gets striped across the two sets so what you get is a raidz stripe 
 giving you the 2x faster.

 tank
 ---raidz
 --devices
 ---raidz
 --devices

 sorry for the diagram.

 So you got your zpool tank with raidz stripe.
Thanks - I think you all have hammered this point home for me now - all 
this confusion stems from my not realizing that sets are merged into a 
single striped pool... ugh!  ;)

Kent


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss