Re: [zfs-discuss] ZFS no longer working with FC devices.

2010-05-24 Thread Demian Phillips
On Sun, May 23, 2010 at 12:02 PM, Torrey McMahon tmcmah...@yahoo.com wrote:
  On 5/23/2010 11:49 AM, Richard Elling wrote:

 FWIW, the A5100 went end-of-life (EOL) in 2001 and end-of-service-life
 (EOSL) in 2006. Personally, I  hate them with a passion and would like to
 extend an offer to use my tractor to bury the beast:-).

 I'm sure I can get some others to help. Can I smash the gbics? Those were my
 favorite. :-)
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


I'd be more then happy to take someone up on the offer but I'd need a
good deal on more current FC array. Since this is my home environment
I am limited by my insignificant pay and the wife factor (who does
indulge me from time to time). Without a corporate IT budget I make do
with everything from free to what I can afford used.

To be honest I'd rather be using an IBM DS4K series array.

Current stress test is creating 700 (50% of array capacity) 1GB files
from /dev/urandom and then I will scrub.

If all goes well it's back to u8 and tuning it.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS no longer working with FC devices.

2010-05-24 Thread Richard Elling
On May 24, 2010, at 4:06 AM, Demian Phillips wrote:
 On Sun, May 23, 2010 at 12:02 PM, Torrey McMahon tmcmah...@yahoo.com wrote:
  On 5/23/2010 11:49 AM, Richard Elling wrote:
 
 FWIW, the A5100 went end-of-life (EOL) in 2001 and end-of-service-life
 (EOSL) in 2006. Personally, I  hate them with a passion and would like to
 extend an offer to use my tractor to bury the beast:-).
 
 I'm sure I can get some others to help. Can I smash the gbics? Those were my
 favorite. :-)
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 
 I'd be more then happy to take someone up on the offer but I'd need a
 good deal on more current FC array. Since this is my home environment
 I am limited by my insignificant pay and the wife factor (who does
 indulge me from time to time). Without a corporate IT budget I make do
 with everything from free to what I can afford used.
 
 To be honest I'd rather be using an IBM DS4K series array.
 
 Current stress test is creating 700 (50% of array capacity) 1GB files
 from /dev/urandom and then I will scrub.

Unfortunately, /dev/urandom is too slow for direct stress testing. It can be
used as a seed for random data files that are then used for stress testing.
 -- richard

-- 
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS no longer working with FC devices.

2010-05-23 Thread Demian Phillips
On Sat, May 22, 2010 at 11:33 AM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Fri, 21 May 2010, Demian Phillips wrote:

 For years I have been running a zpool using a Fibre Channel array with
 no problems. I would scrub every so often and dump huge amounts of
 data (tens or hundreds of GB) around and it never had a problem
 outside of one confirmed (by the array) disk failure.

 I upgraded to sol10x86 05/09 last year and since then I have
 discovered any sufficiently high I/O from ZFS starts causing timeouts
 and off-lining disks. This leads to failure (once rebooted and cleaned
 all is well) long term because you can no longer scrub reliably.

 The problem could be with the device driver, your FC card, or the array
 itself.  In my case, issues I thought were to blame on my motherboard or
 Solaris were due to a defective FC card and replacing the card resolved the
 problem.

 If the problem is that your storage array is becoming overloaded with
 requests, then try adding this to your /etc/system file:

 * Set device I/O maximum concurrency
 *
 http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29
 set zfs:zfs_vdev_max_pending = 5

 Bob
 --
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/


I've gone back to Solaris 10 11/06.
It's working fine, but I notice some differences in performance that
are I think key to the problem.

With the latest Solaris 10 (u8) throughput according to zpool iostat
was hitting about 115MB/sec sometimes a little higher.

With 11/06 it maxes out at 40MB/sec.

Both setups are using mpio devices as far as I can tell.

Next is to go back to u8 and see if the tuning you suggested will
help. It really looks to me that the OS is asking too much of the FC
chain I have.

The really puzzling thing is I just got told about a brand new Dell
Solaris x86 production box using current and supported FC devices and
a supported SAN get the same kind of problems when a scrub is run. I'm
going to investigate that and see if we can get a fix from Oracle as
that does have a support contract. It may shed some light on the issue
I am seeing on the older hardware.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS no longer working with FC devices.

2010-05-23 Thread Richard Elling
On May 23, 2010, at 6:01 AM, Demian Phillips wrote:
 On Sat, May 22, 2010 at 11:33 AM, Bob Friesenhahn
 bfrie...@simple.dallas.tx.us wrote:
 On Fri, 21 May 2010, Demian Phillips wrote:
 
 For years I have been running a zpool using a Fibre Channel array with
 no problems. I would scrub every so often and dump huge amounts of
 data (tens or hundreds of GB) around and it never had a problem
 outside of one confirmed (by the array) disk failure.
 
 I upgraded to sol10x86 05/09 last year and since then I have
 discovered any sufficiently high I/O from ZFS starts causing timeouts
 and off-lining disks. This leads to failure (once rebooted and cleaned
 all is well) long term because you can no longer scrub reliably.
 
 The problem could be with the device driver, your FC card, or the array
 itself.  In my case, issues I thought were to blame on my motherboard or
 Solaris were due to a defective FC card and replacing the card resolved the
 problem.
 
 If the problem is that your storage array is becoming overloaded with
 requests, then try adding this to your /etc/system file:
 
 * Set device I/O maximum concurrency
 *
 http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29
 set zfs:zfs_vdev_max_pending = 5

I would lower it even farther.  Perhaps 2.

 I've gone back to Solaris 10 11/06.
 It's working fine, but I notice some differences in performance that
 are I think key to the problem.

Yep, lots of performance improvements were added later.

 With the latest Solaris 10 (u8) throughput according to zpool iostat
 was hitting about 115MB/sec sometimes a little higher.

That should be about right for the A5100.

 With 11/06 it maxes out at 40MB/sec.
 
 Both setups are using mpio devices as far as I can tell.
 
 Next is to go back to u8 and see if the tuning you suggested will
 help. It really looks to me that the OS is asking too much of the FC
 chain I have.

I think that is a nice way of saying it.

 The really puzzling thing is I just got told about a brand new Dell
 Solaris x86 production box using current and supported FC devices and
 a supported SAN get the same kind of problems when a scrub is run. I'm
 going to investigate that and see if we can get a fix from Oracle as
 that does have a support contract. It may shed some light on the issue
 I am seeing on the older hardware.

The scrub workload is no different than any other stress test. I'm sure you
can run a benchmark or three on the raw device and get the same error
messages.

FWIW, the A5100 went end-of-life (EOL) in 2001 and end-of-service-life 
(EOSL) in 2006. Personally, I  hate them with a passion and would like to 
extend an offer to use my tractor to bury the beast :-). 
 -- richard

-- 
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS no longer working with FC devices.

2010-05-23 Thread Torrey McMahon

 On 5/23/2010 11:49 AM, Richard Elling wrote:

FWIW, the A5100 went end-of-life (EOL) in 2001 and end-of-service-life
(EOSL) in 2006. Personally, I  hate them with a passion and would like to
extend an offer to use my tractor to bury the beast:-).


I'm sure I can get some others to help. Can I smash the gbics? Those 
were my favorite. :-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS no longer working with FC devices.

2010-05-22 Thread Bob Friesenhahn

On Fri, 21 May 2010, Demian Phillips wrote:


For years I have been running a zpool using a Fibre Channel array with
no problems. I would scrub every so often and dump huge amounts of
data (tens or hundreds of GB) around and it never had a problem
outside of one confirmed (by the array) disk failure.

I upgraded to sol10x86 05/09 last year and since then I have
discovered any sufficiently high I/O from ZFS starts causing timeouts
and off-lining disks. This leads to failure (once rebooted and cleaned
all is well) long term because you can no longer scrub reliably.


The problem could be with the device driver, your FC card, or the 
array itself.  In my case, issues I thought were to blame on my 
motherboard or Solaris were due to a defective FC card and replacing 
the card resolved the problem.


If the problem is that your storage array is becoming overloaded with 
requests, then try adding this to your /etc/system file:


* Set device I/O maximum concurrency
* 
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29
set zfs:zfs_vdev_max_pending = 5

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS no longer working with FC devices.

2010-05-21 Thread Demian Phillips
For years I have been running a zpool using a Fibre Channel array with
no problems. I would scrub every so often and dump huge amounts of
data (tens or hundreds of GB) around and it never had a problem
outside of one confirmed (by the array) disk failure.

I upgraded to sol10x86 05/09 last year and since then I have
discovered any sufficiently high I/O from ZFS starts causing timeouts
and off-lining disks. This leads to failure (once rebooted and cleaned
all is well) long term because you can no longer scrub reliably.

ATA, SATA and SAS do not seem to suffer this problem.

I tried upgrading, and then doing a fresh load of U8 and the problem persists.

My FC hardware is:
Sun A5100 (14 disk) array.
Hitachi 146GB FC disks (started with 9GB SUN disks, moved to 36 GB
disks from a variety of manufacturers, and then to 72 GB IBM disks
before this last capacity upgrade).
Sun branded Qlogic 2310 FC cards (375-3102). Sun qlc drivers and MPIO
is enabled.
The rest of the system:
2 CPU Opteron board and chips(2GHZ), 8GB RAM.

When a hard drive fails in the enclosure, it bypasses the bad drive
and turns on a light to let me know a disk failure has happened. This
never happens with this event, pointing it to be a software problem.

Once it goes off the rails and starts off-lining disks it causes the
system to have problems. Login for a user takes forever (40 minutes
minimum to pass the last login message), any command touching on
storage or zfs/zpool hangs for just as long.

I can reliably reproduce the issue by either copying a large amount of
data into the pool or running a scrub.

All disks test fine via destructive tests in format.

I just reproduced it by clearing and creating anew pool called share:

# zpool status share
  pool: share
 state: ONLINE
 scrub: none requested
config:

NAME   STATE READ WRITE CKSUM
share  ONLINE   0 0 0
  raidz2   ONLINE   0 0 0
c0t50050767190B6C76d0  ONLINE   0 0 0
c0t500507671908E72Bd0  ONLINE   0 0 0
c0t500507671907A32Ad0  ONLINE   0 0 0
c0t50050767190C4CFDd0  ONLINE   0 0 0
c0t500507671906704Dd0  ONLINE   0 0 0
c0t500507671918892Ad0  ONLINE   0 0 0
  raidz2   ONLINE   0 0 0
c0t50050767190D11E4d0  ONLINE   0 0 0
c0t500507671915CABEd0  ONLINE   0 0 0
c0t50050767191371C7d0  ONLINE   0 0 0
c0t5005076719125EDBd0  ONLINE   0 0 0
c0t50050767190E4DABd0  ONLINE   0 0 0
c0t5005076719147ECAd0  ONLINE   0 0 0

errors: No known data errors


messages logs something like the following:


May 21 15:27:54 solarisfc scsi: [ID 243001 kern.warning] WARNING:
/scsi_vhci (scsi_vhci0):
May 21 15:27:54 solarisfc   /scsi_vhci/d...@g50050767191371c7
(sd2): Command Timeout on path
/p...@0,0/pci1022,7...@a/pci1077,1...@3/f...@0,0 (fp1)
May 21 15:27:54 solarisfc scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/d...@g50050767191371c7 (sd2):
May 21 15:27:54 solarisfc   SCSI transport failed: reason
'timeout': retrying command
May 21 15:27:54 solarisfc scsi: [ID 243001 kern.warning] WARNING:
/scsi_vhci (scsi_vhci0):
May 21 15:27:54 solarisfc   /scsi_vhci/d...@g50050767191371c7
(sd2): Command Timeout on path
/p...@0,0/pci1022,7...@a/pci1077,1...@2/f...@0,0 (fp0)
May 21 15:28:54 solarisfc scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/d...@g50050767191371c7 (sd2):
May 21 15:28:54 solarisfc   SCSI transport failed: reason
'timeout': giving up
May 21 15:32:54 solarisfc scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/d...@g50050767191371c7 (sd2):
May 21 15:32:54 solarisfc   SYNCHRONIZE CACHE command failed (5)
May 21 15:40:54 solarisfc scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/d...@g50050767191371c7 (sd2):
May 21 15:40:54 solarisfc   drive offline
May 21 15:48:55 solarisfc scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/d...@g50050767191371c7 (sd2):
May 21 15:48:55 solarisfc   drive offline
May 21 15:56:55 solarisfc scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/d...@g50050767191371c7 (sd2):
May 21 15:56:55 solarisfc   drive offline
May 21 16:04:55 solarisfc scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/d...@g50050767191371c7 (sd2):
May 21 16:04:55 solarisfc   drive offline
May 21 16:04:56 solarisfc fmd: [ID 441519 daemon.error] SUNW-MSG-ID:
ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
May 21 16:04:56 solarisfc EVENT-TIME: Fri May 21 16:04:56 EDT 2010
May 21 16:04:56 solarisfc PLATFORM: To Be Filled By O.E.M., CSN: To Be
Filled By O.E.M., HOSTNAME: solarisfc
May 21 16:04:56 solarisfc SOURCE: zfs-diagnosis, REV: 1.0
May 21 16:04:56 solarisfc EVENT-ID: 295d7729-9a93-47f1-de9d-ba3a08b2d477
May 21 16:04:56 solarisfc DESC: The