Re: [zfs-discuss] ZFS no longer working with FC devices.

2010-05-24 Thread Demian Phillips
On Sun, May 23, 2010 at 12:02 PM, Torrey McMahon  wrote:
>  On 5/23/2010 11:49 AM, Richard Elling wrote:
>>
>> FWIW, the A5100 went end-of-life (EOL) in 2001 and end-of-service-life
>> (EOSL) in 2006. Personally, I  hate them with a passion and would like to
>> extend an offer to use my tractor to bury the beast:-).
>
> I'm sure I can get some others to help. Can I smash the gbics? Those were my
> favorite. :-)
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

I'd be more then happy to take someone up on the offer but I'd need a
good deal on more current FC array. Since this is my home environment
I am limited by my insignificant pay and the wife factor (who does
indulge me from time to time). Without a corporate IT budget I make do
with everything from free to what I can afford used.

To be honest I'd rather be using an IBM DS4K series array.

Current stress test is creating 700 (50% of array capacity) 1GB files
from /dev/urandom and then I will scrub.

If all goes well it's back to u8 and tuning it.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS no longer working with FC devices.

2010-05-23 Thread Demian Phillips
On Sat, May 22, 2010 at 11:33 AM, Bob Friesenhahn
 wrote:
> On Fri, 21 May 2010, Demian Phillips wrote:
>
>> For years I have been running a zpool using a Fibre Channel array with
>> no problems. I would scrub every so often and dump huge amounts of
>> data (tens or hundreds of GB) around and it never had a problem
>> outside of one confirmed (by the array) disk failure.
>>
>> I upgraded to sol10x86 05/09 last year and since then I have
>> discovered any sufficiently high I/O from ZFS starts causing timeouts
>> and off-lining disks. This leads to failure (once rebooted and cleaned
>> all is well) long term because you can no longer scrub reliably.
>
> The problem could be with the device driver, your FC card, or the array
> itself.  In my case, issues I thought were to blame on my motherboard or
> Solaris were due to a defective FC card and replacing the card resolved the
> problem.
>
> If the problem is that your storage array is becoming overloaded with
> requests, then try adding this to your /etc/system file:
>
> * Set device I/O maximum concurrency
> *
> http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29
> set zfs:zfs_vdev_max_pending = 5
>
> Bob
> --
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>

I've gone back to Solaris 10 11/06.
It's working fine, but I notice some differences in performance that
are I think key to the problem.

With the latest Solaris 10 (u8) throughput according to zpool iostat
was hitting about 115MB/sec sometimes a little higher.

With 11/06 it maxes out at 40MB/sec.

Both setups are using mpio devices as far as I can tell.

Next is to go back to u8 and see if the tuning you suggested will
help. It really looks to me that the OS is asking too much of the FC
chain I have.

The really puzzling thing is I just got told about a brand new Dell
Solaris x86 production box using current and supported FC devices and
a supported SAN get the same kind of problems when a scrub is run. I'm
going to investigate that and see if we can get a fix from Oracle as
that does have a support contract. It may shed some light on the issue
I am seeing on the older hardware.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS no longer working with FC devices.

2010-05-21 Thread Demian Phillips
For years I have been running a zpool using a Fibre Channel array with
no problems. I would scrub every so often and dump huge amounts of
data (tens or hundreds of GB) around and it never had a problem
outside of one confirmed (by the array) disk failure.

I upgraded to sol10x86 05/09 last year and since then I have
discovered any sufficiently high I/O from ZFS starts causing timeouts
and off-lining disks. This leads to failure (once rebooted and cleaned
all is well) long term because you can no longer scrub reliably.

ATA, SATA and SAS do not seem to suffer this problem.

I tried upgrading, and then doing a fresh load of U8 and the problem persists.

My FC hardware is:
Sun A5100 (14 disk) array.
Hitachi 146GB FC disks (started with 9GB SUN disks, moved to 36 GB
disks from a variety of manufacturers, and then to 72 GB IBM disks
before this last capacity upgrade).
Sun branded Qlogic 2310 FC cards (375-3102). Sun qlc drivers and MPIO
is enabled.
The rest of the system:
2 CPU Opteron board and chips(>2GHZ), 8GB RAM.

When a hard drive fails in the enclosure, it bypasses the bad drive
and turns on a light to let me know a disk failure has happened. This
never happens with this event, pointing it to be a software problem.

Once it goes off the rails and starts off-lining disks it causes the
system to have problems. Login for a user takes forever (40 minutes
minimum to pass the last login message), any command touching on
storage or zfs/zpool hangs for just as long.

I can reliably reproduce the issue by either copying a large amount of
data into the pool or running a scrub.

All disks test fine via destructive tests in format.

I just reproduced it by clearing and creating anew pool called share:

# zpool status share
  pool: share
 state: ONLINE
 scrub: none requested
config:

NAME   STATE READ WRITE CKSUM
share  ONLINE   0 0 0
  raidz2   ONLINE   0 0 0
c0t50050767190B6C76d0  ONLINE   0 0 0
c0t500507671908E72Bd0  ONLINE   0 0 0
c0t500507671907A32Ad0  ONLINE   0 0 0
c0t50050767190C4CFDd0  ONLINE   0 0 0
c0t500507671906704Dd0  ONLINE   0 0 0
c0t500507671918892Ad0  ONLINE   0 0 0
  raidz2   ONLINE   0 0 0
c0t50050767190D11E4d0  ONLINE   0 0 0
c0t500507671915CABEd0  ONLINE   0 0 0
c0t50050767191371C7d0  ONLINE   0 0 0
c0t5005076719125EDBd0  ONLINE   0 0 0
c0t50050767190E4DABd0  ONLINE   0 0 0
c0t5005076719147ECAd0  ONLINE   0 0 0

errors: No known data errors


messages logs something like the following:


May 21 15:27:54 solarisfc scsi: [ID 243001 kern.warning] WARNING:
/scsi_vhci (scsi_vhci0):
May 21 15:27:54 solarisfc   /scsi_vhci/d...@g50050767191371c7
(sd2): Command Timeout on path
/p...@0,0/pci1022,7...@a/pci1077,1...@3/f...@0,0 (fp1)
May 21 15:27:54 solarisfc scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/d...@g50050767191371c7 (sd2):
May 21 15:27:54 solarisfc   SCSI transport failed: reason
'timeout': retrying command
May 21 15:27:54 solarisfc scsi: [ID 243001 kern.warning] WARNING:
/scsi_vhci (scsi_vhci0):
May 21 15:27:54 solarisfc   /scsi_vhci/d...@g50050767191371c7
(sd2): Command Timeout on path
/p...@0,0/pci1022,7...@a/pci1077,1...@2/f...@0,0 (fp0)
May 21 15:28:54 solarisfc scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/d...@g50050767191371c7 (sd2):
May 21 15:28:54 solarisfc   SCSI transport failed: reason
'timeout': giving up
May 21 15:32:54 solarisfc scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/d...@g50050767191371c7 (sd2):
May 21 15:32:54 solarisfc   SYNCHRONIZE CACHE command failed (5)
May 21 15:40:54 solarisfc scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/d...@g50050767191371c7 (sd2):
May 21 15:40:54 solarisfc   drive offline
May 21 15:48:55 solarisfc scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/d...@g50050767191371c7 (sd2):
May 21 15:48:55 solarisfc   drive offline
May 21 15:56:55 solarisfc scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/d...@g50050767191371c7 (sd2):
May 21 15:56:55 solarisfc   drive offline
May 21 16:04:55 solarisfc scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/d...@g50050767191371c7 (sd2):
May 21 16:04:55 solarisfc   drive offline
May 21 16:04:56 solarisfc fmd: [ID 441519 daemon.error] SUNW-MSG-ID:
ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
May 21 16:04:56 solarisfc EVENT-TIME: Fri May 21 16:04:56 EDT 2010
May 21 16:04:56 solarisfc PLATFORM: To Be Filled By O.E.M., CSN: To Be
Filled By O.E.M., HOSTNAME: solarisfc
May 21 16:04:56 solarisfc SOURCE: zfs-diagnosis, REV: 1.0
May 21 16:04:56 solarisfc EVENT-ID: 295d7729-9a93-47f1-de9d-ba3a08b2d477
May 21 16:04:56 solarisfc DESC: The nu

[zfs-discuss] Pool revcovery from replaced disks.

2010-05-18 Thread Demian Phillips
Is it possible to recover a pool (as it was) from a set of disks that
were replaced during a capacity upgrade?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] behavior of disk identifiers and zpools.

2008-07-01 Thread Demian Phillips
I am using an LSI PCI-X dual port HBA, in a 2 chip opteron system.
Connected to the HBA is a SUN Storagetek A1000 populated with 14 36GB disks.

I have two questions that I think are related.

Initially I set up 2 zpools one on each channel so the pool looked like this:

share
  raidz2 
c2t3d0   
c2t4d0   
c2t5d0   
c2t6d0   
c2t7d0   
c2t8d0   
  raidz2 
c3t9d0   
c3t10d0  
c3t11d0  
c3t12d0  
c3t13d0  
c3t14d0  
spares
  c3t15d0AVAIL
  c2t2d0 AVAIL

With the mpt driver and alternate pathing turned on I could sustain 100MB/s 
throughput into the file systems I create on it.

I was learning the zpool commands and features when I unmounted the file 
systems and exported the pool. This worked and I ran the import according to 
the documentation and that worked, but it added all the disks on c2 instead of 
half on c2 and half on c3 like I had before. Now I am back down to 40MB/s at 
best throughput.

Why did it do that and how can I in such a setup export and import while 
keeping my paths how I want them?

Next question is more of a recent issue.

I posted here asking about replacing the disk, but didnt really find out if I 
needed to do any work in the OS side.

I had a disk fail and the hot spare took over. I had another disk spare in the 
array and I ran the replace using it (removed it first). I then spun down the 
bad disk and popped in a replacement.

Bringing it back up I could not add the new disk into the pool (as a 
replacement for the spare I used for the replace) even after running the proper 
utils to scan the bus (and they did run and work).

So I shutdown and rebooted.

The system comes back up fine, and before I go to add the disk I do a zpool 
status and notice that after the boot the disks in the pools have re-arranged 
themselves.

Original zpool:
share
 raidz2
  c2t3d0
  c2t4d0
  c2t5d0
  c2t6d0
  c2t7d0
  c2t8d0  

Re: [zfs-discuss] Proper wayto do disk replacement in an A1000 storage array and raidz2.

2008-06-30 Thread Demian Phillips
Thanks. I have another spare so I replaced with that and it put the used spare 
back to spare status.

I assume at this point once I replace the failed disk I just need to let 
solaris see the change and then add it back into the pool as a spare (to 
replace the spare I took out and used in the replace)?

I see some odd behavior related to the FC array and controller but that is not 
ZFS related so I will have to post elsewhere about that fun.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Proper wayto do disk replacement in an A1000 storage array and raidz2.

2008-06-28 Thread Demian Phillips
I'm using ZFS and a drive has failed.
I am quite new to solaris and Frankly I seem to know more about ZFS and how it 
works then I do the OS.

I have the hot spare taking over the failed disk and from here, do I need to 
remove the disk on the OS side (if so what is proper) or do I need to take 
action on the ZFS side first?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss