Re: [zfs-discuss] ZFS no longer working with FC devices.
On Sun, May 23, 2010 at 12:02 PM, Torrey McMahon wrote: > On 5/23/2010 11:49 AM, Richard Elling wrote: >> >> FWIW, the A5100 went end-of-life (EOL) in 2001 and end-of-service-life >> (EOSL) in 2006. Personally, I hate them with a passion and would like to >> extend an offer to use my tractor to bury the beast:-). > > I'm sure I can get some others to help. Can I smash the gbics? Those were my > favorite. :-) > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > I'd be more then happy to take someone up on the offer but I'd need a good deal on more current FC array. Since this is my home environment I am limited by my insignificant pay and the wife factor (who does indulge me from time to time). Without a corporate IT budget I make do with everything from free to what I can afford used. To be honest I'd rather be using an IBM DS4K series array. Current stress test is creating 700 (50% of array capacity) 1GB files from /dev/urandom and then I will scrub. If all goes well it's back to u8 and tuning it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS no longer working with FC devices.
On Sat, May 22, 2010 at 11:33 AM, Bob Friesenhahn wrote: > On Fri, 21 May 2010, Demian Phillips wrote: > >> For years I have been running a zpool using a Fibre Channel array with >> no problems. I would scrub every so often and dump huge amounts of >> data (tens or hundreds of GB) around and it never had a problem >> outside of one confirmed (by the array) disk failure. >> >> I upgraded to sol10x86 05/09 last year and since then I have >> discovered any sufficiently high I/O from ZFS starts causing timeouts >> and off-lining disks. This leads to failure (once rebooted and cleaned >> all is well) long term because you can no longer scrub reliably. > > The problem could be with the device driver, your FC card, or the array > itself. In my case, issues I thought were to blame on my motherboard or > Solaris were due to a defective FC card and replacing the card resolved the > problem. > > If the problem is that your storage array is becoming overloaded with > requests, then try adding this to your /etc/system file: > > * Set device I/O maximum concurrency > * > http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29 > set zfs:zfs_vdev_max_pending = 5 > > Bob > -- > Bob Friesenhahn > bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > I've gone back to Solaris 10 11/06. It's working fine, but I notice some differences in performance that are I think key to the problem. With the latest Solaris 10 (u8) throughput according to zpool iostat was hitting about 115MB/sec sometimes a little higher. With 11/06 it maxes out at 40MB/sec. Both setups are using mpio devices as far as I can tell. Next is to go back to u8 and see if the tuning you suggested will help. It really looks to me that the OS is asking too much of the FC chain I have. The really puzzling thing is I just got told about a brand new Dell Solaris x86 production box using current and supported FC devices and a supported SAN get the same kind of problems when a scrub is run. I'm going to investigate that and see if we can get a fix from Oracle as that does have a support contract. It may shed some light on the issue I am seeing on the older hardware. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS no longer working with FC devices.
For years I have been running a zpool using a Fibre Channel array with no problems. I would scrub every so often and dump huge amounts of data (tens or hundreds of GB) around and it never had a problem outside of one confirmed (by the array) disk failure. I upgraded to sol10x86 05/09 last year and since then I have discovered any sufficiently high I/O from ZFS starts causing timeouts and off-lining disks. This leads to failure (once rebooted and cleaned all is well) long term because you can no longer scrub reliably. ATA, SATA and SAS do not seem to suffer this problem. I tried upgrading, and then doing a fresh load of U8 and the problem persists. My FC hardware is: Sun A5100 (14 disk) array. Hitachi 146GB FC disks (started with 9GB SUN disks, moved to 36 GB disks from a variety of manufacturers, and then to 72 GB IBM disks before this last capacity upgrade). Sun branded Qlogic 2310 FC cards (375-3102). Sun qlc drivers and MPIO is enabled. The rest of the system: 2 CPU Opteron board and chips(>2GHZ), 8GB RAM. When a hard drive fails in the enclosure, it bypasses the bad drive and turns on a light to let me know a disk failure has happened. This never happens with this event, pointing it to be a software problem. Once it goes off the rails and starts off-lining disks it causes the system to have problems. Login for a user takes forever (40 minutes minimum to pass the last login message), any command touching on storage or zfs/zpool hangs for just as long. I can reliably reproduce the issue by either copying a large amount of data into the pool or running a scrub. All disks test fine via destructive tests in format. I just reproduced it by clearing and creating anew pool called share: # zpool status share pool: share state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM share ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c0t50050767190B6C76d0 ONLINE 0 0 0 c0t500507671908E72Bd0 ONLINE 0 0 0 c0t500507671907A32Ad0 ONLINE 0 0 0 c0t50050767190C4CFDd0 ONLINE 0 0 0 c0t500507671906704Dd0 ONLINE 0 0 0 c0t500507671918892Ad0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c0t50050767190D11E4d0 ONLINE 0 0 0 c0t500507671915CABEd0 ONLINE 0 0 0 c0t50050767191371C7d0 ONLINE 0 0 0 c0t5005076719125EDBd0 ONLINE 0 0 0 c0t50050767190E4DABd0 ONLINE 0 0 0 c0t5005076719147ECAd0 ONLINE 0 0 0 errors: No known data errors messages logs something like the following: May 21 15:27:54 solarisfc scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0): May 21 15:27:54 solarisfc /scsi_vhci/d...@g50050767191371c7 (sd2): Command Timeout on path /p...@0,0/pci1022,7...@a/pci1077,1...@3/f...@0,0 (fp1) May 21 15:27:54 solarisfc scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/d...@g50050767191371c7 (sd2): May 21 15:27:54 solarisfc SCSI transport failed: reason 'timeout': retrying command May 21 15:27:54 solarisfc scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0): May 21 15:27:54 solarisfc /scsi_vhci/d...@g50050767191371c7 (sd2): Command Timeout on path /p...@0,0/pci1022,7...@a/pci1077,1...@2/f...@0,0 (fp0) May 21 15:28:54 solarisfc scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/d...@g50050767191371c7 (sd2): May 21 15:28:54 solarisfc SCSI transport failed: reason 'timeout': giving up May 21 15:32:54 solarisfc scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/d...@g50050767191371c7 (sd2): May 21 15:32:54 solarisfc SYNCHRONIZE CACHE command failed (5) May 21 15:40:54 solarisfc scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/d...@g50050767191371c7 (sd2): May 21 15:40:54 solarisfc drive offline May 21 15:48:55 solarisfc scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/d...@g50050767191371c7 (sd2): May 21 15:48:55 solarisfc drive offline May 21 15:56:55 solarisfc scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/d...@g50050767191371c7 (sd2): May 21 15:56:55 solarisfc drive offline May 21 16:04:55 solarisfc scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/d...@g50050767191371c7 (sd2): May 21 16:04:55 solarisfc drive offline May 21 16:04:56 solarisfc fmd: [ID 441519 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major May 21 16:04:56 solarisfc EVENT-TIME: Fri May 21 16:04:56 EDT 2010 May 21 16:04:56 solarisfc PLATFORM: To Be Filled By O.E.M., CSN: To Be Filled By O.E.M., HOSTNAME: solarisfc May 21 16:04:56 solarisfc SOURCE: zfs-diagnosis, REV: 1.0 May 21 16:04:56 solarisfc EVENT-ID: 295d7729-9a93-47f1-de9d-ba3a08b2d477 May 21 16:04:56 solarisfc DESC: The nu
[zfs-discuss] Pool revcovery from replaced disks.
Is it possible to recover a pool (as it was) from a set of disks that were replaced during a capacity upgrade? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] behavior of disk identifiers and zpools.
I am using an LSI PCI-X dual port HBA, in a 2 chip opteron system. Connected to the HBA is a SUN Storagetek A1000 populated with 14 36GB disks. I have two questions that I think are related. Initially I set up 2 zpools one on each channel so the pool looked like this: share raidz2 c2t3d0 c2t4d0 c2t5d0 c2t6d0 c2t7d0 c2t8d0 raidz2 c3t9d0 c3t10d0 c3t11d0 c3t12d0 c3t13d0 c3t14d0 spares c3t15d0AVAIL c2t2d0 AVAIL With the mpt driver and alternate pathing turned on I could sustain 100MB/s throughput into the file systems I create on it. I was learning the zpool commands and features when I unmounted the file systems and exported the pool. This worked and I ran the import according to the documentation and that worked, but it added all the disks on c2 instead of half on c2 and half on c3 like I had before. Now I am back down to 40MB/s at best throughput. Why did it do that and how can I in such a setup export and import while keeping my paths how I want them? Next question is more of a recent issue. I posted here asking about replacing the disk, but didnt really find out if I needed to do any work in the OS side. I had a disk fail and the hot spare took over. I had another disk spare in the array and I ran the replace using it (removed it first). I then spun down the bad disk and popped in a replacement. Bringing it back up I could not add the new disk into the pool (as a replacement for the spare I used for the replace) even after running the proper utils to scan the bus (and they did run and work). So I shutdown and rebooted. The system comes back up fine, and before I go to add the disk I do a zpool status and notice that after the boot the disks in the pools have re-arranged themselves. Original zpool: share raidz2 c2t3d0 c2t4d0 c2t5d0 c2t6d0 c2t7d0 c2t8d0
Re: [zfs-discuss] Proper wayto do disk replacement in an A1000 storage array and raidz2.
Thanks. I have another spare so I replaced with that and it put the used spare back to spare status. I assume at this point once I replace the failed disk I just need to let solaris see the change and then add it back into the pool as a spare (to replace the spare I took out and used in the replace)? I see some odd behavior related to the FC array and controller but that is not ZFS related so I will have to post elsewhere about that fun. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Proper wayto do disk replacement in an A1000 storage array and raidz2.
I'm using ZFS and a drive has failed. I am quite new to solaris and Frankly I seem to know more about ZFS and how it works then I do the OS. I have the hot spare taking over the failed disk and from here, do I need to remove the disk on the OS side (if so what is proper) or do I need to take action on the ZFS side first? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss