Ok, after doing a lot more testing of this I've found it's not the Supermicro controller causing problems. It's purely ZFS, and it causes some major problems! I've even found one scenario that appears to cause huge data loss without any warning from ZFS - up to 30,000 files and 100MB of data missing after a reboot, with zfs reporting that the pool is OK.
*********************************************************************** 1. Solaris handles USB and SATA hot plug fine If disks are not in use by ZFS, you can unplug USB or SATA devices, cfgadm will recognise the disconnection. USB devices are recognised automatically as you reconnect them, SATA devices need reconfiguring. Cfgadm even recognises the SATA device as an empty bay: # cfgadm Ap_Id Type Receptacle Occupant Condition sata1/7 sata-port empty unconfigured ok usb1/3 unknown empty unconfigured ok -- insert devices -- # cfgadm Ap_Id Type Receptacle Occupant Condition sata1/7 disk connected unconfigured unknown usb1/3 usb-storage connected configured ok To bring the sata drive online it's just a case of running # cfgadm -c configure sata1/7 *********************************************************************** 2. If ZFS is using a hot plug device, disconnecting it will hang all ZFS status tools. While pools remain accessible, any attempt to run "zpool status" will hang. I don't know if there is any way to recover these tools once this happens. While this is a pretty big problem in itself, it also makes me worry if other types of error could have the same effect. I see potential for this leaving a server in a state whereby you know there are errors in a pool, but have no way of finding out what those errors might be without rebooting the server. *********************************************************************** 3. Once ZFS status tools are hung the computer will not shut down. The only way I've found to recover from this is to physically power down the server. The solaris shutdown process simply hangs. *********************************************************************** 4. While reading an offline disk causes errors, writing does not! *** CAUSES DATA LOSS *** This is a big one: ZFS can continue writing to an unavailable pool. It doesn't always generate errors (I've seen it copy over 100MB before erroring), and if not spotted, this *will* cause data loss after you reboot. I discovered this while testing how ZFS coped with the removal of a hot plug SATA drive. I knew that the ZFS admin tools were hanging, but that redundant pools remained available. I wanted to see whether it was just the ZFS admin tools that were failing, or whether ZFS was also failing to send appropriate error messages back to the OS. These are the tests I carried out: Zpool: Single drive zpool, consisting of one 250GB SATA drive in a hot plug bay. Test data: A folder tree containing 19,160 items. 71.1MB in total. TEST1: Opened File Browser, copied the test data to the pool. Half way through the copy I pulled the drive. THE COPY COMPLETED WITHOUT ERROR. Zpool list reports the pool as online, however zpool status hung as expected. Not quite believing the results, I rebooted and tried again. TEST2: Opened File Browser, copied the data to the pool. Pulled the drive half way through. The copy again finished without error. Checking the properties shows 19,160 files in the copy. ZFS list again shows the filesystem as ONLINE. Now I decided to see how many files I could copy before it errored. I started the copy again. File Browser managed a further 9,171 files before it stopped. That's nearly 30,000 files before any error was detected. Again, despite the copy having finally errored, zpool list shows the pool as online, even though zpool status hangs. I rebooted the server, and found that after the reboot my first copy contains just 10,952 items, and my second copy is completely missing. That's a loss of almost 20,000 files. Zpool status however reports NO ERRORS. For the third test I decided to see if these files are actually accessible before the reboot: TEST3: This time I pulled the drive *before* starting the copy. The copy started much slower this time and only got to 2,939 files before reporting an error. At this point I copied all the files that had been copied to another pool, and then rebooted. After the reboot, the folder in the test pool had disappeared completely, but the copy I took before rebooting was fine and contains 2,938 items, approximately 12MB of data. Again, zpool status reports no errors. Further tests revealed that reading the pool results in an error almost immediately. Writing to the pool appears very inconsistent. This is a huge problem. Data can be written without error, and is still served to users. It is only later on that the server will begin to issue errors, but at that point zfs admin tools are useless. The only possible recovery is a server reboot, but that will loose recent data written to the pool, but will do so without any warnings at all from ZFS. Needless to say I have a lot less faith in ZFS' error checking after having seen it loose 30,000 files without error. *********************************************************************** 5. If you are using CIFS and pull a drive from the volume, the whole server hangs! This appears to be the original problem I found. While ZFS doesn't handle drive removal well, the combination of ZFS and CIFS is worse. If you pull a drive from a ZFS pool (redundant or not), which is serving CIFS data, the entire server freezes until you re-insert the drive. Note that ZFS itself does not recover after the drive is inserted; admin tools will still hang. However the re-insertion of the drive is enough to unfreeze the server. Of course, you still need a physical reboot to get your ZFS admin tools back, but in the meantime data is accessible again. This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss