[zfs-discuss] scsi messages and mpt warning in log - harmless, or indicating a problem?
This afternoon, messages like the following started appearing in /var/adm/messages: May 18 13:46:37 fs8 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,2...@1/pci15d9,a...@0 (mpt0): May 18 13:46:37 fs8 Log info 0x3108 received for target 5. May 18 13:46:37 fs8 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x1 May 18 13:46:38 fs8 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,2...@1/pci15d9,a...@0 (mpt0): May 18 13:46:38 fs8 Log info 0x3108 received for target 5. May 18 13:46:38 fs8 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0 May 18 13:46:40 fs8 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,2...@1/pci15d9,a...@0 (mpt0): May 18 13:46:40 fs8 Log info 0x3108 received for target 5. May 18 13:46:40 fs8 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0 The pool has no errors, so I don't know if these represent a potential problem or not. During this time I was copying files from one fileset to another in the same pool, so it was fairly I/O intensive. Typically you get one every 1-5 seconds for 10 to 20 seconds, sometimes longer, and then it is quiet for many minutes before they occur again. Is this indicating a problem, or just a harmless message? I just kicked off a scrub on the pool as I was writing this, and I am seeing a lot of these messages. I see that zpool status shows c4t5d0 has 12.5K repaired already. The scrub has been in progress for just 6 minutes, and it says I have 170629h54m to go, and it gets longer every time I check the status. I ran a scrub on this a few weeks ago, and had no such problem. I also see two warnings earlier today: May 18 19:14:09 fs8 scsi: [ID 243001 kern.warning] WARNING: /p...@0,0/pci8086,2...@1/pci15d9,a...@0 (mpt0): May 18 19:14:09 fs8 mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31110900 May 18 19:14:09 fs8 scsi: [ID 243001 kern.warning] WARNING: /p...@0,0/pci8086,2...@1/pci15d9,a...@0 (mpt0): May 18 19:14:09 fs8 mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31110900 and two more of these 1 minute and 10 seconds later. So, is my system in trouble or not? Particulars of my system: % uname -a SunOS fs8 5.11 snv_134 i86pc i386 i86pc The hardware is an Asus server motherboard carrying 4GB of ECC memory and a current Xeon CPU, and a SuperMicro AOC-USASLP-L8I card (it uses the 1068E) with 8 Samsung Spinpoint F3EG HD203WI 2TB disks attached. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up ZFS on AHCI disks
I solved the mystery - an astounding 7 out of the 10 brand new disks I was using were bad. I was using 4 at a time, and it wasn't until a good one got in the mix that I realized what was wrong. FYI, these were Western Digital WD15EADS and Samsung HD154UI. Each brand was mostly bad, with one or two good disks. The bad ones are functional enough that the BIOS can tell what type they are, but I got a lot of errors when I plugged them into a Linux box to check them. The whole thing is bizarre enough that I wonder if they got damaged in shipping or if my machine somehow damaged them. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up ZFS on AHCI disks
isainfo -k returns amd64, so I don't think that is the answer. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up ZFS on AHCI disks
> There should be no need to create partitions. > Something simple like this > hould work: > zpool create junkfooblah c13t0d0 > > And if it doesn't work, try "zpool status" just to > verify for certain, that > device is not already part of any pool. It is not part of any pool. I get the same "cannot label" message, and dmsg still shows the task file error messages that I mentioned before. The drives are new, and I don't think they are bad. Likewise, the motherboard is new, although I see the last BIOS release was September, 2008, so the design has been out for a while. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up ZFS on AHCI disks
No Areca controller on this machine. It is a different box, and the drives are just plugged into the SATA ports on the motherboard. I'm running build svn_133, too. The drives are recent - 1.5TB drives, 3 Western Digital and 1 Seagate, if I recall correctly. They ought to support SATA-2. They are brand new, and haven't been used before. I have the feeling I'm missing some simple, obvious step because I'm still pretty new to OpenSolaris. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up ZFS on AHCI disks
devfsadm -Cv gave a lot of "removing file" messages, apparently for items that were not relevant. cfgadm -al says, about the disks, sata0/0::dsk/c13t0d0 disk connectedconfigured ok sata0/1::dsk/c13t1d0 disk connectedconfigured ok sata0/2::dsk/c13t2d0 disk connectedconfigured ok sata0/3::dsk/c13t3d0 disk connectedconfigured ok I still get the same error message, but I'm guessing now that means I have to create a partition on the device. However, I am still stymied for the time being. fdisk can't open any of the /dev/rdsk/c13t*d0p0 devices. I tried running format, and get this AVAILABLE DISK SELECTIONS: 0. c12d1 /p...@0,0/pci-...@1f,1/i...@0/c...@1,0 1. c13t0d0 /p...@0,0/pci1043,8...@1f,2/d...@0,0 2. c13t1d0 /p...@0,0/pci1043,8...@1f,2/d...@1,0 3. c13t2d0 /p...@0,0/pci1043,8...@1f,2/d...@2,0 4. c13t3d0 /p...@0,0/pci1043,8...@1f,2/d...@3,0 Specify disk (enter its number): 1 Error: can't open disk '/dev/rdsk/c13t0d0p0'. AVAILABLE DRIVE TYPES: 0. Auto configure 1. other Specify disk type (enter its number): 0 Auto configure failed No Solaris fdisk partition found. At this point, I not sure whether to run fdisk, format or something else. I tried fdisk, partition and label, but gut the message "Current Disk Type is not set." I expect this is a problem because of the "drive type unknown" appearing on the drives. I gather from another thread that I need to run fdisk, but I haven't been able to do it. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Setting up ZFS on AHCI disks
I'm trying to set up a raidz pool on 4 disks attached to an Asus P5BV-M motherboard with an Intel ICH7R. The bios lets me pick IDE, RAID, or AHCI for the disks. I'm not interested in the motherboard's raid, and reading previous posts, it sounded like there were performance advantages to picking AHCI. However, I am getting errors and I am unable to create the pool. Running format tells me AVAILABLE DISK SELECTIONS: 0. c12d1 /p...@0,0/pci-...@1f,1/i...@0/c...@1,0 1. c13t0d0 /p...@0,0/pci1043,8...@1f,2/d...@0,0 2. c13t1d0 /p...@0,0/pci1043,8...@1f,2/d...@1,0 3. c13t2d0 /p...@0,0/pci1043,8...@1f,2/d...@2,0 4. c13t3d0 /p...@0,0/pci1043,8...@1f,2/d...@3,0 The first disk is an IDE disk containing the OS, and the 2nd four are for the pool. Then # zpool create mypool raidz c13t0d0 c13t1d0 c13t2d0 c13t3d0 cannot label 'c13t0d0': try using fdisk(1M) and then provide a specific slice When doing this, dmsg says: Apr 15 17:14:15 fs8 ahci: [ID 296163 kern.warning] WARNING: ahci0: ahci port 0 has task file error Apr 15 17:14:15 fs8 ahci: [ID 687168 kern.warning] WARNING: ahci0: ahci port 0 is trying to do error recovery Apr 15 17:14:15 fs8 ahci: [ID 551337 kern.warning] WARNING: ahci0: Apr 15 17:14:15 fs8 ahci: [ID 693748 kern.warning] WARNING: ahci0: ahci port 0 task_file_status = 0x451 Apr 15 17:14:15 fs8 genunix: [ID 353554 kern.warning] WARNING: Device /p...@0,0/pci1043,8...@1f,2/d...@0,0 failed to power up. I find reports from 2006 that the ICH7R is well supported, so I'm not sure what the problem is. Any suggestions? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why would zfs have too many errors when underlying raid array is fine?
I've got a Supermicro AOC-USAS-L8I on the way because I gather from these forums that it works well. I'll just wait for that, then try 8 disks on that an 4 on the motherboard SATA ports. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why would zfs have too many errors when underlying raid array is fine?
As I mentioned earlier, I removed the hardware-based Raid6 array, changed all the disks to passthrough disks, made a raidz2 pool using all the disk. I used my backup program to copy 55GB of data to the disk, and now I have errors all over the place. # zpool status -v pool: bigraid state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h4m with 0 errors on Wed Apr 14 22:56:36 2010 config: NAMESTATE READ WRITE CKSUM bigraid DEGRADED 0 0 0 raidz2-0 DEGRADED 0 024 c4t0d0 ONLINE 0 0 3 c4t0d1 ONLINE 0 0 2 c4t0d2 ONLINE 0 0 2 c4t0d3 DEGRADED 0 0 2 too many errors c4t0d4 ONLINE 0 0 2 c4t0d5 ONLINE 0 0 2 c4t0d6 ONLINE 0 0 1 c4t0d7 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c4t1d1 ONLINE 0 0 2 c4t1d2 ONLINE 0 0 2 c4t1d3 ONLINE 0 0 4 errors: No known data errors So, zfs on hardware-supported raid was fine, but zfs on passthrough disks is not. I'm at a loss to explain it. Any ideas? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why would zfs have too many errors when underlying raid array is fine?
These are all good reasons to switch back to letting ZFS handle it. I did put about 600GB of data on the pool as configured with Raid 6 on the card, verified the data, and scrubbed it a couple time in the process and there's no problems, so it appears that the firmware upgrade fixed my problems. However, I'm going to switch it back to passthrough disks, remake the pool and try it again. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why would zfs have too many errors when underlying raid array is fine?
I upgraded to the latest firmware. When I rebooted the machine, the pool was back, with no errors. I was surprised. I will work with it more, and see if it stays good. I've done a scrub, so now I'll put more data on it and stress it some more. If the firmware upgrade fixed everything, then I've got a question about which I am better off doing: keep it as-is, with the raid card providing redundancy, or turn it all back into pass-through drives and let ZFS handle it, making the Areca card just a really expensive way of getting a bunch of SATA interfaces? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why would zfs have too many errors when underlying raid array is fine?
I was wondering if the controller itself has problems. My card's firmware is version 1.42, and the firmware on the website is up to 1.48. I see the firmware released in last September says Fix Opensolaris+ZFS to add device to mirror set in JBOD or passthrough mode and Fix SATA raid controller seagate HDD error handling I'm not using mirroring, but I am using seagate drives. Looks like I should do a firmware upgrade -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why would zfs have too many errors when underlying raid array is fine?
Just a message 7 hours earlier about an IRQ being shared by drivers with different interrupt levels might result in reduced performance. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why would zfs have too many errors when underlying raid array is fine?
IT is a Corsair 650W modular power supply, with 2 or 3 disks per cable. However, the Areca card is not reporting any errors, so I think power to the disks is unlikely to be a problem. Here's what is in /var/adm/messages Apr 11 22:37:41 fs9 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER: 1, SEVERITY: Major Apr 11 22:37:41 fs9 EVENT-TIME: Sun Apr 11 22:37:41 CDT 2010 Apr 11 22:37:41 fs9 PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: fs9 Apr 11 22:37:41 fs9 SOURCE: zfs-diagnosis, REV: 1.0 Apr 11 22:37:41 fs9 EVENT-ID: f6d2aef7-d5fc-e302-a68e-a50a91e81d2d Apr 11 22:37:41 fs9 DESC: The number of checksum errors associated with a ZFS device Apr 11 22:37:41 fs9 exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-GH for more information. Apr 11 22:37:41 fs9 AUTO-RESPONSE: The device has been marked as degraded. An attempt Apr 11 22:37:41 fs9 will be made to activate a hot spare if available. Apr 11 22:37:41 fs9 IMPACT: Fault tolerance of the pool may be compromised. Apr 11 22:37:41 fs9 REC-ACTION: Run 'zpool status -x' and replace the bad device. Apr 11 22:37:42 fs9 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-HC, TYPE: Error, VER: 1, SEVERITY: Major Apr 11 22:37:42 fs9 EVENT-TIME: Sun Apr 11 22:37:42 CDT 2010 Apr 11 22:37:42 fs9 PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: fs9 Apr 11 22:37:42 fs9 SOURCE: zfs-diagnosis, REV: 1.0 Apr 11 22:37:42 fs9 EVENT-ID: 89b2ef1c-c689-66a0-a7f7-d015a1b7f260 Apr 11 22:37:42 fs9 DESC: The ZFS pool has experienced currently unrecoverable I/O Apr 11 22:37:42 fs9 failures. Refer to http://sun.com/msg/ZFS-8000-HC for more information. Apr 11 22:37:42 fs9 AUTO-RESPONSE: No automated response will be taken. Apr 11 22:37:42 fs9 IMPACT: Read and write I/Os cannot be serviced. Apr 11 22:37:42 fs9 REC-ACTION: Make sure the affected devices are connected, then run Apr 11 22:37:42 fs9 'zpool clear'. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Why would zfs have too many errors when underlying raid array is fine?
I'm struggling to get a reliable OpenSolaris system on a file server. I'm running an Asus P5BV-C/4L server motherboard, 4GB ECC ram, an E3110 processor, and an Areca 1230 with 12 1-TB disks attached. In a previous posting, it looked like RAM or the power supply by be a problem, so I ended up upgrading everything except the raid card and the disks. I'm running OpenSolaris preview build 134. I started off my setting up all the disks to be pass-through disks, and tried to make a raidz2 array using all the disks. It would work for a while, then suddenly every disk in the array would have too many errors and the system would fail. I don't know why the sudden failure, but eventually I gave up. Instead, I used the Areca card to create a Raid-6 array with a hot spare, and created a pool directly on the 8TB disk the raid card exposed. I'll let the card handle the redundancy, and zfs just the file system. Disk performance is noticeably faster, by the way, compared to software raid. I have been testing the system, and it suddenly failed again: # zpool status -v pool: bigraid state: DEGRADED status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-HC scrub: none requested config: NAMESTATE READ WRITE CKSUM bigraid DEGRADED 0 0 7 c4t0d0DEGRADED 0 034 too many errors errors: Permanent errors have been detected in the following files: :<0x1> :<0x18> bigraid:<0x3> The raid card says the array is fine - no errors - so something is going on with ZFS. I'm out of ideas this point, except that build 134 might be unstable and I should install an earlier, more stable version. Is there anything I'm missing that I should check? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Diagnosing Permanent Errors
Yes, I was hoping to find the serial numbers. Unfortunately, it doesn't show any serial numbers for the disk attached to the Areca raid card. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Diagnosing Permanent Errors
Memtest didn't show any errors, but between Frank, early in the thread, saying that he had found memory errors that memtest didn't catch, and remove of DIMMs apparently fixing the problem, I too soon jumped to the conclusion it was the memory. Certainly there are other explanations. I see that I have a spare Corsair 620W power supply that I could try. It is a Corsair supply of some wattage in there now. If I recall properly, the steady state power draw is between 150 and 200 watts. By the way, I see that now one of the disks is listed as degraded - too many errors. Is there a good way to identify exactly which of the disks it is? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Diagnosing Permanent Errors
It certainly has symptoms that match a marginal power supply, but I measured the power consumption some time ago and found it comfortably within the power supply's capacity. I've also wondered if the RAM is fine, but there is just some kind of flaky interaction of the ram configuration I had with the motherboard. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Diagnosing Permanent Errors
Looks like it was RAM. I ran memtest+ 4.00, and it found no problems. I removed 2 of the 3 sticks of RAM, ran a backup, and had no errors. I'm running more extensive tests, but it looks like that was it. A new motherboard, CPU and ECC RAM are on the way to me now. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Diagnosing Permanent Errors
Yeah, this morning I concluded I really should be running ECC ram. I sometimes wonder why people people don't run ECC ram more frequently. I remember a decade ago, when ram was much, much less dense, people fretted about alpha particles randomly flipping bits, but that seems to have died down. I know, of course, there is some added expense, but browsing on Newegg, the additional RAM cost is pretty minimal. I see 2GB ECC sticks going for about $12 more than similar non-ECC sticks. It's the motherboards that can handle ECC which are the expensive part. Now I've got to see what is a good motherboard for a file server. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Diagnosing Permanent Errors
I would like to get some help diagnosing permanent errors on my files. The machine in question has 12 1TB disks connected to an Areca raid card. I installed OpenSolaris build 134 and according to zpool history, created a pool with zpool create bigraid raidz2 c4t0d0 c4t0d1 c4t0d2 c4t0d3 c4t0d4 c4t0d5 c4t0d6 c4t0d7 c4t1d0 c4t1d1 c4t1d2 c4t1d3 I then backed up 806G of files to the machine, and had the backup program verify the files. It failed. The check is continuing to run, but so far it found 4 files where the checksums of the backup files don't match the checksum of the original file. Zpool status shows problems: $ sudo zpool status -v pool: bigraid state: DEGRADED status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-HC scrub: none requested config: NAMESTATE READ WRITE CKSUM bigraid DEGRADED 0 0 536 raidz2-0 DEGRADED 0 0 3.14K c4t0d0 ONLINE 0 0 0 c4t0d1 ONLINE 0 0 0 c4t0d2 ONLINE 0 0 0 c4t0d3 ONLINE 0 0 0 c4t0d4 ONLINE 0 0 0 c4t0d5 ONLINE 0 0 0 c4t0d6 ONLINE 0 0 0 c4t0d7 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c4t1d1 ONLINE 0 0 0 c4t1d2 ONLINE 0 0 0 c4t1d3 DEGRADED 0 0 0 too many errors errors: Permanent errors have been detected in the following files: :<0x18> :<0x3a> So, it appears that one of the disks is bad, but if one disk failed, how would a raidz2 pool develop permanent errors? The numbers in the CKSUM column are continuing to grow, but is that because the backup verification is tickling the errors as it runs? Previous postings on permanent errors said to look at fmdump -eV, but that has 437543 lines, and I don't really know how to interpret what I see. I did check the vdev_path with " fmdump -eV | grep vdev_path | sort | uniq -c" to see if it was only certain disks, but every disk in the array is listed in the file, albeit with different frequencies: 2189vdev_path = /dev/dsk/c4t0d0s0 1077vdev_path = /dev/dsk/c4t0d1s0 1077vdev_path = /dev/dsk/c4t0d2s0 1097vdev_path = /dev/dsk/c4t0d3s0 25vdev_path = /dev/dsk/c4t0d4s0 25vdev_path = /dev/dsk/c4t0d5s0 20vdev_path = /dev/dsk/c4t0d6s0 1072vdev_path = /dev/dsk/c4t0d7s0 1092vdev_path = /dev/dsk/c4t1d0s0 vdev_path = /dev/dsk/c4t1d1s0 2221vdev_path = /dev/dsk/c4t1d2s0 1149vdev_path = /dev/dsk/c4t1d3s0 What should I make of this? All the disks are bad? That seems unlikely. I found another thread http://opensolaris.org/jive/thread.jspa?messageID=399988 where it finally came down to bad memory, so I'll test that. Any other suggestions? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss