Hey, Dennis -

I can't help but wonder if the failure is a result of zfs itself finding some problems post restart...

Is there anything in your FMA logs?

  fmstat

for a summary and

  fmdump

for a summary of the related errors

eg:
drteeth:/tmp # fmdump
TIME                 UUID                                 SUNW-MSG-ID
Nov 03 13:57:29.4190 e28210d7-b7aa-42e0-a3e8-9ba21332d1c7 ZFS-8000-D3
Nov 03 13:57:29.9921 916ce3e2-0c5c-e335-d317-ba1e8a93742e ZFS-8000-D3
Nov 03 14:04:58.8973 ff2f60f8-2906-676a-bfb7-ccbd9c7f957d ZFS-8000-CS
Mar 05 18:04:40.7116 ff2f60f8-2906-676a-bfb7-ccbd9c7f957d FMD-8000-4M Repaired Mar 05 18:04:40.7875 ff2f60f8-2906-676a-bfb7-ccbd9c7f957d FMD-8000-6U Resolved Mar 05 18:04:41.0052 e28210d7-b7aa-42e0-a3e8-9ba21332d1c7 FMD-8000-4M Repaired Mar 05 18:04:41.0760 e28210d7-b7aa-42e0-a3e8-9ba21332d1c7 FMD-8000-6U Resolved

then for example,

  fmdump -vu e28210d7-b7aa-42e0-a3e8-9ba21332d1c7

and

  fmdump -Vvu e28210d7-b7aa-42e0-a3e8-9ba21332d1c7

will show more and more information about the error. Note that some of it might seem like rubbish. The important bits should be obvious though - things like the SUNW error message is (like ZFS-8000-D3), which can be pumped into

  sun.com/msg

to see what exactly it's going on about.

Note also that there should also be something interesting in the /var/adm/messages log to match and 'faulted' devices.

You might also find an

  fmdump -e

and

  fmdump -eV

to be interesting - This is the *error* log as opposed to the *fault* log. (Every 'thing that goes wrong' is an error, only those that are diagnosed are considered a fault.)

Note that in all of these fm[dump|stat] commands, you are really only looking at the two sets of data. The errors - that is the telemetry incoming to FMA - and the faults. If you include a -e, you view the errors, otherwise, you are looking at the faults.

By the way - sun.com/msg has a great PDF on it about the predictive self healing technologies in Solaris 10 and will offer more interesting information.

Would be interesting to see *why* ZFS / FMA is feeling the need to fault your devices.

I was interested to see on one of my boxes that I have actually had a *lot* of errors, which I'm now going to have to investigate... Looks like I have a dud rocket in my system... :)

Oh - And I saw this:

Nov 03 14:04:31.2783 ereport.fs.zfs.checksum

Score one more for ZFS! This box has a measly 300GB mirrored, and I have already seen dud data. (heh... It's also got non-ecc memory... ;)

Cheers!

Nathan.


Dennis Clarke wrote:
On Tue, 24 Mar 2009, Dennis Clarke wrote:
You would think so eh?
But a transient problem that only occurs after a power failure?
Transient problems are most common after a power failure or during
initialization.

Well the issue here is that power was on for ten minutes before I tried
to do a boot from the ok pronpt.

Regardless, the point is that the ZPool shows no faults at boot time and
then shows phantom faults *after* I go to init 3.

That does seem odd.

Dennsi


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
//////////////////////////////////////////////////////////////////
// Nathan Kroenert              nathan.kroen...@sun.com         //
// Systems Engineer             Phone:  +61 3 9869-6255         //
// Sun Microsystems             Fax:    +61 3 9869-6288         //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456            //
// Melbourne 3004   Victoria    Australia                       //
//////////////////////////////////////////////////////////////////
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to