Note: IANATZD (I Am Not A Team-ZFS Dude) Speaking as a Hardware Guy, knowing that something is happening, has happened or is indicated to happen is a Good Thing (tm).
Begin unlikely, but possible scenario: If, for instance, I'm getting a cluster of read errors (or, perhaps bad blocks), I could: - See it as it's happening - See the block number for each error - already know the rate at which the errors are happening - Be able to determine that it's not good, and it's time to replace the disk. - You get the picture... And based on this information, I could feel confident that I have the right information at hand to be able to determine that it is or is not time to replace this disk. Of course, that assumes: - I know anything about disks - I know anything about the error messages - I have some sort of logging tool that recognises the errors (and does not just throw out the 'retryable ones', as most I have seen are configured to do) - I care - The folks watching the logs in the enterprise management tool care - My storage even bothers to report the errors Certainly, for some organisations, all of the above are exactly how it works, and it works well for them. Looking at the ZFS/FMA approach, it certainly is somewhat different. The (very) rough concept is that FMA gets pretty much all errors reported to it. It logs them, in a persistent store, which is always available to view. It also makes diagnoses on the errors, based on the rules that exist for that particular style of error. Once enough (or the right type of) errors happen, it'll then make a Fault Diagnosis for that component, and log a message, loud and proud into the syslog. It may also take other actions, like, retire a page of memory, offline a CPU, panic the box, etc. So - That's the rough overview. It's worth noting up front that we can *observe* every event that has happened. Using fmdump and fmstat we can immediately see if anything interesting has been happening, or we can wait for a Fault Diagnosis, in which case, we can just watch /var/adm/messages. I also *believe* (though am not certain - Perhaps someone else on the list might be?) it would be possible to have each *event* (so - the individual events that lead to a Fault Diagnosis) generate a message if it was required, though I have never taken the time to do that one... There are many advantages to this approach - It does not rely on logfiles, offsets into logfiles, counters of previously processes messages and all of the other doom and gloom that comes with scraping logfiles. It's something you can simply ask: Any issues, chief? The answer is there in a flash. You will also be less likely to have the messages rolled out of the logs before you get to them (another classic...). And - You get some great details from fmdump showing you what's really going on, and it's something that's really easy to parse to look for patterns. All of this said, I understand if you feel things are being 'hidden' from you until it's *actually* busted that you are having some of your forward vision obscured 'in the name of a quiet logfile'. I felt much the same way for a period of time. (Though, I live more in the CPU / Memory camp...) But - Once I realised what I could do with fmstat and fmdump, I was not the slightest bit unhappy (Actually, that's not quite true... Even once I knew what they could do, it still took me a while to work out the options I cared about for fmdump / fmstat), but I now trust FMA to look after my CPU / Memory issues better than I would in real life. I can still get what I need when I want to, and the data is actually more accessible and interesting. I just needed to know where to go looking. All this being said, I was not actually aware that many of our disk / target drivers were actually FMA'd up yet. heh - Shows what I know. Does any of this make you feel any better (or worse)? Nathan. Mark A. Carlson wrote: > fmd(1M) can log faults to syslogd that are already diagnosed. Why > would you want the random spew as well? > > -- mark > > Carson Gaspar wrote: >> [EMAIL PROTECTED] wrote: >> >> >>> It's not safe to jump to this conclusion. Disk drivers that support FMA >>> won't log error messages to /var/adm/messages. As more support for I/O >>> FMA shows up, you won't see random spew in the messages file any more. >>> >> >> <mode="large financial institution paying support customer"> >> That is a Very Bad Idea. Please convey this to whoever thinks that >> they're "helping" by not sysloging I/O errors. If this shows up in >> Solaris 11, we will Not Be Amused. Lack of off-box error logging will >> directly cause loss of revenue. >> </mode> >> >> > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss