> <div id="jive-html-wrapper-div"> > > Charles,<br> > <br> > Just like UNIX, there are several ways to drill down > on the problem. I > would probably start with a live crash dump (savecore > -L) when you see > the problem. Another method would be to grap > multiple "stats" commands > during the problem to see where you can drill down > later. I would > probably use this method if the problem lasts for a > while and drill > down with dtrace base on what I saw. But each > method is going to > depend on your skill, when looking at the > problem.<br> > <br> > Dave<br> > <br>
Dave,<br> <br> After running clean since my last post the problem occurred again today. This time I was able to gather some data while it was going on. The only thing that jumps out at my so far is the output of echo ::zio_state | mdb -k. <br> Under normal operations this usually looks like this:<br> <br> ADDRESS TYPE STAGE WAITER<br> <br> ffffff090eb69328 NULL OPEN -<br> ffffff090eb69c88 NULL OPEN -<br> <br> Here are a couple samples while the issue was happening:<br> <br> ADDRESS TYPE STAGE WAITER<br> <br> ffffff0bfe8c59b0 NULL CHECKSUM_VERIFY ffffff003e2f2c60<br> ffffff090eb69328 NULL OPEN -<br> ffffff090eb69c88 NULL OPEN -<br> <br> ADDRESS TYPE STAGE WAITER<br> <br> ffffff09bb12a040 NULL CHECKSUM_VERIFY ffffff003d6acc60<br> ffffff0bfe8c59b0 NULL CHECKSUM_VERIFY ffffff003e2f2c60<br> ffffff090eb69328 NULL OPEN -<br> ffffff090eb69c88 NULL OPEN -<br> <br> Operating under the assumption that the waiter column is referencing kernel threads, I went looking for those addresses in the thread list. Here are the threadlist entries for ffffff003d6acc60 and ffffff003e2f2c60 from the example directly above taken at about the same time as that output:<br> <br> ffffff003d6acc60 ffffff0930d8c700 ffffff09172f9de0 2 0 ffffff09bb12a348<br> PC: _resume_from_idle+0xf1 CMD: zpool-pool0<br> stack pointer for thread ffffff003d6acc60: ffffff003d6ac360<br> [ ffffff003d6ac360 _resume_from_idle+0xf1() ]<br> swtch+0x145()<br> cv_wait+0x61()<br> zio_wait+0x5d()<br> dbuf_read+0x1e8()<br> dmu_buf_hold+0x93()<br> zap_get_leaf_byblk+0x56()<br> zap_deref_leaf+0x78()<br> fzap_length+0x42()<br> zap_length_uint64+0x84()<br> ddt_zap_lookup+0x4b()<br> ddt_object_lookup+0x6d()<br> ddt_lookup+0x115()<br> zio_ddt_free+0x42()<br> zio_execute+0x8d()<br> taskq_thread+0x248()<br> thread_start+8()<br> <br> ffffff003e2f2c60 fffffffffbc2dbb0 0 0 60 ffffff0bfe8c5cb8<br> PC: _resume_from_idle+0xf1 THREAD: txg_sync_thread()<br> stack pointer for thread ffffff003e2f2c60: ffffff003e2f2a40<br> [ ffffff003e2f2a40 _resume_from_idle+0xf1() ]<br> swtch+0x145()<br> cv_wait+0x61()<br> zio_wait+0x5d()<br> spa_sync+0x40c()<br> txg_sync_thread+0x24a()<br> thread_start+8()<br> <br> Not sure if any of that sheds any light on the problem. I also have a live dump from the period when the problem was happening, a bunch of iostats, mpstats, and ::arc, ::spa, ::zio_state, and ::threadlist -v from mdb -k at several points during the issue.<br> <br> If you have any advice on how to proceed from here in debugging this issue I'd greatly appreciate it. So you know, I'm generally very comfortable with unix, but dtrace and the solaris kernel are unfamiliar territory. <br> <br> In any event, thanks again for all the help thus far.<br> <br> -Charles -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss