On Fri, Jul 29, 2011 at 08:30:28PM +0000, Alexander Motin wrote: > Author: mav > Date: Fri Jul 29 20:30:28 2011 > New Revision: 224496 > URL: http://svn.freebsd.org/changeset/base/224496 > > Log: > In some cases failed SATA disks may report their presence, but don't > respond to any commands. I've found that because of multiple command > retries, each of which cause 30s timeout, bus reset and another retry or > requeue for many commands, it may take ages to eventually drop the > failed device. The odd thing is that those retries continue even after > XPT considered device as dead and invalidated it. > > This patch makes cam_periph_error() to block any command retries after > periph was marked as invalid. With that patch all activity completes in > 1-2 minutes, just after several timeouts, required to consider device > death. This should make ZFS, gmirror, graid, etc. operation more robust. > > Reviewed by: mjacob@ on scsi@ > > Approved by: re (kib) > > Modified: > head/sys/cam/cam_periph.c Amusingly, this commit makes my test machine to not boot. This is Ibex Peak PCH, with two SATA disks on the channels 0 and 1.
It seems that geom thread 100012 owns GEOM topology lock, while sleeping in adaclose->cam_periph_getccb() : db> bt 100012 Tracing pid 12 tid 100012 td 0xfffffe00028a2000 sched_switch() at 0xffffffff8034a0c7 = sched_switch+0x157 mi_switch() at 0xffffffff803291fb = mi_switch+0x2eb sleepq_switch() at 0xffffffff803631f3 = sleepq_switch+0x123 sleepq_wait() at 0xffffffff80363eed = sleepq_wait+0x4d _sleep() at 0xffffffff80329b59 = _sleep+0x3b9 cam_periph_getccb() at 0xffffffff817ffc50 = cam_periph_getccb+0xa0 adaclose() at 0xffffffff8182c484 = adaclose+0xc4 g_disk_access() at 0xffffffff802bea74 = g_disk_access+0x1e4 g_access() at 0xffffffff802c519a = g_access+0x1ba g_dev_attrchanged() at 0xffffffff802bd1f6 = g_dev_attrchanged+0x96 g_dev_taste() at 0xffffffff802bd574 = g_dev_taste+0x284 g_new_provider_event() at 0xffffffff802c4ecd = g_new_provider_event+0xad g_run_events() at 0xffffffff802c0750 = g_run_events+0x250 fork_exit() at 0xffffffff802f0d99 = fork_exit+0x189 fork_trampoline() at 0xffffffff804ee3be = fork_trampoline+0xe --- trap 0, rip = 0, rsp = 0xffffff800025fd00, rbp = 0 --- (gdb) list *cam_periph_getccb+0xa0 0x1c50 is in cam_periph_getccb (/usr/home/kostik/work/build/bsd/DEV/src/sys/modules/cam/../../cam/cam_periph.c:883). 882 883 while (SLIST_FIRST(&periph->ccb_list) == NULL) { 884 if (periph->immediate_priority > priority) Reverting the rev. or not loading ahci.ko allows machine to boot.
pgplNNdHYGW37.pgp
Description: PGP signature