On Fri, Jul 29, 2011 at 08:30:28PM +0000, Alexander Motin wrote:
> Author: mav
> Date: Fri Jul 29 20:30:28 2011
> New Revision: 224496
> URL: http://svn.freebsd.org/changeset/base/224496
> 
> Log:
>   In some cases failed SATA disks may report their presence, but don't
>   respond to any commands. I've found that because of multiple command
>   retries, each of which cause 30s timeout, bus reset and another retry or
>   requeue for many commands, it may take ages to eventually drop the
>   failed device. The odd thing is that those retries continue even after
>   XPT considered device as dead and invalidated it.
>   
>   This patch makes cam_periph_error() to block any command retries after
>   periph was marked as invalid. With that patch all activity completes in
>   1-2 minutes, just after several timeouts, required to consider device
>   death. This should make ZFS, gmirror, graid, etc. operation more robust.
>   
>   Reviewed by:        mjacob@ on scsi@
>   
>   Approved by:        re (kib)
> 
> Modified:
>   head/sys/cam/cam_periph.c
Amusingly, this commit makes my test machine to not boot.
This is Ibex Peak PCH, with two SATA disks on the channels 0 and 1.

It seems that geom thread 100012 owns GEOM topology lock, while sleeping
in adaclose->cam_periph_getccb() :

db> bt 100012
Tracing pid 12 tid 100012 td 0xfffffe00028a2000
sched_switch() at 0xffffffff8034a0c7 = sched_switch+0x157
mi_switch() at 0xffffffff803291fb = mi_switch+0x2eb
sleepq_switch() at 0xffffffff803631f3 = sleepq_switch+0x123
sleepq_wait() at 0xffffffff80363eed = sleepq_wait+0x4d
_sleep() at 0xffffffff80329b59 = _sleep+0x3b9
cam_periph_getccb() at 0xffffffff817ffc50 = cam_periph_getccb+0xa0
adaclose() at 0xffffffff8182c484 = adaclose+0xc4
g_disk_access() at 0xffffffff802bea74 = g_disk_access+0x1e4
g_access() at 0xffffffff802c519a = g_access+0x1ba
g_dev_attrchanged() at 0xffffffff802bd1f6 = g_dev_attrchanged+0x96
g_dev_taste() at 0xffffffff802bd574 = g_dev_taste+0x284
g_new_provider_event() at 0xffffffff802c4ecd = g_new_provider_event+0xad
g_run_events() at 0xffffffff802c0750 = g_run_events+0x250
fork_exit() at 0xffffffff802f0d99 = fork_exit+0x189
fork_trampoline() at 0xffffffff804ee3be = fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffff800025fd00, rbp = 0 ---

(gdb) list *cam_periph_getccb+0xa0
0x1c50 is in cam_periph_getccb 
(/usr/home/kostik/work/build/bsd/DEV/src/sys/modules/cam/../../cam/cam_periph.c:883).
882
883             while (SLIST_FIRST(&periph->ccb_list) == NULL) {
884                     if (periph->immediate_priority > priority)

Reverting the rev. or not loading ahci.ko allows machine to boot.

Attachment: pgplNNdHYGW37.pgp
Description: PGP signature

Reply via email to