Remco [re...@d-compu.dyndns.org] wrote:
> Chris Cappuccio wrote:
>
> > here is the key error message. it means your whole ahci disk has
> > disappeared (and anything you can still run is happening from cache.)
> >
> > --
> > ahci0: stopping the port, softreset slot 31 was still active.
> > ahci0: failed to reset port during timeout handling, disabling it
> > --
> >
> > likely a reboot will fix it. this is a known problem with ahci driver and
> > intel ahci controllers.
>
> I am not so sure this is a driver problem.
>
> I think I accidentilly "emulated" this problem the other day on my desktop
> system (not a 6501):
> Nov 28 16:38:44 ws0001 /bsd: ahci1: stopping the port, softreset slot 31 was
> still active.
> Nov 28 16:38:44 ws0001 /bsd: ahci1: failed to reset port during timeout
> handling, disabling it
>
> I have this external drive bay connected through e-SATA. After unmounting
> the drive I switched off the external drive's power. Running disklabel on
> the drive resulted in the above failures, which I guess makes sense, after
> all, I made the drive "disappear".
>
i "emulated" it with softraid on intel ahci, plus a ridiculously heavy disk
load with load averages above 20 for 24 hours per day, and got virtually the
same error. it went away with softraid in ide mode. (sounds like my problem and
yours were totally different.)
the online git history for dragonfly's ahci driver has some interesting things
that we might want to pay attention to:
http://gitweb.dragonflybsd.org/dragonfly.git/history/HEAD:/sys/dev/disk/ahci/ahci.c
but here are a few of dfly's interesting commits (in the context of softreset
errors and other various bugfixes):
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/5f8c1efd092f1039f63ce166e87fce744882a390
"* The softreset code did not properly initialize ccb_xa.flags, causing
the softreset FIS's to sometimes get queued as an NCQ command instead of
as a non-NCQ command.
* Make ahci_poll() a bit more robust. Properly set ccb_xa.state on
timeout, check for unexpected completions, and check to see if the
ccb was put on a queue (though the latter should never happen since
active/sactive is cleared by ahci_get_err_ccb())."
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b089d0bfa66d195e24630869579f0cbc1a48e459
"* Add a small delay after sending the RESET FIS in softreset before
sending the second FIS.
* Add a small delay after the device succesfully unbusies before starting
normal commands."
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/cf5f3a81b0df75cd4844c3c89f008e209d59b218
"* Change the reset sequence. If the first hardreset fails do a second
hardreset. If that fails then try doing a softreset. This seems to
catch all the cases. It is unclear why the reset sequence fails at
random points but it seems to be a combination of the port command
processor state and the device state. COMRESET does not actually reset
everything like its supposed to.
* Temporarily set ap_state to AP_S_NORMAL when starting a reset
sequence so commands do not just fail due to a previously failed
condition on the port.
* Restoration of command register state now depends on whether the
reset succeeded or failed.
* Note that only SERR_DIAG_X needs to be cleared to allow for the
next TFD update. These updates are serialized by the controller
and there may be more then one. Add a function ahci_flush_tfd() which
flushes all of them.
* Add ahci_port_hardstop() for dealing with failed ports and device
removals, instead of using ahci_port_hardreset(). This function
tries to do multiple transitions via section 10.10.1. These
transitions are not well documented by the standard.
* Fix ahci_poll() to not queue a command if the port is in a failed
state, as this really messes up our port processing state machine."
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/8bf6a33f591ca1c66a0a44d51c5535c5cb1b
"Not only does the DHRS interrupt not stop command processing (which means
that ahci_pm_read() needs to be single-threaded by the way, which we only do
by happenstance atm), but we were stopping and starting the port without
reloading commands in-progress."
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/dbef6246a8d336160c284860369a8a39d4902440
"* The IS register was not being properly masked for the fall-through."
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/4c339a5f388e695f1464ef583b1d0f20b03c5233
"* Stopping a port with ahci_port_stop() is problematic if the port has not
already been stopped by the command processor (CR is inactive), because
command completions can race our saved CI register, leading to
double-issues. This creates issues with both NCQ and FBSS support.
Change the timeout code to idle the port by allowing commands to
complete normally until the only commands remaining are expired.
Then the port can be safely stopped.
The timeout code also no longer perform