Date: Sun, 3 Apr 2016 09:51:08 -0400 From: "D'Arcy J.M. Cain" <da...@netbsd.org>
This led me to the following PR. http://gnats.netbsd.org/39016 There is a bit of discussion and then it was closed with "This particular problem has been fixed. Other problems that lead to "tstile syndrome" still exist, because "tstile syndrome" is any generic deadlock." It doesn't say what the fix was. Could this be some sort of code regression? Every mutex in the kernel is supposed to be held for at most some constant duration. When someone tries to an acquire a mutex that is already held, it will wait with wchan `tstile'. There are hundreds or thousands of different mutexes in any given system -- a bug with any one of them could manifest that way. Was your system completely locked up and unresponsive, or just the services that mattered? Can you get a stack trace from crash(8) for the processes that are wedged? If not, can you enter ddb, e.g. by typing C-A-ESC, and do it there? >From either crash(8) or ddb, you can list the processes with `show proc' and get a stack trace for any individual one with `bt 0t<pid>'. (`0t' is the notation for decimal; ddb reads input as hexadecimal by default, for whatever reason.) I am copying tech-kern as we seem to be getting deeper into the kernel. Replies set there as well. Meanwhile I am running the following script, a modification of one suggested by Robert Elz. ... case "${wchan}" in tstile*) x="`ps -p "${pid}" | grep tstile`" if [ "X$x" = "X" ]; then continue; fi dt=`date` echo "TSTILE: ${dt} $x" ;; If you can get a stack trace out of crash(8), that would be more helpful. Maybe something like: printf 'bt 0t%d\n' "${pid}" | crash Usually the culprit is *not* the process or thread that is stuck in tstile, but that stack trace will help to find what mutex is at issue.