[ Adding tech-kern. The relevant earlier mails start at
http://mail-index.netbsd.org/current-users/2015/10/19/msg028233.html
This is about a default-installed amd64 GENERIC 7.0 kernel.
Replies are better in tech-kern, I think, so I set Reply-To
accordingly. ]
On Fri 23 Oct 2015 at 00:46:57 +0200, Rhialto wrote:
> This problem is very repeatable, usually within a few hours, just now it
> happened within half an hour.
>
> It seems to me that somehow the nfs_reqq list gets corrupted. Then
> either there is a crash when traversing it in nfs_timer() (occurring in
> nfs_sigintr() due to being called with a bogus pointer), or there is a
> hang when one of the NFS requests gets lost and never retried.
Looking into this:
the occurrences of nfs_reqq are as follows:
fs/nfs/client/nfs_clvnops.c: * nfs_reqq_mtx : Global lock, protects the
nfs_reqq list.
Since there is no other mention of nfs_reqq_mtx in the whole syssrc
tarball, this looks wrong. It also immediately causes the suspicion
that the list isn't in fact protected at all.
nfs/nfs.h:extern TAILQ_HEAD(nfsreqhead, nfsreq) nfs_reqq;
nfs/nfs_clntsocket.c: TAILQ_FOREACH(rep, &nfs_reqq, r_chain) {
nfs/nfs_clntsocket.c: TAILQ_INSERT_TAIL(&nfs_reqq, rep, r_chain);
nfs/nfs_clntsocket.c: TAILQ_REMOVE(&nfs_reqq, rep, r_chain);
Protected with
s = splsoftnet();
for match #2 and #3 but #1 seems not protected by anything I can see
nearby. Maybe it is
error = nfs_rcvlock(nmp, myrep);
if that makes any sense.
That function definitely does not use either splsoftnet() OR
mutex_enter(softnet_lock).
nfs/nfs_socket.c:struct nfsreqhead nfs_reqq;
nfs/nfs_socket.c: TAILQ_FOREACH(rp, &nfs_reqq, r_chain) {
nfs/nfs_socket.c: TAILQ_FOREACH(rep, &nfs_reqq, r_chain) {
match #3 is protected with
mutex_enter(softnet_lock); /* XXX PR 40491 */
but none of the others (visibly nearby).
#2 is called from nfs_receive() which uses nfs_sndlock() which also
doesn't use either splsoftnet() OR mutex_enter(softnet_lock).
nfs/nfs_subs.c: TAILQ_INIT(&nfs_reqq);
presumably doesn't need any extra protection.
softnet_lock is allocated as
./kern/uipc_socket.c:kmutex_t *softnet_lock;
./kern/uipc_socket.c: softnet_lock = mutex_obj_alloc(MUTEX_DEFAULT, IPL_NONE);
IPL_NONE seems inconsistent with splsoftnet().
I never studied the inner details of kernel locking, but the diversity
of protections of this list doesn't inspire trust at first sight...
-Olaf.
--
___ Olaf 'Rhialto' Seibert -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'
signature.asc
Description: PGP signature