Re: NFS related panics and hangs

2015-11-05 Thread J. Hannken-Illjes
On 05 Nov 2015, at 21:48, Rhialto  wrote:


> Looking into this:
> 
> the occurrences of nfs_reqq are as follows:
> 
> fs/nfs/client/nfs_clvnops.c: * nfs_reqq_mtx : Global lock, protects the 
> nfs_reqq list.
> 
> Since there is no other mention of nfs_reqq_mtx in the whole syssrc
> tarball, this looks wrong.  It also immediately causes the suspicion
> that the list isn't in fact protected at all.

This file (fs/nfs/client/nfs_clvnops.c) is part of a second (dead) nfs
implementation from FreeBSD.  It is not part of any kernel.

Our nfs lives in sys/nfs.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: NFS related panics and hangs

2015-11-05 Thread Rhialto
On Thu 05 Nov 2015 at 22:30:51 +0100, J. Hannken-Illjes wrote:
> This file (fs/nfs/client/nfs_clvnops.c) is part of a second (dead) nfs
> implementation from FreeBSD.  It is not part of any kernel.
> 
> Our nfs lives in sys/nfs.

Ok, why is it included in syssrc.tgz then?
I'd say it should not be there.

My other observations still stand, it seems, since they concern files in
sys/nfs.

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'


signature.asc
Description: PGP signature


Re: NFS related panics and hangs

2015-11-05 Thread Rhialto
[ Adding tech-kern. The relevant earlier mails start at
  http://mail-index.netbsd.org/current-users/2015/10/19/msg028233.html
  This is about a default-installed amd64 GENERIC 7.0 kernel.
  Replies are better in tech-kern, I think, so I set Reply-To
  accordingly.  ]


On Fri 23 Oct 2015 at 00:46:57 +0200, Rhialto wrote:
> This problem is very repeatable, usually within a few hours, just now it
> happened within half an hour.
> 
> It seems to me that somehow the nfs_reqq list gets corrupted. Then
> either there is a crash when traversing it in nfs_timer() (occurring in
> nfs_sigintr() due to being called with a bogus pointer), or there is a
> hang when one of the NFS requests gets lost and never retried.

Looking into this:

the occurrences of nfs_reqq are as follows:

fs/nfs/client/nfs_clvnops.c: * nfs_reqq_mtx : Global lock, protects the 
nfs_reqq list.

Since there is no other mention of nfs_reqq_mtx in the whole syssrc
tarball, this looks wrong.  It also immediately causes the suspicion
that the list isn't in fact protected at all.

nfs/nfs.h:extern TAILQ_HEAD(nfsreqhead, nfsreq) nfs_reqq;

nfs/nfs_clntsocket.c: TAILQ_FOREACH(rep, _reqq, r_chain) {
nfs/nfs_clntsocket.c: TAILQ_INSERT_TAIL(_reqq, rep, r_chain);
nfs/nfs_clntsocket.c: TAILQ_REMOVE(_reqq, rep, r_chain);

Protected with

s = splsoftnet();

for match #2 and #3 but #1 seems not protected by anything I can see
nearby. Maybe it is

error = nfs_rcvlock(nmp, myrep);

if that makes any sense.
That function definitely does not use either splsoftnet() OR
mutex_enter(softnet_lock).

nfs/nfs_socket.c:struct nfsreqhead nfs_reqq;
nfs/nfs_socket.c: TAILQ_FOREACH(rp, _reqq, r_chain) {
nfs/nfs_socket.c: TAILQ_FOREACH(rep, _reqq, r_chain) {

match #3 is protected with

mutex_enter(softnet_lock);  /* XXX PR 40491 */

but none of the others (visibly nearby).

#2 is called from nfs_receive() which uses nfs_sndlock() which also
doesn't use either splsoftnet() OR mutex_enter(softnet_lock).

nfs/nfs_subs.c:   TAILQ_INIT(_reqq);

presumably doesn't need any extra protection.

softnet_lock is allocated as

./kern/uipc_socket.c:kmutex_t   *softnet_lock;
./kern/uipc_socket.c:   softnet_lock = mutex_obj_alloc(MUTEX_DEFAULT, IPL_NONE);

IPL_NONE seems inconsistent with splsoftnet().

I never studied the inner details of kernel locking, but the diversity
of protections of this list doesn't inspire trust at first sight...

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'


signature.asc
Description: PGP signature


Re: NFS related panics and hangs

2015-11-05 Thread David Holland
On Thu, Nov 05, 2015 at 10:46:17PM +0100, Rhialto wrote:
 > > This file (fs/nfs/client/nfs_clvnops.c) is part of a second (dead) nfs
 > > implementation from FreeBSD.  It is not part of any kernel.
 > > 
 > > Our nfs lives in sys/nfs.
 > 
 > Ok, why is it included in syssrc.tgz then?
 > I'd say it should not be there.
 > 
 > My other observations still stand, it seems, since they concern files in
 > sys/nfs.

Because migrating to that nfs is the long-term plan because it'll get
us nfsv4.

(anyone who wants to help work on this, please help...)


-- 
David A. Holland
dholl...@netbsd.org