Re: panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743 in 12.0-CURRENT r308447

2016-11-14 Thread Don Lewis
On 13 Nov, Mateusz Guzik wrote:
> On Sat, Nov 12, 2016 at 05:11:57PM -0800, Don Lewis wrote:

>> What protects nc_flag, the lock for the list that it resides on?
>> 
> 
> It is supposed to be the hot list lock and I think this uncovers a bug
> here.
> 
> Consider a NCF_HOTNEGATIVE entry which is being evicted. It sets the
> NCV_DVDROP flag without the lock held, but the entry is still not
> removed from negative lists. So in principle we can either lose the
> newly set flag or the information that hotnegative is unset.
> 
> That said, I think the fix would be to remove from negative entries
> prior to setting the NCV_DVDROP flag.
> 
> Normally the flag is protected by the hotlist lock.
> 
> Untested, but should do the trick:
> 
> --- vfs_cache.c.old   2016-11-13 09:37:50.09600 +0100
> +++ vfs_cache.c.new   2016-11-13 09:39:45.00400 +0100
> @@ -868,6 +868,13 @@
>   nc_get_name(ncp), ncp->nc_neghits);
>   }
>   LIST_REMOVE(ncp, nc_hash);
> + if (!(ncp->nc_flag & NCF_NEGATIVE)) {
> + TAILQ_REMOVE(>nc_vp->v_cache_dst, ncp, nc_dst);
> + if (ncp == ncp->nc_vp->v_cache_dd)
> + ncp->nc_vp->v_cache_dd = NULL;
> + } else {
> + cache_negative_remove(ncp, neg_locked);
> + }
>   if (ncp->nc_flag & NCF_ISDOTDOT) {
>   if (ncp == ncp->nc_dvp->v_cache_dd)
>   ncp->nc_dvp->v_cache_dd = NULL;
> @@ -878,13 +885,6 @@
>   atomic_subtract_rel_long(, 1);
>   }
>   }
> - if (!(ncp->nc_flag & NCF_NEGATIVE)) {
> - TAILQ_REMOVE(>nc_vp->v_cache_dst, ncp, nc_dst);
> - if (ncp == ncp->nc_vp->v_cache_dd)
> - ncp->nc_vp->v_cache_dd = NULL;
> - } else {
> - cache_negative_remove(ncp, neg_locked);
> - }
>   atomic_subtract_rel_long(, 1);
>  }

No obvious regressions with this patch along with a couple of the
assertions that I had previously added.  It survived a 14 hour poudriere
run building my default package set for FreeBSD 10 i386, which paniced
several times earlier.

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743 in 12.0-CURRENT r308447

2016-11-13 Thread Don Lewis
The machine managed to survive 21 hours of building FreeBSD 12 ports.
All of the panic happened while building FreeBSD 10 ports, which build
faster.  That may be enough to affect timing and affect the likelyhood
of a race condition triggering the panic.

On 13 Nov, Mateusz Guzik wrote:
> On Sat, Nov 12, 2016 at 05:11:57PM -0800, Don Lewis wrote:
>> This !neg_locked code in cache_negative_remove() looks suspicious to me:
>> 
>> if (!neg_locked) {
>> if (ncp->nc_flag & NCF_HOTNEGATIVE) {
>> hot_locked = true;
>> mtx_lock(_hot.nl_lock);
>> if (!(ncp->nc_flag & NCF_HOTNEGATIVE)) {
>> list_locked = true;
>> mtx_lock(>nl_lock);
>> }
>> } else {
>> list_locked = true;
>> mtx_lock(>nl_lock);
>> }
>> 
>> It looks like you handle NCF_HOTNEGATIVE going away while waiting for
>> the ncneg_hot.nl_lock, but don't handle the possible appearance of
>> NCF_HOTNEGATIVE while waiting for neglist->nl_lock.
>> 
> 
> Promotions from regular to hot are only possible on a hit, which is
> prevented by the caller holding both the vnode and bucket lock. The only
> way to see the entry at this stage is from the shrinker, which can
> demote it and then take all the locks to try to remove it. But it checks
> after locking if the node is still there.
> 
>> What protects nc_flag, the lock for the list that it resides on?
>> 
> 
> It is supposed to be the hot list lock and I think this uncovers a bug
> here.
> 
> Consider a NCF_HOTNEGATIVE entry which is being evicted. It sets the
> NCV_DVDROP flag without the lock held, but the entry is still not
> removed from negative lists. So in principle we can either lose the
> newly set flag or the information that hotnegative is unset.
> 
> That said, I think the fix would be to remove from negative entries
> prior to setting the NCV_DVDROP flag.
> 
> Normally the flag is protected by the hotlist lock.
> 
> Untested, but should do the trick:
> 
> --- vfs_cache.c.old   2016-11-13 09:37:50.09600 +0100
> +++ vfs_cache.c.new   2016-11-13 09:39:45.00400 +0100
> @@ -868,6 +868,13 @@
>   nc_get_name(ncp), ncp->nc_neghits);
>   }
>   LIST_REMOVE(ncp, nc_hash);
> + if (!(ncp->nc_flag & NCF_NEGATIVE)) {
> + TAILQ_REMOVE(>nc_vp->v_cache_dst, ncp, nc_dst);
> + if (ncp == ncp->nc_vp->v_cache_dd)
> + ncp->nc_vp->v_cache_dd = NULL;
> + } else {
> + cache_negative_remove(ncp, neg_locked);
> + }
>   if (ncp->nc_flag & NCF_ISDOTDOT) {
>   if (ncp == ncp->nc_dvp->v_cache_dd)
>   ncp->nc_dvp->v_cache_dd = NULL;
> @@ -878,13 +885,6 @@
>   atomic_subtract_rel_long(, 1);
>   }
>   }
> - if (!(ncp->nc_flag & NCF_NEGATIVE)) {
> - TAILQ_REMOVE(>nc_vp->v_cache_dst, ncp, nc_dst);
> - if (ncp == ncp->nc_vp->v_cache_dd)
> - ncp->nc_vp->v_cache_dd = NULL;
> - } else {
> - cache_negative_remove(ncp, neg_locked);
> - }
>   atomic_subtract_rel_long(, 1);
>  }

I will give this a try.

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743 in 12.0-CURRENT r308447

2016-11-13 Thread Mateusz Guzik
On Sat, Nov 12, 2016 at 05:11:57PM -0800, Don Lewis wrote:
> This !neg_locked code in cache_negative_remove() looks suspicious to me:
> 
> if (!neg_locked) {
> if (ncp->nc_flag & NCF_HOTNEGATIVE) {
> hot_locked = true;
> mtx_lock(_hot.nl_lock);
> if (!(ncp->nc_flag & NCF_HOTNEGATIVE)) {
> list_locked = true;
> mtx_lock(>nl_lock);
> }
> } else {
> list_locked = true;
> mtx_lock(>nl_lock);
> }
> 
> It looks like you handle NCF_HOTNEGATIVE going away while waiting for
> the ncneg_hot.nl_lock, but don't handle the possible appearance of
> NCF_HOTNEGATIVE while waiting for neglist->nl_lock.
> 

Promotions from regular to hot are only possible on a hit, which is
prevented by the caller holding both the vnode and bucket lock. The only
way to see the entry at this stage is from the shrinker, which can
demote it and then take all the locks to try to remove it. But it checks
after locking if the node is still there.

> What protects nc_flag, the lock for the list that it resides on?
> 

It is supposed to be the hot list lock and I think this uncovers a bug
here.

Consider a NCF_HOTNEGATIVE entry which is being evicted. It sets the
NCV_DVDROP flag without the lock held, but the entry is still not
removed from negative lists. So in principle we can either lose the
newly set flag or the information that hotnegative is unset.

That said, I think the fix would be to remove from negative entries
prior to setting the NCV_DVDROP flag.

Normally the flag is protected by the hotlist lock.

Untested, but should do the trick:

--- vfs_cache.c.old 2016-11-13 09:37:50.09600 +0100
+++ vfs_cache.c.new 2016-11-13 09:39:45.00400 +0100
@@ -868,6 +868,13 @@
nc_get_name(ncp), ncp->nc_neghits);
}
LIST_REMOVE(ncp, nc_hash);
+   if (!(ncp->nc_flag & NCF_NEGATIVE)) {
+   TAILQ_REMOVE(>nc_vp->v_cache_dst, ncp, nc_dst);
+   if (ncp == ncp->nc_vp->v_cache_dd)
+   ncp->nc_vp->v_cache_dd = NULL;
+   } else {
+   cache_negative_remove(ncp, neg_locked);
+   }
if (ncp->nc_flag & NCF_ISDOTDOT) {
if (ncp == ncp->nc_dvp->v_cache_dd)
ncp->nc_dvp->v_cache_dd = NULL;
@@ -878,13 +885,6 @@
atomic_subtract_rel_long(, 1);
}
}
-   if (!(ncp->nc_flag & NCF_NEGATIVE)) {
-   TAILQ_REMOVE(>nc_vp->v_cache_dst, ncp, nc_dst);
-   if (ncp == ncp->nc_vp->v_cache_dd)
-   ncp->nc_vp->v_cache_dd = NULL;
-   } else {
-   cache_negative_remove(ncp, neg_locked);
-   }
atomic_subtract_rel_long(, 1);
 }
 
-- 
Mateusz Guzik 
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743 in 12.0-CURRENT r308447

2016-11-12 Thread Don Lewis
On 12 Nov, Mateusz Guzik wrote:
> On Fri, Nov 11, 2016 at 11:29:48PM -0800, Don Lewis wrote:
>> On 11 Nov, To: freebsd-current@FreeBSD.org wrote:
>> > Yesterday I upgraded my ports builder box from r306349 to r308477.  It
>> > crashed overnight during a poudriere run.
>> > 
>> > panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743
>> 
>> This problem is very reproduceable for me, so I added some assertions to
>> the code to try to track it down:
>> 
>> panic: Bad link elm 0xf802a59cb690 prev->next != elm
>> #13 0x80b0d55a in cache_negative_remove (ncp=0xf802a59cb690, 
>> neg_locked=) at /usr/src/sys/kern/vfs_cache.c:747
>> #14 0x80b0d84a in cache_zap_locked (ncp=0xf802a59cb690, 
>> neg_locked=false) at /usr/src/sys/kern/vfs_cache.c:890
>> #15 0x80b0ab98 in cache_zap_rlocked_bucket (ncp=0xf802a59cb690, 
>> blp=) at /usr/src/sys/kern/vfs_cache.c:1021
>> #16 cache_lookup (dvp=, vpp=, 
>> cnp=, tsp=, ticksp=)
>> at /usr/src/sys/kern/vfs_cache.c:1308
>> #17 0x80b0e97c in vfs_cache_lookup (ap=)
>> at /usr/src/sys/kern/vfs_cache.c:2024
>> #18 0x8103fcea in VOP_LOOKUP_APV (vop=, 
>> a=) at vnode_if.c:127
>> #19 0x80b17ed2 in VOP_LOOKUP (vpp=0xfe085bfa86e8, 
>> cnp=0xfe085bfa8710, dvp=) at ./vnode_if.h:54
>> #20 lookup (ndp=) at /usr/src/sys/kern/vfs_lookup.c:830
>> #21 0x80b1748e in namei (ndp=)
>> at /usr/src/sys/kern/vfs_lookup.c:397
>> #22 0x80b30bc4 in kern_renameat (td=, oldfd=10, 
>> old=0x28e72a59 , 
>> newfd=10, 
>> new=0x28e72601 , 
>> pathseg=) at /usr/src/sys/kern/vfs_syscalls.c:3515
>> #23 0x80ff98a0 in syscallenter (td=0xf801ebfd8500, 
>> sa=)
>> at /usr/src/sys/amd64/ia32/../../kern/subr_syscall.c:135
>> #24 ia32_syscall (frame=0xfe085bfa8a40)
>> at /usr/src/sys/amd64/ia32/ia32_syscall.c:187
>> #25 
>> 
>> 
>> #13 0x80b0d55a in cache_negative_remove (ncp=0xf802a59cb690, 
>> neg_locked=) at /usr/src/sys/kern/vfs_cache.c:747
>> 747  TAILQ_REMOVE(_hot.nl_list, ncp, nc_dst);
>> (kgdb) list
>> 742  } else {
>> 743  mtx_assert(>nl_lock, MA_OWNED);
>> 744  mtx_assert(_hot.nl_lock, MA_OWNED);
>> 745  }
>> 746  if (ncp->nc_flag & NCF_HOTNEGATIVE) {
>> 747  TAILQ_REMOVE(_hot.nl_list, ncp, nc_dst);
>> 748  } else {
>> 749  TAILQ_REMOVE(>nl_list, ncp, nc_dst);
>> 750  }
>> 751  if (list_locked)
>> 
>> (kgdb) print ncp
>> $1 = (struct namecache *) 0xf802a59cb690
>> 
>> (kgdb) print *ncp
>> $2 = {nc_hash = {le_next = 0x0, le_prev = 0xfe0001814178}, nc_src = {
>> le_next = 0xf80415e22290, le_prev = 0xf8048e5597a8}, nc_dst = {
>> tqe_next = 0xf80421744a10, tqe_prev = 0xf80421744a30}, 
>>   nc_dvp = 0xf8048e559760, n_un = {nu_vp = 0x0, nu_neghits = 0}, 
>>   nc_flag = 32 ' ', nc_nlen = 25 '\031', 
>>   nc_name = 0xf802a59cb6d2 "gtk3-icon-browser.desktop"}
>> (kgdb) print ncp->nc_dst->tqe_prev   
>> $8 = (struct namecache **) 0xf80421744a30
>> (kgdb) print *(ncp->nc_dst->tqe_prev)
>> $9 = (struct namecache *) 0x0
>> 
>> 
>> The line number for the failed assertion doesn't make sense.  The value
>> of nc_flag is 32 (0x20), but NCF_HOTNEGATIVE is defined as 0x40 and
>> NCF_NEGATIVE is defined as 0x20.  In any case, the list seems to be
>> messed up.
>> 
> 
> Can you check if the problem persists after reverting r307650 and
> r307685?
> 
> I got another report of trouble with poudriere, but so far can neither
> reproduce nor diagnose what happens here.
> 
> For instance with this crash, in principle there could be a race where
> the entry is demoted and loses the HOTNEGATIVE. But the demotion code
> has the hot list locked and so does this code if it spots the flag.
> What's more, the race is handled by re-checking the flag after the lock
> was taken.
> 
> That said, there is something seriously fishing going on here.

This !neg_locked code in cache_negative_remove() looks suspicious to me:

if (!neg_locked) {
if (ncp->nc_flag & NCF_HOTNEGATIVE) {
hot_locked = true;
mtx_lock(_hot.nl_lock);
if (!(ncp->nc_flag & NCF_HOTNEGATIVE)) {
list_locked = true;
mtx_lock(>nl_lock);
}
} else {
list_locked = true;
mtx_lock(>nl_lock);
}

It looks like you handle NCF_HOTNEGATIVE going away while waiting for
the ncneg_hot.nl_lock, but don't handle the possible appearance of
NCF_HOTNEGATIVE while waiting for neglist->nl_lock.

What protects nc_flag, the lock for the list that it resides on?

I think the only safe thing to do here is to always grab both locks.


My package builder is now up to 18 1/2 hours of runtime 

Re: panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743 in 12.0-CURRENT r308447

2016-11-12 Thread Don Lewis
On 12 Nov, Mateusz Guzik wrote:
> On Fri, Nov 11, 2016 at 11:29:48PM -0800, Don Lewis wrote:
>> On 11 Nov, To: freebsd-current@FreeBSD.org wrote:
>> > Yesterday I upgraded my ports builder box from r306349 to r308477.  It
>> > crashed overnight during a poudriere run.
>> > 
>> > panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743
>> 
>> This problem is very reproduceable for me, so I added some assertions to
>> the code to try to track it down:
>> 
>> panic: Bad link elm 0xf802a59cb690 prev->next != elm
>> #13 0x80b0d55a in cache_negative_remove (ncp=0xf802a59cb690, 
>> neg_locked=) at /usr/src/sys/kern/vfs_cache.c:747
>> #14 0x80b0d84a in cache_zap_locked (ncp=0xf802a59cb690, 
>> neg_locked=false) at /usr/src/sys/kern/vfs_cache.c:890
>> #15 0x80b0ab98 in cache_zap_rlocked_bucket (ncp=0xf802a59cb690, 
>> blp=) at /usr/src/sys/kern/vfs_cache.c:1021
>> #16 cache_lookup (dvp=, vpp=, 
>> cnp=, tsp=, ticksp=)
>> at /usr/src/sys/kern/vfs_cache.c:1308
>> #17 0x80b0e97c in vfs_cache_lookup (ap=)
>> at /usr/src/sys/kern/vfs_cache.c:2024
>> #18 0x8103fcea in VOP_LOOKUP_APV (vop=, 
>> a=) at vnode_if.c:127
>> #19 0x80b17ed2 in VOP_LOOKUP (vpp=0xfe085bfa86e8, 
>> cnp=0xfe085bfa8710, dvp=) at ./vnode_if.h:54
>> #20 lookup (ndp=) at /usr/src/sys/kern/vfs_lookup.c:830
>> #21 0x80b1748e in namei (ndp=)
>> at /usr/src/sys/kern/vfs_lookup.c:397
>> #22 0x80b30bc4 in kern_renameat (td=, oldfd=10, 
>> old=0x28e72a59 , 
>> newfd=10, 
>> new=0x28e72601 , 
>> pathseg=) at /usr/src/sys/kern/vfs_syscalls.c:3515
>> #23 0x80ff98a0 in syscallenter (td=0xf801ebfd8500, 
>> sa=)
>> at /usr/src/sys/amd64/ia32/../../kern/subr_syscall.c:135
>> #24 ia32_syscall (frame=0xfe085bfa8a40)
>> at /usr/src/sys/amd64/ia32/ia32_syscall.c:187
>> #25 
>> 
>> 
>> #13 0x80b0d55a in cache_negative_remove (ncp=0xf802a59cb690, 
>> neg_locked=) at /usr/src/sys/kern/vfs_cache.c:747
>> 747  TAILQ_REMOVE(_hot.nl_list, ncp, nc_dst);
>> (kgdb) list
>> 742  } else {
>> 743  mtx_assert(>nl_lock, MA_OWNED);
>> 744  mtx_assert(_hot.nl_lock, MA_OWNED);
>> 745  }
>> 746  if (ncp->nc_flag & NCF_HOTNEGATIVE) {
>> 747  TAILQ_REMOVE(_hot.nl_list, ncp, nc_dst);
>> 748  } else {
>> 749  TAILQ_REMOVE(>nl_list, ncp, nc_dst);
>> 750  }
>> 751  if (list_locked)
>> 
>> (kgdb) print ncp
>> $1 = (struct namecache *) 0xf802a59cb690
>> 
>> (kgdb) print *ncp
>> $2 = {nc_hash = {le_next = 0x0, le_prev = 0xfe0001814178}, nc_src = {
>> le_next = 0xf80415e22290, le_prev = 0xf8048e5597a8}, nc_dst = {
>> tqe_next = 0xf80421744a10, tqe_prev = 0xf80421744a30}, 
>>   nc_dvp = 0xf8048e559760, n_un = {nu_vp = 0x0, nu_neghits = 0}, 
>>   nc_flag = 32 ' ', nc_nlen = 25 '\031', 
>>   nc_name = 0xf802a59cb6d2 "gtk3-icon-browser.desktop"}
>> (kgdb) print ncp->nc_dst->tqe_prev   
>> $8 = (struct namecache **) 0xf80421744a30
>> (kgdb) print *(ncp->nc_dst->tqe_prev)
>> $9 = (struct namecache *) 0x0
>> 
>> 
>> The line number for the failed assertion doesn't make sense.  The value
>> of nc_flag is 32 (0x20), but NCF_HOTNEGATIVE is defined as 0x40 and
>> NCF_NEGATIVE is defined as 0x20.  In any case, the list seems to be
>> messed up.
>> 
> 
> Can you check if the problem persists after reverting r307650 and
> r307685?
> 
> I got another report of trouble with poudriere, but so far can neither
> reproduce nor diagnose what happens here.
> 
> For instance with this crash, in principle there could be a race where
> the entry is demoted and loses the HOTNEGATIVE. But the demotion code
> has the hot list locked and so does this code if it spots the flag.
> What's more, the race is handled by re-checking the flag after the lock
> was taken.
> 
> That said, there is something seriously fishing going on here.

I can try later.  After several crashes in fairly short order, the
machine currently has 11+ hours of uptime while running poudriere.  I
made no kernel changes this time.

The earlier panics seemed to indicate that an entry was on the wrong
list.  cache_negative_zap_one() calls cache_negative_shrink_select(),
which updates ncp and neglist, with the latter locked if ncp is not
NULL.  Later on, it drops the locks and relocks various things including
neglist before calling cache_zap_locked(), which in turn calls
cache_negative_remove().  The latter calls NCP2NEGLIST(ncp) to recompute
neglist, which it finds is not locked.  One of asserts that I added to
cache_negative_zap_one() is a sanity check for the value of neglist
returned by cache_negative_shrink_select().  The neglist value the
latter returns is the index of the neglist array where it found the
entry, not the result of calling NCP2NEGLIST() on the entry.  That
assert fired, indicating 

Re: panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743 in 12.0-CURRENT r308447

2016-11-12 Thread Mateusz Guzik
On Fri, Nov 11, 2016 at 11:29:48PM -0800, Don Lewis wrote:
> On 11 Nov, To: freebsd-current@FreeBSD.org wrote:
> > Yesterday I upgraded my ports builder box from r306349 to r308477.  It
> > crashed overnight during a poudriere run.
> > 
> > panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743
> 
> This problem is very reproduceable for me, so I added some assertions to
> the code to try to track it down:
> 
> panic: Bad link elm 0xf802a59cb690 prev->next != elm
> #13 0x80b0d55a in cache_negative_remove (ncp=0xf802a59cb690, 
> neg_locked=) at /usr/src/sys/kern/vfs_cache.c:747
> #14 0x80b0d84a in cache_zap_locked (ncp=0xf802a59cb690, 
> neg_locked=false) at /usr/src/sys/kern/vfs_cache.c:890
> #15 0x80b0ab98 in cache_zap_rlocked_bucket (ncp=0xf802a59cb690, 
> blp=) at /usr/src/sys/kern/vfs_cache.c:1021
> #16 cache_lookup (dvp=, vpp=, 
> cnp=, tsp=, ticksp=)
> at /usr/src/sys/kern/vfs_cache.c:1308
> #17 0x80b0e97c in vfs_cache_lookup (ap=)
> at /usr/src/sys/kern/vfs_cache.c:2024
> #18 0x8103fcea in VOP_LOOKUP_APV (vop=, 
> a=) at vnode_if.c:127
> #19 0x80b17ed2 in VOP_LOOKUP (vpp=0xfe085bfa86e8, 
> cnp=0xfe085bfa8710, dvp=) at ./vnode_if.h:54
> #20 lookup (ndp=) at /usr/src/sys/kern/vfs_lookup.c:830
> #21 0x80b1748e in namei (ndp=)
> at /usr/src/sys/kern/vfs_lookup.c:397
> #22 0x80b30bc4 in kern_renameat (td=, oldfd=10, 
> old=0x28e72a59 , 
> newfd=10, 
> new=0x28e72601 , 
> pathseg=) at /usr/src/sys/kern/vfs_syscalls.c:3515
> #23 0x80ff98a0 in syscallenter (td=0xf801ebfd8500, 
> sa=)
> at /usr/src/sys/amd64/ia32/../../kern/subr_syscall.c:135
> #24 ia32_syscall (frame=0xfe085bfa8a40)
> at /usr/src/sys/amd64/ia32/ia32_syscall.c:187
> #25 
> 
> 
> #13 0x80b0d55a in cache_negative_remove (ncp=0xf802a59cb690, 
> neg_locked=) at /usr/src/sys/kern/vfs_cache.c:747
> 747   TAILQ_REMOVE(_hot.nl_list, ncp, nc_dst);
> (kgdb) list
> 742   } else {
> 743   mtx_assert(>nl_lock, MA_OWNED);
> 744   mtx_assert(_hot.nl_lock, MA_OWNED);
> 745   }
> 746   if (ncp->nc_flag & NCF_HOTNEGATIVE) {
> 747   TAILQ_REMOVE(_hot.nl_list, ncp, nc_dst);
> 748   } else {
> 749   TAILQ_REMOVE(>nl_list, ncp, nc_dst);
> 750   }
> 751   if (list_locked)
> 
> (kgdb) print ncp
> $1 = (struct namecache *) 0xf802a59cb690
> 
> (kgdb) print *ncp
> $2 = {nc_hash = {le_next = 0x0, le_prev = 0xfe0001814178}, nc_src = {
> le_next = 0xf80415e22290, le_prev = 0xf8048e5597a8}, nc_dst = {
> tqe_next = 0xf80421744a10, tqe_prev = 0xf80421744a30}, 
>   nc_dvp = 0xf8048e559760, n_un = {nu_vp = 0x0, nu_neghits = 0}, 
>   nc_flag = 32 ' ', nc_nlen = 25 '\031', 
>   nc_name = 0xf802a59cb6d2 "gtk3-icon-browser.desktop"}
> (kgdb) print ncp->nc_dst->tqe_prev   
> $8 = (struct namecache **) 0xf80421744a30
> (kgdb) print *(ncp->nc_dst->tqe_prev)
> $9 = (struct namecache *) 0x0
> 
> 
> The line number for the failed assertion doesn't make sense.  The value
> of nc_flag is 32 (0x20), but NCF_HOTNEGATIVE is defined as 0x40 and
> NCF_NEGATIVE is defined as 0x20.  In any case, the list seems to be
> messed up.
> 

Can you check if the problem persists after reverting r307650 and
r307685?

I got another report of trouble with poudriere, but so far can neither
reproduce nor diagnose what happens here.

For instance with this crash, in principle there could be a race where
the entry is demoted and loses the HOTNEGATIVE. But the demotion code
has the hot list locked and so does this code if it spots the flag.
What's more, the race is handled by re-checking the flag after the lock
was taken.

That said, there is something seriously fishing going on here.
-- 
Mateusz Guzik 
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743 in 12.0-CURRENT r308447

2016-11-11 Thread Don Lewis
On 11 Nov, To: freebsd-current@FreeBSD.org wrote:
> Yesterday I upgraded my ports builder box from r306349 to r308477.  It
> crashed overnight during a poudriere run.
> 
> panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743
> cpuid = 2
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe085c2a4100
> vpanic() at vpanic+0x182/frame 0xfe085c2a4180
> panic() at panic+0x43/frame 0xfe085c2a41e0
> __mtx_assert() at __mtx_assert+0xc1/frame 0xfe085c2a41f0
> cache_negative_remove() at cache_negative_remove+0x65/frame 0xfe085c2a4230
> cache_zap_locked() at cache_zap_locked+0x1ca/frame 0xfe085c2a4250
> cache_enter_time() at cache_enter_time+0x1c20/frame 0xfe085c2a4330
> tmpfs_lookup() at tmpfs_lookup+0x47a/frame 0xfe085c2a4390
> VOP_CACHEDLOOKUP_APV() at VOP_CACHEDLOOKUP_APV+0xda/frame 0xfe085c2a43c0
> vfs_cache_lookup() at vfs_cache_lookup+0xd6/frame 0xfe085c2a4420
> VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0xda/frame 0xfe085c2a4450
> lookup() at lookup+0x6a2/frame 0xfe085c2a44e0
> namei() at namei+0x57e/frame 0xfe085c2a45a0
> vn_open_cred() at vn_open_cred+0x25b/frame 0xfe085c2a4710
> kern_openat() at kern_openat+0x25c/frame 0xfe085c2a48a0
> ia32_syscall() at ia32_syscall+0x330/frame 0xfe085c2a4a30
> Xint0x80_syscall() at Xint0x80_syscall+0x95/frame 0xfe085c2a4a30
> --- syscall (499, FreeBSD ELF32, sys_openat), rip = 0x97b2f57, rsp = 
> 0x7698, rbp = 0x76b0 ---
> KDB: enter: panic
> 
> 
> I was able to get a crash dump.  This is the kgdb backtrace:
> 
> #0  __curthread () at ./machine/pcpu.h:222
> #1  doadump (textdump=1546271680) at /usr/src/sys/kern/kern_shutdown.c:298
> #2  0x80396db6 in db_fncall_generic (nargs=0, addr=, 
> rv=, args=)
> at /usr/src/sys/ddb/db_command.c:581
> #3  db_fncall (dummy1=, dummy2=, 
> dummy3=, dummy4=)
> at /usr/src/sys/ddb/db_command.c:629
> #4  0x80396919 in db_command (last_cmdp=, 
> cmd_table=, dopager=)
> at /usr/src/sys/ddb/db_command.c:453
> #5  0x80396674 in db_command_loop ()
> at /usr/src/sys/ddb/db_command.c:506
> #6  0x8039972f in db_trap (type=, code=)
> at /usr/src/sys/ddb/db_main.c:248
> #7  0x80aa0a13 in kdb_trap (type=, 
> code=, tf=)
> at /usr/src/sys/kern/subr_kdb.c:654
> #8  0x80ed8cc4 in trap (frame=0xfe085c2a4030)
> at /usr/src/sys/amd64/amd64/trap.c:537
> #9  
> #10 kdb_enter (why=0x8140e5d1 "panic", 
> msg=0x80 )
> at /usr/src/sys/kern/subr_kdb.c:444
> #11 0x80a5eb1f in vpanic (fmt=, ap=0xfe085c2a41c0)
> at /usr/src/sys/kern/kern_shutdown.c:752
> #12 0x80a5eb83 in panic (fmt=0x81bf1de0  
> "\004")
> at /usr/src/sys/kern/kern_shutdown.c:690
> #13 0x80a40191 in __mtx_assert (c=, 
> what=, file=0xfe085c2a3fe0 "", line=0)
> at /usr/src/sys/kern/kern_mutex.c:937
> #14 0x80b0d3d5 in cache_negative_remove (ncp=0xf8062bb8dee0, 
> neg_locked=true) at /usr/src/sys/kern/vfs_cache.c:743
> #15 0x80b0d87a in cache_zap_locked (ncp=0xf8062bb8dee0, 
> neg_locked=true) at /usr/src/sys/kern/vfs_cache.c:886
> #16 0x80b0d210 in cache_negative_zap_one ()
> at /usr/src/sys/kern/vfs_cache.c:835
> #17 cache_enter_time (dvp=, vp=, 
> cnp=, tsp=, dtsp=)
> at /usr/src/sys/kern/vfs_cache.c:1741
> #18 0x82893b3a in tmpfs_lookup (v=)
> at /usr/src/sys/modules/tmpfs/../../fs/tmpfs/tmpfs_vnops.c:199
> #19 0x8103ee4a in VOP_CACHEDLOOKUP_APV (vop=, 
> a=) at vnode_if.c:195
> #20 0x80b0e9d6 in VOP_CACHEDLOOKUP (dvp=0xf8023f3693b0, 
> vpp=0xfe085c2a47d8, cnp=0xfe085c2a4800) at ./vnode_if.h:80
> #21 vfs_cache_lookup (ap=) at 
> /usr/src/sys/kern/vfs_cache.c:2022
> #22 0x8103ecea in VOP_LOOKUP_APV (vop=, 
> a=) at vnode_if.c:127
> #23 0x80b17df2 in VOP_LOOKUP (vpp=0xfe085c2a47d8, 
> cnp=0xfe085c2a4800, dvp=) at ./vnode_if.h:54
> #24 lookup (ndp=) at /usr/src/sys/kern/vfs_lookup.c:830
> #25 0x80b173ae in namei (ndp=)
> at /usr/src/sys/kern/vfs_lookup.c:397
> #26 0x80b333db in vn_open_cred (ndp=, 
> flagp=0xfe085c2a485c, cmode=0, vn_open_flags=, 
> cred=, fp=0x1d0) at /usr/src/sys/kern/vfs_vnops.c:277
> #27 0x80b2c72c in kern_openat (td=0xf8001d68d000, fd=-100, 
> path=0x2c9e71e8 , 
> pathseg=UIO_USERSPACE, 
> flags=, 
> mode=) at /usr/src/sys/kern/vfs_syscalls.c:1016
> #28 0x80ff88a0 in syscallenter (td=0xf8001d68d000, 
> sa=)
> at /usr/src/sys/amd64/ia32/../../kern/subr_syscall.c:135
> #29 ia32_syscall (frame=0xfe085c2a4a40)
> at /usr/src/sys/amd64/ia32/ia32_syscall.c:187
> #30 
> #31 0x097b2f57 in ?? ()
> Backtrace stopped: Cannot access memory at address 0x7698
> 
> 
> I don't see any obvious issues in the code.

This problem is very reproduceable for me, so I added 

panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743 in 12.0-CURRENT r308447

2016-11-11 Thread Don Lewis
Yesterday I upgraded my ports builder box from r306349 to r308477.  It
crashed overnight during a poudriere run.

panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743
cpuid = 2
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe085c2a4100
vpanic() at vpanic+0x182/frame 0xfe085c2a4180
panic() at panic+0x43/frame 0xfe085c2a41e0
__mtx_assert() at __mtx_assert+0xc1/frame 0xfe085c2a41f0
cache_negative_remove() at cache_negative_remove+0x65/frame 0xfe085c2a4230
cache_zap_locked() at cache_zap_locked+0x1ca/frame 0xfe085c2a4250
cache_enter_time() at cache_enter_time+0x1c20/frame 0xfe085c2a4330
tmpfs_lookup() at tmpfs_lookup+0x47a/frame 0xfe085c2a4390
VOP_CACHEDLOOKUP_APV() at VOP_CACHEDLOOKUP_APV+0xda/frame 0xfe085c2a43c0
vfs_cache_lookup() at vfs_cache_lookup+0xd6/frame 0xfe085c2a4420
VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0xda/frame 0xfe085c2a4450
lookup() at lookup+0x6a2/frame 0xfe085c2a44e0
namei() at namei+0x57e/frame 0xfe085c2a45a0
vn_open_cred() at vn_open_cred+0x25b/frame 0xfe085c2a4710
kern_openat() at kern_openat+0x25c/frame 0xfe085c2a48a0
ia32_syscall() at ia32_syscall+0x330/frame 0xfe085c2a4a30
Xint0x80_syscall() at Xint0x80_syscall+0x95/frame 0xfe085c2a4a30
--- syscall (499, FreeBSD ELF32, sys_openat), rip = 0x97b2f57, rsp = 
0x7698, rbp = 0x76b0 ---
KDB: enter: panic


I was able to get a crash dump.  This is the kgdb backtrace:

#0  __curthread () at ./machine/pcpu.h:222
#1  doadump (textdump=1546271680) at /usr/src/sys/kern/kern_shutdown.c:298
#2  0x80396db6 in db_fncall_generic (nargs=0, addr=, 
rv=, args=)
at /usr/src/sys/ddb/db_command.c:581
#3  db_fncall (dummy1=, dummy2=, 
dummy3=, dummy4=)
at /usr/src/sys/ddb/db_command.c:629
#4  0x80396919 in db_command (last_cmdp=, 
cmd_table=, dopager=)
at /usr/src/sys/ddb/db_command.c:453
#5  0x80396674 in db_command_loop ()
at /usr/src/sys/ddb/db_command.c:506
#6  0x8039972f in db_trap (type=, code=)
at /usr/src/sys/ddb/db_main.c:248
#7  0x80aa0a13 in kdb_trap (type=, 
code=, tf=)
at /usr/src/sys/kern/subr_kdb.c:654
#8  0x80ed8cc4 in trap (frame=0xfe085c2a4030)
at /usr/src/sys/amd64/amd64/trap.c:537
#9  
#10 kdb_enter (why=0x8140e5d1 "panic", 
msg=0x80 )
at /usr/src/sys/kern/subr_kdb.c:444
#11 0x80a5eb1f in vpanic (fmt=, ap=0xfe085c2a41c0)
at /usr/src/sys/kern/kern_shutdown.c:752
#12 0x80a5eb83 in panic (fmt=0x81bf1de0  "\004")
at /usr/src/sys/kern/kern_shutdown.c:690
#13 0x80a40191 in __mtx_assert (c=, 
what=, file=0xfe085c2a3fe0 "", line=0)
at /usr/src/sys/kern/kern_mutex.c:937
#14 0x80b0d3d5 in cache_negative_remove (ncp=0xf8062bb8dee0, 
neg_locked=true) at /usr/src/sys/kern/vfs_cache.c:743
#15 0x80b0d87a in cache_zap_locked (ncp=0xf8062bb8dee0, 
neg_locked=true) at /usr/src/sys/kern/vfs_cache.c:886
#16 0x80b0d210 in cache_negative_zap_one ()
at /usr/src/sys/kern/vfs_cache.c:835
#17 cache_enter_time (dvp=, vp=, 
cnp=, tsp=, dtsp=)
at /usr/src/sys/kern/vfs_cache.c:1741
#18 0x82893b3a in tmpfs_lookup (v=)
at /usr/src/sys/modules/tmpfs/../../fs/tmpfs/tmpfs_vnops.c:199
#19 0x8103ee4a in VOP_CACHEDLOOKUP_APV (vop=, 
a=) at vnode_if.c:195
#20 0x80b0e9d6 in VOP_CACHEDLOOKUP (dvp=0xf8023f3693b0, 
vpp=0xfe085c2a47d8, cnp=0xfe085c2a4800) at ./vnode_if.h:80
#21 vfs_cache_lookup (ap=) at /usr/src/sys/kern/vfs_cache.c:2022
#22 0x8103ecea in VOP_LOOKUP_APV (vop=, 
a=) at vnode_if.c:127
#23 0x80b17df2 in VOP_LOOKUP (vpp=0xfe085c2a47d8, 
cnp=0xfe085c2a4800, dvp=) at ./vnode_if.h:54
#24 lookup (ndp=) at /usr/src/sys/kern/vfs_lookup.c:830
#25 0x80b173ae in namei (ndp=)
at /usr/src/sys/kern/vfs_lookup.c:397
#26 0x80b333db in vn_open_cred (ndp=, 
flagp=0xfe085c2a485c, cmode=0, vn_open_flags=, 
cred=, fp=0x1d0) at /usr/src/sys/kern/vfs_vnops.c:277
#27 0x80b2c72c in kern_openat (td=0xf8001d68d000, fd=-100, 
path=0x2c9e71e8 , 
pathseg=UIO_USERSPACE, 
flags=, 
mode=) at /usr/src/sys/kern/vfs_syscalls.c:1016
#28 0x80ff88a0 in syscallenter (td=0xf8001d68d000, 
sa=)
at /usr/src/sys/amd64/ia32/../../kern/subr_syscall.c:135
#29 ia32_syscall (frame=0xfe085c2a4a40)
at /usr/src/sys/amd64/ia32/ia32_syscall.c:187
#30 
#31 0x097b2f57 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7698


I don't see any obvious issues in the code.

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"