Re: CVS commit: src/sys/netinet

2017-12-25 Thread Tom Ivar Helbekkmo
Ryota Ozaki  writes:

> One possible fix has been committed.
>
> Can you update the source code and try a new kernel?

Will do.

Meanwhile, before I got around to building a kernel with debug options
enabled, I had another hang.  Got it into DDB successfully, but then I
managed, while looking for a way to switch between CPUs (the man page is
wrong in this respect), to get DDB to look at something it shouldn't
have, so it just said "fatal pag", and that was that.

I did get a backtrace of CPU 0, though, and it looked interesting:

pool_catchup()
pool_get()
pool_cache_get_slow()
pool_cache_get_paddr()
m_get()
m_gethdr()
wm_add_rxbuf()
wm_rxeof()
wm_intr_legacy()
intr_biglock_wrapper()
Xintr_ioapic_level2()
--- interrupt ---
Xspllower()
uvm_km_kmem_alloc()
pool_page_alloc()
pool_grow()
pool_catchup()
pool_get()
pool_cache_get_slow()
pool_cache_get_paddr()
m_get()
m_gethdr()
tcp_output()
tcp_send_wrapper()
sosend()
soo_write()
dofilewrite()
sys_write()
syscall()
--- syscall (number 4) ---

-tih
-- 
Most people who graduate with CS degrees don't understand the significance
of Lisp.  Lisp is the most important idea in computer science.  --Alan Kay


Re: CVS commit: src/sys/netinet6

2017-12-25 Thread Paul Goyette

On Tue, 26 Dec 2017, Ryota Ozaki wrote:


Well, since the lock _might_ be released (and subsequently reacquired)
by callout_halt(), it might be easiest to modify all the callers to
just unlock it before calling nd6_dad_stoptimer(), and reacquire the
mutex after it returns (as well as re-establishing any values that
might have changed while the mutex was released).

The callers needs to be prepared for the release/reacquire situation
anyway, so the change should not be significant.  As noted in the
callout_halt(9) man page

   ...to avoid race conditions the caller should always assume that
   [the] interlock has been released and reacquired, and act
   accordingly.


Alternatively, you could modify all the callers to always acquire the
mutex before calling nd6_dad_stoptimer().


I mean such changes are tough and mess up many codes which we don't hope.


Yes, the changes are not trivial.  But the first option above should 
already have been done when you changed from callout_stop() to _halt().



But making a run-time decision with mutex_owned() is not a good idea.


If it must be respected we can back to callout_stop because it's unsafe
but NetBSD 7 uses it without any issues; using callout_stop in
nd6_dad_stoptimer might be safe somehow (if not NET_MPSAFE).


IMHO, we definitely do not want to use mutex_owned() in this way.

I do not believe that going backwards to callout_stop() is the right 
approach.


Again IMHO, the _right_ thing to do is to continue using callout_halt() 
and modify the callers.  Yes, it might be a lot of work, and it might 
initially introduce new errors, but in the long run it is the correct 
approach.  IMHO.



+--+--++
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:  |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+--+--++


Re: CVS commit: src/sys/netinet6

2017-12-25 Thread Ryota Ozaki
On Tue, Dec 26, 2017 at 12:37 PM, Paul Goyette  wrote:
> On Tue, 26 Dec 2017, Ryota Ozaki wrote:
>
>> On Tue, Dec 26, 2017 at 11:35 AM, Paul Goyette  wrote:
>>>
>>>
 To generate a diff of this commit:
>>>
>>>
>>>
>>> # cvs rdiff -u -r1.139 -r1.140 src/sys/netinet6/nd6_nbr.c
>>> @@ -1097,7 +1097,11 @@ nd6_dad_stoptimer(struct dadq *dp)
>>>  #ifdef NET_MPSAFE
>>> callout_halt(>dad_timer_ch, NULL);
>>>  #else
>>> -   callout_halt(>dad_timer_ch, softnet_lock);
>>> +   /* XXX still need the trick for softnet_lock */
>>> +   if (mutex_owned(softnet_lock))
>>> +   callout_halt(>dad_timer_ch, softnet_lock);
>>> +   else
>>> +   callout_halt(>dad_timer_ch, NULL);
>>>  #endif
>>>  }
>>>
>>> This goes against the restriction noted in the mutex(9) man page:
>>>
>>>[mutex_owned()] should not be used to make locking decisions at run
>>>time.  ...
>>>
>>> Please find a different way to make this run-time decision.
>>
>>
>> I know the restriction well, but for softnet_lock, following the
>> restriction
>> is sometimes hard. I don't have an option to statically decide if
>> softnet_lock is held or not without messing up many codes.
>>
>> An option we can have here is to give up using callout_halt and use
>> callout_stop that may be unsafe.
>>
>> Which is better for us?
>
>
> Well, since the lock _might_ be released (and subsequently reacquired)
> by callout_halt(), it might be easiest to modify all the callers to
> just unlock it before calling nd6_dad_stoptimer(), and reacquire the
> mutex after it returns (as well as re-establishing any values that
> might have changed while the mutex was released).
>
> The callers needs to be prepared for the release/reacquire situation
> anyway, so the change should not be significant.  As noted in the
> callout_halt(9) man page
>
>...to avoid race conditions the caller should always assume that
>[the] interlock has been released and reacquired, and act
>accordingly.
>
>
> Alternatively, you could modify all the callers to always acquire the
> mutex before calling nd6_dad_stoptimer().

I mean such changes are tough and mess up many codes which we don't hope.

>
> But making a run-time decision with mutex_owned() is not a good idea.

If it must be respected we can back to callout_stop because it's unsafe
but NetBSD 7 uses it without any issues; using callout_stop in
nd6_dad_stoptimer might be safe somehow (if not NET_MPSAFE).

  ozaki-r


Re: CVS commit: src/sys/netinet6

2017-12-25 Thread Paul Goyette

On Tue, 26 Dec 2017, Ryota Ozaki wrote:


On Tue, Dec 26, 2017 at 11:35 AM, Paul Goyette  wrote:



To generate a diff of this commit:



# cvs rdiff -u -r1.139 -r1.140 src/sys/netinet6/nd6_nbr.c
@@ -1097,7 +1097,11 @@ nd6_dad_stoptimer(struct dadq *dp)
 #ifdef NET_MPSAFE
callout_halt(>dad_timer_ch, NULL);
 #else
-   callout_halt(>dad_timer_ch, softnet_lock);
+   /* XXX still need the trick for softnet_lock */
+   if (mutex_owned(softnet_lock))
+   callout_halt(>dad_timer_ch, softnet_lock);
+   else
+   callout_halt(>dad_timer_ch, NULL);
 #endif
 }

This goes against the restriction noted in the mutex(9) man page:

   [mutex_owned()] should not be used to make locking decisions at run
   time.  ...

Please find a different way to make this run-time decision.


I know the restriction well, but for softnet_lock, following the restriction
is sometimes hard. I don't have an option to statically decide if
softnet_lock is held or not without messing up many codes.

An option we can have here is to give up using callout_halt and use
callout_stop that may be unsafe.

Which is better for us?


Well, since the lock _might_ be released (and subsequently reacquired)
by callout_halt(), it might be easiest to modify all the callers to
just unlock it before calling nd6_dad_stoptimer(), and reacquire the
mutex after it returns (as well as re-establishing any values that
might have changed while the mutex was released).

The callers needs to be prepared for the release/reacquire situation
anyway, so the change should not be significant.  As noted in the
callout_halt(9) man page

   ...to avoid race conditions the caller should always assume that
   [the] interlock has been released and reacquired, and act
   accordingly.


Alternatively, you could modify all the callers to always acquire the
mutex before calling nd6_dad_stoptimer().

But making a run-time decision with mutex_owned() is not a good idea.


+--+--++
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:  |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+--+--++


Re: CVS commit: src/sys/netinet

2017-12-25 Thread Ryota Ozaki
One possible fix has been committed.

Can you update the source code and try a new kernel?

Thanks,
  ozaki-r


On Tue, Dec 26, 2017 at 1:00 AM, Ryota Ozaki  wrote:
> On Mon, Dec 25, 2017 at 8:31 PM, Tom Ivar Helbekkmo
>  wrote:
>> Martin Husemann  writes:
>>
>>> The sparc64 hang happened with:
>>>
>>>  $NetBSD: ip_output.c,v 1.288 2017/12/22 11:22:37 ozaki-r Exp $
>>
>> My amd64 that experienced hangs has:
>>
>> $NetBSD: ip_output.c,v 1.287 2017/12/15 04:03:46 ozaki-r Exp $
>>
>> ...and was last updated from CVS on December 19th, which made it much
>> more stable than a kernel from a few days previous to that (from the
>> 12th, I think).  After upgrading, I was trying to update a lot of
>> installed packages from pkgsrc, with pkgsrc, distfiles, and the binary
>> package repository NFS mounted, and a local pkgobj for building in.  I
>> gave up in the end, because of the hangs, but managed to build
>> everything after updating on the 19th.  Since then, I've only had the
>> one hard hang.
>
> My recent changes:
> - 12/6-7:   IFNET_LOCK changes #1
> - 12/14-15: IFNET_LOCK changes #2
> - 12/19:Disable IFEF_MPSAFE by default
>
> So your experiences looks synchronized to the changes.
>
> Can you run the kernel with LOCKDEBUG and DEBUG enabled? And if once your
> system hangs again, can you enter the DDB and collect threads stuck on
> turnstile?
>
> Thanks,
>   ozaki-r


Re: CVS commit: src/sys/netinet6

2017-12-25 Thread Ryota Ozaki
On Tue, Dec 26, 2017 at 11:35 AM, Paul Goyette  wrote:
>
>> To generate a diff of this commit:
>
>
> # cvs rdiff -u -r1.139 -r1.140 src/sys/netinet6/nd6_nbr.c
> @@ -1097,7 +1097,11 @@ nd6_dad_stoptimer(struct dadq *dp)
>  #ifdef NET_MPSAFE
> callout_halt(>dad_timer_ch, NULL);
>  #else
> -   callout_halt(>dad_timer_ch, softnet_lock);
> +   /* XXX still need the trick for softnet_lock */
> +   if (mutex_owned(softnet_lock))
> +   callout_halt(>dad_timer_ch, softnet_lock);
> +   else
> +   callout_halt(>dad_timer_ch, NULL);
>  #endif
>  }
>
> This goes against the restriction noted in the mutex(9) man page:
>
>[mutex_owned()] should not be used to make locking decisions at run
>time.  ...
>
> Please find a different way to make this run-time decision.

I know the restriction well, but for softnet_lock, following the restriction
is sometimes hard. I don't have an option to statically decide if
softnet_lock is held or not without messing up many codes.

An option we can have here is to give up using callout_halt and use
callout_stop that may be unsafe.

Which is better for us?

  ozaki-r


Re: CVS commit: src/sys/netinet6

2017-12-25 Thread Paul Goyette



To generate a diff of this commit:


# cvs rdiff -u -r1.139 -r1.140 src/sys/netinet6/nd6_nbr.c
@@ -1097,7 +1097,11 @@ nd6_dad_stoptimer(struct dadq *dp)
 #ifdef NET_MPSAFE
callout_halt(>dad_timer_ch, NULL);
 #else
-   callout_halt(>dad_timer_ch, softnet_lock);
+   /* XXX still need the trick for softnet_lock */
+   if (mutex_owned(softnet_lock))
+   callout_halt(>dad_timer_ch, softnet_lock);
+   else
+   callout_halt(>dad_timer_ch, NULL);
 #endif
 }

This goes against the restriction noted in the mutex(9) man page:

   [mutex_owned()] should not be used to make locking decisions at run
   time.  ...

Please find a different way to make this run-time decision.



+--+--++
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:  |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+--+--++


Re: CVS commit: src/sys/netinet

2017-12-25 Thread Ryota Ozaki
On Mon, Dec 25, 2017 at 8:31 PM, Tom Ivar Helbekkmo
 wrote:
> Martin Husemann  writes:
>
>> The sparc64 hang happened with:
>>
>>  $NetBSD: ip_output.c,v 1.288 2017/12/22 11:22:37 ozaki-r Exp $
>
> My amd64 that experienced hangs has:
>
> $NetBSD: ip_output.c,v 1.287 2017/12/15 04:03:46 ozaki-r Exp $
>
> ...and was last updated from CVS on December 19th, which made it much
> more stable than a kernel from a few days previous to that (from the
> 12th, I think).  After upgrading, I was trying to update a lot of
> installed packages from pkgsrc, with pkgsrc, distfiles, and the binary
> package repository NFS mounted, and a local pkgobj for building in.  I
> gave up in the end, because of the hangs, but managed to build
> everything after updating on the 19th.  Since then, I've only had the
> one hard hang.

My recent changes:
- 12/6-7:   IFNET_LOCK changes #1
- 12/14-15: IFNET_LOCK changes #2
- 12/19:Disable IFEF_MPSAFE by default

So your experiences looks synchronized to the changes.

Can you run the kernel with LOCKDEBUG and DEBUG enabled? And if once your
system hangs again, can you enter the DDB and collect threads stuck on
turnstile?

Thanks,
  ozaki-r


Re: CVS commit: src/sys/netinet

2017-12-25 Thread Ryota Ozaki
On Mon, Dec 25, 2017 at 8:16 PM, Martin Husemann  wrote:
> On Mon, Dec 25, 2017 at 12:05:28PM +0900, Ryota Ozaki wrote:
>> Anyway I annoy that we often cannot have suspect commits because of cvs
>> (you know the kernel version doesn't work at all for the purpose). I wish
>> we had revision IDs of svn or commit IDs of git to know them.
>
> The sparc64 hang happened with:
>
>  $NetBSD: ip_output.c,v 1.288 2017/12/22 11:22:37 ozaki-r Exp $

Thanks. Hmm, I have no idea why and how the change hangs the system.
If it's a deadlock can you run ps in DDB and gather threads stuck on
turnstile? If not, ...I have no idea :-/

  ozaki-r


Re: CVS commit: src/distrib/ews4800mips/floppies/ramdisk

2017-12-25 Thread Christos Zoulas
In article <20171225061550.e13a3f...@cvs.netbsd.org>,
Rin Okuyama  wrote:
>-=-=-=-=-=-
>
>Module Name:   src
>Committed By:  rin
>Date:  Mon Dec 25 06:15:50 UTC 2017
>
>Modified Files:
>   src/distrib/ews4800mips/floppies/ramdisk: list
>
>Log Message:
>Drop the following features, which reduces ramdisk.bin about 70KB:
>- shutdown, rcmd, rcp
>- support for byte-swapped FFS and Apple UFS in fsck_ffs and newfs
>- support for byte-swapped label and interactive editor in disklabel
>OK tsutsui

You didn't need to do this, because the real issue was that the kernel
grew (which I fixed by removing DDB and making compat a library since
there are no compat options in the install kernel). So now we should
have space to put back DDB or those :-)

christos



Re: CVS commit: src/sys/netinet

2017-12-25 Thread Tom Ivar Helbekkmo
Martin Husemann  writes:

> The sparc64 hang happened with:
>
>  $NetBSD: ip_output.c,v 1.288 2017/12/22 11:22:37 ozaki-r Exp $

My amd64 that experienced hangs has:

$NetBSD: ip_output.c,v 1.287 2017/12/15 04:03:46 ozaki-r Exp $

...and was last updated from CVS on December 19th, which made it much
more stable than a kernel from a few days previous to that (from the
12th, I think).  After upgrading, I was trying to update a lot of
installed packages from pkgsrc, with pkgsrc, distfiles, and the binary
package repository NFS mounted, and a local pkgobj for building in.  I
gave up in the end, because of the hangs, but managed to build
everything after updating on the 19th.  Since then, I've only had the
one hard hang.

-tih
-- 
Most people who graduate with CS degrees don't understand the significance
of Lisp.  Lisp is the most important idea in computer science.  --Alan Kay


Re: CVS commit: src/sys/netinet

2017-12-25 Thread Martin Husemann
On Mon, Dec 25, 2017 at 12:05:28PM +0900, Ryota Ozaki wrote:
> Anyway I annoy that we often cannot have suspect commits because of cvs
> (you know the kernel version doesn't work at all for the purpose). I wish
> we had revision IDs of svn or commit IDs of git to know them.

The sparc64 hang happened with:

 $NetBSD: ip_output.c,v 1.288 2017/12/22 11:22:37 ozaki-r Exp $

Martin


Re: CVS commit: src/sys

2017-12-25 Thread Kamil Rytarowski
On 25.12.2017 02:21, m...@netbsd.org wrote:
> On Tue, Dec 19, 2017 at 07:40:04PM +, Kamil Rytarowski wrote:
>> Module Name: src
>> Committed By:kamil
>> Date:Tue Dec 19 19:40:04 UTC 2017
>>
>> Modified Files:
>>  src/sys/compat/netbsd32: netbsd32_netbsd.c netbsd32_syscall.h
>>  netbsd32_syscallargs.h netbsd32_syscalls.c
>>  netbsd32_syscalls_autoload.c netbsd32_sysent.c
>>  netbsd32_systrace_args.c syscalls.master
>>  src/sys/kern: init_sysent.c syscalls.c syscalls.master
>>  syscalls_autoload.c systrace_args.c
>>  src/sys/rump/include/rump: rump_syscalls.h
>>  src/sys/rump/librump/rumpkern: rump_syscalls.c
>>  src/sys/sys: syscall.h syscallargs.h
>>  src/sys/uvm: uvm_unix.c
>>
>> Log Message:
>> Drop SYS_vadvise
>>
>> The (o)vadvise syscall is dummy since the beginning of NetBSD.
>>
>> It is an obsolete remnant from the old UNIX.
> 
> I think this removes a symbol from libc too
> 

This is correct.

$ nm /usr/lib/libc.so|grep vadvise


0003b980 T vadvise

I will fix it.



signature.asc
Description: OpenPGP digital signature