Re: pserialize(9) vs. TAILQ

2014-11-23 Thread Dennis Ferguson

On 22 Nov, 2014, at 00:03 , Masao Uebayashi  wrote:

> On Thu, Nov 20, 2014 at 4:45 PM, Dennis Ferguson
>  wrote:
> OK, I overlooked the obvious "new elm's next" -> "prev elm's next" ordering.
> 
>> The conclusion is that while the current TAILQ_REMOVE() macro will work
>> unchanged to maintain a list with concurrent readers (given no requirement
>> to run on an MP DEC Alpha) the TAILQ_INSERT_*() macros will not, even
>> though they look correct.  The mutex and/or the use of pserialize() does
>> nothing to change this.  If you want to insert things into a TAILQ list
>> without blocking concurrent readers you must use alternative macros that
>> have membar_producer() calls in the right spots.
> 
> I completely disagree with your conclusion.
> 
> Your basic idea seems that ``member_{prroducer,consumer}() are cheaper
> than membar_{enter,leave}, let's use cheaper ones''.  That way you end
> up concluding that you need memory barrier in every iteration.  Thant
> is contradictory to what you previously said.

You need to fix the store ordering in TAILQ_INSERT_*().  If you want
to use a different barrier to do it that's fine with me.

I was happy with no barrier in the reader when I thought providing
support for the DEC Alpha architecture's unique cache quirk wasn't
necessary.  When it turned out support for the Alpha was necessary
the reader needed to have a barrier inserted that looks like it gets
executed after every read pointer read.  Note, however, that this is
a barrier which does nothing at all on any machine other than an
Alpha. Nothing has changed, nothing has been contradicted, no cost
has been added to readers.  Taylor might like to see barrier
operations paired up in C code but there is no pairing at the
hardware instruction level.  The reader barrier has no cost on any
machine other than the Alpha, the only machine which has an
architectural requirement to generate an instruction there.

I understand why you are unhappy with the barrier.  It seems
stupid to have to execute a not-inexpensive hardware instruction
over and over and over again in the reader to deal with something
that a writer is almost never going to do, but since you have to do
this on an Alpha (and only an Alpha) to make this code correct,
why can't the conclusion be that it sucks to be a DEC Alpha,
and just get on with it?  This is probably a perfect example
of why no processor other than the Alpha has had this quirk in
its cache architecture, nor is there likely to be another in
future; this quirk has a cost that is annoying.  Indeed, judged
from the fact that people think NetBSD runs on an MP Alpha
okay without this barrier, it may be that even later DEC Alpha
implementations omitted this misfeature.

> point to update E_new.next in the "update" section.  If E_new.next has
> been updated *long enough before* E_p.next is updated, that ordering
> is not a problem at all.

"long enough" is not a specific length of time.  If you can
find a spec for the amount of time which is "long enough" you
can wait that long instead.  Just calling pserialize_perform()
when inserting entries (where no pseralize_perform() is required)
because pserialize_perform() is a fantastically expensive function
and takes a long time, doesn't do anything if you don't know
how long is "long enough", not to mention that you'd be executing
this extra-expensive DELAY() on all archectures to eliminate a cost
elsewhere that is unique to the DEC Alpha.

This is only a problem for the Alpha.  The barrier fixes it for
the Alpha without costing other architectures anything.  That seems
like the right thing to do.

Dennis Ferguson



Re: pserialize(9) vs. TAILQ

2014-11-23 Thread Masao Uebayashi
On Sat, Nov 22, 2014 at 2:24 PM, Dennis Ferguson
 wrote:
> I'll guess one problem is in sparc/mutex.h, here:
>
> #define MUTEX_RECEIVE(mtx)  /* nothing */
> #define MUTEX_GIVE(mtx) /* nothing */

I think that these macros should be replaced with
membar_enter()/member_exit() respectively in sys/kern/kern_mutex.c.

The only reason, which I can think of, that these exist is these
predate membar ops.


Should sys_unmount() FOLLOW?

2014-11-23 Thread Emmanuel Dreyfus
Hi

I ran into this strange bug with glusterfs NFS server, which is possible
because it allows the mounted filesystem root vnode to be VLNK (NetBSD's
native mountd prevents this situation and therefore the bug does not
happen with our native NFS server):

bacasel# mount
bacasel.net.espci.fr:/patchy/symlink1 on /mnt/nfs/0 type nfs

bacasel# ls -l /mnt/nfs
lrwxrwxrwx  1 root  wheel4 Nov 23 10:03 0 -> dir1

That is possible because on the exported filesystem, symlink1 is a
symlink to dir1.

It looks funny and harmless, at least until one try to unmount while the
NFS server is down. I do this using the most reliable way: umount -f -R
/mnt/nfs/0 

umount(8) will quickly call unmount(2), passing the /mnt/nfs/0 path
without trying anything fancy. But at the beginning of sys_unmount() in
the kernel, we can find:

NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF | TRYEMULROOT, pb);
if ((error = namei(&nd)) != 0) {
pathbuf_destroy(pb);
return error;
}

The FOLLOW flag will cause the symlink to be resolved before
sys_unmount() can proceed. Since the NFS server is down, the namei call
will never return, and umount(8) get stuck in kernel with this
backtrace:
sleepq_block
kpause
nfs_reconnect
nfs_request
nfs_readlinkrpc
nfs_readlink
VOP_READLINK
namei_tryemulroot
namei
sys_unmount
syscall

This is annoying because umount -f should really unmount. Moreover,
stuck process will crop on the system because umount(8) holds a vnode
lock forever.

I see two way of fixing this:
1) remove FOLLOW flag in sys_unmount(). After all, unless the -R flag is
given, umount(8) should have resolved the patch before calling
unmount(2).

2) Desfine a new MNT_RAW mount flag. If umount(8) is called with -R,
pass that flag to unmount(8), and in sys_unmount(), do not use FOLLOW if
MNT_RAW is set.

Opinions?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org


Re: Should sys_unmount() FOLLOW?

2014-11-23 Thread J. Hannken-Illjes
On 23 Nov 2014, at 17:47, Emmanuel Dreyfus  wrote:

> Hi
> 
> I ran into this strange bug with glusterfs NFS server, which is possible
> because it allows the mounted filesystem root vnode to be VLNK (NetBSD's
> native mountd prevents this situation and therefore the bug does not
> happen with our native NFS server):
> 
> bacasel# mount
> bacasel.net.espci.fr:/patchy/symlink1 on /mnt/nfs/0 type nfs
> 
> bacasel# ls -l /mnt/nfs
> lrwxrwxrwx  1 root  wheel4 Nov 23 10:03 0 -> dir1
> 
> That is possible because on the exported filesystem, symlink1 is a
> symlink to dir1.
> 
> It looks funny and harmless, at least until one try to unmount while the
> NFS server is down. I do this using the most reliable way: umount -f -R
> /mnt/nfs/0
>  
> 
> umount(8) will quickly call unmount(2), passing the /mnt/nfs/0 path
> without trying anything fancy. But at the beginning of sys_unmount() in
> the kernel, we can find:
> 
>NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF | TRYEMULROOT, pb);
>if ((error = namei(&nd)) != 0) {
>pathbuf_destroy(pb);
>return error;
>}
> 
> The FOLLOW flag will cause the symlink to be resolved before
> sys_unmount() can proceed. Since the NFS server is down, the namei call
> will never return, and umount(8) get stuck in kernel with this
> backtrace:
> sleepq_block
> kpause
> nfs_reconnect
> nfs_request
> nfs_readlinkrpc
> nfs_readlink
> VOP_READLINK
> namei_tryemulroot
> namei
> sys_unmount
> syscall
> 
> This is annoying because umount -f should really unmount. Moreover,
> stuck process will crop on the system because umount(8) holds a vnode
> lock forever.
> 
> I see two way of fixing this:
> 1) remove FOLLOW flag in sys_unmount(). After all, unless the -R flag is
> given, umount(8) should have resolved the patch before calling
> unmount(2).

Did you try -- last time I tried a forced unmount with NFS server down
it didn't work even with root being a directory because the namei() call
would hang in VOP_LOOKUP().  Does it work these days?

> 2) Desfine a new MNT_RAW mount flag. If umount(8) is called with -R,
> pass that flag to unmount(8), and in sys_unmount(), do not use FOLLOW if
> MNT_RAW is set.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



posix_madvise(2) should fail with ENOMEM for invalid adresses range

2014-11-23 Thread Nicolas Joly

Hi,

According the OpenGroup online document for posix_madvise[1], it
should fail with ENOMEM for invalid addresses ranges :

[ENOMEM]
Addresses in the range starting at addr and continuing for len
bytes are partly or completely outside the range allowed for the
address space of the calling process.

But we currently fail with EINVAL (returned value from range_check()
function).

Ok to apply the attached patch to fix posix_madvise/madvise ?

Thanks.

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_madvise.html

-- 
Nicolas Joly

Biology IT Center
Institut Pasteur, Paris.
Index: sys/uvm/uvm_mmap.c
===
RCS file: /cvsroot/src/sys/uvm/uvm_mmap.c,v
retrieving revision 1.149
diff -u -p -r1.149 uvm_mmap.c
--- sys/uvm/uvm_mmap.c  5 Sep 2014 09:24:48 -   1.149
+++ sys/uvm/uvm_mmap.c  11 Nov 2014 11:21:23 -
@@ -843,7 +843,7 @@ sys_madvise(struct lwp *l, const struct 
 
error = range_test(addr, size, false);
if (error)
-   return error;
+   return ENOMEM;
 
switch (advice) {
case MADV_NORMAL:
Index: lib/libc/sys/madvise.2
===
RCS file: /cvsroot/src/lib/libc/sys/madvise.2,v
retrieving revision 1.28
diff -u -p -r1.28 madvise.2
--- lib/libc/sys/madvise.2  19 Jul 2014 19:26:47 -  1.28
+++ lib/libc/sys/madvise.2  11 Nov 2014 11:21:23 -
@@ -105,6 +105,13 @@ will fail if:
 .Bl -tag -width Er
 .It Bq Er EINVAL
 Invalid parameters were provided.
+.It Bq Er ENOMEM
+Addresses in the range starting at
+.Fa addr
+and continuing for
+.Fa len
+bytes are partly or completely outside the range allowed for the address
+space of the calling process.
 .El
 .Sh SEE ALSO
 .Xr mincore 2 ,


Re: Should sys_unmount() FOLLOW?

2014-11-23 Thread Emmanuel Dreyfus
J. Hannken-Illjes  wrote:

> Did you try -- last time I tried a forced unmount with NFS server down
> it didn't work even with root being a directory because the namei() call
> would hang in VOP_LOOKUP().  Does it work these days?

It works on netbsd-7:

bacasel# mountd
bacasel# nfsd
bacasel# showmount -e 
Exports list on localhost:
/mnt/test/dir  127.0.0.1 
/mnt/test  127.0.0.1 
bacasel# mount -t nfs 127.0.0.1:/mnt/test /mnt2
bacasel# mount
/dev/xbd0a on / type ffs (log, NFS exported, local)
/dev/xbd1a on /d type ffs (extattr, log, local)
kernfs on /kern type kernfs (local)
ptyfs on /dev/pts type ptyfs (local)
procfs on /proc type procfs (local)
tmpfs on /var/shm type tmpfs (local)
127.0.0.1:/mnt/test on /mnt2 type nfs
bacasel# pkill -9 mountd nfsd
bacasel# ps -ax|egrep '(nfsd|mountd)'
bacasel# umount -f /mnt2
bacasel# mount
/dev/xbd0a on / type ffs (log, NFS exported, local)
/dev/xbd1a on /d type ffs (extattr, log, local)
kernfs on /kern type kernfs (local)
ptyfs on /dev/pts type ptyfs (local)
procfs on /proc type procfs (local)
tmpfs on /var/shm type tmpfs (local)


-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org


Re: posix_madvise(2) should fail with ENOMEM for invalid adresses range

2014-11-23 Thread Justin Cormack
On Sun, Nov 23, 2014 at 4:37 PM, Nicolas Joly  wrote:
>
> Hi,
>
> According the OpenGroup online document for posix_madvise[1], it
> should fail with ENOMEM for invalid addresses ranges :
>
> [ENOMEM]
> Addresses in the range starting at addr and continuing for len
> bytes are partly or completely outside the range allowed for the
> address space of the calling process.
>
> But we currently fail with EINVAL (returned value from range_check()
> function).
>
> Ok to apply the attached patch to fix posix_madvise/madvise ?
>
> Thanks.
>
> [1] 
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_madvise.html

There was some discussion on PR 48910

http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=48910

Which was slightly inconclusive.

Justin


Re: struct ifnet and ifaddr handling [was: Re: Making global variables of if.c MPSAFE]

2014-11-23 Thread Masao Uebayashi
http://marc.info/?t=14167055271&r=1&w=2

Following the ideas raised in that thread:

- Allocate callout_t dynamically.  struct ifnet only has a pointer to struct
  callout, which will not be read by netstat(1) anyway.

- Prefer the name "slowtimo" to "watchdog", because those callbacks do a
  little more than what "watchdog" suggests.

Index: sys/net/if.c
===
RCS file: /cvsroot/src/sys/net/if.c,v
retrieving revision 1.293
diff -p -u -r1.293 if.c
--- sys/net/if.c17 Nov 2014 13:58:53 -  1.293
+++ sys/net/if.c24 Nov 2014 01:49:14 -
@@ -168,8 +168,6 @@ static kmutex_t if_clone_mtx;
 
 static struct ifaddr **ifnet_addrs = NULL;
 
-static callout_t   if_slowtimo_ch;
-
 struct ifnet *lo0ifp;
 intifqmaxlen = IFQ_MAXLEN;
 
@@ -194,6 +192,7 @@ static void ifnet_lock_exit(struct ifnet
 static void if_detach_queues(struct ifnet *, struct ifqueue *);
 static void sysctl_sndq_setup(struct sysctllog **, const char *,
 struct ifaltq *);
+static void if_slowtimo(void *);
 
 #if defined(INET) || defined(INET6)
 static void sysctl_net_pktq_setup(struct sysctllog **, int);
@@ -235,9 +234,6 @@ ifinit(void)
sysctl_net_pktq_setup(NULL, PF_INET6);
 #endif
 
-   callout_init(&if_slowtimo_ch, 0);
-   if_slowtimo(NULL);
-
if_listener = kauth_listen_scope(KAUTH_SCOPE_NETWORK,
if_listener_cb, NULL);
 
@@ -337,7 +333,7 @@ if_nullstop(struct ifnet *ifp, int disab
 }
 
 void
-if_nullwatchdog(struct ifnet *ifp)
+if_nullslowtimo(struct ifnet *ifp)
 {
 
/* Nothing. */
@@ -637,6 +633,13 @@ if_attach(ifnet_t *ifp)
 
/* Announce the interface. */
rt_ifannouncemsg(ifp, IFAN_ARRIVAL);
+
+   if (ifp->if_slowtimo != NULL) {
+   ifp->if_slowtimo_ch =
+   kmem_zalloc(sizeof(*ifp->if_slowtimo_ch), KM_SLEEP);
+   callout_init(ifp->if_slowtimo_ch, 0);
+   if_slowtimo(ifp);
+   }
 }
 
 void
@@ -687,7 +690,7 @@ if_deactivate(struct ifnet *ifp)
ifp->if_ioctl= if_nullioctl;
ifp->if_init = if_nullinit;
ifp->if_stop = if_nullstop;
-   ifp->if_watchdog = if_nullwatchdog;
+   ifp->if_slowtimo = if_nullslowtimo;
ifp->if_drain= if_nulldrain;
 
/* No more packets may be enqueued. */
@@ -736,6 +739,12 @@ if_detach(struct ifnet *ifp)
 
s = splnet();
 
+   if (ifp->if_slowtimo != NULL) {
+   callout_halt(ifp->if_slowtimo_ch, NULL);
+   callout_destroy(ifp->if_slowtimo_ch);
+   kmem_free(ifp->if_slowtimo_ch, sizeof(*ifp->if_slowtimo_ch));
+   }
+
/*
 * Do an if_down() to give protocols a chance to do something.
 */
@@ -1493,24 +1502,23 @@ if_up(struct ifnet *ifp)
 }
 
 /*
- * Handle interface watchdog timer routines.  Called
- * from softclock, we decrement timers (if set) and
+ * Handle interface slow timeout routine.  Called
+ * from softclock, we decrement timer (if set) and
  * call the appropriate interface routine on expiration.
  */
-void
+static void
 if_slowtimo(void *arg)
 {
-   struct ifnet *ifp;
+   struct ifnet *ifp = arg;
int s = splnet();
 
-   IFNET_FOREACH(ifp) {
-   if (ifp->if_timer == 0 || --ifp->if_timer)
-   continue;
-   if (ifp->if_watchdog != NULL)
-   (*ifp->if_watchdog)(ifp);
-   }
+   KASSERT(ifp->if_slowtimo != NULL);
+
+   if (ifp->if_timer != 0 && --ifp->if_timer == 0)
+   (*ifp->if_slowtimo)(ifp);
+
splx(s);
-   callout_reset(&if_slowtimo_ch, hz / IFNET_SLOWHZ, if_slowtimo, NULL);
+   callout_reset(ifp->if_slowtimo_ch, hz / IFNET_SLOWHZ, if_slowtimo, ifp);
 }
 
 /*
Index: sys/net/if.h
===
RCS file: /cvsroot/src/sys/net/if.h,v
retrieving revision 1.175
diff -p -u -r1.175 if.h
--- sys/net/if.h9 Sep 2014 20:16:12 -   1.175
+++ sys/net/if.h24 Nov 2014 01:49:14 -
@@ -244,6 +244,7 @@ TAILQ_HEAD(ifnet_head, ifnet);  /* the a
 
 struct bridge_softc;
 struct bridge_iflist;
+struct callout;
 
 typedef struct ifnet {
void*if_softc;  /* lower-level data for this if */
@@ -253,7 +254,7 @@ typedef struct ifnet {
int if_pcount;  /* number of promiscuous listeners */
struct bpf_if *if_bpf;  /* packet filter structure */
u_short if_index;   /* numeric abbreviation for this if */
-   short   if_timer;   /* time 'til if_watchdog called */
+   short   if_timer;   /* time 'til if_slowtimo called */
short   if_flags;   /* up/down, broadcast, etc. */
short   if__pad1;   /* be nice to m68k ports */
struct  if_data if_data;/* statistics and other data about if */

Re: struct ifnet and ifaddr handling [was: Re: Making global variables of if.c MPSAFE]

2014-11-23 Thread Taylor R Campbell
   Date: Mon, 24 Nov 2014 11:23:28 +0900
   From: Masao Uebayashi 

   http://marc.info/?t=14167055271&r=1&w=2

   Following the ideas raised in that thread:

   - Allocate callout_t dynamically.  struct ifnet only has a pointer to struct
 callout, which will not be read by netstat(1) anyway.

   - Prefer the name "slowtimo" to "watchdog", because those callbacks do a
 little more than what "watchdog" suggests.

Can you (or ozaki-r@, whose earlier patch I missed until now) explain
specifically what this accomplishes?  I have two guesses about the
primary goal of this change: either

(a) to obviate the need to run if_watchdog/if_slowtimo callbacks
inside IFNET_FOREACH, or

(b) to obviate the need to run IFNET_FOREACH in a callout,

with a minor added bonus that each interface's watchdog/slowtimo can
run in parallel once we flip the CALLOUT_MPSAFE switch on them.

In case (a), what might an interface do in an if_watchdog/if_slowtimo
callback that is safe in a callout but not safe inside a pserialized
reader?  Is it simply that it's sort-of-kind-of OK for a callout to
block a little, but absolutely not OK for a pserialized reader to
block and thus switch?

In case (b), if we're going to use pserialized reads for IFNET_FOREACH
anyway, why would it matter that we don't run IFNET_FOREACH at a
callout?  Callouts are at IPL_SOFTCLOCK, and so are pserialized
readers.

If it's just the `minor added bonus' and general disentanglement of
the network stack, that sounds good to me too -- I'm just curious to
see a statement of what the purpose is.

Also, it seems the reason you're allocating the struct callout
dynamically is to allow userland to read struct ifnet without having
to see the guts of the callout.  But does that matter?  netstat(1) is
one of the last programs that uses kvm(3) to report kernel statistics,
and it seems to me we ought to publish the information via sysctl
instead.

   -callout_reset(&if_slowtimo_ch, hz / IFNET_SLOWHZ, if_slowtimo, NULL);
   +callout_reset(ifp->if_slowtimo_ch, hz / IFNET_SLOWHZ, if_slowtimo, ifp);

Prefer using callout_setfunc up front and using callout_schedule in
if_slowtimo, rather than callout_reset.  Makes it easier to be sure
what the callout is doing.


Re: Should sys_unmount() FOLLOW?

2014-11-23 Thread Emmanuel Dreyfus
J. Hannken-Illjes  wrote:

> > 1) remove FOLLOW flag in sys_unmount(). After all, unless the -R flag is
> > given, umount(8) should have resolved the patch before calling
> > unmount(2).
> 
> Did you try -- last time I tried a forced unmount with NFS server down
> it didn't work even with root being a directory because the namei() call
> would hang in VOP_LOOKUP().  Does it work these days?

As noted in my previous message, yes it works fine in the normal case
where the root is a directory. Well, except if you already had a process
stuck trying to talk to the NFS server (wchan nfscn2), in that case,
umount -f does not really try to unmount as it is stuck awaiting for a
vnode lock (wchan tstile).

I tested removing the FOLLOW flag in sys_unmount(), and it solves my bug
if I use umount -f -R (with the same exception as in the normal case: no
process already in nfscn2). We do not seem to loose functionnality, as
umount(8) without -R already resolves symlinks. In fact the changes
improves umount -R behavior, as the kernel will not perform the symlink
resolution that umount(8) was instructed to skip.

Is there any reason to not commit the change below?

Index: sys/kern/vfs_syscalls.c
===
RCS file: /cvsroot/src/sys/kern/vfs_syscalls.c,v
retrieving revision 1.490
diff -U 4 -r1.490 vfs_syscalls.c
--- sys/kern/vfs_syscalls.c 25 Jul 2014 16:28:12 -  1.490
+++ sys/kern/vfs_syscalls.c 24 Nov 2014 05:04:54 -
@@ -573,9 +573,9 @@
if (error) {
return error;
}
 
-   NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF | TRYEMULROOT, pb);
+   NDINIT(&nd, LOOKUP, LOCKLEAF | TRYEMULROOT, pb);
if ((error = namei(&nd)) != 0) {
pathbuf_destroy(pb);
return error;
}


-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org


Re: struct ifnet and ifaddr handling [was: Re: Making global variables of if.c MPSAFE]

2014-11-23 Thread Masao Uebayashi
(I'm trying, but I can't follow up all mails soon, because I need more
than x2 energy & time to write English than you.)

On Mon, Nov 24, 2014 at 12:12 PM, Taylor R Campbell
 wrote:
>Date: Mon, 24 Nov 2014 11:23:28 +0900
>From: Masao Uebayashi 
>
>http://marc.info/?t=14167055271&r=1&w=2
>
>Following the ideas raised in that thread:
>
>- Allocate callout_t dynamically.  struct ifnet only has a pointer to 
> struct
>  callout, which will not be read by netstat(1) anyway.
>
>- Prefer the name "slowtimo" to "watchdog", because those callbacks do a
>  little more than what "watchdog" suggests.
>
> Can you (or ozaki-r@, whose earlier patch I missed until now) explain
> specifically what this accomplishes?  I have two guesses about the
> primary goal of this change: either
>
> (a) to obviate the need to run if_watchdog/if_slowtimo callbacks
> inside IFNET_FOREACH, or

I was thinking of this first.

> (b) to obviate the need to run IFNET_FOREACH in a callout,
>
> with a minor added bonus that each interface's watchdog/slowtimo can
> run in parallel once we flip the CALLOUT_MPSAFE switch on them.

Good point.  There must be a way for drivers to declare if
CALLOUT_MPSAFE or not.  Need to extend if_flags.

> In case (a), what might an interface do in an if_watchdog/if_slowtimo
> callback that is safe in a callout but not safe inside a pserialized
> reader?  Is it simply that it's sort-of-kind-of OK for a callout to
> block a little, but absolutely not OK for a pserialized reader to
> block and thus switch?

I have believed that pserialize(9) reader-side is a critical section.
pserialize(9) relies on scheduler to notify that readers have passed
throught those pserialize(9) protected code paths, by calling
pserialize_switchpoint() from mi_switch().  This obviously implies
that threads can't sleep from within those pserialize(9) protected
code paths.  Otherwise that notification has a different meaning.

I think pserialize_read_{enter,exit}() should explicitly call
KPREEMPT_{DISABLE,ENABLE}(), as is done in percpu_{getref,putref}().

> In case (b), if we're going to use pserialized reads for IFNET_FOREACH
> anyway, why would it matter that we don't run IFNET_FOREACH at a
> callout?  Callouts are at IPL_SOFTCLOCK, and so are pserialized
> readers.

I think it's fine to enter pserialize(9)'ed code in callouts.

> If it's just the `minor added bonus' and general disentanglement of
> the network stack, that sounds good to me too -- I'm just curious to
> see a statement of what the purpose is.
>
> Also, it seems the reason you're allocating the struct callout
> dynamically is to allow userland to read struct ifnet without having
> to see the guts of the callout.  But does that matter?  netstat(1) is
> one of the last programs that uses kvm(3) to report kernel statistics,
> and it seems to me we ought to publish the information via sysctl
> instead.

Yes, netstat(1) uses sysctl(3) to fetch statistics numbers.  But
netstat(1) needs to know struct type information (by building if.c) to
read kernel cores via kvm(3).  Maybe making kvm(3) to understand DWARF
type information helps?

>-callout_reset(&if_slowtimo_ch, hz / IFNET_SLOWHZ, if_slowtimo, NULL);
>+callout_reset(ifp->if_slowtimo_ch, hz / IFNET_SLOWHZ, if_slowtimo, 
> ifp);
>
> Prefer using callout_setfunc up front and using callout_schedule in
> if_slowtimo, rather than callout_reset.  Makes it easier to be sure
> what the callout is doing.

Applied locally.  Thanks!


Re: struct ifnet and ifaddr handling [was: Re: Making global variables of if.c MPSAFE]

2014-11-23 Thread Masao Uebayashi
On Mon, Nov 24, 2014 at 3:26 PM, Masao Uebayashi  wrote:
> netstat(1) needs to know struct type information (by building if.c) to

s/ (by building if.c) / (by including if.h) /