stats collection in sys/kern/vfs_cache.c

2014-12-01 Thread Dennis Ferguson
I thought I would pass by some proposed changes to the statistics
collection in sys/kern/vfs_cache.c to see if I am over-doing (or
under-doing) it, or just doing it wrong.  Statistics are annoying.

Since this code was made MP-capable the basic method of stats
collection has been to provide each processor with a separate
counter structure which is incremented only by threads running on
that processor.  When cache_reclaim() runs (at least once per second)
the per-processor counts are then collected, added into the subsystem
totals and zeroed.  This is a nice method, the counters are generally
well-cached on the cpu doing increments, with no contention.  It
would be nice if networking stats could work like this.

This original code was, however, a bit sloppy about it.  Some of
the counters were incremented inside locked code sections which
prevented cache_reclaim() from running concurrently, but some
were incremented at points where no lock was held.  In the
latter case the increments were just done anyway, non-atomically.
This means that counting races might cause a second's worth of
increments to be counted twice occasionally, as well as dropped
increments.  Worse, however, was that it constrained the counters
to be long's to at least ensure that writes were atomic, and a long
on a 32 bit machine isn't long enough for some of these (my build
machine incremented one of the counters by more than 20 billion
in less than 3 days of uptime).

The more recent modification to the code fixed the sloppiest part
of this by adding locking around every counter increment.  The
problem with this is that since the primary purpose of the locks
is operational they can sometimes be held for a long time, blocking
forward progress for the sake of a stats counter increment.  And
while it added a structure with 64 bit counters for the sysctl(3)
interface it left the internal structure into which the stats
are tallied with the long counters.  I'd like to get it back to
a cost of stats collection closer to the original sloppy code by
removing those locks, but maybe not be as sloppy as the original.

It was suggested that I use atomic operations for the counters
instead, but I'm not fond of that.  Since it is going to the
expense of keeping the per-cpu counts it should be possible to
leverage that to avoid atomics, and I particularly don't want to
burden cache_reclaim() stats harvesting with atomic operations
since cache_reclaim() has enough to do already.  Here's what I'd
like to do instead:

- Make the subsystem total counts maintained by cache_reclaim()
  (nchstats) 64 bits long on all machines, the same as the
  sysctl interface.  Make the per-cpu counts 32 bits on all
  machines.  They only count a second's worth of activity on a
  single cpu, for which 32 bits is plenty, and this makes
  per-cpu counter writes atomic on all supported machines.

- Make cache_reclaim() refrain from zeroing the per-cpu counts in
  favor of computing differences between the current count and the
  count it saw last time it ran.  This makes its stats gathering
  lockless while avoiding atomic operations at the expense of
  space to store the last counts.

- Have all increments done to the per-cpu stats for the cpu the
  thread doing the increment is running on, and do non-atomic
  increments even when no lock is held.  For the latter to fail
  requires that the thread doing the outside-a-lock increment
  be interrupted by another thread running on the same CPU which
  goes on to increment the same counter.  I suspect (but can't
  quite prove) this can't happen, but if it does the worst
  consequence of it seems to be the occasional loss of a single
  increment.

- cache_stat_sysctl() can now do the minimal locking needed to
  hold off a concurrent run of cache_reclaim() while it copies
  the values the latter updates.  It can make do with a copy of
  cache_reclaim()'s 1 second stats, or update that to current
  values by polling the per-cpu stats itself (the latter makes
  'systat vmstat' output less blocky).

The attached patch may do this.  It also changes the name of the
struct (but with no change to its members) passed via sysctl(3)
from 'struct nchstats_sysctl' to 'struct nchstats', and for 32 bit
machines changes the size of the 'struct nchstats nchstats' which
vmstat can be told to read from a dead kernel's core.  Would either
of those changes be enough to require a version bump to the kernel?

Dennis Ferguson



cache_stats.patch
Description: Binary data


Re: struct ifnet and ifaddr handling [was: Re: Making global variables of if.c MPSAFE]

2014-12-01 Thread Ryota Ozaki
Hi,

I found an defect of ifnet object initializations.

- if_attach and if_alloc_sadl is called in each
  interface XXXattach function
  - if_alloc_sadl may be called via XXX_ifattach
(e.g., ether_ifattach)
- if_attach initializes an ifnet object, but
  the initialization is incomplete and if_alloc_sadl
  does the rest
  - for example, ifp->if_dl and ifp->if_sadl
- if_attach enables entering ioctl (doifioctl)
- ioctl can be called between if_attach and
  if_alloc_sadl
- ioctl may touch uninitialized member values
  of ifnet
  - then boom
- sysctl also has the same issue

We have to prevent ioctl from being called until
if_alloc_sadl is finished somehow. We can achieve
the goal by postponing enabling ioctl or registering
an ifnet object to ifnet_list until if_alloc_sadl
is finished. To this end, we need to call something
after if_alloc_sadl.

My approach is to split if_attach into if_init and
if_attach; put initializations of if_attach into
if_init and put registrations (e.g., adding a ifnet
object to ifnet_list and sysctl setups) into if_attach.
Then, call new if_attach after if_alloc_sadl so that
we can ensure ioctl and sysctl are called after all
initializations done.

[before]
  if_attach();
  if_alloc_sadl();

[after]
  if_init();
  if_alloc_sadl();
  if_attach();

A concern of the approach is that it requires
modifications to every places where if_attach
is used (i.e., if_*).

Here is a patch: http://www.netbsd.org/~ozaki-r/if_attach.diff
(This patch includes if_attach modifications of only some if_*,
not all, yet.)

Any comments?

Thanks,
  ozaki-r


driver concurrency

2014-12-01 Thread David Holland
How many drivers are there (hardware-level drivers, not things like
raidframe) where it really matters for more than one lwp to be able to
be running (not stopped) in the driver at once?

I'm thinking probably network cards but not much else.

(This question is supposed to provoke a discussion; I have a reason
for asking, which I don't want to state to avoid skewing things.)
-- 
David A. Holland
dholl...@netbsd.org


Re: driver concurrency

2014-12-01 Thread Manuel Bouyer
On Mon, Dec 01, 2014 at 03:02:39PM +, David Holland wrote:
> How many drivers are there (hardware-level drivers, not things like
> raidframe) where it really matters for more than one lwp to be able to
> be running (not stopped) in the driver at once?
> 
> I'm thinking probably network cards but not much else.

disk controllers which supports lots of concurent transactions could
probably benefit from this too.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Reuse strtonum(3) and reallocarray(3) from OpenBSD

2014-12-01 Thread Alan Barrett

On Sat, 29 Nov 2014, Kamil Rytarowski wrote:
My proposition is to add a new header in src/sys/sys/overflow.h 
(/usr/include/sys/overflow.h) with the following content:


operator_XaddY_overflow()
operator_XsubY_overflow()
operator_XmulY_overflow()

X = optional s (signed)
Y = optional l,ll, etc
[* see comment]


OK, so you have told us the names of the proposed functions.  But what
are their semantics, and why would they be useful?

Last but not least please stop enforcing 
programmers' fancy to produce this kind of art: 
https://github.com/ivmai/bdwgc/commit/83231d0ab5ed60015797c3d1ad9056295ac3b2bb 
:-)


Please don't assume that people reading your email messages have
convenient internet access.  It's fine to give URLs thatrexpand on what
you have said, but if you give the URL without any explanation then I
have no idea what you are talking about.

--apb (Alan Barrett)


status of Linux ptrace on amd64?

2014-12-01 Thread Alexander Nasonov
Hi,

While trying to make some Linux instrumentation tool work on -current
amd64 I noticed that ptrace support in compat_linux and compat_linux32
have some notable differences and compat_linux32 has a better support.
For instance, I can debug 32bit Linux binaries with NetBSD's gdb but
any attempt to set a breakpoint on 64bit Linux results in SIGTRAP.

Neither compat_linux32 nor compat_linux work for that particular tool
because they don't support LIUNX_PEEKUSER.

So, I wonder how much work it is to improve ptrace in compat_linux and
compat_linux32?

Alex


Re: driver concurrency

2014-12-01 Thread Thor Lancelot Simon
On Mon, Dec 01, 2014 at 05:53:12PM +0100, Manuel Bouyer wrote:
> On Mon, Dec 01, 2014 at 03:02:39PM +, David Holland wrote:
> > How many drivers are there (hardware-level drivers, not things like
> > raidframe) where it really matters for more than one lwp to be able to
> > be running (not stopped) in the driver at once?
> > 
> > I'm thinking probably network cards but not much else.
> 
> disk controllers which supports lots of concurent transactions could
> probably benefit from this too.

They would, and many are simple enough to make this reasonably easy to do,
but in practice, the giant locking of our SCSI code makes it pointless.

I have experience with a popular (in the high performance crypto market,
anyway) crypto accellerator which can be configured to present multiple
command queues to allow CPU affinity for requests.  When run that way the
driver still needs locks (one per queue) but the locks are almost always
uncontended.  It's a big win.

Thor


Re: driver concurrency

2014-12-01 Thread Manuel Bouyer
On Mon, Dec 01, 2014 at 02:28:04PM -0500, Thor Lancelot Simon wrote:
> They would, and many are simple enough to make this reasonably easy to do,
> but in practice, the giant locking of our SCSI code makes it pointless.

Sure, but we could also make the scsi code run without the giant lock.
Also, some of them don't use the scsi layer, but present a ld(4)
interface (although is seems that most recent onces present a scsi interface).

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: driver concurrency

2014-12-01 Thread Thor Lancelot Simon
On Mon, Dec 01, 2014 at 08:42:08PM +0100, Manuel Bouyer wrote:
> On Mon, Dec 01, 2014 at 02:28:04PM -0500, Thor Lancelot Simon wrote:
> > They would, and many are simple enough to make this reasonably easy to do,
> > but in practice, the giant locking of our SCSI code makes it pointless.
> 
> Sure, but we could also make the scsi code run without the giant lock.

Try it!  It's quite a mess, particularly around target and bus attach/detach.

Thor


Re: driver concurrency

2014-12-01 Thread Justin Cormack
On Mon, Dec 1, 2014 at 7:42 PM, Manuel Bouyer  wrote:
> On Mon, Dec 01, 2014 at 02:28:04PM -0500, Thor Lancelot Simon wrote:
>> They would, and many are simple enough to make this reasonably easy to do,
>> but in practice, the giant locking of our SCSI code makes it pointless.
>
> Sure, but we could also make the scsi code run without the giant lock.
> Also, some of them don't use the scsi layer, but present a ld(4)
> interface (although is seems that most recent onces present a scsi interface).

NVMexpress is a new non-SCSI storage standard with per CPU queues. I
was planning to port the FreeBSD driver when the hardware becomes a
bit more available.

Justin


Re: status of Linux ptrace on amd64?

2014-12-01 Thread Christos Zoulas
In article <20141201172756.GA29051@neva>,
Alexander Nasonov   wrote:
>Hi,
>
>While trying to make some Linux instrumentation tool work on -current
>amd64 I noticed that ptrace support in compat_linux and compat_linux32
>have some notable differences and compat_linux32 has a better support.
>For instance, I can debug 32bit Linux binaries with NetBSD's gdb but
>any attempt to set a breakpoint on 64bit Linux results in SIGTRAP.
>
>Neither compat_linux32 nor compat_linux work for that particular tool
>because they don't support LIUNX_PEEKUSER.
>
>So, I wonder how much work it is to improve ptrace in compat_linux and
>compat_linux32?

Should not be that hard, but what is that tool reading from PEEKUSER?
Registers?

christos



Re: status of Linux ptrace on amd64?

2014-12-01 Thread Alexander Nasonov
Christos Zoulas wrote:
> Should not be that hard, but what is that tool reading from PEEKUSER?
> Registers?

In both cases it reads from user_regs_struct, if I understand everything
correctly. But it's the first step, the tool would definitely try other
things if PEEKUSER didn't fail. In fact, I'm not sure that thing would
work at all because it's a dynamic instrumentation tool called Pin. It
updates code on the fly.

I've got one more question. Is it possible for Linux emulated binary
to control NetBSD native binary with ptrace? I see that it can attach,
pokedata and continue but would it be able to do more advanced things
correctly?

Alex


Re: status of Linux ptrace on amd64?

2014-12-01 Thread Christos Zoulas
On Dec 1,  9:06pm, al...@yandex.ru (Alexander Nasonov) wrote:
-- Subject: Re: status of Linux ptrace on amd64?

| Christos Zoulas wrote:
| > Should not be that hard, but what is that tool reading from PEEKUSER?
| > Registers?
| 
| In both cases it reads from user_regs_struct, if I understand everything
| correctly. But it's the first step, the tool would definitely try other
| things if PEEKUSER didn't fail. In fact, I'm not sure that thing would
| work at all because it's a dynamic instrumentation tool called Pin. It
| updates code on the fly.
| 
| I've got one more question. Is it possible for Linux emulated binary
| to control NetBSD native binary with ptrace? I see that it can attach,
| pokedata and continue but would it be able to do more advanced things
| correctly?

It might, but most likely it will not due to differences in the ptrace
syscall implementation and bugs in the emulation. I remember trying to
use the linux gdb to debug a linux and a netbsd binary. Neither worked
too well.

christos


Re: driver concurrency

2014-12-01 Thread Masao Uebayashi
On Tue, Dec 2, 2014 at 5:32 AM, Thor Lancelot Simon  wrote:
> Try it!  It's quite a mess, particularly around target and bus attach/detach.

scsipi and wscons are the worst in that respect.

(Who wants to be a hero?)


Re: driver concurrency

2014-12-01 Thread Thor Lancelot Simon
On Tue, Dec 02, 2014 at 12:52:51PM +0900, Masao Uebayashi wrote:
> On Tue, Dec 2, 2014 at 5:32 AM, Thor Lancelot Simon  wrote:
> > Try it!  It's quite a mess, particularly around target and bus 
> > attach/detach.
> 
> scsipi and wscons are the worst in that respect.
> 
> (Who wants to be a hero?)

Dear Mr. Hero,

I have a couple of NetBSD kernel projects every winter, to give me
something to do during my commute (in the winter, I take the train; in
the summer, I bike).

The way I pick them is that I have to be able to restart the project
every week or so, having forgotten where I was, and figure out what I
was doing again in 20 minutes or less -- my commute is about 50 minutes.
That way I can get at least 30 minutes of stuff done any day I don't
have something else to do.

I tried "devise locking scheme for SCSI code" as one of those
projects for about a week last winter.  Utterly hopeless.  Perhaps
someone more clever or dedicated can make some progress, but I would
suggest it is the kind of work that will require several consecutive
days and a lot of planning just to get started -- and hours of effort
at a time, over several weeks to finish.

My best guess using my OOMA Engineering Manager's Manual is that
actually getting any meaningful concurrency in the SCSI code, for disks,
is about 20 or 30 hours of work total.  So if you are lucky enough to
have 3 consecutive hours of spare time every  day, 5 days a week, and you
trust my guesswork, you can assume you are looking at somewhere between 1
and 2 weeks of hacking.

I have a wide array of SCSI controllers and targets that would let you
exercise many of the drivers.  If you're serious about this, send me a
project plan to indicate your interest and I will send them to you or
provide access.

-- 
 Thor Lancelot Simon  t...@panix.com
"From the tooth paste you use in the morning to the salt on your evening meal,
it's easy to take for granted the many products brought to us with explosives."
- Institute of Manufacturers of Explosives, "Explosives Make It Possible" 


Re: driver concurrency

2014-12-01 Thread Paul Goyette

...
actually getting any meaningful concurrency in the SCSI code, for disks,
...


Without, of course, sacrificing functionality for other things that 
attach via SCSI (tapes? scanners? cd?)


:)



-
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Network Engineer | 0786 F758 55DE 53BA 7731 | pgoyette at juniper.net |
| Kernel Developer |  | pgoyette at netbsd.org  |
-