Re: New cpu_count races; seqlocks?

2019-12-17 Thread Andrew Doran
A bit off on a tangent, but my intent was to keep this stuff very simple and
distinct from the general (evcnt) type case because these are special to my
mind due to their well established use and how clustered they are.

On Tue, Dec 17, 2019 at 05:07:12AM +, Taylor R Campbell wrote:

> It is great to get rid of global contention over a bunch of counters
> that don't need to be read exactly.  However, the new cpu_count logic
> is racy -- particularly on 32-bit platforms, where it's not simply a
> matter of reading stale values but a matter of reading total garbage:

I'm not adverse to solving the problem at all but my thinking is that in
most cases it doesn't really matter too much since they're sampled with some
regularity, which is how we've been running without a stable/reliable view
of these numbers for 10+ years now.  However having thought it over again
perhaps we've grown a new problem that we simply don't need.

On a 32-bit platform, I don't see why any of these actually need to be
64-bit.  They've been wrappable since we've had them, they're still
wrappable in 64-bit form, and I'd be quite happy to say that if you're
running 32-bit then you get 32-bit wraparound.  Making them uintptr_t's
internally in the implementation at least would mean we don't introduce
any new problems for the user nor ourselves.

>   /* Claim exclusive access for modifications.  */
>   do {
>   while (1 & (gen = atomic_load_relaxed(_count_gen)))
>   SPINLOCK_BACKOFF_HOOK;
>   } while (atomic_cas_32(_count_gen, gen, gen | 1) != gen);

I actually quite like this from the POV of making sure no more than one LWP
at a time tries to update the sum totals since the cost is huge.

> Perhaps we should have a seqlock(9) API for broader use.  I drafted an
> implementation of Linux's seqlock API for a drm import from Linux
> 4.19:
> 
> https://github.com/riastradh/netbsd-src/blob/riastradh-drm419/sys/external/bsd/drm2/include/linux/seqlock.h

I think we are very likely to run into a need for such a thing for the file
system and network code.

Andrew


New cpu_count races; seqlocks?

2019-12-16 Thread Taylor R Campbell
It is great to get rid of global contention over a bunch of counters
that don't need to be read exactly.  However, the new cpu_count logic
is racy -- particularly on 32-bit platforms, where it's not simply a
matter of reading stale values but a matter of reading total garbage:

/* sys/cpu_data.h */
static inline int64_t
cpu_count_get(enum cpu_count idx)
{
extern int64_t cpu_counts[];
return cpu_counts[idx];
}

On a 32-bit platform, this will generally be implemented by a sequence
of two 32-bit loads from memory, which can be torn and interrupted in
the middle.  For example, cpu_count_get(CPU_COUNT_NINTR) generates the
following i386 code (confirmed in procfs_linux.c, procfs_docpustat):

 65a:   8b 0d 20 00 00 00   mov0x20,%ecx
65c: R_386_32   cpu_counts
 660:   8b 1d 24 00 00 00   mov0x24,%ebx
662: R_386_32   cpu_counts

Perhaps the words will be written in an order such that on x86 CPUs
we'll read them in the correct order without garbage, but that's not
reliable in general.  To make this reliable in general, for quantities
that may overflow the 32-bit range, we might use a seqlock -- perhaps
conditionalized on !_LP64, with the cheaper atomic_load/store_relaxed
on _LP64 instead:

/* kern_cpu.c */
volatile uint32_t cpu_count_gen;
int64_t cpu_counts[...];

void
cpu_count_sync_all(void)
{
uint32_t gen;

/* Claim exclusive access for modifications.  */
do {
while (1 & (gen = atomic_load_relaxed(_count_gen)))
SPINLOCK_BACKOFF_HOOK;
} while (atomic_cas_32(_count_gen, gen, gen | 1) != gen);

membar_producer(); /* store cpu_count_gen then cpu_counts[i] */
...
cpu_counts[i] = ...
...
membar_producer(); /* store cpu_counts[i] then cpu_count_gen */

/* Relinquish exclusive access for modifications.  */
KASSERT(cpu_count_gen == gen);
atomic_store_relaxed(_count_gen, (gen | 1) + 1);
}

int64_t
cpu_count_get(enum cpu_count i)
{
uint32_t gen;
int64_t value;

do {
/* Wait for modifications to settle.  */
while (1 & (gen = atomic_load_relaxed(_count_gen)))
SPINLOCK_BACKOFF_HOOK;

membar_consumer(); /* load cpu_count_gen then cpu_counts[i] */
value = cpu_counts[i];
membar_consumer(); /* load cpu_counts[i] then cpu_count_gen */

/* Retry if modified in the interim.  */
} while (__predict_false(gen != atomic_load_relaxed(_count_gen)));

return value;
}

There is, of course, a chance that a 32-bit modification counter
(cpu_count_gen) can roll over, particularly if we are preempted in
cpu_count_get.  The probability of that happening may be low, but it
might also be prudent to kpreempt_disable during this read.

Perhaps we should have a seqlock(9) API for broader use.  I drafted an
implementation of Linux's seqlock API for a drm import from Linux
4.19:

https://github.com/riastradh/netbsd-src/blob/riastradh-drm419/sys/external/bsd/drm2/include/linux/seqlock.h