Re: svn commit: r252032 - head/sys/amd64/include

2013-06-26 Thread Bruce Evans

On Wed, 26 Jun 2013, Gleb Smirnoff wrote:


On Wed, Jun 26, 2013 at 11:42:39AM +1000, Bruce Evans wrote:
B> > Anyway, as Gleb said, there is no point in
B> > optimizing the i386 kernel.
B>
B> I said that there is every point in optimizing the i386 kernel.  This
B> applies even more to other 32-bit arches.  Some CPUs are much slower
B> than modern x86's.  They shouldn't be slowed down more by inefficient
B> KPIs.

I didn't mean that i386 arch is a relic and should be ignored at all.

What I actually meant, is that the problem of performance drop due to
cache poisoning and loss of statistics with simple "+=" operation can
be observed only at extremely high event rates, with multiple processors
involved.


I think you already fixed cache poisoning, and it has nothing to do with
whether the access is a simple "+=' operation.  amd64 still uses a simple
'+=' operation (written in asm so that it uses the right instructions),
so the slow cmpxch8b used on i386 can't be doing anything good for the
cache.


The counter(9) is solution for these conditions. Thus we are interested
in optimising amd64, not i386. The latter isn't affected neither positively
nor negatively with these changes, just because last i386 CPUs can't reach
the event rates where need for counter(9) arises. Yes, you can tweak
implementation and obtain better results with microbenchmarks, but I bet
that any change in counter(9) implementation won't affect packet forwarding
rate on any i386. What we claim for i386 (and all other arches) that
counter(9) is lossless, and that's all.

I second to Konstantin, that we don't have objections in any changes to
i386 part of counter, including a daemon, but the changes shouldn't affect
amd64.


amd64 should be changed too, to use 32-bit pcpu counters to avoid ifdefs
and to use less cache.  You can't reach event rates that overflow 32-bit
counters faster than a daemon can accumulate them.  For example, with
10 Gbps ethernet the maximum packet rate is about 14 Mpps.  Suppose that
is all handled and counted on 1 CPU, which isn't possible yet.  The
daemon must run once every 286 seconds to keep up with that.  Byte counts
are more interesting.  Counting 1 G/second of anything requires running
the daemon every 4 seconds.

I don't remember how you distributed the counters to avoid cache poisoning.
Is it one pcpu counter per cache line, so that counters never poison nor
benefit from caching for other counters for the same CPU?  Or is it just
separate cache lines for each CPU?  I think the latter.  So there can be
several 64-bit counters per cache line, or twice as many 32-bit counters.

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-26 Thread Dmitry Morozovsky
On Tue, 25 Jun 2013, Konstantin Belousov wrote:

> > > Updates to the counter cannot be done from the interrupt context.
> > 
> > This is fragile, however.  It prevents using counters for things like
> > counting interrupts.  Most interrupt counting is now done directlyly
> > and doesn't use PCPU_INC().  i386/i386 has just 1 use of PCPU_INC().
> > It is to count traps in machdep.c.  Traps include nonmaskable
> > interrupts.  Even hard-disabling of interrupts doesn't prevent these.
> > Otherwise, PCPU is used mainly for vm counters.  E.g., for pagefaults.
> > Now the trap is not an interrupt, so it shouldn't occur in the middle
> > of the counter update and the PCPU_INC() can safely be replaced by
> > a counter, but this is not clear.
> Traps are not performance critical in the sense that there is no need to count
> up to 1-10G traps per second.  Anyway, as Gleb said, there is no point in
> optimizing the i386 kernel.  

Hmm, don't we count semi-embedded routers? Or, do we think they are either 
amd64 or arm/mips?

[snip all the rest]


-- 
Sincerely,
D.Marck [DM5020, MCK-RIPE, DM3-RIPN]
[ FreeBSD committer: ma...@freebsd.org ]

*** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- ma...@rinet.ru ***

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-26 Thread Dmitry Morozovsky
On Wed, 26 Jun 2013, Gleb Smirnoff wrote:

> On Wed, Jun 26, 2013 at 11:42:39AM +1000, Bruce Evans wrote:
> B> > Anyway, as Gleb said, there is no point in
> B> > optimizing the i386 kernel.
> B> 
> B> I said that there is every point in optimizing the i386 kernel.  This
> B> applies even more to other 32-bit arches.  Some CPUs are much slower
> B> than modern x86's.  They shouldn't be slowed down more by inefficient
> B> KPIs.
> 
> I didn't mean that i386 arch is a relic and should be ignored at all.
> 
> What I actually meant, is that the problem of performance drop due to
> cache poisoning and loss of statistics with simple "+=" operation can
> be observed only at extremely high event rates, with multiple processors
> involved.
> 
> The counter(9) is solution for these conditions. Thus we are interested
> in optimising amd64, not i386. The latter isn't affected neither positively
> nor negatively with these changes, just because last i386 CPUs can't reach
> the event rates where need for counter(9) arises. Yes, you can tweak
> implementation and obtain better results with microbenchmarks, but I bet
> that any change in counter(9) implementation won't affect packet forwarding
> rate on any i386. What we claim for i386 (and all other arches) that
> counter(9) is lossless, and that's all.
> 
> I second to Konstantin, that we don't have objections in any changes to
> i386 part of counter, including a daemon, but the changes shouldn't affect
> amd64.

Ah, apparently this mostly answers the question I've just asked ;)

-- 
Sincerely,
D.Marck [DM5020, MCK-RIPE, DM3-RIPN]
[ FreeBSD committer: ma...@freebsd.org ]

*** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- ma...@rinet.ru ***

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-26 Thread Gleb Smirnoff
  Bruce,

On Wed, Jun 26, 2013 at 11:42:39AM +1000, Bruce Evans wrote:
B> > Anyway, as Gleb said, there is no point in
B> > optimizing the i386 kernel.
B> 
B> I said that there is every point in optimizing the i386 kernel.  This
B> applies even more to other 32-bit arches.  Some CPUs are much slower
B> than modern x86's.  They shouldn't be slowed down more by inefficient
B> KPIs.

I didn't mean that i386 arch is a relic and should be ignored at all.

What I actually meant, is that the problem of performance drop due to
cache poisoning and loss of statistics with simple "+=" operation can
be observed only at extremely high event rates, with multiple processors
involved.

The counter(9) is solution for these conditions. Thus we are interested
in optimising amd64, not i386. The latter isn't affected neither positively
nor negatively with these changes, just because last i386 CPUs can't reach
the event rates where need for counter(9) arises. Yes, you can tweak
implementation and obtain better results with microbenchmarks, but I bet
that any change in counter(9) implementation won't affect packet forwarding
rate on any i386. What we claim for i386 (and all other arches) that
counter(9) is lossless, and that's all.

I second to Konstantin, that we don't have objections in any changes to
i386 part of counter, including a daemon, but the changes shouldn't affect
amd64.

-- 
Totus tuus, Glebius.
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-25 Thread Bruce Evans

On Tue, 25 Jun 2013, Konstantin Belousov wrote:


On Tue, Jun 25, 2013 at 08:14:41PM +1000, Bruce Evans wrote:

On Tue, 25 Jun 2013, Konstantin Belousov wrote:


Updates to the counter cannot be done from the interrupt context.


This is fragile, however.  It prevents using counters for things like
counting interrupts.  Most interrupt counting is now done directlyly
and doesn't use PCPU_INC().  i386/i386 has just 1 use of PCPU_INC().
It is to count traps in machdep.c.  Traps include nonmaskable
interrupts.  Even hard-disabling of interrupts doesn't prevent these.
...

Traps are not performance critical in the sense that there is no need to count
up to 1-10G traps per second.


The trap count increment is not a performance problem, but a correctness one.
Disabling interrupts for a multi-word increment doesn't work for NMIs even
for !SMP or pcpu accesses with SMP.


Anyway, as Gleb said, there is no point in
optimizing the i386 kernel.


I said that there is every point in optimizing the i386 kernel.  This
applies even more to other 32-bit arches.  Some CPUs are much slower
than modern x86's.  They shouldn't be slowed down more by inefficient
KPIs.


The fetch problem of sometime missing the carry bit is present on all
32bit arches now, I think.


Also the zeroing problem.  Since there are no atomic ops in sight, zeroing
has several races.  My transferring code uses atomic ops on only 1 CPU,
so it has half of these races.  See below.

Right, zeroing should also use atomic 64bit writes on i386.  Updated
patch is at end.


It's getting even more complicated and further from what I want.  It also
doesn't work.  Disabling interrupts doesn't work for the SMP case.  The
zeroing needs to run in a per-CPU daemon.


The arm implementation of atomic.h is also relevant and is especially
interesting.  I looked at it after mdf@ pointed out ARM_RAS_START.
This seems to only be used by arm in userland.  It is even larger and
slower and more complicated than critical sections, so it doesn't seem
to be useful for counters.  In the kernel, arm has several
implementations depending on machine capabilities.  The ifdefs are
hard to understand.  On some machine, it seems to use the equivalent
of a cmpxchg loop.  On others, it just disables interrupts.  Disabling
interrupts is presumably better than a critical section, and userland
has to use ARM_RAS_START for some arches because neither disabling
interrupts not critical sections are available in userland.  None
of this care helps avoid the races in pcpu.h, since pcpu.h intentionally
avoids using atomic ops.  None of this care helps avoid the races in
counter.h or make counter.h efficient, since counter.h uses its own
methods.  To reach the same quality as atomic.h, counter.h would have
to copy half of the ifdefs and methods in atomic.h.

BTW, why do you claim that disabling interrupts is faster and better method
to disable preemption than critical section ?


For arm, it is because if the critical section method were better, then
arm should use it for atomic ops too.

Of course not.  Atomics must be atomic WRT the interrupts as well.
Using critical sections only provides atomicity against preemption
but not against interruption.


Indeed, atomics would be fragile if they couldn't be used in interrupt
handlers.  So in a non-optimized implementation it is safe to disable
interrupts in hardware.  An optimized solution would accept the fragility
like the counter increment does, and have use special atomic ops that
do more for the few variables that are accessed by interrupt handlers.


The arm looks quite deficient anyway.  From the arm/include/atomic.h,
it seems that the arch did not have (usable) atomics until v6.  And
ARM does not support SMP in the stock HEAD still.


Disabling interrupts only gives atomicity for non-SMP.  Thus the cases
where this method is used indicate an arch that doesn't support SMP.
More interestingly for the current discussion, arm and other several
other arches apparently don't have interrupt-safe load-modify-store
instructions even for register-sized memory variables.  Thus they
need to either disable interrupts or use a cmpxchg-type instruction
even to add to register-sized variables.


% static inline void
% counter_u64_add(counter_u64_t c, int64_t inc)
% {
%
% #if 1
%   /* 20 cycles on A64. */
%   /* 31 cycles on A64 with lock prefix. */
%   /* 35 cycles on A64 with mfence prefix. */
%   if ((cpu_feature & 1) == 1) {
%   counter_64_inc_8b(c, inc);
%   }

Old tests said that this takes 18 cycles.  It now takes 20.

Adding a lock or mfence prefix to the cmpxchg8b gives the times stated
above.  This shows how bad using cmpxch8b is.  We could use atomic ops
for everything and only be 50% slower.  But we shouldn't use either
cmpxchg8b or atomic ops.  We should use 32-bit addl, which takes about
4 cycles.

Use of cmpxchg8b provides the soundness to the algorithm

Re: svn commit: r252032 - head/sys/amd64/include

2013-06-25 Thread Konstantin Belousov
On Tue, Jun 25, 2013 at 08:14:41PM +1000, Bruce Evans wrote:
> On Tue, 25 Jun 2013, Konstantin Belousov wrote:
> 
> > Updates to the counter cannot be done from the interrupt context.
> 
> This is fragile, however.  It prevents using counters for things like
> counting interrupts.  Most interrupt counting is now done directlyly
> and doesn't use PCPU_INC().  i386/i386 has just 1 use of PCPU_INC().
> It is to count traps in machdep.c.  Traps include nonmaskable
> interrupts.  Even hard-disabling of interrupts doesn't prevent these.
> Otherwise, PCPU is used mainly for vm counters.  E.g., for pagefaults.
> Now the trap is not an interrupt, so it shouldn't occur in the middle
> of the counter update and the PCPU_INC() can safely be replaced by
> a counter, but this is not clear.
Traps are not performance critical in the sense that there is no need to count
up to 1-10G traps per second.  Anyway, as Gleb said, there is no point in
optimizing the i386 kernel.  

> 
> > The fetch problem of sometime missing the carry bit is present on all
> > 32bit arches now, I think.
> 
> Also the zeroing problem.  Since there are no atomic ops in sight, zeroing
> has several races.  My transferring code uses atomic ops on only 1 CPU,
> so it has half of these races.  See below.
Right, zeroing should also use atomic 64bit writes on i386.  Updated
patch is at end.

> 
> >> The arm implementation of atomic.h is also relevant and is especially
> >> interesting.  I looked at it after mdf@ pointed out ARM_RAS_START.
> >> This seems to only be used by arm in userland.  It is even larger and
> >> slower and more complicated than critical sections, so it doesn't seem
> >> to be useful for counters.  In the kernel, arm has several
> >> implementations depending on machine capabilities.  The ifdefs are
> >> hard to understand.  On some machine, it seems to use the equivalent
> >> of a cmpxchg loop.  On others, it just disables interrupts.  Disabling
> >> interrupts is presumably better than a critical section, and userland
> >> has to use ARM_RAS_START for some arches because neither disabling
> >> interrupts not critical sections are available in userland.  None
> >> of this care helps avoid the races in pcpu.h, since pcpu.h 
> >> intentionally
> >> avoids using atomic ops.  None of this care helps avoid the races in
> >> counter.h or make counter.h efficient, since counter.h uses its own
> >> methods.  To reach the same quality as atomic.h, counter.h would have
> >> to copy half of the ifdefs and methods in atomic.h.
> > BTW, why do you claim that disabling interrupts is faster and better method
> > to disable preemption than critical section ?
> 
> For arm, it is because if the critical section method were better, then
> arm should use it for atomic ops too.
Of course not.  Atomics must be atomic WRT the interrupts as well.
Using critical sections only provides atomicity against preemption
but not against interruption.

The arm looks quite deficient anyway.  From the arm/include/atomic.h,
it seems that the arch did not have (usable) atomics until v6.  And
ARM does not support SMP in the stock HEAD still.

> 
> For i386, it is because it reduces the window in which the value is invalid
> (other CPUs can see the invalid value), and is easier to use and shorter
> (2-3 bytes) and thus better suited to inline code.  My examples of inline
> code mostly used it in rarely-executed cases where its speed doesn't matter
> and it is just easier to use.  Actual testing shows that it is fairly
> efficient, so can be used in all cases.  On Athlon64, it is faster than
> cmpxchg8b but slower than an inline optimized critical section.  This
> testing is in userland, where there may be an extra penalty for using
> cli/sti/pushfl/popfl:
> 
> % static inline void
> % counter_u64_add(counter_u64_t c, int64_t inc)
> % {
> % 
> % #if 1
> % /* 20 cycles on A64. */
> % /* 31 cycles on A64 with lock prefix. */
> % /* 35 cycles on A64 with mfence prefix. */
> % if ((cpu_feature & 1) == 1) {
> % counter_64_inc_8b(c, inc);
> % }
> 
> Old tests said that this takes 18 cycles.  It now takes 20.
> 
> Adding a lock or mfence prefix to the cmpxchg8b gives the times stated
> above.  This shows how bad using cmpxch8b is.  We could use atomic ops
> for everything and only be 50% slower.  But we shouldn't use either
> cmpxchg8b or atomic ops.  We should use 32-bit addl, which takes about
> 4 cycles.
Use of cmpxchg8b provides the soundness to the algorithm, by using
atomic 64 bit loads and stores.

As I stated, I am not going to spent time microoptimizing for i386.
Current implementation is good enough.

> 
> % struct thread
> % {
> % uint32_t td_critnest;
> % uint32_t td_owepreempt;
> % } td0;
> % 
> % volatile struct thread *td = &td0;
> % 
> % void
> % dummy_accum(void)
> % {
> % }
> % 
> % void
> % fake_mi_switch(void)
> % {
> % }
> 
> Added more emulation for a compl

Re: svn commit: r252032 - head/sys/amd64/include

2013-06-25 Thread Bruce Evans

On Tue, 25 Jun 2013, Gleb Smirnoff wrote:


On Mon, Jun 24, 2013 at 11:16:33PM +1000, Bruce Evans wrote:
B> > K> This is quite interesting idea, but I still did not decided if it
B> > K> acceptable.  The issue is that we could add the carry to the other
B> > K> processor counter, if the preemption kicks in at right time between
B> > K> two instructions.  I did not found any argument why would it be
B> > K> wrong, the races for fetch operation seems to be the same with either
B> > K> local update method.
B> >
B> > This would be wrong since update isn't locked. Thus, if we are put on
B> > other CPU between two instructions, and in second instruction updating
B> > another CPU counter simultaneously with the original CPU we were on,
B> > then we are losing an update.
B>
B> Hmm, this is subtle.  The update is never lost, but is applied to
B> different counter, non-atomically at a later time.  Non-atomicity
B> only matters when there is a carry since the counter only goes
B> transiently backards in this case.  For example: initial state:
B>
B>  CPU1 counter:   
B>  CPU2 counter:   fffe
B>
B> Start adding 1 to the first counter, doing it non-atomically by
B> incrementing the low word first.
B>
B>  CPU1 counter:     (carry in CPU1 eflags)
B>  CPU2 counter:   fffe
B>
B> Get preempted at this point:
B>
B>  CPU1 counter:   
B>  CPU2 counter:   fffe  (carry in CPU2 eflags)

Nope, the non-atomicity isn't specific to the case when we carry
bit from least significant part to most one.


Sure, but the other parts already work correctly.


The non-atomicity is much simplier: two CPUs write to same memory address
w/o using lock prefix.


No, they can't, since these are per-CPU counters.  Except for broken
per-CPU counters and invalid accesses by CPUs that don't own the counter
in upper layer code.


Thus, any update mechanism that first calculates absolute per-CPU address,
then writes to it w/o lock, requires critical section. Because preemption
between calculation and write, and later execution on other CPU leads
to two CPUs writing to same address.


Only broken per-CPU counters do that.  I already pointed out that all
arches except amd64 and i386 have buggy PCPU_ADD() and other PCPU_*()
access functions because they do do that.  ia64, powerpc and sparc64
even document that that they do this in an XXX comment.  Actually,
ia64 and sparc64 and perhaps some others are not quite this bad, since
their pcpu pointer is in a special register that should get switched
by context switches.  It still takes a lot of magic for the compiler
to use this register for the accesses and not copy it to another
register that doesn't get context switched.  Their pcpu access bugs
may be limited to the compiler possibly generating non-atomic
load-modify-store instructions.

Their XXX comment has wording bugs either way.  It says that "this"
operation should be made atomic, but the scope of the comment is 4
operations.  The first operation is PCPU_ADD(), and it used to be the
only one that does load-modify-store.  Now it is followed by the
PCPU_INC() operation which also does load-modify-store.  Aligned word-sized
operations have a chance of being atomic enough for the other operations.

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-25 Thread Gleb Smirnoff
On Mon, Jun 24, 2013 at 11:16:33PM +1000, Bruce Evans wrote:
B> > K> This is quite interesting idea, but I still did not decided if it
B> > K> acceptable.  The issue is that we could add the carry to the other
B> > K> processor counter, if the preemption kicks in at right time between
B> > K> two instructions.  I did not found any argument why would it be
B> > K> wrong, the races for fetch operation seems to be the same with either
B> > K> local update method.
B> >
B> > This would be wrong since update isn't locked. Thus, if we are put on
B> > other CPU between two instructions, and in second instruction updating
B> > another CPU counter simultaneously with the original CPU we were on,
B> > then we are losing an update.
B> 
B> Hmm, this is subtle.  The update is never lost, but is applied to
B> different counter, non-atomically at a later time.  Non-atomicity
B> only matters when there is a carry since the counter only goes
B> transiently backards in this case.  For example: initial state:
B> 
B>  CPU1 counter:   
B>  CPU2 counter:   fffe
B> 
B> Start adding 1 to the first counter, doing it non-atomically by
B> incrementing the low word first.
B> 
B>  CPU1 counter:     (carry in CPU1 eflags)
B>  CPU2 counter:   fffe
B> 
B> Get preempted at this point:
B> 
B>  CPU1 counter:   
B>  CPU2 counter:   fffe  (carry in CPU2 eflags)

Nope, the non-atomicity isn't specific to the case when we carry
bit from least significant part to most one.

The non-atomicity is much simplier: two CPUs write to same memory address
w/o using lock prefix.

Thus, any update mechanism that first calculates absolute per-CPU address,
then writes to it w/o lock, requires critical section. Because preemption
between calculation and write, and later execution on other CPU leads
to two CPUs writing to same address.

-- 
Totus tuus, Glebius.
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-25 Thread Bruce Evans

On Tue, 25 Jun 2013, Konstantin Belousov wrote:


On Tue, Jun 25, 2013 at 12:45:36PM +1000, Bruce Evans wrote:

On Mon, 24 Jun 2013, Konstantin Belousov wrote:

...

The following is the prototype for the x86. The other 64bit
architectures are handled exactly like amd64. For 32bit !x86 arches,
i.e. arm, mips32 and powerpc32, some more investigation is needed.


That works of course, but is too complicated, machine-dependent and slow
for me.
...
The non-optimized  avoids these problems using a
critical section.  I'm not sure that works.  Anyway, counter.h shouldn't
have to understand pcpu better than pcpu.h does in order to work.

Updates to the counter cannot be done from the interrupt context.


This is fragile, however.  It prevents using counters for things like
counting interrupts.  Most interrupt counting is now done directlyly
and doesn't use PCPU_INC().  i386/i386 has just 1 use of PCPU_INC().
It is to count traps in machdep.c.  Traps include nonmaskable
interrupts.  Even hard-disabling of interrupts doesn't prevent these.
Otherwise, PCPU is used mainly for vm counters.  E.g., for pagefaults.
Now the trap is not an interrupt, so it shouldn't occur in the middle
of the counter update and the PCPU_INC() can safely be replaced by
a counter, but this is not clear.


The fetch problem of sometime missing the carry bit is present on all
32bit arches now, I think.


Also the zeroing problem.  Since there are no atomic ops in sight, zeroing
has several races.  My transferring code uses atomic ops on only 1 CPU,
so it has half of these races.  See below.


The arm implementation of atomic.h is also relevant and is especially
interesting.  I looked at it after mdf@ pointed out ARM_RAS_START.
This seems to only be used by arm in userland.  It is even larger and
slower and more complicated than critical sections, so it doesn't seem
to be useful for counters.  In the kernel, arm has several
implementations depending on machine capabilities.  The ifdefs are
hard to understand.  On some machine, it seems to use the equivalent
of a cmpxchg loop.  On others, it just disables interrupts.  Disabling
interrupts is presumably better than a critical section, and userland
has to use ARM_RAS_START for some arches because neither disabling
interrupts not critical sections are available in userland.  None
of this care helps avoid the races in pcpu.h, since pcpu.h intentionally
avoids using atomic ops.  None of this care helps avoid the races in
counter.h or make counter.h efficient, since counter.h uses its own
methods.  To reach the same quality as atomic.h, counter.h would have
to copy half of the ifdefs and methods in atomic.h.

BTW, why do you claim that disabling interrupts is faster and better method
to disable preemption than critical section ?


For arm, it is because if the critical section method were better, then
arm should use it for atomic ops too.

For i386, it is because it reduces the window in which the value is invalid
(other CPUs can see the invalid value), and is easier to use and shorter
(2-3 bytes) and thus better suited to inline code.  My examples of inline
code mostly used it in rarely-executed cases where its speed doesn't matter
and it is just easier to use.  Actual testing shows that it is fairly
efficient, so can be used in all cases.  On Athlon64, it is faster than
cmpxchg8b but slower than an inline optimized critical section.  This
testing is in userland, where there may be an extra penalty for using
cli/sti/pushfl/popfl:

% static inline void
% counter_u64_add(counter_u64_t c, int64_t inc)
% {
% 
% #if 1

%   /* 20 cycles on A64. */
%   /* 31 cycles on A64 with lock prefix. */
%   /* 35 cycles on A64 with mfence prefix. */
%   if ((cpu_feature & 1) == 1) {
%   counter_64_inc_8b(c, inc);
%   }

Old tests said that this takes 18 cycles.  It now takes 20.

Adding a lock or mfence prefix to the cmpxchg8b gives the times stated
above.  This shows how bad using cmpxch8b is.  We could use atomic ops
for everything and only be 50% slower.  But we shouldn't use either
cmpxchg8b or atomic ops.  We should use 32-bit addl, which takes about
4 cycles.

% struct thread
% {
%   uint32_t td_critnest;
%   uint32_t td_owepreempt;
% } td0;
% 
% volatile struct thread *td = &td0;
% 
% void

% dummy_accum(void)
% {
% }
% 
% void

% fake_mi_switch(void)
% {
% }

Added more emulation for a complete critical section emulation.

% static inline void
% alt_counter_u64_add(counter_u64_t c, int64_t inc)
% {
% #if 0
%   /* 8.5 cycles on A64. */
%   td->td_critnest++;
%   __asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
%   td->td_critnest++;
% #elif 0
%   /* 7 cycles on A64. */
%   uint32_t ocritnest;
% 
% 	ocritnest = td->td_critnest;

%   td->td_critnest = ocritnest + 1;
%   __asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
%   td->td_c

Re: svn commit: r252032 - head/sys/amd64/include

2013-06-24 Thread Konstantin Belousov
On Tue, Jun 25, 2013 at 12:45:36PM +1000, Bruce Evans wrote:
> On Mon, 24 Jun 2013, Konstantin Belousov wrote:
> 
> > On Sun, Jun 23, 2013 at 07:57:57PM +1000, Bruce Evans wrote:
> >> The case that can't be fixed by rereading the counters is when fetching
> >> code runs in between the stores.  If the stores are on a another CPU
> >> that is currently executing them, then we can keep checking that the
> >> counters don't change for the maximum time that any pair of stores
> >> take to complete.  This time is quite long and difficult to determine
> >> (but it can be bounded by hard-disabling interrupts in the storer, and
> >> limited by prefetching).  If the stores are on the same CPU, then we
> >> have no good way to detect that another thread is in the middle of
> >> them, or to switch back to that thread.  So we currently prevent this
> >> case using slow methods.
> >
> > We are already explicit about the fetched value potentially not
> > reflecting any real moment value, but indeed, having the fetch result
> > off by 2^32 is not acceptable.
> >
> > We need atomic 64bit loads to fix this, which is provided by the same
> > cmpxchg8b instruction on the 8-byte aligned elements.  Intel guarantees
> > that 8-byte aligned loads and stores are atomic.
> >
> > The following is the prototype for the x86. The other 64bit
> > architectures are handled exactly like amd64. For 32bit !x86 arches,
> > i.e. arm, mips32 and powerpc32, some more investigation is needed.
> 
> That works of course, but is too complicated, machine-dependent and slow
> for me.
> (I just checked all the pcpu.h implementations and found that
>  all of them except amd64 and i386 still use the quick and racy
>  PCPU_PTR(member) += (value) for PCPU_ADD().  This has 1 sure
>  race and 1 compiler/machine-dependent race:
>  - PCPU_PTR() is only usable if preemption (resulting in rescheduling
>to run on a different CPU) is prevented in the caller.  Otherwise,
>the pointer may become invalid immediately iafter it is initialized.
>  - the compiler may generate a racy load-add-store operation for +=.
>Using PCPU_PTR() makes this a smaller problem than using a thread-
>local pointer.  It prevents the store going to a different CPU
>than the load.
>  PCPU_GET() and PCPU_SET() have the first of these problems on all
>  arches except amd64 and i386.  Even curthread wouldn't work in the
>  SMP case if it has the default implementation of PCPU_GET(curthread).
>  However, it has a non-default implementation on all arches except
>  arm and mips.  So even curthread seems to be broken for arm and mips.)
Brokeness (or, from other PoV, a restriction on use of PCPU_GET/SET/INC)
is well-known. For ARM, I believe the in-tree kernels do not support SMP
at all, so curthread as is is not broken. For mips it seems as if the
pcpup located on the wired page which mapping is context-switched, that
should make the curthread functional.

> 
> The non-optimized  avoids these problems using a
> critical section.  I'm not sure that works.  Anyway, counter.h shouldn't
> have to understand pcpu better than pcpu.h does in order to work.
Updates to the counter cannot be done from the interrupt context.
The fetch problem of sometime missing the carry bit is present on all 
32bit arches now, I think.

> 
> The arm implementation of atomic.h is also relevant and is especially
> interesting.  I looked at it after mdf@ pointed out ARM_RAS_START.
> This seems to only be used by arm in userland.  It is even larger and
> slower and more complicated than critical sections, so it doesn't seem
> to be useful for counters.  In the kernel, arm has several
> implementations depending on machine capabilities.  The ifdefs are
> hard to understand.  On some machine, it seems to use the equivalent
> of a cmpxchg loop.  On others, it just disables interrupts.  Disabling
> interrupts is presumably better than a critical section, and userland
> has to use ARM_RAS_START for some arches because neither disabling
> interrupts not critical sections are available in userland.  None
> of this care helps avoid the races in pcpu.h, since pcpu.h intentionally
> avoids using atomic ops.  None of this care helps avoid the races in
> counter.h or make counter.h efficient, since counter.h uses its own
> methods.  To reach the same quality as atomic.h, counter.h would have
> to copy half of the ifdefs and methods in atomic.h.
BTW, why do you claim that disabling interrupts is faster and better method
to disable preemption than critical section ?

> 
> The slowness that I care about is only in the counter update.  Anything
> complicated there will be slow.
> 
> My current best design:
> - use ordinary mutexes to protect counter fetches in non-per-CPU contexts.
> - use native-sized or always 32-bit counters.  Counter updates are done
>by a single addl on i386. 

Re: svn commit: r252032 - head/sys/amd64/include

2013-06-24 Thread Daniel O'Connor

On 25/06/2013, at 12:54, Bruce Evans  wrote:
>> - run a daemon every few minutes to fetch all the counters, so that
>> the native-sized counters are in no danger of overflowing on systems
>> that don't run statistics programs often enough to fetch the counters
>> to actually use.
> 
> There is at least 1 counter decrement (add -1) in tcp, so the native counters
> need to be signed.

You could convert decrements into an increment of a separate counter and then 
subtract that value from the others when collecting them all.

--
Daniel O'Connor software and network engineer
for Genesis Software - http://www.gsoft.com.au
"The nice thing about standards is that there
are so many of them to choose from."
  -- Andrew Tanenbaum
GPG Fingerprint - 5596 B766 97C0 0E94 4347 295E E593 DC20 7B3F CE8C






___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-24 Thread Bruce Evans

On Tue, 25 Jun 2013, I wrote:


My current best design:
- use ordinary mutexes to protect counter fetches in non-per-CPU contexts.
- use native-sized or always 32-bit counters.  Counter updates are done
 by a single addl on i386.  Fix pcpu.h on arches other than amd64 and
 i386 and use the same method as there.
- counter fetches add the native-sized pcpu counters to 64-bit non-pcpu
 counters, when the native-size counters are in danger of overflowing
 or always, under the mutex.  Transferring data uses an ordinary
 atomic_cmpset.  To avoid ifdefs, always use u_int pcpu counters.
 The 64-bit non-pcpu counters can easily be changed to pcpu counters
 if the API is extended to support pcpu data.
- run a daemon every few minutes to fetch all the counters, so that
 the native-sized counters are in no danger of overflowing on systems
 that don't run statistics programs often enough to fetch the counters
 to actually use.


There is at least 1 counter decrement (add -1) in tcp, so the native counters
need to be signed.


...
With my design:

extern struct mtx counter_locks[];
extern uint64_t counters[];


This is pseudo-code.  The extra structure must be dynamically allocated
with each counter.  I'm not sure how to do that.  uint64_t_pcpu_zone
is specialized for pcpu counters, and a non-pcpu part is needed.


uint64_t r;
volatile u_int *p;
u_int v;


Change to int.


int cnum;

cnum = ctonum(c);
mtx_lock(&counter_locks[cnum]); /* might not need 1 per counter */
r = counters[cnum];
for (i = 0; i < mp_ncpus; i++) {
p = (u_int *)((char *)c + sizeof(struct pcpu) * i);


Change to int *.


v = *p; /* don't care if it is stale */
if (v >= 0x8000) {


Change the critical level to 2 critical levels, 0x4000 for positive
values and -0x4000 for negative values.


/* Transfer only when above critical level. */
while (atomic_cmpset_rel_int(p, v, 0) == 0)
v = *p; /* still don't care if it is stale */
counters[cnum] += v;


Even though full counter values are not expected to become negative,
the native counters can easily become negative when a decrement occurs
after the transfer resets them to 0.


}
r += v;
}
mtx_unlock(&counter_locks[cnum]);
return (r);

Mutexes give some slowness in the fetching code, but fetches eare expected
to be relatively rare.

I added the complication to usually avoid atomic ops at the last minute.
The complication might not be worth it.


The complication is to test v so as to normally avoid doing the transfer
and its atomic ops (and to not use atomic ops for loading v).  The
complication is larger with 2 thresholds.  If we always transferred,
then *p would usually be small and often 0, so that decrementing it
would often make it -1.  This -1 must be transferred by adding -1, not
by adding 0xf.  Changing the type of the native counter to int
gives this.

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-24 Thread Bruce Evans

On Mon, 24 Jun 2013, Konstantin Belousov wrote:


On Sun, Jun 23, 2013 at 07:57:57PM +1000, Bruce Evans wrote:

The case that can't be fixed by rereading the counters is when fetching
code runs in between the stores.  If the stores are on a another CPU
that is currently executing them, then we can keep checking that the
counters don't change for the maximum time that any pair of stores
take to complete.  This time is quite long and difficult to determine
(but it can be bounded by hard-disabling interrupts in the storer, and
limited by prefetching).  If the stores are on the same CPU, then we
have no good way to detect that another thread is in the middle of
them, or to switch back to that thread.  So we currently prevent this
case using slow methods.


We are already explicit about the fetched value potentially not
reflecting any real moment value, but indeed, having the fetch result
off by 2^32 is not acceptable.

We need atomic 64bit loads to fix this, which is provided by the same
cmpxchg8b instruction on the 8-byte aligned elements.  Intel guarantees
that 8-byte aligned loads and stores are atomic.

The following is the prototype for the x86. The other 64bit
architectures are handled exactly like amd64. For 32bit !x86 arches,
i.e. arm, mips32 and powerpc32, some more investigation is needed.


That works of course, but is too complicated, machine-dependent and slow
for me.
   (I just checked all the pcpu.h implementations and found that
all of them except amd64 and i386 still use the quick and racy
PCPU_PTR(member) += (value) for PCPU_ADD().  This has 1 sure
race and 1 compiler/machine-dependent race:
- PCPU_PTR() is only usable if preemption (resulting in rescheduling
  to run on a different CPU) is prevented in the caller.  Otherwise,
  the pointer may become invalid immediately iafter it is initialized.
- the compiler may generate a racy load-add-store operation for +=.
  Using PCPU_PTR() makes this a smaller problem than using a thread-
  local pointer.  It prevents the store going to a different CPU
  than the load.
PCPU_GET() and PCPU_SET() have the first of these problems on all
arches except amd64 and i386.  Even curthread wouldn't work in the
SMP case if it has the default implementation of PCPU_GET(curthread).
However, it has a non-default implementation on all arches except
arm and mips.  So even curthread seems to be broken for arm and mips.)

   The non-optimized  avoids these problems using a
   critical section.  I'm not sure that works.  Anyway, counter.h shouldn't
   have to understand pcpu better than pcpu.h does in order to work.

   The arm implementation of atomic.h is also relevant and is especially
   interesting.  I looked at it after mdf@ pointed out ARM_RAS_START.
   This seems to only be used by arm in userland.  It is even larger and
   slower and more complicated than critical sections, so it doesn't seem
   to be useful for counters.  In the kernel, arm has several
   implementations depending on machine capabilities.  The ifdefs are
   hard to understand.  On some machine, it seems to use the equivalent
   of a cmpxchg loop.  On others, it just disables interrupts.  Disabling
   interrupts is presumably better than a critical section, and userland
   has to use ARM_RAS_START for some arches because neither disabling
   interrupts not critical sections are available in userland.  None
   of this care helps avoid the races in pcpu.h, since pcpu.h intentionally
   avoids using atomic ops.  None of this care helps avoid the races in
   counter.h or make counter.h efficient, since counter.h uses its own
   methods.  To reach the same quality as atomic.h, counter.h would have
   to copy half of the ifdefs and methods in atomic.h.

The slowness that I care about is only in the counter update.  Anything
complicated there will be slow.

My current best design:
- use ordinary mutexes to protect counter fetches in non-per-CPU contexts.
- use native-sized or always 32-bit counters.  Counter updates are done
  by a single addl on i386.  Fix pcpu.h on arches other than amd64 and
  i386 and use the same method as there.
- counter fetches add the native-sized pcpu counters to 64-bit non-pcpu
  counters, when the native-size counters are in danger of overflowing
  or always, under the mutex.  Transferring data uses an ordinary
  atomic_cmpset.  To avoid ifdefs, always use u_int pcpu counters.
  The 64-bit non-pcpu counters can easily be changed to pcpu counters
  if the API is extended to support pcpu data.
- run a daemon every few minutes to fetch all the counters, so that
  the native-sized counters are in no danger of overflowing on systems
  that don't run statistics programs often enough to fetch the counters
  to actually use.


...
diff --git a/sys/kern/subr_counter.c b/sys/kern/subr_counter.c
index a98ed40..3222881 100644
--- a/sys/kern/subr_counter.c
+++ b/sys/kern/subr_counter.c
...
@@ -49,14 +51,8 @@ counter_u64_zero(

Re: svn commit: r252032 - head/sys/amd64/include

2013-06-24 Thread Konstantin Belousov
On Sun, Jun 23, 2013 at 07:57:57PM +1000, Bruce Evans wrote:
> The case that can't be fixed by rereading the counters is when fetching
> code runs in between the stores.  If the stores are on a another CPU
> that is currently executing them, then we can keep checking that the
> counters don't change for the maximum time that any pair of stores
> take to complete.  This time is quite long and difficult to determine
> (but it can be bounded by hard-disabling interrupts in the storer, and
> limited by prefetching).  If the stores are on the same CPU, then we
> have no good way to detect that another thread is in the middle of
> them, or to switch back to that thread.  So we currently prevent this
> case using slow methods.

We are already explicit about the fetched value potentially not
reflecting any real moment value, but indeed, having the fetch result
off by 2^32 is not acceptable.

We need atomic 64bit loads to fix this, which is provided by the same
cmpxchg8b instruction on the 8-byte aligned elements.  Intel guarantees
that 8-byte aligned loads and stores are atomic.

The following is the prototype for the x86. The other 64bit
architectures are handled exactly like amd64. For 32bit !x86 arches,
i.e. arm, mips32 and powerpc32, some more investigation is needed.

diff --git a/sys/amd64/include/counter.h b/sys/amd64/include/counter.h
index b37a4b8..ba302a3 100644
--- a/sys/amd64/include/counter.h
+++ b/sys/amd64/include/counter.h
@@ -36,6 +36,28 @@ extern struct pcpu __pcpu[1];
 #definecounter_enter() do {} while (0)
 #definecounter_exit()  do {} while (0)
 
+#ifdef IN_SUBR_COUNTER_C
+static inline uint64_t
+counter_u64_fetch_inline(uint64_t *p)
+{
+   uint64_t r;
+   int i;
+
+   r = 0;
+   for (i = 0; i < mp_ncpus; i++)
+   r += counter_u64_read_one((uint64_t *)c, i);
+
+   return (r);
+}
+#endif
+
+static inline uint64_t
+counter_u64_read_one(uint64_t *p, int cpu)
+{
+
+   return (*(uint64_t *)((char *)p + sizeof(struct pcpu) * cpu));
+}
+
 #definecounter_u64_add_protected(c, i) counter_u64_add(c, i)
 
 static inline void
diff --git a/sys/i386/include/counter.h b/sys/i386/include/counter.h
index 3e93b36..94dbee3 100644
--- a/sys/i386/include/counter.h
+++ b/sys/i386/include/counter.h
@@ -67,6 +67,50 @@ counter_64_inc_8b(uint64_t *p, int64_t inc)
: "memory", "cc", "eax", "edx", "ebx", "ecx");
 }
 
+static inline uint64_t
+counter_u64_read_one_8b(uint64_t *p, int cpu)
+{
+   uint32_t res_lo, res_high;
+
+   __asm __volatile(
+   "movl   %%eax,%%ebx\n\t"
+   "movl   %%edx,%%ecx\n\t"
+   "cmpxchg8b  (%%esi)"
+   : "=a" (res_lo), "=d"(res_high)
+   : "S" ((char *)p + sizeof(struct pcpu) * cpu)
+   : "cc", "ebx", "ecx");
+   return (res_lo + ((uint64_t)res_high << 32));
+}
+
+#ifdef IN_SUBR_COUNTER_C
+static inline uint64_t
+counter_u64_fetch_inline(uint64_t *p)
+{
+   uint64_t res;
+   int i;
+
+   res = 0;
+   if ((cpu_feature & CPUID_CX8) == 0) {
+   /*
+* The machines without cmpxchg8b are not SMP.
+* Disabling the preemption provides atomicity of the
+* counter reading, since update is done in the
+* critical section as well.
+*/
+   critical_enter();
+   for (i = 0; i < mp_ncpus; i++) {
+   res += *(uint64_t *)((char *)p +
+   sizeof(struct pcpu) * i);
+   }
+   critical_exit();
+   } else {
+   for (i = 0; i < mp_ncpus; i++)
+   res += counter_u64_read_one_8b(p, i);
+   }
+   return (res);
+}
+#endif
+
 #definecounter_u64_add_protected(c, inc)   do {\
if ((cpu_feature & CPUID_CX8) == 0) {   \
CRITICAL_ASSERT(curthread); \
diff --git a/sys/kern/subr_counter.c b/sys/kern/subr_counter.c
index a98ed40..3222881 100644
--- a/sys/kern/subr_counter.c
+++ b/sys/kern/subr_counter.c
@@ -29,11 +29,13 @@ __FBSDID("$FreeBSD$");
 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
+
+#define IN_SUBR_COUNTER_C
+#include 
  
 static uma_zone_t uint64_pcpu_zone;
 
@@ -49,14 +51,8 @@ counter_u64_zero(counter_u64_t c)
 uint64_t
 counter_u64_fetch(counter_u64_t c)
 {
-   uint64_t r;
-   int i;
 
-   r = 0;
-   for (i = 0; i < mp_ncpus; i++)
-   r += *(uint64_t *)((char *)c + sizeof(struct pcpu) * i);
-
-   return (r);
+   return (counter_u64_fetch_inline(c));
 }
 
 counter_u64_t


pgpYUibcqooBj.pgp
Description: PGP signature


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-24 Thread mdf
[snipping everything about counter64, atomic ops, cycles, etc.]

I wonder if the idea explained in this paper:

http://static.usenix.org/event/usenix03/tech/freenix03/full_papers/mcgarry/mcgarry_html/

Which seems to be used in FreeBSD for some ARM atomics:

http://svnweb.freebsd.org/base/head/sys/arm/include/atomic.h?view=annotate
, look for ARM_RAS_START

would be more efficient.

To summarize: one marks a section of code such that if a thread is
interrupted during the code it restarts at the beginning instead of where
it was interrupted.  This has been used to implement atomic increment on
some hardware without the necessary instructions.  Here it could be used to
implement atomic increment on the per-cpu counter without the overhead of
an atomic instruction.

It's multiple stores to mark the section of code doing the increment, but
they're all per-cpu or per thread.  That may be cheaper than an atomic
increment, at least on 32-bit platforms that are doing an atomic 64-bit
increment.

I haven't benchmarked this (ENOTIME, plus I'm on vacation right now), but
using restartable sections requires three stores (add an item to a linked
list, 64-bit increment for the counter, remove an item from a linked list).
 Some of the penalty is payed in the context switch code, which has to
check if the instruction pointer is in one of these critical sections.  I
haven't checked to see if this code is enabled on all architectures or just
ARM.  But if context switches are less frequent than counter increments in
the networking code, it's still a win for most uses.

Thanks,
matthew
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-24 Thread Bruce Evans

On Mon, 24 Jun 2013, Gleb Smirnoff wrote:


On Sun, Jun 23, 2013 at 10:33:43AM +0300, Konstantin Belousov wrote:
K> On Sat, Jun 22, 2013 at 06:58:15PM +1000, Bruce Evans wrote:
K> > So the i386 version be simply "addl; adcl" to memory.  Each store in
K> > this is atomic at the per-CPU level.  If there is no carry, then the
K> > separate stores are equivalent to adding separate nonnegative values and
K> > the counter value is valid at all times.  If there is carry, then the
K> > separate stores are equivalent to adding a negative value followed by
K> > a larger positive value.  The counter transiently goes backwards, but
K> > no information is lost and the counter is eventually set to the correct
K> > forward-going value.
K>
K> This is quite interesting idea, but I still did not decided if it
K> acceptable.  The issue is that we could add the carry to the other
K> processor counter, if the preemption kicks in at right time between
K> two instructions.  I did not found any argument why would it be
K> wrong, the races for fetch operation seems to be the same with either
K> local update method.

This would be wrong since update isn't locked. Thus, if we are put on
other CPU between two instructions, and in second instruction updating
another CPU counter simultaneously with the original CPU we were on,
then we are losing an update.


Hmm, this is subtle.  The update is never lost, but is applied to
different counter, non-atomically at a later time.  Non-atomicity
only matters when there is a carry since the counter only goes
transiently backards in this case.  For example: initial state:

CPU1 counter:   
CPU2 counter:   fffe

Start adding 1 to the first counter, doing it non-atomically by
incrementing the low word first.

CPU1 counter:     (carry in CPU1 eflags)
CPU2 counter:   fffe

Get preempted at this point:

CPU1 counter:   
CPU2 counter:   fffe  (carry in CPU2 eflags)

The fetching code running at this point will see a wrong value, but
has more serious bugs.  Once it is fixed, it can try using a heuristic
to detect this case, or the fix might involve a heuristic that detects
this case too.  I was thinking of keeping a global count of the previous
value for each counter, and adding 0x1 when the count goes backwards.
However, in the worst each component of a counter may be making the
transition from 0x to 0x1, with the carry bit in eflags
for all CPUs.  Detecting this requires watching the low word of all
components of all counters.  Increments must be limited to considerably
less than 32 bits, and negative increments should be disallowed, so
that the low word doesn't wap faster than it can be watched.

CPU2 runs and completes the increment by adding the carry to the high
word:

CPU1 counter:   
CPU2 counter:  0001 fffe

The total count becomes correct once the second word is written.
Statistics for each CPU should be correct, but losing this race is
so rare that maybe we especially don't care about this subcase.  There
is currently no API for exporting the counter statistics for each CPU.

Similarly if the preemption is to another thread on the same CPU, or
to another thread that doesn't touch this counter and then back to the
original thread (much later) on the same CPU.  There may be any number
of updates to the low word before the corresponding updates to the high
word, mixed in any order except for the low word being updated before
the high word for each increment.  State for the half-done updates is
saved in carry flags and registers in any number of threads.  If large
increments are allowed, the value may appear to have had many negative
increments.


Yes, the probability of such event is extremely low, but still possible.



The idea on counter(9) is that although fetching might be not precise,
the internal value of a counter is consistent and doesn't lose even a
single update. This would allow later to use counter(9) as a cheap
reference counter.


The fetching code has essentially the same race (in the 32-bit case) that
is laboriously worked around in the update code, irrespective of how
atomic the update code is, even with 1 CPU:

CPU1 counter:   

Start reading it on CPU1, using 32-bit load of the low word (the compiler
could make this worse or just different by loading the high word first):

  (read this value)

Get preempted; counter update occurs and adds 1 perfectly atomically:

CPU1 counter:  0001 

Back to fetching code on CPU1; read high word:

   0001   (read this value; it is garbage)

If the compiler loads the high word first, it will see the value off by
~2**32 in the opposite direction ( ).

Like I said before, this can probably be handled by rereading each
pcpu counter until it doesn't change.  Or if large

Re: svn commit: r252032 - head/sys/amd64/include

2013-06-24 Thread Bruce Evans

On Mon, 24 Jun 2013, Gleb Smirnoff wrote:


 did you run your benchmarks in userland or in kernel? How many
parallel threads were updating the same counter?

 Can you please share your benchmarks?


Only userland, with 1 thread.

I don't have any more benchmarks than the test program in previous mail.

I don't see how threads have anything to do with efficiency of counter
incrementation, unless slow locking is used.  With threads for the same
process on the same CPU, the accesses are not really different from
accesses by a single thread.  With threads for different processes on
the same CPU, switching the address space will thrash the cache for
user threads, but the pcpu area in kernel memory shouldn't be switched.
It seems difficult to simulate pcpu in user address space.  With threads
for the same or different processes on different CPUs, there is no
contention for pcpu counters.

Please remind me of your old tests that did show some efficiency
differences.  IIRC, direct increment was unexpectedly slower.  Was
that on a 64-bit system?  I guess it wasn't, since the committed version
just uses a direct increment on amd64.  On i386 with cmpxch86b, using
cmpxchg8b might be more efficient because it is a 64-bit access.  I
don't see how that can be, since 4 32-bit accesses are needed to set
up the cmpxchg8b.  In fact, 1 of these accesses can be extremely slow
since it has a store-to-load penalty on some arches (I have considerable
experience with store-to-load penalties in FP code and large uncommitted
asms in libm to avoid them.).

Here is the access which is likely to have the penalty:

% static inline void
% counter_64_inc_8b(uint64_t *p, int64_t inc)
% {
% 
% 	__asm __volatile(

%   "movl  %%fs:(%%esi),%%eax\n\t"

The previous store was a cmpchg8b.  Presumably that was 64 bits.  No
problem for this load, since it is at the same address as the store
and its size mismatch doesn't have the penalty on any CPU that I
know of.

%   "movl  %%fs:4(%%esi),%%edx\n"

Store to load mismatch penalty on at least AthlonXP and Athlon64.  The
load is from the middle of a 64-bit store, and at least these CPUs
don't have hardware to forward it from the write buffer.  Costs 10-20
cycles.  Phenom is documented to have extra hardware to make this case
as fast as the previous case.  I haven't tested Pheonom.  Acccording
to FP benchmarks, store-to-load penalties are large on core2 and corei7
too.

% "1:\n\t"
%   "movl  %%eax,%%ebx\n\t"
%   "movl  %%edx,%%ecx\n\t"
%   "addl  (%%edi),%%ebx\n\t"
%   "adcl  4(%%edi),%%ecx\n\t"

These extra memory accesses are unfortunately necessary because there
aren't enough registers and the asm is a bit too simple (e.g., to add
1, more complicated asm could just add $1 with carry here, but the
current asm has to store 1 to a 64-bit temporary memory variable so
that it can be loaded here).  These are all 32-bit accesses so they
don't have penalties.  There is just a lot of memory traffic for them.

%   "cmpxchg8b %%fs:(%%esi)\n\t"

This presumably does a 64-bit load followed by a 64-bit store (when it
succeeds).  The load matches the previous store, so there is no penalty.

%   "jnz   1b"
%   :
%   : "S" ((char *)p - (char *)&__pcpu[0]), "D" (&inc)
%   : "memory", "cc", "eax", "edx", "ebx", "ecx");
% }

The penalty may be unimportant in normal use because loads are normally
separated from stores by long enough to give the write buffers a chance
to flush to the cache.  But loop benchmarks will always see it unless
loop does enough things between the store and the load to give the
large separation.  Note that the penalty affects loads, so its latency
is normally not hidden.

I forgot about this when I ran tests on Athlon64.  Athlon64 was only
about 6 cycles slower than core2, for about 20 cycles per iteration
altogther.  Not much more, but 20 is about the penalty time, so maybe
the loop is ending up testing just the penalty time, with all the other
latencies in parallel with the penalty.  For a quick test of this, I
replaced the load that has the penalty by a load of immediate 0.  This
reduced the time to 14.5 cycles.  So the penalty is at least 5.5 cycles.
(Note that in the benchmark, the counter only goes up to about 2
billion, so the high 32 bits always has value 0, so loading immediate
0 gives the same result.)  On core2 (ref10-i386) and corei7 (freefall),
the same change has no effect on the time.  This shows that the penalty
doesn't apply on core2 or corei7, and the FP penalties that I see there
have a different source.  ... Testing shows that they are for loads
of 64-bit values that are mismatched since the value was built up using
2 32-bit stores.  Test program:

% #include 
% 
% uint64_t foo;
% 
% int

% main(void)
% {
%   unsigned i;
% 
% 	for (i = 0; i < 2666813872; i++)	/* sysctl -n machdep.tsc_freq */

%   asm volatile(
% #ifdef NO_PENALTY
%   "movq %%rax,foo; movq foo,%%rax"
%  

Re: svn commit: r252032 - head/sys/amd64/include

2013-06-24 Thread Gleb Smirnoff
  Bruce,

  did you run your benchmarks in userland or in kernel? How many
parallel threads were updating the same counter?

  Can you please share your benchmarks?

-- 
Totus tuus, Glebius.
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-24 Thread Gleb Smirnoff
On Sun, Jun 23, 2013 at 10:33:43AM +0300, Konstantin Belousov wrote:
K> On Sat, Jun 22, 2013 at 06:58:15PM +1000, Bruce Evans wrote:
K> > So the i386 version be simply "addl; adcl" to memory.  Each store in
K> > this is atomic at the per-CPU level.  If there is no carry, then the
K> > separate stores are equivalent to adding separate nonnegative values and
K> > the counter value is valid at all times.  If there is carry, then the
K> > separate stores are equivalent to adding a negative value followed by
K> > a larger positive value.  The counter transiently goes backwards, but
K> > no information is lost and the counter is eventually set to the correct
K> > forward-going value.
K> 
K> This is quite interesting idea, but I still did not decided if it
K> acceptable.  The issue is that we could add the carry to the other
K> processor counter, if the preemption kicks in at right time between
K> two instructions.  I did not found any argument why would it be
K> wrong, the races for fetch operation seems to be the same with either
K> local update method.

This would be wrong since update isn't locked. Thus, if we are put on
other CPU between two instructions, and in second instruction updating
another CPU counter simultaneously with the original CPU we were on,
then we are losing an update.

Yes, the probability of such event is extremely low, but still possible.

The idea on counter(9) is that although fetching might be not precise,
the internal value of a counter is consistent and doesn't lose even a
single update. This would allow later to use counter(9) as a cheap
reference counter.

-- 
Totus tuus, Glebius.
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-23 Thread Bruce Evans

On Sun, 23 Jun 2013, I wrote:


I thought of lots of variations, but couldn't find one that works perfectly.
One idea (that goes with the sign check on the low 32 bits) is to use a
misaligned add to memory to copy the 31st bit as a carry bit to the the
high word.  The value of the counter is supposed to be

  [unsigned value of low 32 bits] + [unsigned value of next 24 bits] << 31
  (high 8 bits not used)

at all times, with the 31st bit zero at most times so that the the carry
operation is rarely executed.  This format is only slightly confusing for
debugging (it basically has 2 31st bits, with the one in the low 32 bits
rarely used).  This format can be updated using something like:

The above doesn't work if it is preempted -- then it might add do the
carry operation more than once.  But it is easy to lock using cmpxchg or
disabling interrupts.  Disabling interrupts requires only small code:
...
Another idea is to use the high bits of the high word to encode state.
They can be set atomically enough using addl/andl/orl/xorl since they
are per-CPU.  I couldn't quite get this to work.  You could increment


Here is a combined version.  It uses the old shift trick instead of
disabling interrupts for locking.  The shift trick is not often used
since it doesn't work for SMP.  It works here since the counters are
per-CPU:

addl%1,%%fs:(%0)# only small 32-bit increments supported
jns 8f
sar $1,%%fs:7(%0)   # hi 8 bits are for lock; init to 0xfe
jc  8f  # already locked
addl$0x80,%%fs:3(%0)# misaligned memory access
movb$0xfe,%%fs:7(%0)# unlock
8: ;

Since the locked case is rarely executed and the code for it is only
inline for convenience, the shortest code should be preferred and that
is the simpler version that disables interrupts.  Unless the lock byte
can be exploited in the fetching and zeroing code.  I don't see how
it can be, since this type of locking only works for a single CPU.

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-23 Thread Bruce Evans

On Sun, 23 Jun 2013, Konstantin Belousov wrote:


On Sat, Jun 22, 2013 at 06:58:15PM +1000, Bruce Evans wrote:

So the i386 version be simply "addl; adcl" to memory.  Each store in
this is atomic at the per-CPU level.  If there is no carry, then the
separate stores are equivalent to adding separate nonnegative values and
the counter value is valid at all times.  If there is carry, then the
separate stores are equivalent to adding a negative value followed by
a larger positive value.  The counter transiently goes backwards, but
no information is lost and the counter is eventually set to the correct
forward-going value.


This is quite interesting idea, but I still did not decided if it
acceptable.  The issue is that we could add the carry to the other
processor counter, if the preemption kicks in at right time between
two instructions.  I did not found any argument why would it be
wrong, the races for fetch operation seems to be the same with either
local update method.


I thought about exploiting not-so-strong memory ordering to fix the fetches.
On at least x86, stores are not reordered, so if the storer does a simple
"addl, adcl", then the order is not much different from

load low 32 bits
store low 32 bits
load high 32 bits
store high 32 bits

The loads can be in any order, but must be before the stores to the
same location.  I think the fetching code can depend on this (maybe
only on x86, but the fetching code needs to be MD anyway).  Thus it
can detect most races by simply rereading the counters in almost any
order until they don't change.  Without this, the fetching code can
easily read high and low bits from unrelated stores (when it is
preempted, the race window is large), and the sophisticated cmpxchg8b
code to make each store atomic is almost useless.

The case that can't be fixed by rereading the counters is when fetching
code runs in between the stores.  If the stores are on a another CPU
that is currently executing them, then we can keep checking that the
counters don't change for the maximum time that any pair of stores
take to complete.  This time is quite long and difficult to determine
(but it can be bounded by hard-disabling interrupts in the storer, and
limited by prefetching).  If the stores are on the same CPU, then we
have no good way to detect that another thread is in the middle of
them, or to switch back to that thread.  So we currently prevent this
case using slow methods.


On the other hand, for debugging it would be quite confusing state of
the counters array to observe.


I thought of lots of variations, but couldn't find one that works perfectly.
One idea (that goes with the sign check on the low 32 bits) is to use a
misaligned add to memory to copy the 31st bit as a carry bit to the the
high word.  The value of the counter is supposed to be

   [unsigned value of low 32 bits] + [unsigned value of next 24 bits] << 31
   (high 8 bits not used)

at all times, with the 31st bit zero at most times so that the the carry
operation is rarely executed.  This format is only slightly confusing for
debugging (it basically has 2 31st bits, with the one in the low 32 bits
rarely used).  This format can be updated using something like:

addl%1,%%fs:(%0)# only small 32-bit increments supported
jns 8f
addl$0x80,%%fs:3(%0)# misaligned memory access
8: ;

No problem if this is not preempted.  The idea of the misaligned memory
access is that the CPU needs to make this atomic enough even though it
crosses a 32-bit boundary.  BTW, is anything guaranteed if the access
also crosses a cache line or page boundary?  The OS would be involved
for page boundaries if there is a page fault.  Better not step on bugs
in that.  What about for misaligned 64-bit or 128-bit stores done by
cmpxchg8b or SSE?  Counters are 64-bit aligned, so I think that at least
CPUs that are basically 64-bit must handle this like I want.

The above doesn't work if it is preempted -- then it might add do the
carry operation more than once.  But it is easy to lock using cmpxchg or
disabling interrupts.  Disabling interrupts requires only small code:

addl%1,%%fs:(%0)# only small 32-bit increments supported
jns 8f
pushfl
cli
testb   $0x80,%%fs:3(%0)
je  1f  # carry already done by preemption
addl$0x80,%%fs:3(%0)# misaligned memory access
1:
popfl
8: ;

Another idea is to use the high bits of the high word to encode state.
They can be set atomically enough using addl/andl/orl/xorl since they
are per-CPU.  I couldn't quite get this to work.  You could increment
them to indicate an update in progress.  This only requires 1 extra
add to memory if the adcl is always done (e.g., first add 0x01
to the high word; then addl to the low word; then adcl of (inc >> 32)
- 0x0100 to the high word.  Increments must be limited to no

Re: svn commit: r252032 - head/sys/amd64/include

2013-06-23 Thread Bruce Evans

On Sun, 23 Jun 2013, Konstantin Belousov wrote:


On Sat, Jun 22, 2013 at 01:37:58PM +1000, Bruce Evans wrote:

On Sat, 22 Jun 2013, I wrote:


...
Here are considerably expanded tests, with noninline tests dropped.
Summary of times on Athlon64:

simple increment:   4-7 cycles (1)
simple increment preceded by feature test:  5-8 cycles (1)
simple 32-bit increment:4-7 cycles (2)
correct 32-bit increment (addl to mem): 5.5-7 cycles (3)
inlined critical section:   8.5 cycles (4)
better inlined critical section:7 cycles (5)
correct unsigned 32-bit inc of 64-bit counter:  4-7 cycles (6)
"improve" previous to allow immediate operand:  5+ cycles
correct signed 32-bit inc of 64-bit counter:8.5-9 cycles (7)
correct 64-bit inc of 64-bit counter:   8-9 cycles (8)
-current method (cmpxchg8b):   18 cycles


corei7 (freefall) has about the same timing as Athlon64, but core2
(ref10-i386) is 3-4 cycles slower for the tests that use cmpxchg.

You only tested 32 bit, right ? Note that core2-class machines have
at least one cycle penalty for decoding any instruction with REX prefix.


Yes, since the 64-bit case works more or less correctly.  I tested the
32-bit binary on 64-bit systems.


(4) The critical section method is quite fast when inlined.
(5) The critical section method is even faster when optimized.  This is
   what should be used if you don't want the complications for the
   daemon.


Oops, I forgot that critical sections are much slower in -current than
in my version.  They probably take 20-40 cycles for the best case, and
can't easily be tested in userland since they disable interrupts in
hardware.  My versions disable interrupts in software.

The critical sections do not disable the interrupts.  Only the thread
local counter is incremented.  Leaving the section could be complicated
though.


Yes, as I noticed later, it was only an old version of FreeBSD that disabled
interrupts.

The critical section method (or disabling interrupts, which is probably
faster) only works for the non-SMP case, but old CPUs that don't have
cmpxchg8b probably don't support SMP.


Further tests confirm that incl and incq are pipelined normally on at
least corei7 and core2.  In the loop test, freefall can do 4 independent
addq's to memory faster than it can do 1 :-).  It can do 6 independent
addq's to memory in the same time that it can do 1.  After that, the
loop overhead prevents geting the complete bandwidth of the memory
system.  However, 6 addq's to the same memory location take a little
more than 6 times longer than 1.  Multiple increments of the same counter
one after the other are probably rare, but the counter API makes it harder
to coaelsce them if they occur, and the implementation using independent
asms ensures that the compiler cannot coalesce them.


I think that the naive looping on Core i7+ to measure the latencies
of the instructions simply does not work.  The Nehalem hardware started
to provide the loop detector for the instruction queue inside the decoder,
which essentially makes the short loops executed from the microopcode
form.  We never use counters in the tight loops.


The test results show it working perfectly in the test environment.
Any loop detection just does a better job of running all the loop
instructions in parallel with the instructions being timed.  But even
old CPUs can run all the loop instructions in parallel with high-latency
instructions like add-to-memory.  Also, add-to-memory instructions have
strict ordering requirements on x86.  The order must be load-modify-store.
That's for storing to a single location.  This prevents the loop running
in parallel with itself, so its latency timing works especially simply.

However, the test environment isn't normal.  It is important to understand
that some of the time is latency that doesn't matter in the normal
environment.  But in the complicated versions with cmpxchg8b's or critical
sections, there are lots of extra instructions and ordering requirements
whose latencies can't be hidden.

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-23 Thread Konstantin Belousov
On Sat, Jun 22, 2013 at 06:58:15PM +1000, Bruce Evans wrote:
> So the i386 version be simply "addl; adcl" to memory.  Each store in
> this is atomic at the per-CPU level.  If there is no carry, then the
> separate stores are equivalent to adding separate nonnegative values and
> the counter value is valid at all times.  If there is carry, then the
> separate stores are equivalent to adding a negative value followed by
> a larger positive value.  The counter transiently goes backwards, but
> no information is lost and the counter is eventually set to the correct
> forward-going value.

This is quite interesting idea, but I still did not decided if it
acceptable.  The issue is that we could add the carry to the other
processor counter, if the preemption kicks in at right time between
two instructions.  I did not found any argument why would it be
wrong, the races for fetch operation seems to be the same with either
local update method.

On the other hand, for debugging it would be quite confusing state of
the counters array to observe.


pgpBRbpz1zfRK.pgp
Description: PGP signature


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-23 Thread Konstantin Belousov
On Sat, Jun 22, 2013 at 01:37:58PM +1000, Bruce Evans wrote:
> On Sat, 22 Jun 2013, I wrote:
> 
> > ...
> > Here are considerably expanded tests, with noninline tests dropped.
> > Summary of times on Athlon64:
> >
> > simple increment:   4-7 cycles (1)
> > simple increment preceded by feature test:  5-8 cycles (1)
> > simple 32-bit increment:4-7 cycles (2)
> > correct 32-bit increment (addl to mem): 5.5-7 cycles (3)
> > inlined critical section:   8.5 cycles (4)
> > better inlined critical section:7 cycles (5)
> > correct unsigned 32-bit inc of 64-bit counter:  4-7 cycles (6)
> > "improve" previous to allow immediate operand:  5+ cycles
> > correct signed 32-bit inc of 64-bit counter:8.5-9 cycles (7)
> > correct 64-bit inc of 64-bit counter:   8-9 cycles (8)
> > -current method (cmpxchg8b):   18 cycles
> 
> corei7 (freefall) has about the same timing as Athlon64, but core2
> (ref10-i386) is 3-4 cycles slower for the tests that use cmpxchg.
You only tested 32 bit, right ? Note that core2-class machines have
at least one cycle penalty for decoding any instruction with REX prefix.

> 
> > (4) The critical section method is quite fast when inlined.
> > (5) The critical section method is even faster when optimized.  This is
> >what should be used if you don't want the complications for the
> >daemon.
> 
> Oops, I forgot that critical sections are much slower in -current than
> in my version.  They probably take 20-40 cycles for the best case, and
> can't easily be tested in userland since they disable interrupts in
> hardware.  My versions disable interrupts in software.
The critical sections do not disable the interrupts.  Only the thread
local counter is incremented.  Leaving the section could be complicated
though.

> 
> > ...
> > % % static inline void
> > % alt_counter_u64_add(counter_u64_t c, int64_t inc)
> > % {
> > % #if 1
> > %   /* 8.5 cycles on A64. */
> > %   td->td_critnest++;
> > %   __asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
> > %   td->td_critnest++;
> 
> Oops, one increment should be a decrement.
> 
> > % #elif 1
> > %   /* 7 cycles on A64. */
> > %   uint32_t ocritnest;
> > % 
> > %   ocritnest = td->td_critnest;
> > %   td->td_critnest = ocritnest + 1;
> > %   __asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
> > %   td->td_critnest = ocritnest;
> > % #elif 0
> 
> Even in my version, I have to check for unmasked interrupts when td_critnest
> is reduced to 0.  At least the test for being reduced to 0 can be very fast,
> since the reduced value is loaded early and can be tested early.
> 
> Further tests confirm that incl and incq are pipelined normally on at
> least corei7 and core2.  In the loop test, freefall can do 4 independent
> addq's to memory faster than it can do 1 :-).  It can do 6 independent
> addq's to memory in the same time that it can do 1.  After that, the
> loop overhead prevents geting the complete bandwidth of the memory
> system.  However, 6 addq's to the same memory location take a little
> more than 6 times longer than 1.  Multiple increments of the same counter
> one after the other are probably rare, but the counter API makes it harder
> to coaelsce them if they occur, and the implementation using independent
> asms ensures that the compiler cannot coalesce them.

I think that the naive looping on Core i7+ to measure the latencies
of the instructions simply does not work.  The Nehalem hardware started
to provide the loop detector for the instruction queue inside the decoder,
which essentially makes the short loops executed from the microopcode
form.  We never use counters in the tight loops.


pgpNJb5wXELLy.pgp
Description: PGP signature


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-22 Thread Bruce Evans

On Sat, 22 Jun 2013, I wrote:


On Sat, 22 Jun 2013, I wrote:


...
Here are considerably expanded tests, with noninline tests dropped.
Summary of times on Athlon64:

simple increment:   4-7 cycles (1)
simple increment preceded by feature test:  5-8 cycles (1)
simple 32-bit increment:4-7 cycles (2)
correct 32-bit increment (addl to mem): 5.5-7 cycles (3)
inlined critical section:   8.5 cycles (4)
better inlined critical section:7 cycles (5)
correct unsigned 32-bit inc of 64-bit counter:  4-7 cycles (6)
"improve" previous to allow immediate operand:  5+ cycles
correct signed 32-bit inc of 64-bit counter:8.5-9 cycles (7)
correct 64-bit inc of 64-bit counter:   8-9 cycles (8)
-current method (cmpxchg8b):   18 cycles


corei7 (freefall) has about the same timing as Athlon64, but core2
(ref10-i386) is 3-4 cycles slower for the tests that use cmpxchg.


(4) The critical section method is quite fast when inlined.
(5) The critical section method is even faster when optimized.  This is
   what should be used if you don't want the complications for the
   daemon.


Oops, I forgot that critical sections are much slower in -current than
in my version.  They probably take 20-40 cycles for the best case, and
can't easily be tested in userland since they disable interrupts in
hardware.  My versions disable interrupts in software.


Re-oops, I was looking at the wrong tree.  -current is similar to my version
here.

Anyway, almost all this is moot.  The counter fetch is inherently
non-atomic, on i386 so it is pointless to do the counter update
atomically at the per-CPU level.  Losing the race from non-atomic
counter updates is very unlikely since the race window is about 2 nsec
wide and occurs only on approximately every 4 billionth update.  Losing
the race in the fetch is a little more likely.  The latter might be
handled by a heuristic, and anything for this works for the update
too.  It seems to be good enough to detect counters going backwards
and ignore the new value then.  That takes more space for the record
of the previous values than I like.

Non-atomic fetches from subr_counter.c:

% void
% counter_u64_zero(counter_u64_t c)
% {
%   int i;
% 
% 	for (i = 0; i < mp_ncpus; i++)

%   *(uint64_t *)((char *)c + sizeof(struct pcpu) * i) = 0;
% }

Non-atomic stores are more of a problem.  Races in stores can create
permanently invalid counter values.  Perhaps i386 memory ordering is
strong enough to help.  But I think it doesn't help much.  It would
locked cmpxchg8b in the i386 update and maybe here too to avoid the
races, and the lock would defeat the point of pcpu counters.

% 
% uint64_t

% counter_u64_fetch(counter_u64_t c)
% {
%   uint64_t r;
%   int i;
% 
% 	r = 0;

%   for (i = 0; i < mp_ncpus; i++)
%   r += *(uint64_t *)((char *)c + sizeof(struct pcpu) * i);
% 
% 	return (r);

% }

This code depends on 64-bit loads being atomic.  They just aren't on
32-bit arches.  The stores for the updates may go out in any order,
and might be stale.  The reads here may be in any order, especially
since this is written in C.  We don't care much about stale values, but
want monotonically increasing values.  Negative counter increments
really shouldn't be supported, since they cause complications
everywhere.  Here they break the heuristic that a negative-going fetched
value means a lost race.

So the i386 version be simply "addl; adcl" to memory.  Each store in
this is atomic at the per-CPU level.  If there is no carry, then the
separate stores are equivalent to adding separate nonnegative values and
the counter value is valid at all times.  If there is carry, then the
separate stores are equivalent to adding a negative value followed by
a larger positive value.  The counter transiently goes backwards, but
no information is lost and the counter is eventually set to the correct
forward-going value.

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-21 Thread Bruce Evans

On Sat, 22 Jun 2013, I wrote:


...
Here are considerably expanded tests, with noninline tests dropped.
Summary of times on Athlon64:

simple increment:   4-7 cycles (1)
simple increment preceded by feature test:  5-8 cycles (1)
simple 32-bit increment:4-7 cycles (2)
correct 32-bit increment (addl to mem): 5.5-7 cycles (3)
inlined critical section:   8.5 cycles (4)
better inlined critical section:7 cycles (5)
correct unsigned 32-bit inc of 64-bit counter:  4-7 cycles (6)
"improve" previous to allow immediate operand:  5+ cycles
correct signed 32-bit inc of 64-bit counter:8.5-9 cycles (7)
correct 64-bit inc of 64-bit counter:   8-9 cycles (8)
-current method (cmpxchg8b):   18 cycles


corei7 (freefall) has about the same timing as Athlon64, but core2
(ref10-i386) is 3-4 cycles slower for the tests that use cmpxchg.


(4) The critical section method is quite fast when inlined.
(5) The critical section method is even faster when optimized.  This is
   what should be used if you don't want the complications for the
   daemon.


Oops, I forgot that critical sections are much slower in -current than
in my version.  They probably take 20-40 cycles for the best case, and
can't easily be tested in userland since they disable interrupts in
hardware.  My versions disable interrupts in software.


...
% % static inline void
% alt_counter_u64_add(counter_u64_t c, int64_t inc)
% {
% #if 1
%   /* 8.5 cycles on A64. */
%   td->td_critnest++;
%   __asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
%   td->td_critnest++;


Oops, one increment should be a decrement.


% #elif 1
%   /* 7 cycles on A64. */
%   uint32_t ocritnest;
% 
% 	ocritnest = td->td_critnest;

%   td->td_critnest = ocritnest + 1;
%   __asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
%   td->td_critnest = ocritnest;
% #elif 0


Even in my version, I have to check for unmasked interrupts when td_critnest
is reduced to 0.  At least the test for being reduced to 0 can be very fast,
since the reduced value is loaded early and can be tested early.

Further tests confirm that incl and incq are pipelined normally on at
least corei7 and core2.  In the loop test, freefall can do 4 independent
addq's to memory faster than it can do 1 :-).  It can do 6 independent
addq's to memory in the same time that it can do 1.  After that, the
loop overhead prevents geting the complete bandwidth of the memory
system.  However, 6 addq's to the same memory location take a little
more than 6 times longer than 1.  Multiple increments of the same counter
one after the other are probably rare, but the counter API makes it harder
to coaelsce them if they occur, and the implementation using independent
asms ensures that the compiler cannot coalesce them.

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-21 Thread Bruce Evans

On Fri, 21 Jun 2013, Gleb Smirnoff wrote:


On Fri, Jun 21, 2013 at 09:02:36PM +1000, Bruce Evans wrote:
B> Not if it is a 32-bit increment on 32-bit systems, as it should be.
B>
B> I said to use a daemon to convert small (16 or 32 bit) counters into
B> larger (32 or 64 bit) ones.  It is almost as efficient to call the
B> accumulation code directly:


Please trim quotes more.


[... lots of technical details]



May be you are right and an implementation with a "daemon" would be
more efficient for i386. But daemon isn't needed at all on 64-bit
platforms. So I doubt it is worth coding it.


Of course it would be more efficient on i386.  Probably it would be
more efficient on amd64 too.  The main counter array would be half
the size, so takes less cache, less write buffers, and smaller instructions.


Frankly speaking, we already do not consider the i386 as a platform to
build high performance solutions. The need for counter(9) in TCP/IP
stack was demonstrated on modern amd64 hardware. Thus, our motivation
was to make:


I need to complain about lost performance in i386 more then :-).  i386
is faster for anything that doesn't need large memory.


 - a lossless and faster implementation for amd64
 - a lossless implementation for !amd64
 - if we have an option to avoid critical section on some arch, good
   let's avoid it.


I already gave an implementation to avoid the critical section.  But using
the critical section seems to be the second best method.  The best method
is using 32-bit counters and a daemon.

Here are considerably expanded tests, with noninline tests dropped.
Summary of times on Athlon64:

simple increment:   4-7 cycles (1)
simple increment preceded by feature test:  5-8 cycles (1)
simple 32-bit increment:4-7 cycles (2)
correct 32-bit increment (addl to mem): 5.5-7 cycles (3)
inlined critical section:   8.5 cycles (4)
better inlined critical section:7 cycles (5)
correct unsigned 32-bit inc of 64-bit counter:  4-7 cycles (6)
"improve" previous to allow immediate operand:  5+ cycles
correct signed 32-bit inc of 64-bit counter:8.5-9 cycles (7)
correct 64-bit inc of 64-bit counter:   8-9 cycles (8)
-current method (cmpxchg8b):   18 cycles

(1) Unusable due to races.
(2) It's no faster to do a simple increment of a 32-bit counter than a
64-bit one!  This indicates delicate pipelining/memory access
behaviour.  The Athlon64 timing for addl to memory is 4 cycles
latency.  The tests show this being achieved.  But since it is only
latency and an Athlon64 can do something like 1 64-bit store and
1 64-bit load per cycle (or 2 64-bit loads but not 2 64-bit stores
per cycle?), there is plenty of time to do the adcl to memory and
also the loop overhead in not-quite-the-same 4 cycles, so that the
loop throughput is 1 per 4 cycles.  In practical/non-loop use, you
would care even less about the store latancy.  Compilers often mess
this up to get 7 cycles instead of 4.  -O2 generally gives worst
results, by unrolling the loop.  Since the rolled loop can excecute
in 4 cycles, unrolling it can only pessimize it.
(3) The single 32-bit addl, combined with a daemon, is what should be used.
But compilers generated worse code for this than the similar but more
branchy code in (6).
(4) The critical section method is quite fast when inlined.
(5) The critical section method is even faster when optimized.  This is
what should be used if you don't want the complications for the
daemon.
(6) Call the daemon directly ("daemon" code is now just the current code
in an extern function.  The extern function is a dummy that is never
called in the tests.  Just to simulate the branch prediction penalties).
Some restrictions on the increment.
(7) Same as (6), with fewer restrictions on the increment.  This uses
cmpxchg just like the -current method uses cmpxchg8b, provided a
32-bit increment is sufficient.  Races are detected as the low
bits changing.  We don't care if the high bits change while we
are changing the low bits.
(8) Same as (7), with no restrictions on the increment.  The full version
somehow turned out to have a better fastest case than (7).  This is
without even an optimization for constant increments that would
eliminate the runtime check in most cases.  The extra code for the
runtime check makes things faster :-).

% #include 
% #include 
% 
% static inline void

% counter_64_inc_8b(volatile uint64_t *p, int64_t inc)
% {
% 
% 	__asm __volatile(

%   "movl  %%ds:(%%esi),%%eax\n\t"
%   "movl  %%ds:4(%%esi),%%edx\n"
% "1:\n\t"
%   "movl  %%eax,%%ebx\n\t"
%   "movl  %%edx,%%ecx\n\t"
%   "addl  (%%edi),%%ebx\n\t"
%   "adcl  4(%%edi),%%ecx\n\t"
%   "cmpxchg8b %%ds:(%%esi)\n\t"
%   "jnz   1b"
%   :
%   : "S" (p), "D" (&inc)

Re: svn commit: r252032 - head/sys/amd64/include

2013-06-21 Thread Gleb Smirnoff
  Bruce,

On Fri, Jun 21, 2013 at 09:02:36PM +1000, Bruce Evans wrote:
B> Not if it is a 32-bit increment on 32-bit systems, as it should be.
B> 
B> I said to use a daemon to convert small (16 or 32 bit) counters into
B> larger (32 or 64 bit) ones.  It is almost as efficient to call the
B> accumulation code directly:
B> 
B>  addl%1,%fs:(%0)
B>  jns 1f
B>  callcounter_accum
B> 1:
B> 
B> This calls the accumulation code when the counter goes above 0x8000.
B> The accumulation code then uses cmpxchg8b or a critical section to
B> atomically transfer everything in the small counter to the large
B> counter.  Note that the small counter remains valid at all times unless
B> it is incrememnted too fast.  Code that reports the counters must also
B> use cmpxch8b or a critical section to read them atomically before
B> summing them.  The transfer window between when the small counter
B> reaches 0x8000 and when it overflows is very large provided large
B> increments are never made (2 increments of 0x8000 would break it
B> immediately, but an average increment of 1 takes about a minute to
B> reach 0x8000 even if called from a loop doing nothing except the
B> increment).  Only 32-bit increments that are not too large are
B> supported.  Negative increments are handed by a heuristic in the
B> accumulator.  E.g., an increment of -1 just works unless it causes the
B> small counter's value to go from 0 to 0xf.  0x is
B> detected by the sign test.  It must be interpreted as -1 to be added
B> to the large counter, not 0x.  The full heuristic would be
B> something like:
B> 
B>  if (small_counter >= 0xf000) {
B>  /* Subtract from large counter. */
B>  /* Actually do this atomically, of course. */
B>  large_counter += (int32_t)small_counter;
B>  small_counter = 0;
B>  /*
B>   * Add here sanity checks that the large counter never goes
B>   * negative.  This can only be afforded because this code
B>   * is rarely called (and is not inline).
B>   */
B>  } else if (small_counter < 0x9000) {
B>  /* Add to large counter.
B>  large_counter += small_counter;
B>  small_counter = 0;
B>  } else {
B>  panic("transfer window too small");
B>  }
B> 
B> > Entering critical section means modifying curthread, which is again
B> > a %gs based load & store. Exiting critical section has the same cost.
B> > Thus, we assume that doing a direct %gs based update on the counter
B> > is cheaper than critical_enter(); counter += foo; critical_exit();
B> 
B> Critical sections actually only modifiy td->td_critnest, so they do
B> a single %fs-based load and 2 %ds-based modifications (it's %fs for
B> i386 and %gs for amd64).  So it's far from clear that the direct
B> update is faster than the generic i386 code.  Avoiding the generic
B> i386 code also takes an extra load, test and branch.  The generic
B> code reduces to something like the following instructions after inlining
B> the critical section:
B> 
B>  movl%gs:CURTHREAD,%ebx
B>  inclTD_CRITNEST(%ebx)
B>  movlcounter_arg,%edi
B>  movlinc_lo,%eax
B>  movlinc_hi,%edx
B>  addl%eax,%fs:(%edi)
B>  adcl%edx,%fs:4(%edi)
B> 1:
B>  declTD_CRITNEST(%ebx)
B> 
B> while the non-generic code for the best case when cmpxchg8b is supported is
B> something like:
B> 
B>  testl   $CPUID_CX8,cpu_feature
B>  je  2f
B>  movlcounter_arg,%esi
B>  movlptr_to_inc,%edi # might need more insns to init *ptr
B>  // 8 instructions here from the asm string.
B>  jmp 3f  # actually rearrange to jump in other case
B> 2:
B>  // Use critical section.
B> 3:
B> 
B> 
B> So using cmpxchg8b takes at least 4 more instructions than using a
B> critical section.  The incl/decl instructions on td_critnest are rather
B> slow, however.  According to my test, they take 6 cycles each.  (I
B> think they can be reduced to load; modify; store; ... store back
B> original value which should take more like 9 cycles.)  This probably
B> applies to the addl and adcl that do the work too.  12 cycles for them,
B> and just 2 more cycles (pipelined) for cmpxchg8b and all the other
B> setup instructions.
B> 
B> Now, how does the performance aspect work?  Is it by cmpxch8b causing
B> less contention than direct addl to memory?  If so, then hopefully
B> plain cmpxchg does the same.
B> 
B> The non-cmpxchg version of the i386 code can probably be improved by
B> adding more instructions to it, to avoid doing the adcl to memory
B> in most cases.  Even if you support 64-bit increments, they will rarely
B> be used, so the branches for this will be predicted well:
B> 
B>  // Next 4 instructions as above:
B>  movlcounter_arg,%edi
B>  movlinc_lo,%eax
B>  movlinc_hi,%edx
B>  addl%eax,%fs:(%edi)

Re: svn commit: r252032 - head/sys/amd64/include

2013-06-21 Thread Bruce Evans

On Fri, 21 Jun 2013, Gleb Smirnoff wrote:


On Fri, Jun 21, 2013 at 09:04:34AM +1000, Bruce Evans wrote:
B> >> The i386 version of the counter asm doesn't support the immediate
B> >> constraint for technical reasons.  64 bit counters are too large and
B> >> slow to use on i386, especially when they are implemented as they are
B> >> without races.
B> >
B> > Actual testing showed that it is only about twice as slow as a direct
B> > increment.  With the enclosed test program (a userland version hacked
B> > on a bit to avoid pcpu), on ref10-i386 the times are:
...
Yes, for a single threaded userland program using "+=" is faster than
all the magic that counter(9) does.

But when multiple threads need to access one counter "+=" fails both
with precision and with performance.


No.  But I forgot about the performance aspect.


Using "+=" upon a per-CPU counter is racy, since += is compiled into
"load", "increment", "store" sequence and if we are not in a critical
section, then this is racy. We might be removed from CPU between load
and store.


Not if it is a 32-bit increment on 32-bit systems, as it should be.

I said to use a daemon to convert small (16 or 32 bit) counters into
larger (32 or 64 bit) ones.  It is almost as efficient to call the
accumulation code directly:

addl%1,%fs:(%0)
jns 1f
callcounter_accum
1:

This calls the accumulation code when the counter goes above 0x8000.
The accumulation code then uses cmpxchg8b or a critical section to
atomically transfer everything in the small counter to the large
counter.  Note that the small counter remains valid at all times unless
it is incrememnted too fast.  Code that reports the counters must also
use cmpxch8b or a critical section to read them atomically before
summing them.  The transfer window between when the small counter
reaches 0x8000 and when it overflows is very large provided large
increments are never made (2 increments of 0x8000 would break it
immediately, but an average increment of 1 takes about a minute to
reach 0x8000 even if called from a loop doing nothing except the
increment).  Only 32-bit increments that are not too large are
supported.  Negative increments are handed by a heuristic in the
accumulator.  E.g., an increment of -1 just works unless it causes the
small counter's value to go from 0 to 0xf.  0x is
detected by the sign test.  It must be interpreted as -1 to be added
to the large counter, not 0x.  The full heuristic would be
something like:

if (small_counter >= 0xf000) {
/* Subtract from large counter. */
/* Actually do this atomically, of course. */
large_counter += (int32_t)small_counter;
small_counter = 0;
/*
 * Add here sanity checks that the large counter never goes
 * negative.  This can only be afforded because this code
 * is rarely called (and is not inline).
 */
} else if (small_counter < 0x9000) {
/* Add to large counter.
large_counter += small_counter;
small_counter = 0;
} else {
panic("transfer window too small");
}


Entering critical section means modifying curthread, which is again
a %gs based load & store. Exiting critical section has the same cost.
Thus, we assume that doing a direct %gs based update on the counter
is cheaper than critical_enter(); counter += foo; critical_exit();


Critical sections actually only modifiy td->td_critnest, so they do
a single %fs-based load and 2 %ds-based modifications (it's %fs for
i386 and %gs for amd64).  So it's far from clear that the direct
update is faster than the generic i386 code.  Avoiding the generic
i386 code also takes an extra load, test and branch.  The generic
code reduces to something like the following instructions after inlining
the critical section:

movl%gs:CURTHREAD,%ebx
inclTD_CRITNEST(%ebx)
movlcounter_arg,%edi
movlinc_lo,%eax
movlinc_hi,%edx
addl%eax,%fs:(%edi)
adcl%edx,%fs:4(%edi)
1:
declTD_CRITNEST(%ebx)

while the non-generic code for the best case when cmpxchg8b is supported is
something like:

testl   $CPUID_CX8,cpu_feature
je  2f
movlcounter_arg,%esi
movlptr_to_inc,%edi # might need more insns to init *ptr
// 8 instructions here from the asm string.
jmp 3f  # actually rearrange to jump in other case
2:
// Use critical section.
3:


So using cmpxchg8b takes at least 4 more instructions than using a
critical section.  The incl/decl instructions on td_critnest are rather
slow, however.  According to my test, they take 6 cycles each.  (I
think they can be reduced to load; modify; store; ... store back
original value which should take more like 9 cycle

Re: svn commit: r252032 - head/sys/amd64/include

2013-06-20 Thread Gleb Smirnoff
  Bruce,

On Fri, Jun 21, 2013 at 09:04:34AM +1000, Bruce Evans wrote:
B> >> The i386 version of the counter asm doesn't support the immediate
B> >> constraint for technical reasons.  64 bit counters are too large and
B> >> slow to use on i386, especially when they are implemented as they are
B> >> without races.
B> >
B> > Actual testing showed that it is only about twice as slow as a direct
B> > increment.  With the enclosed test program (a userland version hacked
B> > on a bit to avoid pcpu), on ref10-i386 the times are:
B> > - loop overhead:1 cycle
B> > - direct unlocked increment of a uint32_t:  6 cycles
B> > - direct unlocked increment of a uint64_t:  7 cycles
B> > - non-inline function unlocked increment of a uint64_t: 7.5 cycles
B> > - counter_u64_add():   14 cycles
B> > - non-inline counter_u64_add():18 cycles
B> > ...
B> 
B> Actually enclosing the test program:
B> 
B> % #include 
B> % #include 
B> % 
B> % static inline void
B> % counter_64_inc_8b(volatile uint64_t *p, int64_t inc)
B> % {
B> % 
B> %__asm __volatile(
B> %"movl   %%ds:(%%esi),%%eax\n\t"
B> %"movl   %%ds:4(%%esi),%%edx\n"
B> % "1:\n\t"
B> %"movl   %%eax,%%ebx\n\t"
B> %"movl   %%edx,%%ecx\n\t"
B> %"addl   (%%edi),%%ebx\n\t"
B> %"adcl   4(%%edi),%%ecx\n\t"
B> %"cmpxchg8b %%ds:(%%esi)\n\t"
B> %"jnz1b"
B> %:
B> %: "S" (p), "D" (&inc)
B> %: "memory", "cc", "eax", "edx", "ebx", "ecx");
B> % }
B> % 
B> % uint32_t cpu_feature = 1;
B> % 
B> % typedef volatile uint64_t *counter_u64_t;
B> % 
B> % static void
B> % #if 1
B> % inline
B> % #else
B> % __noinline
B> % #endif
B> % counter_u64_add(counter_u64_t c, int64_t inc)
B> % {
B> % 
B> % #if 1
B> %if ((cpu_feature & 1) == 1) {
B> %counter_64_inc_8b(c, inc);
B> %}
B> % #elif 0
B> %if ((cpu_feature & 1) == 1) {
B> %*c += inc;
B> %}
B> % #else
B> %*c += inc;
B> % #endif
B> % }
B> % 
B> % uint64_t mycounter[1];
B> % 
B> % int
B> % main(void)
B> % {
B> %unsigned i;
B> % 
B> %for (i = 0; i < 1861955704; i++)/* sysctl -n machdep.tsc_freq */
B> %counter_u64_add(mycounter, 1);
B> %printf("%ju\n", (uintmax_t)mycounter[0]);
B> % }

Yes, for a single threaded userland program using "+=" is faster than
all the magic that counter(9) does.

But when multiple threads need to access one counter "+=" fails both
with precision and with performance.

Using "+=" upon a per-CPU counter is racy, since += is compiled into
"load", "increment", "store" sequence and if we are not in a critical
section, then this is racy. We might be removed from CPU between load
and store.

Entering critical section means modifying curthread, which is again
a %gs based load & store. Exiting critical section has the same cost.
Thus, we assume that doing a direct %gs based update on the counter
is cheaper than critical_enter(); counter += foo; critical_exit();

-- 
Totus tuus, Glebius.
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-20 Thread Konstantin Belousov
On Fri, Jun 21, 2013 at 12:15:24PM +1000, Lawrence Stewart wrote:
> Hi Kostik,
> 
> On 06/21/13 00:30, Konstantin Belousov wrote:
> > Author: kib
> > Date: Thu Jun 20 14:30:04 2013
> > New Revision: 252032
> > URL: http://svnweb.freebsd.org/changeset/base/252032
> > 
> > Log:
> >   Allow immediate operand.
> >   
> >   Sponsored by: The FreeBSD Foundation
> > 
> > Modified:
> >   head/sys/amd64/include/counter.h
> > 
> > Modified: head/sys/amd64/include/counter.h
> > ==
> > --- head/sys/amd64/include/counter.hThu Jun 20 14:20:03 2013
> > (r252031)
> > +++ head/sys/amd64/include/counter.hThu Jun 20 14:30:04 2013
> > (r252032)
> > @@ -44,7 +44,7 @@ counter_u64_add(counter_u64_t c, int64_t
> >  
> > __asm __volatile("addq\t%1,%%gs:(%0)"
> > :
> > -   : "r" ((char *)c - (char *)&__pcpu[0]), "r" (inc)
> > +   : "r" ((char *)c - (char *)&__pcpu[0]), "ri" (inc)
> > : "memory", "cc");
> >  }
> 
> For mere mortals like myself, a verbose explanation of what this does,
> why it's necessary and what problem(s) it solves (if any) would be most
> helpful :)

It does what was written in the commit message.  The addq instructions
is allowed to take the immediate operand, besides the register, for the
increment value.  For the typical case of incrementing by the constant 1,
before the commit, the emited code was like
movl$1,%ecx
addq%ecx,%gs(%rdx)
now it could be
addq$1,%gs(%rdx)

Mostly aestetic, also slightly lowering the registers pressure.


pgpW1HQa_iUd4.pgp
Description: PGP signature


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-20 Thread Lawrence Stewart
Hi Kostik,

On 06/21/13 00:30, Konstantin Belousov wrote:
> Author: kib
> Date: Thu Jun 20 14:30:04 2013
> New Revision: 252032
> URL: http://svnweb.freebsd.org/changeset/base/252032
> 
> Log:
>   Allow immediate operand.
>   
>   Sponsored by:   The FreeBSD Foundation
> 
> Modified:
>   head/sys/amd64/include/counter.h
> 
> Modified: head/sys/amd64/include/counter.h
> ==
> --- head/sys/amd64/include/counter.h  Thu Jun 20 14:20:03 2013
> (r252031)
> +++ head/sys/amd64/include/counter.h  Thu Jun 20 14:30:04 2013
> (r252032)
> @@ -44,7 +44,7 @@ counter_u64_add(counter_u64_t c, int64_t
>  
>   __asm __volatile("addq\t%1,%%gs:(%0)"
>   :
> - : "r" ((char *)c - (char *)&__pcpu[0]), "r" (inc)
> + : "r" ((char *)c - (char *)&__pcpu[0]), "ri" (inc)
>   : "memory", "cc");
>  }

For mere mortals like myself, a verbose explanation of what this does,
why it's necessary and what problem(s) it solves (if any) would be most
helpful :)

Cheers,
Lawrence
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-20 Thread Bruce Evans

On Fri, 21 Jun 2013, Bruce Evans wrote:


On Fri, 21 Jun 2013, I wrote:


On Thu, 20 Jun 2013, Konstantin Belousov wrote:

...
@@ -44,7 +44,7 @@ counter_u64_add(counter_u64_t c, int64_t
...

The i386 version of the counter asm doesn't support the immediate
constraint for technical reasons.  64 bit counters are too large and
slow to use on i386, especially when they are implemented as they are
without races.


Actual testing showed that it is only about twice as slow as a direct
increment.  With the enclosed test program (a userland version hacked
on a bit to avoid pcpu), on ref10-i386 the times are:
- loop overhead:1 cycle
- direct unlocked increment of a uint32_t:  6 cycles
- direct unlocked increment of a uint64_t:  7 cycles
- non-inline function unlocked increment of a uint64_t: 7.5 cycles
- counter_u64_add():   14 cycles
- non-inline counter_u64_add():18 cycles
...


Actually enclosing the test program:

% #include 
% #include 
% 
% static inline void

% counter_64_inc_8b(volatile uint64_t *p, int64_t inc)
% {
% 
% 	__asm __volatile(

%   "movl  %%ds:(%%esi),%%eax\n\t"
%   "movl  %%ds:4(%%esi),%%edx\n"
% "1:\n\t"
%   "movl  %%eax,%%ebx\n\t"
%   "movl  %%edx,%%ecx\n\t"
%   "addl  (%%edi),%%ebx\n\t"
%   "adcl  4(%%edi),%%ecx\n\t"
%   "cmpxchg8b %%ds:(%%esi)\n\t"
%   "jnz   1b"
%   :
%   : "S" (p), "D" (&inc)
%   : "memory", "cc", "eax", "edx", "ebx", "ecx");
% }
% 
% uint32_t cpu_feature = 1;
% 
% typedef volatile uint64_t *counter_u64_t;
% 
% static void

% #if 1
% inline
% #else
% __noinline
% #endif
% counter_u64_add(counter_u64_t c, int64_t inc)
% {
% 
% #if 1

%   if ((cpu_feature & 1) == 1) {
%   counter_64_inc_8b(c, inc);
%   }
% #elif 0
%   if ((cpu_feature & 1) == 1) {
%   *c += inc;
%   }
% #else
%   *c += inc;
% #endif
% }
% 
% uint64_t mycounter[1];
% 
% int

% main(void)
% {
%   unsigned i;
% 
% 	for (i = 0; i < 1861955704; i++)	/* sysctl -n machdep.tsc_freq */

%   counter_u64_add(mycounter, 1);
%   printf("%ju\n", (uintmax_t)mycounter[0]);
% }

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-20 Thread Bruce Evans

On Fri, 21 Jun 2013, I wrote:


On Thu, 20 Jun 2013, Konstantin Belousov wrote:

...
@@ -44,7 +44,7 @@ counter_u64_add(counter_u64_t c, int64_t
...

The i386 version of the counter asm doesn't support the immediate
constraint for technical reasons.  64 bit counters are too large and
slow to use on i386, especially when they are implemented as they are
without races.


Actual testing showed that it is only about twice as slow as a direct
increment.  With the enclosed test program (a userland version hacked
on a bit to avoid pcpu), on ref10-i386 the times are:
- loop overhead:1 cycle
- direct unlocked increment of a uint32_t:  6 cycles
- direct unlocked increment of a uint64_t:  7 cycles
- non-inline function unlocked increment of a uint64_t: 7.5 cycles
- counter_u64_add():   14 cycles
- non-inline counter_u64_add():18 cycles

Add many more when critical_enter()/exit() is needed.

I thought that a direct increment of a uint32_t took only 3 cycles.  This
is the documented time for i486.  4 cycles latency is documented for
AthlonxXP/64.  The carry check for incrementing a uint64_t is pipelined
on most modern i386, so it adds very little to this.

Nevertheless, the correct implementation of counters, once you have the
complexity of a counter API and can't just do counter++, is to use small
counters and run a daemon to accumulate them in larger counters before
they overflow.  pcpu accesses should allow simple counter++ accesses to
work for the smaller counters (except their address is in pcpu space).
But I don't see how sysctl accesses can work without lots of context
switches to reach strictly per-CPU context.  The current accumulation
of pcpu counters in places like vcnt() doesn't do that -- it accesses
pcpu counters for other CPUs, so has races.  The races are more serious
for accumulating counters into larger ones.  Then the smaller ones need
to be cleared atomically with copying them.  The accumulation daemon(s)
can run per-CPU more easily than sysctls, since the daemons don't need
to run as frequently as sysctls might.

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r252032 - head/sys/amd64/include

2013-06-20 Thread Bruce Evans

On Thu, 20 Jun 2013, Konstantin Belousov wrote:


Log:
 Allow immediate operand.
..
Modified: head/sys/amd64/include/counter.h
==
--- head/sys/amd64/include/counter.hThu Jun 20 14:20:03 2013
(r252031)
+++ head/sys/amd64/include/counter.hThu Jun 20 14:30:04 2013
(r252032)
@@ -44,7 +44,7 @@ counter_u64_add(counter_u64_t c, int64_t

__asm __volatile("addq\t%1,%%gs:(%0)"
:
-   : "r" ((char *)c - (char *)&__pcpu[0]), "r" (inc)
+   : "r" ((char *)c - (char *)&__pcpu[0]), "ri" (inc)
: "memory", "cc");
}


I don't like the quality of these asms.  pcpu.h on both amd64 and i386
uses a hackish output operand and no memory clobber, and no cc clobber.
The output operand is hackish because the variable is in a different
address space.  But since the variable is in a different address space,
the memory clobber is very unlikely to protect anything -- it could
only protect accesses to the variable without using %gs, and it is
practically impossible to have such accesses in scope, since we are
using %gs because it is the only way that we have in scope to access
the variable.

The i386 version is much larger and uglier.  It has mounds of style bugs.
See {amd64.i386}/include/atomic.h for normal style of multi-line asms.
I don't like that either.  I like a style with backslash-newlines, hard
tabs and soft newlines ('...\n\$') used in i386/isa/npx.c.  The
backslash-newlines are and hard tabs less ugly and more readable than
double quotes and soft tabs ('"...\n\t$"').  Escapes in string constants
are unfortunately necessary since gcc broke its extension of allowing
hard newlines in string constants.

The old good way, using the gcc extension, allowed copying extern asm:
directly into inline asm (code for a dummy loop):

asm("
movl$100,%%ecx
1:
decl%%ecx
jne 1b
");

Without the extension, this has to be uglified and requires a lot of editing
to convert.  Using backslash-newline and the same indentation for the source
and generated asm file, but with trailing tabs to line up:

asm("  \n\
movl$100,%%ecx  \n\
1:  \n\
decl%%ecx   \n\
jne 1b  \n\
");/* clobbers omitted */

The style in machine/atomic.h adds lots of quotes and breaks the formatting
of the generated asm file by omitting the newlines, and has to add another
ugliness to recover from that -- now it needs semicolons instead of newlines:

asm(
"  movl$100,%%ecx; "
"1:"  /* should use newline/semi */
"  decl%%ecx ; "
"  jne 1b  "
);  /* might be on previous line */

This requires much more editing.  Finally, the style in
i386/include/counter.h changes the semicolons to soft newlines to
improve the output and removes most of the formatting in the source but
adds soft tabs to give the indentation in the output in another way:

asm(
"movl  $100,%%ecx\n\t"
"1:\n\t"
"decl  %%ecx\n\t"
"jne   1b\n\t"
);

The i386 version of the counter asm doesn't support the immediate
constraint for technical reasons.  64 bit counters are too large and
slow to use on i386, especially when they are implemented as they are
without races.

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"