Re: svn commit: r237434 - in head/lib/libc: amd64/sys gen i386/sys include sys

2012-06-24 Thread Bruce Evans

On Sat, 23 Jun 2012, Alexander Motin wrote:


On 06/23/12 18:26, Bruce Evans wrote:

On Sat, 23 Jun 2012, Konstantin Belousov wrote:

On Sat, Jun 23, 2012 at 03:17:57PM +0200, Marius Strobl wrote:

So apart from introducing code to constantly synchronize the
TICK counters, using the timecounters on the host busses also
seems to be the only viable solution for userland. The latter
should be doable but is long-winded as besides duplicating
portions of the corresponding device drivers in userland, it
probably also means to get some additional infrastructure
like being able to memory map registers for devices on the
nexus(4) level in place ...


There is little point in optimizations to avoid syscalls for hardware.
On x86, a syscall takes 100-400 nsec extra, so if the hardware takes
500-2000 nsec then reduction the total time by 100-400 nsec is not
very useful.


Just out of curiosity I've run my own binuptime() micro-benchmarks:
- on Core i5-650:
 TSC  11ns
 HPET433ns
 ACPI-fast   515ns
 i8254  3736ns


The TSC is surprisingly fast and the others are depressingly slow,
although about the fastest I've seen for bus-based timecounters.

On Athlon64, rdtsc() takes 6.5 cycles, but I thought all P-state
invariant TSCs took  40 cycles.  rdtsc() takes 65 cycles on FreeBSD
x86 cluster machines (core2 Xeon), except on freefall (P4(?) Xeon).

I hardly believe 11ns.  That's 44 cycles at 4GHz.  IIRC, the Athlon64
at 2.2GHz took 29nsec for binuptime() last time I measured it (long
ago, when it still had the statistics counter pessimization).


- on dual-socket Xeon E5645:
 TSC  15ns
 HPET580ns
 ACPI-fast  1118ns
 i8254  3911ns

I think it could be useful to have that small benchmark in base kernel.


I think kib put one in src/tools for userland.  I mostly use a userland
one.  Except for the TSC, the overhead for the kernel parts can be
estimate accurately from userland, since it is so large.

This is more normal slowness for ACPI-[!]fast.  freefall still uses
ACPI-fast and it takes a minimum of 1396 and an average of 1729nsec
from usrerland (load average 1.3).  Other x86 cluster machines now use
TSC-[s]low, and it takes a minimum of 481 and an average of 533nsec
(now the swing from 481 to 533 is given by its gratuitous impreciseness
and not by system load).

BTW, the i8254 timecounter can be made about 3/2 times faster if anyone
cared, by reading only the low 8 bits of the timer.  This would require
running clock interrupts at = 4kHz so that the top 8 bits are rarely
needed (great for a tickless kernel :-), or maybe by using a fuzzier
timer to determine when the top bits are needed.  At ~2500ns, it would
be only slightly slower than the slowest ACPI-fast, and faster than
ACPI-safe.

OTOH, I have measured i8254 timer reads taking 138000ns (on UP with
interrupts disabled) on a system where they normally take only 4000ns.
Apparently the ISA bus waits for other bus activity (DMA?) for that
long.  Does this happen for other buses?  Extra bridges for ISA can't
help.


...
The new timeout code to support tickless kernels looks like it will give
large pessimizations unless the timecounter is fast.  Instead of using
the tick counter (1 atomic increment on every clock tick) and some
getbinuptime() calls in places like select(), it uses the hardware
timecounter via binuptime() in most places (since without a tick counter
and without clock interrupts updating the timehands periodically, it takes
a hardware timecounter read to determine the time).  So callout_reset()
might start taking thousands of nsec for per call, depending on how slow
the timecounter is.  This fix is probably to use a fuzzy time for long
long timeouts and to discourage use of short timeouts and/or to turn them
into long or fuzzy timeouts so that they are not very useful.


The new timeout code is still in active development and optimization was not 
the first priority yet. My idea was to use much faster getbinuptime() for 
periods above let's say 100ms.


You would need to run non-tickless with a clock interrupt frequency
of = 10Hz to keep getbinuptime() working.  Seems like a bad thing to
aim for.  Better not use bintimes at all.  I would try using
pseudo-ticks, (where the tick counter is advanced on every
not-very-periodic clock interrupt and at some other times when you
know that clock interrupts have been stopped, and maybe at other
interesting places (all interrupts and all syscalls?)).  Only call
binuptime() every few thousand pseudo-ticks to prevent long-term drift.
Timeouts would become longer and fuzzier than now, but that is a feature
(it inhibits using them for busy-waiting).  You know when you scheduled
clock interrupts and can advance the tick counter to represent the
interval between clock interrupts fairly accurately (say to within 10%).
The fuzziness comes mainly from not scheduling clock interrupts very
often, so that for example when something asks for a sleep of 1 tick

Re: svn commit: r237434 - in head/lib/libc: amd64/sys gen i386/sys include sys

2012-06-23 Thread Marius Strobl
On Fri, Jun 22, 2012 at 10:48:17AM +0300, Konstantin Belousov wrote:
 On Fri, Jun 22, 2012 at 09:34:56AM +0200, Marius Strobl wrote:
  On Fri, Jun 22, 2012 at 07:13:31AM +, Konstantin Belousov wrote:
   Author: kib
   Date: Fri Jun 22 07:13:30 2012
   New Revision: 237434
   URL: http://svn.freebsd.org/changeset/base/237434
   
   Log:
 Use struct vdso_timehands data to implement fast gettimeofday(2) and
 clock_gettime(2) functions if supported. The speedup seen in
 microbenchmarks is in range 4x-7x depending on the hardware.
 
 Only amd64 and i386 architectures are supported. Libc uses rdtsc and
 kernel data to calculate current time, if enabled by kernel.
  
  I don't know much about x86 CPUs but is my understanding correct
  that TSCs are not synchronized in any way across CPUs, i.e.
  reading it on different CPUs may result in time going backwards
  etc., which is okay for this application though?
 
 Generally speaking, tsc state among different CPU after boot is not
 synchronized, you are right.
 
 Kernel has somewhat doubtful test which verifies whether the after-boot
 state of tsc looks good. If the test fails, TSC is not enabled by
 default as timecounter, and then usermode follows kernel policy and
 falls back to slow syscall. So we err on the safe side.
 I tested this on Core i7 2xxx, where the test (usually) passes.

Okay, so for x86 the TSCs are not used as timecounters by either
the kernel or userland in the SMP case if they don't appear to
be synchronized, correct?

 
 While you are there. do you have comments about sparc64 TICK counter ?
 On SMP, the counter of BSP is used by IPI. Is it unavoidable ?

The TICK counters are per-core and not synchronized by the hardware.
We synchronize APs with the BSP on bring-up but they drift over time
and the initial synchronization might not be perfect in the first
place. At least in the past, drifting TICK counters caused all sorts
of issues and strange behavior in FreeBSD when used as timecounter
in the SMP case. If my understanding of the above is right, as is
this still rules them out as timecounters for userland.
Linux has some complex code (based on equivalent code origining in
their ia64 port) for constantly synchronizing the TICK counters.
In order to avoid that complexity and overhead, what I do in
FreeBSD in the SMP case is to (ab)use counters (either intended
for that purpose or bus cycle counters probably intended for
debugging the hardware during development) available in the
various host-to-foo bridges so it doesn't matter which CPU they
are read by. This works just fine except for pre-PCI-Express
based USIIIi machines, where the bus cycle counters are broken.
That's where the TICK counter is always read from the BSP
using an IPI in the SMP case. The latter is done as sched_bind(9)
isn't possible with td_critnest  1 according to information
from jhb@ and mav@.
So apart from introducing code to constantly synchronize the
TICK counters, using the timecounters on the host busses also
seems to be the only viable solution for userland. The latter
should be doable but is long-winded as besides duplicating
portions of the corresponding device drivers in userland, it
probably also means to get some additional infrastructure
like being able to memory map registers for devices on the
nexus(4) level in place ...

Marius

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r237434 - in head/lib/libc: amd64/sys gen i386/sys include sys

2012-06-23 Thread Konstantin Belousov
On Sat, Jun 23, 2012 at 03:17:57PM +0200, Marius Strobl wrote:
 On Fri, Jun 22, 2012 at 10:48:17AM +0300, Konstantin Belousov wrote:
  On Fri, Jun 22, 2012 at 09:34:56AM +0200, Marius Strobl wrote:
   On Fri, Jun 22, 2012 at 07:13:31AM +, Konstantin Belousov wrote:
Author: kib
Date: Fri Jun 22 07:13:30 2012
New Revision: 237434
URL: http://svn.freebsd.org/changeset/base/237434

Log:
  Use struct vdso_timehands data to implement fast gettimeofday(2) and
  clock_gettime(2) functions if supported. The speedup seen in
  microbenchmarks is in range 4x-7x depending on the hardware.
  
  Only amd64 and i386 architectures are supported. Libc uses rdtsc and
  kernel data to calculate current time, if enabled by kernel.
   
   I don't know much about x86 CPUs but is my understanding correct
   that TSCs are not synchronized in any way across CPUs, i.e.
   reading it on different CPUs may result in time going backwards
   etc., which is okay for this application though?
  
  Generally speaking, tsc state among different CPU after boot is not
  synchronized, you are right.
  
  Kernel has somewhat doubtful test which verifies whether the after-boot
  state of tsc looks good. If the test fails, TSC is not enabled by
  default as timecounter, and then usermode follows kernel policy and
  falls back to slow syscall. So we err on the safe side.
  I tested this on Core i7 2xxx, where the test (usually) passes.
 
 Okay, so for x86 the TSCs are not used as timecounters by either
 the kernel or userland in the SMP case if they don't appear to
 be synchronized, correct?
Correct as for now.  But this is bug and not a feature. The tscs shall
be synchronized, or skew tables calculated instead of refusing to use it.

 
  
  While you are there. do you have comments about sparc64 TICK counter ?
  On SMP, the counter of BSP is used by IPI. Is it unavoidable ?
 
 The TICK counters are per-core and not synchronized by the hardware.
 We synchronize APs with the BSP on bring-up but they drift over time
 and the initial synchronization might not be perfect in the first
 place. At least in the past, drifting TICK counters caused all sorts
 of issues and strange behavior in FreeBSD when used as timecounter
 in the SMP case. If my understanding of the above is right, as is
 this still rules them out as timecounters for userland.
 Linux has some complex code (based on equivalent code origining in
 their ia64 port) for constantly synchronizing the TICK counters.
 In order to avoid that complexity and overhead, what I do in
 FreeBSD in the SMP case is to (ab)use counters (either intended
 for that purpose or bus cycle counters probably intended for
 debugging the hardware during development) available in the
 various host-to-foo bridges so it doesn't matter which CPU they
 are read by. This works just fine except for pre-PCI-Express
 based USIIIi machines, where the bus cycle counters are broken.
 That's where the TICK counter is always read from the BSP
 using an IPI in the SMP case. The latter is done as sched_bind(9)
 isn't possible with td_critnest  1 according to information
 from jhb@ and mav@.
 So apart from introducing code to constantly synchronize the
 TICK counters, using the timecounters on the host busses also
 seems to be the only viable solution for userland. The latter
 should be doable but is long-winded as besides duplicating
 portions of the corresponding device drivers in userland, it
 probably also means to get some additional infrastructure
 like being able to memory map registers for devices on the
 nexus(4) level in place ...

Understand. I do plan eventually to map HPET counters page into usermode
on x86.

Also, as I noted above, some code to synchronize per-package counters
would be useful for x86, so it might be developed with multi-arch
usage in mind.


pgprOXUaTAQSS.pgp
Description: PGP signature


Re: svn commit: r237434 - in head/lib/libc: amd64/sys gen i386/sys include sys

2012-06-23 Thread Bruce Evans

On Sat, 23 Jun 2012, Konstantin Belousov wrote:


On Sat, Jun 23, 2012 at 03:17:57PM +0200, Marius Strobl wrote:

On Fri, Jun 22, 2012 at 10:48:17AM +0300, Konstantin Belousov wrote:

On Fri, Jun 22, 2012 at 09:34:56AM +0200, Marius Strobl wrote:

On Fri, Jun 22, 2012 at 07:13:31AM +, Konstantin Belousov wrote:

Author: kib
Date: Fri Jun 22 07:13:30 2012
New Revision: 237434
URL: http://svn.freebsd.org/changeset/base/237434

Log:
  Use struct vdso_timehands data to implement fast gettimeofday(2) and
  clock_gettime(2) functions if supported. The speedup seen in
  microbenchmarks is in range 4x-7x depending on the hardware.

  Only amd64 and i386 architectures are supported. Libc uses rdtsc and
  kernel data to calculate current time, if enabled by kernel.


I don't know much about x86 CPUs but is my understanding correct
that TSCs are not synchronized in any way across CPUs, i.e.
reading it on different CPUs may result in time going backwards
etc., which is okay for this application though?


Generally speaking, tsc state among different CPU after boot is not
synchronized, you are right.

Kernel has somewhat doubtful test which verifies whether the after-boot
state of tsc looks good. If the test fails, TSC is not enabled by
default as timecounter, and then usermode follows kernel policy and
falls back to slow syscall. So we err on the safe side.
I tested this on Core i7 2xxx, where the test (usually) passes.


Okay, so for x86 the TSCs are not used as timecounters by either
the kernel or userland in the SMP case if they don't appear to
be synchronized, correct?

Correct as for now.  But this is bug and not a feature. The tscs shall
be synchronized, or skew tables calculated instead of refusing to use it.


While you are there. do you have comments about sparc64 TICK counter ?
On SMP, the counter of BSP is used by IPI. Is it unavoidable ?


The TICK counters are per-core and not synchronized by the hardware.
We synchronize APs with the BSP on bring-up but they drift over time
and the initial synchronization might not be perfect in the first
place. At least in the past, drifting TICK counters caused all sorts
of issues and strange behavior in FreeBSD when used as timecounter
in the SMP case. If my understanding of the above is right, as is
this still rules them out as timecounters for userland.
Linux has some complex code (based on equivalent code origining in
their ia64 port) for constantly synchronizing the TICK counters.
In order to avoid that complexity and overhead, what I do in
FreeBSD in the SMP case is to (ab)use counters (either intended


Attempted synchronization of TSCs is left out for the same reason on x86.
Except some half-baked synchronization for a home made time function in
dtrace (dtrace_gethrtime() on amd64 and i386) crept in.


for that purpose or bus cycle counters probably intended for
debugging the hardware during development) available in the
various host-to-foo bridges so it doesn't matter which CPU they
are read by. This works just fine except for pre-PCI-Express
based USIIIi machines, where the bus cycle counters are broken.
That's where the TICK counter is always read from the BSP
using an IPI in the SMP case. The latter is done as sched_bind(9)
isn't possible with td_critnest  1 according to information
from jhb@ and mav@.


How can it work fine?  Buses are too slow.  On x86, ACPI-fast takes
700-1900 nsec on machines that I've tested (mostly pre-PCIe ones).
HPET seems to be only slightly faster (maybe 500 nsec).


So apart from introducing code to constantly synchronize the
TICK counters, using the timecounters on the host busses also
seems to be the only viable solution for userland. The latter
should be doable but is long-winded as besides duplicating
portions of the corresponding device drivers in userland, it
probably also means to get some additional infrastructure
like being able to memory map registers for devices on the
nexus(4) level in place ...


There is little point in optimizations to avoid syscalls for hardware.
On x86, a syscall takes 100-400 nsec extra, so if the hardware takes
500-2000 nsec then reduction the total time by 100-400 nsec is not
very useful.


Understand. I do plan eventually to map HPET counters page into usermode
on x86.


This should be left out too.


Also, as I noted above, some code to synchronize per-package counters
would be useful for x86, so it might be developed with multi-arch
usage in mind.


It's only worth synchonizing fast timecounter hardware so that it can be
used in more cases.  It probably needs to be non-bus based to be fast.
That means the TSC on x86.

The new timeout code to support tickless kernels looks like it will give
large pessimizations unless the timecounter is fast.  Instead of using
the tick counter (1 atomic increment on every clock tick) and some
getbinuptime() calls in places like select(), it uses the hardware
timecounter via binuptime() in most places (since without a tick counter
and without clock 

Re: svn commit: r237434 - in head/lib/libc: amd64/sys gen i386/sys include sys

2012-06-23 Thread Alexander Motin

On 06/23/12 18:26, Bruce Evans wrote:

On Sat, 23 Jun 2012, Konstantin Belousov wrote:

On Sat, Jun 23, 2012 at 03:17:57PM +0200, Marius Strobl wrote:

So apart from introducing code to constantly synchronize the
TICK counters, using the timecounters on the host busses also
seems to be the only viable solution for userland. The latter
should be doable but is long-winded as besides duplicating
portions of the corresponding device drivers in userland, it
probably also means to get some additional infrastructure
like being able to memory map registers for devices on the
nexus(4) level in place ...


There is little point in optimizations to avoid syscalls for hardware.
On x86, a syscall takes 100-400 nsec extra, so if the hardware takes
500-2000 nsec then reduction the total time by 100-400 nsec is not
very useful.


Just out of curiosity I've run my own binuptime() micro-benchmarks:
 - on Core i5-650:
  TSC 11ns
  HPET   433ns
  ACPI-fast  515ns
  i8254 3736ns

 - on dual-socket Xeon E5645:
  TSC 15ns
  HPET   580ns
  ACPI-fast 1118ns
  i8254 3911ns

I think it could be useful to have that small benchmark in base kernel.


Understand. I do plan eventually to map HPET counters page into usermode
on x86.


This should be left out too.


Also, as I noted above, some code to synchronize per-package counters
would be useful for x86, so it might be developed with multi-arch
usage in mind.


It's only worth synchonizing fast timecounter hardware so that it can be
used in more cases.  It probably needs to be non-bus based to be fast.
That means the TSC on x86.

The new timeout code to support tickless kernels looks like it will give
large pessimizations unless the timecounter is fast.  Instead of using
the tick counter (1 atomic increment on every clock tick) and some
getbinuptime() calls in places like select(), it uses the hardware
timecounter via binuptime() in most places (since without a tick counter
and without clock interrupts updating the timehands periodically, it takes
a hardware timecounter read to determine the time).  So callout_reset()
might start taking thousands of nsec for per call, depending on how slow
the timecounter is.  This fix is probably to use a fuzzy time for long
long timeouts and to discourage use of short timeouts and/or to turn them
into long or fuzzy timeouts so that they are not very useful.


The new timeout code is still in active development and optimization was 
not the first priority yet. My idea was to use much faster 
getbinuptime() for periods above let's say 100ms. Legacy ticks-oriented 
callout_reset() functions are by default not supposed to provide 
sub-tick resolution and with some assumptions could use getbinuptime(). 
For new interfaces it depends on caller, how will it get present time.


I understand that integer tick counter is as fast as nothing else can 
ever be. But sorry, 32bit counter doesn't fit present goals. To have 
more we need some artificial atomicity -- exactly what getbinuptime() 
implements. What I would like to see there is tc_tick removal to make 
tc_windup() called for every hardclock tick. Having new tick-irrelevant 
callout interfaces we probably won't so much need to increase HZ too 
high any more, while this simplification would make ticks and 
getbinuptime() precision equal, solving some of your valid arguments 
against the last.


--
Alexander Motin
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


svn commit: r237434 - in head/lib/libc: amd64/sys gen i386/sys include sys

2012-06-22 Thread Konstantin Belousov
Author: kib
Date: Fri Jun 22 07:13:30 2012
New Revision: 237434
URL: http://svn.freebsd.org/changeset/base/237434

Log:
  Use struct vdso_timehands data to implement fast gettimeofday(2) and
  clock_gettime(2) functions if supported. The speedup seen in
  microbenchmarks is in range 4x-7x depending on the hardware.
  
  Only amd64 and i386 architectures are supported. Libc uses rdtsc and
  kernel data to calculate current time, if enabled by kernel.
  
  Hopefully, this code is going to migrate into vdso in some future.
  
  Discussed with:   bde
  Reviewed by:  jhb
  Tested by:flo
  MFC after:1 month

Added:
  head/lib/libc/amd64/sys/__vdso_gettc.c   (contents, props changed)
  head/lib/libc/i386/sys/__vdso_gettc.c   (contents, props changed)
  head/lib/libc/sys/__vdso_gettimeofday.c   (contents, props changed)
  head/lib/libc/sys/clock_gettime.c   (contents, props changed)
  head/lib/libc/sys/gettimeofday.c   (contents, props changed)
Modified:
  head/lib/libc/amd64/sys/Makefile.inc
  head/lib/libc/gen/aux.c
  head/lib/libc/i386/sys/Makefile.inc
  head/lib/libc/include/libc_private.h
  head/lib/libc/sys/Makefile.inc

Modified: head/lib/libc/amd64/sys/Makefile.inc
==
--- head/lib/libc/amd64/sys/Makefile.incFri Jun 22 07:06:40 2012
(r237433)
+++ head/lib/libc/amd64/sys/Makefile.incFri Jun 22 07:13:30 2012
(r237434)
@@ -1,7 +1,8 @@
 #  from: Makefile.inc,v 1.1 1993/09/03 19:04:23 jtc Exp
 # $FreeBSD$
 
-SRCS+= amd64_get_fsbase.c amd64_get_gsbase.c amd64_set_fsbase.c 
amd64_set_gsbase.c
+SRCS+= amd64_get_fsbase.c amd64_get_gsbase.c amd64_set_fsbase.c \
+   amd64_set_gsbase.c __vdso_gettc.c
 
 MDASM= vfork.S brk.S cerror.S exect.S getcontext.S pipe.S ptrace.S \
reboot.S sbrk.S setlogin.S sigreturn.S

Added: head/lib/libc/amd64/sys/__vdso_gettc.c
==
--- /dev/null   00:00:00 1970   (empty, because file is newly added)
+++ head/lib/libc/amd64/sys/__vdso_gettc.c  Fri Jun 22 07:13:30 2012
(r237434)
@@ -0,0 +1,49 @@
+/*-
+ * Copyright (c) 2012 Konstantin Belousov k...@freebsd.org
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include sys/cdefs.h
+__FBSDID($FreeBSD$);
+
+#include sys/types.h
+#include sys/time.h
+#include sys/vdso.h
+#include machine/cpufunc.h
+
+static u_int
+__vdso_gettc_low(const struct vdso_timehands *th)
+{
+   uint32_t rv;
+
+   __asm __volatile(rdtsc; shrd %%cl, %%edx, %0
+   : =a (rv) : c (th-th_x86_shift) : edx);
+   return (rv);
+}
+
+u_int
+__vdso_gettc(const struct vdso_timehands *th)
+{
+
+   return (th-th_x86_shift  0 ? __vdso_gettc_low(th) : rdtsc32());
+}

Modified: head/lib/libc/gen/aux.c
==
--- head/lib/libc/gen/aux.c Fri Jun 22 07:06:40 2012(r237433)
+++ head/lib/libc/gen/aux.c Fri Jun 22 07:13:30 2012(r237434)
@@ -66,6 +66,7 @@ __init_elf_aux_vector(void)
 static pthread_once_t aux_once = PTHREAD_ONCE_INIT;
 static int pagesize, osreldate, canary_len, ncpus, pagesizes_len;
 static char *canary, *pagesizes;
+static void *timekeep;
 
 static void
 init_aux(void)
@@ -101,6 +102,10 @@ init_aux(void)
case AT_NCPUS:
ncpus = aux-a_un.a_val;
break;
+
+   case AT_TIMEKEEP:
+   timekeep = aux-a_un.a_ptr;
+   break;
}
}
 }
@@ -163,6 +168,16 @@ _elf_aux_info(int aux, void *buf, int bu
} else
res = EINVAL;
break;

Re: svn commit: r237434 - in head/lib/libc: amd64/sys gen i386/sys include sys

2012-06-22 Thread Marius Strobl
On Fri, Jun 22, 2012 at 07:13:31AM +, Konstantin Belousov wrote:
 Author: kib
 Date: Fri Jun 22 07:13:30 2012
 New Revision: 237434
 URL: http://svn.freebsd.org/changeset/base/237434
 
 Log:
   Use struct vdso_timehands data to implement fast gettimeofday(2) and
   clock_gettime(2) functions if supported. The speedup seen in
   microbenchmarks is in range 4x-7x depending on the hardware.
   
   Only amd64 and i386 architectures are supported. Libc uses rdtsc and
   kernel data to calculate current time, if enabled by kernel.

I don't know much about x86 CPUs but is my understanding correct
that TSCs are not synchronized in any way across CPUs, i.e.
reading it on different CPUs may result in time going backwards
etc., which is okay for this application though?

Marius

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r237434 - in head/lib/libc: amd64/sys gen i386/sys include sys

2012-06-22 Thread Konstantin Belousov
On Fri, Jun 22, 2012 at 09:34:56AM +0200, Marius Strobl wrote:
 On Fri, Jun 22, 2012 at 07:13:31AM +, Konstantin Belousov wrote:
  Author: kib
  Date: Fri Jun 22 07:13:30 2012
  New Revision: 237434
  URL: http://svn.freebsd.org/changeset/base/237434
  
  Log:
Use struct vdso_timehands data to implement fast gettimeofday(2) and
clock_gettime(2) functions if supported. The speedup seen in
microbenchmarks is in range 4x-7x depending on the hardware.

Only amd64 and i386 architectures are supported. Libc uses rdtsc and
kernel data to calculate current time, if enabled by kernel.
 
 I don't know much about x86 CPUs but is my understanding correct
 that TSCs are not synchronized in any way across CPUs, i.e.
 reading it on different CPUs may result in time going backwards
 etc., which is okay for this application though?

Generally speaking, tsc state among different CPU after boot is not
synchronized, you are right.

Kernel has somewhat doubtful test which verifies whether the after-boot
state of tsc looks good. If the test fails, TSC is not enabled by
default as timecounter, and then usermode follows kernel policy and
falls back to slow syscall. So we err on the safe side.
I tested this on Core i7 2xxx, where the test (usually) passes.

The test we currently have fails for me at least on single-package
Nehalems, where the counter should be located on uncore part. This
indicates some brokeness in the code, but I did not investigated the
cause.

The code can be developed which adjusts tsc msrs to be in sync. Or, rtdscp
instruction can be used, which allow to handle counter skew in usermode
in race-free manner.

While you are there. do you have comments about sparc64 TICK counter ?
On SMP, the counter of BSP is used by IPI. Is it unavoidable ?


pgpbUQaD5boRd.pgp
Description: PGP signature


Re: svn commit: r237434 - in head/lib/libc: amd64/sys gen i386/sys include sys

2012-06-22 Thread David Chisnall
On 22 Jun 2012, at 08:34, Marius Strobl wrote:

 I don't know much about x86 CPUs but is my understanding correct
 that TSCs are not synchronized in any way across CPUs, i.e.
 reading it on different CPUs may result in time going backwards
 etc., which is okay for this application though?

As long as the initial value is set on every context switch, it only matters 
that the TSC is monotonic and increments at an approximately constant rate.  It 
is also possible to set the TSC value, but that's less useful in this context.

The one thing to be careful about is the fact that certain power saving states 
will affect the speed at which the TSC increments, and so it is important to 
update the ticks-per-second value whenever a core goes into a low power state.

This is more or less the same approach used by Xen, so most of the issues have 
been ironed out: Oracle complained to CPU vendors about a few corner cases and, 
because Oracle customers tend to buy a lot of expensive Xeon and Opteron chips, 
they were fixed quite promptly.  

David

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r237434 - in head/lib/libc: amd64/sys gen i386/sys include sys

2012-06-22 Thread John Baldwin
On Friday, June 22, 2012 3:34:56 am Marius Strobl wrote:
 On Fri, Jun 22, 2012 at 07:13:31AM +, Konstantin Belousov wrote:
  Author: kib
  Date: Fri Jun 22 07:13:30 2012
  New Revision: 237434
  URL: http://svn.freebsd.org/changeset/base/237434
  
  Log:
Use struct vdso_timehands data to implement fast gettimeofday(2) and
clock_gettime(2) functions if supported. The speedup seen in
microbenchmarks is in range 4x-7x depending on the hardware.

Only amd64 and i386 architectures are supported. Libc uses rdtsc and
kernel data to calculate current time, if enabled by kernel.
 
 I don't know much about x86 CPUs but is my understanding correct
 that TSCs are not synchronized in any way across CPUs, i.e.
 reading it on different CPUs may result in time going backwards
 etc., which is okay for this application though?

Hmm, in practice I have found that on modern x86 CPUs (Penryn and later) the 
TSC is in fact sychronized across packages at work.  At least, when I measure 
skew across packages it appears to be consistent with the time it would take a 
write to propagate from one to the other.

-- 
John Baldwin
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org