Re: Micro-benchmark for various time syscalls...

Sean Chittenden Mon, 02 Jun 2008 12:06:00 -0700

I wrote a small micro-benchmark utility[1] to test various timesyscalls and the results were a bit surprising to me. The resultswere from a UP machine and I believe that the difference betweengettimeofday(2) and clock_gettime(CLOCK_REALTIME_FAST) would'vebeen bigger on an SMP system and performance would've degradedfurther with each additional core.
I wouldn't expect SMP to make much difference between CLOCK_REALTIMEand
CLOCK_REALTIME_FAST.  The only difference is that the former calls
nanotime() where the latter calls getnanotime(). nanotime() alwaysdoesmore, but it doesn't have any extra SMP overheads in most cases (inrarecases like i386 using the i8254 timecounter, it needs to lockaccesses to
the timecounter hardware).  gettimeofday() always does more than
CLOCK_REALTIME, but again no more for SMP.

You may be right, I can only speculate. Going off of phk@'srhetorical questions regarding gettimeofday(2) working across cores/threads, I assumed there would be some kind of synchronization.


http://lists.freebsd.org/mailman/htdig/freebsd-current/2005-October/057280.html

clock_gettime(CLOCK_REALTIME_FAST) is likely the ideal function formost authors (CLOCK_REALTIME_FAST is supposed to be precise to +/-10ms of CLOCK_REALTIME's value[2]). In fact, I'd assume thatCLOCK_REALTIME_FAST is just as accurate as Linux's gettimeofday(2)(a statement I can't back up, but believe is likely to be correct)and therefore there isn't much harm (if any) in seeingclock_gettime(2) + CLOCK_REALTIME_FAST receive more widespread usevs. gettimeofday(2). FYI. -sc
The existence of most of CLOCK_* is a bug. I wouldn't useCLOCK_REALTIME_FAST
for anything (if only because it doesn't exist in most kernels that I
run.

I think that's debatable, actually. I modified my little micro-benchmark program to test the realtime values returned from eachexecution and found that CLOCK_REALTIME_FAST likely updates itselfsufficiently frequently for most applications (not all, but most). Mytest ensures that time doesn't go backwards and tally's the number oftimes that the values are identical. It'd be nice ofCLOCK_REALTIME_FAST incremented by a small and reasonable fudge factorevery time it's invoked that way the values aren't identical.

On my machine, I can make 100K gettimeofday(2) calls compared to 3MCLOCK_REALTIME_FAST calls, which is a significantly large delta whenyou're aiming for software that's handling around ~40-50Kpps and wantto include time information periodically (see above comment about afudge factor being included after every call *grin* ).


http://sean.chittenden.org/pubfiles/freebsd/bench_clock_realtime.c

% ./bench_clock_realtime 9079882 | sort -rnk1
clock realtime micro-benchmark.  9079882 syscall iterations.
Avg. us/call    Elapsed         Name
9.317078        84.597968       gettimeofday(2)
8.960372        81.359120       time(3)
8.776467        79.689287       clock_gettime(2/CLOCK_REALTIME)
0.332357        3.017763        clock_gettime(2/CLOCK_REALTIME_FAST)
0.311705        2.830246        clock_gettime(2/CLOCK_SECOND)
Value from time(3): 1212427374
Last value from gettimeofday(2): 1212427293.590511      Equal: 0

Last value from clock_gettime(2/CLOCK_SECOND): 1212427460.000000000Equal: 9079878Last value from clock_gettime(2/CLOCK_REALTIME_FAST):1212427457.656410126 Equal: 9078198Last value from clock_gettime(2/CLOCK_REALTIME): 1212427454.639076390Equal: 0


% irb
>> tot = 9079882
=> 9079882
>> eq = 9078198
=> 9078198
>> tot - eq
=> 1684
>> time = 3.017763
=> 3.017763
>> (tot - eq) / time
=> 558.029242190324
>> tot / time
=> 3008812.15655437  # number of CLOCK_REALTIME_FAST calls per second
>> tot / 84.597968
=> 107329.788346689  # number of gettimeofday(2) calls per second

I switched from using gettimeofday() to CLOCK_REALTIME many years

ago when syscalls started taking less than 1 usec and stilloccasionally

have problems from this running old kernels, because old i386 kernels
don't support CLOCK_REALTIME and old amd64 kernels have a broken
CLOCK_REALTIME in 32-bit mode).

Entirely possible that's why things are more expensive on my testmachine.


% sysctl hw.model
hw.model: AMD Athlon(tm) 64 Processor 3500+
% uname -a

FreeBSD dev2.office.chittenden.org 7.0-RELEASE FreeBSD 7.0-RELEASE #0:Sun Feb 24 10:35:36 UTC 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC amd64

PS Is there a reason that time(3) can't be implemented in terms ofclock_gettime(CLOCK_SECOND)? 10ms seems precise enough compared totime_t's whole second resolution.
I might use CLOCK_SECOND (unlike CLOCK_REALTIME_FAST), since the low
accuracy timers provided by the get*time() family are accurate enough
to give the time in seconds.  Unfortunately, they are still broken --
they are all incoherent relative to nanotime() and some are incoherent
relative to each other. CLOCK_SECOND can lag the time in secondsgiven
by up to tc_tick/HZ seconds.  This is because CLOCK_SECOND returns the
time in seconds at the last tc_windup(), so it misses seeing rollovers
of the second in the interval between the rollover and the next
tc_windup(), while nanotime() doesn't miss seeing these rollovers so
it gives incoherent times, with nanotime()/CLOCK_REALTIME beingcorrect
and time_second/CLOCK_SECOND broken.

Interesting. Incoherent, but accurate enough? We're talking about a<10ms window of incoherency, right?

% ./bench_time 9079882 | sort -rnk1
Timing micro-benchmark.  9079882 syscall iterations.
Avg. us/call    Elapsed     Name
9.322484    84.647053       gettimeofday(2)
8.955324    81.313291       time(3)
8.648315    78.525684       clock_gettime(2/CLOCK_REALTIME)
8.598495    78.073325       clock_gettime(2/CLOCK_MONOTONIC)
0.674194    6.121600        clock_gettime(2/CLOCK_PROF)
0.648083    5.884515        clock_gettime(2/CLOCK_VIRTUAL)
0.330556    3.001412        clock_gettime(2/CLOCK_REALTIME_FAST)
0.306514    2.783111        clock_gettime(2/CLOCK_SECOND)
0.262788    2.386085        clock_gettime(2/CLOCK_MONOTONIC_FAST)


These are very slow.  Are they on a 486? :-)  I get about 262 ns for
CLOCK_REALTIME using the TSC timecounter on all ~2GHz UP systems.
The syscall overhead is about 200 nsec (170 nsec for a simpler syscall

and maybe 30 nsec extra for copyin/out for clock_gettime()) andreading

the TSC timecounter adds another 60 nsec, including a whole 6 nsec for

the hardware part of the read (perhaps more like 30 nsec than 60 forthe

whoe read).  The TSC doesn't work on all machines (never for SMP), but
this will hopefully change.  (Phenom is supposed to have TSCs that are
coherent across CPUs, and rdtsc has slowed down from 12 cycles to 40+

to implement this :-(. Core2 already has a 40+ cycles rdtsc, butAFAIKit doesn't have coherent TSCs.) Other timecounters are much slowerthan

the TSC, but I haven't seen one take 8000 nsec since 486 days.


*shrug*  elapsed / number of calls.  Not doing anything fancy here.

Some of my benchmark results:


Can I run this same test/see how this was written?

This system has a fairly fast ACPI and i8254 timecounters.  1500-800
nsec is more typical for ACPI-fast, and 4000-5000 is more typical
for i8254.  ACPI-fast should be named ACPI-not-very-slow.  ACPI-safe
is very slow, perhaps slower than i8254.  i8254 could be made about
twice as fast if anyone cared.


Hrm.

% sysctl -a | grep -i acpi_timer
machdep.acpi_timer_freq: 3579545
dev.acpi_timer.0.%desc: 24-bit timer at 3.579545MHz
dev.acpi_timer.0.%driver: acpi_timer
dev.acpi_timer.0.%location: unknown
dev.acpi_timer.0.%pnpinfo: unknown
dev.acpi_timer.0.%parent: acpi0
% sysctl -a | grep -i tsc

kern.timecounter.choice: TSC(800) ACPI-safe(850) i8254(0)dummy(-1000000)

kern.timecounter.tc.TSC.mask: 4294967295
kern.timecounter.tc.TSC.counter: 2749242907
kern.timecounter.tc.TSC.frequency: 2222000000
kern.timecounter.tc.TSC.quality: 800
kern.timecounter.smp_tsc: 0
machdep.tsc_freq: 2222000000

Other implementation bugs (all in clock_getres()):
- all of the clock ids that use getnanotime() claim a resolution of 1

nsec, but that us bogus. The actual resolution is more liketc_tick/HZ.

 The extra resolution in a struct timespec is only used to return
 garbage related to the incoherency of the clocks.  (If it could be
 arranged that tc_windup() always ran on a tc_tick/HZ boundary, then
 the clocks would be coherent and the times would always be a multiple
 of tc_tick/HZ, with no garbage in low bits.)
- CLOCK_VIRTUAL and CLOCK_PROF claim a resolution of 1/hz, but that is
 bogus.  The actual resolution is more like 1/stathz, or perhaps 1
 microsecond.  hz is irrelevant here since statclock ticks are used.
 statclock ticks only have a resolution of 1/stathz, but if 1 nsec is
 correct for CLOCK_REALTIME_FAST, then 1 usec is correct here since
 caclru() calculates the time to a resolution of 1 usec; it is just
 very inaccurate at that resolution.

"Resolution" is a poor term for the functionality needed here. Ithinka hint about the accuracy is more important. In simpleimplementationsusing interrupts and ticks, the accuracy would be about the the sameas

the resolution, but FreeBSD is more complicated.

Is there any reason that the garbage resolution can't be zero'ed outto indicate confidence of the kernel in the precision of theinformation? -sc


--
Sean Chittenden
[EMAIL PROTECTED]
http://sean.chittenden.org/

_______________________________________________
freebsd-performance@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Micro-benchmark for various time syscalls...

Reply via email to