On Sun, 1 Jun 2008, Sean Chittenden wrote:
I wrote a small micro-benchmark utility[1] to test various time syscalls and the results were a bit surprising to me. The results were from a UP machine and I believe that the difference between gettimeofday(2) and clock_gettime(CLOCK_REALTIME_FAST) would've been bigger on an SMP system and performance would've degraded further with each additional core.
I wouldn't expect SMP to make much difference between CLOCK_REALTIME and CLOCK_REALTIME_FAST. The only difference is that the former calls nanotime() where the latter calls getnanotime(). nanotime() always does more, but it doesn't have any extra SMP overheads in most cases (in rare cases like i386 using the i8254 timecounter, it needs to lock accesses to the timecounter hardware). gettimeofday() always does more than CLOCK_REALTIME, but again no more for SMP.
clock_gettime(CLOCK_REALTIME_FAST) is likely the ideal function for most authors (CLOCK_REALTIME_FAST is supposed to be precise to +/- 10ms of CLOCK_REALTIME's value[2]). In fact, I'd assume that CLOCK_REALTIME_FAST is just as accurate as Linux's gettimeofday(2) (a statement I can't back up, but believe is likely to be correct) and therefore there isn't much harm (if any) in seeing clock_gettime(2) + CLOCK_REALTIME_FAST receive more widespread use vs. gettimeofday(2). FYI. -sc
The existence of most of CLOCK_* is a bug. I wouldn't use CLOCK_REALTIME_FAST for anything (if only because it doesn't exist in most kernels that I run. I switched from using gettimeofday() to CLOCK_REALTIME many years ago when syscalls started taking less than 1 usec and still occasionally have problems from this running old kernels, because old i386 kernels don't support CLOCK_REALTIME and old amd64 kernels have a broken CLOCK_REALTIME in 32-bit mode).
PS Is there a reason that time(3) can't be implemented in terms of clock_gettime(CLOCK_SECOND)? 10ms seems precise enough compared to time_t's whole second resolution.
I might use CLOCK_SECOND (unlike CLOCK_REALTIME_FAST), since the low accuracy timers provided by the get*time() family are accurate enough to give the time in seconds. Unfortunately, they are still broken -- they are all incoherent relative to nanotime() and some are incoherent relative to each other. CLOCK_SECOND can lag the time in seconds given by up to tc_tick/HZ seconds. This is because CLOCK_SECOND returns the time in seconds at the last tc_windup(), so it misses seeing rollovers of the second in the interval between the rollover and the next tc_windup(), while nanotime() doesn't miss seeing these rollovers so it gives incoherent times, with nanotime()/CLOCK_REALTIME being correct and time_second/CLOCK_SECOND broken. vfs_timestamp() already defaults to using time_second, so it gives times incoherent with time() since the latter still uses getttimeofday(). Some file system test programs see this incoherency and I run them with vfs.timestamp.precision=3 (nanotime()) to avoid it. File systems were micro-optimized to use time_second (now not so micro optimized to use vfs_timestamp() which defaults to using time_second), but micro-pessimizing them to use nanotime() makes no significant difference. This is because most file system timestamp updates are cached (delayed until the next sync or disk write), and in cases where the updates are written to disk the time to read the clock is in the noise relative to the time for the disk write.
% ./bench_time 9079882 | sort -rnk1 Timing micro-benchmark. 9079882 syscall iterations. Avg. us/call Elapsed Name 9.322484 84.647053 gettimeofday(2) 8.955324 81.313291 time(3) 8.648315 78.525684 clock_gettime(2/CLOCK_REALTIME) 8.598495 78.073325 clock_gettime(2/CLOCK_MONOTONIC) 0.674194 6.121600 clock_gettime(2/CLOCK_PROF) 0.648083 5.884515 clock_gettime(2/CLOCK_VIRTUAL) 0.330556 3.001412 clock_gettime(2/CLOCK_REALTIME_FAST) 0.306514 2.783111 clock_gettime(2/CLOCK_SECOND) 0.262788 2.386085 clock_gettime(2/CLOCK_MONOTONIC_FAST)
These are very slow. Are they on a 486? :-) I get about 262 ns for CLOCK_REALTIME using the TSC timecounter on all ~2GHz UP systems. The syscall overhead is about 200 nsec (170 nsec for a simpler syscall and maybe 30 nsec extra for copyin/out for clock_gettime()) and reading the TSC timecounter adds another 60 nsec, including a whole 6 nsec for the hardware part of the read (perhaps more like 30 nsec than 60 for the whoe read). The TSC doesn't work on all machines (never for SMP), but this will hopefully change. (Phenom is supposed to have TSCs that are coherent across CPUs, and rdtsc has slowed down from 12 cycles to 40+ to implement this :-(. Core2 already has a 40+ cycles rdtsc, but AFAIK it doesn't have coherent TSCs.) Other timecounters are much slower than the TSC, but I haven't seen one take 8000 nsec since 486 days. Some of my benchmark results: 2.205GHz A64 in 32-bit mode, VIA motherboard: %%% 2008/01/05 (TSC) bde-current, -O2 -mcpu=athlon-xp min 240, max 77658, mean 242.171787, std 65.655259 2007/11/23 (TSC) bde-current min 247, max 11890, mean 247.857786, std 62.559317 2007/05/19 (TSC) plain -current-noacpi min 262, max 286965, mean 263.941187, std 41.801400 2007/05/19 (TSC) plain -current-acpi min 261, max 68926, mean 279.848650, std 40.477440 2007/05/19 (ACPI-fast timecounter) plain -current-acpi min 558, max 285494, mean 827.597038, std 78.322301 2007/05/19 (i8254) plain -current-acpi min 3352, max 288288, mean 4182.774148, std 257.977752 %%% These times are for CLOCK_REALTIME. This system has a fairly fast ACPI and i8254 timecounters. 1500-800 nsec is more typical for ACPI-fast, and 4000-5000 is more typical for i8254. ACPI-fast should be named ACPI-not-very-slow. ACPI-safe is very slow, perhaps slower than i8254. i8254 could be made about twice as fast if anyone cared. 133MHz P1: %%% 1996/07/12: min 3, max 472, mean 3.320346, std 0.694846 1998/02/21 pre-phk: min 3, max 595, mean 3.443382, std 0.767383 1998/02/21 post-phk: min 4, max 99, mean 4.614527, std 0.710407 1999/12/04: min 4, max 120, mean 4.630231, std 0.777733 2000/09/29: min 5, max 203, mean 5.376130, std 1.912127 2001/05/19: min 6, max 1715, mean 6.783378, std 2.015211 2001/09/02: min 5, max 482, mean 5.474384, std 2.683939 %%% These times are for gettimeofday(). Note that there are now in usec. The timecounter is always the TSC (post-phk) or uses the TSC more directly (pre-phk). These times serve mainly to document time bloat due to timecounters and SMPng. The P1 has limited caching and suffers more from longer code paths than new CPUs. 66MHz 486DX2: %%% 1995/11/03: min 13, max 171, mean 14.286634, std 1.836667 2000/11/15: min 20, max 542, mean 21.843003, std 8.003137 %%% Here the timecounter is always the i8254. These times serve mainly as a reminder of how slow old machines were. The i8254 timecounter hardware didn't take any longer back then (it was probably faster, since old machines didn't have PCI bridges, and they had tunable ISA wait states which I tuned), but a simple syscall took 7.2 usec and gettimeofday() took much longer. The bloat between 1995 and 2000 was relatively similar to that on the P1 system. Other implementation bugs (all in clock_getres()): - all of the clock ids that use getnanotime() claim a resolution of 1 nsec, but that us bogus. The actual resolution is more like tc_tick/HZ. The extra resolution in a struct timespec is only used to return garbage related to the incoherency of the clocks. (If it could be arranged that tc_windup() always ran on a tc_tick/HZ boundary, then the clocks would be coherent and the times would always be a multiple of tc_tick/HZ, with no garbage in low bits.) - CLOCK_VIRTUAL and CLOCK_PROF claim a resolution of 1/hz, but that is bogus. The actual resolution is more like 1/stathz, or perhaps 1 microsecond. hz is irrelevant here since statclock ticks are used. statclock ticks only have a resolution of 1/stathz, but if 1 nsec is correct for CLOCK_REALTIME_FAST, then 1 usec is correct here since caclru() calculates the time to a resolution of 1 usec; it is just very inaccurate at that resolution. "Resolution" is a poor term for the functionality needed here. I think a hint about the accuracy is more important. In simple implementations using interrupts and ticks, the accuracy would be about the the same as the resolution, but FreeBSD is more complicated. Bruce _______________________________________________ freebsd-performance@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-performance To unsubscribe, send any mail to "[EMAIL PROTECTED]"