On Wed, 25 Jul 2012, Jung-uk Kim wrote:
On 2012-07-25 14:05:37 -0400, Konstantin Belousov wrote:
Since we have gettimeofday() in userland, the above Linux thread
is more relevant now, I guess.
Indeed. syscalls put squillions of instructions between. Maybe even
a serialization instruction.
For some unrelated reasons, we do have lfence;rdtsc sequence in
the userland already. Well, it is not exactly such sequence, there
are some instructions between, but the main fact is that two
consequtive invocations of gettimeofday(2) (*) or clock_gettime(2)
are interleaved with lfence on Intels, guaranteeing that backstep
of the counter is impossible.
In fact, there is always a full documented serialization instruction
for syscalls, except maybe in FreeBSD-1 compat code on i386, at
least on Athlon64. i386 syscalls use int 0x80 (except in FreeBSD-1
compat code they use lcalls, and the iret necessary to return from
this is serializing on at least Athlon64. amd64 syscalls use
sysenter/sysret. sysret isn't serializing (like far returns), at least
on Athlon64, but at least in FreeBSD, the syscall implementation uses
at least 2 swapgs's (one on entry and one just before the sysret), and
swapgs is serializing, at least on Athlon64.
* - it is not a syscall anymore.
As I said, using recommended mfence;rdtsc sequence for AMDs would
require some work, but lets handle the kernel and userspace issues
separately.
Benchmarks for various methods on AthlonXP: I started with a program
that loops making a fe million clock_gettime() calls:
unchanged program: 1.15 seconds
add lfence: 1.16 seconds
add mfence: 1.15 seconds (yes, faster than mfence)
add atomic_cmpset: 1.20 seconds
add cpuid: 1.25 seconds
And, I really failed to find what the patch from the thread you
referenced tried to fix.
The patch was supposed to reduce a barrier, i.e., vsyscall
optimization. Please note I brought it up at the time, not because it
fixed any problem but because we completely lack necessary serialization.
Was it really committed into Linux ?
Yes, it was committed in a simpler form:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=057e6a8c660e95c3f4e7162e00e2fee1fc90c50d
This function was moved around from time to time and now it sits here:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob_plain;f=arch/x86/vdso/vclock_gettime.c
It still carries one barrier before rdtsc. Please see the comments.
For safety, you probably need to use the slowest (cpuid) method. Linux
seems to be just using fences that are observed to work.
Original Athlon64 manuals say this about rdtsc: "... not serializing...
even when bound by serializing instructions, the system environment at
the time the instruction is executed can cause additional cycles
[before it reaches EDX:EAX]".
With multiple CPUs, the hardware would have to be smarter and might need
more or different serialization instructions so that these additional
cycles don't break monotonicity across all CPUs.
I see actual problem of us allowing timecounters going back, and a
solution that exactly follows words of both Intel and AMD
documentation. This is good one step forward IMHO.
I agree with you here. Correctness outweighs performance, IMHO.
Use an i8254 then :-).
Bruce
_______________________________________________
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"