On Wed, 25 Jul 2012, Andriy Gapon wrote:

on 25/07/2012 13:21 Konstantin Belousov said the following:
...
diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c
index 085c339..229b351 100644
--- a/sys/x86/x86/tsc.c
+++ b/sys/x86/x86/tsc.c
@@ -594,6 +594,7 @@ static u_int
 tsc_get_timecount(struct timecounter *tc __unused)
 {

+       rmb();
        return (rdtsc32());
 }

This makes sense to me.  We probably want correctness over performance here.
[BTW, I originally thought that the change was here; brain malfunction]

And I liked the original change because it wasn't here :-).

@@ -602,8 +603,9 @@ tsc_get_timecount_low(struct timecounter *tc)
 {
        uint32_t rv;

+       rmb();
        __asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
-       : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
+           : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
        return (rv);
 }


It would correct here too, but not sure if it would make any difference given 
that
some lower bits are discarded anyway.  Probably depends on exact CPU.

It is needed to pessimize this too. :-)

As I have complained before, the loss of resolution from the shift is
easy to see by reading the time from userland, even with syscall overhead
taking 10-20 times longer than the read.  On core2 with TSC-low, a clock-
checking utility gives:

% min 481, max 12031, mean 530.589452, std 51.633626
% 1th: 550 (1296487 observations)
% 2th: 481 (448425 observations)
% 3th: 482 (142650 observations)
% 4th: 549 (61945 observations)
% 5th: 551 (47619 observations)

The numbers are diffences in nanoseconds measured by clock_gettime().
The jump from 481 to 549 is 68.  From this I can tell that the clock
frequency is 1.86 Ghz and the shift is 128, or the clock frequency is
3.72 Ghz and the shift is 256.

On AthlonXP with TSC:

% min 273, max 29075, mean 274.412811, std 80.425963
% 1th: 273 (853962 observations)
% 2th: 274 (745606 observations)
% 3th: 275 (400212 observations)
% 4th: 276 (20 observations)
% 5th: 280 (10 observations)

Now the numbers cluster about the mean.  Although syscalls take much longer
than the loss of resolution with TSC-low, and even the core2 TSC takes
almost as long to read as the loss, it is still possible to see things
happening at the limits of the resolution (~0.5 nsec).

And, oh hmm, I read AMD Software Optimization Guide for AMD Family 10h 
Processors
and they suggest using cpuid (with a note that it may be intercepted in
virtualized environments) or _mfence_ in the discussed role (Appendix F of the
document).
Googling for 'rdtsc mfence lfence' yields some interesting results.

The second hit was for the shrd pessimization/loss of resolution and
a memory access hack in lkml in 2011.  I now seem to remember jkim
mentioning the memory access hack.  rmb() on i386 has a related memory
access hack, but now with a lock prefix that defeats the point of the
2011 hack (it wanted to save 5 nsec by removing fences).  rmb() on
amd64 uses lfence.

Some of the other hits are a bit old.  The 8th one was by me in the
thread about kib@ implementing gettimeofday() in userland.

Bruce
_______________________________________________
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"

Reply via email to