On Sat, 22 Dec 2007, Minskey Guo wrote:


On 2007-12-22, at 下午9:17, 陶捷 TaoJie wrote:

Hi Bart,

I noticed this email just now :(
Thank you for your advice.

Are there any barrier instructions on x86/x64 could force the rdtsc to behave sychronously?

iret, xchg, cpuid, sfence, lock, etc. but cpuid changes eax etc, sfence is not available for all pentium (PIII ???)

We had this discussion with AMD a while ago; if I remember correctly, but Bart may well step in here, is that the only thing that's guaranteed in all situations and fully vendor/chip-rev independent is CPUID. Which is sort of a barrier sledgehammer. Takes thousands of cycles.

Wondering - what _exactly_ are you planning to do ? Instruction-based sampling can be done via CPU performance monitoring counters, the old "sample time, do something, sample time again" is sort-of superseded by those. High-level access in Solaris would be via the cpc(7d) driver.

FrankH.




My concern is
---------------
rdtsc
[barrier]
AA
BB
CC
....
XX
[barrier]
rdtsc
----------------
(2nd rdtsc - 1st rdtsc) should be the time cost of these inner instructions/functions.
And it should be equal to or greater than the actual cost.

Are there any barrier instructions to force rdtsc execute before AA and 2nd rdtsc execute after XX?
using some continuous nops? or some instrcution else?

sfence
rdtsc
xxxxx

sfence
rdtsc


maybe cpuid is available if exa can be corrupted, or you can save it somewhere before cpuid .

-minskey




Btw, you mentioned you had the experience of performance measuring.
Are there any recommended articles about performance measuring on x86/x64 platform?
Are there any recommended atricles about measuring instrcution cost?
For example, in some books, they said nop costs 1 cycle on Pentium, costs 3 cycle on 386. How to get these precise costs?

Thank you :)

Another question:
In SMP or Multi-core (or say CMT) platform, each processor/core does have its own tsc register on its chip, doesn't it? Then, how could gethrtime() guarantee to provide the system-wide time? I mean if a program runing on CPU1 for a while and then running on CPU2, would gethrtime() - gethrtime() be the precise time cost? Does gethrtime() read ticks from CPU's tsc register or read it from system-wide timer( e.g. 8253 chip for x86)?

I'm not familiar with timer... sorry for these stupid questions :-(


Kind Regards,
TJ


2007/10/30, Bart Smaalders <[EMAIL PROTECTED]>:
?? TaoJie wrote:
Dear all:

My platform is:
Intel Pentium 4 CPU
OpenSolaris B74, built by myself
Sun Studio 11

In my program, I use asm("rdtsc") to measure the time cost between two rdtsc.
for example:
int some_func(...)
{
    long long time1, time2;
    int i = 3198, j = 324;

    asm volatile("rdtsc" : "=A" (time1));

    ....
    i = i + j * i / j;

    asm volatile("rdtsc" : "=A" (time2))

    return i;
}

int main(...)
{
    ....
    some_func();
    ....
}

When I compile this program using "cc example.c" and disasmble a.out
by dis, the program logic is ok. The output is
some_func()
    main+0x36:              0f 31              rdtsc
    main+0x38:              89 45 f4           movl   %eax,-0xc(%ebp)
    main+0x3b:              89 55 f8           movl   %edx,-0x8(%ebp)
    main+0x3e:              8b 45 e8           movl   -0x18(%ebp),%eax
    main+0x41:              03 45 e4           addl   -0x1c(%ebp),%eax
    main+0x44:              89 45 e8           movl   %eax,-0x18(%ebp)
    main+0x47:              8b 45 e8           movl   -0x18(%ebp),%eax
    main+0x4a:              0f af 45 e4        imull  -0x1c(%ebp),%eax
    main+0x4e:              89 45 e8           movl   %eax,-0x18(%ebp)
    main+0x51:              8b 45 e8           movl   -0x18(%ebp),%eax
    main+0x54:              99                 cltd
    main+0x55:              f7 7d e4           idivl  -0x1c(%ebp)
    main+0x58:              8b d0              movl   %eax,%edx
    main+0x5a:              89 55 e8           movl   %edx,-0x18(%ebp)
    main+0x5d:              0f 31              rdtsc
    main+0x5f:              89 45 ec           movl   %eax,-0x14(%ebp)
    main+0x62:              89 55 f0           movl   %edx,-0x10(%ebp)

When I compile this program using "cc -xO5", the dis output is
some_func()
    main+0x7:               0f 31              rdtsc
    main+0x9:               89 45 e8           movl   %eax,-0x18(%ebp)
    main+0xc:               89 55 ec           movl   %edx,-0x14(%ebp)
    main+0xf:               0f 31              rdtsc
    main+0x11:              89 45 f0           movl   %eax,-0x10(%ebp)
    main+0x14:              89 55 f4           movl   %edx,-0xc(%ebp)
    main+0x17:              8b 5d f0           movl   -0x10(%ebp),%ebx
    main+0x1a:              8b 45 f4           movl   -0xc(%ebp),%eax
    main+0x1d:              8b 4d e8           movl   -0x18(%ebp),%ecx
    main+0x20:              8b 55 ec           movl   -0x14(%ebp),%edx
    main+0x23:              2b d9              subl   %ecx,%ebx
    main+0x25:              1b c2              sbbl   %edx,%eax
    main+0x27:              89 5d e0           movl   %ebx,-0x20(%ebp)
    main+0x2a:              89 45 e4           movl   %eax,-0x1c(%ebp)

Now the program logic is wrong! sun cc thinks rdtscs are irrelative
with the other parts in some_func, and then it advances the second
asm("rdtsc")!
In this case, I can't measure the time cost.

Then how can I stop sun cc optimization partly between these two asm
statements when using -xO5 optimization to the whole program?
I mean the second rdtsc should be put after the statement  i = i + j *
i / j strictly. (though I know the instructions will be executed in
x86 cpu out-of-order, and the result may not be very precise, but it
still works)
Any good ideas?

TIA

Regards,
TJ
_______________________________________________
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

You're going to be very frustrated with this approach because:

1) rdtsc is not a synchronizing instruction; the cpu may perform
  the load earlier than you think it does.
2) you'll need to bind your program to a cpu as tsc counters are not
  the same at boot.

My suggestion is to repeat the activity a sufficient number of times
such that you can afford to use gethrtime() to measure the time
interval.  This is the approach we took w/ libmicro (see performance
community) and has worked reasonably well.

- Bart


--
Bart Smaalders                  Solaris Kernel Performance
[EMAIL PROTECTED]         http://blogs.sun.com/barts

_______________________________________________
perf-discuss mailing list
[EMAIL PROTECTED]

_______________________________________________
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Reply via email to