On Sat, 22 Dec 2007, 陶捷 TaoJie wrote:

2007/12/22, Frank Hofmann <[EMAIL PROTECTED]>:



On Sat, 22 Dec 2007, Minskey Guo wrote:


On 2007-12-22, at 下午9:17, 陶捷 TaoJie wrote:

Hi Bart,

I noticed this email just now :(
Thank you for your advice.

Are there any barrier instructions on x86/x64 could force the rdtsc to
behave sychronously?

iret, xchg, cpuid, sfence, lock, etc.  but cpuid changes eax
etc,  sfence is
not available for all pentium (PIII ???)

We had this discussion with AMD a while ago; if I remember correctly, but


Do you remember the topic of that discussion?

Was related to:

http://src.opensolaris.org/source/diff/onnv/onnv-gate/usr/src/uts/intel/ia32/ml/i86_subr.s?r1=5322&r2=5084
http://src.opensolaris.org/source/diff/onnv/onnv-gate/usr/src/uts/i86pc/os/mlsetup.c?r1=5338&r2=5084




Bart may well step in here, is that the only thing that's guaranteed in
all situations and fully vendor/chip-rev independent is CPUID. Which is
sort of a barrier sledgehammer. Takes thousands of cycles.

Because it will takes thousands of cycles?

Yes.


It takes thousands of cycles, then it will affact the testing result a bit.
But it seems a good generic resolution.

btw, On P4 and the later Intel platform, which instruction is the best
barrier?
On AMD Opteron and the later AMD platform, which instruction is?

The only one that allows serialization even for instructions that do not access memory is cpuid. All else (mfence and varieties, iret) have cornercases where they may not serialize [on all cpu varieties]. The source above should give you some more details.




Wondering - what _exactly_ are you planning to do ? Instruction-based
sampling can be done via CPU performance monitoring counters, the old
"sample time, do something, sample time again" is sort-of superseded by
those. High-level access in Solaris would be via the cpc(7d) driver.

OK, I'll try to find some articles about performance monitoring counters and
the cpc driver in Solaris to read.

Start with the source and with this one:

http://docs.sun.com/app/docs/doc/816-5172/6mbb7btcs?a=view


A program, I want to analysis its detail behavior.
In a word, I want to know the time cost of any sub-flow on the whole program
flow.
Suppose the program flow is a long vertical line like
*main*
*func1 *
*func2 *
*func3 *
*"some key instructions in func3"  (record it as "#1")*
*func4*
*func3*
*func2 *
*func1 *
*"some key instructions in func1"  (record it as "#2")*
*"exit in main"*

Again, see source above. Or, rather, try using CPC, and/or DTrace's timestampers. If it's your own sourcecode, userland SDT probes might give you the necessary sampling hooks as well.



I'm interested in
func4 takes how much time?
#1 takes how much time?
#2 takes how much time?
control transfered from func2 to func3 (this is a function call) takes how
much time?
during func4, this program may be interrupted by some event, if so, it takes
how much time? and it spends how much time to re-gain the CPU
if not, that's all right.

To this problem, are there any good suggestions?

If the time involved in all these samples is _not_ microscopic, then DTrace's sampling might well tell you. If it is microscopic, though, then CPC (or even your own use of CPU performance monitoring facilities, to avoid a kernel driver overhead) might become necessary.

Correctly sampling "micro-events" is a hard task, and I'm not aware of a generic "good" suggestion.

Have a great weekend,
FrankH.


Kind Regards,
TJ



FrankH.




My concern is
---------------
rdtsc
[barrier]
AA
BB
CC
....
XX
[barrier]
rdtsc
----------------
(2nd rdtsc - 1st rdtsc) should be the time cost of these inner
instructions/functions.
And it should be equal to or greater than the actual cost.

Are there any barrier instructions to force rdtsc execute before AA and
2nd
rdtsc execute after XX?
using some continuous nops? or some instrcution else?

sfence
rdtsc
xxxxx

sfence
rdtsc


maybe cpuid is available if exa can be corrupted, or you can save it
somewhere before cpuid .

-minskey




Btw, you mentioned you had the experience of performance measuring.
Are there any recommended articles about performance measuring on
x86/x64
platform?
Are there any recommended atricles about measuring instrcution cost?
For example, in some books, they said nop costs 1 cycle on Pentium,
costs 3
cycle on 386. How to get these precise costs?

Thank you :)

Another question:
In SMP or Multi-core (or say CMT) platform, each processor/core does
have
its own tsc register on its chip, doesn't it?
Then, how could gethrtime() guarantee to provide the system-wide time?
I
mean if a program runing on CPU1 for a while and then running on CPU2,
would gethrtime() - gethrtime() be the precise time cost? Does
gethrtime()
read ticks from CPU's tsc register or read it from system-wide timer(
e.g.
8253 chip for x86)?

I'm not familiar with timer... sorry for these stupid questions :-(


Kind Regards,
TJ


2007/10/30, Bart Smaalders <[EMAIL PROTECTED]>:
?? TaoJie wrote:
Dear all:

My platform is:
Intel Pentium 4 CPU
OpenSolaris B74, built by myself
Sun Studio 11

In my program, I use asm("rdtsc") to measure the time cost between two
rdtsc.
for example:
int some_func(...)
{
    long long time1, time2;
    int i = 3198, j = 324;

    asm volatile("rdtsc" : "=A" (time1));

    ....
    i = i + j * i / j;

    asm volatile("rdtsc" : "=A" (time2))

    return i;
}

int main(...)
{
    ....
    some_func();
    ....
}

When I compile this program using "cc example.c" and disasmble a.out
by dis, the program logic is ok. The output is
some_func()
    main+0x36:              0f 31              rdtsc
    main+0x38:              89 45 f4           movl   %eax,-0xc(%ebp)
    main+0x3b:              89 55 f8           movl   %edx,-0x8(%ebp)
    main+0x3e:              8b 45 e8           movl   -0x18(%ebp),%eax
    main+0x41:              03 45 e4           addl   -0x1c(%ebp),%eax
    main+0x44:              89 45 e8           movl   %eax,-0x18(%ebp)
    main+0x47:              8b 45 e8           movl   -0x18(%ebp),%eax
    main+0x4a:              0f af 45 e4        imull  -0x1c(%ebp),%eax
    main+0x4e:              89 45 e8           movl   %eax,-0x18(%ebp)
    main+0x51:              8b 45 e8           movl   -0x18(%ebp),%eax
    main+0x54:              99                 cltd
    main+0x55:              f7 7d e4           idivl  -0x1c(%ebp)
    main+0x58:              8b d0              movl   %eax,%edx
    main+0x5a:              89 55 e8           movl   %edx,-0x18(%ebp)
    main+0x5d:              0f 31              rdtsc
    main+0x5f:              89 45 ec           movl   %eax,-0x14(%ebp)
    main+0x62:              89 55 f0           movl   %edx,-0x10(%ebp)

When I compile this program using "cc -xO5", the dis output is
some_func()
    main+0x7:               0f 31              rdtsc
    main+0x9:               89 45 e8           movl   %eax,-0x18(%ebp)
    main+0xc:               89 55 ec           movl   %edx,-0x14(%ebp)
    main+0xf:               0f 31              rdtsc
    main+0x11:              89 45 f0           movl   %eax,-0x10(%ebp)
    main+0x14:              89 55 f4           movl   %edx,-0xc(%ebp)
    main+0x17:              8b 5d f0           movl   -0x10(%ebp),%ebx
    main+0x1a:              8b 45 f4           movl   -0xc(%ebp),%eax
    main+0x1d:              8b 4d e8           movl   -0x18(%ebp),%ecx
    main+0x20:              8b 55 ec           movl   -0x14(%ebp),%edx
    main+0x23:              2b d9              subl   %ecx,%ebx
    main+0x25:              1b c2              sbbl   %edx,%eax
    main+0x27:              89 5d e0           movl   %ebx,-0x20(%ebp)
    main+0x2a:              89 45 e4           movl   %eax,-0x1c(%ebp)

Now the program logic is wrong! sun cc thinks rdtscs are irrelative
with the other parts in some_func, and then it advances the second
asm("rdtsc")!
In this case, I can't measure the time cost.

Then how can I stop sun cc optimization partly between these two asm
statements when using -xO5 optimization to the whole program?
I mean the second rdtsc should be put after the statement  i = i + j *
i / j strictly. (though I know the instructions will be executed in
x86 cpu out-of-order, and the result may not be very precise, but it
still works)
Any good ideas?

TIA

Regards,
TJ
_______________________________________________
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

You're going to be very frustrated with this approach because:

1) rdtsc is not a synchronizing instruction; the cpu may perform
  the load earlier than you think it does.
2) you'll need to bind your program to a cpu as tsc counters are not
  the same at boot.

My suggestion is to repeat the activity a sufficient number of times
such that you can afford to use gethrtime() to measure the time
interval.  This is the approach we took w/ libmicro (see performance
community) and has worked reasonably well.

- Bart


--
Bart Smaalders                  Solaris Kernel Performance
[EMAIL PROTECTED]         http://blogs.sun.com/barts

_______________________________________________
perf-discuss mailing list
[EMAIL PROTECTED]



_______________________________________________
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Reply via email to