Environment: VC6 SP5, Win98SE
Many programmers need to optimize some algorithms for speed
from time to time. You cannot always afford to buy expensive code
timing tools. This is mostly the case when learning to program.
The poor man's approach for timing a piece of code is to bracket the
code with GetTickCount() and subtract the resulting values. This works
on most systems, but has some limitations:
GetTickCount() comes with some overhead, since it involves a function
call, and has a limited resolution of about 0.055s.
(continued)
I have been looking for some possibility to time
certain
pieces of code without too much overhead on my Intel PIII, Win98
machine. Here is my approach:
Beginning with Pentium, all IA-Family processors have
a Real Time-Stamp Counter. This counter can be read at any privilege
level, when TSD flag in CR4 is clear. The command for reading the
counter is RDTSC. This command will load EDX:EAX with a 64 bit
value representing the current time stamp (the time stamp actually is
the number of cycles elapsed since reset).
RDTSC is not a serializing instruction, i.e. "it does
not necessarily wait until
all previous instructions have been executed before reading the
counter. Similarly, subsequent
instructions may begin execution before the read operation is
performed." (Intel Architecture
Software Developers Manual Volume 2: Instruction Set Reference, p.
3-604). To avoid inaccuracies, RDTSC has to be bracketed by a
serializing instruction. CPUID is such an instruction.
Although it may seem complicated to use RDTSC to time
code, it is not. I've written following four macros for this purpose:
#ifdef _DEBUG
#define PERF_DECLARE \
__int64 MSRB, MSRE; \
void *mrsb = &MSRB; \
void *mrse = &MSRE; \
char perfmtrbuf[100];
#define PERF_START \
{_asm mov eax, 0 \
_asm cpuid \
_asm rdtsc \
_asm mov ebx, mrsb \
_asm mov dword ptr [ebx], eax \
_asm mov dword ptr [ebx+4], edx \
_asm mov eax, 0 \
_asm cpuid}
#define PERF_STOP \
{_asm mov eax, 0 \
_asm cpuid \
_asm rdtsc \
_asm mov ebx, mrse \
_asm mov dword ptr [ebx], eax \
_asm mov dword ptr [ebx+4], edx \
_asm mov eax, 0 \
_asm cpuid}
#define PERF_REPORT \
{_ui64toa(MSRE-MSRB, perfmtrbuf, 10); \
TRACE("Cycles needed: %s\n", perfmtrbuf);}
#else
#define PERF_DECLARE
#define PERF_START
#define PERF_STOP
#define PERF_REPORT
#endif
The function _ui64toa() is used, so stdlib.h
has to be included prior defining these macros.
Since I was using MFC, I used TRACE to output the results. Any other
method for displaying and/or storing perfmtrbuf
is good.
Using the macros is very simple:
void SomeFunction(void){
PERF_DECLARE;
PERF_START;
PERF_STOP;
PERF_REPORT;
}
PERF_START and PERF_STOP/PERF_REPORT can bracket as many
pieces of code one wants. One cannot pair them overlapped or nested,
though.
If nesting is a must (although I cannot imagine why it could be), the
macros can be changed to accept parameters (the variable names to use).
1. Accuracy
Actually, the timing's accuracy is
less-than-expected,
at a first glance. Two consecitive timings can show up a difference of
many cycles. Executing following function 60 times shows the behaviour
of this timing method:
void Test()
{
PERF_DECLARE;
int i, j, k;
PERF_START;
for(i= 1;i<0xff; i++){
for(j= 1;j<0xff; j++){
k = i+j;
}
}
PERF_STOP;
PERF_REPORT;
}
The mean execution cycle number is 463885.38, with a
standard deviation of 4.84%. Following chart shows this:
Obviously, there is a lower threshold at 457313
cycles. The code mostly executes in this number of cycles, except some
peaks
when it needs more
than that. How come? First, the interrupts are not off during code
execution. Second, the code ran under Windows, which could preempt the
process' execution. It does not apply for this piece of code, but the
initial state of the processor's cache also can eat some cycles.
In conclusion, the code is being timed in it's
"natural
environment", not in a vacuum. If you need to know the bare execution
time of a specific piece of code,
repeated timing is a must. For short pieces of code, it is possible to
get the exact number of cycles needed at minimun to execute.
2. Compatibility / Portability
The code clearly needs an Intel Pentium processor at
minimum. At the time writing this, I have no information about
the behavior of other processors. The compiler used must be able to
handle the __int64 data type (not ANSI). The _asm
is a Microsoft specific keyword; other compiler may define it
different. The TRACE macro i have used
is actually a question of context, and can be easily replaced with some
equvalent. The macros actually do not need Windows at all. The only
condition, except a Pentium+ processor is to have the TSD flag in CR4
cleared by the OS.
3. Pitfalls
The RDTSC instruction will cause a #GP fault, if the
TSD
flag in CR4 is set. At the time writing this I have no information
about other operating systems regarding this point.
Both RDTSC and CPUID will generate an #UD exception
on processors "older" than Pentium (486, 386 and so on).
If not paired properly, PERF_START and PERF_STOP will
return bogus results, but no compile/runtime error.
Timing code that accesses any external device except
the main memory, will probably produce totally inconsistent results.
This is true for any simmilar timing method, by the way.