On Mar 08, 2007, at 02:24:04, Eric Dumazet wrote:
Kyle Moffett a écrit :
Prefetching is also fairly critical on a Power4 or G5 PowerPC
system as they have a long memory latency; an L2-cache miss can
cost 200+ cycles. On such systems the "dcbt" prefetch instruction
brings in a single 128-byte cacheline and has no serializing
effects whatsoever, making it ideal for use in a linked-list-
traversal inner loop.
OK, 200 cycles...
But what is the cost of the conditional branch you added in prefetch
(x) ?
if (!x) return;
(correctly predicted or not, but do powerPC have a BTB ?)
Well, PowerPC "dcbt" does prefetch() correctly, it doesn't ever raise
exceptions, doesn't have any side effects, takes only enough CPU to
decode the address, and is ignored if it would have to do anything
other than load the cacheline from RAM. Prefetch streams are halted
when they reach the end of a page boundary (no trapping to the MMU)
and if the TLB entry isn't present then they would asynchronously
abort. This is a suitable prefetch implementation for G5:
#define prefetch(x) __asm__ __volatile__ ("dcbt 0, %0"::"r"(x));
On GCC4+ the "__builtin_prefetch" built-in functions is probably
mildly better tuned for most architectures:
#define PREFETCH_RD 0
#define PREFETCH_WR 1
#define PREFETCH_LOCALITY_NONE 0
#define PREFETCH_LOCALITY_LOW 1
#define PREFETCH_LOCALITY_MED 2
#define PREFETCH_LOCALITY_HIGH 3
#define prefetch(x, args...) __builtin_prefetch(x ,##args)
The pertinent GCC docs:
This function is used to minimize cache-miss latency by moving data
into a cache before it is accessed. You can insert calls to
__builtin_prefetch into code for which you know addresses of data
in memory that is likely to be accessed soon. If the target
supports them, data prefetch instructions will be generated. If the
prefetch is done early enough before the access then the data will
be in the cache by the time it is accessed.
The value of addr is the address of the memory to prefetch. There
are two optional arguments, rw and locality. The value of rw is a
compile-time constant one or zero; one means that the prefetch is
preparing for a write to the memory address and zero, the default,
means that the prefetch is preparing for a read. The value locality
must be a compile-time constant integer between zero and three. A
value of zero means that the data has no temporal locality, so it
need not be left in the cache after the access. A value of three
means that the data has a high degree of temporal locality and
should be left in all levels of cache possible. Values of one and
two mean, respectively, a low or moderate degree of temporal
locality. The default is three.
for (i = 0; i < n; i++) {
a[i] = a[i] + b[i];
__builtin_prefetch (&a[i+j], 1, 1);
__builtin_prefetch (&b[i+j], 0, 1);
/* ... */
}
Data prefetch does not generate faults if addr is invalid, but the
address expression itself must be valid. For example, a prefetch of
p->next will not fault if p->next is not a valid address, but
evaluation will fault if p is not a valid address.
If the target does not support data prefetch, the address
expression is evaluated if it includes side effects but no other
code is generated and GCC does not issue a warning.
Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/