I just updated my CVS copy and read the documentation Kevin wrote
about his hash strategy and see the similar thoughts about the
cachelines :-)

His design is an excellent implementation of
the L2 and L3 hashtables. I think however that an additional L1
hashtable might be a bonus by defeating
the LRU replacement implementation of
the processors L1 data cache.

Imagine that the guest is executing some tight loops which 
(repeatedly) sweep through a working set of pages that exceed
the capacity of the processors L1 data cache. (e.g. a 16 kbyte
memcpy).

This translated code does contain
a handfull of branch instructions which are so often executed
that the cache lines containing these hashtable lookups are
never replaced by the LRU replacement policy of the 
processors L1 data cache.

All other entries in our lookup table will not be used for a while
and will be forced out of the processors L1 data cache.

Creating a small L1 lookup hash table (significantly smaller
than the L2 lookup hashtable) will have the effect that slots that
contain lookup values that are NOT used in the tight loops above,
but share the same L1 cache line as slots that ARE used, are not
a candidate for the processors L1 cache replacement strategy.

When the tight loops end executing, these entries will have survived
the L1 processor cache LRU replacements. Probably a 2 deep 
hashtable is not
optimal for this behaviour, 1 deep is better because more slots
profit from neighbour lookups. The assumption is that the calculation
of our L1 hash key value distributes the accesses more or less
uniformly over our L1 hashtable. A simple "and" function does not
do this, we will have to swap the upper and lower (4?) bits of the
addresses. When e.g. 4 addresses are frequently used and L1 cache
lines are 32 bytes, than in total 16 addresses survive the
L1 processor data cache storm, so perhaps the L1 hashtable should
have about 16 slots. The end result is that the 16 most recently
used lookups are most of the time in the L1 processor cache.

I'm not really sure that forcing our own "host" L1 processor cache
working set in this way really has a positive effect on the
performance. Perhaps (probably?) the time spend executing the
tight loops, makes any performance gain of maintaining
our 16 lookup entries insignificant. It's just a thought.

Tom

Reply via email to