Kevin Lawton wrote:

> Fixed is good.  It could be tied directly to the size N of the meta
> info array.  If possible, it might be nice to be able to scale N and
> the hash together to gather some real data later.

The range of N is important!  Will it be 5 or 5,000,000?  There are some
bit-twiddling hash algorithms that work well for 1K entries that generate too
much overhead by the time you reach 10K, and are useless at 100K.

Got some example numbers?  (Whoops - I see some below.)


> Fixed to N as above.  Perhaps, if you can make it scale based on
> powers of 2 or whatever works out to be simple.  But performance
> is the key.  Flexibility has to be a non-primary goal.

I suspected as much, but I do need some real numbers to make much headway.  I
know of no "general" (unbounded) algorithms that are also fast.

But I do know of several that can be tuned nicely over rather narrow ranges.
For example:

If the cost of creating an entry in the array does not matter, then a perfect
hash can be created on the fly.  This is painfully slow (at
initialization/creation time), but once generated it is blazingly fast (at
run/access time).  Dynamic perfect hash generators are a black art, and the
few theoretically optimal perfect hash generators I know of suck massive CPU
time (IIRC, it may be an NP-Hard problem).

But if start/setup/initialization time is "free", and run time is everything,
then you will likely want to compute a perfect hash.  And recompute the
entire thing whenever the number of active (in use) elements in the array
changes, or when/if array entries are ever reused.


> Let me explain how this hash fits in so you can get the big picture.
> Most of the dynamically generated code will be just the original
> instructions.  Though, when we see computed or out-of-region static
> branches, a small sequence which calls a handler function will be
> generated.
>
> So the handler function will be called by many generated sequences.
> In the original C code, you could imagine that a loop might be
> calling a C function many times/second.  And we would be redirecting
> that call via our handler routine.  So, it's important that our
> handler routine be _very_ fast.

I am completely ignorant of the Plex86/Bochs design and implementation
("dynamically generated code"?), though I am extremely familiar with embedded
and real-time applications down to the hardware and instruction level, and
I do speak hash.

For this kind of speed, especially for repeated hits, we may want to design
the lookup function to reside completely within L1 or L2 cache.  A small
"slow" function in cache may trounce the performance of a "perfect" function
that won't fit in cache.  We could see 100:1 speed differences that depend
more on size than any other description of optimality.

Is there any Plex86 code that will be more important than this hash
function?  Would it be worth considering locking the hash function in the
cache?  (If you can do that on x86 - can you?  I haven't worked with x86 in
years - Pentiums are lousy embedded or real-time processors.)

Such considerations will have a profound effect on the design of the
algorithm.


> Sure, I was planning on storing the page address in the meta info.  The
> size of the meta array would be determined by the number of guest pages
> for which we would have any amount of translated code.  Some pages may
> have only small portions translated, others higher density.  Maybe
> if we had 256K of actual translated code, that might translate into
> some 128 or 256 meta entries - just pulling these numbers out of thin air.

Egad, I'm not even following your terminology here:  What are "guest" pages?
What other types of pages are involved?  And "translated" code?

Perhaps the best way to approach this (for now) would be to abstract the
problem as far away from Plex86 specifics as possible, and express it in
product-neutral terms.  Otherwise, I fear my help may be delayed by the time
needed for me to come up to speed on Plex86 itself.  And the time for that
isn't anywhere within radar range right now.

Can you give me a prototype for the function?

Can you point me to all the data structures the function must know about in
order for it to do its job?

Can you list the functional constraints?  For example, if I make a
hair-on-fire hash, may I use 1 MB of scratchpad RAM?

I think we need to think about drafting a few formal requirements (functional
characteristics) and some specifications (hard limits, such as:  code/data
space, setup/lookup time, frequency of access, language/compiler, CPU
families [Pentium only?  No '486?], etc.).

With such information, I can start an immediate search, and probably even
find free code for a few candidate implementations (not tailored to Plex86,
but suitable for "filling the gap" and serving as a basis for refinement).


> You mean mapped into the page tables?

There's that terminology thing again.  For now, let's assume I'm (Ple)x86
ignorant, and hash wise.


> Uniform 4k pages.  I'm looking to map based on only the page address
> (only the upper 20 bits).  For the offset, I also need to then hash to
> find the address of the translated instruction, but have an idea for that
> already.

Uh, now I'm confused.  This hash is one of (at least) two algorithms involved
in this lookup process?  Would you paint me a picture (with pretty colors) of
the entire transaction, end-to-end?  I suspect I may be able to uncover a few
tricks that may help you get from start to finish with a single algorithm, or
possibly with a small set of algorithms that are mutually optimized.


I don't want to consume chunks of your time having you play tutor to me, when
you may be able to find a much more direct route to your goal.  Perhaps if
you do a quick and dirty implementation, I can then come in and try to
improve it to meet your needs.

Is this hash algorithm a show-stopper?  If so, then you most certainly want
me to avoid learning more than the absolute minimum about Plex86, and instead
provide me with enough abstracted detail to provide a "plug in" solution.
Can you implement an interface, a high-level API, which I can then try to
flesh out for you?

If so, then we can put in some instrumented (but slow) code to get it up and
running ASAP, and have it gather real-world performance/use data which I can
then use to tailor a better solution.


Then again, it may be faster for an experienced Plex86 team member to learn
about hashes, than it would be for me to apply my hash knowledge to Plex86.
I doubt it, but it is certainly very possible.  Here are some links that
should give you an idea of some of the practice, thinking and theory
surrounding perfect hashing:

http://www.amk.ca/python/code/perfect-hash.html
A "perfect" hash generator implemented in Python, intended to compute static
hashes.  Notice that the discussion immediately drops into graph theory.  All
mapping problems tend to wind up in mired in graph theory.

http://burtleburtle.net/bob/hash/perfect.html
Here's another one, written in C.  It too is intended to generate "one time
only" hashes, and not be used to dynamically optimize a hash.

http://citeseer.nj.nec.com/czech92optimal.html
Want some big-head theory?  Here's the place.  A theoretically optimal
algorithm for generating perfect hashes.  Gnarly stuff.  The references are
excellent resources.  However, the implementation leaves much to be desired.

http://www.gnu.org/software/gperf/gperf.html
GNU GPERF generates perfect hashes.  I haven't looked at this code, but many
other people have.  Like most hash generators, it is optimized for hashing
strings, so it may not be of direct use to Plex86.  But it has the right
license, and I bet one of the authors would be a good resource.

http://acm.fi.uva.es/problemset/v1/188.html
Wanna see an algorithm an Internet content scanner might use?

http://penguinfiles.com/software/listings/Development/Tools/1535/
http://www.ibiblio.org/pub/Linux/devel/lang/c/mph-1.2.tar.gz
This package is more code I haven't looked at, but it appears to have some
desirable properties.

http://uw7doc.sco.com/SDK_c++/_A_perfect_hash.html
Some C++.


Now, I am *not* saying a perfect hash is the way to go.  But it will
certainly be impossible to get a "theoretically" faster lookup.  But a
"practically" faster lookup is absolutely possible when we consider hardware
and platform related optimizations, such as eliminating multiplications and
intermediary tables, and getting the entire algorithm to fit in cache.

Plus, I'm wondering how close to random those 20 bits will be.  If we can
assume a "working set" of hashes, a window into the entire total hash space,
then some other fast and powerful tools can be applied.  Locality in space
and time may be used to obtain near-ideal performance from less than ideal
algorithms, the penalty being similar to a "cache miss".


-BobC



Reply via email to