Re: [Python-Dev] More compact dictionaries with faster iteration

Dag Sverre Seljebotn Wed, 12 Dec 2012 12:51:28 -0800

On 12/12/2012 01:15 AM, Nick Coghlan wrote:

On Wed, Dec 12, 2012 at 5:37 AM, Dino Viehland <di...@microsoft.com
<mailto:di...@microsoft.com>> wrote:


    OTOH changing certain dictionaries in IronPython (such as keyword
    args) to be
    ordered would certainly be possible.  Personally I just wouldn't
    want to see it
    be the default as that seems like unnecessary overhead when the
    specialized
    class exists.


Which reminds me, I was going to note that one of the main gains with
ordered keyword arguments, is their use in the construction of
string-keyed objects where you want to be able to control the order of
iteration (e.g. for serialisation or display purposes). Currently you
have to go the path of something like namedtuple where you define the
order of iteration in one operation, and set the values in another.

So here's a brand new argument against ordered dicts: The existence ofperfect hashing schemes. They fundamentally conflict with ordered dicts.

I played with using them for vtable dispatches in Cython this summer,and they can perform really, really well for branch-predicted lookups inhot loops, because you always/nearly always eliminate linear probing andso there's no branch misses or extra comparisons. (The overhead of aperfect hash table lookup over a traditional vtable lookup was only acouple of cycles in my highly artificial fully branch-predictedmicro-benchmark.)

There's some overhead in setup; IIRC, ~20 microseconds for 64 elements,2 GHz CPU, though that was a first prototype implementation and bothalgorithmic improvements and tuning should be possible.

So it's not useful for everything, but perhaps for things like moduledictionaries and classes an "optionally perfect" dict can make sense.

Note: I'm NOT suggesting the use of perfect hashing, just making sureit's existence is mentioned and that one is aware that if always-ordereddicts become the language standard it precludes this option far off inthe future.

(Something like a sort() method could still work and make the dict"unperfect"; one could also have a pack() method that made the dictperfect again.).


That concludes the on-topic parts of my post.

--
Dag Sverre Seljebotn

APPENDIX

Going off-topic for those who are interested, here's the longwinded andugly details. My code [1] is based on the paper [2] (psuedo-code inAppendix A), but I adapted it a bit to be useful for tens/hundreds ofelements rather than billions.


The ingredients:

1) You need the hash to be 32 bits (or 64) of good entropy (md5 ormurmurhash or similar). (Yes, that's a tall order for CPython, I'm justdescribing the scheme.) (If the hash collides on all bits you *will*collide, so some fallback is still necesarry, just unlikely.)


 2) To lookup, the idea is (psuedo-code!)

typedef struct {
    int m_f m_g, r, k;
    int16_t d[k]; /* "small" int, like current proposal */
} table_header_t;

And then one computes index of an element with hash "h" using the function

((h >> tab->r) & tab->m_f) ^ tab->d[h & tab->m_g]

rather than the usual "h % n". While more arithmetic, arithmetic ischeap and branch misses are not.

3) To set up/repack a table one needs to find the parameters. Thegeneral idea is:

a) Partition the hashes into k bins by using "h & m_g". There will becollisions, but the number of bins with many collisions will be verysmall; most bins will have 2 or 1 or 0 elements.


 b) Starting with the largest bin, distribute the elements according to

the hash function. If a bin collides with the existing contents, tryanother value for d[binindex] until it doesn't.

The r parameter let's you try again 32 (or 64) times to find a solution.In my testcases there was ~0.1% chance of not finding a solution (thatis, exhausting possible choices of r) with 64-bit hashes with 4 or 8elements and no empty table elements. For any other number of elements,or with some empty elements, the chance of failure was much lower.)

[1] It's not exactly a great demo, but it contains the algorithm. Ifthere's much interest I should clean it up and make a proper benchmarkdemo out of it:


https://github.com/dagss/pyextensibletype/blob/perfecthash/include/perfecthash.h


[2] Pagh (1999)

http://www.brics.dk/RS/99/13/BRICS-RS-99-13.ps.gz




_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] More compact dictionaries with faster iteration

Reply via email to