Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?
bob> If you have the same number of entries as buckets, and you have a bob> good hash function, then if you have n buckets your longest chain bob> should have length around ln(n). The average length of a nonempty bob> bucket would be somewhere around 1 1/2. Yes, and it achieves that nice short chain length by consuming gobs of memory. A dictionary with 10**7 keys is going to chew up lots of memory. There's nothing particularly magical about dictionaries in this respect. They are good examples of a classic time-space tradeoff. Skip -- http://mail.python.org/mailman/listinfo/python-list
Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?
Roy> If you're getting long hash chains, you're either using a bad hash Roy> function, or your table isn't big enough. Sure. The original poster said something about 10 million keys I think. Unless the key distribution is amazingly fortuitous and the number of unique values is small, the dictionary is going to have a huge memory footprint. On my Mac laptop this code: >>> d = {} >>> for n in xrange(10**7): ... d[n] = hex(n) ... yields a process with 647MB of VM. If I trim that to 1000 unique values: >>> d = {} >>> for n in xrange(10**7): ... d[n] = hex(n % 1000) ... I still wind up with 647MB VM. The dictionary and its keys are what consume all the memory, and those keys are as simple as you can get. Skip -- http://mail.python.org/mailman/listinfo/python-list
Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?
If you have the same number of entries as buckets, and you have a good hash function, then if you have n buckets your longest chain should have length around ln(n). The average length of a nonempty bucket would be somewhere around 1 1/2. -- http://mail.python.org/mailman/listinfo/python-list
Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?
In article <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]> wrote: > >Graham> Looking up a key in a dictionary is done in constant-time, >Graham> i.e. it doesn't matter how large the dictionary is. > >Doesn't that depend on how many keys hash to the same value? For small >dictionaries keeping the max keys that hash to the same value small isn't a >huge problem. For large dictionaries (millions of keys) might you have some >long chains? Or in an effort to reduce the chain length, wind up using so >much virtual memory that you wind up wrecking performance by swapping? If you're getting long hash chains, you're either using a bad hash function, or your table isn't big enough. -- http://mail.python.org/mailman/listinfo/python-list
Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?
On 5/16/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > Graham> Looking up a key in a dictionary is done in constant-time, > Graham> i.e. it doesn't matter how large the dictionary is. > > Doesn't that depend on how many keys hash to the same value? For small > dictionaries keeping the max keys that hash to the same value small isn't a > huge problem. Hi Skip. On reflection, I guess I ought to have asked the OP what he meant by "large". :-) Probably asked for details on his use-case as well. Sure, collisions are a reality. The introductory comments in dictobject.c in the Python source [1] give good coverage of how these issues are handled in CPython. And, they make for great bedtime reading! :-) Regardless, I can't imagine that using multiple dictionaries would provide a sufficient speed-up for the vast majority of large dictionary use-cases. There are still swapping issues, plus the user-code overhead of managing the multiple dicts, plus the multiple lookups. If the only improvement is in collision reduction (which is questionable in the general case, since an unfortunate set of keys could result in numerous collisions even in a smaller database), then is the extra overhead really worth it? Still, it's just my opinion. If someone wants to post code and prove me wrong, I'll eat my hat and take my lickings. ;-) Best, Graham [1] http://svn.python.org/view/python/trunk/Objects/dictobject.c -- http://mail.python.org/mailman/listinfo/python-list
Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?
Graham> Looking up a key in a dictionary is done in constant-time, Graham> i.e. it doesn't matter how large the dictionary is. Doesn't that depend on how many keys hash to the same value? For small dictionaries keeping the max keys that hash to the same value small isn't a huge problem. For large dictionaries (millions of keys) might you have some long chains? Or in an effort to reduce the chain length, wind up using so much virtual memory that you wind up wrecking performance by swapping? Skip -- http://mail.python.org/mailman/listinfo/python-list
Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?
Casey Hawthorne wrote: > For Large Dictionaries Could One Use Separate Dictionaries Where Each > Dictionary Covers an Interval of the Input Range? One Could, But Why? :-) You wouldn't see any performance improvements. Looking up a key in a dictionary is done in constant-time, i.e. it doesn't matter how large the dictionary is. Graham -- http://mail.python.org/mailman/listinfo/python-list