Re: Associative Array: reasonable limits?

Charles Hixson Tue, 05 Nov 2013 12:38:40 -0800

On 11/05/2013 05:34 AM, Dmitry Olshansky wrote:

05-Nov-2013 02:20, Charles Hixson пишет:

On 11/03/2013 01:46 AM, Dmitry Olshansky wrote:

03-Nov-2013 02:37, Charles Hixson пишет:

I'm contemplating an associative array that will eventually grow to be

an estimated 64KB in size, assuming it's about half full. It wouldthen

be holding around 90,000 entries.  Is this reasonable, or should I go
with a sorted array, and do binary searches?  I estimate that binary
searches would average around 15 accesses/hit, but that's still a lot

faster than disk access. The array would be nearly full, so therewould

be less wasted space, but, of course, insertions would become quite

expensive. (Deletions aren't expected.) In either case the keyswould

be fixed length character arrays, and the values would also be simple
structs without reference types (so the garbage collector could pretty
much ignore things).


90k elements is a small AA. It would work just fine with any sane hash
function (e.g. the built-in one).

At around 10M+ elements it may be worth to consider space saving
techniques. That said on x64 100M+ is perfectly OK, on 32bit I'd
advise against it but only because of GC.

FWIW I'm expecting this array to be around the entire time the program
is running, but it would probably be rolled out most of the time, as

frequently accessed items would be pulled into a much smallerstructure,

and if I go with the array rather than the hash table I may only do

updates as a batch. (Since it's linear with the size of the array,but

essentially ignores the number of items being added.)

My temptation is to go with the hash table...but I'm not sure of the
underlying mechanics.


Think of it as a sparse array that has "holes" and takes around ~2-4
times the size of normal array. It's not only because of holes but
because a slot is a bit bigger.

In return lookups/inserts/deletions are cheap O(1) with highprobability.

Thanks, I'd been assuming that it would just be around 128 bytes larger
(i.e., two pointers) and around a 50% fill rate.  Sounds like a bit
worse than I was estimating...not hugely worse though.  OTOH, that's an
argument in favor of using a small cache that is a hash table, and a
large array as a backing store, where infrequently accessed items could
be kept.  Probably if the cache could hold around 2,000 entries it would
have over a 90% hit rate (though that would need to be tested).


What's your requirements for this data-structure anyway?

And the
array would also provide a sorted access method for traversing it.  I'm
not sure that's important, but it could be.  It could also be saved to
an externally editable file (i.e., using a standard text editor) which
also might be important.

Oh my, that's a lot of complexity for no gain. Truth be told I tend toalways optimize for speed (=less power consumption) these days. Whatyou describe is sort of collision pages (like in some DB) but deployedin such a small scenario I see ZERO gain in any dimension.

You know code is also occupying some RAM ;).

This will, naturally, slow down insertions. (I don't expect any
deletions, as they would render historical data unintelligible.)

Perhaps I'll batch the insertions, as the insertion is linear WRT the
size of the array, but is essentially independent of the number of
insertions.  (I.e., no matter how many insertions, it's a one pass job.)

Even more code. And space for batching and keeping track of if theitem is batched but not yet in collection.

I'm not sure about the batch adds. I certainly won't do it at first,while the data is small. The thing is, I want to keep as much asfeasible rolled out most of the time. That means that while it's activein "RAM" that RAM is actually pages on the disk of virtual memory. Witha large hash table, that's not likely to be controllable. What I'mdoing here is optimizing for "speed and..." where and includes activememory use (as opposed to rolled out virtual memory). A part of whatmakes a hash table efficient is that it's always in RAM. If you need toroll it in to access it, then you lose most of the efficiency. A sortedarray is a pretty straightforwards data structure. It's reasonablyquick to access, though not as fast as a hash table. The real killer isinsertion (or deletion). And it doesn't lose anywhere near as much bybeing rolled out (for one thing, each page holds about twice as muchdata, so you don't need as many reads). And it's easy to save toexternal disk. (You can size it easily so that when starting up theprogram again, the array is initialized at the correct size.)

I had actually for a long time considered just using a database, but Ican't control how the database caches things in RAM. I want the cacheto have a long term usage memory as well as recent use, and doing thatwith a database is just too complex. Yes, this is sort of like adatabase, but except for some databases that store everything in RAM,it's not the same.

OTOH, if this were the entire program, you would be right that justusing a hash table was a better idea. But this is actually almost aside show, rather like GUI's are sideshows to what the program isactually doing. This is what the program needs to interfaceintelligibly with people. As such, it's not a heavily used part, andit's best if it doesn't consume excess active RAM. (For that matter, I*know* my computer is too small for what I'm trying to do. So usingextra RAM is undesirable. I may eventually decide to have the thing thatI'm currently calling "the array" live out on disk, but it's myunderstanding that this isn't much of a gain over having it just rolledout by virtual memory.) I've also seriously considered using a sort ofdatabase (I'm thinking of KyotoCabinet), but doing it this way yieldsfewer external dependencies, and the hash cache is probably a betterchoice. (Again, with KyotoCabinet I don't have as much control overwhat it decides to keep in the cache.) I expect that the total memoryneeds of the program will edge towards the Terabyte range, so most of itis going to need to live on disk no matter what I do. But I don't knowin detail just what's going to be involved there. I've roughed out adata file based around fixed length records addressable by id# which canbe transformed into a file offset, but how to decide what should beremoved from RAM isn't something I've figured out yet. You could thinkof the basic datastructure of the program as a 40 dimensional cyclicgraph with one-way links, but that's not quite right. The links carrymore information than is typical of graph structures (though logicallythat's no change...you could model them as nodes along the links thatact as filters). OTOH, the file structure that supports this is quitesimple. Just a direct access file with fixed length blocks.

As for "to be added" being separate, that's not hard either. That couldbe either a hash table or an array. though a hash table would probablybe simpler. And it's just another place I'd need to check when lookingto see if I already recognized the code. Still, that's something thatshould really be saved until later, as it can be added on at any time.(Which probably means I'll never get around to it.)


--
Charles Hixson

Re: Associative Array: reasonable limits?

Reply via email to