We'd love to see such a patch. we've been hampered by this problem for years. 
We use apr_hash_t to look up several in-memory lexicons, each of them with over 
half a million words. The initial slot array size expands from 2^n - 1, n = 4. 
That wastes too much as we need to put all the hash lexicons in persistent 
pools. The easiest remedy we've been using is to add a new "API" function like 
this:

APR_DECLARE(apr_hash_t *) apr_hash_make_ex(apr_pool_t *pool, unsigned int 
init_size)
{
    apr_hash_t *ht;
    ht = apr_palloc(pool, sizeof(apr_hash_t));
    ht->pool = pool;
    ht->free = NULL;
    ht->count = 0;
    ht->max = init_size;  // INITIAL_MAX;
    ht->array = alloc_array(ht, ht->max);
    ht->hash_func = apr_hashfunc_default;
    return ht;
}

which resembles the other two APR functions:

        APR_DECLARE(apr_table_t *) apr_table_make(apr_pool_t *p, int nelts);
        APR_DECLARE(apr_array_header_t *) apr_array_make(apr_pool_t *p, int 
nelts, int elt_size);

Also, we'd like very much to see the possiblity whether apr_hash_t could be 
extended to make use of the "move-to-front" technique that is presented by the 
following papers:

        "In-memory hash tables for accumulating text vocabularies", J. Zobel, 
S. Heinz, and H.E. Williams, Information Processing Letters, 80(6):271-277, 
December 2001. (http://www.cs.rmit.edu.au/~jz/fulltext/ipl01.pdf)
 
        "Cache-conscious collision resolution in string hash tables", N. 
Askitis and J. Zobel, Proceedings of the SPIRE String Processing and 
Information Retrieval Symposium, November 2005, pp. 91-102. LNCS 3772.  
(http://www.cs.rmit.edu.au/~jz/fulltext/spire05.pdf)

Cheers,
Bing Swen



-----Original Message-----
From: Neil Conway [mailto:[email protected]] 
Sent: Tuesday, September 15, 2009 3:55 AM
To: [email protected]
Subject: apr_hash: Memory consumption

I notice that the implementation of expand_array() in
tables/apr_hash.c allocates a new bucket array, without making any
attempt to release the memory allocated for the previous bucket array.
That wastes memory: if the hash table grows exponentially, this
strategy wastes O(n) extra memory.

Is there a rational for the current behavior? If not, perhaps the
easiest fix is to allocate the bucket array in a subpool, and then
create a new subpool and destroy the old one in expand_array(). If a
patch to do this would be accepted, let me know and I'll submit one.

Neil

Reply via email to