On 9.7.2014 16:07, Robert Haas wrote:
> On Tue, Jul 8, 2014 at 5:16 PM, Tomas Vondra <t...@fuzzy.cz> wrote:
>> Thinking about this a bit more, do we really need to build the
>> hash table on the first pass? Why not to do this:
>> 
>> (1) batching - read the tuples, stuff them into a simple list -
>> don't build the hash table yet
>> 
>> (2) building the hash table - we have all the tuples in a simple
>> list, batching is done - we know exact row count, can size the
>> table properly - build the table
> 
> We could do this, and in fact we could save quite a bit of memory if 
> we allocated say 1MB chunks and packed the tuples in tightly instead 
> of palloc-ing each one separately.  But I worry that rescanning the 
> data to build the hash table would slow things down too much.

OK. I think we should shoot for no negative impact on well estimated
plans, and the rescans might violate that. I need to think about this
for some time, but my current plan is this:

(1) size the buckets for NTUP_PER_BUCKET=1 (and use whatever number
    of batches this requires)

(2) build the batches as in the current patch

(3) if the hash table is too small, resize it

Which is not really different from the current patch.

>> Also, maybe we could use a regular linear hash table [1], instead
>> of using the current implementation with NTUP_PER_BUCKET=1.
>> (Although, that'd be absolutely awful with duplicates.)
> 
> Linear probing is pretty awful unless your load factor is << 1.
> You'd probably want NTUP_PER_BUCKET=0.25, or something like that,
> which would eat up a lot of memory.

My experience is that 0.5-0.75 load factor is quite OK, and
NTUP_PER_BUCKET=1 gives you ~0.75 load on average, so it's not that
different. Anyway, the inability to handle duplicies reasonably is
probably a show-stopper.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to