Re: [HACKERS] Re: custom hash-based COUNT(DISTINCT) aggregate - unexpectedly high memory consumption

Tomas Vondra Sat, 12 Oct 2013 18:02:37 -0700

On 11.10.2013 13:42, Huchev wrote:
> 
> gettimeofday(&start, NULL);
>     for (i = 0; i < VALUES; i++) {
>         state = XXH32_init(result);
>         XXH32_update(state, &i, 4);
>         XXH32_digest(state);
>     }
> gettimeofday(&end, NULL);
> 
> 
> This code is using the "update" variant, which is only useful when dealing
> with very large amount of data which can't fit into a single block of
> memory. This is obviously overkill for a 4-bytes-only test. 3 functions
> calls, a malloc, intermediate data book keeping, etc.
> 
> To hash a single block of data, it's better to use the simpler (and faster)
> variant XXH32() :
> 
> gettimeofday(&start, NULL);
>     for (i = 0; i < VALUES; i++) { XXH32(&i, 4, result); }
> gettimeofday(&end, NULL);
> 
> You'll probably get better results by an order of magnitude. For better
> results, you could even inline it (yes, for such short loop with almost no
> work to do, it makes a very sensible difference).


Not really. Even with this change it's slightly slower than crc32, at
least with the 32-bit integers. With 64-bit integers it's about 2x as
fast. But even then it's like ~1% of the total runtime, so any
improvements here are not really changing anything.

The inlining is not a good idea IMHO, because that'd be very different
from the actual usage (there won't be such tight loop). OTOH I'm not
sure if the compiler does not already inline that as an optimization.

> That being said, it's true that these advanced hash algorithms only
> shine with "big enough" amount of data to hash. Hashing a 4-bytes
> value into a 4-bytes hash is a bit limited exercise. There is no
> "pigeon hole" issue. A simple multiplication by a 32-bits prime would
> fare good enough and result in zero collision.

Agreed. I'll revisit this if/when I'll need to support larger data types
in this aggregate.

Tomas


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: custom hash-based COUNT(DISTINCT) aggregate - unexpectedly high memory consumption

Reply via email to