Re: Hash Joins vs. Bloom Filters / take 2

Patrick Krecker Tue, 06 Mar 2018 16:28:46 -0800

On Thu, Mar 1, 2018 at 4:04 PM, David Steele <[email protected]> wrote:
> On 3/1/18 6:52 PM, Tomas Vondra wrote:
>>
>> On 03/02/2018 12:31 AM, Andres Freund wrote:
>>>
>>>
>>>
>>> On March 1, 2018 3:22:44 PM PST, Tomas Vondra
>>> <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 03/01/2018 11:01 PM, Andres Freund wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> On 2018-02-20 22:23:54 +0100, Tomas Vondra wrote:
>>>>>>
>>>>>> So I've decided to revive the old patch, rebase it to current
>>>>
>>>> master,
>>>>>>
>>>>>> and see if we can resolve the issues that killed it in 2016.
>>>>>
>>>>>
>>>>> There seems to be some good discussion in the thread. But the patch
>>>>> arrived just before the last commitfest and certainly isn't a trivial
>>>>> cleanup patch. Therefore I think it should be moved to the next CF?
>>>>>
>>>>
>>>> It isn't a massive invasive patch either, though, so I object to moving
>>>> it to 2018-09 right away.
>>>
>>>
>>> Why do we have rules around not submitting large stuff to the last
>>> cf, if we just not follow through? We're neck deep in patches that
>>> are older. And you've already gotten a fair bit of feedback..
>>>
>>
>> It was not my intention to break (or even bend) the CF rules, of course.
>> I haven't considered the patch to be "large stuff", while you do. I see
>> Peter Geoghegan agrees with your conclusion on another thread, so go
>> ahead and move it to 2018-09.
>
>
> After reviewing the thread I also agree that this should be pushed to
> 2018-09, so I have done so.
>
> I'm very excited by this patch, though.  In general I agree with Peter that
> a higher rate of false positives is acceptable to save memory.  I also don't
> see any reason why this can't be tuned with a parameter. Start with a
> conservative default and allow the user to adjust as desired.
>
> --
> -David
> [email protected]
>


Hi All --

I'm curious what has to be done to move this patch along. I looked
through the patched and I noticed that the work for deciding whether to
instantiate the bloom filter in single-batch mode is not complete yet
(or if it's in this change, I can't find it), contrary to this
comment:

+        * When starting in a single-batch mode, we do nothing initially.
+        * If the whole hash table fits into a single batch, we can get
+        * sufficiently accurate ndistinct estimate by simply counting
+        * occupied buckets (thanks to shooting for NTUP_PER_BUCKET=1),
+        * or perhaps we could use something more elaborate (e.g. HLL).
+        * But we only build the bloom filter if the hash table is large
+        * enough to exceed on-CPU caches (e.g. 4MB).

I did some basic benchmarking with two tables and a simple where
clause filtering 90% of rows and it does yield about a 1.2x speedup in
my tests. In the pathological case where two tables are joined on a
pk/fk relationship with no filtering, the penalty seems to be about
1-2%.

Patrick

Re: Hash Joins vs. Bloom Filters / take 2

Reply via email to