On Thu, Mar 1, 2018 at 4:04 PM, David Steele <da...@pgmasters.net> wrote: > On 3/1/18 6:52 PM, Tomas Vondra wrote: >> >> On 03/02/2018 12:31 AM, Andres Freund wrote: >>> >>> >>> >>> On March 1, 2018 3:22:44 PM PST, Tomas Vondra >>> <tomas.von...@2ndquadrant.com> wrote: >>>> >>>> >>>> >>>> On 03/01/2018 11:01 PM, Andres Freund wrote: >>>>> >>>>> Hi, >>>>> >>>>> On 2018-02-20 22:23:54 +0100, Tomas Vondra wrote: >>>>>> >>>>>> So I've decided to revive the old patch, rebase it to current >>>> >>>> master, >>>>>> >>>>>> and see if we can resolve the issues that killed it in 2016. >>>>> >>>>> >>>>> There seems to be some good discussion in the thread. But the patch >>>>> arrived just before the last commitfest and certainly isn't a trivial >>>>> cleanup patch. Therefore I think it should be moved to the next CF? >>>>> >>>> >>>> It isn't a massive invasive patch either, though, so I object to moving >>>> it to 2018-09 right away. >>> >>> >>> Why do we have rules around not submitting large stuff to the last >>> cf, if we just not follow through? We're neck deep in patches that >>> are older. And you've already gotten a fair bit of feedback.. >>> >> >> It was not my intention to break (or even bend) the CF rules, of course. >> I haven't considered the patch to be "large stuff", while you do. I see >> Peter Geoghegan agrees with your conclusion on another thread, so go >> ahead and move it to 2018-09. > > > After reviewing the thread I also agree that this should be pushed to > 2018-09, so I have done so. > > I'm very excited by this patch, though. In general I agree with Peter that > a higher rate of false positives is acceptable to save memory. I also don't > see any reason why this can't be tuned with a parameter. Start with a > conservative default and allow the user to adjust as desired. > > -- > -David > da...@pgmasters.net >
Hi All -- I'm curious what has to be done to move this patch along. I looked through the patched and I noticed that the work for deciding whether to instantiate the bloom filter in single-batch mode is not complete yet (or if it's in this change, I can't find it), contrary to this comment: + * When starting in a single-batch mode, we do nothing initially. + * If the whole hash table fits into a single batch, we can get + * sufficiently accurate ndistinct estimate by simply counting + * occupied buckets (thanks to shooting for NTUP_PER_BUCKET=1), + * or perhaps we could use something more elaborate (e.g. HLL). + * But we only build the bloom filter if the hash table is large + * enough to exceed on-CPU caches (e.g. 4MB). I did some basic benchmarking with two tables and a simple where clause filtering 90% of rows and it does yield about a 1.2x speedup in my tests. In the pathological case where two tables are joined on a pk/fk relationship with no filtering, the penalty seems to be about 1-2%. Patrick