On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > > On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote: > >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowle...@gmail.com> wrote: > >> > >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.von...@2ndquadrant.com> > >> wrote: > >> > This reminds me our attempts to add bloom filters to hash joins, which I > >> > think ran into mostly the same challenge of deciding when the bloom > >> > filter can be useful and is worth the extra work. > >> > >> Speaking of that, it would be interesting to see how a test where you > >> write the query as IN(VALUES(...)) instead of IN() compares. It would > >> be interesting to know if the planner is able to make a more suitable > >> choice and also to see how all the work over the years to improve Hash > >> Joins compares to the bsearch with and without the bloom filter. > > > >It would be interesting. > > > >It also makes one wonder about optimizing these into to hash > >joins...which I'd thought about over at [1]. I think it'd be a very > >significant effort though. > > > > I modified the script to also do the join version of the query. I can > only run it on my laptop at the moment, so the results may be a bit > different from those I shared before, but it's interesting I think. > > In most cases it's comparable to the binsearch/bloom approach, and in > some cases it actually beats them quite significantly. It seems to > depend on how expensive the comparison is - for "int" the comparison is > very cheap and there's almost no difference. For "text" the comparison > is much more expensive, and there are significant speedups. > > For example the test with 100k lookups in array of 10k elements and 10% > match probability, the timings are these > > master: 62362 ms > binsearch: 201 ms > bloom: 65 ms > hashjoin: 36 ms > > I do think the explanation is fairly simple - the bloom filter > eliminates about 90% of the expensive comparisons, so it's 20ms plus > some overhead to build and check the bits. The hash join probably > eliminates a lot of the remaining comparisons, because the hash table > is sized to have one tuple per bucket. > > Note: I also don't claim the PoC has the most efficient bloom filter > implementation possible. I'm sure it could be made faster. > > Anyway, I'm not sure transforming this to a hash join is worth the > effort - I agree that seems quite complex. But perhaps this suggest we > should not be doing binary search and instead just build a simple hash > table - that seems much simpler, and it'll probably give us about the > same benefits.
That's actually what I originally thought about doing, but I chose binary search since it seemed a lot easier to get off the ground. If we instead build a hash is there anything else we need to be concerned about? For example, work mem? I suppose for the binary search we already have to expand the array, so perhaps it's not all that meaningful relative to that... I was looking earlier at what our standard hash implementation was, and it seemed less obvious what was needed to set that up (so binary search seemed a faster proof of concept). If you happen to have any pointers to similar usages I should look at, please let me know. James