Tomas, have you seen http://www.postgresql.org/message-id/4b4dd67f.9010...@sigaev.ru I have very limited internet connection (no graphics) , so I may miss something
Oleg On Wed, Dec 16, 2015 at 4:15 AM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > Hi, > > while working on the Hash Join improvements, I've been repeatedly running > into the idea of bloom filter - various papers on hash joins mention bloom > filters as a way to optimize access to the hash table by doing fewer > lookups, etc. > > Sadly, I've been unable to actually see any real benefit of using a bloom > filter, which I believe is mostly due to NTUP_PER_BUCKET=1, which makes the > lookups much more efficient (so the room for bloom filter improvements is > narrow). > > The one case where bloom filter might still help, and that's when the > bloom filter fits into L3 cache (a few MBs) while the hash table (or more > accurately the buckets) do not. Then there's a chance that the bloom filter > (which needs to do multiple lookups) might help. > > But I think there's another case where bloom filter might be way more > useful in Hash Join - when we do batching. What we do currently is that we > simply > > 1) build the batches for the hash table (inner relation) > > 2) read the outer relation (usually the larger one), and split it > into batches just like the hash table > > 3) while doing (2) we join the first batch, and write the remaining > batches to disk (temporary files) > > 4) we read the batches one by one (for both tables) and do the join > > Now, imagine that only some of the rows in the outer table actually match > a row in the hash table. Currently, we do write those rows into the > temporary file, but with a bloom filter on the whole hash table (all the > batches at once) we can skip that for some types of joins. > > For inner join we can immediately discard the outer row, for left join we > can immediately output the row. In both cases we can completely eliminate > the overhead with writing the tuple to the temporary file and then reading > it again. > > The attached patch is a PoC of this approach - I'm pretty sure it's not > perfectly correct (e.g. I only tried it with inner join), but it's good > enough for demonstrating the benefits. It's rather incomplete (see the end > of this e-mail), and I'm mostly soliciting some early feedback at this > point. > > The numbers presented here are for a test case like this: > > CREATE TABLE dim (id INT, dval TEXT); > CREATE TABLE fact (id INT, fval TEXT); > > INSERT INTO dim SELECT i, md5(i::text) > FROM generate_series(1,10000000) s(i); > > -- repeat 10x > INSERT INTO fact SELECT * FROM dim; > > and a query like this > > SELECT COUNT(fval) FROM fact JOIN dim USING (id) WHERE dval < 'a'; > > with different values in the WHERE condition to select a fraction of the > inner 'dim' table - this directly affects what portion of the 'fact' table > has a matching row, and also the size of the hash table (and number of > batches). > > Now, some numbers from a machine with 8GB of RAM (the 'fact' table has > ~6.5GB of data, so there's actually quite a bit of memory pressure, forcing > the temp files to disk etc.). > > With work_mem=16MB, it looks like this: > > batches filter select. bloom master bloom/master > ----------------------------------------------------------- > 4 1 6.25% 23871 48631 49.09% > 8 2 12.50% 25752 56692 45.42% > 8 3 18.75% 31273 57455 54.43% > 16 4 25.01% 37430 62325 60.06% > 16 5 31.25% 39005 61143 63.79% > 16 6 37.50% 46157 63533 72.65% > 16 7 43.75% 53500 65483 81.70% > 32 8 49.99% 53952 65730 82.08% > 32 9 56.23% 55187 67521 81.73% > 32 a 62.49% 64454 69448 92.81% > 32 b 68.73% 66937 71692 93.37% > 32 c 74.97% 73323 72060 101.75% > 32 d 81.23% 76703 73513 104.34% > 32 e 87.48% 81970 74890 109.45% > 32 f 93.74% 86102 76257 112.91% > > The 'batches' means how many batches were used for the join, 'filter' is > the value used in the WHERE condition, selectivity is the fraction of the > 'dim' table that matches the condition (and also the 'fact'). Bloom and > master are timings of the query in miliseconds, and bloom/master is > comparison of the runtimes - so for example 49% means the hash join with > bloom filter was running ~2x as fast. > > Admittedly, work_mem=16MB is quite low, but that's just a way to force > batching. What really matters is the number of batches and selectivity (how > many tuples we can eliminate using the bloom filter). > > For work_mem=64MB it looks like this: > > batches filter select. bloom master bloom/master > ----------------------------------------------------------- > 1 1 6.25% 24846 23854 104.16% > 2 2 12.50% 24369 45672 53.36% > 2 3 18.75% 30432 47454 64.13% > 4 4 25.01% 36175 59741 60.55% > 4 5 31.25% 43103 62631 68.82% > 4 6 37.50% 48179 64079 75.19% > > So initially it's a bit slower (it's not doing any batching in this case, > but the code is a bit silly and while not building the bloom filter it > still does some checks). But once we start batching, it gets 2x as fast > again, and then slowly degrades as the selectivity increases. > > Attached is a spreadsheet with results for various work_mem values, and > also with a smaller data set (just 30M rows in the fact table), which > easily fits into memory. Yet it shows similar gains, shaving off ~40% in > the best case, suggesting that this is not just thanks to reduction of I/O > when forcing the temp files to disk. > > As I mentioned, the patch is incomplete in several ways: > > 1) It does not count the bloom filter (which may be quite big) into > work_mem properly. > > 2) It probably does not work for outer joins at this point. > > 3) Currently the bloom filter is used whenever we do batching, but it > should really be driven by selectivity too - it'd be good to (a) > estimate the fraction of 'fact' tuples having a match in the hash > table, and not to do bloom if it's over ~60% or so. Also, maybe > the could should count the matches at runtime, and disable the > bloom filter if we reach some threshold. > > But overall, this seems like a nice optimization opportunity for hash > joins on large data sets, where batching is necessary. > > Ideas? > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > >