Re: [HACKERS] join removal
I took at a first crack at coding up an implementation of relation_is_distinct_for() tonight. I am not sure if this will help or not, but on the 8.4 code base we implemented two functions: - getCandidateKeys() - would recursively traverse a tree from a given node to the leaf nodes and determine the candidate keys for the intermediate relation produced by that node - getJoinCard() - determined the join cardinality of a hash join node (1:1, 1:N, etc.) based on the candidate keys of the two input relations It worked pretty well for our tests with equi-joins, but I am sure it is missing many cases. I have attached the code which we used (cardinalityFuncs.c). Some of the helper functions may also be useful (convertUniqueIndexesToCandidateKeys, getJoinAttrs). -- Ramon Lawrence cardinalityFuncs.c Description: cardinalityFuncs.c -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] HashJoin w/option to unique-ify inner rel
Upon further review, it appears that a big part of this problem is that cost_hashjoin() doesn't understand that it needs cost semi-joins differently from inner or left joins. The bogus logic looks to be right here: startup_cost += hash_qual_cost.startup; run_cost += hash_qual_cost.per_tuple * outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) * 0.5; Of course, when the join type is JOIN_SEMI, we're going to stop looking after we find the first match, so this estimate is really far off. The 8.3 version of cost_hashjoin() had a line like this: joininfactor = join_in_selectivity(path-jpath, root); and a cost function like this: run_cost += hash_qual_cost.per_tuple * outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) * joininfactor * 0.5; This compensated for IN joins being able to stop scanning a bucket once a match is found. You may consider something similar for a semi-join. Having experimented with a lot of this code recently, there is some potential for improvement on the bucket sizes, etc., but it is a non-trivial problem. I tested a similar query on TPCH 100M1Z on version 8.3: select * from customer where c_custkey in (select o_custkey from orders) and found that hash aggregate was marginally faster. If you turn off aggregation, it selects an IN hash join which is about 5% slower and the planner is not too far off. So, it would be definitely possible to modify the cost function appropriately. It's tempting to have Hash cheat and just peek at the node beneath it to see if it's a HashAggregate, in which case it could call a special method to request the whole hash. But it would have to know that it's just a plain uniquify and not implementing a GROUP BY. If HashAggregate is faster, then the question is can you make it better by avoiding building the hash structure twice. I haven't considered all the possibilities, but the situation you have used as an example, an IN query, seems workable. Instead of translating to a hash aggregate/hash/hash join query plan, it may be possible to create a special hash join node that does uniquefy. The benefit is that the planner knows about it (instead of changing the execution plan), you can be more accurate on costs for the hash join, and you can optimize by using only one hash table construction. A challenge that must be dealt with is handling the multi-batch case. It appears that hash aggregate does not currently handle this, but I may be mistaken. -- Ramon Lawrence -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] a few crazy ideas about hash joins
While investigating some performance problems recently I've had cause to think about the way PostgreSQL uses hash joins. So here are a few thoughts. Some of these have been brought up before. 1. When the hash is not expected to spill to disk, it preserves the pathkeys of the outer side of the join. If the optimizer were allowed to assume that, it could produce significantly more efficient query plans in some cases. This is definitely possible, but you will have to dynamically modify the execution path if the hash join ends up to be more than one batch. 3. Avoid building the exact same hash table twice in the same query. This happens more often you'd think. For example, a table may have two columns creator_id and last_updater_id which both reference person (id). If you're considering a hash join between paths A and B, you could conceivably check whether what is essentially a duplicate of B has already been hashed somewhere within path A. If so, you can reuse that same hash table at zero startup-cost. 4. As previously discussed, avoid hashing for distinct and then hashing the results for a hash join on the same column with the same operators. Thoughts on the value and/or complexity of implementation of any of these? I would be interested in working with you on any of these changes to hash join if you decide to pursue them. I am especially interested in looking at the hash aggregation code and potentially improving its efficiency. We have implemented a multi-way hash join (can join more than 2 tables at a time) which may help with cases #3 and #4. Performance results look very good, and we are planning on building a patch for this over the summer. -- Ramon Lawrence -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] a few crazy ideas about hash joins
I would be especially interested in using a shared memory hash table that *all* backends can use - if the table is mostly read-only, as dimension tables often are in data warehouse applications. That would give zero startup cost and significantly reduced memory. I think that's a non-starter due to visibility issues and handling inserts and updates. Even just reusing a hash from one execution in a later execution of the same plan would be tricky since we would have to expire it if the snapshot changes. If your data set is nearly read-only, materialized views would be a better way to go and would require no hash join changes. The idea of perfect hash functions for dimension tables is very interesting. If the data set is near static, it is possible to compute them once in a few minutes time for a million tuple table and then re-use them until they change. The research has shown it is possible, but I do not know if anyone has actually implemented it in a real DBMS. An implementation could be something to try if there is interest. -- Ramon Lawrence -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
I think you missed the point of the performance questions. It wasn't about avoiding extra simple if-tests in the per-tuple loops; a few of those are certainly not going to add measurable cost given how complex the code is already. (I really don't think you should be duplicating hunks of code to avoid adding such tests.) Rather, the concern was that if we are dedicating a fraction of available work_mem to this purpose, that reduces the overall efficiency of the regular non-IM code path, principally by forcing the creation of more batches than would otherwise be needed. It's not clear whether the savings for IM tuples always exceeds this additional cost. I misunderstood the concern. So, there is no issue with the patch when it is disabled (single batch case or multi-batch with no skew)? There is no memory allocated when the optimization is off, so these cases will not affect the number of batches or re-partitioning. * The IM hashtable is only needed during the first-batch processing; once we've completed the first pass over the outer relation there is no longer any need for it, unless I'm misunderstanding things completely. Therefore it really only competes for space with the regular first batch. However the damage to nbatches will already have been done; in effect, we can expect that each subsequent batch will probably only use (100 - IM_WORK_MEM_PERCENT)% of work_mem. The patch seems to try to deal with this by keeping IM_WORK_MEM_PERCENT negligibly small, but surely that's mostly equivalent to fighting with one hand tied behind your back. I wonder if it'd be better to dedicate all of work_mem to the MCV hash values during the first pass, rather than allowing them to compete with the first regular batch. The IM hash table doesn't need to be very large in order to produce a substantial benefit, because there are only going to be ~100 MCVs in the probe table and each of those may well be unique in the build table. But no matter what size you choose for it, there's some danger that it will push us over the edge into more batches, and if the skew doesn't turn out to be enough to make up for that, you lose. I'm not sure there's any way to completely eliminate that unpleasant possibility. Correct - The IM table only competes with the first-batch during processing and is removed after the first pass. Also, it tends to be VERY small as the default of 100 MCVs usually results in 100 tuples being in the IM table which is normally much less than 2% of work_mem. We get almost all the benefit with 100-1 MCVs with little downside risk. Making the IM table larger (size of work_mem) is both not possible (not that many MCVs) and has a bigger downside risk if we get it wrong. * The IM hashtable creates an additional reason why nbatch might increase during the initial scan of the inner relation; in fact, since it's an effect not modeled in the initial choice of nbatch, it's probably going to be a major reason for that to happen. Increasing nbatch on the fly isn't good because it results in extra I/O for tuples that were previously assigned to what is now the wrong batch. Again, the only answer the patch has for this is to try not to use enough of work_mem for it to make a difference. Seems like instead the initial nbatch estimate needs to account for that. The possibility of the 1-2% IM_WORK_MEM_PERCENT causing a re-batch exists but is very small. The number of batches is calculated in ExecChooseHashTableSize (costsize.c) as ceil(inner_rel_bytes/work_mem) rounded up to the next power of 2. Thus, hash join already wastes some of its work_mem allocation due to rounding. For instance, if nbatch is calculated as 3 then rounded up to 4, only 75% of work_mem is used for each batch. This leaves 25% of work_mem unaccounted for which may be used by the IM table (and also to compensate for build skew). Clearly, if nbatch is exactly 4, then this unaccounted space is not present and if the optimizer is exact in its estimates, the extra 1-2% may force a re-partition. A solution may be to re-calculate nbatch factoring in the extra 1-2% during ExecHashTableCreate (nodeHashjoin.c) which calls ExecChooseHashTableSize again before execution. The decision is whether to modify ExecChooseHashTableSize itself (which is used during costing) or to make a modified ExecChooseHashTableSize function that is only used once in ExecHashTableCreate. We have tried to change the original code as little as possible, but it is possible to modify ExecChooseHashTableSize and the hash join cost function to be skew optimization aware. -- Ramon Lawrence -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
From: Tom Lane Heikki's got a point here: the planner is aware that hashjoin doesn't like skewed distributions, and it assigns extra cost accordingly if it can determine that the join key is skewed. (See the bucketsize stuff in cost_hashjoin.) If this patch is accepted we'll want to tweak that code. Those modifications would make the optimizer more likely to select hash join, even with skewed distributions. For the TPC-H data set that we are using the optimizer always picks hash join over merge join (single or multi-batch). Since the current patch does not change the cost function, there is no change in the planning cost. It may or may not be useful to modify the cost function depending on the effect on planning cost. Still, that has little to do with the current gating issue, which is whether we've convinced ourselves that the patch doesn't cause a performance decrease for cases in which it's unable to help. Although we have not seen an overhead when the optimization is by-passed, we are looking at some small code changes that would guarantee that no extra statements are executed for the single batch case. Currently, an if optimization_on check is performed on each probe tuple which, although minor, should be able to be avoided. The patch's author, Bryce Cutt, is defending his Master's thesis Friday morning (on this work), so we will provide some updated code right after that. Since these code changes are small, they should not affect people trying to test the performance of the current patch. -- Ramon Lawrence -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposed Patch to Improve Performance ofMulti-BatchHash Join for Skewed Data Sets
They're automatically generated by the dbgen utility, a link to which was originally published somewhere in this thread. That tool creates a few text files suitable (with some tweaking) for a COPY command. I've got the original files... the .tbz I just made is 1.8 GB :) Anyone have someplace they'd like me to drop it? Just a note that the Z7 data set is really a uniform data set Z0. The generator only accepts skew in the range from Z0 to Z4. The uniform, Z0, data set is typically used when benchmarking data warehouses. It turns out the data is not perfectly uniform as the top 100 suppliers and products represent 2.3% and 1.5% of LineItem. This is just enough skew that the optimization will sometimes be triggered in the multi-batch case (currently 1% skew is the cutoff). I have posted a pg_dump of the TPCH 1G Z0 data set at: http://people.ok.ubc.ca/rlawrenc/tpch1g0z.zip (Note that ownership commands are in the dump and make sure to vacuum analyze after the load.) I can also post the input text files if that is easier. -- Ramon Lawrence -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposed Patch to Improve Performance ofMulti-BatchHash Join for Skewed Data Sets
That seems VERY useful - can you post the other ones (Z1, etc.) so I can download them all? The Z1 data set is posted at: http://people.ok.ubc.ca/rlawrenc/tpch1g1z.zip I have not generated Z2, Z3, Z4 for 1G, but I can generate the Z2 and Z3 data sets, and in a hour or two they will be at: http://people.ok.ubc.ca/rlawrenc/tpch1g2z.zip http://people.ok.ubc.ca/rlawrenc/tpch1g3z.zip Note that Z3 and Z4 are not really useful as the skew is extreme (98% of the probe relation covered by top 100 values). Using the Z2/Z3 data set should be enough to show the huge win if you do *really* have a skewed data set. BTW, is there any particular form/options of the pg_dump command that I should use to make the dump? -- Ramon Lawrence -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
-Original Message- From: Robert Haas Sadly, there seem to be a number of cases in the Z7 database where the optimization makes things significantly worse (specifically, queries 2, 3, and 7, but especially query 3). Have you investigated what is going on there? I had thought that we had sufficient safeguards in place to prevent this optimization from kicking in in cases where it doesn't help, but it seems not. There will certainly be real-world databases that are more like Z7 than Z1. I agree that there should be no noticeable performance difference when the optimization is not used (single batch case or no skew). I think the patch achieves this. The optimization is not used in those cases, but we will review to see if it is the code that by-passes the optimization that is causing a difference. The query #3 timing difference is primarily due to a flaw in the experimental setup. For some reason, query #3 got executed before #4 with the optimization on, and executed after #4 with the optimization off. This skewed the results for all runs (due to buffering issues), but is especially noticeable for Z7. Note how query #4 is always faster for the optimization on version even though the optimization is not actually used for those queries (because they were one batch). I expect that if you run query #3 on Z7 in isolation then the results should be basically identical. I have attached the SQL script that Joshua sent me. The raw data I have posted at: http://people.ok.ubc.ca/rlawrenc/test.output -- Ramon Lawrence test.sql Description: test.sql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
From: pgsql-hackers-ow...@postgresql.org on behalf of Robert Haas I think what we need here is some very simple testing to demonstrate that this patch demonstrates a speed-up even when the inner side of the join is a joinrel rather than a baserel. Can you suggest a single query against the skewed TPCH dataset that will result in two or more multi-batch hash joins? If so, it should be a simple matter to run that query with and without the patch and verify that the former is faster than the latter. This query will have the outer relation be a joinrel rather than a baserel: select count(*) from supplier, part, lineitem where l_partkey = p_partkey and s_suppkey = l_suppkey; The approach collects statistics on the outer relation (not the inner relation) so the code had to have the ability to determine a stats tuple on a joinrel in addition to a baserel. Joshua sent us some preliminary data with this query and others and indicated that we could post it. He wanted time to clean it up and re-run some experiments, but the data is generally good and the algorithm performs as expected. I have attached this data to the post. Note that the last set of data (although labelled as Z7) is actually an almost zero skew database and represents the worst-case for the algorithm (for most queries the optimization is not even used). -- Ramon Lawrence JoshuaTolleyData.xls Description: JoshuaTolleyData.xls -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The testing of multi-batch hash joins with skewed data sets patch
The idea I came up with for benchmarking was a little similar to what I remember from the original tests. I have a sales orders table and a products table. My version of the sales orders table contains a customer column. Data for 10 customers is populated into the sales orders table, customer 1 has a totally non-skewed set of orders, where customer 10 has the most skew. I've done this by creating 1 products each with a product code that has been cast into a varchar and padded up to 5 chars in length with '0's. Each customer has the same number of rows in the sales orders table, customer 10 mostly orders products that when cast as INT are evenly divisible by 10, where customer 2 mostly orders products that are evenly divisible by 2. You get the idea. Currently I'm unsure the best way to ensure that the hash join goes into more than one batch apart from just making the dataset very large. Does anyone have any thoughts about the way I plan to go about benchmarking? Thank you for testing the patch - it is very much appreciated. If you use the test version of the patch, it will print out statistics that will be helpful. I think your approach should work. I have two comments: 1) You will need to scale the data set larger to go multi-batch. Even a minimum work_mem of 1 MB may be enough to keep the product table in memory unless each tuple is large. For the TPC-H tests, the size of product was 200,000 for 1 GB tests and 2 million tuples for 10 GB tests. 2) The current formula may not generate the skew you expect on sales.productcode. To simplify the discussion, I will only consider customer 1 (c1) and customer 10 (c10) and a total of 100,000 sales (50,000 for each customer). If I look at product 10 for instance, it will be ordered 50,000/1,000 = 50 times by c10 and 50,000/10,000 = 5 times by c1 for a total of 55 times. Product 10 represents only 0.055% of all sales. For all mod 10 products combined, they represent 55% of sales, which is significant BUT requires us to store 10% of product in memory (1000 tuples all of which need to be in the stats record). This two customer test would be interesting. There should be no benefit for customer 1. In fact, you would see the worst case as you would plan for skew but not get any benefit. For customer 10 you should see a benefit if your stats have 1000 tuples. The issue is that you cannot scale this test easily. Increasing by a factor of 10 would require stats of 10,000, and increasing by a factor of 100 is not possible. The Zipfian distribution used in the previous experiments causes the top few values to be exponentially better than the average value. For instance, the top 100 products may represent 10 to 50% of total sales even for 1 million products. In the previous case, the top 100 products represent only 0.0055% of total sales for 1 million products. This level of skew would be ignored by the algorithm which has a cutoff value that at least 1% of the probe relation must match with the skew values buffered in memory. To test higher values of skew, you could setup the experiment like this (may scale down by a factor of 10 depending on your hardware): products - 1 million sales - 10 million customers - 5 - Each customer has 2 million orders. - Customer 1 orders each product equally. - Customer 2 orders each product mod 10^2 equally. - Customer 5 orders each product mod 10^5 equally. It is customer 5's orders that result in most of the skew as every 100,000th product will be ordered 200,000 times (customer 5 only orders 10 products). Then, there is a huge benefit for customer 5 for keeping these 10 products in memory during the join. The benefit decreases for each customer all the way down to customer 1 which will see no benefit. -- Ramon Lawrence -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The testing of multi-batch hash joins with skewed data sets patch
-Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers- ow...@postgresql.org] On Behalf Of Tom Lane But really there are two different performance regimes here, one where the hash data is large enough to spill to disk and one where it isn't. Reducing work_mem will cause data to spill into kernel disk cache, but if the total problem fits in RAM then very possibly that data won't ever really go to disk. So I suspect such a test case will act more like the small-data case than the big-data case. You probably actually need more data than RAM to be sure you're testing the big-data case. Is there a way to limit the kernel disk cache? (We are running SUSE Linux.) We have been testing hybrid hash join performance and have seen that the performance varies considerably less than expected even for dramatic changes in work_mem and the I/Os that appear to be performed. -- Ramon Lawrence -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Help with Join Performance Testing
A hash join modification patch is under review for 8.4 that needs performance testing. We would appreciate help with this testing. A testing version of the patch is attached in addition to testing instructions and where to retrieve a sample data set. The basic idea of the patch is that it reduces disk operations for large multi-batch hash joins where there is skew in the probe relation. The patch collects statistics on performance benefits when using the optimization. -- Ramon Lawrence and Bryce Cutt Overview This document provides an overview of how to test the histojoin patch. The patch performs skew optimization for large, multi-batch hash joins. Installation The patch should compile cleanly against CVS head. Execution - The skew optimization can be turned on by: set enable_hashjoin_usestatmcvs = on; and off by: set enable_hashjoin_usestatmcvs = off; If a hash join has detectable skew in the larger probe relation, then the skew optimization will output the amount of skew it sees and the number of tuples it will buffer in memory to exploit that skew. When the hash join completes, it will output statistics on the number of tuples actually matched by the in-memory (IM) skew partition and the number of tuples in partition 0. The improvements in join I/Os is also given. Sample (from LI-P TPCH 10G 1Z): Values: 100 Skew: 0.27 Est. tuples: 59986052.00 Batches: 512 Est. Save: 16114709.99 Total Inner Tuples: 200 IM Inner Tuples: 83 Batch Zero Inner Tuples: 3941 Batch Zero Potential Inner Tuples: 3941 Total Outer Tuples: 59986052 IM Outer Tuples: 16074146 Batch Zero Outer Tuples: 98778 Batch Zero Potential Outer Tuples: 98778 Total Output Tuples: 59986052 IM Output Tuples: 16074146 Batch Zero Output Tuples: 98778 Batch Zero Potential Output Tuples: 98778 Percentage less tuple IOs than HHJ: 25.98 Data Set A sample test data set is TPC-H scale factor 1 GB. A pg_dump can be downloaded from: http://people.ok.ubc.ca/rlawrenc/tpch1g1z.zip The larger 10 GB data sets are available on request. You can also download the generator itself (works only on Windows) at: http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip The only joins with significant skew in the database are Part-LineItem and Supplier-LineItem. Result Notes 1) The percentage benefit increases with the amount of skew. Relations with no skew are not affected. Relations with minimal skew show no noticeable improvement or negative impact. 2) Since disk I/Os in the join is only one part of the query execution time, overall execution times do not improve the same amount as the reduction in disk I/Os. For CPU-bound queries, the disk I/O improvement may not have a significant effect on the overall time. 3) The relations are quite large. Thus, queries with SELECT * that join several relations are very costly and the generation of the tuples dominates the execution time (especially if executing the query through a client such as pgAdmin). Previous Results The join with LineItem-Part on TPCH 1G 1Z shows about a 26% improvement in I/Os performed during the join and about 5-10% improvement in overall time. The join with LineItem-Supplier is similar. Data sets with higher skew show even better performance. For example, Lineitem-Part on TPCH 10G 2Z has 90% of probe relation tuples matching 100 most common values. The improvement in I/Os is about 90% and time about 50%. Some sample test queries: Query #1a: SELECT * FROM Part, Lineitem WHERE p_partkey = l_partkey; Query #1b: SELECT count(*) FROM Part, Lineitem WHERE p_partkey = l_partkey; Query #2a: SELECT * FROM Supplier, Lineitem WHERE s_suppkey = l_suppkey; Query #2b: SELECT count(*) FROM Supplier, Lineitem WHERE s_suppkey = l_suppkey; Query #3a: SELECT * FROM Part, Lineitem, Supplier WHERE p_partkey = l_partkey and s_suppkey = l_suppkey; Query #3b: SELECT count(*) FROM Part, Lineitem, Supplier WHERE p_partkey = l_partkey and s_suppkey = l_suppkey; histojoin_testing.patch Description: histojoin_testing.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Here is a cleaned-up version. I fixed a number of whitespace issues, improved a few comments, and rearranged one set of nested if-else statements (hopefully without breaking anything in the process). Josh / eggyknap - Can you rerun your performance tests with this version of the patch? To help with testing, we have constructed a patch specifically for testing. The patch is the same as Robert's version except that it tracks and prints out statistics during the join on how many tuples are affected and has the enable_hashjoin_usestatmcvs variable defined so that it is easy to turn on/off skew handling. This is useful as although the patch reduces the number of I/Os performed, this improvement may not be seen in some queries which are dominated by other cost factors (non-skew joins, CPU time, time to scan input relations, etc.). The sample output looks like this: LI-P Values: 100 Skew: 0.27 Est. tuples: 59986052.00 Batches: 512 Est. Save: 16114709.99 Total Inner Tuples: 200 IM Inner Tuples: 83 Batch Zero Inner Tuples: 3941 Batch Zero Potential Inner Tuples: 3941 Total Outer Tuples: 59986052 IM Outer Tuples: 16074146 Batch Zero Outer Tuples: 98778 Batch Zero Potential Outer Tuples: 98778 Total Output Tuples: 59986052 IM Output Tuples: 16074146 Batch Zero Output Tuples: 98778 Batch Zero Potential Output Tuples: 98778 Percentage less tuple IOs than HHJ: 25.98 The other change is that the system calculates the skew and will not use the in-memory skew partition if the skew is less than 1%. Finally, we have attached some performance results for the TPCH 10G data set (skew factors z=1 and z=2). For the Customer-Orders-Lineitem-Part query that Josh was testing, we see no overall time difference that is significant compared to experimental error (although there is I/O benefit for the Lineitem-Part join). This query cost is dominated by the non-skew joins of Customer-Orders and Orders-Lineitem and output tuple construction. The joins with skew, Lineitem-Supplier and Lineitem-Part, show significantly improved performance. Note how the statistics show that the percentage I/O savings is directly proportional to the skew. However, the overall query time savings is always less than this as there are other costs such as reading the relations, performing the hash comparisons, building the output tuples, etc. that are unaffected by the optimization. At this point, we await further feedback on what is necessary to get this patch accepted. We would also like to thank Josh and Robert again for their review time. Sincerely, Ramon Lawrence and Bryce Cutt histojoin_testing.patch Description: histojoin_testing.patch TPC-H 10G Skew Factor Z=1 results - LI-P Regular HJ: (time in milliseconds) 990344 1022562 1071250 1003219 1049000 989953 Average: 1021054.667 LI-P Skew-enabled HJ: (time in milliseconds) 883593 960860 934906 1007282 937406 948078 Average: 945354.1667 % Difference: 7.4% LI-P Values: 100 Skew: 0.27 Est. tuples: 59986052.00 Batches: 512 Est. Save: 16114709.99 Total Inner Tuples: 200 IM Inner Tuples: 83 Batch Zero Inner Tuples: 3941 Batch Zero Potential Inner Tuples: 3941 Total Outer Tuples: 59986052 IM Outer Tuples: 16074146 Batch Zero Outer Tuples: 98778 Batch Zero Potential Outer Tuples: 98778 Total Output Tuples: 59986052 IM Output Tuples: 16074146 Batch Zero Output Tuples: 98778 Batch Zero Potential Output Tuples: 98778 Percentage less tuple IOs than HHJ: 25.98 LI-P-S Regular HJ: (time in milliseconds) 1833016 1567515 1504625 Average: 1635052 LI-P-S Skew-enabled HJ: (time in milliseconds) 883593 1280297 1423984 Average: 1195958 % Difference: 27% LI-S Values: 100 Skew: 0.19 Est. tuples: 59986052.00 Batches: 32 Est. Save: 11097357.16 Total Inner Tuples: 10 IM Inner Tuples: 78 Batch Zero Inner Tuples: 3123 Batch Zero Potential Inner Tuples: 3125 Total Outer Tuples: 59986052 IM Outer Tuples: 11563695 Batch Zero Outer Tuples: 1577432 Batch Zero Potential Outer Tuples: 1693632 Total Output Tuples: 59986052 IM Output Tuples: 11563695 Batch Zero Output Tuples: 1577432 Batch Zero Potential Output Tuples: 1693632 Percentage less tuple IOs than HHJ: 19.61 (LI-S)-P Values: 100 Skew: 0.27 Est. tuples: 59986052.00 Batches: 512 Est. Save: 16114709.99 Total Inner Tuples: 200 IM Inner Tuples: 83 Batch Zero Inner Tuples: 3941 Batch Zero Potential Inner Tuples: 3941 Total Outer Tuples: 59986052 IM Outer Tuples: 16074146 Batch Zero Outer Tuples: 98778 Batch Zero Potential Outer Tuples: 98778 Total Output Tuples: 59986052 IM Output Tuples: 16074146 Batch Zero Output Tuples: 98778 Batch Zero Potential Output Tuples: 98778 Percentage less tuple IOs than HHJ: 25.98 TPC-H 10G Skew Factor Z=2 results - LI-P Regular HJ: (time in milliseconds) 505672 424922 303250 361610 358125 Average: 390715.8 LI-P Skew-enabled HJ: (time in milliseconds) 219078 210078 212938 210094 212500 Average: 212937.6 % difference: 45.5%
Re: [HACKERS] Potential Join Performance Issue
Has this been completed? TODO item? I'd be more inclined to deal with the issue by trying to establish a safety margin in the estimate of whether the hash will go multi-batch. IOW we should disuse_physical_tlist if the hash is estimated to be close to but still within one batch. I do not know how this issue was resolved. It is an issue that is very important for multi-batch hash joins. The simplest resolution is to disable physical_tlist on the outer relation for hash joins of more than one batch. However, as discussed in the thread, more sophisticated solutions are also viable. -- Ramon Lawrence -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
I thought about this, but upon due reflection I think it's the wrong approach. Raising work_mem is a pretty common tuning step - it's 4MB even on my small OLTP systems, and in a data-warehousing environment where this optimization will bring the most benefit, it could easily be higher. Furthermore, if someone DOES change the statistics target for that column to 10,000, there's a pretty good chance that they had a reason for doing so (or at the very least it's not for us to assume that they were doing something stupid). I think we need some kind of code to try to tune this based on the actual situation. We might try to size the in-memory hash table to be the largest value that won't increase the total number of batches, but if the number of batches is large then this won't be the right decision. Maybe we should insist on setting aside some minimum percentage of work_mem for the in-memory hash table, and fill it with however many MCVs we think will fit. I think that setting aside a minimum percentage of work_mem may be a reasonable approach. For instance, setting aside 1% at even 1 MB work_mem would be 10 KB which is enough to store about 40 MCV tuples of the TPC-H database. Such a small percentage would be very unlikely (but still possible) to change the number of batches used. Then, given the memory allocation and the known tuple size + overhead, only that number of MCVs are selected for the MCV table regardless how many there are. The MCV table size would then increase as work_mem is changed up to a maximum given by the number of MCVs. I agree. However, there's no reason at all to assume that the tuples we flush out of the table are any better or worse than the new ones we add back in later. In fact, although it's far from a guarantee, if the order of the tuples in the table is random, then we're more likely to encounter the most common values first. We might as well just keep the ones we had rather than dumping them out and adding in different ones. Err, except, maybe we can't guarantee correctness that way, in the case of a many-to-many join? The code when building the MCV hash table keeps track of the order of insertion of the best MCVs. It then flushes the MCV partitions in decreasing order of frequency of MCVs. Thus, by the end of the build partitioning phase the MCV hash table should only store the most frequent MCV tuples. Even with many-to-many joins as long as we keep all build tuples that have a given MCV in memory, then everything is fine. You would get into problems if you only flushed some of the tuples of a certain MCV but that will not happen. -- Ramon Lawrence -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
-Original Message- From: Robert Haas [mailto:robertmh...@gmail.com] I looked at this some more. I'm a little concerned about the way we're maintaining the in-memory hash table. Since the highest legal statistics target is now 10,000, it's possible that we could have two orders of magnitude more MCVs than what you're expecting. As I read the code, that could lead to construction of an in-memory hash table with 64K slots. On a 32-bit machine, I believe that works out to 16 bytes per partition (12 and 4), which is a 1MB hash table. That's not necessarily problematic, except that I don't think you're considering the size of the hash table itself when evaluating whether you are blowing out work_mem, and the default size of work_mem is 1MB. I totally agree that 10,000 MCVs changes things. Ideally, these 10,000 MCVs should be kept in memory because they will join with the most tuples. However, the size of the MCV hash table (as you point out) can be bigger than work_mem *by itself* not even considering the tuples in the table or in the in-memory batch. Supporting that many MCVs would require more modifications to the hash join algorithm. 100 MCVs should be able to fit in memory though. Since the number of batches is rounded to a power of 2, there is often some hash_table_bytes that are not used by the in-memory batch that can be used to store the MCV table. The absolute size of the memory used should also be reasonable (depending on the tuple size in bytes). So, basically, we have a decision to make whether to try support a larger number of MCVs or cap it at a reasonable number like a 100. You can come up with situations where using all 10,000 MCVs is good (for instance if all MCVs have frequency 1/1), but I expect 100 MCVs will capture the majority of the cases as usually the top 100 MCVs are significantly more frequent than later MCVs. I now also see that the code should be changed to keep track of the MCV bytes separately from hashtable-spaceUsed as this is used to determine when to dynamically increase the number of batches. I also don't really understand why we're trying to control the size of the hash table by flushing tuples after the fact. Right now, when the in-memory table fills up, we just keep adding tuples to it, which in turn forces us to flush out other tuples to keep the size down. This seems quite inefficient - not only are we doing a lot of unnecessary allocating and freeing, but those flushed slots in the hash table degrade performance (because they don't stop the scan for an empty slot). It seems like we could simplify things considerably by adding tuples to the in-memory hash table only to the point where the next tuple would blow it out. Once we get to that point, we can skip the isAMostCommonValue() test and send any future tuples straight to temp files. (This would also reduce the memory consumption of the in-memory table by a factor of two.) In the ideal case, we select a number of MCVs to support that we know will always fit in memory. The flushing is used to deal with the case where we are doing a many-to-many join and there may be multiple tuples with the given MCV value in the build relation. The issue with building the MCV table is that the hash operator will not be receiving tuples in MCV frequency order. It is possible that the MCV table is filled up with tuples of less frequent MCVs when a more frequent MCV tuple arrives. In that case, we would like to keep the more frequent MCV and bump one of the less frequent MCVs. We could potentially improve on this even further if we can estimate in advance how many MCVs we can fit into the in-memory hash table before it gets blown out. If, for example, we have only 1MB of work_mem but there 10,000 MCVs, getMostCommonValues() might decide to only hash the first 1,000 MCVs. Even if we still blow out the in-memory hash table, the earlier MCVs are more frequent than the later MCVs, so the ones that actually make it into the table are likely to be more beneficial. I'm not sure exactly how to do this tuning though, since we'd need to approximate the size of the tuples... I guess the query planner makes some effort to estimate that but I'm not sure how to get at it. The number of batches (nbatch), inner_rel_bytes, and hash_table_bytes are calculated in ExecChooseHashTableSize in nodeHash.c. The number of bytes free not allocated to the in-memory batch is then: hash_table_bytes - inner_rel_bytes/nbatch Depending on the power of 2 rounding of nbatch, this may be almost 0 or quite large. You could change the calculation of nbatch or try to resize the in-memory batch, but that opens up a can of worms. It may be best to assume a small number of MCVs 10 or 100. However, the join with Part and LineItem *should* show a benefit but may not because of a limitation of the patch implementation (not the idea). The MCV optimization is only enabled currently when the
Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Because there is no nice way in PostgreSQL (that I know of) to derive a histogram after a join (on an intermediate result) currently usingMostCommonValues is only enabled on a join when the outer (probe) side is a table scan (seq scan only actually). See getMostCommonValues (soon to be called ExecHashJoinGetMostCommonValues) for the logic that determines this. So my test case of do a whole bunch of hash joins in a test query isn't really valid. Makes sense. I did another, more haphazard test on a query with fewer joins, and saw noticeable speedups. It's starting to seem to me that the case where this patch provides a benefit is so narrow that I'm not sure it's worth the extra code. Not that anyone asked, but I don't consider myself qualified to render judgement on that point. Code size is, I guess, a maintainability issue, and I'm not terribly experienced maintaining PostgreSQL :) Is it realistic to think that the MCVs of the base relation might still be applicable to the joinrel? It's certainly easy to think of counterexamples, but it might be a good approximation more often than not. It's equivalent to our assumption that distributions of values in columns in the same table are independent. Making that assumption in this case would probably result in occasional dramatic speed improvements similar to the ones we've seen in less complex joins, offset by just-as-occasional dramatic slowdowns of similar magnitude. In other words, it will increase the variance of our results. - Josh There is almost zero penalty for selecting incorrect MCV tuples to buffer in memory. Since the number of MCVs is approximately 100, the overhead is keeping these 100 tuples in memory where they *might* not be MCVs. The cost is the little extra memory and the checking of the MCVs which is very fast. On the other hand, the benefit is potentially tremendous if the MCV is very common in the probe relation. Every probe tuple that matches the MCV tuple in memory does not have to be written to disk. The potential speedup is directly proportional to the skew. The more skew the more benefit. An analogy is with a page buffering system where one goal is to keep frequently used pages in the buffer. Essentially the goal of this patch is to pin in memory the tuples that the join believes will match with the most tuples on the probe side. This reduces I/Os by making more probe relation tuples match during the first read of the probe relation. Regular hash join has no way to guarantee frequently matched build tuples remain memory-resident. The particular join with Customer, Orders, LineItem, and Part is a reasonable test case. There may be two explanations for the results. (I am running tests for this query currently.) First, the time to generate the tuples (select *) may be dominating the query time. Second, as mentioned by Bryce, I expect the issue is that only the join with Customer and Orders exploited the patch. Customer has some skew (but not dramatic) so there would be some speedup. However, the join with Part and LineItem *should* show a benefit but may not because of a limitation of the patch implementation (not the idea). The MCV optimization is only enabled currently when the probe side is a sequential scan. This limitation is due to our current inability to determine a stats tuple of the join attribute on the probe side for other operators. (This should be possible - help please?). Even if this stats tuple is on the base relation and may not exactly reflect the distribution of the intermediate relation on the probe side, it still could be very good. Even if it is not, once again the cost is negligible. In summary, the patch will improve performance of any multi-batch hash join with skew. It is useful right now when the probe relation has skew and is accessed using a sequential scan. It would be useful in even more situations if the code was modified to determine the stats for the join attribute of the probe relation in all cases (even when the probe relation is produced by another operator). -- Dr. Ramon Lawrence Assistant Professor, Department of Computer Science, University of British Columbia Okanagan E-mail: ramon.lawre...@ubc.ca -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Robert, You do not need to use qgen.exe to generate queries as you are not running the TPC-H benchmark test. Attached is an example of the 22 sample TPC-H queries according to the benchmark. We have not tested using the TPC-H queries for this particular patch and only use the TPC-H database as a large, skewed data set. The simpler queries we test involve joins of Part-Lineitem or Supplier-Lineitem such as: Select * from part, lineitem where p_partkey = l_partkey OR Select count(*) from part, lineitem where p_partkey = l_partkey The count(*) version is usually more useful for comparisons as the generation of output tuples on the client side (say with pgadmin) dominates the actual time to complete the query. To isolate query costs, we also test using a simple server-side function. The setup description I have also attached. I would be happy to help in any way I can. Bryce is currently working on an updated patch according to your suggestions. -- Dr. Ramon Lawrence Assistant Professor, Department of Computer Science, University of British Columbia Okanagan E-mail: ramon.lawre...@ubc.ca -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers- ow...@postgresql.org] On Behalf Of Robert Haas Sent: December 17, 2008 7:54 PM To: Lawrence, Ramon Cc: Tom Lane; pgsql-hackers@postgresql.org; Bryce Cutt Subject: Re: [HACKERS] Proposed Patch to Improve Performance of Multi- Batch Hash Join for Skewed Data Sets Dr. Lawrence: I'm still working on reviewing this patch. I've managed to load the sample TPCH data from tpch1g1z.zip after changing the line endings to UNIX-style and chopping off the trailing vertical bars. (If anyone is interested, I have the results of pg_dump | bzip2 -9 on the resulting database, which I would be happy to upload if someone has server space. It is about 250MB.) But, I'm not sure quite what to do in terms of generating queries. TPCHSkew contains QGEN.EXE, but that seems to require that you provide template queries as input, and I'm not sure where to get the templates. Any suggestions? Thanks, ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- using 10100 as a seed to the RNG -- QUERY_1 select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate = date '1998-09-01' group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus; -- QUERY_2 select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 28 and p_type like '%STEEL' and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'MIDDLE EAST' and ps_supplycost = ( select min(ps_supplycost) from partsupp, supplier, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'MIDDLE EAST' ) order by s_acctbal desc, n_name, s_name, p_partkey limit 100; -- QUERY_3 select l_orderkey, sum(l_extendedprice * (1 - l_discount)) as revenue, o_orderdate, o_shippriority from customer, orders, lineitem where c_mktsegment = 'BUILDING' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate date '1995-03-31' and l_shipdate date '1995-03-31' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10; -- QUERY_4 select o_orderpriority, count(*) as order_count from orders where o_orderdate = date '1997-10-01' and o_orderdate date '1998-02-01' and exists ( select * from lineitem where l_orderkey = o_orderkey
Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
-Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] I'm a tad worried about what happens when the values that are frequently occurring in the outer relation are also frequently occurring in the inner (which hardly seems an improbable case). Don't you stand a severe risk of blowing out the in-memory hash table? It doesn't appear to me that the code has any way to back off once it's decided that a certain set of join key values are to be treated in-memory. Splitting the main join into more batches certainly doesn't help with that. Also, AFAICS the benefit of this patch comes entirely from avoiding dump and reload of tuples bearing the most common values, which means it's a significant waste of cycles when there's only one batch. It'd be better to avoid doing any of the extra work in the single-batch case. One thought that might address that point as well as the difficulty of getting stats in nontrivial cases is to wait until we've overrun memory and are forced to start batching, and at that point determine on-the-fly which are the most common hash values from inspection of the hash table as we dump it out. This would amount to optimizing on the basis of frequency in the *inner* relation not the outer, but offhand I don't see any strong theoretical basis why that wouldn't be just as good. It could lose if the first work_mem worth of inner tuples isn't representative of what follows; but this hardly seems more dangerous than depending on MCV stats that are for the whole outer relation rather than the portion of it being selected. regards, tom lane You are correct with both observations. The patch only has a benefit when there is more than one batch. Also, there is a potential issue with MCV hash table overflows if the number of tuples that match the MCVs in the build relation is very large. Bryce has created a patch (attached) that disables the code for one batch joins. This patch also checks for MCV hash table overflows and handles them by flushing from the MCV hash table back to the main hash table. The main hash table will then resolve overflows as usual. Note that this will cause the worse case of a build table with all the same values to be handled the same as the current hash code, i.e., it will attempt to re-partition until it eventually gives up and then allocates the entire partition in memory. There may be a better way to handle this case, but the new patch will remain consistent with the current hash join implementation. The issue with determining and using the MCV stats is more challenging than it appears. First, knowing the MCVs of the build table will not help us. What we need are the MCVs of the probe table because by knowing those values we will keep the tuples with those values in the build relation in memory. For example, consider a join between tables Part and LineItem. Assume 1 popular part accounts for 10% of all LineItems. If Part is the build relation and LineItem is the probe relation, then by keeping that 1 part record in memory, we will guarantee that we do not need to write out 10% of LineItem. If a selection occurs on LineItem before the join, it may change the distribution of LineItem (the MCVs) but it is probable that they are still a good estimate of the MCVs in the derived LineItem relation. (We did experiments on trying to sample the first few thousand tuples of the probe relation to dynamically determine the MCVs but generally found this was inaccurate due to non-random samples.) In essence, the goal is to smartly pick the tuples that remain in the in-memory batch before probing begins. Since the number of MCVs is small, incorrectly selecting build tuples to remain in memory has negligible cost. If we assume that LineItem has been filtered so much that it is now smaller than Part and is the build relation then the MCV approach does not apply. There is no skew in Part on partkey (since it is the PK) and knowing the MCV partkeys in LineItem does not help us because they each only join with a single tuple in Part. In this case, the MCV approach should not be used because no benefit is possible, and it will not be used because there will be no MCVs for Part.partkey. The bad case with MCV hash table overflow requires a many-to-many join between the two relations which would not occur on the more typical PK-FK joins. -- Dr. Ramon Lawrence Assistant Professor, Department of Computer Science, University of British Columbia Okanagan E-mail: [EMAIL PROTECTED] histojoin_v3.patch Description: histojoin_v3.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Hash Join-Filter Pruning using Bloom Filters
-Original Message- On Sun, Nov 2, 2008 at 10:49 PM, Jonah H. Harris [EMAIL PROTECTED] wrote: It's effective as-is for a preliminary patch. The GUC code is the least of my worries. Can you provide some figures on the performance impact of the bloom filter? I have tested the Bloom filter patch. It compiles cleanly against HEAD. As indicated, the performance improvements for hash join are good, especially when the build table is filtered with a selection condition. Performance improvements range from a couple of percent up to 20% for multi-batch joins. Note that the bloom filter will slightly slow queries where the filter has no benefit. I have not looked at the actual implementation of the Bloom filter, but will proceed to do that next. One issue to be considered is how the space used for the bloom filter is related to the work_mem allocated to the join. That is, does the bloom filter consume some of the work_mem space or is it treated as additional memory allocated to the join. Experimental Results (in ms) Query Time With FilterTime Without Filter % Faster 1 2,166 2,648 18% 2 1,665 1,772 6% 3 5,308 6,374 17% 4 63,690 75,715 15% 5 87,864 81,552 -8% 6 12,492 11,696 -7% Query 1: (provided by patch author) = 190,000 results CREATE TABLE t1 (id INTEGER PRIMARY KEY, x INTEGER); CREATE TABLE t2 (id INTEGER PRIMARY KEY, x INTEGER); INSERT INTO t1 (SELECT ge, ge % 100 FROM generate_series(1, 100) ge); INSERT INTO t2 (SELECT * FROM t1); VACUUM ANALYZE; SELECT COUNT(*) FROM t1, t2 WHERE t1.id = t2.id AND t1.x 30 AND t2.x 10; The next five queries are on the TPCH 1GB data set. Query 2: (in-memory join with restrictive build filter) = 56,110 results select count(*) from customer, orders where c_custkey = o_custkey and c_nationkey = 10; Query 3: (multi-batch join with less restrictive build filter) = 841,767 results select count(*) from customer, orders where c_custkey = o_custkey and c_nationkey 10; Query 4: (large multi-batch join) = 3,215,402 results select count(*) from orders, lineitem where o_orderkey = l_orderkey and o_totalprice 15; Query 5: (large multi-batch join with no filter - hard case for BLOOM filter) = 6,000,003 results select count(*) from orders, lineitem where o_orderkey = l_orderkey; Query 6: (large probe, in-memory build with no filter - hard case for BLOOM filter) = 6,000,003 results select count(*) from supplier, lineitem where s_suppkey = l_suppkey; All tests were run 4 times and the times were averaged. The initial run time was discarded to deal with buffering issues. -- Dr. Ramon Lawrence University of British Columbia Okanagan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Hash Join-Filter Pruning using Bloom Filters
-Original Message- From: Jonah H. Harris [mailto:[EMAIL PROTECTED] I have a new patch which does not create a bloom filter unless it sees that the hash join is going to batch. I'll send it along later tonight. Currently it's additional space not accounted for by work_mem. Additionally, it's a good amount more space than is required. This is fixed in the newer patch as well. I think that the bloom filter will also improve the performance of in-memory joins as well. The basic trade-off in that case is the time to probe multiple entries in a bucket in the hash table (which currently defaults to 10) versus the cost of building/probing the bloom filter. The bloom filter should win in this case as long as there are tuples in the probe relation that cannot find a match in the build relation. My suggestion would be to keep it enabled for all joins. If possible, it would be valuable to try to estimate what percentage of tuples that the bloom filter filters out. A simple estimate would be to determine the percentage of the build table that is involved in the join. For instance, the good test cases had between 40-90% of the customer relation filtered out and a corresponding percentage of the probe relation, lineitem, was filtered out by the bloom filter. The bad case used all of customer, so the bloom filter stopped no probe tuples. It would be useful for testing to track the number and percentage of probe tuples that the bloom filter prevents a probe for. You may further record which of these tuples were in the in-memory batch and on-disk batches. These statistics may help you get the bloom filter optimized for all cases. -- Ramon Lawrence -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
-Original Message- Minor question on this patch. AFAICS there is another patch that seems to be aiming at exactly the same use case. Jonah's Bloom filter patch. Shouldn't we have a dust off to see which one is best? Or at least a discussion to test whether they overlap? Perhaps you already did that and I missed it because I'm not very tuned in on this thread. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support We haven't had that discussion AFAIK, and definitely should. First glance suggests they could coexist peacefully, with proper coaxing. If I understand things properly, Jonah's patch filters tuples early in the join process, and this patch tries to ensure that hash join batches are kept in RAM when they're most likely to be used. So they're orthogonal in purpose, and the patches actually apply *almost* cleanly together. Jonah, any comments? If I continue to have some time to devote, and get through all I think I can do to review this patch, I'll gladly look at Jonah's too, FWIW. - Josh The skew patch and bloom filter patch are orthogonal and can both be applied. The bloom filter patch is a great idea, and it is used in many other database systems. You can use the TPC-H data set to demonstrate that the bloom filter patch will significantly improve performance of multi-batch joins (with or without data skew). Any query that filters a build table before joining on the probe table will show improvements with a bloom filter. For example, select * from customer, orders where customer.c_nationkey = 10 and customer.c_custkey = orders.o_custkey The bloom filter on customer would allow us to avoid probing with orders tuples that cannot possibly find a match due to the selection criteria. This is especially beneficial for multi-batch joins where an orders tuple must be written to disk if its corresponding customer batch is not the in-memory batch. I have no experience reviewing patches, but I would be happy to help contribute/review the bloom filter patch as best I can. -- Dr. Ramon Lawrence Assistant Professor, Department of Computer Science, University of British Columbia Okanagan E-mail: [EMAIL PROTECTED] -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Joshua, Thank you for offering to review the patch. The easiest way to test would be to generate your own TPC-H data and load it into a database for testing. I have posted the TPC-H generator at: http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip The generator can produce skewed data sets. It was produced by Microsoft Research. After unzipping, on a Windows machine, you can just run the command: dbgen -s 1 -z 1 This will produce a TPC-H database of scale 1 GB with a Zipfian skew of z=1. More information on the generator is in the document README-S.DOC. Source is provided for the generator, so you should be able to run it on other operating systems as well. The schema DDL is at: http://people.ok.ubc.ca/rlawrenc/tpch_pg_ddl.txt Note that the load time for 1G data is 1-2 hours and for 10G data is about 24 hours. I recommend you do not add the foreign keys until after the data is loaded. The other alternative is to do a pgdump on our data sets. However, the download size would be quite large, and it will take a couple of days for us to get you the data in that form. -- Dr. Ramon Lawrence Assistant Professor, Department of Computer Science, University of British Columbia Okanagan E-mail: [EMAIL PROTECTED] -Original Message- From: Joshua Tolley [mailto:[EMAIL PROTECTED] Sent: November 1, 2008 3:42 PM To: Lawrence, Ramon Cc: pgsql-hackers@postgresql.org; Bryce Cutt Subject: Re: [HACKERS] Proposed Patch to Improve Performance of Multi- Batch Hash Join for Skewed Data Sets On Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon [EMAIL PROTECTED] wrote: We propose a patch that improves hybrid hash join's performance for large multi-batch joins where the probe relation has skew. Project name: Histojoin Patch file: histojoin_v1.patch This patch implements the Histojoin join algorithm as an optional feature added to the standard Hybrid Hash Join (HHJ). A flag is used to enable or disable the Histojoin features. When Histojoin is disabled, HHJ acts as normal. The Histojoin features allow HHJ to use PostgreSQL's statistics to do skew aware partitioning. The basic idea is to keep build relation tuples in a small in-memory hash table that have join values that are frequently occurring in the probe relation. This improves performance of HHJ when multiple batches are used by 10% to 50% for skewed data sets. The performance improvements of this patch can be seen in the paper (pages 25-30) at: http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf All generators and materials needed to verify these results can be provided. This is a patch against the HEAD of the repository. This patch does not contain platform specific code. It compiles and has been tested on our machines in both Windows (MSVC++) and Linux (GCC). Currently the Histojoin feature is enabled by default and is used whenever HHJ is used and there are Most Common Value (MCV) statistics available on the probe side base relation of the join. To disable this feature simply set the enable_hashjoin_usestatmcvs flag to off in the database configuration file or at run time with the 'set' command. One potential improvement not included in the patch is that Most Common Value (MCV) statistics are only determined when the probe relation is produced by a scan operator. There is a benefit to using MCVs even when the probe relation is not a base scan, but we were unable to determine how to find statistics from a base relation after other operators are performed. This patch was created by Bryce Cutt as part of his work on his M.Sc. thesis. -- Dr. Ramon Lawrence Assistant Professor, Department of Computer Science, University of British Columbia Okanagan E-mail: [EMAIL PROTECTED] I'm interested in trying to review this patch. Having not done patch review before, I can't exactly promise grand results, but if you could provide me with the data to check your results? In the meantime I'll go read the paper. - Josh / eggyknap -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
From: Tom Lane [mailto:[EMAIL PROTECTED] What alternatives are there for people who do not run Windows? regards, tom lane The TPC-H generator is a standard code base provided at http://www.tpc.org/tpch/. We have been able to compile this code on Linux. However, we were unable to get the Microsoft modifications to this code to compile on Linux (although they are supposed to be portable). So, we just used the Windows version with wine on our test Debian machine. I have also posted the text files for the TPC-H 1G 1Z data set at: http://people.ok.ubc.ca/rlawrenc/tpch1g1z.zip Note that you need to trim the extra characters at the end of the lines for PostgreSQL to read them properly. Since the data takes a while to generate and load, we can also provide a compressed version of the PostgreSQL data directory of the databases with the data already loaded. -- Ramon Lawrence -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
We propose a patch that improves hybrid hash join's performance for large multi-batch joins where the probe relation has skew. Project name: Histojoin Patch file: histojoin_v1.patch This patch implements the Histojoin join algorithm as an optional feature added to the standard Hybrid Hash Join (HHJ). A flag is used to enable or disable the Histojoin features. When Histojoin is disabled, HHJ acts as normal. The Histojoin features allow HHJ to use PostgreSQL's statistics to do skew aware partitioning. The basic idea is to keep build relation tuples in a small in-memory hash table that have join values that are frequently occurring in the probe relation. This improves performance of HHJ when multiple batches are used by 10% to 50% for skewed data sets. The performance improvements of this patch can be seen in the paper (pages 25-30) at: http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf All generators and materials needed to verify these results can be provided. This is a patch against the HEAD of the repository. This patch does not contain platform specific code. It compiles and has been tested on our machines in both Windows (MSVC++) and Linux (GCC). Currently the Histojoin feature is enabled by default and is used whenever HHJ is used and there are Most Common Value (MCV) statistics available on the probe side base relation of the join. To disable this feature simply set the enable_hashjoin_usestatmcvs flag to off in the database configuration file or at run time with the 'set' command. One potential improvement not included in the patch is that Most Common Value (MCV) statistics are only determined when the probe relation is produced by a scan operator. There is a benefit to using MCVs even when the probe relation is not a base scan, but we were unable to determine how to find statistics from a base relation after other operators are performed. This patch was created by Bryce Cutt as part of his work on his M.Sc. thesis. -- Dr. Ramon Lawrence Assistant Professor, Department of Computer Science, University of British Columbia Okanagan E-mail: [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] histojoin_v1.patch Description: histojoin_v1.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Potential Join Performance Issue
From: Tom Lane [mailto:[EMAIL PROTECTED] I was intending to do it the other way, actually. An extra field in HashPath hardly costs anything. The other reason for it is that there are other possible uses for knowing whether a hash will be multi-batch. (For example, if we were prepared to tell the executor that it *must* keep the hash to one batch, we could assume that the sort order of the left input is preserved. I haven't looked into the risks/benefits of that too much, but it's been in the back of the mind for a long time.) Having the number of batches in HashPath could be potentially useful for a variety of reasons. For our research, we have added an nbatch variable in both HashPath and HashJoin. Having it in HashJoin is useful as we modified EXPLAIN to output the number of batches. There are costs in putting an nbatch variable in HashPath as the system may set this variable potentially hundreds/thousands of times during costing and does not (currently) use it until you convert the chosen HashPath to a plan. I'd be more inclined to deal with the issue by trying to establish a safety margin in the estimate of whether the hash will go multi-batch. IOW we should disuse_physical_tlist if the hash is estimated to be close to but still within one batch. Our experiments with large TPC-H 1GB joins show that it is almost always better to not use physical_tlists if the number of batches is 1. There is a noticeable (approximately 5-15%) improvement when using physical_tlists for in-memory joins. For batches of size 2, it sometimes can go either way depending how many attributes are projected out of the outer relation. Using physical_tlists may be better even for batches of size 2 if most of the attributes of the outer relation are kept. For a larger number of batches, the extra I/O cost significantly dominates over the physical_tlist optimization. Performance of multi-batch joins may improve 50% or more by disabling the optimization. It is possible to create a safety margin by having ExecChooseHashTableSize() return the value inner_rel_bytes/hash_table_bytes which represents the fraction of the memory available that the inner relation is expected to consume. You can then make decisions based on that. However, this is only as good as the inner relation size estimate and especially for large queries, the estimate may be quite inaccurate. A more robust solution could examine the width of the path and the width of the relation combined with the number of batches to see if projecting early would be worth it. It may be best to keep it simple and just use number of batches 1 as a criteria and instead focus on examining issues with inaccurate join size estimates. -- Dr. Ramon Lawrence Assistant Professor, Department of Computer Science, University of British Columbia Okanagan E-mail: [EMAIL PROTECTED] -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Potential Join Performance Issue
PostgreSQL development community: Our research group has been using the PostgreSQL code base to test new join algorithms. During testing, we noticed that the planner is not pushing down projections to the outer relation in a hash join. Although this makes sense for in-memory (1 batch) joins, for joins larger than memory (such as for TPC-H DSS), this causes the system to perform significantly more disk I/Os when reading/writing batches of the outer relation. A simple solution is to add a single line of code to src\backend\optimizer\plan\createplan.c after line 1771: disuse_physical_tlist(outer_plan, best_path-jpath.outerjoinpath); This will always force the projection on the outer relation. A more complicated modification alternative is to add a state variable to allow the planner to know how many batches the hash join expects and only push down the projection if it is greater than one. However, pushing the projection on the outer relation is almost always the best choice as it eliminates unneeded attributes for operators above the hash join in the plan and will be robust in the case of poor estimates. We have been testing using TPC-H scale factor 1 GB. A sample query that demonstrates the behavior is: SELECT c_custkey, c_name, o_orderkey, o_orderdate FROM Customer, Orders WHERE c_custkey = o_custkey Note that EXPLAIN on this query will indicate that the projection is performed on the outer relation even though it is not done. We found the difference by modifying our code to track tuples and bytes output to disk, but it also can be detected by watching the size of the temporary files produced during the join. Sincerely, Dr. Ramon Lawrence Assistant Professor, Department of Computer Science, University of British Columbia Okanagan http://people.ok.ubc.ca/rlawrenc/ E-mail: [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]