Re: [HACKERS] join removal

2009-08-10 Thread Lawrence, Ramon
 I took at a first crack at coding up an implementation of
 relation_is_distinct_for() tonight.

I am not sure if this will help or not, but on the 8.4 code base we
implemented two functions:

- getCandidateKeys() - would recursively traverse a tree from a given
node to the leaf nodes and determine the candidate keys for the
intermediate relation produced by that node

- getJoinCard() - determined the join cardinality of a hash join node
(1:1, 1:N, etc.) based on the candidate keys of the two input relations

It worked pretty well for our tests with equi-joins, but I am sure it is
missing many cases.  I have attached the code which we used
(cardinalityFuncs.c).  Some of the helper functions may also be useful
(convertUniqueIndexesToCandidateKeys, getJoinAttrs).

--
Ramon Lawrence



cardinalityFuncs.c
Description: cardinalityFuncs.c

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HashJoin w/option to unique-ify inner rel

2009-04-16 Thread Lawrence, Ramon
 Upon further review, it appears that a big part of this problem is
 that cost_hashjoin() doesn't understand that it needs cost semi-joins
 differently from inner or left joins.  The bogus logic looks to be
 right here:
 startup_cost += hash_qual_cost.startup;
 run_cost += hash_qual_cost.per_tuple *
 outer_path_rows * clamp_row_est(inner_path_rows *
 innerbucketsize) * 0.5;
 
 Of course, when the join type is JOIN_SEMI, we're going to stop
 looking after we find the first match, so this estimate is really far
 off.

The 8.3 version of cost_hashjoin() had a line like this:

joininfactor = join_in_selectivity(path-jpath, root);

and a cost function like this:

run_cost += hash_qual_cost.per_tuple *
outer_path_rows * clamp_row_est(inner_path_rows *
innerbucketsize) * joininfactor * 0.5;

This compensated for IN joins being able to stop scanning a bucket once
a match is found.  You may consider something similar for a semi-join.
Having experimented with a lot of this code recently, there is some
potential for improvement on the bucket sizes, etc., but it is a
non-trivial problem.

I tested a similar query on TPCH 100M1Z on version 8.3:

select * from customer where c_custkey in (select o_custkey from orders)

and found that hash aggregate was marginally faster.  If you turn off
aggregation, it selects an IN hash join which is about 5% slower and the
planner is not too far off.  So, it would be definitely possible to
modify the cost function appropriately.

  It's tempting to have Hash cheat and just peek at the node beneath
it
  to see if it's a HashAggregate, in which case it could call a
special
  method to request the whole hash. But it would have to know that
it's
  just a plain uniquify and not implementing a GROUP BY.

If HashAggregate is faster, then the question is can you make it better
by avoiding building the hash structure twice.  I haven't considered all
the possibilities, but the situation you have used as an example, an IN
query, seems workable.  Instead of translating to a hash
aggregate/hash/hash join query plan, it may be possible to create a
special hash join node that does uniquefy.  

The benefit is that the planner knows about it (instead of changing the
execution plan), you can be more accurate on costs for the hash join,
and you can optimize by using only one hash table construction. A
challenge that must be dealt with is handling the multi-batch case.  It
appears that hash aggregate does not currently handle this, but I may be
mistaken.

--
Ramon Lawrence

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a few crazy ideas about hash joins

2009-04-03 Thread Lawrence, Ramon
 While investigating some performance problems recently I've had cause
 to think about the way PostgreSQL uses hash joins.  So here are a few
 thoughts.  Some of these have been brought up before.
 
 1. When the hash is not expected to spill to disk, it preserves the
 pathkeys of the outer side of the join.  If the optimizer were allowed
 to assume that, it could produce significantly more efficient query
 plans in some cases. 

This is definitely possible, but you will have to dynamically modify the
execution path if the hash join ends up to be more than one batch.

 3. Avoid building the exact same hash table twice in the same query.
 This happens more often you'd think.  For example, a table may have
 two columns creator_id and last_updater_id which both reference person
 (id).  If you're considering a hash join between paths A and B, you
 could conceivably check whether what is essentially a duplicate of B
 has already been hashed somewhere within path A.  If so, you can reuse
 that same hash table at zero startup-cost.
 
 4. As previously discussed, avoid hashing for distinct and then
 hashing the results for a hash join on the same column with the same
 operators.
 
 Thoughts on the value and/or complexity of implementation of any of
these?

I would be interested in working with you on any of these changes to
hash join if you decide to pursue them.   I am especially interested in
looking at the hash aggregation code and potentially improving its
efficiency.

We have implemented a multi-way hash join (can join more than 2 tables
at a time) which may help with cases #3 and #4.  Performance results
look very good, and we are planning on building a patch for this over
the summer.

--
Ramon Lawrence

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a few crazy ideas about hash joins

2009-04-03 Thread Lawrence, Ramon
  I would be especially interested in using a shared memory hash table
  that *all* backends can use - if the table is mostly read-only, as
  dimension tables often are in data warehouse applications. That
would
  give zero startup cost and significantly reduced memory.
 
 I think that's a non-starter due to visibility issues and handling
 inserts and updates. Even just reusing a hash from one execution in a
 later execution of the same plan would be tricky since we would have
 to expire it if the snapshot changes.

If your data set is nearly read-only, materialized views would be a
better way to go and would require no hash join changes.

The idea of perfect hash functions for dimension tables is very
interesting.  If the data set is near static, it is possible to compute
them once in a few minutes time for a million tuple table and then
re-use them until they change.  The research has shown it is possible,
but I do not know if anyone has actually implemented it in a real DBMS.
An implementation could be something to try if there is interest.

--
Ramon Lawrence

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-06 Thread Lawrence, Ramon
  I think you missed the point of the performance questions.  It wasn't
  about avoiding extra simple if-tests in the per-tuple loops; a few of
  those are certainly not going to add measurable cost given how complex
  the code is already.  (I really don't think you should be duplicating
  hunks of code to avoid adding such tests.)  Rather, the concern was that
  if we are dedicating a fraction of available work_mem to this purpose,
  that reduces the overall efficiency of the regular non-IM code path,
  principally by forcing the creation of more batches than would otherwise
  be needed.  It's not clear whether the savings for IM tuples always
  exceeds this additional cost.

I misunderstood the concern.  So, there is no issue with the patch when it is 
disabled (single batch case or multi-batch with no skew)?  There is no memory 
allocated when the optimization is off, so these cases will not affect the 
number of batches or re-partitioning.  

  * The IM hashtable is only needed during the first-batch processing;
  once we've completed the first pass over the outer relation there is
  no longer any need for it, unless I'm misunderstanding things
  completely.  Therefore it really only competes for space with the
  regular first batch.  However the damage to nbatches will already have
  been done; in effect, we can expect that each subsequent batch will
  probably only use (100 - IM_WORK_MEM_PERCENT)% of work_mem.  The patch
  seems to try to deal with this by keeping IM_WORK_MEM_PERCENT negligibly
  small, but surely that's mostly equivalent to fighting with one hand
  tied behind your back.   I wonder if it'd be better to dedicate all of
  work_mem to the MCV hash values during the first pass, rather than
  allowing them to compete with the first regular batch.
 
 The IM hash table doesn't need to be very large in order to produce a
 substantial benefit, because there are only going to be ~100 MCVs in
 the probe table and each of those may well be unique in the build
 table.  But no matter what size you choose for it, there's some danger
 that it will push us over the edge into more batches, and if the skew
 doesn't turn out to be enough to make up for that, you lose.  I'm not
 sure there's any way to completely eliminate that unpleasant
 possibility.

Correct - The IM table only competes with the first-batch during processing and 
is removed after the first pass.  Also, it tends to be VERY small as the 
default of 100 MCVs usually results in 100 tuples being in the IM table which 
is normally much less than 2% of work_mem.  We get almost all the benefit with 
100-1 MCVs with little downside risk.  Making the IM table larger (size of 
work_mem) is both not possible (not that many MCVs) and has a bigger downside 
risk if we get it wrong.
 
  * The IM hashtable creates an additional reason why nbatch might
  increase during the initial scan of the inner relation; in fact, since
  it's an effect not modeled in the initial choice of nbatch, it's
  probably going to be a major reason for that to happen.  Increasing
  nbatch on the fly isn't good because it results in extra I/O for tuples
  that were previously assigned to what is now the wrong batch.  Again,
  the only answer the patch has for this is to try not to use enough
  of work_mem for it to make a difference.  Seems like instead the initial
  nbatch estimate needs to account for that.

The possibility of the 1-2% IM_WORK_MEM_PERCENT causing a re-batch exists but 
is very small.  The number of batches is calculated in ExecChooseHashTableSize 
(costsize.c) as ceil(inner_rel_bytes/work_mem) rounded up to the next power of 
2.  Thus, hash join already wastes some of its work_mem allocation due to 
rounding.  For instance, if nbatch is calculated as 3 then rounded up to 4, 
only 75% of work_mem is used for each batch.  This leaves 25% of work_mem 
unaccounted for which may be used by the IM table (and also to compensate for 
build skew). Clearly, if nbatch is exactly 4, then this unaccounted space is 
not present and if the optimizer is exact in its estimates, the extra 1-2% may 
force a re-partition.

A solution may be to re-calculate nbatch factoring in the extra 1-2% during 
ExecHashTableCreate (nodeHashjoin.c) which calls ExecChooseHashTableSize again 
before execution.  The decision is whether to modify ExecChooseHashTableSize 
itself (which is used during costing) or to make a modified 
ExecChooseHashTableSize function that is only used once in ExecHashTableCreate.

We have tried to change the original code as little as possible, but it is 
possible to modify ExecChooseHashTableSize and the hash join cost function to 
be skew optimization aware.

--
Ramon Lawrence

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Lawrence, Ramon
 From: Tom Lane
 Heikki's got a point here: the planner is aware that hashjoin doesn't
 like skewed distributions, and it assigns extra cost accordingly if it
 can determine that the join key is skewed.  (See the bucketsize
stuff
 in cost_hashjoin.)  If this patch is accepted we'll want to tweak that
 code.

Those modifications would make the optimizer more likely to select hash
join, even with skewed distributions.  For the TPC-H data set that we
are using the optimizer always picks hash join over merge join (single
or multi-batch).  Since the current patch does not change the cost
function, there is no change in the planning cost.  It may or may not be
useful to modify the cost function depending on the effect on planning
cost.

 Still, that has little to do with the current gating issue, which is
 whether we've convinced ourselves that the patch doesn't cause a
 performance decrease for cases in which it's unable to help.

Although we have not seen an overhead when the optimization is
by-passed, we are looking at some small code changes that would
guarantee that no extra statements are executed for the single batch
case.  Currently, an if optimization_on check is performed on each probe
tuple which, although minor, should be able to be avoided.  

The patch's author, Bryce Cutt, is defending his Master's thesis Friday
morning (on this work), so we will provide some updated code right after
that.  Since these code changes are small, they should not affect people
trying to test the performance of the current patch.

--
Ramon Lawrence

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance ofMulti-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Lawrence, Ramon
 They're automatically generated by the dbgen utility, a link to which
 was originally published somewhere in this thread. That tool creates a
 few text files suitable (with some tweaking) for a COPY command. I've
 got the original files... the .tbz I just made is 1.8 GB :) Anyone
have
 someplace they'd like me to drop it?

Just a note that the Z7 data set is really a uniform data set Z0.  The
generator only accepts skew in the range from Z0 to Z4.  The uniform,
Z0, data set is typically used when benchmarking data warehouses.

It turns out the data is not perfectly uniform as the top 100 suppliers
and products represent 2.3% and 1.5% of LineItem.  This is just enough
skew that the optimization will sometimes be triggered in the
multi-batch case (currently 1% skew is the cutoff).

I have posted a pg_dump of the TPCH 1G Z0 data set at:

http://people.ok.ubc.ca/rlawrenc/tpch1g0z.zip

(Note that ownership commands are in the dump and make sure to vacuum
analyze after the load.)  I can also post the input text files if that
is easier.

--
Ramon Lawrence

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance ofMulti-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Lawrence, Ramon
 That seems VERY useful - can you post the other ones (Z1, etc.) so I
 can download them all?

The Z1 data set is posted at:

http://people.ok.ubc.ca/rlawrenc/tpch1g1z.zip

I have not generated Z2, Z3, Z4 for 1G, but I can generate the Z2 and Z3
data sets, and in a hour or two they will be at:

http://people.ok.ubc.ca/rlawrenc/tpch1g2z.zip
http://people.ok.ubc.ca/rlawrenc/tpch1g3z.zip

Note that Z3 and Z4 are not really useful as the skew is extreme (98% of
the probe relation covered by top 100 values).  Using the Z2/Z3 data set
should be enough to show the huge win if you do *really* have a skewed
data set.

BTW, is there any particular form/options of the pg_dump command that I
should use to make the dump?

--
Ramon Lawrence

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-24 Thread Lawrence, Ramon
 -Original Message-
 From: Robert Haas
 Sadly, there seem to be a number of cases in the Z7 database where the
 optimization makes things significantly worse (specifically, queries
 2, 3, and 7, but especially query 3).  Have you investigated what is
 going on there?  I had thought that we had sufficient safeguards in
 place to prevent this optimization from kicking in in cases where it
 doesn't help, but it seems not.  There will certainly be real-world
 databases that are more like Z7 than Z1.

I agree that there should be no noticeable performance difference when
the optimization is not used (single batch case or no skew).  I think
the patch achieves this.  The optimization is not used in those cases,
but we will review to see if it is the code that by-passes the
optimization that is causing a difference.

The query #3 timing difference is primarily due to a flaw in the
experimental setup.  For some reason, query #3 got executed before #4
with the optimization on, and executed after #4 with the optimization
off.  This skewed the results for all runs (due to buffering issues),
but is especially noticeable for Z7.  Note how query #4 is always faster
for the optimization on version even though the optimization is not
actually used for those queries (because they were one batch).  I expect
that if you run query #3 on Z7 in isolation then the results should be
basically identical. 

I have attached the SQL script that Joshua sent me.  The raw data I have
posted at: http://people.ok.ubc.ca/rlawrenc/test.output


--
Ramon Lawrence



test.sql
Description: test.sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-19 Thread Lawrence, Ramon


From: pgsql-hackers-ow...@postgresql.org on behalf of Robert Haas
I think what we need here is some very simple testing to demonstrate
that this patch demonstrates a speed-up even when the inner side of
the join is a joinrel rather than a baserel.  Can you suggest a single
query against the skewed TPCH dataset that will result in two or more
multi-batch hash joins?  If so, it should be a simple matter to run
that query with and without the patch and verify that the former is
faster than the latter.

This query will have the outer relation be a joinrel rather than a baserel:
 
select count(*) from supplier, part, lineitem where l_partkey = p_partkey and 
s_suppkey = l_suppkey;
 
The approach collects statistics on the outer relation (not the inner relation) 
so the code had to have the ability to determine a stats tuple on a joinrel in 
addition to a baserel.
 
Joshua sent us some preliminary data with this query and others and indicated 
that we could post it.  He wanted time to clean it up and re-run some 
experiments, but the data is generally good and the algorithm performs as 
expected.  I have attached this data to the post.  Note that the last set of 
data (although labelled as Z7) is actually an almost zero skew database and 
represents the worst-case for the algorithm (for most queries the optimization 
is not even used).
 
--
Ramon Lawrence


JoshuaTolleyData.xls
Description: JoshuaTolleyData.xls

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] The testing of multi-batch hash joins with skewed data sets patch

2009-02-10 Thread Lawrence, Ramon
 The idea I came up with for benchmarking was a little similar to what
I
 remember from the original tests. I have a sales orders table and a
 products
 table. My version of the sales orders table contains a customer
column.
 Data
 for 10 customers is populated into the sales orders table, customer 1
has
 a
 totally non-skewed set of orders, where customer 10 has the most skew.
 I've
 done this by creating 1 products each with a product code that has
 been
 cast into a varchar and padded up to 5 chars in length with '0's. Each
 customer has the same number of rows in the sales orders table,
customer
 10
 mostly orders products that when cast as INT are evenly divisible by
10,
 where customer 2 mostly orders products that are evenly divisible by
2.
 You
 get the idea.
 Currently I'm unsure the best way to ensure that the hash join goes
into
 more than one batch apart from just making the dataset very large.
 
 Does anyone have any thoughts about the way I plan to go about
 benchmarking?

Thank you for testing the patch - it is very much appreciated.  If you
use the test version of the patch, it will print out statistics that
will be helpful.

I think your approach should work.  I have two comments:

1) You will need to scale the data set larger to go multi-batch.  Even a
minimum work_mem of 1 MB may be enough to keep the product table in
memory unless each tuple is large.  For the TPC-H tests, the size of
product was 200,000 for 1 GB tests and 2 million tuples for 10 GB tests.

2) The current formula may not generate the skew you expect on
sales.productcode.  To simplify the discussion, I will only consider
customer 1 (c1) and customer 10 (c10) and a total of 100,000 sales
(50,000 for each customer).  

If I look at product 10 for instance, it will be ordered 50,000/1,000 =
50 times by c10 and 50,000/10,000 = 5 times by c1 for a total of 55
times. Product 10 represents only 0.055% of all sales.  For all mod 10
products combined, they represent 55% of sales, which is significant BUT
requires us to store 10% of product in memory (1000 tuples all of which
need to be in the stats record).

This two customer test would be interesting.  There should be no benefit
for customer 1. In fact, you would see the worst case as you would plan
for skew but not get any benefit.  For customer 10 you should see a
benefit if your stats have 1000 tuples.  The issue is that you cannot
scale this test easily.  Increasing by a factor of 10 would require
stats of 10,000, and increasing by a factor of 100 is not possible.

The Zipfian distribution used in the previous experiments causes the top
few values to be exponentially better than the average value.  For
instance, the top 100 products may represent 10 to 50% of total sales
even for 1 million products.  In the previous case, the top 100 products
represent only 0.0055% of total sales for 1 million products.  This
level of skew would be ignored by the algorithm which has a cutoff value
that at least 1% of the probe relation must match with the skew values
buffered in memory.

To test higher values of skew, you could setup the experiment like this
(may scale down by a factor of 10 depending on your hardware):

products - 1 million
sales - 10 million
customers - 5
   - Each customer has 2 million orders.
   - Customer 1 orders each product equally.
   - Customer 2 orders each product mod 10^2 equally.
   - Customer 5 orders each product mod 10^5 equally.

It is customer 5's orders that result in most of the skew as every
100,000th product will be ordered 200,000 times (customer 5 only orders
10 products).  Then, there is a huge benefit for customer 5 for keeping
these 10 products in memory during the join.  The benefit decreases for
each customer all the way down to customer 1 which will see no benefit.

--
Ramon Lawrence

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] The testing of multi-batch hash joins with skewed data sets patch

2009-02-10 Thread Lawrence, Ramon
 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-
 ow...@postgresql.org] On Behalf Of Tom Lane
 But really there are two different performance regimes here, one where
 the hash data is large enough to spill to disk and one where it isn't.
 Reducing work_mem will cause data to spill into kernel disk cache, but
 if the total problem fits in RAM then very possibly that data won't
ever
 really go to disk.  So I suspect such a test case will act more like
the
 small-data case than the big-data case.  You probably actually need
more
 data than RAM to be sure you're testing the big-data case.

Is there a way to limit the kernel disk cache?  (We are running SUSE
Linux.)

We have been testing hybrid hash join performance and have seen that the
performance varies considerably less than expected even for dramatic
changes in work_mem and the I/Os that appear to be performed.  

--
Ramon Lawrence

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Help with Join Performance Testing

2009-01-21 Thread Lawrence, Ramon
A hash join modification patch is under review for 8.4 that needs
performance testing.   We would appreciate help with this testing. 

 

A testing version of the patch is attached in addition to testing
instructions and where to retrieve a sample data set.   The basic idea
of the patch is that it reduces disk operations for large multi-batch
hash joins where there is skew in the probe relation.  The patch
collects statistics on performance benefits when using the optimization.

 

--

Ramon Lawrence and Bryce Cutt

Overview


This document provides an overview of how to test the histojoin patch.  The 
patch performs skew optimization for large, multi-batch hash joins.

Installation

The patch should compile cleanly against CVS head.


Execution
-
The skew optimization can be turned on by:

set enable_hashjoin_usestatmcvs = on;

and off by:

set enable_hashjoin_usestatmcvs = off;

If a hash join has detectable skew in the larger probe relation, then the skew 
optimization will output the amount of skew it sees and the number of tuples it 
will buffer in memory to exploit that skew.  When the hash join completes, it 
will output statistics on the number of tuples actually matched by the 
in-memory (IM) skew partition and the number of tuples in partition 0. The 
improvements in join I/Os is also given.

Sample (from LI-P TPCH 10G 1Z):

Values: 100 Skew: 0.27  Est. tuples: 59986052.00 Batches: 512  Est. Save: 
16114709.99
Total Inner Tuples: 200
IM Inner Tuples: 83
Batch Zero Inner Tuples: 3941
Batch Zero Potential Inner Tuples: 3941
Total Outer Tuples: 59986052
IM Outer Tuples: 16074146
Batch Zero Outer Tuples: 98778
Batch Zero Potential Outer Tuples: 98778 
Total Output Tuples: 59986052 
IM Output Tuples: 16074146
Batch Zero Output Tuples: 98778
Batch Zero Potential Output Tuples: 98778
Percentage less tuple IOs than HHJ: 25.98


Data Set

A sample test data set is TPC-H scale factor 1 GB.  A pg_dump can be downloaded 
from:

http://people.ok.ubc.ca/rlawrenc/tpch1g1z.zip

The larger 10 GB data sets are available on request.  You can also download the 
generator itself (works only on Windows) at:

http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip

The only joins with significant skew in the database are Part-LineItem and 
Supplier-LineItem.


Result Notes


1) The percentage benefit increases with the amount of skew.  Relations with no 
skew are not affected.  Relations with minimal skew show no noticeable 
improvement or negative impact.

2) Since disk I/Os in the join is only one part of the query execution time, 
overall execution times do not improve the same amount as the reduction in disk 
I/Os.  For CPU-bound queries, the disk I/O improvement may not have a 
significant effect on the overall time.

3) The relations are quite large.  Thus, queries with SELECT * that join 
several relations are very costly and the generation of the tuples dominates 
the execution time (especially if executing the query through a client such as 
pgAdmin).


Previous Results


The join with LineItem-Part on TPCH 1G 1Z shows about a 26% improvement in I/Os 
performed during the join and about 5-10% improvement in overall time.  The 
join with LineItem-Supplier is similar.  Data sets with higher skew show even 
better performance.  For example, Lineitem-Part on TPCH 10G 2Z has 90% of probe 
relation tuples matching 100 most common values.  The improvement in I/Os is 
about 90% and time about 50%.

Some sample test queries:

Query #1a:
SELECT * FROM Part, Lineitem WHERE p_partkey = l_partkey;

Query #1b:
SELECT count(*) FROM Part, Lineitem WHERE p_partkey = l_partkey;

Query #2a:
SELECT * FROM Supplier, Lineitem WHERE s_suppkey = l_suppkey;

Query #2b:
SELECT count(*) FROM Supplier, Lineitem WHERE s_suppkey = l_suppkey;

Query #3a:
SELECT * FROM Part, Lineitem, Supplier WHERE p_partkey = l_partkey and 
s_suppkey = l_suppkey;

Query #3b:
SELECT count(*) FROM Part, Lineitem, Supplier WHERE p_partkey = l_partkey and 
s_suppkey = l_suppkey;



histojoin_testing.patch
Description: histojoin_testing.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-01-14 Thread Lawrence, Ramon

 Here is a cleaned-up version.  I fixed a number of whitespace issues,
 improved a few comments, and rearranged one set of nested if-else
 statements (hopefully without breaking anything in the process).
 
 Josh / eggyknap -
 
 Can you rerun your performance tests with this version of the patch?

To help with testing, we have constructed a patch specifically for
testing.  The patch is the same as Robert's version except that it
tracks and prints out statistics during the join on how many tuples are
affected and has the enable_hashjoin_usestatmcvs variable defined so
that it is easy to turn on/off skew handling.  This is useful as
although the patch reduces the number of I/Os performed, this
improvement may not be seen in some queries which are dominated by other
cost factors (non-skew joins, CPU time, time to scan input relations,
etc.).

The sample output looks like this:

LI-P
Values: 100 Skew: 0.27  Est. tuples: 59986052.00 Batches: 512  Est.
Save: 16114709.99
Total Inner Tuples: 200
IM Inner Tuples: 83
Batch Zero Inner Tuples: 3941
Batch Zero Potential Inner Tuples: 3941
Total Outer Tuples: 59986052
IM Outer Tuples: 16074146
Batch Zero Outer Tuples: 98778
Batch Zero Potential Outer Tuples: 98778
Total Output Tuples: 59986052
IM Output Tuples: 16074146
Batch Zero Output Tuples: 98778
Batch Zero Potential Output Tuples: 98778
Percentage less tuple IOs than HHJ: 25.98

The other change is that the system calculates the skew and will not use
the in-memory skew partition if the skew is less than 1%.

Finally, we have attached some performance results for the TPCH 10G data
set (skew factors z=1 and z=2).  For the Customer-Orders-Lineitem-Part
query that Josh was testing, we see no overall time difference that is
significant compared to experimental error (although there is I/O
benefit for the Lineitem-Part join).  This query cost is dominated by
the non-skew joins of Customer-Orders and Orders-Lineitem and output
tuple construction.

The joins with skew, Lineitem-Supplier and Lineitem-Part, show
significantly improved performance.  Note how the statistics show that
the percentage I/O savings is directly proportional to the skew.
However, the overall query time savings is always less than this as
there are other costs such as reading the relations, performing the hash
comparisons, building the output tuples, etc. that are unaffected by the
optimization.

At this point, we await further feedback on what is necessary to get
this patch accepted.  We would also like to thank Josh and Robert again
for their review time.

Sincerely,

Ramon Lawrence and Bryce Cutt


histojoin_testing.patch
Description: histojoin_testing.patch
TPC-H 10G Skew Factor Z=1 results
-

LI-P Regular HJ: (time in milliseconds)
990344
1022562
1071250
1003219
1049000
989953
Average: 1021054.667


LI-P Skew-enabled HJ: (time in milliseconds)
883593
960860
934906
1007282
937406
948078
Average: 945354.1667

% Difference: 7.4%


LI-P
Values: 100 Skew: 0.27  Est. tuples: 59986052.00 Batches: 512  Est. Save: 
16114709.99
Total Inner Tuples: 200
IM Inner Tuples: 83
Batch Zero Inner Tuples: 3941
Batch Zero Potential Inner Tuples: 3941
Total Outer Tuples: 59986052
IM Outer Tuples: 16074146
Batch Zero Outer Tuples: 98778
Batch Zero Potential Outer Tuples: 98778
Total Output Tuples: 59986052
IM Output Tuples: 16074146
Batch Zero Output Tuples: 98778
Batch Zero Potential Output Tuples: 98778
Percentage less tuple IOs than HHJ: 25.98



LI-P-S Regular HJ: (time in milliseconds)
1833016
1567515
1504625
Average: 1635052

LI-P-S Skew-enabled HJ: (time in milliseconds)
883593
1280297
1423984
Average: 1195958

% Difference: 27%

LI-S
Values: 100 Skew: 0.19  Est. tuples: 59986052.00 Batches: 32  Est. Save: 
11097357.16
Total Inner Tuples: 10
IM Inner Tuples: 78
Batch Zero Inner Tuples: 3123
Batch Zero Potential Inner Tuples: 3125
Total Outer Tuples: 59986052
IM Outer Tuples: 11563695
Batch Zero Outer Tuples: 1577432
Batch Zero Potential Outer Tuples: 1693632
Total Output Tuples: 59986052
IM Output Tuples: 11563695
Batch Zero Output Tuples: 1577432
Batch Zero Potential Output Tuples: 1693632
Percentage less tuple IOs than HHJ: 19.61

(LI-S)-P
Values: 100 Skew: 0.27  Est. tuples: 59986052.00 Batches: 512  Est. Save: 
16114709.99
Total Inner Tuples: 200
IM Inner Tuples: 83
Batch Zero Inner Tuples: 3941
Batch Zero Potential Inner Tuples: 3941
Total Outer Tuples: 59986052
IM Outer Tuples: 16074146
Batch Zero Outer Tuples: 98778
Batch Zero Potential Outer Tuples: 98778
Total Output Tuples: 59986052
IM Output Tuples: 16074146
Batch Zero Output Tuples: 98778
Batch Zero Potential Output Tuples: 98778
Percentage less tuple IOs than HHJ: 25.98


TPC-H 10G Skew Factor Z=2 results
-

LI-P Regular HJ: (time in milliseconds)
505672
424922
303250
361610
358125
Average: 390715.8

LI-P Skew-enabled HJ: (time in milliseconds)
219078
210078
212938
210094
212500
Average: 212937.6

% difference: 45.5%


Re: [HACKERS] Potential Join Performance Issue

2009-01-07 Thread Lawrence, Ramon
 Has this been completed?  TODO item?

   I'd be more inclined to deal with the issue by trying to establish
a
   safety margin in the estimate of whether the hash will go
  multi-batch.
   IOW we should disuse_physical_tlist if the hash is estimated to be
  close to but still within one batch.

I do not know how this issue was resolved.  It is an issue that is very
important for multi-batch hash joins.  The simplest resolution is to
disable physical_tlist on the outer relation for hash joins of more than
one batch.  However, as discussed in the thread, more sophisticated
solutions are also viable.

--
Ramon Lawrence

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-28 Thread Lawrence, Ramon
 I thought about this, but upon due reflection I think it's the wrong
 approach.  Raising work_mem is a pretty common tuning step - it's 4MB
 even on my small OLTP systems, and in a data-warehousing environment
 where this optimization will bring the most benefit, it could easily
 be higher.  Furthermore, if someone DOES change the statistics target
 for that column to 10,000, there's a pretty good chance that they had
 a reason for doing so (or at the very least it's not for us to assume
 that they were doing something stupid). I think we need some kind of
 code to try to tune this based on the actual situation.
 
 We might try to size the in-memory hash table to be the largest value
 that won't increase the total number of batches, but if the number of
 batches is large then this won't be the right decision.  Maybe we
 should insist on setting aside some minimum percentage of work_mem for
 the in-memory hash table, and fill it with however many MCVs we think
 will fit.

I think that setting aside a minimum percentage of work_mem may be a
reasonable approach.  For instance, setting aside 1% at even 1 MB
work_mem would be 10 KB which is enough to store about 40 MCV tuples of
the TPC-H database.  Such a small percentage would be very unlikely (but
still possible) to change the number of batches used.  Then, given the
memory allocation and the known tuple size + overhead, only that number
of MCVs are selected for the MCV table regardless how many there are.
The MCV table size would then increase as work_mem is changed up to a
maximum given by the number of MCVs.

 I agree.  However, there's no reason at all to assume that the tuples
 we flush out of the table are any better or worse than the new ones we
 add back in later.  In fact, although it's far from a guarantee, if
 the order of the tuples in the table is random, then we're more likely
 to encounter the most common values first.  We might as well just keep
 the ones we had rather than dumping them out and adding in different
 ones.  Err, except, maybe we can't guarantee correctness that way, in
 the case of a many-to-many join?

The code when building the MCV hash table keeps track of the order of
insertion of the best MCVs.  It then flushes the MCV partitions in
decreasing order of frequency of MCVs.  Thus, by the end of the build
partitioning phase the MCV hash table should only store the most
frequent MCV tuples.  Even with many-to-many joins as long as we keep
all build tuples that have a given MCV in memory, then everything is
fine.  You would get into problems if you only flushed some of the
tuples of a certain MCV but that will not happen.

--
Ramon Lawrence

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-27 Thread Lawrence, Ramon
 -Original Message-
 From: Robert Haas [mailto:robertmh...@gmail.com]
 I looked at this some more.  I'm a little concerned about the way
 we're maintaining the in-memory hash table.  Since the highest legal
 statistics target is now 10,000, it's possible that we could have two
 orders of magnitude more MCVs than what you're expecting.  As I read
 the code, that could lead to construction of an in-memory hash table
 with 64K slots.  On a 32-bit machine, I believe that works out to 16
 bytes per partition (12 and 4), which is a 1MB hash table.  That's not
 necessarily problematic, except that I don't think you're considering
 the size of the hash table itself when evaluating whether you are
 blowing out work_mem, and the default size of work_mem is 1MB.

I totally agree that 10,000 MCVs changes things.  Ideally, these 10,000
MCVs should be kept in memory because they will join with the most
tuples.  However, the size of the MCV hash table (as you point out) can
be bigger than work_mem *by itself* not even considering the tuples in
the table or in the in-memory batch.  Supporting that many MCVs would
require more modifications to the hash join algorithm.

100 MCVs should be able to fit in memory though.  Since the number of
batches is rounded to a power of 2, there is often some hash_table_bytes
that are not used by the in-memory batch that can be used to store the
MCV table.  The absolute size of the memory used should also be
reasonable (depending on the tuple size in bytes).

So, basically, we have a decision to make whether to try support a
larger number of MCVs or cap it at a reasonable number like a 100.  You
can come up with situations where using all 10,000 MCVs is good (for
instance if all MCVs have frequency 1/1), but I expect 100 MCVs will
capture the majority of the cases as usually the top 100 MCVs are
significantly more frequent than later MCVs.

I now also see that the code should be changed to keep track of the MCV
bytes separately from hashtable-spaceUsed as this is used to determine
when to dynamically increase the number of batches.

 I also don't really understand why we're trying to control the size of
 the hash table by flushing tuples after the fact.  Right now, when the
 in-memory table fills up, we just keep adding tuples to it, which in
 turn forces us to flush out other tuples to keep the size down.  This
 seems quite inefficient - not only are we doing a lot of unnecessary
 allocating and freeing, but those flushed slots in the hash table
 degrade performance (because they don't stop the scan for an empty
 slot).  It seems like we could simplify things considerably by adding
 tuples to the in-memory hash table only to the point where the next
 tuple would blow it out.  Once we get to that point, we can skip the
 isAMostCommonValue() test and send any future tuples straight to temp
 files.  (This would also reduce the memory consumption of the
 in-memory table by a factor of two.)

In the ideal case, we select a number of MCVs to support that we know
will always fit in memory.  The flushing is used to deal with the case
where we are doing a many-to-many join and there may be multiple tuples
with the given MCV value in the build relation.

The issue with building the MCV table is that the hash operator will not
be receiving tuples in MCV frequency order.  It is possible that the MCV
table is filled up with tuples of less frequent MCVs when a more
frequent MCV tuple arrives.  In that case, we would like to keep the
more frequent MCV and bump one of the less frequent MCVs.  

 We could potentially improve on this even further if we can estimate
 in advance how many MCVs we can fit into the in-memory hash table
 before it gets blown out.  If, for example, we have only 1MB of
 work_mem but there 10,000 MCVs, getMostCommonValues() might decide to
 only hash the first 1,000 MCVs.  Even if we still blow out the
 in-memory hash table, the earlier MCVs are more frequent than the
 later MCVs, so the ones that actually make it into the table are
 likely to be more beneficial.  I'm not sure exactly how to do this
 tuning though, since we'd need to approximate the size of the
 tuples... I guess the query planner makes some effort to estimate that
 but I'm not sure how to get at it.

The number of batches (nbatch), inner_rel_bytes, and hash_table_bytes
are calculated in ExecChooseHashTableSize in nodeHash.c.

The number of bytes free not allocated to the in-memory batch is then:

hash_table_bytes - inner_rel_bytes/nbatch

Depending on the power of 2 rounding of nbatch, this may be almost 0 or
quite large.  You could change the calculation of nbatch or try to
resize the in-memory batch, but that opens up a can of worms.  It may be
best to assume a small number of MCVs 10 or 100.

 
  However, the join with Part and LineItem *should* show a benefit but
may
  not because of a limitation of the patch implementation (not the
idea).
  The MCV optimization is only enabled currently when the 

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-23 Thread Lawrence, Ramon
   Because there is no nice way in PostgreSQL (that I know of) to
derive
   a histogram after a join (on an intermediate result) currently
   usingMostCommonValues is only enabled on a join when the outer
(probe)
   side is a table scan (seq scan only actually).  See
   getMostCommonValues (soon to be called
   ExecHashJoinGetMostCommonValues) for the logic that determines
this.
 
 So my test case of do a whole bunch of hash joins in a test query
 isn't really valid. Makes sense. I did another, more haphazard test on
a
 query with fewer joins, and saw noticeable speedups.
 
  It's starting to seem to me that the case where this patch provides
a
  benefit is so narrow that I'm not sure it's worth the extra code.
 
 Not that anyone asked, but I don't consider myself qualified to render
 judgement on that point. Code size is, I guess, a maintainability
issue,
 and I'm not terribly experienced maintaining PostgreSQL :)
 
  Is it realistic to think that the MCVs of the base relation might
  still be applicable to the joinrel?  It's certainly easy to think of
  counterexamples, but it might be a good approximation more often
than
  not.
 
 It's equivalent to our assumption that distributions of values in
 columns in the same table are independent. Making that assumption in
 this case would probably result in occasional dramatic speed
 improvements similar to the ones we've seen in less complex joins,
 offset by just-as-occasional dramatic slowdowns of similar magnitude.
In
 other words, it will increase the variance of our results.
 
 - Josh

There is almost zero penalty for selecting incorrect MCV tuples to
buffer in memory.  Since the number of MCVs is approximately 100, the
overhead is keeping these 100 tuples in memory where they *might* not
be MCVs.  The cost is the little extra memory and the checking of the
MCVs which is very fast.

On the other hand, the benefit is potentially tremendous if the MCV is
very common in the probe relation.  Every probe tuple that matches the
MCV tuple in memory does not have to be written to disk.  The potential
speedup is directly proportional to the skew.  The more skew the more
benefit.

An analogy is with a page buffering system where one goal is to keep
frequently used pages in the buffer.  Essentially the goal of this patch
is to pin in memory the tuples that the join believes will match with
the most tuples on the probe side.  This reduces I/Os by making more
probe relation tuples match during the first read of the probe relation.
Regular hash join has no way to guarantee frequently matched build
tuples remain memory-resident.
 
The particular join with Customer, Orders, LineItem, and Part is a
reasonable test case.  There may be two explanations for the results.
(I am running tests for this query currently.) First, the time to
generate the tuples (select *) may be dominating the query time.
Second, as mentioned by Bryce, I expect the issue is that only the join
with Customer and Orders exploited the patch.  Customer has some skew
(but not dramatic) so there would be some speedup.  

However, the join with Part and LineItem *should* show a benefit but may
not because of a limitation of the patch implementation (not the idea).
The MCV optimization is only enabled currently when the probe side is a
sequential scan.  This limitation is due to our current inability to
determine a stats tuple of the join attribute on the probe side for
other operators.  (This should be possible - help please?).  

Even if this stats tuple is on the base relation and may not exactly
reflect the distribution of the intermediate relation on the probe side,
it still could be very good.  Even if it is not, once again the cost is
negligible.

In summary, the patch will improve performance of any multi-batch hash
join with skew.  It is useful right now when the probe relation has skew
and is accessed using a sequential scan.  It would be useful in even
more situations if the code was modified to determine the stats for the
join attribute of the probe relation in all cases (even when the probe
relation is produced by another operator).

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawre...@ubc.ca


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-17 Thread Lawrence, Ramon
Robert,

You do not need to use qgen.exe to generate queries as you are not
running the TPC-H benchmark test.  Attached is an example of the 22
sample TPC-H queries according to the benchmark.  

We have not tested using the TPC-H queries for this particular patch and
only use the TPC-H database as a large, skewed data set.  The simpler
queries we test involve joins of Part-Lineitem or Supplier-Lineitem such
as:

Select * from part, lineitem where p_partkey = l_partkey  

OR

Select count(*) from part, lineitem where p_partkey = l_partkey  

The count(*) version is usually more useful for comparisons as the
generation of output tuples on the client side (say with pgadmin)
dominates the actual time to complete the query.

To isolate query costs, we also test using a simple server-side
function.  The setup description I have also attached.

I would be happy to help in any way I can.

Bryce is currently working on an updated patch according to your
suggestions.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawre...@ubc.ca


 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-
 ow...@postgresql.org] On Behalf Of Robert Haas
 Sent: December 17, 2008 7:54 PM
 To: Lawrence, Ramon
 Cc: Tom Lane; pgsql-hackers@postgresql.org; Bryce Cutt
 Subject: Re: [HACKERS] Proposed Patch to Improve Performance of Multi-
 Batch Hash Join for Skewed Data Sets
 
 Dr. Lawrence:
 
 I'm still working on reviewing this patch.  I've managed to load the
 sample TPCH data from tpch1g1z.zip after changing the line endings to
 UNIX-style and chopping off the trailing vertical bars.  (If anyone is
 interested, I have the results of pg_dump | bzip2 -9 on the resulting
 database, which I would be happy to upload if someone has server
 space.  It is about 250MB.)
 
 But, I'm not sure quite what to do in terms of generating queries.
 TPCHSkew contains QGEN.EXE, but that seems to require that you provide
 template queries as input, and I'm not sure where to get the
 templates.
 
 Any suggestions?
 
 Thanks,
 
 ...Robert
 
 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers
-- using 10100 as a seed to the RNG

-- QUERY_1

select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate = date '1998-09-01' 
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus;


-- QUERY_2

select
s_acctbal,
s_name,
n_name,
p_partkey,
p_mfgr,
s_address,
s_phone,
s_comment
from
part,
supplier,
partsupp,
nation,
region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and p_size = 28
and p_type like '%STEEL'
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'MIDDLE EAST'
and ps_supplycost = (
select
min(ps_supplycost)
from
partsupp,
supplier,
nation,
region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'MIDDLE EAST'
)
order by
s_acctbal desc,
n_name,
s_name,
p_partkey
limit 100;


-- QUERY_3

select
l_orderkey,
sum(l_extendedprice * (1 - l_discount)) as revenue,
o_orderdate,
o_shippriority
from
customer,
orders,
lineitem
where
c_mktsegment = 'BUILDING'
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate  date '1995-03-31'
and l_shipdate  date '1995-03-31'
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc,
o_orderdate
limit 10;


-- QUERY_4

select
o_orderpriority,
count(*) as order_count
from
orders
where
o_orderdate = date '1997-10-01'
and o_orderdate  date '1998-02-01'
and exists (
select
*
from
lineitem
where
l_orderkey = o_orderkey

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-24 Thread Lawrence, Ramon
 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED]
 I'm a tad worried about what happens when the values that are
frequently
 occurring in the outer relation are also frequently occurring in the
 inner (which hardly seems an improbable case).  Don't you stand a
severe
 risk of blowing out the in-memory hash table?  It doesn't appear to me
 that the code has any way to back off once it's decided that a certain
 set of join key values are to be treated in-memory.  Splitting the
main
 join into more batches certainly doesn't help with that.
 
 Also, AFAICS the benefit of this patch comes entirely from avoiding
dump
 and reload of tuples bearing the most common values, which means it's
a
 significant waste of cycles when there's only one batch.  It'd be
better
 to avoid doing any of the extra work in the single-batch case.
 
 One thought that might address that point as well as the difficulty of
 getting stats in nontrivial cases is to wait until we've overrun
memory
 and are forced to start batching, and at that point determine
on-the-fly
 which are the most common hash values from inspection of the hash
table
 as we dump it out.  This would amount to optimizing on the basis of
 frequency in the *inner* relation not the outer, but offhand I don't
see
 any strong theoretical basis why that wouldn't be just as good.  It
 could lose if the first work_mem worth of inner tuples isn't
 representative of what follows; but this hardly seems more dangerous
 than depending on MCV stats that are for the whole outer relation
rather
 than the portion of it being selected.
 
   regards, tom lane

You are correct with both observations.  The patch only has a benefit
when there is more than one batch.  Also, there is a potential issue
with MCV hash table overflows if the number of tuples that match the
MCVs in the build relation is very large.

Bryce has created a patch (attached) that disables the code for one
batch joins.  This patch also checks for MCV hash table overflows and
handles them by flushing from the MCV hash table back to the main hash
table.  The main hash table will then resolve overflows as usual.  Note
that this will cause the worse case of a build table with all the same
values to be handled the same as the current hash code, i.e., it will
attempt to re-partition until it eventually gives up and then allocates
the entire partition in memory.  There may be a better way to handle
this case, but the new patch will remain consistent with the current
hash join implementation.

The issue with determining and using the MCV stats is more challenging
than it appears.  First, knowing the MCVs of the build table will not
help us.  What we need are the MCVs of the probe table because by
knowing those values we will keep the tuples with those values in the
build relation in memory.  For example, consider a join between tables
Part and LineItem.  Assume 1 popular part accounts for 10% of all
LineItems.  If Part is the build relation and LineItem is the probe
relation, then by keeping that 1 part record in memory, we will
guarantee that we do not need to write out 10% of LineItem.  If a
selection occurs on LineItem before the join, it may change the
distribution of LineItem (the MCVs) but it is probable that they are
still a good estimate of the MCVs in the derived LineItem relation.  (We
did experiments on trying to sample the first few thousand tuples of the
probe relation to dynamically determine the MCVs but generally found
this was inaccurate due to non-random samples.)  In essence, the goal is
to smartly pick the tuples that remain in the in-memory batch before
probing begins.  Since the number of MCVs is small, incorrectly
selecting build tuples to remain in memory has negligible cost.

If we assume that LineItem has been filtered so much that it is now
smaller than Part and is the build relation then the MCV approach does
not apply.  There is no skew in Part on partkey (since it is the PK) and
knowing the MCV partkeys in LineItem does not help us because they each
only join with a single tuple in Part.  In this case, the MCV approach
should not be used because no benefit is possible, and it will not be
used because there will be no MCVs for Part.partkey.

The bad case with MCV hash table overflow requires a many-to-many join
between the two relations which would not occur on the more typical
PK-FK joins.  

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: [EMAIL PROTECTED]


histojoin_v3.patch
Description: histojoin_v3.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Hash Join-Filter Pruning using Bloom Filters

2008-11-10 Thread Lawrence, Ramon
 -Original Message-
 On Sun, Nov 2, 2008 at 10:49 PM, Jonah H. Harris
 [EMAIL PROTECTED] wrote:
 It's effective as-is for a preliminary patch.  The GUC code is the
 least of my worries.
 
  Can you provide some figures on the performance impact of the bloom
 filter?

I have tested the Bloom filter patch.  It compiles cleanly against HEAD.

As indicated, the performance improvements for hash join are good,
especially when the build table is filtered with a selection condition.
Performance improvements range from a couple of percent up to 20% for
multi-batch joins.  Note that the bloom filter will slightly slow
queries where the filter has no benefit.

I have not looked at the actual implementation of the Bloom filter, but
will proceed to do that next.  One issue to be considered is how the
space used for the bloom filter is related to the work_mem allocated to
the join. That is, does the bloom filter consume some of the work_mem
space or is it treated as additional memory allocated to the join.

Experimental Results (in ms)
Query   Time With FilterTime Without Filter % Faster
1   2,166   2,648   18%
2   1,665   1,772   6%
3   5,308   6,374   17%
4   63,690  75,715  15%
5   87,864  81,552  -8%
6   12,492  11,696  -7%


Query 1: (provided by patch author) = 190,000 results
CREATE TABLE t1 (id INTEGER PRIMARY KEY, x INTEGER); CREATE TABLE t2 (id
INTEGER PRIMARY KEY, x INTEGER); INSERT INTO t1 (SELECT ge, ge % 100
FROM generate_series(1, 100) ge); INSERT INTO t2 (SELECT * FROM t1);
VACUUM ANALYZE; 
SELECT COUNT(*)
  FROM t1, t2
 WHERE t1.id = t2.id
   AND t1.x  30
   AND t2.x  10;
   
The next five queries are on the TPCH 1GB data set.

Query 2: (in-memory join with restrictive build filter) = 56,110 results
select count(*) from customer, orders where c_custkey = o_custkey and
c_nationkey = 10;

Query 3: (multi-batch join with less restrictive build filter) = 841,767
results
select count(*) from customer, orders where c_custkey = o_custkey and
c_nationkey  10;

Query 4: (large multi-batch join) = 3,215,402 results
select count(*) from orders, lineitem where o_orderkey = l_orderkey and
o_totalprice  15;

Query 5: (large multi-batch join with no filter - hard case for BLOOM
filter) = 6,000,003 results
select count(*) from orders, lineitem where o_orderkey = l_orderkey;

Query 6: (large probe, in-memory build with no filter - hard case for
BLOOM filter) = 6,000,003 results
select count(*) from supplier, lineitem where s_suppkey = l_suppkey;

All tests were run 4 times and the times were averaged.  The initial run
time was discarded to deal with buffering issues.

--
Dr. Ramon Lawrence
University of British Columbia Okanagan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Hash Join-Filter Pruning using Bloom Filters

2008-11-10 Thread Lawrence, Ramon
 -Original Message-
 From: Jonah H. Harris [mailto:[EMAIL PROTECTED]
 I have a new patch which does not create a bloom filter unless it sees
 that the hash join is going to batch.  I'll send it along later
 tonight.
 
 Currently it's additional space not accounted for by work_mem.
 Additionally, it's a good amount more space than is required.  This is
 fixed in the newer patch as well.

I think that the bloom filter will also improve the performance of
in-memory joins as well.  The basic trade-off in that case is the time
to probe multiple entries in a bucket in the hash table (which currently
defaults to 10) versus the cost of building/probing the bloom filter.
The bloom filter should win in this case as long as there are tuples in
the probe relation that cannot find a match in the build relation. 

My suggestion would be to keep it enabled for all joins.  If possible,
it would be valuable to try to estimate what percentage of tuples that
the bloom filter filters out.  A simple estimate would be to determine
the percentage of the build table that is involved in the join.  For
instance, the good test cases had between 40-90% of the customer
relation filtered out and a corresponding percentage of the probe
relation, lineitem, was filtered out by the bloom filter.  The bad case
used all of customer, so the bloom filter stopped no probe tuples.

It would be useful for testing to track the number and percentage of
probe tuples that the bloom filter prevents a probe for.  You may
further record which of these tuples were in the in-memory batch and
on-disk batches.  These statistics may help you get the bloom filter
optimized for all cases.

--
Ramon Lawrence



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-06 Thread Lawrence, Ramon
 -Original Message-
  Minor question on this patch. AFAICS there is another patch that
seems
  to be aiming at exactly the same use case. Jonah's Bloom filter
patch.
 
  Shouldn't we have a dust off to see which one is best? Or at least a
  discussion to test whether they overlap? Perhaps you already did
that
  and I missed it because I'm not very tuned in on this thread.
 
  --
   Simon Riggs   www.2ndQuadrant.com
   PostgreSQL Training, Services and Support
 
 We haven't had that discussion AFAIK, and definitely should. First
 glance suggests they could coexist peacefully, with proper coaxing. If
 I understand things properly, Jonah's patch filters tuples early in
 the join process, and this patch tries to ensure that hash join
 batches are kept in RAM when they're most likely to be used. So
 they're orthogonal in purpose, and the patches actually apply *almost*
 cleanly together. Jonah, any comments? If I continue to have some time
 to devote, and get through all I think I can do to review this patch,
 I'll gladly look at Jonah's too, FWIW.
 
 - Josh

The skew patch and bloom filter patch are orthogonal and can both be
applied.  The bloom filter patch is a great idea, and it is used in many
other database systems.  You can use the TPC-H data set to demonstrate
that the bloom filter patch will significantly improve performance of
multi-batch joins (with or without data skew).

Any query that filters a build table before joining on the probe table
will show improvements with a bloom filter.  For example, 

select * from customer, orders where customer.c_nationkey = 10 and
customer.c_custkey = orders.o_custkey

The bloom filter on customer would allow us to avoid probing with orders
tuples that cannot possibly find a match due to the selection criteria.
This is especially beneficial for multi-batch joins where an orders
tuple must be written to disk if its corresponding customer batch is not
the in-memory batch.

I have no experience reviewing patches, but I would be happy to help
contribute/review the bloom filter patch as best I can.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: [EMAIL PROTECTED]

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-02 Thread Lawrence, Ramon
Joshua,

Thank you for offering to review the patch.

The easiest way to test would be to generate your own TPC-H data and
load it into a database for testing.  I have posted the TPC-H generator
at:

http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip

The generator can produce skewed data sets.  It was produced by
Microsoft Research.

After unzipping, on a Windows machine, you can just run the command:

dbgen -s 1 -z 1

This will produce a TPC-H database of scale 1 GB with a Zipfian skew of
z=1.  More information on the generator is in the document README-S.DOC.
Source is provided for the generator, so you should be able to run it on
other operating systems as well.

The schema DDL is at:

http://people.ok.ubc.ca/rlawrenc/tpch_pg_ddl.txt

Note that the load time for 1G data is 1-2 hours and for 10G data is
about 24 hours.  I recommend you do not add the foreign keys until after
the data is loaded.

The other alternative is to do a pgdump on our data sets.  However, the
download size would be quite large, and it will take a couple of days
for us to get you the data in that form.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: [EMAIL PROTECTED]


 -Original Message-
 From: Joshua Tolley [mailto:[EMAIL PROTECTED]
 Sent: November 1, 2008 3:42 PM
 To: Lawrence, Ramon
 Cc: pgsql-hackers@postgresql.org; Bryce Cutt
 Subject: Re: [HACKERS] Proposed Patch to Improve Performance of Multi-
 Batch Hash Join for Skewed Data Sets
 
 On Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon
[EMAIL PROTECTED]
 wrote:
  We propose a patch that improves hybrid hash join's performance for
 large
  multi-batch joins where the probe relation has skew.
 
  Project name: Histojoin
  Patch file: histojoin_v1.patch
 
  This patch implements the Histojoin join algorithm as an optional
 feature
  added to the standard Hybrid Hash Join (HHJ).  A flag is used to
enable
 or
  disable the Histojoin features.  When Histojoin is disabled, HHJ
acts as
  normal.  The Histojoin features allow HHJ to use PostgreSQL's
statistics
 to
  do skew aware partitioning.  The basic idea is to keep build
relation
 tuples
  in a small in-memory hash table that have join values that are
 frequently
  occurring in the probe relation.  This improves performance of HHJ
when
  multiple batches are used by 10% to 50% for skewed data sets.  The
  performance improvements of this patch can be seen in the paper
(pages
  25-30) at:
 
  http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf
 
  All generators and materials needed to verify these results can be
 provided.
 
  This is a patch against the HEAD of the repository.
 
  This patch does not contain platform specific code.  It compiles and
has
  been tested on our machines in both Windows (MSVC++) and Linux
(GCC).
 
  Currently the Histojoin feature is enabled by default and is used
 whenever
  HHJ is used and there are Most Common Value (MCV) statistics
available
 on
  the probe side base relation of the join.  To disable this feature
 simply
  set the enable_hashjoin_usestatmcvs flag to off in the database
  configuration file or at run time with the 'set' command.
 
  One potential improvement not included in the patch is that Most
Common
  Value (MCV) statistics are only determined when the probe relation
is
  produced by a scan operator.  There is a benefit to using MCVs even
when
 the
  probe relation is not a base scan, but we were unable to determine
how
 to
  find statistics from a base relation after other operators are
 performed.
 
  This patch was created by Bryce Cutt as part of his work on his
M.Sc.
  thesis.
 
  --
  Dr. Ramon Lawrence
  Assistant Professor, Department of Computer Science, University of
 British
  Columbia Okanagan
  E-mail: [EMAIL PROTECTED]
 
 I'm interested in trying to review this patch. Having not done patch
 review before, I can't exactly promise grand results, but if you could
 provide me with the data to check your results? In the meantime I'll
 go read the paper.
 
 - Josh / eggyknap

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-02 Thread Lawrence, Ramon
 From: Tom Lane [mailto:[EMAIL PROTECTED]
 What alternatives are there for people who do not run Windows?
 
   regards, tom lane

The TPC-H generator is a standard code base provided at
http://www.tpc.org/tpch/.  We have been able to compile this code on
Linux.

However, we were unable to get the Microsoft modifications to this code
to compile on Linux (although they are supposed to be portable).  So, we
just used the Windows version with wine on our test Debian machine.  

I have also posted the text files for the TPC-H 1G 1Z data set at:

http://people.ok.ubc.ca/rlawrenc/tpch1g1z.zip

Note that you need to trim the extra characters at the end of the lines
for PostgreSQL to read them properly.

Since the data takes a while to generate and load, we can also provide a
compressed version of the PostgreSQL data directory of the databases
with the data already loaded.

--
Ramon Lawrence

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-10-20 Thread Lawrence, Ramon
We propose a patch that improves hybrid hash join's performance for
large multi-batch joins where the probe relation has skew.

 

Project name: Histojoin

Patch file: histojoin_v1.patch

 

This patch implements the Histojoin join algorithm as an optional
feature added to the standard Hybrid Hash Join (HHJ).  A flag is used to
enable or disable the Histojoin features.  When Histojoin is disabled,
HHJ acts as normal.  The Histojoin features allow HHJ to use
PostgreSQL's statistics to do skew aware partitioning.  The basic idea
is to keep build relation tuples in a small in-memory hash table that
have join values that are frequently occurring in the probe relation.
This improves performance of HHJ when multiple batches are used by 10%
to 50% for skewed data sets.  The performance improvements of this patch
can be seen in the paper (pages 25-30) at:

 

http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf

 

All generators and materials needed to verify these results can be
provided.

 

This is a patch against the HEAD of the repository.

 

This patch does not contain platform specific code.  It compiles and has
been tested on our machines in both Windows (MSVC++) and Linux (GCC).

 

Currently the Histojoin feature is enabled by default and is used
whenever HHJ is used and there are Most Common Value (MCV) statistics
available on the probe side base relation of the join.  To disable this
feature simply set the enable_hashjoin_usestatmcvs flag to off in the
database configuration file or at run time with the 'set' command.

 

One potential improvement not included in the patch is that Most Common
Value (MCV) statistics are only determined when the probe relation is
produced by a scan operator.  There is a benefit to using MCVs even when
the probe relation is not a base scan, but we were unable to determine
how to find statistics from a base relation after other operators are
performed.

 

This patch was created by Bryce Cutt as part of his work on his M.Sc.
thesis.

 

--

Dr. Ramon Lawrence

Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan

E-mail: [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] 

 



histojoin_v1.patch
Description: histojoin_v1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Potential Join Performance Issue

2008-09-12 Thread Lawrence, Ramon
 From: Tom Lane [mailto:[EMAIL PROTECTED]
 I was intending to do it the other way, actually.  An extra field in
 HashPath hardly costs anything.  The other reason for it is that there
 are other possible uses for knowing whether a hash will be
multi-batch.
 (For example, if we were prepared to tell the executor that it *must*
 keep the hash to one batch, we could assume that the sort order of the
 left input is preserved.  I haven't looked into the risks/benefits of
 that too much, but it's been in the back of the mind for a long time.)

Having the number of batches in HashPath could be potentially useful for
a variety of reasons.  For our research, we have added an nbatch
variable in both HashPath and HashJoin.  Having it in HashJoin is useful
as we modified EXPLAIN to output the number of batches.  There are costs
in putting an nbatch variable in HashPath as the system may set this
variable potentially hundreds/thousands of times during costing and does
not (currently) use it until you convert the chosen HashPath to a plan.

 I'd be more inclined to deal with the issue by trying to establish a
 safety margin in the estimate of whether the hash will go
multi-batch.
 IOW we should disuse_physical_tlist if the hash is estimated to be
close
 to but still within one batch.

Our experiments with large TPC-H 1GB joins show that it is almost always
better to not use physical_tlists if the number of batches is  1.
There is a noticeable (approximately 5-15%) improvement when using
physical_tlists for in-memory joins.  For batches of size 2, it
sometimes can go either way depending how many attributes are projected
out of the outer relation.  Using physical_tlists may be better even for
batches of size 2 if most of the attributes of the outer relation are
kept.  For a larger number of batches, the extra I/O cost significantly
dominates over the physical_tlist optimization.  Performance of
multi-batch joins may improve 50% or more by disabling the optimization.

It is possible to create a safety margin by having
ExecChooseHashTableSize() return the value
inner_rel_bytes/hash_table_bytes which represents the fraction of the
memory available that the inner relation is expected to consume.  You
can then make decisions based on that.   However, this is only as good
as the inner relation size estimate and especially for large queries,
the estimate may be quite inaccurate.  A more robust solution could
examine the width of the path and the width of the relation combined
with the number of batches to see if projecting early would be worth it.
It may be best to keep it simple and just use number of batches  1 as a
criteria and instead focus on examining issues with inaccurate join size
estimates.  

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: [EMAIL PROTECTED]

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Potential Join Performance Issue

2008-09-09 Thread Lawrence, Ramon
PostgreSQL development community:

 

Our research group has been using the PostgreSQL code base to test new
join algorithms.  During testing, we noticed that the planner is not
pushing down projections to the outer relation in a hash join.  Although
this makes sense for in-memory (1 batch) joins, for joins larger than
memory (such as for TPC-H DSS), this causes the system to perform
significantly more disk I/Os when reading/writing batches of the outer
relation.

 

A simple solution is to add a single line of code to
src\backend\optimizer\plan\createplan.c after line 1771:

 

disuse_physical_tlist(outer_plan, best_path-jpath.outerjoinpath);

 

This will always force the projection on the outer relation.

 

A more complicated modification alternative is to add a state variable
to allow the planner to know how many batches the hash join expects and
only push down the projection if it is greater than one.  However,
pushing the projection on the outer relation is almost always the best
choice as it eliminates unneeded attributes for operators above the hash
join in the plan and will be robust in the case of poor estimates.

 

We have been testing using TPC-H scale factor 1 GB.  A sample query that
demonstrates the behavior is:

 

SELECT c_custkey, c_name, o_orderkey, o_orderdate

FROM Customer, Orders

WHERE c_custkey = o_custkey

 

Note that EXPLAIN on this query will indicate that the projection is
performed on the outer relation even though it is not done.  We found
the difference by modifying our code to track tuples and bytes output to
disk, but it also can be detected by watching the size of the temporary
files produced during the join.

 

Sincerely,

 

Dr. Ramon Lawrence

Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan

http://people.ok.ubc.ca/rlawrenc/

E-mail: [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]