Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-20 Thread Tom Lane
Bryce Cutt pandas...@gmail.com writes:
 Here is the new patch.

Applied with revisions.  I undid some of the optimizations that
cluttered the code in order to save a cycle or two per tuple --- as per
previous discussion, that's not what the performance questions were
about.  Also, I did not like the terminology in-memory/IM; it seemed
confusing since the main hash table is in-memory too.  I revised the
code to consistently refer to the additional hash table as a skew
hashtable and the optimization in general as skew optimization.  Hope
that seems reasonable to you --- we could search-and-replace it to
something else if you'd prefer.

For the moment, I didn't really do anything about teaching the planner
to account for this optimization in its cost estimates.  The initial
estimate of the number of MCVs that will be specially treated seems to
me to be too high (it's only accurate if the inner relation is unique),
but getting a more accurate estimate seems pretty hard, and it's not
clear it's worth the trouble.  Without that, though, you can't tell
what fraction of outer tuples will get the short-circuit treatment.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-20 Thread Robert Haas
On Fri, Mar 20, 2009 at 8:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Bryce Cutt pandas...@gmail.com writes:
 Here is the new patch.

 Applied with revisions.  I undid some of the optimizations that
 cluttered the code in order to save a cycle or two per tuple --- as per
 previous discussion, that's not what the performance questions were
 about.  Also, I did not like the terminology in-memory/IM; it seemed
 confusing since the main hash table is in-memory too.  I revised the
 code to consistently refer to the additional hash table as a skew
 hashtable and the optimization in general as skew optimization.  Hope
 that seems reasonable to you --- we could search-and-replace it to
 something else if you'd prefer.

 For the moment, I didn't really do anything about teaching the planner
 to account for this optimization in its cost estimates.  The initial
 estimate of the number of MCVs that will be specially treated seems to
 me to be too high (it's only accurate if the inner relation is unique),
 but getting a more accurate estimate seems pretty hard, and it's not
 clear it's worth the trouble.  Without that, though, you can't tell
 what fraction of outer tuples will get the short-circuit treatment.

If the inner relation isn't fairly close to unique you shouldn't be
using this optimization in the first place.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-20 Thread Robert Haas
On Fri, Mar 20, 2009 at 8:45 PM, Bryce Cutt pandas...@gmail.com wrote:
 On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas robertmh...@gmail.com wrote:
 If the inner relation isn't fairly close to unique you shouldn't be
 using this optimization in the first place.
 Not necessarily true.  Seeing as (when the statistics are correct) we
 know each of these inner tuples will match with the largest amount of
 outer tuples it is just as much of a win per inner tuple as when they
 are unique.  There is just a chance you will have to give up on the
 optimization part way through if too many inner tuples fall into the
 new skew buckets (formerly IM buckets) and dump the tuples back into
 the main buckets.  The potential win is still pretty high though.

 - Bryce Cutt

Maybe I'm remembering wrong, but I thought the estimating functions
assuemd that the inner relation was unique.  So if there turn out to
be 2, 3, 4, or more copies of each value, the chances of blowing out
the skew hash table are almost 100%, I would think...  am I wrong?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-20 Thread Robert Haas
On Fri, Mar 20, 2009 at 8:45 PM, Bryce Cutt pandas...@gmail.com wrote:
 On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas robertmh...@gmail.com wrote:
 If the inner relation isn't fairly close to unique you shouldn't be
 using this optimization in the first place.
 Not necessarily true.  Seeing as (when the statistics are correct) we
 know each of these inner tuples will match with the largest amount of
 outer tuples it is just as much of a win per inner tuple as when they
 are unique.  There is just a chance you will have to give up on the
 optimization part way through if too many inner tuples fall into the
 new skew buckets (formerly IM buckets) and dump the tuples back into
 the main buckets.  The potential win is still pretty high though.

 - Bryce Cutt

Maybe I'm remembering wrong, but I thought the estimating functions
assuemd that the inner relation was unique.  So if there turn out to
be 2, 3, 4, or more copies of each value, the chances of blowing out
the skew hash table are almost 100%, I would think...  am I wrong?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-20 Thread Bryce Cutt
Not necessarily true.  Seeing as (when the statistics are correct) we
know each of these inner tuples will match with the largest amount of
outer tuples it is just as much of a win per inner tuple as when they
are unique.  There is just a chance you will have to give up on the
optimization part way through if too many inner tuples fall into the
new skew buckets (formerly IM buckets) and dump the tuples back into
the main buckets.  The potential win is still pretty high though.

- Bryce Cutt


On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Mar 20, 2009 at 8:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Bryce Cutt pandas...@gmail.com writes:
 Here is the new patch.

 Applied with revisions.  I undid some of the optimizations that
 cluttered the code in order to save a cycle or two per tuple --- as per
 previous discussion, that's not what the performance questions were
 about.  Also, I did not like the terminology in-memory/IM; it seemed
 confusing since the main hash table is in-memory too.  I revised the
 code to consistently refer to the additional hash table as a skew
 hashtable and the optimization in general as skew optimization.  Hope
 that seems reasonable to you --- we could search-and-replace it to
 something else if you'd prefer.

 For the moment, I didn't really do anything about teaching the planner
 to account for this optimization in its cost estimates.  The initial
 estimate of the number of MCVs that will be specially treated seems to
 me to be too high (it's only accurate if the inner relation is unique),
 but getting a more accurate estimate seems pretty hard, and it's not
 clear it's worth the trouble.  Without that, though, you can't tell
 what fraction of outer tuples will get the short-circuit treatment.

 If the inner relation isn't fairly close to unique you shouldn't be
 using this optimization in the first place.

 ...Robert


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-20 Thread Bryce Cutt
The estimation functions assume the inner relation join column is
unique.  But it freezes (flushes back to the main hash table) one skew
bucket at a time in order of least importance so if 100 inner tuples
can fit in the skew buckets then the skew buckets are only fully blown
out if the best tuple (the single most common value) occurs more than
100 times in the inner relation.  And up until that point you still
have the tuples in memory that are the best per tuple worth of
memory.  But yes, after that point (more than 100 tuples of that best
MCV) the entire effort was wasted.  The skew buckets are dynamically
flushed just like buckets in a dynamic hash join would be.

- Bryce Cutt

On Fri, Mar 20, 2009 at 5:51 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Mar 20, 2009 at 8:45 PM, Bryce Cutt pandas...@gmail.com wrote:
 On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas robertmh...@gmail.com wrote:
 If the inner relation isn't fairly close to unique you shouldn't be
 using this optimization in the first place.
 Not necessarily true.  Seeing as (when the statistics are correct) we
 know each of these inner tuples will match with the largest amount of
 outer tuples it is just as much of a win per inner tuple as when they
 are unique.  There is just a chance you will have to give up on the
 optimization part way through if too many inner tuples fall into the
 new skew buckets (formerly IM buckets) and dump the tuples back into
 the main buckets.  The potential win is still pretty high though.

 - Bryce Cutt

 Maybe I'm remembering wrong, but I thought the estimating functions
 assuemd that the inner relation was unique.  So if there turn out to
 be 2, 3, 4, or more copies of each value, the chances of blowing out
 the skew hash table are almost 100%, I would think...  am I wrong?

 ...Robert


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-06 Thread Tom Lane
Bryce Cutt pandas...@gmail.com writes:
 Here is the new patch.
 Our experiments show no noticeable performance issue when using the
 patch for cases where the optimization is not used because the number
 of extra statements executed when the optimization is disabled is
 insignificant.

 We have updated the patch to remove a couple of if statements, but
 this is really minor.  The biggest change was to MultiExecHash that
 avoids an if check per tuple by duplicating the hashing loop.

I think you missed the point of the performance questions.  It wasn't
about avoiding extra simple if-tests in the per-tuple loops; a few of
those are certainly not going to add measurable cost given how complex
the code is already.  (I really don't think you should be duplicating
hunks of code to avoid adding such tests.)  Rather, the concern was that
if we are dedicating a fraction of available work_mem to this purpose,
that reduces the overall efficiency of the regular non-IM code path,
principally by forcing the creation of more batches than would otherwise
be needed.  It's not clear whether the savings for IM tuples always
exceeds this additional cost.

After looking over the code a bit, there are two points that
particularly concern me in this connection:

* The IM hashtable is only needed during the first-batch processing;
once we've completed the first pass over the outer relation there is
no longer any need for it, unless I'm misunderstanding things
completely.  Therefore it really only competes for space with the
regular first batch.  However the damage to nbatches will already have
been done; in effect, we can expect that each subsequent batch will
probably only use (100 - IM_WORK_MEM_PERCENT)% of work_mem.  The patch
seems to try to deal with this by keeping IM_WORK_MEM_PERCENT negligibly
small, but surely that's mostly equivalent to fighting with one hand
tied behind your back.  I wonder if it'd be better to dedicate all of
work_mem to the MCV hash values during the first pass, rather than
allowing them to compete with the first regular batch.

* The IM hashtable creates an additional reason why nbatch might
increase during the initial scan of the inner relation; in fact, since
it's an effect not modeled in the initial choice of nbatch, it's
probably going to be a major reason for that to happen.  Increasing
nbatch on the fly isn't good because it results in extra I/O for tuples
that were previously assigned to what is now the wrong batch.  Again,
the only answer the patch has for this is to try not to use enough
of work_mem for it to make a difference.  Seems like instead the initial
nbatch estimate needs to account for that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-06 Thread Robert Haas
On Fri, Mar 6, 2009 at 1:57 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Bryce Cutt pandas...@gmail.com writes:
 Here is the new patch.
 Our experiments show no noticeable performance issue when using the
 patch for cases where the optimization is not used because the number
 of extra statements executed when the optimization is disabled is
 insignificant.

 We have updated the patch to remove a couple of if statements, but
 this is really minor.  The biggest change was to MultiExecHash that
 avoids an if check per tuple by duplicating the hashing loop.

 I think you missed the point of the performance questions.  It wasn't
 about avoiding extra simple if-tests in the per-tuple loops; a few of
 those are certainly not going to add measurable cost given how complex
 the code is already.  (I really don't think you should be duplicating
 hunks of code to avoid adding such tests.)  Rather, the concern was that

Well, at one point we were still trying to verify that (1) the patch
actually had a benefit and (2) blowing out the IM hashtable wasn't too
horribly nasty.  A great deal of improvement has been made in those
areas since this was first reviewed.  But your questions are
completely valid, too.  (I don't think anyone ever expressed a concern
about the simple if-tests, either.)

 if we are dedicating a fraction of available work_mem to this purpose,
 that reduces the overall efficiency of the regular non-IM code path,
 principally by forcing the creation of more batches than would otherwise
 be needed.  It's not clear whether the savings for IM tuples always
 exceeds this additional cost.

 After looking over the code a bit, there are two points that
 particularly concern me in this connection:

 * The IM hashtable is only needed during the first-batch processing;
 once we've completed the first pass over the outer relation there is
 no longer any need for it, unless I'm misunderstanding things
 completely.  Therefore it really only competes for space with the
 regular first batch.  However the damage to nbatches will already have
 been done; in effect, we can expect that each subsequent batch will
 probably only use (100 - IM_WORK_MEM_PERCENT)% of work_mem.  The patch
 seems to try to deal with this by keeping IM_WORK_MEM_PERCENT negligibly
 small, but surely that's mostly equivalent to fighting with one hand
 tied behind your back.   I wonder if it'd be better to dedicate all of
 work_mem to the MCV hash values during the first pass, rather than
 allowing them to compete with the first regular batch.

The IM hash table doesn't need to be very large in order to produce a
substantial benefit, because there are only going to be ~100 MCVs in
the probe table and each of those may well be unique in the build
table.  But no matter what size you choose for it, there's some danger
that it will push us over the edge into more batches, and if the skew
doesn't turn out to be enough to make up for that, you lose.  I'm not
sure there's any way to completely eliminate that unpleasant
possibility.

 * The IM hashtable creates an additional reason why nbatch might
 increase during the initial scan of the inner relation; in fact, since
 it's an effect not modeled in the initial choice of nbatch, it's
 probably going to be a major reason for that to happen.  Increasing
 nbatch on the fly isn't good because it results in extra I/O for tuples
 that were previously assigned to what is now the wrong batch.  Again,
 the only answer the patch has for this is to try not to use enough
 of work_mem for it to make a difference.  Seems like instead the initial
 nbatch estimate needs to account for that.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-06 Thread Lawrence, Ramon
  I think you missed the point of the performance questions.  It wasn't
  about avoiding extra simple if-tests in the per-tuple loops; a few of
  those are certainly not going to add measurable cost given how complex
  the code is already.  (I really don't think you should be duplicating
  hunks of code to avoid adding such tests.)  Rather, the concern was that
  if we are dedicating a fraction of available work_mem to this purpose,
  that reduces the overall efficiency of the regular non-IM code path,
  principally by forcing the creation of more batches than would otherwise
  be needed.  It's not clear whether the savings for IM tuples always
  exceeds this additional cost.

I misunderstood the concern.  So, there is no issue with the patch when it is 
disabled (single batch case or multi-batch with no skew)?  There is no memory 
allocated when the optimization is off, so these cases will not affect the 
number of batches or re-partitioning.  

  * The IM hashtable is only needed during the first-batch processing;
  once we've completed the first pass over the outer relation there is
  no longer any need for it, unless I'm misunderstanding things
  completely.  Therefore it really only competes for space with the
  regular first batch.  However the damage to nbatches will already have
  been done; in effect, we can expect that each subsequent batch will
  probably only use (100 - IM_WORK_MEM_PERCENT)% of work_mem.  The patch
  seems to try to deal with this by keeping IM_WORK_MEM_PERCENT negligibly
  small, but surely that's mostly equivalent to fighting with one hand
  tied behind your back.   I wonder if it'd be better to dedicate all of
  work_mem to the MCV hash values during the first pass, rather than
  allowing them to compete with the first regular batch.
 
 The IM hash table doesn't need to be very large in order to produce a
 substantial benefit, because there are only going to be ~100 MCVs in
 the probe table and each of those may well be unique in the build
 table.  But no matter what size you choose for it, there's some danger
 that it will push us over the edge into more batches, and if the skew
 doesn't turn out to be enough to make up for that, you lose.  I'm not
 sure there's any way to completely eliminate that unpleasant
 possibility.

Correct - The IM table only competes with the first-batch during processing and 
is removed after the first pass.  Also, it tends to be VERY small as the 
default of 100 MCVs usually results in 100 tuples being in the IM table which 
is normally much less than 2% of work_mem.  We get almost all the benefit with 
100-1 MCVs with little downside risk.  Making the IM table larger (size of 
work_mem) is both not possible (not that many MCVs) and has a bigger downside 
risk if we get it wrong.
 
  * The IM hashtable creates an additional reason why nbatch might
  increase during the initial scan of the inner relation; in fact, since
  it's an effect not modeled in the initial choice of nbatch, it's
  probably going to be a major reason for that to happen.  Increasing
  nbatch on the fly isn't good because it results in extra I/O for tuples
  that were previously assigned to what is now the wrong batch.  Again,
  the only answer the patch has for this is to try not to use enough
  of work_mem for it to make a difference.  Seems like instead the initial
  nbatch estimate needs to account for that.

The possibility of the 1-2% IM_WORK_MEM_PERCENT causing a re-batch exists but 
is very small.  The number of batches is calculated in ExecChooseHashTableSize 
(costsize.c) as ceil(inner_rel_bytes/work_mem) rounded up to the next power of 
2.  Thus, hash join already wastes some of its work_mem allocation due to 
rounding.  For instance, if nbatch is calculated as 3 then rounded up to 4, 
only 75% of work_mem is used for each batch.  This leaves 25% of work_mem 
unaccounted for which may be used by the IM table (and also to compensate for 
build skew). Clearly, if nbatch is exactly 4, then this unaccounted space is 
not present and if the optimizer is exact in its estimates, the extra 1-2% may 
force a re-partition.

A solution may be to re-calculate nbatch factoring in the extra 1-2% during 
ExecHashTableCreate (nodeHashjoin.c) which calls ExecChooseHashTableSize again 
before execution.  The decision is whether to modify ExecChooseHashTableSize 
itself (which is used during costing) or to make a modified 
ExecChooseHashTableSize function that is only used once in ExecHashTableCreate.

We have tried to change the original code as little as possible, but it is 
possible to modify ExecChooseHashTableSize and the hash join cost function to 
be skew optimization aware.

--
Ramon Lawrence

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-02 Thread Bryce Cutt
Here is the new patch.

Our experiments show no noticeable performance issue when using the
patch for cases where the optimization is not used because the number
of extra statements executed when the optimization is disabled is
insignificant.

We have updated the patch to remove a couple of if statements, but
this is really minor.  The biggest change was to MultiExecHash that
avoids an if check per tuple by duplicating the hashing loop.

To demonstrate the differences, here is an analysis of the code
changes and their impact.

Three cases:
1) One batch hash join - Optimization is disabled.  Extra statements
executed are:
 - One if (hashtable-nbatch  1) in ExecHashJoin (line 356 of nodeHashjoin.c)
 - One if optimization_on in MultiExecHash (line 259 of nodeHash.c)
 - One if optimization_on in MultiExecHash per probe tuple (line 431
of nodeHashjoin.c)
 - One if statement in ExecScanHashBucket per probe tuple (line 1071
of nodeHash.c)

2) Multi-batch hash join with limited skew - Optimization is disabled.
 Extra statements executed are:
 - One if (hashtable-nbatch  1) in ExecHashJoin (line 356 of nodeHashjoin.c)
 - Executes ExecHashJoinDetectSkew method (at line 357 of
nodeHashjoin.c) that reads stats tuple for probe relation attribute
and determines if skew is above cut-off.  In this case, skew is not
above cutoff and no extra memory is used.
 - One if optimization_on in MultiExecHash (line 259 of nodeHash.c)
 - One if optimization_on in MultiExecHash per probe tuple (line 431
of nodeHashjoin.c)
 - One if statement in ExecScanHashBucket per probe tuple (line 1071
of nodeHash.c)

3) Multi-batch hash join with skew - Optimization is enabled. Extra
statements executed are:
 - One if (hashtable-nbatch  1) in ExecHashJoin (line 356 of nodeHashjoin.c)
 - Executes ExecHashJoinDetectSkew method (at line 357 of
nodeHashjoin.c) that reads stats tuple for probe relation attribute
and determines there is skew.  Allocates space for XXX which is 2% of
work_mem.
 - One if optimization_on in MultiExecHash (line 259 of nodeHash.c)
 - In MultiExecHash after each tuple is hashed determines if its join
attribute value matches one of the MCVs.  If it does, it is put in the
MCV structure.  Cost is the hash and search for each build tuple.
 - If all IM buckets end up frozen in the build phase (MultiExecHash)
because they grow larger than the memory allowed for IM buckets then
skew optimization is turned off and the probe phase reverts to Case 2
 - For each probe tuple, determines if its value is a MCV by
performing hash and quick table lookup.  If yes, probes MCV bucket
otherwise does regular hash algorithm as usual.
 - One if statement in ExecScanHashBucket per probe tuple (line 1071
of nodeHash.c)
 - Additional cost is determining if a tuple is a common tuple (both
on build and probe side).  This additional cost is dramatically
outweighed by avoiding disk I/Os (even if they never hit the disk due
to caching).

The if statement on line 440 of nodeHashjoin.c (in ExecHashJoin) has
been rearranged so that in the single batch case short circuit
evaluation requires only the first test in the IF to be checked.

The limited skew check mentioned in Case 2 above is a simple check
in the ExecHashJoinDetectSkew function.

- Bryce Cutt



On Thu, Feb 26, 2009 at 12:16 PM, Bryce Cutt pandas...@gmail.com wrote:
 The patch originally modified the cost function but I removed that
 part before we submitted it to be a bit conservative about our
 proposed changes.  I didn't like that for large plans the statistics
 were retrieved and calculated many times when finding the optimal
 query plan.

 The overhead of the algorithm when the skew optimization is not used
 ends up being roughly a function call and an if statement per tuple.
 It would be easy to remove the function call per tuple.  Dr. Lawrence
 has come up with some changes so that when the optimization is turned
 off, the function call does not happen at all and instead of the if
 statement happening per tuple it is run just once per join.  We have
 to test this a bit more but it should further reduce the overhead.

 Hopefully we will have the new patch ready to go this weekend.

 - Bryce Cutt


 On Thu, Feb 26, 2009 at 7:45 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki's got a point here: the planner is aware that hashjoin doesn't
 like skewed distributions, and it assigns extra cost accordingly if it
 can determine that the join key is skewed.  (See the bucketsize stuff
 in cost_hashjoin.)  If this patch is accepted we'll want to tweak that
 code.

 Still, that has little to do with the current gating issue, which is
 whether we've convinced ourselves that the patch doesn't cause a
 performance decrease for cases in which it's unable to help.

                        regards, tom lane




histojoin_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Heikki Linnakangas
I haven't been following this thread closely, so pardon if this has been 
discussed already.


The patch doesn't seem to change the cost estimates in the planner at 
all. Without that, I'd imagine that the planner rarely chooses a 
multi-batch hash join to begin with.


Joshua, in the tests that you've been running, did you have to rig the 
planner with enable_mergjoin=off or similar, to get the queries to use 
hash joins?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Robert Haas
On Thu, Feb 26, 2009 at 4:22 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I haven't been following this thread closely, so pardon if this has been
 discussed already.

 The patch doesn't seem to change the cost estimates in the planner at all.
 Without that, I'd imagine that the planner rarely chooses a multi-batch hash
 join to begin with.

AFAICS, a multi-batch hash join happens when you are joining two big,
unsorted paths.  The planner essentially compares the cost of sorting
the two paths and then merge-joining them versus the cost of a hash
join.  It doesn't seem to be unusual for the hash join to come out the
winner, although admittedly I haven't played with it a ton.  You
certainly could try to model it in the costing algorithm, but I'm not
sure how much benefit you'd get out of it: if you're doing this a lot
you're probably better off creating indices.

 Joshua, in the tests that you've been running, did you have to rig the
 planner with enable_mergjoin=off or similar, to get the queries to use
 hash joins?

I didn't have to fiddle anything, but Josh's tests were more exhaustive.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Joshua Tolley
On Wed, Feb 25, 2009 at 10:24:21PM -0500, Robert Haas wrote:
 I don't think we're really doing this the right way.  EXPLAIN ANALYZE
 has a measurable effect on the results, and we probably ought to stop
 the database and drop the VM caches after each query.  Are the Z1-Z7
 datasets on line someplace?  I might be able to rig up a script here.
 
 ...Robert

They're automatically generated by the dbgen utility, a link to which
was originally published somewhere in this thread. That tool creates a
few text files suitable (with some tweaking) for a COPY command. I've
got the original files... the .tbz I just made is 1.8 GB :) Anyone have
someplace they'd like me to drop it?

- Josh


signature.asc
Description: Digital signature


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Joshua Tolley
On Thu, Feb 26, 2009 at 08:22:52AM -0500, Robert Haas wrote:
 On Thu, Feb 26, 2009 at 4:22 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
  Joshua, in the tests that you've been running, did you have to rig the
  planner with enable_mergjoin=off or similar, to get the queries to use
  hash joins?
 
 I didn't have to fiddle anything, but Josh's tests were more exhaustive.

The planner chose hash joins for the queries I was running, regardless
of whether the patch was applied. I didn't have to mess with any
settings to get hash joins.

- Josh


signature.asc
Description: Digital signature


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Tom Lane
Heikki's got a point here: the planner is aware that hashjoin doesn't
like skewed distributions, and it assigns extra cost accordingly if it
can determine that the join key is skewed.  (See the bucketsize stuff
in cost_hashjoin.)  If this patch is accepted we'll want to tweak that
code.

Still, that has little to do with the current gating issue, which is
whether we've convinced ourselves that the patch doesn't cause a
performance decrease for cases in which it's unable to help.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Lawrence, Ramon
 From: Tom Lane
 Heikki's got a point here: the planner is aware that hashjoin doesn't
 like skewed distributions, and it assigns extra cost accordingly if it
 can determine that the join key is skewed.  (See the bucketsize
stuff
 in cost_hashjoin.)  If this patch is accepted we'll want to tweak that
 code.

Those modifications would make the optimizer more likely to select hash
join, even with skewed distributions.  For the TPC-H data set that we
are using the optimizer always picks hash join over merge join (single
or multi-batch).  Since the current patch does not change the cost
function, there is no change in the planning cost.  It may or may not be
useful to modify the cost function depending on the effect on planning
cost.

 Still, that has little to do with the current gating issue, which is
 whether we've convinced ourselves that the patch doesn't cause a
 performance decrease for cases in which it's unable to help.

Although we have not seen an overhead when the optimization is
by-passed, we are looking at some small code changes that would
guarantee that no extra statements are executed for the single batch
case.  Currently, an if optimization_on check is performed on each probe
tuple which, although minor, should be able to be avoided.  

The patch's author, Bryce Cutt, is defending his Master's thesis Friday
morning (on this work), so we will provide some updated code right after
that.  Since these code changes are small, they should not affect people
trying to test the performance of the current patch.

--
Ramon Lawrence

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Bryce Cutt
The patch originally modified the cost function but I removed that
part before we submitted it to be a bit conservative about our
proposed changes.  I didn't like that for large plans the statistics
were retrieved and calculated many times when finding the optimal
query plan.

The overhead of the algorithm when the skew optimization is not used
ends up being roughly a function call and an if statement per tuple.
It would be easy to remove the function call per tuple.  Dr. Lawrence
has come up with some changes so that when the optimization is turned
off, the function call does not happen at all and instead of the if
statement happening per tuple it is run just once per join.  We have
to test this a bit more but it should further reduce the overhead.

Hopefully we will have the new patch ready to go this weekend.

- Bryce Cutt


On Thu, Feb 26, 2009 at 7:45 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki's got a point here: the planner is aware that hashjoin doesn't
 like skewed distributions, and it assigns extra cost accordingly if it
 can determine that the join key is skewed.  (See the bucketsize stuff
 in cost_hashjoin.)  If this patch is accepted we'll want to tweak that
 code.

 Still, that has little to do with the current gating issue, which is
 whether we've convinced ourselves that the patch doesn't cause a
 performance decrease for cases in which it's unable to help.

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-25 Thread Robert Haas
On Wed, Feb 25, 2009 at 12:38 AM, Lawrence, Ramon ramon.lawre...@ubc.ca wrote:
 -Original Message-
 From: Robert Haas
 Sadly, there seem to be a number of cases in the Z7 database where the
 optimization makes things significantly worse (specifically, queries
 2, 3, and 7, but especially query 3).  Have you investigated what is
 going on there?  I had thought that we had sufficient safeguards in
 place to prevent this optimization from kicking in in cases where it
 doesn't help, but it seems not.  There will certainly be real-world
 databases that are more like Z7 than Z1.

 I agree that there should be no noticeable performance difference when
 the optimization is not used (single batch case or no skew).  I think
 the patch achieves this.  The optimization is not used in those cases,
 but we will review to see if it is the code that by-passes the
 optimization that is causing a difference.

Yeah we need to understand what's going on there.

 The query #3 timing difference is primarily due to a flaw in the
 experimental setup.  For some reason, query #3 got executed before #4
 with the optimization on, and executed after #4 with the optimization
 off.  This skewed the results for all runs (due to buffering issues),
 but is especially noticeable for Z7.  Note how query #4 is always faster
 for the optimization on version even though the optimization is not
 actually used for those queries (because they were one batch).  I expect
 that if you run query #3 on Z7 in isolation then the results should be
 basically identical.

 I have attached the SQL script that Joshua sent me.  The raw data I have
 posted at: http://people.ok.ubc.ca/rlawrenc/test.output

I don't think we're really doing this the right way.  EXPLAIN ANALYZE
has a measurable effect on the results, and we probably ought to stop
the database and drop the VM caches after each query.  Are the Z1-Z7
datasets on line someplace?  I might be able to rig up a script here.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-24 Thread Robert Haas
 Joshua sent us some preliminary data with this query and others and indicated 
 that we could post it.  He wanted time to clean it up
 and re-run some experiments, but the data is generally good and the algorithm 
 performs as expected.  I have attached this data to the
 post.  Note that the last set of data (although labelled as Z7) is actually 
 an almost zero skew database and represents the worst-case
 for the algorithm (for most queries the optimization is not even used).

Sadly, there seem to be a number of cases in the Z7 database where the
optimization makes things significantly worse (specifically, queries
2, 3, and 7, but especially query 3).  Have you investigated what is
going on there?  I had thought that we had sufficient safeguards in
place to prevent this optimization from kicking in in cases where it
doesn't help, but it seems not.  There will certainly be real-world
databases that are more like Z7 than Z1.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-24 Thread Lawrence, Ramon
 -Original Message-
 From: Robert Haas
 Sadly, there seem to be a number of cases in the Z7 database where the
 optimization makes things significantly worse (specifically, queries
 2, 3, and 7, but especially query 3).  Have you investigated what is
 going on there?  I had thought that we had sufficient safeguards in
 place to prevent this optimization from kicking in in cases where it
 doesn't help, but it seems not.  There will certainly be real-world
 databases that are more like Z7 than Z1.

I agree that there should be no noticeable performance difference when
the optimization is not used (single batch case or no skew).  I think
the patch achieves this.  The optimization is not used in those cases,
but we will review to see if it is the code that by-passes the
optimization that is causing a difference.

The query #3 timing difference is primarily due to a flaw in the
experimental setup.  For some reason, query #3 got executed before #4
with the optimization on, and executed after #4 with the optimization
off.  This skewed the results for all runs (due to buffering issues),
but is especially noticeable for Z7.  Note how query #4 is always faster
for the optimization on version even though the optimization is not
actually used for those queries (because they were one batch).  I expect
that if you run query #3 on Z7 in isolation then the results should be
basically identical. 

I have attached the SQL script that Joshua sent me.  The raw data I have
posted at: http://people.ok.ubc.ca/rlawrenc/test.output


--
Ramon Lawrence



test.sql
Description: test.sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-19 Thread Lawrence, Ramon


From: pgsql-hackers-ow...@postgresql.org on behalf of Robert Haas
I think what we need here is some very simple testing to demonstrate
that this patch demonstrates a speed-up even when the inner side of
the join is a joinrel rather than a baserel.  Can you suggest a single
query against the skewed TPCH dataset that will result in two or more
multi-batch hash joins?  If so, it should be a simple matter to run
that query with and without the patch and verify that the former is
faster than the latter.

This query will have the outer relation be a joinrel rather than a baserel:
 
select count(*) from supplier, part, lineitem where l_partkey = p_partkey and 
s_suppkey = l_suppkey;
 
The approach collects statistics on the outer relation (not the inner relation) 
so the code had to have the ability to determine a stats tuple on a joinrel in 
addition to a baserel.
 
Joshua sent us some preliminary data with this query and others and indicated 
that we could post it.  He wanted time to clean it up and re-run some 
experiments, but the data is generally good and the algorithm performs as 
expected.  I have attached this data to the post.  Note that the last set of 
data (although labelled as Z7) is actually an almost zero skew database and 
represents the worst-case for the algorithm (for most queries the optimization 
is not even used).
 
--
Ramon Lawrence


JoshuaTolleyData.xls
Description: JoshuaTolleyData.xls

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-19 Thread Joshua Tolley
On Wed, Feb 18, 2009 at 11:20:03PM -0500, Robert Haas wrote:
 On Wed, Jan 7, 2009 at 9:14 AM, Joshua Tolley eggyk...@gmail.com wrote:
  On Tue, Jan 06, 2009 at 11:49:57PM -0500, Robert Haas wrote:
  Josh / eggyknap -
 
  Can you rerun your performance tests with this version of the patch?
 
  ...Robert
 
  Will do, as soon as I can.
 
 Josh,
 
 Have you been able to do anything further with this?
 
 I'm attaching a rebased version of this patch with a few further
 whitespace cleanups.
 
 ...Robert

I keep trying to do testing, but not getting too far, though I did
return some test results to the original authors for their review. I'll
try to get a more formal response put together (my new daughter will be
24 hours old in a little bit, though, so it might be a while!)

- Josh 


signature.asc
Description: Digital signature


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-19 Thread David Fetter
On Thu, Feb 19, 2009 at 01:50:55PM -0700, Josh Tolley wrote:
 (my new daughter will be 24 hours old in a little bit, though, so it
 might be a while!)

Pics!

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-18 Thread Robert Haas
On Wed, Jan 7, 2009 at 9:14 AM, Joshua Tolley eggyk...@gmail.com wrote:
 On Tue, Jan 06, 2009 at 11:49:57PM -0500, Robert Haas wrote:
 Josh / eggyknap -

 Can you rerun your performance tests with this version of the patch?

 ...Robert

 Will do, as soon as I can.

Josh,

Have you been able to do anything further with this?

I'm attaching a rebased version of this patch with a few further
whitespace cleanups.

...Robert
*** a/src/backend/executor/nodeHash.c
--- b/src/backend/executor/nodeHash.c
***
*** 53,58  ExecHash(HashState *node)
--- 53,222 
  	return NULL;
  }
  
+ /*
+  * 
+  *		ExecHashGetIMBucket
+  *
+  *  	Returns the index of the in-memory bucket for this
+  *		hashvalue, or IM_INVALID_BUCKET if the hashvalue is not
+  *		associated with any unfrozen bucket (or if skew
+  *		optimization is not being used).
+  *
+  *		It is possible for a tuple whose join attribute value is
+  *		not a MCV to hash to an in-memory bucket due to the limited
+  * 		number of hash values but it is unlikely and everything
+  *		continues to work even if it does happen. We would
+  *		accidentally cache some less optimal tuples in memory
+  *		but the join result would still be accurate.
+  *
+  *		hashtable-imBucket is an open addressing hashtable of
+  *		in-memory buckets (HashJoinIMBucket).
+  * 
+  */
+ int
+ ExecHashGetIMBucket(HashJoinTable hashtable, uint32 hashvalue)
+ {
+ 	int bucket;
+ 
+ 	if (!hashtable-enableSkewOptimization)
+ 		return IM_INVALID_BUCKET;
+ 	
+ 	/* Modulo the hashvalue (using bitmask) to find the IM bucket. */
+ 	bucket = hashvalue  (hashtable-nIMBuckets - 1);
+ 
+ 	/*
+ 	 * While we have not hit a hole in the hashtable and have not hit the
+ 	 * actual bucket we have collided in the hashtable so try the next
+ 	 * bucket location.
+ 	 */
+ 	while (hashtable-imBucket[bucket] != NULL
+ 		 hashtable-imBucket[bucket]-hashvalue != hashvalue)
+ 		bucket = (bucket + 1)  (hashtable-nIMBuckets - 1);
+ 
+ 	/*
+ 	 * If the bucket exists and has been correctly determined return
+ 	 * the bucket index.
+ 	 */
+ 	if (hashtable-imBucket[bucket] != NULL
+ 		 hashtable-imBucket[bucket]-hashvalue == hashvalue)
+ 		return bucket;
+ 
+ 	/*
+ 	 * Must have run into an empty location or a frozen bucket which means the
+ 	 * tuple with this hashvalue is not to be handled as if it matches with an
+ 	 * in-memory bucket.
+ 	 */
+ 	return IM_INVALID_BUCKET;
+ }
+ 
+ /*
+  * 
+  *		ExecHashFreezeNextIMBucket
+  *
+  *		Freeze the tuples of the next in-memory bucket by pushing
+  *		them into the main hashtable.  Buckets are frozen in order
+  *		so that the best tuples are kept in memory the longest.
+  * 
+  */
+ static bool
+ ExecHashFreezeNextIMBucket(HashJoinTable hashtable)
+ {
+ 	int bucketToFreeze;
+ 	int bucketno;
+ 	int batchno;
+ 	uint32 hashvalue;
+ 	HashJoinTuple hashTuple;
+ 	HashJoinTuple nextHashTuple;
+ 	HashJoinIMBucket *bucket;
+ 	MinimalTuple mintuple;
+ 
+ 	/* Calculate the imBucket index of the bucket to freeze. */
+ 	bucketToFreeze = hashtable-imBucketFreezeOrder
+ 		[hashtable-nUsedIMBuckets - 1 - hashtable-nIMBucketsFrozen];
+ 
+ 	/* Grab a pointer to the actual IM bucket. */
+ 	bucket = hashtable-imBucket[bucketToFreeze];
+ 	hashvalue = bucket-hashvalue;
+ 
+ 	/*
+ 	 * Grab a pointer to the first tuple in the soon to be frozen IM bucket.
+ 	 */
+ 	hashTuple = bucket-tuples;
+ 
+ 	/*
+ 	 * Calculate which bucket and batch the tuples belong to in the main
+ 	 * non-IM hashtable.
+ 	 */
+ 	ExecHashGetBucketAndBatch(hashtable, hashvalue, bucketno, batchno);
+ 
+ 	/* until we have read all tuples from this bucket */
+ 	while (hashTuple != NULL)
+ 	{
+ 		/*
+ 		 * Some of this code is very similar to that of ExecHashTableInsert.
+ 		 * We do not call ExecHashTableInsert directly as
+ 		 * ExecHashTableInsert expects a TupleTableSlot and we already have
+ 		 * HashJoinTuples.
+ 		 */
+ 		mintuple = HJTUPLE_MINTUPLE(hashTuple);
+ 
+ 		/* Decide whether to put the tuple in the hash table or a temp file. */
+ 		if (batchno == hashtable-curbatch)
+ 		{
+ 			/* Put the tuple in hash table. */
+ 			nextHashTuple = hashTuple-next;
+ 			hashTuple-next = hashtable-buckets[bucketno];
+ 			hashtable-buckets[bucketno] = hashTuple;
+ 			hashTuple = nextHashTuple;
+ 			hashtable-spaceUsedIM -= HJTUPLE_OVERHEAD + mintuple-t_len;
+ 		}
+ 		else
+ 		{
+ 			/* Put the tuples into a temp file for later batches. */
+ 			Assert(batchno  hashtable-curbatch);
+ 			ExecHashJoinSaveTuple(mintuple, hashvalue,
+   hashtable-innerBatchFile[batchno]);
+ 			/*
+ 			 * Some memory has been freed up. This must be done before we
+ 			 * pfree the hashTuple of we lose access to the tuple size.
+ 			 */
+ 			hashtable-spaceUsed -= HJTUPLE_OVERHEAD + 

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-18 Thread Robert Haas
 At this point, we await further feedback on what is necessary to get
 this patch accepted.  We would also like to thank Josh and Robert again
 for their review time.

I think what we need here is some very simple testing to demonstrate
that this patch demonstrates a speed-up even when the inner side of
the join is a joinrel rather than a baserel.  Can you suggest a single
query against the skewed TPCH dataset that will result in two or more
multi-batch hash joins?  If so, it should be a simple matter to run
that query with and without the patch and verify that the former is
faster than the latter.

Thanks,

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-01-14 Thread Lawrence, Ramon

 Here is a cleaned-up version.  I fixed a number of whitespace issues,
 improved a few comments, and rearranged one set of nested if-else
 statements (hopefully without breaking anything in the process).
 
 Josh / eggyknap -
 
 Can you rerun your performance tests with this version of the patch?

To help with testing, we have constructed a patch specifically for
testing.  The patch is the same as Robert's version except that it
tracks and prints out statistics during the join on how many tuples are
affected and has the enable_hashjoin_usestatmcvs variable defined so
that it is easy to turn on/off skew handling.  This is useful as
although the patch reduces the number of I/Os performed, this
improvement may not be seen in some queries which are dominated by other
cost factors (non-skew joins, CPU time, time to scan input relations,
etc.).

The sample output looks like this:

LI-P
Values: 100 Skew: 0.27  Est. tuples: 59986052.00 Batches: 512  Est.
Save: 16114709.99
Total Inner Tuples: 200
IM Inner Tuples: 83
Batch Zero Inner Tuples: 3941
Batch Zero Potential Inner Tuples: 3941
Total Outer Tuples: 59986052
IM Outer Tuples: 16074146
Batch Zero Outer Tuples: 98778
Batch Zero Potential Outer Tuples: 98778
Total Output Tuples: 59986052
IM Output Tuples: 16074146
Batch Zero Output Tuples: 98778
Batch Zero Potential Output Tuples: 98778
Percentage less tuple IOs than HHJ: 25.98

The other change is that the system calculates the skew and will not use
the in-memory skew partition if the skew is less than 1%.

Finally, we have attached some performance results for the TPCH 10G data
set (skew factors z=1 and z=2).  For the Customer-Orders-Lineitem-Part
query that Josh was testing, we see no overall time difference that is
significant compared to experimental error (although there is I/O
benefit for the Lineitem-Part join).  This query cost is dominated by
the non-skew joins of Customer-Orders and Orders-Lineitem and output
tuple construction.

The joins with skew, Lineitem-Supplier and Lineitem-Part, show
significantly improved performance.  Note how the statistics show that
the percentage I/O savings is directly proportional to the skew.
However, the overall query time savings is always less than this as
there are other costs such as reading the relations, performing the hash
comparisons, building the output tuples, etc. that are unaffected by the
optimization.

At this point, we await further feedback on what is necessary to get
this patch accepted.  We would also like to thank Josh and Robert again
for their review time.

Sincerely,

Ramon Lawrence and Bryce Cutt


histojoin_testing.patch
Description: histojoin_testing.patch
TPC-H 10G Skew Factor Z=1 results
-

LI-P Regular HJ: (time in milliseconds)
990344
1022562
1071250
1003219
1049000
989953
Average: 1021054.667


LI-P Skew-enabled HJ: (time in milliseconds)
883593
960860
934906
1007282
937406
948078
Average: 945354.1667

% Difference: 7.4%


LI-P
Values: 100 Skew: 0.27  Est. tuples: 59986052.00 Batches: 512  Est. Save: 
16114709.99
Total Inner Tuples: 200
IM Inner Tuples: 83
Batch Zero Inner Tuples: 3941
Batch Zero Potential Inner Tuples: 3941
Total Outer Tuples: 59986052
IM Outer Tuples: 16074146
Batch Zero Outer Tuples: 98778
Batch Zero Potential Outer Tuples: 98778
Total Output Tuples: 59986052
IM Output Tuples: 16074146
Batch Zero Output Tuples: 98778
Batch Zero Potential Output Tuples: 98778
Percentage less tuple IOs than HHJ: 25.98



LI-P-S Regular HJ: (time in milliseconds)
1833016
1567515
1504625
Average: 1635052

LI-P-S Skew-enabled HJ: (time in milliseconds)
883593
1280297
1423984
Average: 1195958

% Difference: 27%

LI-S
Values: 100 Skew: 0.19  Est. tuples: 59986052.00 Batches: 32  Est. Save: 
11097357.16
Total Inner Tuples: 10
IM Inner Tuples: 78
Batch Zero Inner Tuples: 3123
Batch Zero Potential Inner Tuples: 3125
Total Outer Tuples: 59986052
IM Outer Tuples: 11563695
Batch Zero Outer Tuples: 1577432
Batch Zero Potential Outer Tuples: 1693632
Total Output Tuples: 59986052
IM Output Tuples: 11563695
Batch Zero Output Tuples: 1577432
Batch Zero Potential Output Tuples: 1693632
Percentage less tuple IOs than HHJ: 19.61

(LI-S)-P
Values: 100 Skew: 0.27  Est. tuples: 59986052.00 Batches: 512  Est. Save: 
16114709.99
Total Inner Tuples: 200
IM Inner Tuples: 83
Batch Zero Inner Tuples: 3941
Batch Zero Potential Inner Tuples: 3941
Total Outer Tuples: 59986052
IM Outer Tuples: 16074146
Batch Zero Outer Tuples: 98778
Batch Zero Potential Outer Tuples: 98778
Total Output Tuples: 59986052
IM Output Tuples: 16074146
Batch Zero Output Tuples: 98778
Batch Zero Potential Output Tuples: 98778
Percentage less tuple IOs than HHJ: 25.98


TPC-H 10G Skew Factor Z=2 results
-

LI-P Regular HJ: (time in milliseconds)
505672
424922
303250
361610
358125
Average: 390715.8

LI-P Skew-enabled HJ: (time in milliseconds)
219078
210078
212938
210094
212500
Average: 212937.6

% difference: 45.5%


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-01-07 Thread Joshua Tolley
On Tue, Jan 06, 2009 at 11:49:57PM -0500, Robert Haas wrote:
 Josh / eggyknap -
 
 Can you rerun your performance tests with this version of the patch?
 
 ...Robert

Will do, as soon as I can. 


signature.asc
Description: Digital signature


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-01-06 Thread Bryce Cutt
The latest version of the patch is attached.  The revision considerably
cleans up the code, especially variable naming consistency.  We have
adopted the use of IM (in-memory) in variable names for the hash table
structures as suggested.

Two other implementations changes:

1) The overhead of the hash table has been reduced by allocating an
array of pointers instead of an array of structs and only allocating the
structs as they are needed to store MCVs.  IM buckets are now frozen
by first removing all tuples then deleting the struct from memory.  This
allows more memory to be freed as well as the removal of the frozen
field in the IM bucket struct which now makes that struct only 8 bytes
on a 32bit machine.  If for some reason all IM buckets are frozen all
IM struct overhead is removed from memory to further reduce the memory
footprint.

2) This patch supports using a set percentage of work_mem (currently 2%)
to store the build tuples that join frequently with probe relation
tuples.  The code only allocates MCVs up to the maximum amount and will
flush from the in-memory hash table if the memory is ever exceeded.  The
code also ensures that the overall join memory used (the MCV hash table
and batch 0 in memory) does not exceed spaceAllocated as usual.  If this 2%
of memory is not used by the MCV hash table then it can be used by batch 0.

These changes are mostly relate to style, although some of the cleanup
has made the code slightly faster.

We would really appreciate help on finalizing this patch, especially in
regard to style issues.  Thank you for all the help.

- Dr. Ramon Lawrence and Bryce Cutt


On Sun, Jan 4, 2009 at 6:48 PM, Robert Haas robertmh...@gmail.com wrote:
 1) Isn't ExecHashFreezeNextMCVPartition actually a most common TUPLE
 partition, rather than a most common VALUE partition (and similarly for
 ExecHashGetMCVPartition)?

 A partition stores all tuples that correspond to that MCV value.  It is
 usually one for foreign key joins but may be more than one.  (Plus, it
 may store other tuples that have the same hash value for the join
 attribute as the MCV value.)

 I guess my point is - check that your variable/function/structure
 member naming is consistent between different parts of the code.  The
 ExecHashGetMCVPartition function accesses structure members called
 nMostCommonTuplePartitionHashBuckets, nMostCommonTuplePartition, and
 mostCommonTuplePartition.  It seems inconsistent that the function
 name uses MCVPartition and
 the structure members use mostCommonTuplePartition - aren't we talking
 about the same thing in both cases?

 And, more to the point, the terminology just seems wrong to me, the
 more I think about it.  I mean, ExecHashGetMCVParitition is not
 finding a partition of the MCVs.  It's finding a partition of an
 in-memory hash table which we plan to populate with MCVs.   That's why
 I'm wondering if we should make it ExecHashGetIMPartition,
 nIMPartitionHashBuckets, etc.

 2) Have you done any performance testing on the impact of this change?

 Yes, the ability to use MCVs for more than sequential scans
 significantly improves performance in multi-join cases.  The allocation
 of a percentage of memory of only 1% will not affect any performance
 results as all our testing was done with the MCV value of 10 or 100
 which is significantly below a 1% allocation of work_mem.  If anything,
 performance would be improved when using more MCVs.

 That is a very good thing.

 Finally, any help you can provide on style concerns to make this easier
 to commit would be appreciated.  We will put all the effort required
 over the next few days to get this into 8.4.

 If I have time, I might be willing to make a style run over the next
 version of the patch after you post it to the list, and just correct
 anything I see and repost.  This might be faster than sending comments
 back and forth, if you are OK with it.  I have a day job so this would
 probably need to be Tuesday or Wednesday night.  My main advice is
 read the diff before you post it.  Sometimes things will just pop
 out at you that are less obvious when you are head-down in the code.

 Random stuff I notice in v4 patch: make sure all lines fit in 80
 columns (except for long error messages if any), missing space before
 closing comment delimiter in ExecHashGetMCVPartition, extraneous blank
 line added to nodeHash.c just before the comment that says and remove
 from hash table, comment in ExecHashJoinGetMostCommonValues just
 after the get_attstatsslot call is formatted strangely, still extra
 curly braces around the calls to
 ExecScanHashMostCommonValuePartition/ExecScanHashBucket.

 ...Robert



histojoin_v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-01-06 Thread Robert Haas
 We would really appreciate help on finalizing this patch, especially in
 regard to style issues.  Thank you for all the help.

Here is a cleaned-up version.  I fixed a number of whitespace issues,
improved a few comments, and rearranged one set of nested if-else
statements (hopefully without breaking anything in the process).

Josh / eggyknap -

Can you rerun your performance tests with this version of the patch?

...Robert
*** a/src/backend/executor/nodeHash.c
--- b/src/backend/executor/nodeHash.c
***
*** 53,58  ExecHash(HashState *node)
--- 53,222 
  	return NULL;
  }
  
+ /*
+  * 
+  *		ExecHashGetIMBucket
+  *
+  *  	Returns the index of the in-memory bucket for this
+  *		hashvalue, or IM_INVALID_BUCKET if the hashvalue is not
+  *		associated with any unfrozen bucket (or if skew
+  *		optimization is not being used).
+  *  
+  *		It is possible for a tuple whose join attribute value is
+  *		not a MCV to hash to an in-memory bucket due to the limited
+  * 		number of hash values but it is unlikely and everything
+  *		continues to work even if it does happen. We would
+  *		accidentally cache some less optimal tuples in memory
+  *		but the join result would still be accurate.
+  *
+  *		hashtable-imBucket is an open addressing hashtable of
+  *		in-memory buckets (HashJoinIMBucket).
+  * 
+  */
+ int 
+ ExecHashGetIMBucket(HashJoinTable hashtable, uint32 hashvalue)
+ {
+ 	int bucket;
+ 
+ 	if (!hashtable-enableSkewOptimization)
+ 		return IM_INVALID_BUCKET;
+ 	
+ 	/* Modulo the hashvalue (using bitmask) to find the IM bucket. */
+ 	bucket = hashvalue  (hashtable-nIMBuckets - 1);
+ 
+ 	/*
+ 	 * While we have not hit a hole in the hashtable and have not hit the 
+ 	 * actual bucket we have collided in the hashtable so try the next
+ 	 * bucket location.
+ 	 */
+ 	while (hashtable-imBucket[bucket] != NULL
+ 		 hashtable-imBucket[bucket]-hashvalue != hashvalue)
+ 		bucket = (bucket + 1)  (hashtable-nIMBuckets - 1);
+ 
+ 	/*
+ 	 * If the bucket exists and has been correctly determined return
+ 	 * the bucket index.
+ 	 */
+ 	if (hashtable-imBucket[bucket] != NULL
+ 		 hashtable-imBucket[bucket]-hashvalue == hashvalue)
+ 		return bucket;
+ 
+ 	/*
+ 	 * Must have run into an empty location or a frozen bucket which means the
+ 	 * tuple with this hashvalue is not to be handled as if it matches with an
+ 	 * in-memory bucket.
+ 	 */
+ 	return IM_INVALID_BUCKET;
+ }
+ 
+ /*
+  * 
+  *		ExecHashFreezeNextIMBucket
+  *
+  *		Freeze the tuples of the next in-memory bucket by pushing
+  *		them into the main hashtable.  Buckets are frozen in order
+  *		so that the best tuples are kept in memory the longest.
+  * 
+  */
+ static bool 
+ ExecHashFreezeNextIMBucket(HashJoinTable hashtable)
+ {
+ 	int bucketToFreeze;
+ 	int bucketno;
+ 	int batchno;
+ 	uint32 hashvalue;
+ 	HashJoinTuple hashTuple;
+ 	HashJoinTuple nextHashTuple;
+ 	HashJoinIMBucket *bucket;
+ 	MinimalTuple mintuple;
+ 
+ 	/* Calculate the imBucket index of the bucket to freeze. */
+ 	bucketToFreeze = hashtable-imBucketFreezeOrder
+ 		[hashtable-nUsedIMBuckets - 1 - hashtable-nIMBucketsFrozen];
+ 
+ 	/* Grab a pointer to the actual IM bucket. */
+ 	bucket = hashtable-imBucket[bucketToFreeze];
+ 	hashvalue = bucket-hashvalue;
+ 
+ 	/*
+ 	 * Grab a pointer to the first tuple in the soon to be frozen IM bucket.
+ 	 */
+ 	hashTuple = bucket-tuples;
+ 
+ 	/*
+ 	 * Calculate which bucket and batch the tuples belong to in the main
+ 	 * non-IM hashtable.
+ 	 */
+ 	ExecHashGetBucketAndBatch(hashtable, hashvalue, bucketno, batchno);
+ 
+ 	/* until we have read all tuples from this bucket */
+ 	while (hashTuple != NULL)
+ 	{
+ 		/*
+ 		 * Some of this code is very similar to that of ExecHashTableInsert.
+ 		 * We do not call ExecHashTableInsert directly as
+ 		 * ExecHashTableInsert expects a TupleTableSlot and we already have
+ 		 * HashJoinTuples.
+ 		 */
+ 		mintuple = HJTUPLE_MINTUPLE(hashTuple);
+ 
+ 		/* Decide whether to put the tuple in the hash table or a temp file. */
+ 		if (batchno == hashtable-curbatch)
+ 		{
+ 			/* Put the tuple in hash table. */
+ 			nextHashTuple = hashTuple-next;
+ 			hashTuple-next = hashtable-buckets[bucketno];
+ 			hashtable-buckets[bucketno] = hashTuple;
+ 			hashTuple = nextHashTuple;
+ 			hashtable-spaceUsedIM -= HJTUPLE_OVERHEAD + mintuple-t_len;
+ 		}
+ 		else
+ 		{
+ 			/* Put the tuples into a temp file for later batches. */
+ 			Assert(batchno  hashtable-curbatch);
+ 			ExecHashJoinSaveTuple(mintuple, hashvalue,
+   hashtable-innerBatchFile[batchno]);
+ 			/*
+ 			 * Some memory has been freed up. This must be done before we
+ 			 * pfree the hashTuple of we lose access to the tuple size.
+ 			 */
+ 			hashtable-spaceUsed -= 

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-01-01 Thread Robert Haas
On Tue, Dec 30, 2008 at 12:29 AM, Bryce Cutt pandas...@gmail.com wrote:
 Here is the next patch version.

Thanks for posting this update.  This is definitely getting better,
but I still see some style issues.  We can work on fixing those once
the rest of the details have been finalized.

However, one question in this area - isn't
ExecHashFreezeNextMCVPartition actually a most common TUPLE partition,
rather than a most common VALUE partition (and similarly for
ExecHashGetMCVPartition)?  I'm not quite sure what to do about this as
the names are already quite long - is there some better name for the
functions and structure members than MostCommonTuplePartition?  Maybe
we could call it the in-memory partition and abbreviate it IMPartition
throughout.  I think that might make things more clear.

 The code can now find the the MCVs in more cases.  Even if the probe
 side is an operator other than a seq scan (such as another hashjoin)
 the code can now find the stats tuple for the underlying relation.

You're using varnoold in a way that directly contradicts the comment
in primnodes.h (essentially, that it's not used for anything other
than debugging).  I don't think this is a bad thing, but you have to
patch the comment.

Have you done any performance testing on the impact of this change?

 The new idea of limiting the number of MCVs to a percentage of memory
 has not been added yet.

That's a pretty important change, I think, though it would be nice to
have one of the committers chime in here.  For those who may not have
been following the thread closely, the current implementation's memory
usage can go quite a bit higher than work_mem - the in-memory open
hash table can be up to 1MB or so (if statistics_target = 10K) plus it
can contain up to work_mem of tuples plus each batch can contain
another work_mem of tuples.  The proposal is to carve out 1-3% of
work_mem for the in-memory hash table and leave the rest for the
batches, thus hopefully not affecting the # of batches very much.  If
it doesn't look like the whole MCV list will fit, we'll take a shot at
guessing what length prefix of it will.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-30 Thread Bryce Cutt
Here is the next patch version.

The naming and style concerns have been addressed.  The patch now only
touches 5 files.  4 of those files are hashjoin specific and 1 is to
add a couple lines to a hashjoin specific struct in another file.

The code can now find the the MCVs in more cases.  Even if the probe
side is an operator other than a seq scan (such as another hashjoin)
the code can now find the stats tuple for the underlying relation.

The new idea of limiting the number of MCVs to a percentage of memory
has not been added yet.

- Bryce Cutt


On Mon, Dec 29, 2008 at 8:55 PM, Robert Haas robertmh...@gmail.com wrote:
 I think that setting aside a minimum percentage of work_mem may be a
 reasonable approach.  For instance, setting aside 1% at even 1 MB
 work_mem would be 10 KB which is enough to store about 40 MCV tuples of
 the TPC-H database.  Such a small percentage would be very unlikely (but
 still possible) to change the number of batches used.  Then, given the
 memory allocation and the known tuple size + overhead, only that number
 of MCVs are selected for the MCV table regardless how many there are.
 The MCV table size would then increase as work_mem is changed up to a
 maximum given by the number of MCVs.

 Sounds fine.  Maybe 2-3% would be better.

 The code when building the MCV hash table keeps track of the order of
 insertion of the best MCVs.  It then flushes the MCV partitions in
 decreasing order of frequency of MCVs.  Thus, by the end of the build
 partitioning phase the MCV hash table should only store the most
 frequent MCV tuples.  Even with many-to-many joins as long as we keep
 all build tuples that have a given MCV in memory, then everything is
 fine.  You would get into problems if you only flushed some of the
 tuples of a certain MCV but that will not happen.

 OK, I'll read it again - I must not have understood.

 It would be good to post an updated patch soon, even if not everything
 has been addressed.

 ...Robert



histojoin_v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-29 Thread Robert Haas
 I think that setting aside a minimum percentage of work_mem may be a
 reasonable approach.  For instance, setting aside 1% at even 1 MB
 work_mem would be 10 KB which is enough to store about 40 MCV tuples of
 the TPC-H database.  Such a small percentage would be very unlikely (but
 still possible) to change the number of batches used.  Then, given the
 memory allocation and the known tuple size + overhead, only that number
 of MCVs are selected for the MCV table regardless how many there are.
 The MCV table size would then increase as work_mem is changed up to a
 maximum given by the number of MCVs.

Sounds fine.  Maybe 2-3% would be better.

 The code when building the MCV hash table keeps track of the order of
 insertion of the best MCVs.  It then flushes the MCV partitions in
 decreasing order of frequency of MCVs.  Thus, by the end of the build
 partitioning phase the MCV hash table should only store the most
 frequent MCV tuples.  Even with many-to-many joins as long as we keep
 all build tuples that have a given MCV in memory, then everything is
 fine.  You would get into problems if you only flushed some of the
 tuples of a certain MCV but that will not happen.

OK, I'll read it again - I must not have understood.

It would be good to post an updated patch soon, even if not everything
has been addressed.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-28 Thread Robert Haas
 I totally agree that 10,000 MCVs changes things.  Ideally, these 10,000
 MCVs should be kept in memory because they will join with the most
 tuples.  However, the size of the MCV hash table (as you point out) can
 be bigger than work_mem *by itself* not even considering the tuples in
 the table or in the in-memory batch.

 So, basically, we have a decision to make whether to try support a
 larger number of MCVs or cap it at a reasonable number like a 100.  You
 can come up with situations where using all 10,000 MCVs is good (for
 instance if all MCVs have frequency 1/1), but I expect 100 MCVs will
 capture the majority of the cases as usually the top 100 MCVs are
 significantly more frequent than later MCVs.

I thought about this, but upon due reflection I think it's the wrong
approach.  Raising work_mem is a pretty common tuning step - it's 4MB
even on my small OLTP systems, and in a data-warehousing environment
where this optimization will bring the most benefit, it could easily
be higher.  Furthermore, if someone DOES change the statistics target
for that column to 10,000, there's a pretty good chance that they had
a reason for doing so (or at the very least it's not for us to assume
that they were doing something stupid). I think we need some kind of
code to try to tune this based on the actual situation.

We might try to size the in-memory hash table to be the largest value
that won't increase the total number of batches, but if the number of
batches is large then this won't be the right decision.  Maybe we
should insist on setting aside some minimum percentage of work_mem for
the in-memory hash table, and fill it with however many MCVs we think
will fit.

 The issue with building the MCV table is that the hash operator will not
 be receiving tuples in MCV frequency order.  It is possible that the MCV
 table is filled up with tuples of less frequent MCVs when a more
 frequent MCV tuple arrives.  In that case, we would like to keep the
 more frequent MCV and bump one of the less frequent MCVs.

I agree.  However, there's no reason at all to assume that the tuples
we flush out of the table are any better or worse than the new ones we
add back in later.  In fact, although it's far from a guarantee, if
the order of the tuples in the table is random, then we're more likely
to encounter the most common values first.  We might as well just keep
the ones we had rather than dumping them out and adding in different
ones.  Err, except, maybe we can't guarantee correctness that way, in
the case of a many-to-many join?

I don't think there's any way to get around the possibility of a
hash-table overflow completely.  Besides many-to-many joins, there's
also the possibility of hash collisions.  The code assumes that
anything that hashes to the same 32-bit value as an MCV is in fact an
MCV, which is obviously false, but doesn't seem worth worrying about
since the chances of a collision are very small and the equality test
might be expensive.  But clearly we want to minimize overflows as much
as we can.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-28 Thread Lawrence, Ramon
 I thought about this, but upon due reflection I think it's the wrong
 approach.  Raising work_mem is a pretty common tuning step - it's 4MB
 even on my small OLTP systems, and in a data-warehousing environment
 where this optimization will bring the most benefit, it could easily
 be higher.  Furthermore, if someone DOES change the statistics target
 for that column to 10,000, there's a pretty good chance that they had
 a reason for doing so (or at the very least it's not for us to assume
 that they were doing something stupid). I think we need some kind of
 code to try to tune this based on the actual situation.
 
 We might try to size the in-memory hash table to be the largest value
 that won't increase the total number of batches, but if the number of
 batches is large then this won't be the right decision.  Maybe we
 should insist on setting aside some minimum percentage of work_mem for
 the in-memory hash table, and fill it with however many MCVs we think
 will fit.

I think that setting aside a minimum percentage of work_mem may be a
reasonable approach.  For instance, setting aside 1% at even 1 MB
work_mem would be 10 KB which is enough to store about 40 MCV tuples of
the TPC-H database.  Such a small percentage would be very unlikely (but
still possible) to change the number of batches used.  Then, given the
memory allocation and the known tuple size + overhead, only that number
of MCVs are selected for the MCV table regardless how many there are.
The MCV table size would then increase as work_mem is changed up to a
maximum given by the number of MCVs.

 I agree.  However, there's no reason at all to assume that the tuples
 we flush out of the table are any better or worse than the new ones we
 add back in later.  In fact, although it's far from a guarantee, if
 the order of the tuples in the table is random, then we're more likely
 to encounter the most common values first.  We might as well just keep
 the ones we had rather than dumping them out and adding in different
 ones.  Err, except, maybe we can't guarantee correctness that way, in
 the case of a many-to-many join?

The code when building the MCV hash table keeps track of the order of
insertion of the best MCVs.  It then flushes the MCV partitions in
decreasing order of frequency of MCVs.  Thus, by the end of the build
partitioning phase the MCV hash table should only store the most
frequent MCV tuples.  Even with many-to-many joins as long as we keep
all build tuples that have a given MCV in memory, then everything is
fine.  You would get into problems if you only flushed some of the
tuples of a certain MCV but that will not happen.

--
Ramon Lawrence

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-27 Thread Lawrence, Ramon
 -Original Message-
 From: Robert Haas [mailto:robertmh...@gmail.com]
 I looked at this some more.  I'm a little concerned about the way
 we're maintaining the in-memory hash table.  Since the highest legal
 statistics target is now 10,000, it's possible that we could have two
 orders of magnitude more MCVs than what you're expecting.  As I read
 the code, that could lead to construction of an in-memory hash table
 with 64K slots.  On a 32-bit machine, I believe that works out to 16
 bytes per partition (12 and 4), which is a 1MB hash table.  That's not
 necessarily problematic, except that I don't think you're considering
 the size of the hash table itself when evaluating whether you are
 blowing out work_mem, and the default size of work_mem is 1MB.

I totally agree that 10,000 MCVs changes things.  Ideally, these 10,000
MCVs should be kept in memory because they will join with the most
tuples.  However, the size of the MCV hash table (as you point out) can
be bigger than work_mem *by itself* not even considering the tuples in
the table or in the in-memory batch.  Supporting that many MCVs would
require more modifications to the hash join algorithm.

100 MCVs should be able to fit in memory though.  Since the number of
batches is rounded to a power of 2, there is often some hash_table_bytes
that are not used by the in-memory batch that can be used to store the
MCV table.  The absolute size of the memory used should also be
reasonable (depending on the tuple size in bytes).

So, basically, we have a decision to make whether to try support a
larger number of MCVs or cap it at a reasonable number like a 100.  You
can come up with situations where using all 10,000 MCVs is good (for
instance if all MCVs have frequency 1/1), but I expect 100 MCVs will
capture the majority of the cases as usually the top 100 MCVs are
significantly more frequent than later MCVs.

I now also see that the code should be changed to keep track of the MCV
bytes separately from hashtable-spaceUsed as this is used to determine
when to dynamically increase the number of batches.

 I also don't really understand why we're trying to control the size of
 the hash table by flushing tuples after the fact.  Right now, when the
 in-memory table fills up, we just keep adding tuples to it, which in
 turn forces us to flush out other tuples to keep the size down.  This
 seems quite inefficient - not only are we doing a lot of unnecessary
 allocating and freeing, but those flushed slots in the hash table
 degrade performance (because they don't stop the scan for an empty
 slot).  It seems like we could simplify things considerably by adding
 tuples to the in-memory hash table only to the point where the next
 tuple would blow it out.  Once we get to that point, we can skip the
 isAMostCommonValue() test and send any future tuples straight to temp
 files.  (This would also reduce the memory consumption of the
 in-memory table by a factor of two.)

In the ideal case, we select a number of MCVs to support that we know
will always fit in memory.  The flushing is used to deal with the case
where we are doing a many-to-many join and there may be multiple tuples
with the given MCV value in the build relation.

The issue with building the MCV table is that the hash operator will not
be receiving tuples in MCV frequency order.  It is possible that the MCV
table is filled up with tuples of less frequent MCVs when a more
frequent MCV tuple arrives.  In that case, we would like to keep the
more frequent MCV and bump one of the less frequent MCVs.  

 We could potentially improve on this even further if we can estimate
 in advance how many MCVs we can fit into the in-memory hash table
 before it gets blown out.  If, for example, we have only 1MB of
 work_mem but there 10,000 MCVs, getMostCommonValues() might decide to
 only hash the first 1,000 MCVs.  Even if we still blow out the
 in-memory hash table, the earlier MCVs are more frequent than the
 later MCVs, so the ones that actually make it into the table are
 likely to be more beneficial.  I'm not sure exactly how to do this
 tuning though, since we'd need to approximate the size of the
 tuples... I guess the query planner makes some effort to estimate that
 but I'm not sure how to get at it.

The number of batches (nbatch), inner_rel_bytes, and hash_table_bytes
are calculated in ExecChooseHashTableSize in nodeHash.c.

The number of bytes free not allocated to the in-memory batch is then:

hash_table_bytes - inner_rel_bytes/nbatch

Depending on the power of 2 rounding of nbatch, this may be almost 0 or
quite large.  You could change the calculation of nbatch or try to
resize the in-memory batch, but that opens up a can of worms.  It may be
best to assume a small number of MCVs 10 or 100.

 
  However, the join with Part and LineItem *should* show a benefit but
may
  not because of a limitation of the patch implementation (not the
idea).
  The MCV optimization is only enabled currently when the 

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-25 Thread Robert Haas
 There is almost zero penalty for selecting incorrect MCV tuples to
 buffer in memory.  Since the number of MCVs is approximately 100, the
 overhead is keeping these 100 tuples in memory where they *might* not
 be MCVs.  The cost is the little extra memory and the checking of the
 MCVs which is very fast.

I looked at this some more.  I'm a little concerned about the way
we're maintaining the in-memory hash table.  Since the highest legal
statistics target is now 10,000, it's possible that we could have two
orders of magnitude more MCVs than what you're expecting.  As I read
the code, that could lead to construction of an in-memory hash table
with 64K slots.  On a 32-bit machine, I believe that works out to 16
bytes per partition (12 and 4), which is a 1MB hash table.  That's not
necessarily problematic, except that I don't think you're considering
the size of the hash table itself when evaluating whether you are
blowing out work_mem, and the default size of work_mem is 1MB.

I also don't really understand why we're trying to control the size of
the hash table by flushing tuples after the fact.  Right now, when the
in-memory table fills up, we just keep adding tuples to it, which in
turn forces us to flush out other tuples to keep the size down.  This
seems quite inefficient - not only are we doing a lot of unnecessary
allocating and freeing, but those flushed slots in the hash table
degrade performance (because they don't stop the scan for an empty
slot).  It seems like we could simplify things considerably by adding
tuples to the in-memory hash table only to the point where the next
tuple would blow it out.  Once we get to that point, we can skip the
isAMostCommonValue() test and send any future tuples straight to temp
files.  (This would also reduce the memory consumption of the
in-memory table by a factor of two.)

We could potentially improve on this even further if we can estimate
in advance how many MCVs we can fit into the in-memory hash table
before it gets blown out.  If, for example, we have only 1MB of
work_mem but there 10,000 MCVs, getMostCommonValues() might decide to
only hash the first 1,000 MCVs.  Even if we still blow out the
in-memory hash table, the earlier MCVs are more frequent than the
later MCVs, so the ones that actually make it into the table are
likely to be more beneficial.  I'm not sure exactly how to do this
tuning though, since we'd need to approximate the size of the
tuples... I guess the query planner makes some effort to estimate that
but I'm not sure how to get at it.

 However, the join with Part and LineItem *should* show a benefit but may
 not because of a limitation of the patch implementation (not the idea).
 The MCV optimization is only enabled currently when the probe side is a
 sequential scan.  This limitation is due to our current inability to
 determine a stats tuple of the join attribute on the probe side for
 other operators.  (This should be possible - help please?).

Not sure how to get at this either, but I'll take a look and see if I
can figure it out.

Merry Christmas,

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-23 Thread Lawrence, Ramon
   Because there is no nice way in PostgreSQL (that I know of) to
derive
   a histogram after a join (on an intermediate result) currently
   usingMostCommonValues is only enabled on a join when the outer
(probe)
   side is a table scan (seq scan only actually).  See
   getMostCommonValues (soon to be called
   ExecHashJoinGetMostCommonValues) for the logic that determines
this.
 
 So my test case of do a whole bunch of hash joins in a test query
 isn't really valid. Makes sense. I did another, more haphazard test on
a
 query with fewer joins, and saw noticeable speedups.
 
  It's starting to seem to me that the case where this patch provides
a
  benefit is so narrow that I'm not sure it's worth the extra code.
 
 Not that anyone asked, but I don't consider myself qualified to render
 judgement on that point. Code size is, I guess, a maintainability
issue,
 and I'm not terribly experienced maintaining PostgreSQL :)
 
  Is it realistic to think that the MCVs of the base relation might
  still be applicable to the joinrel?  It's certainly easy to think of
  counterexamples, but it might be a good approximation more often
than
  not.
 
 It's equivalent to our assumption that distributions of values in
 columns in the same table are independent. Making that assumption in
 this case would probably result in occasional dramatic speed
 improvements similar to the ones we've seen in less complex joins,
 offset by just-as-occasional dramatic slowdowns of similar magnitude.
In
 other words, it will increase the variance of our results.
 
 - Josh

There is almost zero penalty for selecting incorrect MCV tuples to
buffer in memory.  Since the number of MCVs is approximately 100, the
overhead is keeping these 100 tuples in memory where they *might* not
be MCVs.  The cost is the little extra memory and the checking of the
MCVs which is very fast.

On the other hand, the benefit is potentially tremendous if the MCV is
very common in the probe relation.  Every probe tuple that matches the
MCV tuple in memory does not have to be written to disk.  The potential
speedup is directly proportional to the skew.  The more skew the more
benefit.

An analogy is with a page buffering system where one goal is to keep
frequently used pages in the buffer.  Essentially the goal of this patch
is to pin in memory the tuples that the join believes will match with
the most tuples on the probe side.  This reduces I/Os by making more
probe relation tuples match during the first read of the probe relation.
Regular hash join has no way to guarantee frequently matched build
tuples remain memory-resident.
 
The particular join with Customer, Orders, LineItem, and Part is a
reasonable test case.  There may be two explanations for the results.
(I am running tests for this query currently.) First, the time to
generate the tuples (select *) may be dominating the query time.
Second, as mentioned by Bryce, I expect the issue is that only the join
with Customer and Orders exploited the patch.  Customer has some skew
(but not dramatic) so there would be some speedup.  

However, the join with Part and LineItem *should* show a benefit but may
not because of a limitation of the patch implementation (not the idea).
The MCV optimization is only enabled currently when the probe side is a
sequential scan.  This limitation is due to our current inability to
determine a stats tuple of the join attribute on the probe side for
other operators.  (This should be possible - help please?).  

Even if this stats tuple is on the base relation and may not exactly
reflect the distribution of the intermediate relation on the probe side,
it still could be very good.  Even if it is not, once again the cost is
negligible.

In summary, the patch will improve performance of any multi-batch hash
join with skew.  It is useful right now when the probe relation has skew
and is accessed using a sequential scan.  It would be useful in even
more situations if the code was modified to determine the stats for the
join attribute of the probe relation in all cases (even when the probe
relation is produced by another operator).

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon.lawre...@ubc.ca


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers