Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-20 Thread Tom Lane
Bryce Cutt pandas...@gmail.com writes: Here is the new patch. Applied with revisions. I undid some of the optimizations that cluttered the code in order to save a cycle or two per tuple --- as per previous discussion, that's not what the performance questions were about. Also, I did not like

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-20 Thread Robert Haas
On Fri, Mar 20, 2009 at 8:14 PM, Tom Lane t...@sss.pgh.pa.us wrote: Bryce Cutt pandas...@gmail.com writes: Here is the new patch. Applied with revisions.  I undid some of the optimizations that cluttered the code in order to save a cycle or two per tuple --- as per previous discussion,

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-20 Thread Robert Haas
On Fri, Mar 20, 2009 at 8:45 PM, Bryce Cutt pandas...@gmail.com wrote: On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas robertmh...@gmail.com wrote: If the inner relation isn't fairly close to unique you shouldn't be using this optimization in the first place. Not necessarily true.  Seeing as

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-20 Thread Robert Haas
On Fri, Mar 20, 2009 at 8:45 PM, Bryce Cutt pandas...@gmail.com wrote: On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas robertmh...@gmail.com wrote: If the inner relation isn't fairly close to unique you shouldn't be using this optimization in the first place. Not necessarily true.  Seeing as

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-20 Thread Bryce Cutt
Not necessarily true. Seeing as (when the statistics are correct) we know each of these inner tuples will match with the largest amount of outer tuples it is just as much of a win per inner tuple as when they are unique. There is just a chance you will have to give up on the optimization part

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-20 Thread Bryce Cutt
The estimation functions assume the inner relation join column is unique. But it freezes (flushes back to the main hash table) one skew bucket at a time in order of least importance so if 100 inner tuples can fit in the skew buckets then the skew buckets are only fully blown out if the best tuple

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-06 Thread Tom Lane
Bryce Cutt pandas...@gmail.com writes: Here is the new patch. Our experiments show no noticeable performance issue when using the patch for cases where the optimization is not used because the number of extra statements executed when the optimization is disabled is insignificant. We have

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-06 Thread Robert Haas
On Fri, Mar 6, 2009 at 1:57 PM, Tom Lane t...@sss.pgh.pa.us wrote: Bryce Cutt pandas...@gmail.com writes: Here is the new patch. Our experiments show no noticeable performance issue when using the patch for cases where the optimization is not used because the number of extra statements

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-06 Thread Lawrence, Ramon
I think you missed the point of the performance questions.  It wasn't about avoiding extra simple if-tests in the per-tuple loops; a few of those are certainly not going to add measurable cost given how complex the code is already.  (I really don't think you should be duplicating hunks

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-03-02 Thread Bryce Cutt
Here is the new patch. Our experiments show no noticeable performance issue when using the patch for cases where the optimization is not used because the number of extra statements executed when the optimization is disabled is insignificant. We have updated the patch to remove a couple of if

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Heikki Linnakangas
I haven't been following this thread closely, so pardon if this has been discussed already. The patch doesn't seem to change the cost estimates in the planner at all. Without that, I'd imagine that the planner rarely chooses a multi-batch hash join to begin with. Joshua, in the tests that

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Robert Haas
On Thu, Feb 26, 2009 at 4:22 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I haven't been following this thread closely, so pardon if this has been discussed already. The patch doesn't seem to change the cost estimates in the planner at all. Without that, I'd imagine that

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Joshua Tolley
On Wed, Feb 25, 2009 at 10:24:21PM -0500, Robert Haas wrote: I don't think we're really doing this the right way. EXPLAIN ANALYZE has a measurable effect on the results, and we probably ought to stop the database and drop the VM caches after each query. Are the Z1-Z7 datasets on line

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Joshua Tolley
On Thu, Feb 26, 2009 at 08:22:52AM -0500, Robert Haas wrote: On Thu, Feb 26, 2009 at 4:22 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Joshua, in the tests that you've been running, did you have to rig the planner with enable_mergjoin=off or similar, to get the queries

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Tom Lane
Heikki's got a point here: the planner is aware that hashjoin doesn't like skewed distributions, and it assigns extra cost accordingly if it can determine that the join key is skewed. (See the bucketsize stuff in cost_hashjoin.) If this patch is accepted we'll want to tweak that code. Still,

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Lawrence, Ramon
From: Tom Lane Heikki's got a point here: the planner is aware that hashjoin doesn't like skewed distributions, and it assigns extra cost accordingly if it can determine that the join key is skewed. (See the bucketsize stuff in cost_hashjoin.) If this patch is accepted we'll want to tweak

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-26 Thread Bryce Cutt
The patch originally modified the cost function but I removed that part before we submitted it to be a bit conservative about our proposed changes. I didn't like that for large plans the statistics were retrieved and calculated many times when finding the optimal query plan. The overhead of the

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-25 Thread Robert Haas
On Wed, Feb 25, 2009 at 12:38 AM, Lawrence, Ramon ramon.lawre...@ubc.ca wrote: -Original Message- From: Robert Haas Sadly, there seem to be a number of cases in the Z7 database where the optimization makes things significantly worse (specifically, queries 2, 3, and 7, but especially

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-24 Thread Robert Haas
Joshua sent us some preliminary data with this query and others and indicated that we could post it.  He wanted time to clean it up and re-run some experiments, but the data is generally good and the algorithm performs as expected.  I have attached this data to the post.  Note that the last

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-24 Thread Lawrence, Ramon
-Original Message- From: Robert Haas Sadly, there seem to be a number of cases in the Z7 database where the optimization makes things significantly worse (specifically, queries 2, 3, and 7, but especially query 3). Have you investigated what is going on there? I had thought that we

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-19 Thread Lawrence, Ramon
From: pgsql-hackers-ow...@postgresql.org on behalf of Robert Haas I think what we need here is some very simple testing to demonstrate that this patch demonstrates a speed-up even when the inner side of the join is a joinrel rather than a baserel. Can you suggest

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-19 Thread Joshua Tolley
On Wed, Feb 18, 2009 at 11:20:03PM -0500, Robert Haas wrote: On Wed, Jan 7, 2009 at 9:14 AM, Joshua Tolley eggyk...@gmail.com wrote: On Tue, Jan 06, 2009 at 11:49:57PM -0500, Robert Haas wrote: Josh / eggyknap - Can you rerun your performance tests with this version of the patch?

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-19 Thread David Fetter
On Thu, Feb 19, 2009 at 01:50:55PM -0700, Josh Tolley wrote: (my new daughter will be 24 hours old in a little bit, though, so it might be a while!) Pics! Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype:

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-18 Thread Robert Haas
On Wed, Jan 7, 2009 at 9:14 AM, Joshua Tolley eggyk...@gmail.com wrote: On Tue, Jan 06, 2009 at 11:49:57PM -0500, Robert Haas wrote: Josh / eggyknap - Can you rerun your performance tests with this version of the patch? ...Robert Will do, as soon as I can. Josh, Have you been able to do

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-02-18 Thread Robert Haas
At this point, we await further feedback on what is necessary to get this patch accepted. We would also like to thank Josh and Robert again for their review time. I think what we need here is some very simple testing to demonstrate that this patch demonstrates a speed-up even when the inner

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-01-14 Thread Lawrence, Ramon
Here is a cleaned-up version. I fixed a number of whitespace issues, improved a few comments, and rearranged one set of nested if-else statements (hopefully without breaking anything in the process). Josh / eggyknap - Can you rerun your performance tests with this version of the patch?

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-01-07 Thread Joshua Tolley
On Tue, Jan 06, 2009 at 11:49:57PM -0500, Robert Haas wrote: Josh / eggyknap - Can you rerun your performance tests with this version of the patch? ...Robert Will do, as soon as I can. signature.asc Description: Digital signature

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-01-06 Thread Bryce Cutt
The latest version of the patch is attached. The revision considerably cleans up the code, especially variable naming consistency. We have adopted the use of IM (in-memory) in variable names for the hash table structures as suggested. Two other implementations changes: 1) The overhead of the

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-01-06 Thread Robert Haas
We would really appreciate help on finalizing this patch, especially in regard to style issues. Thank you for all the help. Here is a cleaned-up version. I fixed a number of whitespace issues, improved a few comments, and rearranged one set of nested if-else statements (hopefully without

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2009-01-01 Thread Robert Haas
On Tue, Dec 30, 2008 at 12:29 AM, Bryce Cutt pandas...@gmail.com wrote: Here is the next patch version. Thanks for posting this update. This is definitely getting better, but I still see some style issues. We can work on fixing those once the rest of the details have been finalized. However,

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-30 Thread Bryce Cutt
Here is the next patch version. The naming and style concerns have been addressed. The patch now only touches 5 files. 4 of those files are hashjoin specific and 1 is to add a couple lines to a hashjoin specific struct in another file. The code can now find the the MCVs in more cases. Even if

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-29 Thread Robert Haas
I think that setting aside a minimum percentage of work_mem may be a reasonable approach. For instance, setting aside 1% at even 1 MB work_mem would be 10 KB which is enough to store about 40 MCV tuples of the TPC-H database. Such a small percentage would be very unlikely (but still

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-28 Thread Robert Haas
I totally agree that 10,000 MCVs changes things. Ideally, these 10,000 MCVs should be kept in memory because they will join with the most tuples. However, the size of the MCV hash table (as you point out) can be bigger than work_mem *by itself* not even considering the tuples in the table

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-28 Thread Lawrence, Ramon
I thought about this, but upon due reflection I think it's the wrong approach. Raising work_mem is a pretty common tuning step - it's 4MB even on my small OLTP systems, and in a data-warehousing environment where this optimization will bring the most benefit, it could easily be higher.

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-27 Thread Lawrence, Ramon
-Original Message- From: Robert Haas [mailto:robertmh...@gmail.com] I looked at this some more. I'm a little concerned about the way we're maintaining the in-memory hash table. Since the highest legal statistics target is now 10,000, it's possible that we could have two orders of

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-25 Thread Robert Haas
There is almost zero penalty for selecting incorrect MCV tuples to buffer in memory. Since the number of MCVs is approximately 100, the overhead is keeping these 100 tuples in memory where they *might* not be MCVs. The cost is the little extra memory and the checking of the MCVs which is

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

2008-12-23 Thread Lawrence, Ramon
Because there is no nice way in PostgreSQL (that I know of) to derive a histogram after a join (on an intermediate result) currently usingMostCommonValues is only enabled on a join when the outer (probe) side is a table scan (seq scan only actually). See getMostCommonValues (soon