Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-23 Thread Joshua Tolley
On Tue, Dec 23, 2008 at 10:14:29AM -0500, Robert Haas wrote: > > It's equivalent to our assumption that distributions of values in > > columns in the same table are independent. Making that assumption in > > this case would probably result in occasional dramatic speed > > improvements similar to th

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-23 Thread Robert Haas
> It's equivalent to our assumption that distributions of values in > columns in the same table are independent. Making that assumption in > this case would probably result in occasional dramatic speed > improvements similar to the ones we've seen in less complex joins, > offset by just-as-occasion

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-23 Thread Joshua Tolley
On Tue, Dec 23, 2008 at 09:22:27AM -0500, Robert Haas wrote: > On Tue, Dec 23, 2008 at 2:21 AM, Bryce Cutt wrote: > > Because there is no nice way in PostgreSQL (that I know of) to derive > > a histogram after a join (on an intermediate result) currently > > usingMostCommonValues is only enabled o

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-23 Thread Robert Haas
On Tue, Dec 23, 2008 at 2:21 AM, Bryce Cutt wrote: > Because there is no nice way in PostgreSQL (that I know of) to derive > a histogram after a join (on an intermediate result) currently > usingMostCommonValues is only enabled on a join when the outer (probe) > side is a table scan (seq scan only

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-23 Thread Bryce Cutt
Because there is no nice way in PostgreSQL (that I know of) to derive a histogram after a join (on an intermediate result) currently usingMostCommonValues is only enabled on a join when the outer (probe) side is a table scan (seq scan only actually). See getMostCommonValues (soon to be called Exec

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-22 Thread Joshua Tolley
On Sun, Dec 21, 2008 at 10:25:59PM -0500, Robert Haas wrote: > [Some performance testing.] I (finally!) have a chance to post my performance testing results... my apologies for the really long delay. Unfortunately I'm not seeing wonderful speedups with the particular queries I did in this case.

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-21 Thread Robert Haas
[Some performance testing.] I ran this query 10x with this patch applied, and then 10x again with enable_hashjoin_usestatmvcs set to false to disable the optimization: select sum(1) from (select * from part, lineitem where p_partkey = l_partkey) x; With the optimization enabled, the query took b

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-20 Thread Bryce Cutt
Robert, I thoroughly appreciate the constructive criticism. The compile errors are due to my development process being convoluted. I will endeavor to not waste your time in the future with errors caused by my development process. I have updated the code to follow the conventions and suggestions

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-17 Thread Lawrence, Ramon
ql.org [mailto:pgsql-hackers- > ow...@postgresql.org] On Behalf Of Robert Haas > Sent: December 17, 2008 7:54 PM > To: Lawrence, Ramon > Cc: Tom Lane; pgsql-hackers@postgresql.org; Bryce Cutt > Subject: Re: [HACKERS] Proposed Patch to Improve Performance of Multi- > Batch Hash Join for

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-17 Thread Robert Haas
Dr. Lawrence: I'm still working on reviewing this patch. I've managed to load the sample TPCH data from tpch1g1z.zip after changing the line endings to UNIX-style and chopping off the trailing vertical bars. (If anyone is interested, I have the results of pg_dump | bzip2 -9 on the resulting data

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-12-15 Thread Robert Haas
I have to admit that I haven't fully grokked what this patch is about just yet, so what follows is mostly a coding style review at this point. It would help a lot if you could add some comments to the new functions that are being added to explain the purpose of each at a very high level. There's

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-24 Thread Lawrence, Ramon
> -Original Message- > From: Tom Lane [mailto:[EMAIL PROTECTED] > I'm a tad worried about what happens when the values that are frequently > occurring in the outer relation are also frequently occurring in the > inner (which hardly seems an improbable case). Don't you stand a severe > risk

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-20 Thread Tom Lane
"Lawrence, Ramon" <[EMAIL PROTECTED]> writes: > We propose a patch that improves hybrid hash join's performance for > large multi-batch joins where the probe relation has skew. > ... > The basic idea > is to keep build relation tuples in a small in-memory hash table that > have join values that are

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-10 Thread Joshua Tolley
On Wed, Nov 05, 2008 at 04:06:11PM -0800, Bryce Cutt wrote: > The error is causes by me Asserting against the wrong variable. I > never noticed this as I apparently did not have assertions turned on > on my development machine. That is fixed now and with the new patch > version I have attached al

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-06 Thread Joshua Tolley
On Thu, Nov 6, 2008 at 5:31 PM, Lawrence, Ramon <[EMAIL PROTECTED]> wrote: >> -Original Message- >> > Minor question on this patch. AFAICS there is another patch that > seems >> > to be aiming at exactly the same use case. Jonah's Bloom filter > patch. >> > >> > Shouldn't we have a dust off

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-06 Thread Lawrence, Ramon
> -Original Message- > > Minor question on this patch. AFAICS there is another patch that seems > > to be aiming at exactly the same use case. Jonah's Bloom filter patch. > > > > Shouldn't we have a dust off to see which one is best? Or at least a > > discussion to test whether they overlap

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-06 Thread Joshua Tolley
On Thu, Nov 6, 2008 at 3:52 PM, Simon Riggs <[EMAIL PROTECTED]> wrote: > > On Thu, 2008-11-06 at 15:33 -0700, Joshua Tolley wrote: > >> Stay tuned. > > Minor question on this patch. AFAICS there is another patch that seems > to be aiming at exactly the same use case. Jonah's Bloom filter patch. > >

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-06 Thread Simon Riggs
On Thu, 2008-11-06 at 15:33 -0700, Joshua Tolley wrote: > Stay tuned. Minor question on this patch. AFAICS there is another patch that seems to be aiming at exactly the same use case. Jonah's Bloom filter patch. Shouldn't we have a dust off to see which one is best? Or at least a discussion to

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-06 Thread Joshua Tolley
On Wed, Nov 5, 2008 at 5:06 PM, Bryce Cutt <[EMAIL PROTECTED]> wrote: > The error is causes by me Asserting against the wrong variable. I > never noticed this as I apparently did not have assertions turned on > on my development machine. That is fixed now and with the new patch > version I have a

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-05 Thread Joshua Tolley
On Wed, Nov 05, 2008 at 04:06:11PM -0800, Bryce Cutt wrote: > The error is causes by me Asserting against the wrong variable. I > never noticed this as I apparently did not have assertions turned on > on my development machine. That is fixed now and with the new patch > version I have attached al

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-05 Thread Bryce Cutt
The error is causes by me Asserting against the wrong variable. I never noticed this as I apparently did not have assertions turned on on my development machine. That is fixed now and with the new patch version I have attached all assertions are passing with your query and my test queries. I add

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-05 Thread Joshua Tolley
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Wed, Nov 5, 2008 at 8:20 AM, Tom Lane wrote: > Joshua Tolley writes: >> On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote: >>> We propose a patch that improves hybrid hash join's performance for large >>> multi-batch joins where the

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-05 Thread Tom Lane
Joshua Tolley <[EMAIL PROTECTED]> writes: > On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote: >> We propose a patch that improves hybrid hash join's performance for large >> multi-batch joins where the probe relation has skew. > I also recommend modifying docs/src/sgml/config.sgml t

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-05 Thread Joshua Tolley
On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote: >We propose a patch that improves hybrid hash join's performance for large >multi-batch joins where the probe relation has skew. I also recommend modifying docs/src/sgml/config.sgml to include the enable_hashjoin_usestatmcvs

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-05 Thread Joshua Tolley
On Mon, Oct 20, 2008 at 03:42:49PM -0700, Lawrence, Ramon wrote: >We propose a patch that improves hybrid hash join's performance for large >multi-batch joins where the probe relation has skew. I'm running into problems with this patch. It applies cleanly, and the technique you provided fo

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-02 Thread Lawrence, Ramon
> From: Tom Lane [mailto:[EMAIL PROTECTED] > What alternatives are there for people who do not run Windows? > > regards, tom lane The TPC-H generator is a standard code base provided at http://www.tpc.org/tpch/. We have been able to compile this code on Linux. However, we

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-02 Thread Tom Lane
"Lawrence, Ramon" <[EMAIL PROTECTED]> writes: > The easiest way to test would be to generate your own TPC-H data and > load it into a database for testing. I have posted the TPC-H generator > at: > http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip > The generator can produce skewed data sets. It was

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-02 Thread Joshua Tolley
On Sun, Nov 2, 2008 at 4:48 PM, Lawrence, Ramon <[EMAIL PROTECTED]> wrote: > Joshua, > > Thank you for offering to review the patch. > > The easiest way to test would be to generate your own TPC-H data and > load it into a database for testing. I have posted the TPC-H generator > at: > > http://pe

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-02 Thread Lawrence, Ramon
Okanagan E-mail: [EMAIL PROTECTED] > -Original Message- > From: Joshua Tolley [mailto:[EMAIL PROTECTED] > Sent: November 1, 2008 3:42 PM > To: Lawrence, Ramon > Cc: pgsql-hackers@postgresql.org; Bryce Cutt > Subject: Re: [HACKERS] Proposed Patch to Improve Performance of M

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-11-01 Thread Joshua Tolley
On Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon <[EMAIL PROTECTED]> wrote: > We propose a patch that improves hybrid hash join's performance for large > multi-batch joins where the probe relation has skew. > > Project name: Histojoin > Patch file: histojoin_v1.patch > > This patch implements the H

[HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

2008-10-20 Thread Lawrence, Ramon
We propose a patch that improves hybrid hash join's performance for large multi-batch joins where the probe relation has skew. Project name: Histojoin Patch file: histojoin_v1.patch This patch implements the Histojoin join algorithm as an optional feature added to the standard Hybrid Hash