Joshua, Thank you for offering to review the patch.
The easiest way to test would be to generate your own TPC-H data and load it into a database for testing. I have posted the TPC-H generator at: http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip The generator can produce skewed data sets. It was produced by Microsoft Research. After unzipping, on a Windows machine, you can just run the command: dbgen -s 1 -z 1 This will produce a TPC-H database of scale 1 GB with a Zipfian skew of z=1. More information on the generator is in the document README-S.DOC. Source is provided for the generator, so you should be able to run it on other operating systems as well. The schema DDL is at: http://people.ok.ubc.ca/rlawrenc/tpch_pg_ddl.txt Note that the load time for 1G data is 1-2 hours and for 10G data is about 24 hours. I recommend you do not add the foreign keys until after the data is loaded. The other alternative is to do a pgdump on our data sets. However, the download size would be quite large, and it will take a couple of days for us to get you the data in that form. -- Dr. Ramon Lawrence Assistant Professor, Department of Computer Science, University of British Columbia Okanagan E-mail: [EMAIL PROTECTED] > -----Original Message----- > From: Joshua Tolley [mailto:[EMAIL PROTECTED] > Sent: November 1, 2008 3:42 PM > To: Lawrence, Ramon > Cc: pgsql-hackers@postgresql.org; Bryce Cutt > Subject: Re: [HACKERS] Proposed Patch to Improve Performance of Multi- > Batch Hash Join for Skewed Data Sets > > On Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon <[EMAIL PROTECTED]> > wrote: > > We propose a patch that improves hybrid hash join's performance for > large > > multi-batch joins where the probe relation has skew. > > > > Project name: Histojoin > > Patch file: histojoin_v1.patch > > > > This patch implements the Histojoin join algorithm as an optional > feature > > added to the standard Hybrid Hash Join (HHJ). A flag is used to enable > or > > disable the Histojoin features. When Histojoin is disabled, HHJ acts as > > normal. The Histojoin features allow HHJ to use PostgreSQL's statistics > to > > do skew aware partitioning. The basic idea is to keep build relation > tuples > > in a small in-memory hash table that have join values that are > frequently > > occurring in the probe relation. This improves performance of HHJ when > > multiple batches are used by 10% to 50% for skewed data sets. The > > performance improvements of this patch can be seen in the paper (pages > > 25-30) at: > > > > http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf > > > > All generators and materials needed to verify these results can be > provided. > > > > This is a patch against the HEAD of the repository. > > > > This patch does not contain platform specific code. It compiles and has > > been tested on our machines in both Windows (MSVC++) and Linux (GCC). > > > > Currently the Histojoin feature is enabled by default and is used > whenever > > HHJ is used and there are Most Common Value (MCV) statistics available > on > > the probe side base relation of the join. To disable this feature > simply > > set the enable_hashjoin_usestatmcvs flag to off in the database > > configuration file or at run time with the 'set' command. > > > > One potential improvement not included in the patch is that Most Common > > Value (MCV) statistics are only determined when the probe relation is > > produced by a scan operator. There is a benefit to using MCVs even when > the > > probe relation is not a base scan, but we were unable to determine how > to > > find statistics from a base relation after other operators are > performed. > > > > This patch was created by Bryce Cutt as part of his work on his M.Sc. > > thesis. > > > > -- > > Dr. Ramon Lawrence > > Assistant Professor, Department of Computer Science, University of > British > > Columbia Okanagan > > E-mail: [EMAIL PROTECTED] > > I'm interested in trying to review this patch. Having not done patch > review before, I can't exactly promise grand results, but if you could > provide me with the data to check your results? In the meantime I'll > go read the paper. > > - Josh / eggyknap -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers