Joshua,

Thank you for offering to review the patch.

The easiest way to test would be to generate your own TPC-H data and
load it into a database for testing.  I have posted the TPC-H generator
at:

http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip

The generator can produce skewed data sets.  It was produced by
Microsoft Research.

After unzipping, on a Windows machine, you can just run the command:

dbgen -s 1 -z 1

This will produce a TPC-H database of scale 1 GB with a Zipfian skew of
z=1.  More information on the generator is in the document README-S.DOC.
Source is provided for the generator, so you should be able to run it on
other operating systems as well.

The schema DDL is at:

http://people.ok.ubc.ca/rlawrenc/tpch_pg_ddl.txt

Note that the load time for 1G data is 1-2 hours and for 10G data is
about 24 hours.  I recommend you do not add the foreign keys until after
the data is loaded.

The other alternative is to do a pgdump on our data sets.  However, the
download size would be quite large, and it will take a couple of days
for us to get you the data in that form.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: [EMAIL PROTECTED]


> -----Original Message-----
> From: Joshua Tolley [mailto:[EMAIL PROTECTED]
> Sent: November 1, 2008 3:42 PM
> To: Lawrence, Ramon
> Cc: pgsql-hackers@postgresql.org; Bryce Cutt
> Subject: Re: [HACKERS] Proposed Patch to Improve Performance of Multi-
> Batch Hash Join for Skewed Data Sets
> 
> On Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon
<[EMAIL PROTECTED]>
> wrote:
> > We propose a patch that improves hybrid hash join's performance for
> large
> > multi-batch joins where the probe relation has skew.
> >
> > Project name: Histojoin
> > Patch file: histojoin_v1.patch
> >
> > This patch implements the Histojoin join algorithm as an optional
> feature
> > added to the standard Hybrid Hash Join (HHJ).  A flag is used to
enable
> or
> > disable the Histojoin features.  When Histojoin is disabled, HHJ
acts as
> > normal.  The Histojoin features allow HHJ to use PostgreSQL's
statistics
> to
> > do skew aware partitioning.  The basic idea is to keep build
relation
> tuples
> > in a small in-memory hash table that have join values that are
> frequently
> > occurring in the probe relation.  This improves performance of HHJ
when
> > multiple batches are used by 10% to 50% for skewed data sets.  The
> > performance improvements of this patch can be seen in the paper
(pages
> > 25-30) at:
> >
> > http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf
> >
> > All generators and materials needed to verify these results can be
> provided.
> >
> > This is a patch against the HEAD of the repository.
> >
> > This patch does not contain platform specific code.  It compiles and
has
> > been tested on our machines in both Windows (MSVC++) and Linux
(GCC).
> >
> > Currently the Histojoin feature is enabled by default and is used
> whenever
> > HHJ is used and there are Most Common Value (MCV) statistics
available
> on
> > the probe side base relation of the join.  To disable this feature
> simply
> > set the enable_hashjoin_usestatmcvs flag to off in the database
> > configuration file or at run time with the 'set' command.
> >
> > One potential improvement not included in the patch is that Most
Common
> > Value (MCV) statistics are only determined when the probe relation
is
> > produced by a scan operator.  There is a benefit to using MCVs even
when
> the
> > probe relation is not a base scan, but we were unable to determine
how
> to
> > find statistics from a base relation after other operators are
> performed.
> >
> > This patch was created by Bryce Cutt as part of his work on his
M.Sc.
> > thesis.
> >
> > --
> > Dr. Ramon Lawrence
> > Assistant Professor, Department of Computer Science, University of
> British
> > Columbia Okanagan
> > E-mail: [EMAIL PROTECTED]
> 
> I'm interested in trying to review this patch. Having not done patch
> review before, I can't exactly promise grand results, but if you could
> provide me with the data to check your results? In the meantime I'll
> go read the paper.
> 
> - Josh / eggyknap

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to