> I took at a first crack at coding up an implementation of
> relation_is_distinct_for() tonight.
I am not sure if this will help or not, but on the 8.4 code base we
implemented two functions:
- getCandidateKeys() - would recursively traverse a tree from a given
node to the leaf nodes and determi
> Upon further review, it appears that a big part of this problem is
> that cost_hashjoin() doesn't understand that it needs cost semi-joins
> differently from inner or left joins. The bogus logic looks to be
> right here:
> startup_cost += hash_qual_cost.startup;
> run_cost += hash_qual_c
> > I would be especially interested in using a shared memory hash table
> > that *all* backends can use - if the table is mostly read-only, as
> > dimension tables often are in data warehouse applications. That
would
> > give zero startup cost and significantly reduced memory.
>
> I think that's
> While investigating some performance problems recently I've had cause
> to think about the way PostgreSQL uses hash joins. So here are a few
> thoughts. Some of these have been brought up before.
>
> 1. When the hash is not expected to spill to disk, it preserves the
> pathkeys of the outer si
> > I think you missed the point of the performance questions. It wasn't
> > about avoiding extra simple if-tests in the per-tuple loops; a few of
> > those are certainly not going to add measurable cost given how complex
> > the code is already. (I really don't think you should be duplicating
>
> That seems VERY useful - can you post the other ones (Z1, etc.) so I
> can download them all?
The Z1 data set is posted at:
http://people.ok.ubc.ca/rlawrenc/tpch1g1z.zip
I have not generated Z2, Z3, Z4 for 1G, but I can generate the Z2 and Z3
data sets, and in a hour or two they will be at:
h
> They're automatically generated by the dbgen utility, a link to which
> was originally published somewhere in this thread. That tool creates a
> few text files suitable (with some tweaking) for a COPY command. I've
> got the original files... the .tbz I just made is 1.8 GB :) Anyone
have
> somepl
> From: Tom Lane
> Heikki's got a point here: the planner is aware that hashjoin doesn't
> like skewed distributions, and it assigns extra cost accordingly if it
> can determine that the join key is skewed. (See the "bucketsize"
stuff
> in cost_hashjoin.) If this patch is accepted we'll want to t
> -Original Message-
> From: Robert Haas
> Sadly, there seem to be a number of cases in the Z7 database where the
> optimization makes things significantly worse (specifically, queries
> 2, 3, and 7, but especially query 3). Have you investigated what is
> going on there? I had thought th
From: pgsql-hackers-ow...@postgresql.org on behalf of Robert Haas
I think what we need here is some very simple testing to demonstrate
that this patch demonstrates a speed-up even when the inner side of
the join is a joinrel rather than a baserel. Can you suggest
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-
> ow...@postgresql.org] On Behalf Of Tom Lane
> But really there are two different performance regimes here, one where
> the hash data is large enough to spill to disk and one where it isn't.
> Reducing w
> The idea I came up with for benchmarking was a little similar to what
I
> remember from the original tests. I have a sales orders table and a
> products
> table. My version of the sales orders table contains a customer
column.
> Data
> for 10 customers is populated into the sales orders table, cu
A hash join modification patch is under review for 8.4 that needs
performance testing. We would appreciate help with this testing.
A testing version of the patch is attached in addition to testing
instructions and where to retrieve a sample data set. The basic idea
of the patch is that it
> Here is a cleaned-up version. I fixed a number of whitespace issues,
> improved a few comments, and rearranged one set of nested if-else
> statements (hopefully without breaking anything in the process).
>
> Josh / eggyknap -
>
> Can you rerun your performance tests with this version of the p
> Has this been completed? TODO item?
> > > I'd be more inclined to deal with the issue by trying to establish
a
> > > "safety margin" in the estimate of whether the hash will go
> > multi-batch.
> > > IOW we should disuse_physical_tlist if the hash is estimated to be
> > close to but still withi
> I thought about this, but upon due reflection I think it's the wrong
> approach. Raising work_mem is a pretty common tuning step - it's 4MB
> even on my small OLTP systems, and in a data-warehousing environment
> where this optimization will bring the most benefit, it could easily
> be higher.
> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> I looked at this some more. I'm a little concerned about the way
> we're maintaining the in-memory hash table. Since the highest legal
> statistics target is now 10,000, it's possible that we could have two
> orders
> > > Because there is no nice way in PostgreSQL (that I know of) to
derive
> > > a histogram after a join (on an intermediate result) currently
> > > usingMostCommonValues is only enabled on a join when the outer
(probe)
> > > side is a table scan (seq scan only actually). See
> > > getMostCommon
ql.org [mailto:pgsql-hackers-
> ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: December 17, 2008 7:54 PM
> To: Lawrence, Ramon
> Cc: Tom Lane; pgsql-hackers@postgresql.org; Bryce Cutt
> Subject: Re: [HACKERS] Proposed Patch to Improve Performance of Multi-
> Batch Hash Join for
> -Original Message-
> From: Tom Lane [mailto:[EMAIL PROTECTED]
> I'm a tad worried about what happens when the values that are
frequently
> occurring in the outer relation are also frequently occurring in the
> inner (which hardly seems an improbable case). Don't you stand a
severe
> risk
> -Original Message-
> From: Jonah H. Harris [mailto:[EMAIL PROTECTED]
> I have a new patch which does not create a bloom filter unless it sees
> that the hash join is going to batch. I'll send it along later
> tonight.
>
> Currently it's additional space not accounted for by work_mem.
>
> -Original Message-
> On Sun, Nov 2, 2008 at 10:49 PM, Jonah H. Harris
> <[EMAIL PROTECTED]> wrote:
> It's effective as-is for a preliminary patch. The GUC code is the
> least of my worries.
>
> > Can you provide some figures on the performance impact of the bloom
> filter?
I have teste
> -Original Message-
> > Minor question on this patch. AFAICS there is another patch that
seems
> > to be aiming at exactly the same use case. Jonah's Bloom filter
patch.
> >
> > Shouldn't we have a dust off to see which one is best? Or at least a
> > discussion to test whether they overlap
> From: Tom Lane [mailto:[EMAIL PROTECTED]
> What alternatives are there for people who do not run Windows?
>
> regards, tom lane
The TPC-H generator is a standard code base provided at
http://www.tpc.org/tpch/. We have been able to compile this code on
Linux.
However, we
Okanagan
E-mail: [EMAIL PROTECTED]
> -Original Message-
> From: Joshua Tolley [mailto:[EMAIL PROTECTED]
> Sent: November 1, 2008 3:42 PM
> To: Lawrence, Ramon
> Cc: pgsql-hackers@postgresql.org; Bryce Cutt
> Subject: Re: [HACKERS] Proposed Patch to Improve Performance of M
We propose a patch that improves hybrid hash join's performance for
large multi-batch joins where the probe relation has skew.
Project name: Histojoin
Patch file: histojoin_v1.patch
This patch implements the Histojoin join algorithm as an optional
feature added to the standard Hybrid Hash
> From: Tom Lane [mailto:[EMAIL PROTECTED]
> I was intending to do it the other way, actually. An extra field in
> HashPath hardly costs anything. The other reason for it is that there
> are other possible uses for knowing whether a hash will be
multi-batch.
> (For example, if we were prepared to
PostgreSQL development community:
Our research group has been using the PostgreSQL code base to test new
join algorithms. During testing, we noticed that the planner is not
pushing down projections to the outer relation in a hash join. Although
this makes sense for in-memory (1 batch) joins,
28 matches
Mail list logo