Bryce Cutt pandas...@gmail.com writes:
Here is the new patch.
Applied with revisions. I undid some of the optimizations that
cluttered the code in order to save a cycle or two per tuple --- as per
previous discussion, that's not what the performance questions were
about. Also, I did not like
On Fri, Mar 20, 2009 at 8:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:
Bryce Cutt pandas...@gmail.com writes:
Here is the new patch.
Applied with revisions. I undid some of the optimizations that
cluttered the code in order to save a cycle or two per tuple --- as per
previous discussion,
On Fri, Mar 20, 2009 at 8:45 PM, Bryce Cutt pandas...@gmail.com wrote:
On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas robertmh...@gmail.com wrote:
If the inner relation isn't fairly close to unique you shouldn't be
using this optimization in the first place.
Not necessarily true. Seeing as
On Fri, Mar 20, 2009 at 8:45 PM, Bryce Cutt pandas...@gmail.com wrote:
On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas robertmh...@gmail.com wrote:
If the inner relation isn't fairly close to unique you shouldn't be
using this optimization in the first place.
Not necessarily true. Seeing as
Not necessarily true. Seeing as (when the statistics are correct) we
know each of these inner tuples will match with the largest amount of
outer tuples it is just as much of a win per inner tuple as when they
are unique. There is just a chance you will have to give up on the
optimization part
The estimation functions assume the inner relation join column is
unique. But it freezes (flushes back to the main hash table) one skew
bucket at a time in order of least importance so if 100 inner tuples
can fit in the skew buckets then the skew buckets are only fully blown
out if the best tuple
Bryce Cutt pandas...@gmail.com writes:
Here is the new patch.
Our experiments show no noticeable performance issue when using the
patch for cases where the optimization is not used because the number
of extra statements executed when the optimization is disabled is
insignificant.
We have
On Fri, Mar 6, 2009 at 1:57 PM, Tom Lane t...@sss.pgh.pa.us wrote:
Bryce Cutt pandas...@gmail.com writes:
Here is the new patch.
Our experiments show no noticeable performance issue when using the
patch for cases where the optimization is not used because the number
of extra statements
I think you missed the point of the performance questions. It wasn't
about avoiding extra simple if-tests in the per-tuple loops; a few of
those are certainly not going to add measurable cost given how complex
the code is already. (I really don't think you should be duplicating
hunks
Here is the new patch.
Our experiments show no noticeable performance issue when using the
patch for cases where the optimization is not used because the number
of extra statements executed when the optimization is disabled is
insignificant.
We have updated the patch to remove a couple of if
I haven't been following this thread closely, so pardon if this has been
discussed already.
The patch doesn't seem to change the cost estimates in the planner at
all. Without that, I'd imagine that the planner rarely chooses a
multi-batch hash join to begin with.
Joshua, in the tests that
On Thu, Feb 26, 2009 at 4:22 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
I haven't been following this thread closely, so pardon if this has been
discussed already.
The patch doesn't seem to change the cost estimates in the planner at all.
Without that, I'd imagine that
On Wed, Feb 25, 2009 at 10:24:21PM -0500, Robert Haas wrote:
I don't think we're really doing this the right way. EXPLAIN ANALYZE
has a measurable effect on the results, and we probably ought to stop
the database and drop the VM caches after each query. Are the Z1-Z7
datasets on line
On Thu, Feb 26, 2009 at 08:22:52AM -0500, Robert Haas wrote:
On Thu, Feb 26, 2009 at 4:22 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
Joshua, in the tests that you've been running, did you have to rig the
planner with enable_mergjoin=off or similar, to get the queries
Heikki's got a point here: the planner is aware that hashjoin doesn't
like skewed distributions, and it assigns extra cost accordingly if it
can determine that the join key is skewed. (See the bucketsize stuff
in cost_hashjoin.) If this patch is accepted we'll want to tweak that
code.
Still,
From: Tom Lane
Heikki's got a point here: the planner is aware that hashjoin doesn't
like skewed distributions, and it assigns extra cost accordingly if it
can determine that the join key is skewed. (See the bucketsize
stuff
in cost_hashjoin.) If this patch is accepted we'll want to tweak
The patch originally modified the cost function but I removed that
part before we submitted it to be a bit conservative about our
proposed changes. I didn't like that for large plans the statistics
were retrieved and calculated many times when finding the optimal
query plan.
The overhead of the
On Wed, Feb 25, 2009 at 12:38 AM, Lawrence, Ramon ramon.lawre...@ubc.ca wrote:
-Original Message-
From: Robert Haas
Sadly, there seem to be a number of cases in the Z7 database where the
optimization makes things significantly worse (specifically, queries
2, 3, and 7, but especially
Joshua sent us some preliminary data with this query and others and indicated
that we could post it. He wanted time to clean it up
and re-run some experiments, but the data is generally good and the algorithm
performs as expected. I have attached this data to the
post. Note that the last
-Original Message-
From: Robert Haas
Sadly, there seem to be a number of cases in the Z7 database where the
optimization makes things significantly worse (specifically, queries
2, 3, and 7, but especially query 3). Have you investigated what is
going on there? I had thought that we
From: pgsql-hackers-ow...@postgresql.org on behalf of Robert Haas
I think what we need here is some very simple testing to demonstrate
that this patch demonstrates a speed-up even when the inner side of
the join is a joinrel rather than a baserel. Can you suggest
On Wed, Feb 18, 2009 at 11:20:03PM -0500, Robert Haas wrote:
On Wed, Jan 7, 2009 at 9:14 AM, Joshua Tolley eggyk...@gmail.com wrote:
On Tue, Jan 06, 2009 at 11:49:57PM -0500, Robert Haas wrote:
Josh / eggyknap -
Can you rerun your performance tests with this version of the patch?
On Thu, Feb 19, 2009 at 01:50:55PM -0700, Josh Tolley wrote:
(my new daughter will be 24 hours old in a little bit, though, so it
might be a while!)
Pics!
Cheers,
David.
--
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype:
On Wed, Jan 7, 2009 at 9:14 AM, Joshua Tolley eggyk...@gmail.com wrote:
On Tue, Jan 06, 2009 at 11:49:57PM -0500, Robert Haas wrote:
Josh / eggyknap -
Can you rerun your performance tests with this version of the patch?
...Robert
Will do, as soon as I can.
Josh,
Have you been able to do
At this point, we await further feedback on what is necessary to get
this patch accepted. We would also like to thank Josh and Robert again
for their review time.
I think what we need here is some very simple testing to demonstrate
that this patch demonstrates a speed-up even when the inner
Here is a cleaned-up version. I fixed a number of whitespace issues,
improved a few comments, and rearranged one set of nested if-else
statements (hopefully without breaking anything in the process).
Josh / eggyknap -
Can you rerun your performance tests with this version of the patch?
On Tue, Jan 06, 2009 at 11:49:57PM -0500, Robert Haas wrote:
Josh / eggyknap -
Can you rerun your performance tests with this version of the patch?
...Robert
Will do, as soon as I can.
signature.asc
Description: Digital signature
The latest version of the patch is attached. The revision considerably
cleans up the code, especially variable naming consistency. We have
adopted the use of IM (in-memory) in variable names for the hash table
structures as suggested.
Two other implementations changes:
1) The overhead of the
We would really appreciate help on finalizing this patch, especially in
regard to style issues. Thank you for all the help.
Here is a cleaned-up version. I fixed a number of whitespace issues,
improved a few comments, and rearranged one set of nested if-else
statements (hopefully without
On Tue, Dec 30, 2008 at 12:29 AM, Bryce Cutt pandas...@gmail.com wrote:
Here is the next patch version.
Thanks for posting this update. This is definitely getting better,
but I still see some style issues. We can work on fixing those once
the rest of the details have been finalized.
However,
Here is the next patch version.
The naming and style concerns have been addressed. The patch now only
touches 5 files. 4 of those files are hashjoin specific and 1 is to
add a couple lines to a hashjoin specific struct in another file.
The code can now find the the MCVs in more cases. Even if
I think that setting aside a minimum percentage of work_mem may be a
reasonable approach. For instance, setting aside 1% at even 1 MB
work_mem would be 10 KB which is enough to store about 40 MCV tuples of
the TPC-H database. Such a small percentage would be very unlikely (but
still
I totally agree that 10,000 MCVs changes things. Ideally, these 10,000
MCVs should be kept in memory because they will join with the most
tuples. However, the size of the MCV hash table (as you point out) can
be bigger than work_mem *by itself* not even considering the tuples in
the table
I thought about this, but upon due reflection I think it's the wrong
approach. Raising work_mem is a pretty common tuning step - it's 4MB
even on my small OLTP systems, and in a data-warehousing environment
where this optimization will bring the most benefit, it could easily
be higher.
-Original Message-
From: Robert Haas [mailto:robertmh...@gmail.com]
I looked at this some more. I'm a little concerned about the way
we're maintaining the in-memory hash table. Since the highest legal
statistics target is now 10,000, it's possible that we could have two
orders of
There is almost zero penalty for selecting incorrect MCV tuples to
buffer in memory. Since the number of MCVs is approximately 100, the
overhead is keeping these 100 tuples in memory where they *might* not
be MCVs. The cost is the little extra memory and the checking of the
MCVs which is
Because there is no nice way in PostgreSQL (that I know of) to
derive
a histogram after a join (on an intermediate result) currently
usingMostCommonValues is only enabled on a join when the outer
(probe)
side is a table scan (seq scan only actually). See
getMostCommonValues (soon
37 matches
Mail list logo