subject:"Re\: \[HACKERS\] Performance improvement for joins where outer side is unique"

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-04-09 Thread David Rowley

On 8 April 2017 at 14:23, Tom Lane  wrote:
> David Rowley  writes:
> [ unique_joins_2017-04-07b.patch ]
>
> It turned out that this patch wasn't as close to committable as I'd
> thought, but after a full day of whacking at it, I got to a place
> where I thought it was OK.  So, pushed.

Many thanks for taking the time to do this, and committing too!

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-04-07 Thread Tom Lane

David Rowley  writes:
[ unique_joins_2017-04-07b.patch ]

It turned out that this patch wasn't as close to committable as I'd
thought, but after a full day of whacking at it, I got to a place
where I thought it was OK.  So, pushed.

[ and that's a wrap for v10 feature freeze, I think ]

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-04-06 Thread David Rowley

On 7 April 2017 at 13:41, Tom Lane  wrote:
> David Rowley  writes:
>> On 7 April 2017 at 11:47, Tom Lane  wrote:
>>> What I'm on about is that you can't do the early advance to the
>>> next outer tuple unless you're sure that all the quals that were
>>> relevant to the uniqueness proof have been checked for the current
>>> inner tuple.  That affects all three join types not only merge.
>
>> Well, look at the join code and you'll see this only happens after the
>> joinqual is evaulated. I didn't make a special effort here. I just
>> borrowed the location that JOIN_SEMI was already using.
>
> Right, and that's exactly the point: some of the conditions you're
> depending on might have ended up in the otherqual not the joinqual.
>
> We'd discussed rearranging the executor logic enough to deal with
> such situations and agreed that it seemed too messy; but that means
> that the optimization needs to take care not to use otherqual
> (ie pushed-down) conditions in the uniqueness proofs.
>
>> Oh yeah. I get it, but that's why we ignore !can_join clauses
>
> can_join seems to me to be not particularly relevant ... there's
> nothing that prevents that from getting set for pushed-down clauses.
>
> It's possible that the case I'm worried about is unreachable in
> practice because all the conditions that could be of interest would?
> be strict and therefore would have forced join strength reduction.
> But I'm not comfortable with assuming that.

Okay, well how about we protect against that by not using such quals
as unique proofs? We'd just need to ignore anything that's
outerjoin_delayed?

If we're struggling to think of a case that this will affect, then we
shouldn't be too worried about any missed optimisations.

A patch which does this is attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


unique_joins_2017-04-07b.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-04-06 Thread Tom Lane

David Rowley  writes:
> On 7 April 2017 at 11:47, Tom Lane  wrote:
>> What I'm on about is that you can't do the early advance to the
>> next outer tuple unless you're sure that all the quals that were
>> relevant to the uniqueness proof have been checked for the current
>> inner tuple.  That affects all three join types not only merge.

> Well, look at the join code and you'll see this only happens after the
> joinqual is evaulated. I didn't make a special effort here. I just
> borrowed the location that JOIN_SEMI was already using.

Right, and that's exactly the point: some of the conditions you're
depending on might have ended up in the otherqual not the joinqual.

We'd discussed rearranging the executor logic enough to deal with
such situations and agreed that it seemed too messy; but that means
that the optimization needs to take care not to use otherqual
(ie pushed-down) conditions in the uniqueness proofs.

> Oh yeah. I get it, but that's why we ignore !can_join clauses

can_join seems to me to be not particularly relevant ... there's
nothing that prevents that from getting set for pushed-down clauses.

It's possible that the case I'm worried about is unreachable in
practice because all the conditions that could be of interest would
be strict and therefore would have forced join strength reduction.
But I'm not comfortable with assuming that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-04-06 Thread David Rowley

On 7 April 2017 at 11:47, Tom Lane  wrote:
> David Rowley  writes:
>> On 7 April 2017 at 07:26, Tom Lane  wrote:
>>> I'm looking through this, and I'm failing to see where it deals with
>>> the problem we discussed last time, namely that you can't apply the
>>> optimization unless all clauses that were used in the uniqueness
>>> proof are included in the join's merge/hash conditions + joinquals.
>
>> The code in question is:
>> mergestate->mj_SkipMarkRestore = !mergestate->js.joinqual &&
>> mergestate->js.first_inner_tuple_only;
>
> Uh, AFAICS that only protects the skip-mark-and-restore logic.
> What I'm on about is that you can't do the early advance to the
> next outer tuple unless you're sure that all the quals that were
> relevant to the uniqueness proof have been checked for the current
> inner tuple.  That affects all three join types not only merge.

Well, look at the join code and you'll see this only happens after the
joinqual is evaulated. I didn't make a special effort here. I just
borrowed the location that JOIN_SEMI was already using.

For example, from hash join:

if (joinqual == NULL || ExecQual(joinqual, econtext))
{
node->hj_MatchedOuter = true;
HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));

/* In an antijoin, we never return a matched tuple */
if (node->js.jointype == JOIN_ANTI)
{
node->hj_JoinState = HJ_NEED_NEW_OUTER;
continue;
}

/*
* Skip to the next outer tuple if we only need to join to
* the first inner matching tuple.
*/
if (node->js.first_inner_tuple_only)
node->hj_JoinState = HJ_NEED_NEW_OUTER;

Note the first line and the final two lines.

Here's the one from nested loop:

if (ExecQual(joinqual, econtext))
{
node->nl_MatchedOuter = true;

/* In an antijoin, we never return a matched tuple */
if (node->js.jointype == JOIN_ANTI)
{
node->nl_NeedNewOuter = true;
continue; /* return to top of loop */
}

/*
* Skip to the next outer tuple if we only need to join to the
* first inner matching tuple.
*/
if (node->js.first_inner_tuple_only)
node->nl_NeedNewOuter = true;

Again, note the first line and final 2 lines.

> The case that would be relevant to this is, eg,
>
> create table t1 (f1 int, f2 int, primary key(f1,f2));
>
> select * from t_outer left join t1 on (t_outer.f1 = t1.f1)
> where t_outer.f2 = t2.f2;

hmm, that query is not valid, unless you have created some table named
t_outer. I don't know how you've defined that. So I guess you must
have meant:

postgres=# explain verbose select * from t1 t_outer left join t1 on
(t_outer.f1 = t1.f1) where t_outer.f2 = t1.f2;
QUERY PLAN
---
 Hash Join  (cost=66.50..133.57 rows=128 width=16)
   Output: t_outer.f1, t_outer.f2, t1.f1, t1.f2
   Inner Unique: Yes
   Hash Cond: ((t_outer.f1 = t1.f1) AND (t_outer.f2 = t1.f2))
   ->  Seq Scan on public.t1 t_outer  (cost=0.00..32.60 rows=2260 width=8)
 Output: t_outer.f1, t_outer.f2
   ->  Hash  (cost=32.60..32.60 rows=2260 width=8)
 Output: t1.f1, t1.f2
 ->  Seq Scan on public.t1  (cost=0.00..32.60 rows=2260 width=8)
   Output: t1.f1, t1.f2
(10 rows)

Which did become an INNER JOIN due to the strict W

If you'd had done:

postgres=# explain verbose select * from t1 t_outer left join t1 on
(t_outer.f1 = t1.f1) where t_outer.f2 = t1.f2 or t1.f1 is null;
   QUERY PLAN

 Merge Left Join  (cost=0.31..608.67 rows=255 width=16)
   Output: t_outer.f1, t_outer.f2, t1.f1, t1.f2
   Inner Unique: No
   Merge Cond: (t_outer.f1 = t1.f1)
   Filter: ((t_outer.f2 = t1.f2) OR (t1.f1 IS NULL))
   ->  Index Only Scan using t1_pkey on public.t1 t_outer
(cost=0.16..78.06 rows=2260 width=8)
 Output: t_outer.f1, t_outer.f2
   ->  Materialize  (cost=0.16..83.71 rows=2260 width=8)
 Output: t1.f1, t1.f2
 ->  Index Only Scan using t1_pkey on public.t1
(cost=0.16..78.06 rows=2260 width=8)
   Output: t1.f1, t1.f2
(11 rows)

You'll notice that "Inner Unique: No"

> Your existing patch would think t1 is unique-inner, but the qual pushed
> down from WHERE would not be a joinqual so the wrong thing would happen
> at runtime.
>
> (Hm ... actually, this example wouldn't fail as written because
> the WHERE qual is probably strict, so the left join would get
> reduced to an inner join and then pushed-down-ness no longer
> matters.  But hopefully you get my drift.)

Oh yeah. I get it, but that's why we ignore !can_join clauses

/* Ignore if it's not a mergejoinable clause */
if (!restrictinfo->can_join ||
restrictinfo->mergeopfamilies == NIL)
continue; /* not mergejoinable */

no?

>> I don't really think the List idea would be nearly as efficient.
>
> No, what I'm saying is that each RelOptInfo would contain a single List of
>

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-04-06 Thread Tom Lane

David Rowley  writes:
> On 7 April 2017 at 07:26, Tom Lane  wrote:
>> I'm looking through this, and I'm failing to see where it deals with
>> the problem we discussed last time, namely that you can't apply the
>> optimization unless all clauses that were used in the uniqueness
>> proof are included in the join's merge/hash conditions + joinquals.

> The code in question is:
> mergestate->mj_SkipMarkRestore = !mergestate->js.joinqual &&
> mergestate->js.first_inner_tuple_only;

Uh, AFAICS that only protects the skip-mark-and-restore logic.
What I'm on about is that you can't do the early advance to the
next outer tuple unless you're sure that all the quals that were
relevant to the uniqueness proof have been checked for the current
inner tuple.  That affects all three join types not only merge.

The case that would be relevant to this is, eg,

create table t1 (f1 int, f2 int, primary key(f1,f2));

select * from t_outer left join t1 on (t_outer.f1 = t1.f1)
where t_outer.f2 = t2.f2;

Your existing patch would think t1 is unique-inner, but the qual pushed
down from WHERE would not be a joinqual so the wrong thing would happen
at runtime.

(Hm ... actually, this example wouldn't fail as written because
the WHERE qual is probably strict, so the left join would get
reduced to an inner join and then pushed-down-ness no longer
matters.  But hopefully you get my drift.)

> I don't really think the List idea would be nearly as efficient.

No, what I'm saying is that each RelOptInfo would contain a single List of
Relids of proven-unique-for outer rels (and another one for the negative
cache).  No array, no more searching than you have now, just removal of an
uncertainly-safe array fetch.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-04-06 Thread David Rowley

On 7 April 2017 at 07:26, Tom Lane  wrote:
> I'm looking through this, and I'm failing to see where it deals with
> the problem we discussed last time, namely that you can't apply the
> optimization unless all clauses that were used in the uniqueness
> proof are included in the join's merge/hash conditions + joinquals.

Many thanks for looking at this again.

The test in question is in nodeMergeJoin.c. I believe the join is
still unique no matter where the clauses are evaluated. It should be
up to the executor code to make use of the knowledge how it sees fit.
The planner should not make assumptions on how the executor will make
use of this knowledge. I've carefully crafted a comment in
nodeMergejoin.c which explains all of this, and also about the
limitations and where things might be improved later.

The code in question is:

mergestate->mj_SkipMarkRestore = !mergestate->js.joinqual &&
mergestate->js.first_inner_tuple_only;

> I don't especially like the centralized unique_rels cache structure.
> It's not terribly clear what it's for, and you're making uncomfortably
> large assumptions about never indexing off the end of the array, despite
> not having any range checks for the subscripts.  Wouldn't it be better to
> add simple List fields into RelOptInfo, representing the outer rels this
> rel has been proven unique or not-unique for?  That would dodge the array
> size question and would be more readily extensible to someday applying
> this to join rels .

hmm, perhaps bounds checking could be done, but it's no worse than
planner_rt_fetch().
I don't really think the List idea would be nearly as efficient. The
array provides a direct lookup for the List of proof relations. A List
of List list would require a linear lookup just to find the correct
List, then the existing linear lookup to find the proofs. The cache is
designed to be fast. Slowing it down seems like a bad idea. Perhaps
throwing it away would be better, since it's not required and was only
added as an optimisation.

The non_unique_rels will most often have a NULL bitmap set due to the
incremental join search by the standard planner. So access to this as
an array should be very fast, as we'll quickly realise there are no
proofs to be found.

> I also think some more thought is needed about handling JOIN_UNIQUE_OUTER
> and JOIN_UNIQUE_INNER cases.  In the first place, the patch cavalierly
> violates the statement in joinpath.c that those jointype values never
> propagate outside that module.  In the second place, shouldn't
> JOIN_UNIQUE_INNER automatically result in the path getting marked
> inner_unique?  I suspect the logic in add_paths_to_joinrel ought to
> look something like
>
> if (jointype == JOIN_UNIQUE_INNER)
> extra.inner_unique = true;
> else
> extra.inner_unique = innerrel_is_unique(root, outerrel, innerrel,
> (jointype == JOIN_UNIQUE_OUTER ? JOIN_INNER : 
> jointype),
> restrictlist);

hmm, innerrel_is_unique() is without prejudice as to the jointypes it
supports, so much so that the argument is completely ignored. It was
probably left over from some previous incarnation of the patch.
If we treat JOIN_UNIQUE_INNER specially, then we'd better be sure that
it's made unique on the RHS join quals. It looks like
create_unique_path() uses sjinfo->semi_rhs_exprs to uniquify the
relation, and compute_semijoin_info() seems to take all of the join
conditions there or nothing at all, so I think it's safe to
automatically mark JOIN_UNIQUE_INNERs this way.

Updated patch is attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

unique_joins_2017-04-07.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-04-06 Thread Tom Lane

David Rowley  writes:
> On 2 April 2017 at 21:21, David Rowley  wrote:
>> I've attached an updated patch which updates the regression test output of
>> a recent commit to include the "Unique Inner" in the expected results.

> The patch must've fallen off. Attempt number 2 at attaching.

I'm looking through this, and I'm failing to see where it deals with
the problem we discussed last time, namely that you can't apply the
optimization unless all clauses that were used in the uniqueness
proof are included in the join's merge/hash conditions + joinquals.

It might be sufficient to ignore is_pushed_down conditions (at an
outer join only) while trying to make the proofs; but if it's doing
that, I don't see where.

I don't especially like the centralized unique_rels cache structure.
It's not terribly clear what it's for, and you're making uncomfortably
large assumptions about never indexing off the end of the array, despite
not having any range checks for the subscripts.  Wouldn't it be better to
add simple List fields into RelOptInfo, representing the outer rels this
rel has been proven unique or not-unique for?  That would dodge the array
size question and would be more readily extensible to someday applying
this to join rels.

I also think some more thought is needed about handling JOIN_UNIQUE_OUTER
and JOIN_UNIQUE_INNER cases.  In the first place, the patch cavalierly
violates the statement in joinpath.c that those jointype values never
propagate outside that module.  In the second place, shouldn't
JOIN_UNIQUE_INNER automatically result in the path getting marked
inner_unique?  I suspect the logic in add_paths_to_joinrel ought to
look something like

if (jointype == JOIN_UNIQUE_INNER)
extra.inner_unique = true;
else
extra.inner_unique = innerrel_is_unique(root, outerrel, innerrel,
(jointype == JOIN_UNIQUE_OUTER ? JOIN_INNER : jointype),
restrictlist);

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-04-02 Thread Tom Lane

David Rowley  writes:
> Tom, I'm wondering if you think you'll get time to look at this before the
> feature freeze?

Yeah, I intend to.  Thanks for updating the patch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-04-02 Thread David Rowley

On 2 April 2017 at 21:21, David Rowley  wrote:

> I've attached an updated patch which updates the regression test output of
> a recent commit to include the "Unique Inner" in the expected results.
>

The patch must've fallen off. Attempt number 2 at attaching.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


unique_joins_2017-04-02.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-04-02 Thread Robert Haas

On Sun, Apr 2, 2017 at 5:21 AM, David Rowley
 wrote:
> I've attached an updated patch which updates the regression test output of a
> recent commit to include the "Unique Inner" in the expected results.

Was this email supposed to have a patch attached?

> Tom, I'm wondering if you think you'll get time to look at this before the
> feature freeze?

/me crosses fingers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-04-02 Thread David Rowley

On 27 March 2017 at 15:51, David Rowley 
wrote:

> On 27 March 2017 at 09:28, David Rowley 
> wrote:
>
> > Patch is attached which fixes up the conflict between the expression
> > evaluation performance patch.
>
> Seems I forgot to commit locally before creating the patch... Here's
> the actual patch I meant to attach earlier.

I've attached an updated patch which updates the regression test output of
a recent commit to include the "Unique Inner" in the expected results.

I've also performed a pgindent run on the patch.

Tom, I'm wondering if you think you'll get time to look at this before the
feature freeze?

David

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-03-26 Thread David Rowley

On 27 March 2017 at 09:28, David Rowley  wrote:

> Patch is attached which fixes up the conflict between the expression
> evaluation performance patch.

Seems I forgot to commit locally before creating the patch... Here's
the actual patch I meant to attach earlier.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


unique_joins_2017-03-27a.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-03-26 Thread David Rowley

On 14 March 2017 at 16:37, David Rowley  wrote:
> On 14 March 2017 at 11:35, David Rowley 
> wrote:
>>
>> On 14 March 2017 at 07:50, Tom Lane  wrote:
>>>
>>> [ getting back to this patch finally... ]
>>>
>>> David Rowley  writes:
>>> > I've attached a patch which implements this, though only for
>>> > MergeJoin, else I'd imagine we'd also need to ensure all proofs used
>>> > for testing the uniqueness were also hash-able too. I added some XXX
>>> > comments in analyzejoin.c around the mergeopfamilies == NIL tests to
>>> > mention that Merge Join depends on all the unique proof quals having
>>> > mergeopfamilies.  This also assumes we'll never use some subset of
>>> > mergejoin-able quals for a merge join, which could be an interesting
>>> > area in the future, as we might have some btree index on a subset of
>>> > those columns to provide pre-sorted input. In short, it's a great
>>> > optimisation, but I'm a little scared we might break it one day.
>>>
>>> Umm ... it's broken already isn't it?  See the comment for
>>> generate_mergejoin_paths:
>>>
>> Thanks for looking at this again.
>>
>> Yeah confirmed. It's broken. I guess I just need to remember in the Path
>> if we got all the join quals, although I've not looked in detail what the
>> best fix is. I imagine it'll require storing something else in the JoinPath.
>
>
> OK, so I've spent some more time on this and I've come up with a solution.
>
> Basically the solution is to not skip mark and restore when joinquals
> contains any items.  This is a requirement for SEMI joins, but overly
> cautious for unique joins. However, I think it'll apply in most cases when
> Merge Join will be a win, and that's when there's a btree index on both
> sides of the join which covers all columns in the join condition.  I
> carefully commented this part of the code to explain what can be done to
> have it apply in more cases.
>
> This caused me to go and change the following code too:
>
> @@ -2676,6 +2688,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath
> *path,
>   * it off does not entitle us to deliver an invalid plan.
>   */
>   else if (innersortkeys == NIL &&
> + !((extra->inner_unique || path->jpath.jointype == JOIN_SEMI) &&
> + list_length(path->jpath.joinrestrictinfo) ==
> + list_length(path->path_mergeclauses)) &&
>   !ExecSupportsMarkRestore(inner_path))
>   path->materialize_inner = true;
>
> I've been staring at this for a while and I think it's correct. If it's
> wrong then we'd get "ERROR: unrecognized node type: " from ExecMarkPos().
>
> Here we must make sure and never skip materializing the inner side, if we're
> not going to skip mark and restore in ExecInitMergeJoin().
>
> I've attached a patch which fixes the issue.
>
> Also added a regression test to try to make sure it stays fixed. There's a
> few other mostly cosmetic fixes in there too.

Patch is attached which fixes up the conflict between the expression
evaluation performance patch.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


unique_joins_2017-03-27.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-03-13 Thread David Rowley

On 14 March 2017 at 11:35, David Rowley 
wrote:

> On 14 March 2017 at 07:50, Tom Lane  wrote:
>
>> [ getting back to this patch finally... ]
>>
>> David Rowley  writes:
>> > I've attached a patch which implements this, though only for
>> > MergeJoin, else I'd imagine we'd also need to ensure all proofs used
>> > for testing the uniqueness were also hash-able too. I added some XXX
>> > comments in analyzejoin.c around the mergeopfamilies == NIL tests to
>> > mention that Merge Join depends on all the unique proof quals having
>> > mergeopfamilies.  This also assumes we'll never use some subset of
>> > mergejoin-able quals for a merge join, which could be an interesting
>> > area in the future, as we might have some btree index on a subset of
>> > those columns to provide pre-sorted input. In short, it's a great
>> > optimisation, but I'm a little scared we might break it one day.
>>
>> Umm ... it's broken already isn't it?  See the comment for
>> generate_mergejoin_paths:
>>
>> Thanks for looking at this again.
>
> Yeah confirmed. It's broken. I guess I just need to remember in the Path
> if we got all the join quals, although I've not looked in detail what the
> best fix is. I imagine it'll require storing something else in the JoinPath.
>

OK, so I've spent some more time on this and I've come up with a solution.

Basically the solution is to not skip mark and restore when joinquals
contains any items.  This is a requirement for SEMI joins, but overly
cautious for unique joins. However, I think it'll apply in most cases when
Merge Join will be a win, and that's when there's a btree index on both
sides of the join which covers all columns in the join condition.  I
carefully commented this part of the code to explain what can be done to
have it apply in more cases.

This caused me to go and change the following code too:

@@ -2676,6 +2688,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath
*path,
  * it off does not entitle us to deliver an invalid plan.
  */
  else if (innersortkeys == NIL &&
+ !((extra->inner_unique || path->jpath.jointype == JOIN_SEMI) &&
+ list_length(path->jpath.joinrestrictinfo) ==
+ list_length(path->path_mergeclauses)) &&
  !ExecSupportsMarkRestore(inner_path))
  path->materialize_inner = true;

I've been staring at this for a while and I think it's correct. If it's
wrong then we'd get "ERROR: unrecognized node type: " from ExecMarkPos().

Here we must make sure and never skip materializing the inner side, if
we're not going to skip mark and restore in ExecInitMergeJoin().

I've attached a patch which fixes the issue.

Also added a regression test to try to make sure it stays fixed. There's a
few other mostly cosmetic fixes in there too.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

unique_joins_2017-03-14a.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-03-13 Thread David Rowley

On 14 March 2017 at 07:50, Tom Lane  wrote:

> [ getting back to this patch finally... ]
>
> David Rowley  writes:
> > I've attached a patch which implements this, though only for
> > MergeJoin, else I'd imagine we'd also need to ensure all proofs used
> > for testing the uniqueness were also hash-able too. I added some XXX
> > comments in analyzejoin.c around the mergeopfamilies == NIL tests to
> > mention that Merge Join depends on all the unique proof quals having
> > mergeopfamilies.  This also assumes we'll never use some subset of
> > mergejoin-able quals for a merge join, which could be an interesting
> > area in the future, as we might have some btree index on a subset of
> > those columns to provide pre-sorted input. In short, it's a great
> > optimisation, but I'm a little scared we might break it one day.
>
> Umm ... it's broken already isn't it?  See the comment for
> generate_mergejoin_paths:
>
>  * We generate mergejoins if mergejoin clauses are available.  We have
>  * two ways to generate the inner path for a mergejoin: sort the cheapest
>  * inner path, or use an inner path that is already suitably ordered for
> the
>  * merge.  If we have several mergeclauses, it could be that there is no
> inner
>  * path (or only a very expensive one) for the full list of mergeclauses,
> but
>  * better paths exist if we truncate the mergeclause list (thereby
> discarding
>  * some sort key requirements).  So, we consider truncations of the
>  * mergeclause list as well as the full list.  (Ideally we'd consider all
>  * subsets of the mergeclause list, but that seems way too expensive.)
>
> There's another, more subtle, way in which it could fail in
> sort_inner_and_outer():
>
>  * Each possible ordering of the available mergejoin clauses will generate
>  * a differently-sorted result path at essentially the same cost.  We have
>  * no basis for choosing one over another at this level of joining, but
>  * some sort orders may be more useful than others for higher-level
>  * mergejoins, so it's worth considering multiple orderings.
>  *
>  * Actually, it's not quite true that every mergeclause ordering will
>  * generate a different path order, because some of the clauses may be
>  * partially redundant (refer to the same EquivalenceClasses).  Therefore,
>  * what we do is convert the mergeclause list to a list of canonical
>  * pathkeys, and then consider different orderings of the pathkeys.
>
> I'm fairly sure that it's possible to end up with fewer pathkeys than
> there are mergeclauses in this code path.  Now, you might be all right
> anyway given that the mergeclauses must refer to the same ECs in such a
> case --- maybe they're fully redundant and we can take testing the
> included clause as proving the omitted one(s) too.  I'm not certain
> right now what I meant by "partially redundant" in this comment.
> But in any case, it's moot for the present purpose because
> generate_mergejoin_paths certainly breaks your assumption.
>

Thanks for looking at this again.

Yeah confirmed. It's broken. I guess I just need to remember in the Path if
we got all the join quals, although I've not looked in detail what the best
fix is. I imagine it'll require storing something else in the JoinPath.

Here's my test case:

select 'create tablespace slow_disk LOCATION ''' ||
current_setting('data_directory') || ''' WITH (random_page_cost=1000);';
\gexec
create table ab (a int not null, b int not null);
create unique index ab_pkey on ab (a,b) tablespace slow_disk;
alter table ab add constraint ab_pkey primary key using index ab_pkey;

create index ab_a_idx on ab (a);

insert into ab values(1,1);
insert into ab values(1,2);

set enable_nestloop to 0;
set enable_hashjoin to 0;
set enable_sort to 0;

explain verbose select * from ab ab1 inner join ab ab2 on ab1.a = ab2.a and
ab1.b = ab2.b;
 QUERY PLAN

-
 Merge Join  (cost=0.26..24.35 rows=1 width=16)
   Output: ab1.a, ab1.b, ab2.a, ab2.b
   Inner Unique: Yes
   Merge Cond: (ab1.a = ab2.a)
   Join Filter: (ab1.b = ab2.b)
   ->  Index Scan using ab_a_idx on public.ab ab1  (cost=0.13..12.16 rows=2
width=8)
 Output: ab1.a, ab1.b
   ->  Index Scan using ab_a_idx on public.ab ab2  (cost=0.13..12.16 rows=2
width=8)
 Output: ab2.a, ab2.b
(9 rows)

select * from ab ab1 inner join ab ab2 on ab1.a = ab2.a and ab1.b = ab2.b;
-- wrong results
 a | b | a | b
---+---+---+---
 1 | 2 | 1 | 2
(1 row)

drop index ab_a_idx;

-- same query again. This time it'll use the PK index on the slow_disk
tablespace
select * from ab ab1 inner join ab ab2 on ab1.a = ab2.a and ab1.b = ab2.b;
 a | b | a | b
---+---+---+---
 1 | 1 | 1 | 1
 1 | 2 | 1 | 2
(2 rows)

I had to fix up some conflicts before testing this again on a recent
master, so I've attached my rebased version. No fix yet for the above issue


-- 
 David

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-03-13 Thread Tom Lane

[ getting back to this patch finally... ]

David Rowley  writes:
> I've attached a patch which implements this, though only for
> MergeJoin, else I'd imagine we'd also need to ensure all proofs used
> for testing the uniqueness were also hash-able too. I added some XXX
> comments in analyzejoin.c around the mergeopfamilies == NIL tests to
> mention that Merge Join depends on all the unique proof quals having
> mergeopfamilies.  This also assumes we'll never use some subset of
> mergejoin-able quals for a merge join, which could be an interesting
> area in the future, as we might have some btree index on a subset of
> those columns to provide pre-sorted input. In short, it's a great
> optimisation, but I'm a little scared we might break it one day.

Umm ... it's broken already isn't it?  See the comment for
generate_mergejoin_paths:

 * We generate mergejoins if mergejoin clauses are available.  We have
 * two ways to generate the inner path for a mergejoin: sort the cheapest
 * inner path, or use an inner path that is already suitably ordered for the
 * merge.  If we have several mergeclauses, it could be that there is no inner
 * path (or only a very expensive one) for the full list of mergeclauses, but
 * better paths exist if we truncate the mergeclause list (thereby discarding
 * some sort key requirements).  So, we consider truncations of the
 * mergeclause list as well as the full list.  (Ideally we'd consider all
 * subsets of the mergeclause list, but that seems way too expensive.)

There's another, more subtle, way in which it could fail in
sort_inner_and_outer():

 * Each possible ordering of the available mergejoin clauses will generate
 * a differently-sorted result path at essentially the same cost.  We have
 * no basis for choosing one over another at this level of joining, but
 * some sort orders may be more useful than others for higher-level
 * mergejoins, so it's worth considering multiple orderings.
 *
 * Actually, it's not quite true that every mergeclause ordering will
 * generate a different path order, because some of the clauses may be
 * partially redundant (refer to the same EquivalenceClasses).  Therefore,
 * what we do is convert the mergeclause list to a list of canonical
 * pathkeys, and then consider different orderings of the pathkeys.

I'm fairly sure that it's possible to end up with fewer pathkeys than
there are mergeclauses in this code path.  Now, you might be all right
anyway given that the mergeclauses must refer to the same ECs in such a
case --- maybe they're fully redundant and we can take testing the
included clause as proving the omitted one(s) too.  I'm not certain
right now what I meant by "partially redundant" in this comment.
But in any case, it's moot for the present purpose because
generate_mergejoin_paths certainly breaks your assumption.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-02-01 Thread Michael Paquier

On Tue, Jan 31, 2017 at 9:13 AM, David Rowley
 wrote:
> On 31 January 2017 at 13:10, David Rowley  
> wrote:
>> I've attached a patch which implements this.
>
> Please disregards previous patch. (I forgot git commit before git diff
> to make the patch)
>
> I've attached the correct patch.

Moved to CF 2017-03. (You are the last one.)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-02-01 Thread David Rowley

On 31 January 2017 at 10:43, Tom Lane  wrote:
> David Rowley  writes:
>> I don't think that's possible. The whole point that the current join
>> removal code retries to remove joins which it already tried to remove,
>> after a successful removal is exactly because it is possible for a
>> join to become provability unique on the removal of another join.
>
> Not seeing that ... example please?

I had a quick look at this again as it had been a while since I noticed that.

The sample case was:

create temp table uniquetbl (f1 text unique);
explain (costs off)
select t1.* from
  uniquetbl as t1
  left join (select *, '***'::text as d1 from uniquetbl) t2
  on t1.f1 = t2.f1
  left join uniquetbl t3
  on t2.d1 = t3.f1;

However, what it actually fails on depends on if you check for unused
columns or uniqueness first as initially the subquery fails both of
the tests.

I was under the impression it was failing the unique test, as that's
what I was doing first in my patch.

If you test uniqueness first it'll fail on:

/*
* If such a clause actually references the inner rel then join
* removal has to be disallowed.  We have to check this despite
* the previous attr_needed checks because of the possibility of
* pushed-down clauses referencing the rel.
*/
if (bms_is_member(innerrelid, restrictinfo->clause_relids))
return false;

but if you test for unused columns first, it'll fail on:

if (bms_is_subset(phinfo->ph_eval_at, innerrel->relids))
return false; /* there isn't any other place to eval PHV */

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-30 Thread David Rowley

On 31 January 2017 at 13:10, David Rowley  wrote:
> I've attached a patch which implements this.

Please disregards previous patch. (I forgot git commit before git diff
to make the patch)

I've attached the correct patch.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


unique_joins_2017-01-31a.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-30 Thread David Rowley

On 28 January 2017 at 05:44, Tom Lane  wrote:
> I wrote:
>> David Rowley  writes:
>>> hmm. I'm having trouble understanding why this is a problem for Unique
>>> joins, but not for join removal?
>
>> Ah, you know what, that's just mistaken.  I was thinking that we
>> short-circuited the join on the strength of the hash (or merge) quals
>> only, but actually we check all the joinquals first.  As long as the
>> uniqueness proof uses only joinquals and not conditions that will end up
>> as otherquals, it's fine.
>
> Actually, after thinking about that some more, it seems to me that there
> is a performance (not correctness) issue here: suppose that we have
> something like
>
> select ... from t1 left join t2 on t1.x = t2.x and t1.y < t2.y
>
> If there's a unique index on t2.x, we'll be able to mark the join
> inner-unique.  However, short-circuiting would only occur after
> finding a row that passes both joinquals.  If the y condition is
> true for only a few rows, this would pretty nearly disable the
> optimization.  Ideally we would short-circuit after testing the x
> condition only, but there's no provision for that.

I've attached a patch which implements this, though only for
MergeJoin, else I'd imagine we'd also need to ensure all proofs used
for testing the uniqueness were also hash-able too. I added some XXX
comments in analyzejoin.c around the mergeopfamilies == NIL tests to
mention that Merge Join depends on all the unique proof quals having
mergeopfamilies.  This also assumes we'll never use some subset of
mergejoin-able quals for a merge join, which could be an interesting
area in the future, as we might have some btree index on a subset of
those columns to provide pre-sorted input. In short, it's a great
optimisation, but I'm a little scared we might break it one day.

Implementing this meant removing the match_first_row_only being set
for JOIN_SEMI, as this optimisation only applies to unique_inner and
not JOIN_SEMI alone. This also means we should be checking for unique
properties on JOIN_SEMI now too, which I've enabled.

Also, I've removed all jointype checks from innerrel_is_unique(). I
took a bit of time to think about JOIN_UNIQUE_OUTER and
JOIN_UNIQUE_INNER. I think these are fine too, in fact one of the
regression test plans moved away from using a Semi Join due to proving
that the inner side was unique. That could perhaps allow a better
plan, since the join order can be swapped. I'd really like someone
else to have a think about that too, just to make sure I've not
blundered that.

I've put the join removal code back to the way it was before. As I
mentioned, we can't use the caches here since we're also using
additional quals for proofs in this case.

I wasn't sure if I should add some regression tests which exercises
MergeJoin a bit to test the new optimisation. Any thoughts on that?

David

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

unique_joins_2017-01-31.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-30 Thread Tom Lane

David Rowley  writes:
> On 31 January 2017 at 04:56, Tom Lane  wrote:
>> I'm not following.  If the join removal code had reached the stage of
>> making a uniqueness check, and that check had succeeded, the join would be
>> gone and there would be no repeat check later.  If it didn't reach the
>> stage of making a uniqueness check, then again there's no duplication.

> I had forgotten the unique check was performed last. In that case the
> check for unused columns is duplicated needlessly each time.

I think we do need to repeat that each time, as columns that were formerly
used in a join condition to a now-dropped relation might thereby have
become unused.

> But let's
> drop it, as putting the code back is not making things any worse.

Agreed, if there is something to be won there, we can address it
separately.

> I don't think that's possible. The whole point that the current join
> removal code retries to remove joins which it already tried to remove,
> after a successful removal is exactly because it is possible for a
> join to become provability unique on the removal of another join.

Not seeing that ... example please?

> If you remove that retry code, a regression test fails.

Probably because of the point about unused columns...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-30 Thread David Rowley

On 31 January 2017 at 04:56, Tom Lane  wrote:
> David Rowley  writes:
>> I can make this change, but before I do I just want to point that I
>> don't think what you've said here is entirely accurate.
>
>> Let's assume unique joins are very common place, and join removals are
>> not so common. If a query has 5 left joins, and only one of which can
>> be removed, then the new code will most likely perform 5 unique join
>> checks, whereas the old code would perform 9, as those unique checks
>> are performed again once the 1 relation is removed for the remaining
>> 4.
>
> I'm not following.  If the join removal code had reached the stage of
> making a uniqueness check, and that check had succeeded, the join would be
> gone and there would be no repeat check later.  If it didn't reach the
> stage of making a uniqueness check, then again there's no duplication.

I had forgotten the unique check was performed last. In that case the
check for unused columns is duplicated needlessly each time. But let's
drop it, as putting the code back is not making things any worse.

> There will be some advantage in making a negative cache entry if join
> removal performs a uniqueness check that fails, but I don't really see
> why that's hard.  It does not seem like removal of a relation could
> cause another rel to become unique that wasn't before, so keeping
> negative cache entries across join removals ought to be safe.

I don't think that's possible. The whole point that the current join
removal code retries to remove joins which it already tried to remove,
after a successful removal is exactly because it is possible for a
join to become provability unique on the removal of another join. If
you remove that retry code, a regression test fails. I believe this is
because there initially would have been more than one RHS rel, and the
bitmap singleton check would have failed. After all a unique index is
on a single relation, so proofs don't exist when >1 rel is on the RHS.

In any case, it's not possible to use the cache with join removals, as
we use the otherquals for unique tests in join removals, but we can't
for unique joins, as I'm  adding those optimizations to the executor
which rely on the uniqueness being on the join condition alone, so we
can skip to the next outer tuple on matched join, but unmatched quals.
If I change how join removals work in that regard it will disallow
join removals where they were previously possible. So that's a no go
area.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-30 Thread Tom Lane

David Rowley  writes:
> I can make this change, but before I do I just want to point that I
> don't think what you've said here is entirely accurate.

> Let's assume unique joins are very common place, and join removals are
> not so common. If a query has 5 left joins, and only one of which can
> be removed, then the new code will most likely perform 5 unique join
> checks, whereas the old code would perform 9, as those unique checks
> are performed again once the 1 relation is removed for the remaining
> 4.

I'm not following.  If the join removal code had reached the stage of
making a uniqueness check, and that check had succeeded, the join would be
gone and there would be no repeat check later.  If it didn't reach the
stage of making a uniqueness check, then again there's no duplication.
There will be some advantage in making a negative cache entry if join
removal performs a uniqueness check that fails, but I don't really see
why that's hard.  It does not seem like removal of a relation could
cause another rel to become unique that wasn't before, so keeping
negative cache entries across join removals ought to be safe.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-29 Thread David Rowley

On 28 January 2017 at 05:04, Tom Lane  wrote:
> David Rowley  writes:
>> I agree that special handling of one join type is not so pretty.
>> However, LEFT JOINs still remain a bit special as they're the only
>> ones we currently perform join removal on, and the patch modifies that
>> code to make use of the new flag for those. This can improve planner
>> performance of join removal when a join is removed successfully, as
>> the previous code had to recheck uniqueness of each remaining LEFT
>> JOIN again, whereas the new code only checks uniqueness of ones not
>> previously marked as unique. This too likely could be done with the
>> cache, although I'm a bit concerned with populating the cache, then
>> performing a bunch of LEFT JOIN removals and leaving relids in the
>> cache which no longer exist. Perhaps it's OK. I've just not found
>> proofs in my head yet that it is.
>
> TBH, I do not like that tie-in at all.  I don't believe that it improves
> any performance, because if analyzejoins.c detects that the join is
> unique, it will remove the join; therefore there is nothing to cache.
> (This statement depends on the uniqueness test being the last removability
> test, but it is.)  And running mark_unique_joins() multiple times is ugly
> and adds cycles whenever it cannot prove a join unique, because it'll keep
> trying to do so.  So I'm pretty inclined to drop the connection to
> analyzejoins.c altogether, along with mark_unique_joins(), and just use
> the generic positive/negative cache mechanism you added for all join types.

I can make this change, but before I do I just want to point that I
don't think what you've said here is entirely accurate.

Let's assume unique joins are very common place, and join removals are
not so common. If a query has 5 left joins, and only one of which can
be removed, then the new code will most likely perform 5 unique join
checks, whereas the old code would perform 9, as those unique checks
are performed again once the 1 relation is removed for the remaining
4.

However I'll go make the change as something needs fixed in that area
anyway, as LEFT JOINs use the additional quals, whereas other join
types don't, which is broken.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-27 Thread Robert Haas

On Fri, Jan 27, 2017 at 2:00 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Fri, Jan 27, 2017 at 1:39 PM, Tom Lane  wrote:
>>> Um ... what's that got to do with the point at hand?
>
>> So I assumed from that that the issue was that you'd have to wait for
>> the first time the irrelevant-joinqual got satisfied before the
>> optimization kicked in.
>
> No, the problem is that that needs to happen for *each* outer row,
> and there's only one chance for it to happen.  Given the previous
> example,
>
> select ... from t1 left join t2 on t1.x = t2.x and t1.y < t2.y
>
> once we've found an x match for a given outer row, there aren't going to
> be any more and we should move on to the next outer row.  But as the patch
> stands, we only recognize that if t1.y < t2.y happens to be true for that
> particular row pair.  Otherwise we'll keep searching and we'll never find
> another match for that outer row.  So if the y condition is, say, 50%
> selective then the optimization only wins for 50% of the outer rows
> (that have an x partner in the first place).
>
> Now certainly that's better than a sharp stick in the eye, and
> maybe we should just press forward anyway.  But it feels like
> this is leaving a lot more on the table than I originally thought.
> Especially for the inner-join case, where *all* the WHERE conditions
> get a chance to break the optimization this way.

OK, now I understand why you were concerned. Given the size of some of
the speedups David's reported on this thread, I'd be tempted to press
forward even if no solution to this part of the problem presents
itself, but I also agree with you that it's leaving quite a bit on the
table if we can't do better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-27 Thread Tom Lane

Robert Haas  writes:
> On Fri, Jan 27, 2017 at 1:39 PM, Tom Lane  wrote:
>> Um ... what's that got to do with the point at hand?

> So I assumed from that that the issue was that you'd have to wait for
> the first time the irrelevant-joinqual got satisfied before the
> optimization kicked in.

No, the problem is that that needs to happen for *each* outer row,
and there's only one chance for it to happen.  Given the previous
example,

select ... from t1 left join t2 on t1.x = t2.x and t1.y < t2.y

once we've found an x match for a given outer row, there aren't going to
be any more and we should move on to the next outer row.  But as the patch
stands, we only recognize that if t1.y < t2.y happens to be true for that
particular row pair.  Otherwise we'll keep searching and we'll never find
another match for that outer row.  So if the y condition is, say, 50%
selective then the optimization only wins for 50% of the outer rows
(that have an x partner in the first place).

Now certainly that's better than a sharp stick in the eye, and
maybe we should just press forward anyway.  But it feels like
this is leaving a lot more on the table than I originally thought.
Especially for the inner-join case, where *all* the WHERE conditions
get a chance to break the optimization this way.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-27 Thread Robert Haas

On Fri, Jan 27, 2017 at 1:39 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Fri, Jan 27, 2017 at 11:44 AM, Tom Lane  wrote:
>>> I'm afraid though that we may have to do something about the
>>> irrelevant-joinquals issue in order for this to be of much real-world
>>> use for inner joins.
>
>> Maybe, but it's certainly not the case that all inner joins are highly
>> selective.  There are plenty of inner joins in real-world applications
>> where the join product is 10% or 20% or 50% of the size of the larger
>> input.
>
> Um ... what's that got to do with the point at hand?

I thought it was directly relevant, but maybe I'm confused.  Further
up in that email, you wrote: "If there's a unique index on t2.x, we'll
be able to mark the join inner-unique.  However, short-circuiting
would only occur after finding a row that passes both joinquals.  If
the y condition is true for only a few rows, this would pretty nearly
disable the optimization.  Ideally we would short-circuit after
testing the x condition only, but there's no provision for that."

So I assumed from that that the issue was that you'd have to wait for
the first time the irrelevant-joinqual got satisfied before the
optimization kicked in.  But, if the join is emitting lots of rows,
that'll happen pretty quickly.  I mean, if the join emits even as many
20 rows, the time after the first one is, all things being equal, 95%
of the runtime of the join.  There could certainly be bad cases where
it takes a long time to produce the first row, but I wouldn't say
that's a particularly common thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-27 Thread Tom Lane

Robert Haas  writes:
> On Fri, Jan 27, 2017 at 11:44 AM, Tom Lane  wrote:
>> I'm afraid though that we may have to do something about the
>> irrelevant-joinquals issue in order for this to be of much real-world
>> use for inner joins.

> Maybe, but it's certainly not the case that all inner joins are highly
> selective.  There are plenty of inner joins in real-world applications
> where the join product is 10% or 20% or 50% of the size of the larger
> input.

Um ... what's that got to do with the point at hand?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-27 Thread Robert Haas

On Fri, Jan 27, 2017 at 11:44 AM, Tom Lane  wrote:
> I'm afraid though that we may have to do something about the
> irrelevant-joinquals issue in order for this to be of much real-world
> use for inner joins.

Maybe, but it's certainly not the case that all inner joins are highly
selective.  There are plenty of inner joins in real-world applications
where the join product is 10% or 20% or 50% of the size of the larger
input.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-27 Thread Tom Lane

I wrote:
> David Rowley  writes:
>> hmm. I'm having trouble understanding why this is a problem for Unique
>> joins, but not for join removal?

> Ah, you know what, that's just mistaken.  I was thinking that we
> short-circuited the join on the strength of the hash (or merge) quals
> only, but actually we check all the joinquals first.  As long as the
> uniqueness proof uses only joinquals and not conditions that will end up
> as otherquals, it's fine.

Actually, after thinking about that some more, it seems to me that there
is a performance (not correctness) issue here: suppose that we have
something like

select ... from t1 left join t2 on t1.x = t2.x and t1.y < t2.y

If there's a unique index on t2.x, we'll be able to mark the join
inner-unique.  However, short-circuiting would only occur after
finding a row that passes both joinquals.  If the y condition is
true for only a few rows, this would pretty nearly disable the
optimization.  Ideally we would short-circuit after testing the x
condition only, but there's no provision for that.

This might not be a huge problem for outer joins.  My sense of typical
SQL style is that the joinquals (ON conditions) are likely to be
exactly what you need to prove inner uniqueness, while random other
conditions will be pushed-down from WHERE and hence will be otherquals.
But I'm afraid it is quite a big deal for inner joins, where we dump
all available conditions into the joinquals.  We might need to rethink
that choice.

At least for merge and hash joins, it's tempting to think about a
short-circuit test being made after testing just the merge/hash quals.
But we'd have to prove uniqueness using only the merge/hash quals,
so the planning cost might be unacceptably high --- particularly for
merge joins which often don't use all available mergeable quals.
In the end I think we probably want to keep the short-circuit in the
same place where it is for SEMI/ANTI cases (which have to have it
exactly there for semantic correctness).

I'm afraid though that we may have to do something about the
irrelevant-joinquals issue in order for this to be of much real-world
use for inner joins.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-27 Thread Tom Lane

David Rowley  writes:
> On 27 January 2017 at 12:39, Tom Lane  wrote:
>> 2. In these same cases (unique/semi/anti joins), it is possible to avoid
>> mark/restore overhead in a mergejoin, because we can tweak the executor
>> logic to not require backing up the inner side.

> I've made modifications in the attached to add this optimization, and
> it's quite a significant improvement.

Cool ... validates my gut feeling that that was worth incorporating.

>> ... IOW, would it hurt to drop the SpecialJoinInfo
>> tie-in and just rely on the generic cache?

> I agree that special handling of one join type is not so pretty.
> However, LEFT JOINs still remain a bit special as they're the only
> ones we currently perform join removal on, and the patch modifies that
> code to make use of the new flag for those. This can improve planner
> performance of join removal when a join is removed successfully, as
> the previous code had to recheck uniqueness of each remaining LEFT
> JOIN again, whereas the new code only checks uniqueness of ones not
> previously marked as unique. This too likely could be done with the
> cache, although I'm a bit concerned with populating the cache, then
> performing a bunch of LEFT JOIN removals and leaving relids in the
> cache which no longer exist. Perhaps it's OK. I've just not found
> proofs in my head yet that it is.

TBH, I do not like that tie-in at all.  I don't believe that it improves
any performance, because if analyzejoins.c detects that the join is
unique, it will remove the join; therefore there is nothing to cache.
(This statement depends on the uniqueness test being the last removability
test, but it is.)  And running mark_unique_joins() multiple times is ugly
and adds cycles whenever it cannot prove a join unique, because it'll keep
trying to do so.  So I'm pretty inclined to drop the connection to
analyzejoins.c altogether, along with mark_unique_joins(), and just use
the generic positive/negative cache mechanism you added for all join types.

It's possible that it'd make sense for analyzejoins.c to add a negative
cache entry about any join that it tries and fails to prove unique.
Your point about cache maintenance is valid, but a simple answer there
would just be to flush the cache whenever we remove a rel.  (Although
I'm dubious that we need to: how could a removable rel be part of the
min outer rels for any surviving rel?)

>> * Because of where we apply the short-circuit logic in the executor,
>> it's only safe to consider the inner rel as unique if it is provably
>> unique using only the join clauses that drive the primary join mechanism
>> (ie, the "joinquals" not the "otherquals").  We already do ignore quals
>> that are pushed-down to an outer join, so that's good, but there's an
>> oversight: we will use a qual that is mergejoinable even if it's not
>> hashjoinable. That means we could get the wrong answers in a hash join.

> hmm. I'm having trouble understanding why this is a problem for Unique
> joins, but not for join removal?

Ah, you know what, that's just mistaken.  I was thinking that we
short-circuited the join on the strength of the hash (or merge) quals
only, but actually we check all the joinquals first.  As long as the
uniqueness proof uses only joinquals and not conditions that will end up
as otherquals, it's fine.

> However you mentioning this cause me to notice that this is only true
> in the patch for left joins, and the other join types are not
> consistent with that.

Hm, perhaps, but the example you show isn't proving that, because it's
choosing to put a1 on the inside in the innerjoin case, and a1 certainly
isn't unique for this query.  We can't see whether a2 was detected as
unique.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-27 Thread David Rowley

On 27 January 2017 at 12:39, Tom Lane  wrote:
> 2. In these same cases (unique/semi/anti joins), it is possible to avoid
> mark/restore overhead in a mergejoin, because we can tweak the executor
> logic to not require backing up the inner side.  This goes further than
> just tweaking the executor logic, though, because if we know we don't
> need mark/restore then that actually makes some plan shapes legal that
> weren't before: we don't need to interpose a Material node to protect
> join inputs that can't mark/restore.

I've made modifications in the attached to add this optimization, and
it's quite a significant improvement.

Setup:

create table t1 (id text primary key);
insert into t1 select x from generate_series(1,100) x(x);
create table t2 (id text primary key);
insert into t2 select x from generate_series(1,100) x(x);
vacuum freeze;

Query:
select count(*) from t1 inner join t2 on t1.id=t2.id;

Unpatched

Time: 369.618 ms
Time: 358.208 ms
Time: 357.094 ms
Time: 355.642 ms
Time: 358.193 ms
Time: 371.272 ms
Time: 354.386 ms
Time: 364.277 ms
Time: 346.091 ms
Time: 358.757 ms

Patched

Time: 273.154 ms
Time: 258.520 ms
Time: 269.456 ms
Time: 252.861 ms
Time: 271.015 ms
Time: 252.567 ms
Time: 271.132 ms
Time: 267.505 ms
Time: 265.295 ms
Time: 257.068 ms

About 26.5% improvement for this case.

> * The patch applies point #1 to only INNER and LEFT join modes, but
> I don't really see why the idea wouldn't work for RIGHT and FULL modes,
> ie the optimization seems potentially interesting for all executable join
> types.  Once you've got a match, you can immediately go to the next outer
> tuple instead of continuing to scan inner.  (Am I missing something?)

No I believe I started development with LEFT JOINs, moved to INNER,
then didn't progress to think of RIGHT or FULL. I've lifted this
restriction in the patch. It seems no other work is required to have
that just work.

> * Particularly in view of the preceding point, I'm not that happy with
> the way that management/caching of the "is it unique" knowledge is
> done completely differently for INNER and LEFT joins.  I wonder if
> there's actually a good argument for that or is it mostly a development
> sequence artifact.  IOW, would it hurt to drop the SpecialJoinInfo
> tie-in and just rely on the generic cache?

I agree that special handling of one join type is not so pretty.
However, LEFT JOINs still remain a bit special as they're the only
ones we currently perform join removal on, and the patch modifies that
code to make use of the new flag for those. This can improve planner
performance of join removal when a join is removed successfully, as
the previous code had to recheck uniqueness of each remaining LEFT
JOIN again, whereas the new code only checks uniqueness of ones not
previously marked as unique. This too likely could be done with the
cache, although I'm a bit concerned with populating the cache, then
performing a bunch of LEFT JOIN removals and leaving relids in the
cache which no longer exist. Perhaps it's OK. I've just not found
proofs in my head yet that it is.

> * Because of where we apply the short-circuit logic in the executor,
> it's only safe to consider the inner rel as unique if it is provably
> unique using only the join clauses that drive the primary join mechanism
> (ie, the "joinquals" not the "otherquals").  We already do ignore quals
> that are pushed-down to an outer join, so that's good, but there's an
> oversight: we will use a qual that is mergejoinable even if it's not
> hashjoinable. That means we could get the wrong answers in a hash join.
> I think probably the appropriate fix for the moment is just to consider
> only clauses that are both mergeable and hashable while trying to prove
> uniqueness.  We do have some equality operators that support only one
> or the other, but they're corner cases, and I'm dubious that it's worth
> having to make separate proofs for merge and hash joins in order to
> cater to those cases.

hmm. I'm having trouble understanding why this is a problem for Unique
joins, but not for join removal?

However you mentioning this cause me to notice that this is only true
in the patch for left joins, and the other join types are not
consistent with that.

create table a1 (a int, b int, primary key(a,b));
create table a2 (a int, b int, primary key(a,b));

explain (verbose, costs off) select * from a1 left join a2 on a1.a =
a2.a and a2.b=1 ;
QUERY PLAN
--
 Merge Left Join
   Output: a1.a, a1.b, a2.a, a2.b
   Inner Unique: Yes
   Merge Cond: (a1.a = a2.a)
   ->  Index Only Scan using a1_pkey on public.a1
 Output: a1.a, a1.b
   ->  Sort
 Output: a2.a, a2.b
 Sort Key: a2.a
 ->  Bitmap Heap Scan on public.a2
   Output: a2.a, a2.b
   Recheck Cond: (a2.b = 1)
   ->  Bitmap Index Scan on a2_pkey
 Index Cond: (a2.b = 1)

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-27 Thread Antonin Houska

I thought about the patch from the perspective of "grouped relations"
(especially [1]). When looking for the appropriate context within the thread,
I picked this message.

David Rowley  wrote:

> On 12 March 2016 at 11:43, Tom Lane  wrote:
> >
> > It seems like the major intellectual complexity here is to figure out
> > how to detect inner-side-unique at reasonable cost.  I see that for
> > LEFT joins you're caching that in the SpecialJoinInfos, which is probably
> > fine.  But for INNER joins it looks like you're just doing it over again
> > for every candidate join, and that seems mighty expensive.

> ... I'll look into that.
> 
> The other thing I thought of was to add a dedicated list for unique
> indexes in RelOptInfo, this would also allow
> rel_supports_distinctness() to do something a bit smarter than just
> return false if there's no indexes. That might not buy us much though,
> but at least relations tend to have very little unique indexes, even
> when they have lots of indexes.

I'm thinking of a concept of "unique keys", similar to path keys that the
planner already uses. Besides the current evaluation of uniqueness of the
inner side of a join, the planner would (kind of) union the unique keys of the
joined rels, ie compute a list of expressions which generates an unique row
throughout the new join result. (Requirement is that each key must be usable
in join expression, as opposed to filter.)

To figure out whether at most one inner row exists per outer row, each unique
key of the inner relation which references the outer relation needs to match
an unique key of the outer relation (but it's probably wrong if multiple
unique keys of the inner rel reference the same key of the outer rel).

Like path key, the unique key would also point to an equivalence class. Thus
mere equality of the EC pointers could perhaps be used to evaluate the match
of the inner and outer keys.

Given that rel_is_distinct_for() currently does not accept joins, this change
would make the patch more generic. (BTW, with this approach, unique_rels and
non_unique_rels caches would have to be stored per-relation (RelOptInfo), as
opposed to PlannerInfo.)

The reason I'd like the unique keys is that - from the "grouped relation"
point of view - the relation uniqueness also needs to be checked against the
GROUP BY clause. Thus the "unique keys" concept seem to me like an useful
abstraction.

Does this proposal seem to have a serious flaw?

[1]
https://www.postgresql.org/message-id/CAKJS1f_h1CLff92B%3D%2BbdrMK2Nf3EfGWaJu2WbzQUYcSBUi02ag%40mail.gmail.com

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-26 Thread Tom Lane

To re-familiarize myself with this patch, I've been re-reading the thread,
which has gotten quite long.  It seemed like it would be a good idea to
stop and try to summarize what the patch ought to accomplish, because
there's been some drift over the more than 2 years the patch has been
in the works.

So: ISTM there are two core ideas at this point.

1. We should recognize when the inner side of a join is unique (that is,
there is provably no more than one inner tuple joining to any given outer
tuple), and make use of that knowledge to short-circuit execution in the
same way we already do for JOIN_SEMI and JOIN_ANTI cases.  That is, once
we find a match we can immediately move on to the next outer tuple rather
than continuing to scan for inner-side matches.

2. In these same cases (unique/semi/anti joins), it is possible to avoid
mark/restore overhead in a mergejoin, because we can tweak the executor
logic to not require backing up the inner side.  This goes further than
just tweaking the executor logic, though, because if we know we don't
need mark/restore then that actually makes some plan shapes legal that
weren't before: we don't need to interpose a Material node to protect
join inputs that can't mark/restore.

Maybe I missed something, but it doesn't look like the current patch
(unique_joins_2017-01-27_no_outer_unique.patch) has anything concerning
point #2 at all.  It might make sense to address that idea as a follow-on
patch, but I think it can be a quite significant win and we shouldn't
just lose track of it.

Anyway, having laid out that scope of work, I have some concerns:

* The patch applies point #1 to only INNER and LEFT join modes, but
I don't really see why the idea wouldn't work for RIGHT and FULL modes,
ie the optimization seems potentially interesting for all executable join
types.  Once you've got a match, you can immediately go to the next outer
tuple instead of continuing to scan inner.  (Am I missing something?)

* Particularly in view of the preceding point, I'm not that happy with
the way that management/caching of the "is it unique" knowledge is
done completely differently for INNER and LEFT joins.  I wonder if
there's actually a good argument for that or is it mostly a development
sequence artifact.  IOW, would it hurt to drop the SpecialJoinInfo
tie-in and just rely on the generic cache?

* Because of where we apply the short-circuit logic in the executor,
it's only safe to consider the inner rel as unique if it is provably
unique using only the join clauses that drive the primary join mechanism
(ie, the "joinquals" not the "otherquals").  We already do ignore quals
that are pushed-down to an outer join, so that's good, but there's an
oversight: we will use a qual that is mergejoinable even if it's not
hashjoinable. That means we could get the wrong answers in a hash join.
I think probably the appropriate fix for the moment is just to consider
only clauses that are both mergeable and hashable while trying to prove
uniqueness.  We do have some equality operators that support only one
or the other, but they're corner cases, and I'm dubious that it's worth
having to make separate proofs for merge and hash joins in order to
cater to those cases.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-26 Thread David Rowley

On 27 January 2017 at 08:34, Tom Lane  wrote:
> David Rowley  writes:
>> I've attached a version without outer unique.
>
> I looked through this a bit, and the first thing I noticed was it doesn't
> touch costsize.c at all.  That seems pretty wrong; it's little help to
> have a performance improvement if the planner won't pick the right plan
> type.  There were costsize.c changes in there back in April; what happened
> to them?

hmm. I must've taken them out when I changed everything around to use
the join types. Nothing special was needed when changing JOIN_INNER to
JOIN_SEMI.

I must've forgotten to put them back again... I'll do that now.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-26 Thread Tom Lane

David Rowley  writes:
> I've attached a version without outer unique.

I looked through this a bit, and the first thing I noticed was it doesn't
touch costsize.c at all.  That seems pretty wrong; it's little help to
have a performance improvement if the planner won't pick the right plan
type.  There were costsize.c changes in there back in April; what happened
to them?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-26 Thread David Rowley

On 27 January 2017 at 00:37, David Rowley  wrote:
> The attached has my Merge Join changes, to show what I think can be
> done to make use of unique outer. Let me know what you think, but I
> get that idea that we're both leaning towards ripping the outer unique
> stuff out, so I'll go do that now...

I've attached a version without outer unique.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


unique_joins_2017-01-27_no_outer_unique.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-26 Thread David Rowley

On 26 January 2017 at 04:56, Antonin Houska  wrote:
> I suspect that "inner" and "outer" relation / tuple are sometimes confused in
> comments:
>
>
> * analyzejoins.c:70
>
>  "searches for subsequent matching outer tuples."
>
>
> * analyzejoins.c:972
>
> /*
>  * innerrel_is_unique
>  *Check for proofs which prove that 'innerrel' can, at most, match a
>  *single tuple in 'outerrel' based on the join condition in
>  *'restrictlist'.
>  */
>
>
> * relation.h:1831
>
> boolinner_unique;   /* inner side of join matches no more 
> than one
>  * outer side 
> tuple */

Thanks for looking over the patch. I believe I've fixed all those now.
I'm not too surprised I got some wrong, after all, I did mess up the
subject of this email too! Which did cause quite a bit of confusion.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-26 Thread David Rowley

Thank for looking at this again.

On 25 January 2017 at 06:27, Tom Lane  wrote:
> David Rowley  writes:
>> However, having said that, I'm not sure why we'd need outer_unique
>> available so we'd know that we could skip mark/restore. I think
>> inner_unique is enough for this purpose. Take the comment from
>> nodeMergejoin.c:
>
>> * outer: (0 ^1 1 2 5 5 5 6 6 7) current tuple: 1
>>  * inner: (1 ^3 5 5 5 5 6) current tuple: 3
>>  ...
>> *
>>  * Consider the above relations and suppose that the executor has
>>  * just joined the first outer "5" with the last inner "5". The
>>  * next step is of course to join the second outer "5" with all
>>  * the inner "5's". This requires repositioning the inner "cursor"
>>  * to point at the first inner "5". This is done by "marking" the
>>  * first inner 5 so we can restore the "cursor" to it before joining
>>  * with the second outer 5. The access method interface provides
>>  * routines to mark and restore to a tuple.
>
>> If only one inner "5" can exist (unique_inner), then isn't the inner
>> side already in the correct place, and no restore is required?
>
> Hmm ... let me see whether I have my head wrapped around this correctly.
>
> What I had in mind is a different optimization.  When the inner is not
> unique, we can't know we're done joining the first outer "5" until we
> reach the inner "6".  At that point, what we normally do is advance the
> outer by one row and back up the inner to the mark point (the first inner
> "5").  But the only reason to back up is that the new outer row might join
> to the same inner rows the previous one did.  If we know the outer side is
> unique, then those inner rows can't join to the new outer row so no need
> to back up.  So this requires no real change to the mergejoin algorithm,
> we just skip the mark and restore steps.

I'd not thought of that possibility, although with a unique outer, I
don't think it'll ever save a restore, as the comparison to the new
outer tuple will always fail to match the marked inner tuple.
Never-the-less we can save that useless comparison, so I've written
code to that affect in the attached. Perhaps detecting unique outer is
not worthwhile for just this?

>
> I think what you're saying is that, if we know the inner side is unique,
> we can also skip mark/restore overhead; but it would work a bit
> differently.  After joining two rows with equal keys, instead of advancing
> the inner as per standard algorithm, we'd need to advance the outer side.
> (This is OK because we know the next inner can't join to this same outer.

Yeah, this is the same optimisation as applies to the other join types
too which happens to just be the same point as semi joins must give up
too.

> But we don't know if the next outer can join to this inner.)  We advance
> inner only when current inner < current outer, so we're done with that
> inner and need never rewind.  So this is a more fundamental algorithm
> change but it gets the same performance benefit.

Yes, I think if inner is unique and outer is unique then after joining
in EXEC_MJ_JOINTUPLES, we can have a new state
EXEC_MJ_NEXTINNERANDOUTER, as the next outer won't match this inner,
so we might as well skip to the next on both, but quite possibly such
a join is just not common enough to have to worry about that. It might
not give us much better performance anyway.

> So the question is, if we can skip mark/restore overhead when we know that
> either input is unique, is it necessary for the planner to account for
> both ways explicitly?  Given the general symmetry of mergejoins, it might
> be okay for the planner to preferentially generate plans with the unique
> input on the inside, and not worry about optimizing in this way when it's
> on the outside.

I'm inclined to agree.

> Now, JOIN_SEMI and JOIN_ANTI cases are *not* symmetric, since we don't
> implement reverse-semi or reverse-anti join modes.  But I don't know that
> there's anything to be won by noticing that the outer side is unique in
> those cases.  I think probably we can apply the same technique of
> advancing outer not inner, and never rewinding, in those join modes
> whether or not either input is known unique.
>
> Another asymmetry is that if one input is likely to be empty, it's better
> to put that one on the outside, because if it is empty we don't have to
> fetch anything from the other input (thus saving its startup cost).  But
> the planner doesn't account for that anyway because it never believes
> non-dummy relations are truly empty; so even if it's getting this right
> today, it's purely by chance.
>
> I can't think of any other asymmetries offhand.
>
> In short, I think you are right that it might be enough to account for
> inner uniqueness only, and not worry about it for the outer side, even
> for mergejoin.  This means my previous objection is wrong and we don't
> really have to allow for a future extension of that kind while choosing
>

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-25 Thread Antonin Houska

David Rowley  wrote:
> On 19 January 2017 at 11:06, David Rowley  
> wrote:
> > Old patch no longer applies, so I've attached a rebased patch. This
> > also re-adds a comment line which I mistakenly removed.
> 
> (meanwhile Andres commits 69f4b9c)
> 
> I should've waited a bit longer.
> 
> Here's another that fixes the new conflicts.

I suspect that "inner" and "outer" relation / tuple are sometimes confused in
comments:


* analyzejoins.c:70

 "searches for subsequent matching outer tuples."


* analyzejoins.c:972

/*
 * innerrel_is_unique
 *Check for proofs which prove that 'innerrel' can, at most, match a
 *single tuple in 'outerrel' based on the join condition in
 *'restrictlist'.
 */


* relation.h:1831

boolinner_unique;   /* inner side of join matches no more 
than one
 * outer side 
tuple */


-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-24 Thread Tom Lane

[ getting back to this at long last ]

David Rowley  writes:
> However, having said that, I'm not sure why we'd need outer_unique
> available so we'd know that we could skip mark/restore. I think
> inner_unique is enough for this purpose. Take the comment from
> nodeMergejoin.c:

> * outer: (0 ^1 1 2 5 5 5 6 6 7) current tuple: 1
>  * inner: (1 ^3 5 5 5 5 6) current tuple: 3
>  ...
> *
>  * Consider the above relations and suppose that the executor has
>  * just joined the first outer "5" with the last inner "5". The
>  * next step is of course to join the second outer "5" with all
>  * the inner "5's". This requires repositioning the inner "cursor"
>  * to point at the first inner "5". This is done by "marking" the
>  * first inner 5 so we can restore the "cursor" to it before joining
>  * with the second outer 5. The access method interface provides
>  * routines to mark and restore to a tuple.

> If only one inner "5" can exist (unique_inner), then isn't the inner
> side already in the correct place, and no restore is required?

Hmm ... let me see whether I have my head wrapped around this correctly.

What I had in mind is a different optimization.  When the inner is not
unique, we can't know we're done joining the first outer "5" until we
reach the inner "6".  At that point, what we normally do is advance the
outer by one row and back up the inner to the mark point (the first inner
"5").  But the only reason to back up is that the new outer row might join
to the same inner rows the previous one did.  If we know the outer side is
unique, then those inner rows can't join to the new outer row so no need
to back up.  So this requires no real change to the mergejoin algorithm,
we just skip the mark and restore steps.

I think what you're saying is that, if we know the inner side is unique,
we can also skip mark/restore overhead; but it would work a bit
differently.  After joining two rows with equal keys, instead of advancing
the inner as per standard algorithm, we'd need to advance the outer side.
(This is OK because we know the next inner can't join to this same outer.
But we don't know if the next outer can join to this inner.)  We advance
inner only when current inner < current outer, so we're done with that
inner and need never rewind.  So this is a more fundamental algorithm
change but it gets the same performance benefit.

So the question is, if we can skip mark/restore overhead when we know that
either input is unique, is it necessary for the planner to account for
both ways explicitly?  Given the general symmetry of mergejoins, it might
be okay for the planner to preferentially generate plans with the unique
input on the inside, and not worry about optimizing in this way when it's
on the outside.

Now, JOIN_SEMI and JOIN_ANTI cases are *not* symmetric, since we don't
implement reverse-semi or reverse-anti join modes.  But I don't know that
there's anything to be won by noticing that the outer side is unique in
those cases.  I think probably we can apply the same technique of
advancing outer not inner, and never rewinding, in those join modes
whether or not either input is known unique.

Another asymmetry is that if one input is likely to be empty, it's better
to put that one on the outside, because if it is empty we don't have to
fetch anything from the other input (thus saving its startup cost).  But
the planner doesn't account for that anyway because it never believes
non-dummy relations are truly empty; so even if it's getting this right
today, it's purely by chance.

I can't think of any other asymmetries offhand.

In short, I think you are right that it might be enough to account for
inner uniqueness only, and not worry about it for the outer side, even
for mergejoin.  This means my previous objection is wrong and we don't
really have to allow for a future extension of that kind while choosing
the notation the planner uses.

So ... would you rather go back to the previous notation (extra
JoinTypes), or do you like the separate boolean better anyway?

Sorry for jerking you back and forth like this, but sometimes the
correct path isn't apparent from the start.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-18 Thread David Rowley

On 19 January 2017 at 11:06, David Rowley  wrote:
> Old patch no longer applies, so I've attached a rebased patch. This
> also re-adds a comment line which I mistakenly removed.

(meanwhile Andres commits 69f4b9c)

I should've waited a bit longer.

Here's another that fixes the new conflicts.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f9fb276..239f78b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1312,6 +1312,26 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (es->verbose)
 		show_plan_tlist(planstate, ancestors, es);
 
+	/* unique join */
+	if (es->verbose || es->format != EXPLAIN_FORMAT_TEXT)
+	{
+		switch (nodeTag(plan))
+		{
+			case T_NestLoop:
+			case T_MergeJoin:
+			case T_HashJoin:
+			{
+const char *value = ((Join *) plan)->inner_unique ? "Yes" : "No";
+ExplainPropertyText("Inner Unique", value, es);
+value =  ((Join *) plan)->outer_unique ? "Yes" : "No";
+ExplainPropertyText("Outer Unique", value, es);
+break;
+			}
+			default:
+break;
+		}
+	}
+
 	/* quals, sort keys, etc */
 	switch (nodeTag(plan))
 	{
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index b41e4e2..5b75b64 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -306,10 +306,10 @@ ExecHashJoin(HashJoinState *node)
 	}
 
 	/*
-	 * In a semijoin, we'll consider returning the first
-	 * match, but after that we're done with this outer tuple.
+	 * Skip to the next outer tuple if we only need one inner
+	 * tuple to match.
 	 */
-	if (node->js.jointype == JOIN_SEMI)
+	if (node->js.match_first_inner_tuple_only)
 		node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
 	if (otherqual == NIL ||
@@ -453,6 +453,14 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->js.ps.state = estate;
 
 	/*
+	 * When the planner was able to determine that the inner side of the join
+	 * will at most contain a single tuple for each outer tuple, then we can
+	 * optimize the join by skipping to the next outer tuple after we find the
+	 * first matching inner tuple.
+	 */
+	hjstate->js.match_first_inner_tuple_only = node->join.inner_unique;
+
+	/*
 	 * Miscellaneous initialization
 	 *
 	 * create expression context for node
@@ -498,8 +506,11 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	/* set up null tuples for outer joins, if needed */
 	switch (node->join.jointype)
 	{
-		case JOIN_INNER:
 		case JOIN_SEMI:
+			/* for semi joins we match to the first inner tuple only */
+			hjstate->js.match_first_inner_tuple_only = true;
+			/* fall through */
+		case JOIN_INNER:
 			break;
 		case JOIN_LEFT:
 		case JOIN_ANTI:
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 2fd1856..378706f 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -840,10 +840,10 @@ ExecMergeJoin(MergeJoinState *node)
 	}
 
 	/*
-	 * In a semijoin, we'll consider returning the first
-	 * match, but after that we're done with this outer tuple.
+	 * Skip to the next outer tuple if we only need one inner
+	 * tuple to match.
 	 */
-	if (node->js.jointype == JOIN_SEMI)
+	if (node->js.match_first_inner_tuple_only)
 		node->mj_JoinState = EXEC_MJ_NEXTOUTER;
 
 	qualResult = (otherqual == NIL ||
@@ -1487,6 +1487,15 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 	mergestate->js.ps.state = estate;
 
 	/*
+	 * When the planner was able to determine that the inner or outer side of
+	 * the join will at most contain a single tuple for the opposing side of
+	 * the join, then we can optimize the merge join by skipping to the next
+	 * tuple since we know there are no more to match.
+	 */
+	mergestate->js.match_first_inner_tuple_only = node->join.inner_unique;
+	mergestate->js.match_first_outer_tuple_only = node->join.outer_unique;
+
+	/*
 	 * Miscellaneous initialization
 	 *
 	 * create expression context for node
@@ -1537,7 +1546,8 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 	 * only if eflags doesn't specify REWIND.
 	 */
 	if (IsA(innerPlan(node), Material) &&
-		(eflags & EXEC_FLAG_REWIND) == 0)
+		(eflags & EXEC_FLAG_REWIND) == 0 &&
+		!node->join.outer_unique)
 		mergestate->mj_ExtraMarks = true;
 	else
 		mergestate->mj_ExtraMarks = false;
@@ -1553,8 +1563,11 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 
 	switch (node->join.jointype)
 	{
-		case JOIN_INNER:
 		case JOIN_SEMI:
+			/* for semi joins we match to the first inner tuple only */
+			mergestate->js.match_first_inner_tuple_only = true;
+			/* fall through */
+		case JOIN_INNER:
 			mergestate->mj_FillOuter = false;

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-18 Thread David Rowley

On 3 December 2016 at 10:26, Tom Lane  wrote:
> Robert Haas  writes:
>> On Dec 2, 2016, at 7:47 AM, Haribabu Kommi  wrote:
>>> Patch still applies fine to HEAD.
>>> Moved to next CF with "ready for committer" status.
>
>> Tom, are you picking this up?
>
> Yeah, I apologize for not having gotten to it in this commitfest, but
> it's definitely something I will look at.

Old patch no longer applies, so I've attached a rebased patch. This
also re-adds a comment line which I mistakenly removed.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


unique_joins_2017-01-19.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-12-02 Thread Tom Lane

Robert Haas  writes:
> On Dec 2, 2016, at 7:47 AM, Haribabu Kommi  wrote:
>> Patch still applies fine to HEAD.
>> Moved to next CF with "ready for committer" status.

> Tom, are you picking this up?

Yeah, I apologize for not having gotten to it in this commitfest, but
it's definitely something I will look at.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-12-02 Thread Robert Haas

On Dec 2, 2016, at 7:47 AM, Haribabu Kommi  wrote:
>> On Wed, Nov 2, 2016 at 1:21 PM, David Rowley  
>> wrote:
>> On 31 October 2016 at 18:37, David Rowley  
>> wrote:
>> > I've rebased the changes I made to address this back in April to current 
>> > master.
>> 
>> Please note that I went ahead and marked this as "Ready for
>> committer". It was previously marked as such in a previous commitfest.
>> The changes made since last version was based on feedback from Tom.
>> 
>> If anyone thinks this is not correct then please mark as "Ready for review".
> 
> Patch still applies fine to HEAD.
> Moved to next CF with "ready for committer" status.

Tom, are you picking this up?

...Robert

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-12-02 Thread Haribabu Kommi

On Wed, Nov 2, 2016 at 1:21 PM, David Rowley 
wrote:

> On 31 October 2016 at 18:37, David Rowley 
> wrote:
> > I've rebased the changes I made to address this back in April to current
> master.
>
> Please note that I went ahead and marked this as "Ready for
> committer". It was previously marked as such in a previous commitfest.
> The changes made since last version was based on feedback from Tom.
>
> If anyone thinks this is not correct then please mark as "Ready for
> review".
>

Patch still applies fine to HEAD.
Moved to next CF with "ready for committer" status.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-11-01 Thread David Rowley

On 31 October 2016 at 18:37, David Rowley  wrote:
> I've rebased the changes I made to address this back in April to current 
> master.

Please note that I went ahead and marked this as "Ready for
committer". It was previously marked as such in a previous commitfest.
The changes made since last version was based on feedback from Tom.

If anyone thinks this is not correct then please mark as "Ready for review".

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-10-30 Thread David Rowley

On 8 April 2016 at 06:49, Tom Lane  wrote:
> David Rowley  writes:
> Just had a thought about this, which should have crystallized a long
> time ago perhaps.  Where I'd originally imagined you were going with
> this idea is to do what the thread title actually says, and check for
> joins in which the *outer* side is unique.  I can't see that that's
> of any value for nestloop or hash joins, but for merge joins, knowing
> that the outer side is unique could be extremely valuable because
> we could skip doing mark/restore backups on the inner side, hugely
> reducing the cost when the inner side has many duplicates.  Now I'm
> not asking for the initially-committed version of the patch to do that,
> but I think we need to design it to be readily extensible to do so.

I've rebased the changes I made to address this back in April to current master.

The changes get rid of the changing JOIN_INNER to JOIN_SEMI conversion
and revert back to the original "inner_unique" marking of the joins.
In addition to this I've also added "outer_unique", which is only made
use of in merge join to control if the outer side needs to enable mark
and restore or not.

However, having said that, I'm not sure why we'd need outer_unique
available so we'd know that we could skip mark/restore. I think
inner_unique is enough for this purpose. Take the comment from
nodeMergejoin.c:

* outer: (0 ^1 1 2 5 5 5 6 6 7) current tuple: 1
 * inner: (1 ^3 5 5 5 5 6) current tuple: 3
 ...
*
 * Consider the above relations and suppose that the executor has
 * just joined the first outer "5" with the last inner "5". The
 * next step is of course to join the second outer "5" with all
 * the inner "5's". This requires repositioning the inner "cursor"
 * to point at the first inner "5". This is done by "marking" the
 * first inner 5 so we can restore the "cursor" to it before joining
 * with the second outer 5. The access method interface provides
 * routines to mark and restore to a tuple.

If only one inner "5" can exist (unique_inner), then isn't the inner
side already in the correct place, and no restore is required?


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


unique_joins_2016-10-31.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-09 Thread David Rowley

On 8 April 2016 at 11:59, Tom Lane  wrote:
> I did some performance testing on the attached somewhat-cleaned-up patch,
> and convinced myself that the planning time penalty is fairly minimal:
> on the order of a couple percent in simple one-join queries, and less
> than that in very large queries.  Oddly, it seems that the result cacheing
> done in get_optimal_jointype() is only barely worth the trouble in typical
> cases; though if you get a query large enough to require GEQO, it's a win
> because successive GEQO attempts can re-use cache entries made by earlier
> attempts.  (This may indicate something wrong with my testing procedure?
> Seems like diking out the cache should have made more difference.)

It'll really depend on whether you're testing a positive case or a
negative one, since a GEQO join search will be the only time the
non-unique cache is filled, then if you're testing the negative case
then this is the only time you'll get any non-unique cache benefits.
On the other hand, if you were testing with a positive case then I
guess the unique index check is cheaper than we thought. Maybe adding
a couple of other unique indexes before defining the one which proves
the join unique (so that they've got a lower OID) would be enough to
start showing the cache benefits of the unique rel cache.

> I did find by measurement that the negative-cache-entry code produces
> exactly zero hits unless you're in GEQO mode, which is not really
> surprising given the order in which the join search occurs.  So in the
> attached patch I made the code not bother with making negative cache
> entries unless using GEQO, to hopefully save a few nanoseconds.

Yeah, I also noticed this in my testing, and it makes sense given how
the standard join search works. Thanks for making that improvement,
I've not yet looked at the code, but it sounds like a good idea.

> I rebased over f338dd758, did a little bit of code cleanup and fixed some
> bugs in the uniqueness detection logic, but have not reviewed the rest of
> the patch since it's likely all gonna change if we reconsider the JoinType
> representation.
>
> Anyway, I think it would be reasonable to give this patch a few more
> days in view of David's being away through the weekend.  But the RMT
> has final say on that.

Thanks for making the updates and committing the refactor of
analyzejoins.c. The RMT have ruled no extension will be given, so I'll
get this into shape for 9.7 CF1.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-09 Thread David Rowley

On 8 April 2016 at 02:46, Tom Lane  wrote:
> I'm also a bit suspicious of the fact that some of the plans in
> aggregates.out changed from merge to hash joins; with basically
> no stats at hand in those tests, that seems dubious.  A quick look
> at what the patch touched in costsize.c suggests that this might
> be because you've effectively allowed cost_hashjoin to give a cost
> discount for inner unique, but provided no similar intelligence
> in cost_mergejoin.

(catching up)

The only possible reason that the merge join plans have become a hash
join is that hash join is costed more cheaply (same as SEMI JOIN) when
the inner side is unique. The reason I didn't cost merge join
differently is that there seems to be no special costing done there
for SEMI joins already. I might be wrong, but I didn't feel like it
was up to this patch to introduce that, though, perhaps a patch should
go in before this one to do that.

I understand this is 9.7 stuff now, but I feel like I should tie up
the lose ends before they get forgotten.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-09 Thread Robert Haas

On Thu, Apr 7, 2016 at 7:59 PM, Tom Lane  wrote:
> Anyway, I think it would be reasonable to give this patch a few more
> days in view of David's being away through the weekend.  But the RMT
> has final say on that.

The RMT has considered this request (sorry for the delay) and thought
it had merit, but ultimately voted not to give this patch an
extension.

Robert Haas
PostgreSQL 9.6 Release Management Team


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-07 Thread Tom Lane

I wrote:
> Alvaro Herrera  writes:
>> FWIW the feature freeze rules state that it is allowed for a committer
>> to request an extension to the feature freeze date for individual
>> patches:
>> https://www.postgresql.org/message-id/CA%2BTgmoY56w5FOzeEo%2Bi48qehL%2BBsVTwy-Q1M0xjUhUCwgGW7-Q%40mail.gmail.com
>> It seems to me that the restrictions laid out there are well met for
>> this patch, if you only need a couple of additional days for this patch
>> to get in.

> Hmm ... the changes I'm thinking about here are certainly pretty
> mechanical, if tedious.  The main question mark I have hanging over
> this patch is whether the planning-time penalty is too large --- but
> that's something that can be tested with the patch as it stands.
> Let me go investigate that a bit before requesting an extension.

I did some performance testing on the attached somewhat-cleaned-up patch,
and convinced myself that the planning time penalty is fairly minimal:
on the order of a couple percent in simple one-join queries, and less
than that in very large queries.  Oddly, it seems that the result cacheing
done in get_optimal_jointype() is only barely worth the trouble in typical
cases; though if you get a query large enough to require GEQO, it's a win
because successive GEQO attempts can re-use cache entries made by earlier
attempts.  (This may indicate something wrong with my testing procedure?
Seems like diking out the cache should have made more difference.)

I did find by measurement that the negative-cache-entry code produces
exactly zero hits unless you're in GEQO mode, which is not really
surprising given the order in which the join search occurs.  So in the
attached patch I made the code not bother with making negative cache
entries unless using GEQO, to hopefully save a few nanoseconds.

I rebased over f338dd758, did a little bit of code cleanup and fixed some
bugs in the uniqueness detection logic, but have not reviewed the rest of
the patch since it's likely all gonna change if we reconsider the JoinType
representation.

Anyway, I think it would be reasonable to give this patch a few more
days in view of David's being away through the weekend.  But the RMT
has final say on that.

regards, tom lane

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 50f1261..f9df294 100644
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*** EXPLAIN (VERBOSE, COSTS false)
*** 386,392 
  ---
   Limit
 Output: t1.c1, t2."C 1"
!->  Merge Join
   Output: t1.c1, t2."C 1"
   Merge Cond: (t1.c1 = t2."C 1")
   ->  Foreign Scan on public.ft2 t1
--- 386,392 
  ---
   Limit
 Output: t1.c1, t2."C 1"
!->  Merge Unique Inner Join
   Output: t1.c1, t2."C 1"
   Merge Cond: (t1.c1 = t2."C 1")
   ->  Foreign Scan on public.ft2 t1
*** EXPLAIN (VERBOSE, COSTS false)
*** 419,425 
  ---
   Limit
 Output: t1.c1, t2."C 1"
!->  Merge Left Join
   Output: t1.c1, t2."C 1"
   Merge Cond: (t1.c1 = t2."C 1")
   ->  Foreign Scan on public.ft2 t1
--- 419,425 
  ---
   Limit
 Output: t1.c1, t2."C 1"
!->  Merge Unique Left Join
   Output: t1.c1, t2."C 1"
   Merge Cond: (t1.c1 = t2."C 1")
   ->  Foreign Scan on public.ft2 t1
*** explain (verbose, costs off) select * fr
*** 2520,2526 
where f.f3 = l.f3 COLLATE "POSIX" and l.f1 = 'foo';
   QUERY PLAN  
  -
!  Hash Join
 Output: f.f1, f.f2, f.f3, l.f1, l.f2, l.f3
 Hash Cond: ((f.f3)::text = (l.f3)::text)
 ->  Foreign Scan on public.ft3 f
--- 2520,2526 
where f.f3 = l.f3 COLLATE "POSIX" and l.f1 = 'foo';
   QUERY PLAN  
  -
!  Hash Unique Inner Join
 Output: f.f1, f.f2, f.f3, l.f1, l.f2, l.f3
 Hash Cond: ((f.f3)::text = (l.f3)::text)
 ->  Foreign Scan on public.ft3 f
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index ee0220a..59084c6 100644
*** a/contrib/postgres_fdw/postgres_fdw.c
--- b/contrib/postgres_fdw/postgres_fdw.c
*** foreign_join_ok(PlannerInfo *root, RelOp
*** 3923,3932 
  	/*
  	 * We support pushing down INNER, LEFT, RIGHT and FULL OUTER joins.
  	 * Constructing queries representing SEMI and ANTI joins is hard, hence
! 	 * not

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-07 Thread Tom Lane

Alvaro Herrera  writes:
> Tom Lane wrote:
>> I don't know if you have time to look at this now --- my clock says it's
>> already Friday morning in New Zealand.

> FWIW the feature freeze rules state that it is allowed for a committer
> to request an extension to the feature freeze date for individual
> patches:
> https://www.postgresql.org/message-id/CA%2BTgmoY56w5FOzeEo%2Bi48qehL%2BBsVTwy-Q1M0xjUhUCwgGW7-Q%40mail.gmail.com
> It seems to me that the restrictions laid out there are well met for
> this patch, if you only need a couple of additional days for this patch
> to get in.

Hmm ... the changes I'm thinking about here are certainly pretty
mechanical, if tedious.  The main question mark I have hanging over
this patch is whether the planning-time penalty is too large --- but
that's something that can be tested with the patch as it stands.
Let me go investigate that a bit before requesting an extension.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-07 Thread Alvaro Herrera

Tom Lane wrote:

> Anyway, while refactoring the make_join_rel/add_paths_to_joinrel division
> of labor wouldn't be such a big deal in itself, I don't want to commit a
> change to JoinType only to undo it later; that would be too much churn.
> So I think we need to resolve this question before we can move forward.
> I don't know if you have time to look at this now --- my clock says it's
> already Friday morning in New Zealand.

FWIW the feature freeze rules state that it is allowed for a committer
to request an extension to the feature freeze date for individual
patches:
https://www.postgresql.org/message-id/CA%2BTgmoY56w5FOzeEo%2Bi48qehL%2BBsVTwy-Q1M0xjUhUCwgGW7-Q%40mail.gmail.com
It seems to me that the restrictions laid out there are well met for
this patch, if you only need a couple of additional days for this patch
to get in.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-07 Thread Tom Lane

David Rowley  writes:
> [ unique_joins_2016-04-07.patch ]

Just had a thought about this, which should have crystallized a long
time ago perhaps.  Where I'd originally imagined you were going with
this idea is to do what the thread title actually says, and check for
joins in which the *outer* side is unique.  I can't see that that's
of any value for nestloop or hash joins, but for merge joins, knowing
that the outer side is unique could be extremely valuable because
we could skip doing mark/restore backups on the inner side, hugely
reducing the cost when the inner side has many duplicates.  Now I'm
not asking for the initially-committed version of the patch to do that,
but I think we need to design it to be readily extensible to do so.

The problem with this is that it blows the current factorization around
add_paths_to_joinrel out of the water.  What we'd want is for the caller
(make_join_rel) to determine uniqueness on both sides, and pass that info
down to each of its two calls of add_paths_to_joinrel; otherwise we have
to do double the work because each run of add_paths_to_joinrel will have
to make those same two determinations.

This probably also means that encoding the uniqueness into JoinType is
a lost cause.  Expanding JOIN_INNER into four variants depending on
whether either or both sides are known unique, and ditto for JOIN_LEFT,
doesn't seem attractive at all.  I suspect we want to go back to your
original design with a separate bool flag (well, two bools now, but
anyway separate from JoinType).  Or maybe the variant JoinTypes still
are better, since they'd fit into switch() tests more naturally, but
it's a lot more debatable as to whether that notation is a win.

I apologize for leading you down the wrong path on the notational aspect,
but sometimes the right answer isn't clear till you've tried all the
possibilities.

Anyway, while refactoring the make_join_rel/add_paths_to_joinrel division
of labor wouldn't be such a big deal in itself, I don't want to commit a
change to JoinType only to undo it later; that would be too much churn.
So I think we need to resolve this question before we can move forward.
I don't know if you have time to look at this now --- my clock says it's
already Friday morning in New Zealand.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-07 Thread Tom Lane

David Rowley  writes:
> I've attached an updated patch which introduces JOIN_INNER_UNIQUE and
> JOIN_LEFT_UNIQUE. So unique inner joins no longer borrow JOIN_SEMI.

OK.

> In EXPLAIN, I named these new join types "Unique Inner" and "Unique
> Left".

Hm.  I'm back to being unhappy about the amount of churn introduced
into the regression test outputs by this patch.  I wonder whether we
could get away with only mentioning the "unique" aspect in VERBOSE
mode.

I'm also a bit suspicious of the fact that some of the plans in
aggregates.out changed from merge to hash joins; with basically
no stats at hand in those tests, that seems dubious.  A quick look
at what the patch touched in costsize.c suggests that this might
be because you've effectively allowed cost_hashjoin to give a cost
discount for inner unique, but provided no similar intelligence
in cost_mergejoin.

> I know we're close to the feature freeze, so I just want to say that
> I'll be AFK starting 5:30am Friday New Zealand time (13:30 on the 8th
> New York time), until Sunday ~4pm.

Understood.  I don't know if we'll get this in or not, but I'll work
on it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-06 Thread David Rowley

On 7 April 2016 at 08:01, David Rowley  wrote:
> On 7 April 2016 at 04:05, Tom Lane  wrote:
>> Starting to look at this again.  I wonder, now that you have the generic
>> caching mechanism for remembering whether join inner sides have been
>> proven unique, is it still worth having the is_unique_join field in
>> SpecialJoinInfo?  It seems like that's creating a separate code path
>> for special joins vs. inner joins that may not be buying us much.
>> It does potentially save lookups in the unique_rels cache, if you already
>> have the SpecialJoinInfo at hand, but I'm not sure what that's worth.
>
> I quite like that field where it is, as it should make
> remove_useless_joins() a bit more efficient, as after a LEFT JOIN is
> removed, the previous code would go off and try to make sure all the
> joins are unique again, but now we cache that, and save it from having
> to bother doing that again, on joins already marked as unique.
>
> Certainly changing that would mean one less special case in
> joinpath.c, as the JOIN_LEFT case can be handle the same as the other
> cases, although it looks like probably, if I do change that, then I'd
> probably move is_innerrel_unique_for() into analyzejoins.c, and put
> the special case for JOIN_LEFT in that function, so that it calls
> specialjoin_is_unique_join(), then cache the sjinfo->min_righthand in
> the unique_rels cache if the result comes back positive, and in the
> non_unique_rels cache if negative... But it seems a bit crazy to go to
> the trouble or all that caching, when we can just throw the result in
> a struct field in the case of Special Joins.  Maybe we could just hide
> both the new joinpath.c functions in analyzejoins.c and call it quits.
> It's not as if there's no special cases for JOIN_LEFT in that file.
>
>> Also, as I'm looking around at the planner some more, I'm beginning to get
>> uncomfortable with the idea of using JOIN_SEMI this way.  It's fine so far
>> as the executor is concerned, no doubt, but there's a lot of planner
>> expectations about the use of JOIN_SEMI that we'd be violating.  One is
>> that there should be a SpecialJoinInfo for any SEMI join.  Another is that
>> JOIN_SEMI can be implemented by unique-ifying the inner side and then
>> doing a regular inner join; that's not a code path we wish to trigger in
>> these situations.  The patch might avoid tripping over these hazards as it
>> stands, but it seems fragile, and third-party FDWs could easily contain
>> code that'll be broken.  So I'm starting to feel that we'd better invent
>> two new JoinTypes after all, to make sure we can distinguish plain-join-
>> with-inner-side-known-unique from a real SEMI join when we need to.
>>
>> What's your thoughts on these matters?
>
> I was a bit uncomfortable with JOIN_INNER becoming JOIN_SEMI before,
> but that was for EXPLAIN reasons. So I do think it's better to have
> JOIN_INNER_UNIQUE and JOIN_LEFT_UNIQUE instead. I can go make that
> change, but unsure on how EXPLAIN will display these now. I'd need to
> pull out my tests if we don't have anything to show in EXPLAIN, and
> I'd really rather have tests for this.

I've attached an updated patch which introduces JOIN_INNER_UNIQUE and
JOIN_LEFT_UNIQUE. So unique inner joins no longer borrow JOIN_SEMI.

I also made some changes around the is_unique_join flag in
SpecialJoinInfo. I've changed this to become optimal_jointype, which
is initially set to jointype and updated by what used to be called
mark_unique_joins(), now called optimize_outerjoin_types(). The LEFT
JOIN removal code now only bothers with special joins with
optimal_jointype as JOIN_LEFT_UNIQUE. This still saves any double work
analyzing the unique properties of LEFT JOINs. I've moved the two new
functions I put in joinpath.c into analyzejoins.c

In EXPLAIN, I named these new join types "Unique Inner" and "Unique
Left". Going by a comment in explain.c, there's some historic reason
that we display "Hash Join" rather than "Hash Inner Join", and "Nested
Loop", rather than "Nested Loop Join" or "Nested Loop Inner Join". I
wasn't quite sure if Unique Inner Joins should use a similar shorthand
form, so I didn't touch that code. We'll get "Nested Loop Unique Inner
Join" now instead of "Nested Loop". I wasn't able to think of some
equivalent shorthand form that made sense.

I know we're close to the feature freeze, so I just want to say that
I'll be AFK starting 5:30am Friday New Zealand time (13:30 on the 8th
New York time), until Sunday ~4pm. . I really hope that if this needs
any tweaking it will be minor. I'll check my email before I leave on
Friday in the hope that if there's anything, then I can quickly fix it
up before I go, but time will be short.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

unique_joins_2016-04-07.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-06 Thread David Rowley

On 7 April 2016 at 08:01, David Rowley  wrote:
> On 7 April 2016 at 04:05, Tom Lane  wrote:
>> Starting to look at this again.  I wonder, now that you have the generic
>> caching mechanism for remembering whether join inner sides have been
>> proven unique, is it still worth having the is_unique_join field in
>> SpecialJoinInfo?  It seems like that's creating a separate code path
>> for special joins vs. inner joins that may not be buying us much.
>> It does potentially save lookups in the unique_rels cache, if you already
>> have the SpecialJoinInfo at hand, but I'm not sure what that's worth.
>
> I quite like that field where it is, as it should make
> remove_useless_joins() a bit more efficient, as after a LEFT JOIN is
> removed, the previous code would go off and try to make sure all the
> joins are unique again, but now we cache that, and save it from having
> to bother doing that again, on joins already marked as unique.
>
> Certainly changing that would mean one less special case in
> joinpath.c, as the JOIN_LEFT case can be handle the same as the other
> cases, although it looks like probably, if I do change that, then I'd
> probably move is_innerrel_unique_for() into analyzejoins.c, and put
> the special case for JOIN_LEFT in that function, so that it calls
> specialjoin_is_unique_join(), then cache the sjinfo->min_righthand in
> the unique_rels cache if the result comes back positive, and in the
> non_unique_rels cache if negative... But it seems a bit crazy to go to
> the trouble or all that caching, when we can just throw the result in
> a struct field in the case of Special Joins.  Maybe we could just hide
> both the new joinpath.c functions in analyzejoins.c and call it quits.
> It's not as if there's no special cases for JOIN_LEFT in that file.

We could also get rid of the SpecialJoinInfo.is_unique_join and just
store this as optimal_jointype, where this would be initialised to
jointype in make_outerjoininfo(), and then set in mark_unique_joins().
This would simplify the test in get_optimal_jointype(), perhaps if
(IS_OUTER_JOIN(jointype)) return sjinfo->optimal_jointype;

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-06 Thread David Rowley

On 7 April 2016 at 04:05, Tom Lane  wrote:
> David Rowley  writes:
>> In the last patch I failed to notice that there's an alternative
>> expected results file for one of the regression tests.
>> The attached patch includes the fix to update that file to match the
>> new expected EXPLAIN output.
>
> Starting to look at this again.  I wonder, now that you have the generic
> caching mechanism for remembering whether join inner sides have been
> proven unique, is it still worth having the is_unique_join field in
> SpecialJoinInfo?  It seems like that's creating a separate code path
> for special joins vs. inner joins that may not be buying us much.
> It does potentially save lookups in the unique_rels cache, if you already
> have the SpecialJoinInfo at hand, but I'm not sure what that's worth.

I quite like that field where it is, as it should make
remove_useless_joins() a bit more efficient, as after a LEFT JOIN is
removed, the previous code would go off and try to make sure all the
joins are unique again, but now we cache that, and save it from having
to bother doing that again, on joins already marked as unique.

Certainly changing that would mean one less special case in
joinpath.c, as the JOIN_LEFT case can be handle the same as the other
cases, although it looks like probably, if I do change that, then I'd
probably move is_innerrel_unique_for() into analyzejoins.c, and put
the special case for JOIN_LEFT in that function, so that it calls
specialjoin_is_unique_join(), then cache the sjinfo->min_righthand in
the unique_rels cache if the result comes back positive, and in the
non_unique_rels cache if negative... But it seems a bit crazy to go to
the trouble or all that caching, when we can just throw the result in
a struct field in the case of Special Joins.  Maybe we could just hide
both the new joinpath.c functions in analyzejoins.c and call it quits.
It's not as if there's no special cases for JOIN_LEFT in that file.

> Also, as I'm looking around at the planner some more, I'm beginning to get
> uncomfortable with the idea of using JOIN_SEMI this way.  It's fine so far
> as the executor is concerned, no doubt, but there's a lot of planner
> expectations about the use of JOIN_SEMI that we'd be violating.  One is
> that there should be a SpecialJoinInfo for any SEMI join.  Another is that
> JOIN_SEMI can be implemented by unique-ifying the inner side and then
> doing a regular inner join; that's not a code path we wish to trigger in
> these situations.  The patch might avoid tripping over these hazards as it
> stands, but it seems fragile, and third-party FDWs could easily contain
> code that'll be broken.  So I'm starting to feel that we'd better invent
> two new JoinTypes after all, to make sure we can distinguish plain-join-
> with-inner-side-known-unique from a real SEMI join when we need to.
>
> What's your thoughts on these matters?

I was a bit uncomfortable with JOIN_INNER becoming JOIN_SEMI before,
but that was for EXPLAIN reasons. So I do think it's better to have
JOIN_INNER_UNIQUE and JOIN_LEFT_UNIQUE instead. I can go make that
change, but unsure on how EXPLAIN will display these now. I'd need to
pull out my tests if we don't have anything to show in EXPLAIN, and
I'd really rather have tests for this.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-06 Thread Tom Lane

David Rowley  writes:
> In the last patch I failed to notice that there's an alternative
> expected results file for one of the regression tests.
> The attached patch includes the fix to update that file to match the
> new expected EXPLAIN output.

Starting to look at this again.  I wonder, now that you have the generic
caching mechanism for remembering whether join inner sides have been
proven unique, is it still worth having the is_unique_join field in
SpecialJoinInfo?  It seems like that's creating a separate code path
for special joins vs. inner joins that may not be buying us much.
It does potentially save lookups in the unique_rels cache, if you already
have the SpecialJoinInfo at hand, but I'm not sure what that's worth.

Also, as I'm looking around at the planner some more, I'm beginning to get
uncomfortable with the idea of using JOIN_SEMI this way.  It's fine so far
as the executor is concerned, no doubt, but there's a lot of planner
expectations about the use of JOIN_SEMI that we'd be violating.  One is
that there should be a SpecialJoinInfo for any SEMI join.  Another is that
JOIN_SEMI can be implemented by unique-ifying the inner side and then
doing a regular inner join; that's not a code path we wish to trigger in
these situations.  The patch might avoid tripping over these hazards as it
stands, but it seems fragile, and third-party FDWs could easily contain
code that'll be broken.  So I'm starting to feel that we'd better invent
two new JoinTypes after all, to make sure we can distinguish plain-join-
with-inner-side-known-unique from a real SEMI join when we need to.

What's your thoughts on these matters?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-02 Thread David Rowley

On 2 April 2016 at 23:26, David Rowley  wrote:
> I worked on this today to try and get it into shape.

In the last patch I failed to notice that there's an alternative
expected results file for one of the regression tests.

The attached patch includes the fix to update that file to match the
new expected EXPLAIN output.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


unique_joins_57837dc_2016-04-03.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-02 Thread David Rowley

On 2 April 2016 at 05:52, Tom Lane  wrote:
>
> David Rowley  writes:
> > On 12 March 2016 at 11:43, Tom Lane  wrote:
> >> It seems like the major intellectual complexity here is to figure out
> >> how to detect inner-side-unique at reasonable cost.  I see that for
> >> LEFT joins you're caching that in the SpecialJoinInfos, which is probably
> >> fine.  But for INNER joins it looks like you're just doing it over again
> >> for every candidate join, and that seems mighty expensive.
>
> > I have a rough idea for this, but I need to think of it a bit more to
> > make sure it's bulletproof.
>
> Where are we on this?  I had left it alone for awhile because I supposed
> that you were going to post a new patch, but it's been a couple weeks...

Apologies for the delay. I was fully booked.

I worked on this today to try and get it into shape.

Changes:

* Added unique_rels and non_unique_rels cache to PlannerInfo. These
are both arrays of Lists which are indexed by the relid, which is
either proved to be unique by, or proved not to be unique by each
listed set of relations. Each List item is a Bitmapset containing the
relids of the relation which proved the uniqueness, or proved no
possibility of uniqueness for the non_unique_rels case.  The
non_unique_rels cache seems far less useful than the unique one, as
with the standard join search, we start at level one, comparing
singleton relations and works our way up, so it's only towards the end
of the search that we'll have more rels on each side of the join
search. Many of the unique cases are found early on in the search, but
the non-unique cases must be rechecked as more relations are added to
the search, as that introduces a new possibility of uniqueness. In
fact, the only cache hit of the non-unique case in the regression
tests is a test that's using the GEQO, so I'm not quite sure if the
standard join search will ever have cache hit here. Never-the-less I
kept the cache, as it's likely going to be most useful to have it when
the GEQO is running the show, as that's when we're going to see the
largest number of relations to join.

There's also a small quirk which could lead to false negatives for
uniqueness which might still need ironed out: I don't yet attempt to
get the minimum set of relations which proved the uniqueness. This,
perhaps, does not matter much for the standard join search, but likely
will, more so for GEQO cases. Fixing this will require changes to
relation_has_unique_index_for() to get it to record and return the
relids of the clauses which matched the index.

* Renamed JOIN_LEFT_UNIQUE to JOIN_SEMI_LEFT. It seems better to
maintain the SEMI part. I thought about renaming JOIN_SEMI to
JOIN_SEMI_INNER, but thought I'd better not touch that here.

* Made a pass to update comments in the areas where I've added
handling for JOIN_SEMI_LEFT.

* I also updated comments in the regression tests to remove reference
to unique joins. "Join conversion" seems to be a better term now.

There was also a couple of places where I wasn't quite sure how to
handle JOIN_SEMI_LEFT. I've marked these with an XXX comment. Perhaps
how to handle them is more clear to you.

I also was not quite sure the best place to allocate memory for these
caches. I've ended up doing that in setup_simple_rel_arrays(), which
is likely the wrong spot. I originally did this in
standard_join_search(), after join_rel_level is allocated memory, but
I soon realised that it can't happen here as the GEQO requires the
caches too, and naturally it does not come through
standard_join_search().

I was not quite sure if I should update any docs to mention that Inner
Joins can become Semi Joins in some cases.

I was also a bit unsure if I should move the two new functions I added
to joinpath.c into analyzejoins.c.

Thanks for taking an interest in this.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

unique_joins_6f34aa1_2016-04-02.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-01 Thread Tom Lane

David Rowley  writes:
> On 12 March 2016 at 11:43, Tom Lane  wrote:
>> It seems like the major intellectual complexity here is to figure out
>> how to detect inner-side-unique at reasonable cost.  I see that for
>> LEFT joins you're caching that in the SpecialJoinInfos, which is probably
>> fine.  But for INNER joins it looks like you're just doing it over again
>> for every candidate join, and that seems mighty expensive.

> I have a rough idea for this, but I need to think of it a bit more to
> make sure it's bulletproof.

Where are we on this?  I had left it alone for awhile because I supposed
that you were going to post a new patch, but it's been a couple weeks...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-13 Thread David Rowley

On 12 March 2016 at 11:43, Tom Lane  wrote:
> I wrote:
>> I wondered why, instead of inventing an extra semantics-modifying flag,
>> we couldn't just change the jointype to *be* JOIN_SEMI when we've
>> discovered that the inner side is unique.
>
> BTW, to clarify: I'm not imagining that we'd make this change in the
> query jointree, as for example prepjointree.c might do.  That would appear
> to add join order constraints, which we don't want.  But I'm thinking that
> at the instant where we form a join Path, we could change the Path's
> jointype to be JOIN_SEMI or JOIN_SEMI_OUTER if we're able to prove the
> inner side unique, rather than annotating the Path with a separate flag.
> Then that representation is what propagates forward.

I've attached a patch which implements this, although I did call the
new join type JOIN_LEFT_UNIQUE rather than JOIN_SEMI_OUTER.
I'm not all that confident that I've not added handling for
JOIN_LEFT_UNIQUE in places where it's not possible to get that join
type, I did leave out a good number of places where I know it's not
possible. I'm also not sure with too much certainty that I've got all
cases correct, but the regression tests pass. The patch is more
intended for assisting discussion than as a ready to commit patch.

> It seems like the major intellectual complexity here is to figure out
> how to detect inner-side-unique at reasonable cost.  I see that for
> LEFT joins you're caching that in the SpecialJoinInfos, which is probably
> fine.  But for INNER joins it looks like you're just doing it over again
> for every candidate join, and that seems mighty expensive.

I have a rough idea for this, but I need to think of it a bit more to
make sure it's bulletproof.
I also just noticed (since it's been a while since I wrote the patch)
that in add_paths_to_joinrel() that innerrel can naturally be a
joinrel too, and we can fail to find uniqueness in that joinrel. I
think it should be possible to analyse the join rel too and search for
a base rel which supports the distinctness, and then ensure none of
the other rels which make up the join rel can cause tuple duplication
of that rel. But this just causes missed optimisation opportunities.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

unique_joins_82d2a07_2016-03-14.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-12 Thread David G. Johnston

On Saturday, March 12, 2016, David G. Johnston 
wrote:

> On Saturday, March 12, 2016, Tom Lane  > wrote:
>
>> "David G. Johnston"  writes:
>> > Don't the semantics of a SEMI JOIN also state that the output columns
>> only
>> > come from the outer relation? i.e., the inner relation doesn't
>> contribute
>> > either rows or columns to the final result?  Or is that simply
>> > an implementation artifact of the fact that the only current way to
>> perform
>> > a semi-join explicitly is via exists/in?
>>
>> I think it's an artifact.  What nodes.h actually says about it is you get
>> the values of one randomly-selected matching inner row, which seems like
>> a fine definition for the purposes we plan to put it to.
>>
>>
> But is it a definition that actually materializes anywhere presently?
>
> I'm not sure what we consider an authoritative source but relational
> algebra does define the results of semi and anti joins as only containing
> rows from main relation.
>
> https://en.m.wikipedia.org/wiki/Relational_algebra
>

Pondering it more calling these optimizations "semi" joins further
distances us from the meaning of "semi" as used in relational algebra.  The
half that semi refers to IS that only one half of the tables are returned.
That you only get a single row of output regardless of multiple potential
matches is simply a consequence of this and general set theory.

In short "semi" communicates a semantic meaning as to the intended output
of the query irrespective of the data upon which it is executed.  We now
are hijacking the and calling something "semi" if by some chance the data
the query is operating against happens to be accommodating to some
particular optimization.

This seems wrong on definitional and cleanliness grounds.

So while I'm still liking the idea of introducing specializations of outer
and inner joins I think calling them "semi" joins adds a definitional
inconsistency we are better off avoiding.

This came about because calling something "outer semi join" struck me as
odd.

Something like "outer only join" and "inner only join" comes to mind.
Consider the parallel between this and "index only scan".  Learning that
"only" means "join the outer row to the (at most for outer) one and only
row in the inner relation" doesn't seem to much of a challenge.

David J.

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-12 Thread David G. Johnston

On Saturday, March 12, 2016, Tom Lane  wrote:

> "David G. Johnston" > writes:
> > Don't the semantics of a SEMI JOIN also state that the output columns
> only
> > come from the outer relation? i.e., the inner relation doesn't contribute
> > either rows or columns to the final result?  Or is that simply
> > an implementation artifact of the fact that the only current way to
> perform
> > a semi-join explicitly is via exists/in?
>
> I think it's an artifact.  What nodes.h actually says about it is you get
> the values of one randomly-selected matching inner row, which seems like
> a fine definition for the purposes we plan to put it to.
>
>
But is it a definition that actually materializes anywhere presently?

I'm not sure what we consider an authoritative source but relational
algebra does define the results of semi and anti joins as only containing
rows from main relation.

https://en.m.wikipedia.org/wiki/Relational_algebra

Given that this is largely internals (aside from the plan explanations
themselves) I guess we can punt for now but calling an inner or outer join
a semijoin in this case relies on a non-standard definition of semijoin -
namely that it is an optimized variation of the other joins instead of a
join type in its own right.  This is complicated further in that we
do implement a true semijoin (using exists) while we allow for an anti join
to be non-standard if expressed using "left join ... is null" instead of
via "not exists".

Calling these optimizations outer/inner+semi/anti preserves the ability to
distinguish these versions from the standard definitions.  I do like ithe
idea of it being exposed and encapsulated as a distinct join type instead
of being an attribute.

David J.

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-12 Thread Tom Lane

Robert Haas  writes:
> The new join pushdown code in postgres_fdw does not grok SEMI and ANTI
> joins because there is no straightforward way of reducing those back
> to SQL.  They can originate in multiple ways and not all of those can
> be represented easily.  I think it would be nice to do something to
> fix this.  For example, if a LEFT join WHERE outer_column IS NULL
> turns into an ANTI join, it would be nice if that were marked in some
> way so that postgres_fdw could conveniently emit it in the original
> form.

> Maybe the people who have been working on that patch just haven't been
> creative enough in thinking about how to solve this problem, but it
> makes me greet the idea of more join types that don't map directly
> back to SQL with somewhat mixed feelings.

I can't summon a whole lot of sympathy for that objection.  These cases
won't arise with postgres_fdw as it stands because we'd never be able to
prove uniqueness on a foreign table.  When and if someone tries to improve
that, we can think about how the whole thing ought to map to FDWs.

Having said that, your point does add a bit of weight to David's
suggestion of inventing two new join-type codes rather than overloading
JOIN_SEMI.  I'd still be a bit inclined to display JOIN_INNER_UNIQUE or
whatever we call it as a "Semi" join in EXPLAIN, though, just to minimize
the amount of newness there.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-12 Thread Robert Haas

On Fri, Mar 11, 2016 at 4:32 PM, Tom Lane  wrote:
> So I started re-reading this thread in preparation for looking at the
> patch, and this bit in your initial message jumped out at me:
>
>> In all of our join algorithms in the executor, if the join type is SEMI,
>> we skip to the next outer row once we find a matching inner row. This is
>> because we don't want to allow duplicate rows in the inner side to
>> duplicate outer rows in the result set. Obviously this is required per SQL
>> spec. I believe we can also skip to the next outer row in this case when
>> we've managed to prove that no other row can possibly exist that matches
>> the current outer row, due to a unique index or group by/distinct clause
>> (for subqueries).
>
> I wondered why, instead of inventing an extra semantics-modifying flag,
> we couldn't just change the jointype to *be* JOIN_SEMI when we've
> discovered that the inner side is unique.
>
> Now of course this only works if the join type was INNER to start with.
> If it was a LEFT join, you'd need an "outer semi join" jointype which
> we haven't got at the moment.  But I wonder whether inventing that
> jointype wouldn't let us arrive at a less messy handling of things in
> the executor and EXPLAIN.  I'm not very enamored of plastering this
> "match_first_tuple_only" flag on every join, in part because it doesn't
> appear to have sensible semantics for other jointypes such as JOIN_RIGHT.
> And I'd really be happier to see the information reflected by join type
> than a new line in EXPLAIN, also.

The new join pushdown code in postgres_fdw does not grok SEMI and ANTI
joins because there is no straightforward way of reducing those back
to SQL.  They can originate in multiple ways and not all of those can
be represented easily.  I think it would be nice to do something to
fix this.  For example, if a LEFT join WHERE outer_column IS NULL
turns into an ANTI join, it would be nice if that were marked in some
way so that postgres_fdw could conveniently emit it in the original
form.

Maybe the people who have been working on that patch just haven't been
creative enough in thinking about how to solve this problem, but it
makes me greet the idea of more join types that don't map directly
back to SQL with somewhat mixed feelings.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-12 Thread Tom Lane

David Rowley  writes:
> On 12 March 2016 at 11:43, Tom Lane  wrote:
>>> I wondered why, instead of inventing an extra semantics-modifying flag,
>>> we couldn't just change the jointype to *be* JOIN_SEMI when we've
>>> discovered that the inner side is unique.

> The thing that might matter is that, this;

> explain (costs off) select * from t1 inner join t2 on t1.id=t2.id
>  QUERY PLAN
> --
>  Hash Join
>Hash Cond: (t1.id = t2.id)
>->  Seq Scan on t1
>->  Hash
>  ->  Seq Scan on t2

> could become;

>   QUERY PLAN
> --
>  Hash Semi Join
>Hash Cond: (t1.id = t2.id)
>->  Seq Scan on t1
>->  Hash
>  ->  Seq Scan on t2

> Wouldn't that cause quite a bit of confusion?

Well, no more than was introduced when we invented semi joins at all.

> Now, we could get around that by
> adding JOIN_SEMI_INNER I guess, and just displaying that as a normal
> inner join, yet it'll behave exactly like JOIN_SEMI!

I'm not that thrilled with having EXPLAIN hide real differences in the
plan from you; if I was, I'd have just lobbied to drop the "unique inner"
annotation from EXPLAIN output altogether.

(I think at one point we'd discussed displaying this in EXPLAIN output
as a different join type, and I'd been against it at the time.  What
changed my thinking was realizing that it could be mapped on to the
existing jointype "semi join".  We still need one new concept,
"outer semi join" or whatever we decide to call it, but it's less of
a stretch than I'd supposed originally.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-12 Thread Tom Lane

"David G. Johnston"  writes:
> Don't the semantics of a SEMI JOIN also state that the output columns only
> come from the outer relation? i.e., the inner relation doesn't contribute
> either rows or columns to the final result?  Or is that simply
> an implementation artifact of the fact that the only current way to perform
> a semi-join explicitly is via exists/in?

I think it's an artifact.  What nodes.h actually says about it is you get
the values of one randomly-selected matching inner row, which seems like
a fine definition for the purposes we plan to put it to.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-12 Thread David G. Johnston

On Saturday, March 12, 2016, David Rowley 
wrote:

> On 12 March 2016 at 11:43, Tom Lane >
> wrote:
> > I wrote:
> >> I wondered why, instead of inventing an extra semantics-modifying flag,
> >> we couldn't just change the jointype to *be* JOIN_SEMI when we've
> >> discovered that the inner side is unique.
> >
> > BTW, to clarify: I'm not imagining that we'd make this change in the
> > query jointree, as for example prepjointree.c might do.  That would
> appear
> > to add join order constraints, which we don't want.  But I'm thinking
> that
> > at the instant where we form a join Path, we could change the Path's
> > jointype to be JOIN_SEMI or JOIN_SEMI_OUTER if we're able to prove the
> > inner side unique, rather than annotating the Path with a separate flag.
> > Then that representation is what propagates forward.
>
> Thanks for looking at this.
>
> Yes that might work, since we'd just be changing the jointype in the
> JoinPath, if that path was discarded if favour of, say the commutative
> variant, which was not "unique inner", then it shouldn't matter, as
> the join type for that path would be the original one.
>
> The thing that might matter is that, this;
>
>
> explain (costs off) select * from t1 inner join t2 on t1.id=t2.id
>  QUERY PLAN
> --
>  Hash Join
>Hash Cond: (t1.id = t2.id)
>->  Seq Scan on t1
>->  Hash
>  ->  Seq Scan on t2
>
> could become;
>
>   QUERY PLAN
> --
>  Hash Semi Join
>Hash Cond: (t1.id = t2.id)
>->  Seq Scan on t1
>->  Hash
>  ->  Seq Scan on t2
>
> Wouldn't that cause quite a bit of confusion? People browsing EXPLAIN
> output might be a bit confused at the lack of EXISTS/IN clause in a
> query which is showing a Semi Join. Now, we could get around that by
> adding JOIN_SEMI_INNER I guess, and just displaying that as a normal
> inner join, yet it'll behave exactly like JOIN_SEMI!
>
>
Don't the semantics of a SEMI JOIN also state that the output columns only
come from the outer relation? i.e., the inner relation doesn't contribute
either rows or columns to the final result?  Or is that simply
an implementation artifact of the fact that the only current way to perform
a semi-join explicitly is via exists/in?

David J.

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-12 Thread David Rowley

On 12 March 2016 at 11:43, Tom Lane  wrote:
> I wrote:
>> I wondered why, instead of inventing an extra semantics-modifying flag,
>> we couldn't just change the jointype to *be* JOIN_SEMI when we've
>> discovered that the inner side is unique.
>
> BTW, to clarify: I'm not imagining that we'd make this change in the
> query jointree, as for example prepjointree.c might do.  That would appear
> to add join order constraints, which we don't want.  But I'm thinking that
> at the instant where we form a join Path, we could change the Path's
> jointype to be JOIN_SEMI or JOIN_SEMI_OUTER if we're able to prove the
> inner side unique, rather than annotating the Path with a separate flag.
> Then that representation is what propagates forward.

Thanks for looking at this.

Yes that might work, since we'd just be changing the jointype in the
JoinPath, if that path was discarded if favour of, say the commutative
variant, which was not "unique inner", then it shouldn't matter, as
the join type for that path would be the original one.

The thing that might matter is that, this;

explain (costs off) select * from t1 inner join t2 on t1.id=t2.id
 QUERY PLAN
--
 Hash Join
   Hash Cond: (t1.id = t2.id)
   ->  Seq Scan on t1
   ->  Hash
 ->  Seq Scan on t2

could become;

  QUERY PLAN
--
 Hash Semi Join
   Hash Cond: (t1.id = t2.id)
   ->  Seq Scan on t1
   ->  Hash
 ->  Seq Scan on t2

Wouldn't that cause quite a bit of confusion? People browsing EXPLAIN
output might be a bit confused at the lack of EXISTS/IN clause in a
query which is showing a Semi Join. Now, we could get around that by
adding JOIN_SEMI_INNER I guess, and just displaying that as a normal
inner join, yet it'll behave exactly like JOIN_SEMI!

And if we do that, then the join node code changes from;

if (node->js.match_first_tuple_only)
  node->nl_NeedNewOuter = true;

to;

if (node->js.jointype == JOIN_SEMI || node->js.jointype ==
JOIN_SEMI_INNER || node->js.jointype == JOIN_SEMI_LEFT)
  node->nl_NeedNewOuter = true;

which is kinda horrid and adds a few cycles and code without a real
need for it. If we kept the match_first_tuples_only and set it in the
node startup similar to how it is now, or changed JoinType to be some
bitmask mask that had a bit for each piece of a venn diagram, and an
extra for the unique inner... but I'm not so sure I like that idea...

To me it seems neater just to keep the match_first_tuple_only and just
update the planner stuff to use the new jointypes.

Thoughts?

>
> It seems like the major intellectual complexity here is to figure out
> how to detect inner-side-unique at reasonable cost.  I see that for
> LEFT joins you're caching that in the SpecialJoinInfos, which is probably
> fine.  But for INNER joins it looks like you're just doing it over again
> for every candidate join, and that seems mighty expensive.

hmm yeah, perhaps something can be done to cache that, I'm just not
all that sure how much reuse the cache will get used since it
generates the Paths for nestloop/merge/has join methods all at once,
once the unique_inner has been determined for that inner rel, and the
restrict info for whatever the outer rel(s) are. I didn't expect that
to happen twice, so I'm not all that sure if a cache will get
reused...

... thinks more ...

Perhaps maybe if rel x was found to be unique with the join conditions
of rel y, then we can be sure that x is unique against y, z, as that
could only possible add more to the join condition, never less. Is
this what you meant?

... I'll look into that.

The other thing I thought of was to add a dedicated list for unique
indexes in RelOptInfo, this would also allow
rel_supports_distinctness() to do something a bit smarter than just
return false if there's no indexes. That might not buy us much though,
but at least relations tend to have very little unique indexes, even
when they have lots of indexes.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-11 Thread Tom Lane

I wrote:
> I wondered why, instead of inventing an extra semantics-modifying flag,
> we couldn't just change the jointype to *be* JOIN_SEMI when we've
> discovered that the inner side is unique.

BTW, to clarify: I'm not imagining that we'd make this change in the
query jointree, as for example prepjointree.c might do.  That would appear
to add join order constraints, which we don't want.  But I'm thinking that
at the instant where we form a join Path, we could change the Path's
jointype to be JOIN_SEMI or JOIN_SEMI_OUTER if we're able to prove the
inner side unique, rather than annotating the Path with a separate flag.
Then that representation is what propagates forward.

It seems like the major intellectual complexity here is to figure out
how to detect inner-side-unique at reasonable cost.  I see that for
LEFT joins you're caching that in the SpecialJoinInfos, which is probably
fine.  But for INNER joins it looks like you're just doing it over again
for every candidate join, and that seems mighty expensive.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-11 Thread Tom Lane

So I started re-reading this thread in preparation for looking at the
patch, and this bit in your initial message jumped out at me:

> In all of our join algorithms in the executor, if the join type is SEMI,
> we skip to the next outer row once we find a matching inner row. This is
> because we don't want to allow duplicate rows in the inner side to
> duplicate outer rows in the result set. Obviously this is required per SQL
> spec. I believe we can also skip to the next outer row in this case when
> we've managed to prove that no other row can possibly exist that matches
> the current outer row, due to a unique index or group by/distinct clause
> (for subqueries).

I wondered why, instead of inventing an extra semantics-modifying flag,
we couldn't just change the jointype to *be* JOIN_SEMI when we've
discovered that the inner side is unique.

Now of course this only works if the join type was INNER to start with.
If it was a LEFT join, you'd need an "outer semi join" jointype which
we haven't got at the moment.  But I wonder whether inventing that
jointype wouldn't let us arrive at a less messy handling of things in
the executor and EXPLAIN.  I'm not very enamored of plastering this
"match_first_tuple_only" flag on every join, in part because it doesn't
appear to have sensible semantics for other jointypes such as JOIN_RIGHT.
And I'd really be happier to see the information reflected by join type
than a new line in EXPLAIN, also.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-08 Thread Tom Lane

David Rowley  writes:
> [1] http://www.postgresql.org/message-id/8907.1440383...@sss.pgh.pa.us

Oh, okay, I had looked at the many changes in the regression outputs and
jumped to the conclusion that you were printing the info all the time.
Looking closer I see it's only coming out in VERBOSE mode.  That's
probably fine.  Never mind

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-08 Thread David Rowley

On 9 March 2016 at 13:19, Tom Lane  wrote:
> I do think that the verbosity this adds to the EXPLAIN output is not
> desirable at all, at least not in text mode.  Another line for every
> darn join is a pretty high price.

For me it seems like a good idea to give some sort of indication in
EXPLAIN about this, although I'm not wedded to what I have currently
as being the best and neatest way to do this.
I was however under the impression that you thought this was
reasonable [1], so I didn't go seeking out any other way. Did you have
another idea, or are you just proposing to remove it for text EXPLAIN?
Perhaps, if you are then the tests can still exist with a non-text
output.

Also in [1], I disagree with your proposal for being inconsistent with
what's shown with VERBOSE depending on the EXPLAIN output format. If
we were going to do that then I'd rather just switch on verbose for
non-text, but that might make some people unhappy, so I'd rather we
didn't do that either. So how about I remove it from the TEXT output
and just include it in non-text formats when the VERBOSE flag is set?
then change the tests to use... something other than text... YAML
seems most compact.

[1] http://www.postgresql.org/message-id/8907.1440383...@sss.pgh.pa.us

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-08 Thread Tom Lane

David Rowley  writes:
> I also notice that some regression tests, which I think some of which
> Tom updated in the upper planner changes have now changed back again
> due to the slightly reduced costs on hash and nested loop joins where
> the inner side is unique.

?? I don't see anything in this patch that touches the same queries
that changed plans in my previous patch.

I do think that the verbosity this adds to the EXPLAIN output is not
desirable at all, at least not in text mode.  Another line for every
darn join is a pretty high price.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-08 Thread David Rowley

On 23 January 2016 at 05:36, Tomas Vondra  wrote:
> Hi,
>
> On 12/17/2015 02:17 PM, David Rowley wrote:
>>
>> On 17 December 2015 at 19:11, Simon Riggs > > wrote:
>>
>> On 17 December 2015 at 00:17, Tomas Vondra
>> >
>> wrote:
>>
>> I'd go with match_first_tuple_only.
>>
>>
>> +1
>>
>> unique_inner is a state that has been detected,
>> match_first_tuple_only is the action we take as a result.
>>
>>
>> Ok great. I've made it so in the attached. This means the comment in the
>> join code where we perform the skip can be a bit less verbose and all
>> the details can go in where we're actually setting the
>> match_first_tuple_only to true.
>
>
> OK. I've looked at the patch again today, and it seems broken bv 45be99f8 as
> the partial paths were not passing the unique_inner to the create_*_path()
> functions. The attached patch should fix that.
>
> Otherwise I think the patch is ready for committer - I think this is a
> valuable optimization and I see nothing wrong with the code.

I've attached an updated patch which updates it to fix the conflicts
with the recent upper planner changes.
I also notice that some regression tests, which I think some of which
Tom updated in the upper planner changes have now changed back again
due to the slightly reduced costs on hash and nested loop joins where
the inner side is unique. I checked the costs of one of these by
disabling hash join and noticed that the final totla cost is the same,
so it's not too surprising that they keep switching plans with these
planner changes going in. I verified that these remain as is when I
comment out the cost changing code in this patch.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

unique_joins_dbcecda_2016-03-09.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-05 Thread David Rowley

On 5 March 2016 at 10:43, Alvaro Herrera  wrote:
> I wonder why do we have two identical copies of clause_sides_match_join ...

Yeah, I noticed the same a while back, and posted about it. Here was
the response:
http://www.postgresql.org/message-id/26820.1405522...@sss.pgh.pa.us


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-03-04 Thread Alvaro Herrera

I wonder why do we have two identical copies of clause_sides_match_join ...

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-02-18 Thread David Rowley

On 23/01/2016 12:42 am, "David Rowley"  wrote:
>
> On 23 January 2016 at 05:36, Tomas Vondra 
wrote:
> > Otherwise I think the patch is ready for committer - I think this is a
> > valuable optimization and I see nothing wrong with the code.
> >
> >
> > Any objections to marking it accordingly?

I've just set this patch back to ready for committer. It was previous
marked as so but lost that status when it was moved to the march fest.

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-01-22 Thread David Rowley

On 23 January 2016 at 05:36, Tomas Vondra  wrote:
> OK. I've looked at the patch again today, and it seems broken bv 45be99f8 as
> the partial paths were not passing the unique_inner to the create_*_path()
> functions. The attached patch should fix that.
>

Thanks for looking at this again, and thanks for fixing that problem.

I've attached an updated patch which incorporates your change, and
also another 2 small tweaks that I made after making another pass over
this myself.

> Otherwise I think the patch is ready for committer - I think this is a
> valuable optimization and I see nothing wrong with the code.
>
>
> Any objections to marking it accordingly?

None from me. I think it's had plenty of review time over the last year.

The only other thing I can possibly think of is to apply most of the
analyzejoins.c changes as a refactor patch first. If committer wants
that, I can separate these out, but I'll hold off for a response
before doing that.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

unique_joins_ba5b9cb_2016-01-23.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-01-22 Thread Tomas Vondra

Hi,

On 12/17/2015 02:17 PM, David Rowley wrote:

On 17 December 2015 at 19:11, Simon Riggs > wrote:

On 17 December 2015 at 00:17, Tomas Vondra
>
wrote:

I'd go with match_first_tuple_only.

+1

unique_inner is a state that has been detected,
match_first_tuple_only is the action we take as a result.

Ok great. I've made it so in the attached. This means the comment in the
join code where we perform the skip can be a bit less verbose and all
the details can go in where we're actually setting the
match_first_tuple_only to true.

OK. I've looked at the patch again today, and it seems broken bv 
45be99f8 as the partial paths were not passing the unique_inner to the 
create_*_path() functions. The attached patch should fix that.

Otherwise I think the patch is ready for committer - I think this is a 
valuable optimization and I see nothing wrong with the code.

Any objections to marking it accordingly?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index be410d2..1ec2ddc 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -411,7 +411,7 @@ try_partial_nestloop_path(PlannerInfo *root,
 	 * Before creating a path, get a quick lower bound on what it is likely
 	 * to cost.  Bail out right away if it looks terrible.
 	 */
-	initial_cost_nestloop(root, , jointype,
+	initial_cost_nestloop(root, , jointype, extra->unique_inner,
 		  outer_path, inner_path,
 		  extra->sjinfo, >semifactors);
 	if (!add_partial_path_precheck(joinrel, workspace.total_cost, pathkeys))
@@ -422,6 +422,7 @@ try_partial_nestloop_path(PlannerInfo *root,
 			 create_nestloop_path(root,
   joinrel,
   jointype,
+  extra->unique_inner,
   ,
   extra->sjinfo,
   >semifactors,
@@ -622,6 +623,7 @@ try_partial_hashjoin_path(PlannerInfo *root,
 			 create_hashjoin_path(root,
   joinrel,
   jointype,
+  extra->unique_inner,
   ,
   extra->sjinfo,
   >semifactors,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-12-23 Thread Michael Paquier

On Thu, Dec 17, 2015 at 10:17 PM, David Rowley
 wrote:
> On 17 December 2015 at 19:11, Simon Riggs  wrote:
>>
>> On 17 December 2015 at 00:17, Tomas Vondra 
>> wrote:
>>>
>>> I'd go with match_first_tuple_only.
>>
>>
>> +1
>>
>> unique_inner is a state that has been detected, match_first_tuple_only is
>> the action we take as a result.
>>
>
> Ok great. I've made it so in the attached. This means the comment in the
> join code where we perform the skip can be a bit less verbose and all the
> details can go in where we're actually setting the match_first_tuple_only to
> true.

Patch moved to next CF because of a lack of reviews on the new patch,
and because the last patch has been posted not so long ago.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-12-17 Thread David Rowley

On 17 December 2015 at 19:11, Simon Riggs  wrote:

> On 17 December 2015 at 00:17, Tomas Vondra 
> wrote:
>
>> I'd go with match_first_tuple_only.
>
>
> +1
>
> unique_inner is a state that has been detected, match_first_tuple_only is
> the action we take as a result.
>
>
Ok great. I've made it so in the attached. This means the comment in the
join code where we perform the skip can be a bit less verbose and all the
details can go in where we're actually setting the match_first_tuple_only
to true.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


unique_joins_18-12-2015_843fb71.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-12-16 Thread David Rowley

On 17 December 2015 at 05:02, Tomas Vondra 
wrote:

> 0) I know the patch does not tweak costing - any plans in this
>
   direction? Would it be possible to simply use the costing used by
>semijoin?
>
>
Many thanks for looking at this.

The patch does tweak the costings so that unique joins are costed in the
same way as semi joins. It's a very small change, so easy to miss.

For example, see the change in initial_cost_nestloop()

-   if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
+   if (jointype == JOIN_SEMI ||
+   jointype == JOIN_ANTI ||
+   unique_inner)
{

Also see the changes in final_cost_nestloop() and final_cost_hashjoin()


1) nodeHashjoin.c (and other join nodes)
>
> I've noticed we have this in the ExecHashJoin() method:
>
>  /*
>   * When the inner side is unique or we're performing a
>   * semijoin, we'll consider returning the first match, but
>   * after that we're done with this outer tuple.
>   */
>  if (node->js.unique_inner)
>  node->hj_JoinState = HJ_NEED_NEW_OUTER;
>
> That seems a bit awkward because the comment speaks about unique joins
> *OR* semijoins, but the check only references unique_inner. That of course
> works because we do this in ExecInitHashJoin():
>
> switch (node->join.jointype)
> {
> case JOIN_SEMI:
> hjstate->js.unique_inner = true;
> /* fall through */
>
> Either some semijoins may not be unique - in that case the naming is
> misleading at best, and we either need to find a better name or simply
> check two conditions like this:
>
>  if (node->js.unique_inner || node->join.type == JOIN_SEMI)
> node->hj_JoinState = HJ_NEED_NEW_OUTER;
>
> Or all semijoins are unique joins, and in that case the comment may need
> rephrasing. But more importantly, it begs the question why we're detecting
> this in the executor and not in the planner? Because if we detect it in
> executor, we can't use this bit in costing, for example.
>
>
The reason not to detect that in the planner is simply that unique_join is
meant to mean that the relation on the inner side of the join will at most
contain a single Tuple which matches each outer relation tuple, based on
the join condition. I've added no detection for this in semi joins, and I'd
rather not go blindly setting the flag to true in the planner as it simply
may not be true for the semi join. At the moment that might not matter as
we're only using the unique_join flag as an optimization in the join nodes,
but I'd prefer not to do this as its likely we'll want to do more with this
flag later, and I'd rather we keep the meaning well defined. You might
argue that this is also true for semi joins, but if down the road somewhere
we want to perform some checks on the inner relation before the join takes
place, and in that case the Tuples of the relation might not have the same
properties we claim they do.

But you're right that reusing the flag in the join nodes is not ideal, and
the comment is not that great either. I'd really rather go down the path of
either renaming the variable, or explaining this better in the comment. It
seems unnecessary to check both for each tuple being joined. I'd imagine
that might add a little overhead to joins which are not unique.

How about changing the comment to:

/*
* In the event that the inner side has been detected to be
* unique, as an optimization we can skip searching for any
* subsequent matching inner tuples, as none will exist.
* For semijoins unique_inner will always be true, although
* in this case we don't match another inner tuple as this
* is the required semi join behavior.
*/

Alternatively or additionally we can rename the variable in the executor
state, although I've not thought of a name yet that I don't find overly
verbose: unique_inner_or_semi_join,  match_first_tuple_only.


2) analyzejoins.c
>
> I see that this code in join_is_removable()
>
> innerrel = find_base_rel(root, innerrelid);
>
> if (innerrel->reloptkind != RELOPT_BASEREL)
> return false;
>
> was rewritten like this:
>
> innerrel = find_base_rel(root, innerrelid);
>
> Assert(innerrel->reloptkind == RELOPT_BASEREL);
>
> That suggests that previously the function expected cases where reloptkind
> may not be RELOPT_BASEREL, but now it'll error out int such cases. I
> haven't noticed any changes in the surrounding code that'd guarantee this
> won't happen, but I also haven't been able to come up with an example
> triggering the assert (haven't been trying too hard). How do we know the
> assert() makes sense?
>
>
I'd have changed this as this should be covered by the  if
(!sjinfo->is_unique_join || a few lines up. mark_unique_joins() only sets
is_unique_join to true if specialjoin_is_unique_join() returns true and
that function tests reloptkind to ensure its set to RELOPT_BASEREL and
return false if the relation is not. Perhaps what is missing

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-12-16 Thread Simon Riggs

On 17 December 2015 at 00:17, Tomas Vondra 
wrote:


> 1) nodeHashjoin.c (and other join nodes)
>>
>> I've noticed we have this in the ExecHashJoin() method:
>>
>>   /*
>>* When the inner side is unique or we're performing a
>>* semijoin, we'll consider returning the first match, but
>>* after that we're done with this outer tuple.
>>*/
>>   if (node->js.unique_inner)
>>   node->hj_JoinState = HJ_NEED_NEW_OUTER;
>>
>> That seems a bit awkward because the comment speaks about unique
>> joins *OR* semijoins, but the check only references unique_inner.
>> That of course works because we do this in ExecInitHashJoin():
>>
>>  switch (node->join.jointype)
>>  {
>>  case JOIN_SEMI:
>>  hjstate->js.unique_inner = true;
>>  /* fall through */
>>
>> Either some semijoins may not be unique - in that case the naming is
>> misleading at best, and we either need to find a better name or
>> simply check two conditions like this:
>>
>>   if (node->js.unique_inner || node->join.type == JOIN_SEMI)
>>  node->hj_JoinState = HJ_NEED_NEW_OUTER;
>>
>> Or all semijoins are unique joins, and in that case the comment may
>> need rephrasing. But more importantly, it begs the question why
>> we're detecting this in the executor and not in the planner? Because
>> if we detect it in executor, we can't use this bit in costing, for
>> example.
>>
>>
>> The reason not to detect that in the planner is simply that unique_join
>> is meant to mean that the relation on the inner side of the join will at
>> most contain a single Tuple which matches each outer relation tuple,
>> based on the join condition. I've added no detection for this in semi
>> joins, and I'd rather not go blindly setting the flag to true in the
>> planner as it simply may not be true for the semi join. At the moment
>> that might not matter as we're only using the unique_join flag as an
>> optimization in the join nodes, but I'd prefer not to do this as its
>> likely we'll want to do more with this flag later, and I'd rather we
>> keep the meaning well defined. You might argue that this is also true
>> for semi joins, but if down the road somewhere we want to perform some
>> checks on the inner relation before the join takes place, and in that
>> case the Tuples of the relation might not have the same properties we
>> claim they do.
>>
>> But you're right that reusing the flag in the join nodes is not ideal,
>> and the comment is not that great either. I'd really rather go down the
>> path of either renaming the variable, or explaining this better in the
>> comment. It seems unnecessary to check both for each tuple being joined.
>> I'd imagine that might add a little overhead to joins which are not
>> unique.
>>
>
> I'd be very surprised it that had any measurable impact.
>
> How about changing the comment to:
>>
>> /*
>> * In the event that the inner side has been detected to be
>> * unique, as an optimization we can skip searching for any
>> * subsequent matching inner tuples, as none will exist.
>> * For semijoins unique_inner will always be true, although
>> * in this case we don't match another inner tuple as this
>> * is the required semi join behavior.
>> */
>>
>> Alternatively or additionally we can rename the variable in the executor
>> state, although I've not thought of a name yet that I don't find overly
>> verbose: unique_inner_or_semi_join,  match_first_tuple_only.
>>
>
> I'd go with match_first_tuple_only.


+1

unique_inner is a state that has been detected, match_first_tuple_only is
the action we take as a result.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-12-16 Thread Tomas Vondra


Hi,

On 12/16/2015 11:40 PM, David Rowley wrote:

On 17 December 2015 at 05:02, Tomas Vondra > wrote:

0) I know the patch does not tweak costing - any plans in this

direction? Would it be possible to simply use the costing used by
semijoin?


Many thanks for looking at this.

The patch does tweak the costings so that unique joins are costed in the
same way as semi joins. It's a very small change, so easy to miss.


Thanks. I missed that bit somehow.



For example, see the change in initial_cost_nestloop()

-   if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
+   if (jointype == JOIN_SEMI ||
+   jointype == JOIN_ANTI ||
+   unique_inner)
 {

Also see the changes in final_cost_nestloop() and final_cost_hashjoin()


1) nodeHashjoin.c (and other join nodes)

I've noticed we have this in the ExecHashJoin() method:

  /*
   * When the inner side is unique or we're performing a
   * semijoin, we'll consider returning the first match, but
   * after that we're done with this outer tuple.
   */
  if (node->js.unique_inner)
  node->hj_JoinState = HJ_NEED_NEW_OUTER;

That seems a bit awkward because the comment speaks about unique
joins *OR* semijoins, but the check only references unique_inner.
That of course works because we do this in ExecInitHashJoin():

 switch (node->join.jointype)
 {
 case JOIN_SEMI:
 hjstate->js.unique_inner = true;
 /* fall through */

Either some semijoins may not be unique - in that case the naming is
misleading at best, and we either need to find a better name or
simply check two conditions like this:

  if (node->js.unique_inner || node->join.type == JOIN_SEMI)
 node->hj_JoinState = HJ_NEED_NEW_OUTER;

Or all semijoins are unique joins, and in that case the comment may
need rephrasing. But more importantly, it begs the question why
we're detecting this in the executor and not in the planner? Because
if we detect it in executor, we can't use this bit in costing, for
example.


The reason not to detect that in the planner is simply that unique_join
is meant to mean that the relation on the inner side of the join will at
most contain a single Tuple which matches each outer relation tuple,
based on the join condition. I've added no detection for this in semi
joins, and I'd rather not go blindly setting the flag to true in the
planner as it simply may not be true for the semi join. At the moment
that might not matter as we're only using the unique_join flag as an
optimization in the join nodes, but I'd prefer not to do this as its
likely we'll want to do more with this flag later, and I'd rather we
keep the meaning well defined. You might argue that this is also true
for semi joins, but if down the road somewhere we want to perform some
checks on the inner relation before the join takes place, and in that
case the Tuples of the relation might not have the same properties we
claim they do.

But you're right that reusing the flag in the join nodes is not ideal,
and the comment is not that great either. I'd really rather go down the
path of either renaming the variable, or explaining this better in the
comment. It seems unnecessary to check both for each tuple being joined.
I'd imagine that might add a little overhead to joins which are not unique.


I'd be very surprised it that had any measurable impact.


How about changing the comment to:

/*
* In the event that the inner side has been detected to be
* unique, as an optimization we can skip searching for any
* subsequent matching inner tuples, as none will exist.
* For semijoins unique_inner will always be true, although
* in this case we don't match another inner tuple as this
* is the required semi join behavior.
*/

Alternatively or additionally we can rename the variable in the executor
state, although I've not thought of a name yet that I don't find overly
verbose: unique_inner_or_semi_join,  match_first_tuple_only.


I'd go with match_first_tuple_only.




2) analyzejoins.c

I see that this code in join_is_removable()

 innerrel = find_base_rel(root, innerrelid);

 if (innerrel->reloptkind != RELOPT_BASEREL)
 return false;

was rewritten like this:

 innerrel = find_base_rel(root, innerrelid);

 Assert(innerrel->reloptkind == RELOPT_BASEREL);

That suggests that previously the function expected cases where
reloptkind may not be RELOPT_BASEREL, but now it'll error out int
such cases. I haven't noticed any changes in the surrounding code
that'd guarantee this won't happen, but I also haven't been able to
come up with an example triggering the assert (haven't been trying
too hard). How do we know the assert() makes sense?


I'd have

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-12-16 Thread Tomas Vondra


Hi,

On 12/16/2015 01:27 AM, David Rowley wrote:


I've attached a rebased patch against current master as there were
some conflicts from the recent changes to LATERAL join.


Thanks. I've looked at the rebased patch and have a few minor comments.

0) I know the patch does not tweak costing - any plans in this
   direction? Would it be possible to simply use the costing used by
   semijoin?

1) nodeHashjoin.c (and other join nodes)

I've noticed we have this in the ExecHashJoin() method:

 /*
  * When the inner side is unique or we're performing a
  * semijoin, we'll consider returning the first match, but
  * after that we're done with this outer tuple.
  */
 if (node->js.unique_inner)
 node->hj_JoinState = HJ_NEED_NEW_OUTER;

That seems a bit awkward because the comment speaks about unique joins 
*OR* semijoins, but the check only references unique_inner. That of 
course works because we do this in ExecInitHashJoin():


switch (node->join.jointype)
{
case JOIN_SEMI:
hjstate->js.unique_inner = true;
/* fall through */

Either some semijoins may not be unique - in that case the naming is 
misleading at best, and we either need to find a better name or simply 
check two conditions like this:


 if (node->js.unique_inner || node->join.type == JOIN_SEMI)
node->hj_JoinState = HJ_NEED_NEW_OUTER;

Or all semijoins are unique joins, and in that case the comment may need 
rephrasing. But more importantly, it begs the question why we're 
detecting this in the executor and not in the planner? Because if we 
detect it in executor, we can't use this bit in costing, for example.


FWIW the misleading naming was actually mentioned in the thread by 
Horiguchi-san, but I don't see any response to that (might have missed 
it though).


2) analyzejoins.c

I see that this code in join_is_removable()

innerrel = find_base_rel(root, innerrelid);

if (innerrel->reloptkind != RELOPT_BASEREL)
return false;

was rewritten like this:

innerrel = find_base_rel(root, innerrelid);

Assert(innerrel->reloptkind == RELOPT_BASEREL);

That suggests that previously the function expected cases where 
reloptkind may not be RELOPT_BASEREL, but now it'll error out int such 
cases. I haven't noticed any changes in the surrounding code that'd 
guarantee this won't happen, but I also haven't been able to come up 
with an example triggering the assert (haven't been trying too hard). 
How do we know the assert() makes sense?



3) joinpath.c

- either "were" or "been" seems redundant

/* left joins were already been analyzed for uniqueness in 
mark_unique_joins() */



4) analyzejoins.c

comment format broken

/*
 * mark_unique_joins
Analyze joins in order to determine if their inner side is unique based
on the join condition.
 */

5) analyzejoins.c

missing line before the comment

}
/*
 * rel_supports_distinctness
 *  Returns true if rel has some properties which can prove the 
relation
 *  to be unique over some set of columns.
 *

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-12-15 Thread David Rowley

On 25 August 2015 at 17:25, David Rowley 
wrote:

> On 24 August 2015 at 14:29, Tom Lane  wrote:
>
>> David Rowley  writes:
>> > I have to admit I don't much like it either, originally I had this as an
>> > extra property that was only seen in EXPLAIN VERBOSE.
>>
>> Seems like a reasonable design from here.
>
>
> The attached patch has the format in this way.
>

I've attached a rebased patch against current master as there were some
conflicts from the recent changes to LATERAL join.

On reviewing the patch again I was reminded that the bulk of the changes in
the patch are in analyzejoins.c. These are mostly just a refactor of the
current code to make it more reusable. The diff looks a bit more scary than
it actually is.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

unique_joins_16-12-2015_7a290ab1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-10-14 Thread Robert Haas

On Wed, Oct 14, 2015 at 1:03 AM, Pavel Stehule  wrote:
> it is great

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-10-13 Thread Pavel Stehule

2015-10-13 23:28 GMT+02:00 David Rowley :

> On 4 September 2015 at 04:50, Robert Haas  wrote:
>
>>
>> Also: very nice performance results.
>>
>>
> Thanks.
>
> On following a thread in [General] [1] it occurred to me that this patch
> can give a massive improvement on Merge joins where the mark and restore
> causes an index scan to have to skip over many filtered rows again and
> again.
>
> I mocked up some tables and some data from the scenario on the [General]
> thread:
>
> create table customers (id bigint, group_id bigint not null);
> insert into customers select x.x,x.x%27724+1 from
> generate_series(1,473733) x(x);
> alter table customers add constraint customer_pkey primary key (id);
> create table balances (id bigint, balance int not null, tracking_number
> int not null, customer_id bigint not null);
> insert into balances select x.x, 100, 12345, x.x % 45 + 1 from
> generate_Series(1,16876) x(x);
> create index balance_customer_id_index on balances (customer_id);
> create index balances_customer_tracking_number_index on balances
> (customer_id,tracking_number);
> analyze;
>
> Unpatched I get:
>
> test=# explain analyze SELECT ac.* FROM balances ac join customers o ON
> o.id = ac.customer_id WHERE o.group_id = 45;
>
>  QUERY PLAN
>
> 
>  Merge Join  (cost=164.87..1868.70 rows=1 width=24) (actual
> time=6.110..1491.408 rows=375 loops=1)
>Merge Cond: (ac.customer_id = o.id)
>->  Index Scan using balance_customer_id_index on balances ac
>  (cost=0.29..881.24 rows=16876 width=24) (actual time=0.009..5.206
> rows=16876 loops=1)
>->  Index Scan using customer_pkey on customers o  (cost=0.42..16062.75
> rows=17 width=8) (actual time=0.014..1484.382 rows=376 loops=1)
>  Filter: (group_id = 45)
>  Rows Removed by Filter: 10396168
>  Planning time: 0.207 ms
>  Execution time: 1491.469 ms
> (8 rows)
>
> Patched:
>
> test=# explain analyze SELECT ac.* FROM balances ac join customers o ON
> o.id = ac.customer_id WHERE o.group_id = 45;
>
>  QUERY PLAN
>
> 
>  Merge Join  (cost=164.87..1868.70 rows=1 width=24) (actual
> time=6.037..11.528 rows=375 loops=1)
>Merge Cond: (ac.customer_id = o.id)
>->  Index Scan using balance_customer_id_index on balances ac
>  (cost=0.29..881.24 rows=16876 width=24) (actual time=0.009..4.978
> rows=16876 loops=1)
>->  Index Scan using customer_pkey on customers o  (cost=0.42..16062.75
> rows=17 width=8) (actual time=0.015..5.141 rows=2 loops=1)
>  Filter: (group_id = 45)
>  Rows Removed by Filter: 27766
>  Planning time: 0.204 ms
>  Execution time: 11.575 ms
> (8 rows)
>
> Now it could well be that the merge join costs need a bit more work to
> avoid a merge join in this case, but as it stands as of today, this is your
> performance gain.
>
> Regards
>

it is great

Pavel


>
> David Rowley
>
> [1]
> http://www.postgresql.org/message-id/caczyddiaxeam2rkmhbmuhwvcm4txh+5e3hqgggyzfzbn-pn...@mail.gmail.com
>
> --
>  David Rowley   http://www.2ndQuadrant.com/
> 
>  PostgreSQL Development, 24x7 Support, Training & Services
>

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-10-13 Thread David Rowley

On 4 September 2015 at 04:50, Robert Haas  wrote:

>
> Also: very nice performance results.
>
>
Thanks.

On following a thread in [General] [1] it occurred to me that this patch
can give a massive improvement on Merge joins where the mark and restore
causes an index scan to have to skip over many filtered rows again and
again.

I mocked up some tables and some data from the scenario on the [General]
thread:

create table customers (id bigint, group_id bigint not null);
insert into customers select x.x,x.x%27724+1 from generate_series(1,473733)
x(x);
alter table customers add constraint customer_pkey primary key (id);
create table balances (id bigint, balance int not null, tracking_number int
not null, customer_id bigint not null);
insert into balances select x.x, 100, 12345, x.x % 45 + 1 from
generate_Series(1,16876) x(x);
create index balance_customer_id_index on balances (customer_id);
create index balances_customer_tracking_number_index on balances
(customer_id,tracking_number);
analyze;

Unpatched I get:

test=# explain analyze SELECT ac.* FROM balances ac join customers o ON o.id
= ac.customer_id WHERE o.group_id = 45;

 QUERY PLAN

 Merge Join  (cost=164.87..1868.70 rows=1 width=24) (actual
time=6.110..1491.408 rows=375 loops=1)
   Merge Cond: (ac.customer_id = o.id)
   ->  Index Scan using balance_customer_id_index on balances ac
 (cost=0.29..881.24 rows=16876 width=24) (actual time=0.009..5.206
rows=16876 loops=1)
   ->  Index Scan using customer_pkey on customers o  (cost=0.42..16062.75
rows=17 width=8) (actual time=0.014..1484.382 rows=376 loops=1)
 Filter: (group_id = 45)
 Rows Removed by Filter: 10396168
 Planning time: 0.207 ms
 Execution time: 1491.469 ms
(8 rows)

Patched:

test=# explain analyze SELECT ac.* FROM balances ac join customers o ON o.id
= ac.customer_id WHERE o.group_id = 45;

 QUERY PLAN

 Merge Join  (cost=164.87..1868.70 rows=1 width=24) (actual
time=6.037..11.528 rows=375 loops=1)
   Merge Cond: (ac.customer_id = o.id)
   ->  Index Scan using balance_customer_id_index on balances ac
 (cost=0.29..881.24 rows=16876 width=24) (actual time=0.009..4.978
rows=16876 loops=1)
   ->  Index Scan using customer_pkey on customers o  (cost=0.42..16062.75
rows=17 width=8) (actual time=0.015..5.141 rows=2 loops=1)
 Filter: (group_id = 45)
 Rows Removed by Filter: 27766
 Planning time: 0.204 ms
 Execution time: 11.575 ms
(8 rows)

Now it could well be that the merge join costs need a bit more work to
avoid a merge join in this case, but as it stands as of today, this is your
performance gain.

Regards

David Rowley

[1]
http://www.postgresql.org/message-id/caczyddiaxeam2rkmhbmuhwvcm4txh+5e3hqgggyzfzbn-pn...@mail.gmail.com

--
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-09-03 Thread Robert Haas

On Tue, Aug 25, 2015 at 1:25 AM, David Rowley
 wrote:
> If that's the case then why do we not enable verbose for all of the non-text
> outputs?
> It seems strange to start making exceptions on a case-by-case basis.

+1.  FORMAT and VERBOSE are separate options, and each one should
control what the name suggests, not the other thing.

Also: very nice performance results.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-08-26 Thread Michael Paquier

On Tue, Aug 25, 2015 at 2:25 PM, David Rowley
david.row...@2ndquadrant.com wrote:
 On 24 August 2015 at 14:29, Tom Lane t...@sss.pgh.pa.us wrote:

 David Rowley david.row...@2ndquadrant.com writes:
  I have to admit I don't much like it either, originally I had this as an
  extra property that was only seen in EXPLAIN VERBOSE.

 Seems like a reasonable design from here.


 The attached patch has the format in this way.


 (Note that for non-text output,
 I'd say the field should come out unconditionally.  We only care about

 abbreviating in text mode.)


 If that's the case then why do we not enable verbose for all of the non-text
 outputs?
 It seems strange to start making exceptions on a case-by-case basis.

Moved to CF 2015-09.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-08-24 Thread David Rowley

On 24 August 2015 at 14:29, Tom Lane t...@sss.pgh.pa.us wrote:

 David Rowley david.row...@2ndquadrant.com writes:
  I have to admit I don't much like it either, originally I had this as an
  extra property that was only seen in EXPLAIN VERBOSE.

 Seems like a reasonable design from here.


The attached patch has the format in this way.


 (Note that for non-text output,
 I'd say the field should come out unconditionally.  We only care about

abbreviating in text mode.)


If that's the case then why do we not enable verbose for all of the
non-text outputs?
It seems strange to start making exceptions on a case-by-case basis.

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


unique_joins_2015-08-25_feb3068.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-08-23 Thread Tomas Vondra


Hi,

I did some initial performance evaluation of this patch today, and I see 
a clear improvement on larger joins. The scenario I've chosen for the 
experiments is a simple fact-dimension join, i.e. a small table 
referenced by a large table. So effectively something like this:


CREATE TABLE dim  (id INT PRIMARY KEY, ...);
CREATE TABLE fact (dim_d INT REFERENCES dim(id), ...);

except that I haven't used the foreign key constraint. In all the 
experiments I've used a fact table 10x the size of the dimension, but I 
believe what really matters most is the size of the dimension (and how 
the hash table fits into the L2/L3 cache).


The query tested is very simple:

select count(1) from (
select * from f join d on (f.fact_id = d.dim_id)
) foo;

The outer aggregation is intentional - the join produces many rows and 
formatting them as string would completely eliminate any gains from the 
patch (even with \o /dev/null or such).


The following numbers come current master, running on E5-2630 v3 
(2.40GHz), 64GB of RAM and this configuration:


 checkpoint_timeout = 15min
 effective_cache_size = 48GB
 maintenance_work_mem = 1GB
 max_wal_size = 4GB
 min_wal_size = 1GB
 random_page_cost = 1.5
 shared_buffers = 4GB
 work_mem = 1GB

all the other values are set to default.

I did 10 runs for each combination of sizes - the numbers seem quite 
consistent and repeatable. I also looked at the median values.



dim 100k rows, fact 1M rows
---

 master patched
--- ---
   1286.184 265.489
   2284.827 264.961
   3281.040 264.768
   4280.926 267.720
   5280.984 261.348
   6280.878 261.463
   7280.875 261.338
   8281.042 261.265
   9281.003 261.236
  10280.939 261.185
--- ---
 med280.994 261.406 (-7%)


dim 1M rows, fact 10M rows
--

 master patched
   
   1   4316.2353690.373
   2   4399.5393738.097
   3   4360.5513655.602
   4   4359.7633626.142
   5   4361.8213621.941
   6   4359.2053654.835
   7   4371.4383631.212
   8   4361.8573626.237
   9   4357.3173676.651
  10   4359.5613641.830
   
 med   4360.1573648.333 (-17%)


dim 10M rows, fact 100M rows


 master patched
   
   1  46246.467   39561.597
   2  45982.937   40083.352
   3  45818.118   39674.661
   4  45716.281   39616.585
   5  45651.117   40463.966
   6  45979.036   41395.390
   7  46045.400   41358.047
   8  45978.698   41253.946
   9  45801.343   41156.440
  10  45720.722   41374.688
  -   -
 med  45898.408   40810.203 (-10%)


So the gains seem quite solid - it's not something that would make the 
query an order of magnitude faster, but it's well above the noise.


Of course, in practice the queries will be more complicated, making the 
improvement less significant, but I don't think that's a reason not to 
apply it.


Two minor comments on the patch:

1) the 'subquery' variable in specialjoin_is_unique_join is unused

2) in the explain output, there should probably be a space before the
   '(inner unique)' text, so

 Hash Join (inner unique) ...

   instead of

 Hash Join(inner unique)

but that's just nitpicking at this point. Otherwise the patch seems 
quite solid to me.


regard

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


unijoins-test.sql
Description: application/sql


unijoins-queries.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-08-23 Thread David Rowley

On 24 August 2015 at 12:19, Tom Lane t...@sss.pgh.pa.us wrote:

 David Rowley david.row...@2ndquadrant.com writes:
  On 24 August 2015 at 07:31, Tomas Vondra tomas.von...@2ndquadrant.com
  wrote:
  2) in the explain output, there should probably be a space before the
  '(inner unique)' text, so
 
  Hash Join (inner unique) ...
 
  instead of
 
  Hash Join(inner unique)
 
  but that's just nitpicking at this point. Otherwise the patch seems
 quite
  solid to me.

  The attached fixes these two issues.

 Please, no.  Randomly sticking additional bits of information into Join
 Type is a good way to render EXPLAIN output permanently unintelligible,
 not only to humans but to whatever programs are still trying to read the
 text output format (I think explain.depesz.com still does).  It is also
 not a great idea to put more semantic distance between the text and
 non-text output representations.

 I am not exactly convinced that this behavior needs to be visible in
 EXPLAIN output at all, but if it does, it should be a separate field.


I have to admit I don't much like it either, originally I had this as an
extra property that was only seen in EXPLAIN VERBOSE.

-  Nested Loop
  Output: a.ctid, wcte.*
+ Unique Join: No

There was a debate somewhere about this and it ended up the way it is now,
I didn't feel the need to argue for the EXPLAIN VERBOSR field as I thought
that a committer would likely change it anyway. However, if I remove all
changes to explain.c, then it makes it very difficult to write regression
tests which ensure the new code is doing what it's meant to.

What do you think of the extra EXPLAIN VERBOSE field?

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-08-23 Thread David Rowley

On 24 August 2015 at 07:31, Tomas Vondra tomas.von...@2ndquadrant.com
wrote:


 dim 100k rows, fact 1M rows
 ---

  master patched
 --- ---

..


  med280.994 261.406 (-7%)


 dim 1M rows, fact 10M rows
 --

  master patched


..


  med   4360.1573648.333 (-17%)


 dim 10M rows, fact 100M rows
 

  master patched


..

  med  45898.408   40810.203 (-10%)


 So the gains seem quite solid - it's not something that would make the
 query an order of magnitude faster, but it's well above the noise.

 Of course, in practice the queries will be more complicated, making the
 improvement less significant, but I don't think that's a reason not to
 apply it.


Many thanks for doing that performance testing.


 Two minor comments on the patch:

 1) the 'subquery' variable in specialjoin_is_unique_join is unused

 2) in the explain output, there should probably be a space before the
'(inner unique)' text, so

  Hash Join (inner unique) ...

instead of

  Hash Join(inner unique)

 but that's just nitpicking at this point. Otherwise the patch seems quite
 solid to me.


The attached fixes these two issues.

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


unique_joins_2015-08-24.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 126 matches

Mail list logo