Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
I've pushed the executor part of this, but mostly not the planner part, because I didn't think the latter was anywhere near ready for prime time: the regression test changes it was causing were entirely bogus. Hi Tom, Thanks for the commit and the explanation. I'll try to address your comments if I continue working on the planner part. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: full merge join on comparison clause
Hi, I am attaching the updated patch, rebased to 820c03. On 09.10.2017 13:47, Ashutosh Bapat wrote: Hi Alexander, Commit c7a9fa399 has added another test on mergeopfamilies. I think your patch will need to take care of that test. On Wed, Oct 4, 2017 at 6:38 PM, Alexander Kuzmenkov wrote: As discussed earlier, I changed the way we work with mergeopfamilies. I use the "is_equality" flag to indicate whether the clause is an equality one, and fill mergeopfamilies for both equality and inequality operators. The updated patch is attached (rebased to 20b6552242). -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c index 925b4cf553..73e6a4ca74 100644 --- a/src/backend/executor/nodeMergejoin.c +++ b/src/backend/executor/nodeMergejoin.c @@ -172,31 +172,32 @@ typedef enum * to the opfamily and collation, with nulls at the indicated end of the range. * This allows us to obtain the needed comparison function from the opfamily. */ -static MergeJoinClause +static void MJExamineQuals(List *mergeclauses, Oid *mergefamilies, Oid *mergecollations, int *mergestrategies, bool *mergenullsfirst, - PlanState *parent) + MergeJoinState *parent) { - MergeJoinClause clauses; int nClauses = list_length(mergeclauses); int iClause; ListCell *cl; - clauses = (MergeJoinClause) palloc0(nClauses * sizeof(MergeJoinClauseData)); + parent->mj_Clauses = (MergeJoinClause) palloc0(nClauses * sizeof(MergeJoinClauseData)); + parent->mj_UseEqual = (bool *) palloc0(nClauses * sizeof(bool)); + parent->mj_UseLesser = (bool *) palloc0(nClauses * sizeof(bool)); iClause = 0; foreach(cl, mergeclauses) { OpExpr *qual = (OpExpr *) lfirst(cl); - MergeJoinClause clause = &clauses[iClause]; + MergeJoinClause clause = &parent->mj_Clauses[iClause]; Oid opfamily = mergefamilies[iClause]; Oid collation = mergecollations[iClause]; - StrategyNumber opstrategy = mergestrategies[iClause]; + StrategyNumber sort_op_strategy = mergestrategies[iClause]; bool nulls_first = mergenullsfirst[iClause]; - int op_strategy; + int join_op_strategy; Oid op_lefttype; Oid op_righttype; Oid sortfunc; @@ -207,28 +208,55 @@ MJExamineQuals(List *mergeclauses, /* * Prepare the input expressions for execution. */ - clause->lexpr = ExecInitExpr((Expr *) linitial(qual->args), parent); - clause->rexpr = ExecInitExpr((Expr *) lsecond(qual->args), parent); + clause->lexpr = ExecInitExpr((Expr *) linitial(qual->args), (PlanState *) parent); + clause->rexpr = ExecInitExpr((Expr *) lsecond(qual->args), (PlanState *) parent); /* Set up sort support data */ clause->ssup.ssup_cxt = CurrentMemoryContext; clause->ssup.ssup_collation = collation; - if (opstrategy == BTLessStrategyNumber) + if (sort_op_strategy == BTLessStrategyNumber) clause->ssup.ssup_reverse = false; - else if (opstrategy == BTGreaterStrategyNumber) + else if (sort_op_strategy == BTGreaterStrategyNumber) clause->ssup.ssup_reverse = true; else /* planner screwed up */ - elog(ERROR, "unsupported mergejoin strategy %d", opstrategy); + elog(ERROR, "unsupported mergejoin strategy %d", sort_op_strategy); clause->ssup.ssup_nulls_first = nulls_first; /* Extract the operator's declared left/right datatypes */ get_op_opfamily_properties(qual->opno, opfamily, false, - &op_strategy, + &join_op_strategy, &op_lefttype, &op_righttype); - if (op_strategy != BTEqualStrategyNumber) /* should not happen */ - elog(ERROR, "cannot merge using non-equality operator %u", - qual->opno); + + /* + * Determine whether we accept lesser and/or equal tuples of the inner + * relation. + */ + switch (join_op_strategy) + { + case BTEqualStrategyNumber: +parent->mj_UseEqual[iClause] = true; +break; + + case BTLessEqualStrategyNumber: +parent->mj_UseEqual[iClause] = true; +/* fall through */ + + case BTLessStrategyNumber: +parent->mj_UseLesser[iClause] = true; +break; + + case BTGreaterEqualStrategyNumber: +parent->mj_UseEqual[iClause] = true; +/* fall through */ + + case BTGreaterStrategyNumber: +parent->mj_UseLesser[iClause] = true; +break; + + default: +elog(ERROR, "unsupported join strategy %d", join_op_strategy); + } /* * sortsupport routine must know if abbreviation optimization is @@ -265,8 +293,6 @@ MJExamineQuals(List *mergeclauses, iClause++; } - - return clauses; } /* @@ -378,6 +404,14 @@ MJEvalInnerValues(MergeJoinState *mergestate, TupleT
Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages
Analysis: The estimated value of the lossy_pages is way higher than its actual value and reason is that the total_pages calculated by the "Mackert and Lohman formula" is not correct. I think the problem might be that the total_pages includes cache effects and rescans. For bitmap entries we should use something like relation pages * selectivity. Meanwhile, I ran TPC-H briefly with the v3 patch: scale factor 25, 2 workers, SSD storage. It shows significant improvement on 4MB work_mem and no change on 2GB. Here are the results (execution time in seconds, average of 5 runs): work_mem4MB2GB Query masterpatchmasterpatch 4179174168167 5190168155156 628087227229 8197114179172 10269227190192 14110108106 105 -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: full merge join on comparison clause
As discussed earlier, I changed the way we work with mergeopfamilies. I use the "is_equality" flag to indicate whether the clause is an equality one, and fill mergeopfamilies for both equality and inequality operators. The updated patch is attached (rebased to 20b6552242). -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c index 925b4cf553..73e6a4ca74 100644 --- a/src/backend/executor/nodeMergejoin.c +++ b/src/backend/executor/nodeMergejoin.c @@ -172,31 +172,32 @@ typedef enum * to the opfamily and collation, with nulls at the indicated end of the range. * This allows us to obtain the needed comparison function from the opfamily. */ -static MergeJoinClause +static void MJExamineQuals(List *mergeclauses, Oid *mergefamilies, Oid *mergecollations, int *mergestrategies, bool *mergenullsfirst, - PlanState *parent) + MergeJoinState *parent) { - MergeJoinClause clauses; int nClauses = list_length(mergeclauses); int iClause; ListCell *cl; - clauses = (MergeJoinClause) palloc0(nClauses * sizeof(MergeJoinClauseData)); + parent->mj_Clauses = (MergeJoinClause) palloc0(nClauses * sizeof(MergeJoinClauseData)); + parent->mj_UseEqual = (bool *) palloc0(nClauses * sizeof(bool)); + parent->mj_UseLesser = (bool *) palloc0(nClauses * sizeof(bool)); iClause = 0; foreach(cl, mergeclauses) { OpExpr *qual = (OpExpr *) lfirst(cl); - MergeJoinClause clause = &clauses[iClause]; + MergeJoinClause clause = &parent->mj_Clauses[iClause]; Oid opfamily = mergefamilies[iClause]; Oid collation = mergecollations[iClause]; - StrategyNumber opstrategy = mergestrategies[iClause]; + StrategyNumber sort_op_strategy = mergestrategies[iClause]; bool nulls_first = mergenullsfirst[iClause]; - int op_strategy; + int join_op_strategy; Oid op_lefttype; Oid op_righttype; Oid sortfunc; @@ -207,28 +208,55 @@ MJExamineQuals(List *mergeclauses, /* * Prepare the input expressions for execution. */ - clause->lexpr = ExecInitExpr((Expr *) linitial(qual->args), parent); - clause->rexpr = ExecInitExpr((Expr *) lsecond(qual->args), parent); + clause->lexpr = ExecInitExpr((Expr *) linitial(qual->args), (PlanState *) parent); + clause->rexpr = ExecInitExpr((Expr *) lsecond(qual->args), (PlanState *) parent); /* Set up sort support data */ clause->ssup.ssup_cxt = CurrentMemoryContext; clause->ssup.ssup_collation = collation; - if (opstrategy == BTLessStrategyNumber) + if (sort_op_strategy == BTLessStrategyNumber) clause->ssup.ssup_reverse = false; - else if (opstrategy == BTGreaterStrategyNumber) + else if (sort_op_strategy == BTGreaterStrategyNumber) clause->ssup.ssup_reverse = true; else /* planner screwed up */ - elog(ERROR, "unsupported mergejoin strategy %d", opstrategy); + elog(ERROR, "unsupported mergejoin strategy %d", sort_op_strategy); clause->ssup.ssup_nulls_first = nulls_first; /* Extract the operator's declared left/right datatypes */ get_op_opfamily_properties(qual->opno, opfamily, false, - &op_strategy, + &join_op_strategy, &op_lefttype, &op_righttype); - if (op_strategy != BTEqualStrategyNumber) /* should not happen */ - elog(ERROR, "cannot merge using non-equality operator %u", - qual->opno); + + /* + * Determine whether we accept lesser and/or equal tuples of the inner + * relation. + */ + switch (join_op_strategy) + { + case BTEqualStrategyNumber: +parent->mj_UseEqual[iClause] = true; +break; + + case BTLessEqualStrategyNumber: +parent->mj_UseEqual[iClause] = true; +/* fall through */ + + case BTLessStrategyNumber: +parent->mj_UseLesser[iClause] = true; +break; + + case BTGreaterEqualStrategyNumber: +parent->mj_UseEqual[iClause] = true; +/* fall through */ + + case BTGreaterStrategyNumber: +parent->mj_UseLesser[iClause] = true; +break; + + default: +elog(ERROR, "unsupported join strategy %d", join_op_strategy); + } /* * sortsupport routine must know if abbreviation optimization is @@ -265,8 +293,6 @@ MJExamineQuals(List *mergeclauses, iClause++; } - - return clauses; } /* @@ -378,6 +404,14 @@ MJEvalInnerValues(MergeJoinState *mergestate, TupleTableSlot *innerslot) return result; } +/* Tuple comparison result */ +typedef enum +{ + MJCR_NextInner = 1, + MJCR_NextOuter = -1, + MJCR_Join = 0 +} MJCompareResult; + /* * MJCompare * @@ -388,10 +422,10 @@ MJEvalInnerValues(MergeJoinState *mergestate, TupleTableSlot *innerslot) * MJEvalOuterValues and MJEvalInnerValues must already have been called * for the current outer and inner tuples, re
Re: [HACKERS] PoC: full merge join on comparison clause
Hi Ashutosh, Thanks for the review. *Jeff*, I'm copying you because this is relevant to our discussion about what to do with mergeopfamilies when adding new merge join types. You have renamed RestrictInfo member mergeopfamilies as equivopfamilies. I don't think that's a good name; it doesn't convey that these are opfamilies containing merge operators. The changes in check_mergejoinable() suggest that an operator may act as equality operator in few operator families and comparison operator in others. That looks odd. Actually an operator family contains operators other than equality operators, so you may want to retain this member and add a new member to specify whether the clause is an equality clause or not. For mergeopfamilies, I'm not sure what is the best thing to do. I'll try to explain my understanding of the situation, please correct me if I'm wrong. Before the patch, mergeopfamilies was used for two things: creating equivalence classes and performing merge joins. For equivalence classes: we look at the restriction clauses, and if they have mergeopfamilies set, it means that these clause are based on an equality operator, and the left and right variables must be equal. To record this fact, we create an equivalence class. The variables might be equal for one equality operator and not equal for another, so we record the particular operator families to which our equality operator belongs. For merge join: we look at the join clauses, and if they have mergeopfamilies set, it means that these clauses are based on an equality operator, and we can try performing this particular join as merge join. These opfamilies are also used beforehand to create the equivalence classes for left and right variables. The equivalence classes are used to match the join clauses to pathkeys describing the ordering of join inputs. So, if we want to start doing merge joins for operators other than equality, we still need to record their opfamilies, but only use them for the second case and not the first. I chose to put these opfamilies to different variables, and name the one used for equivalence classes 'equivopfamilies' and the one used for merge joins 'mergeopfamilies'. The equality operators are used for both cases, so we put their opfamilies into both of these variables. I agree this might look confusing. Indeed, we could keep a single variable for opfamilies, and add separate flags that show how they can be used, be that for equivalence classes, merge joins, range joins or some combination of them. This is similar to what Jeff did in his range merge join patch [1]. I will think more about this and try to produce an updated patch. In mergejoinscansel() you have just removed Assert(op_strategy == BTEqualStrategyNumber); Probably this function is written considering on equality operators. But now that we are using this for all other operators, we will need more changes to this function. That may be the reason why INNER join in your earlier example doesn't choose right costing. I changed mergejoinscansel() slightly to reflect the fact that the inner relation is scanned from the beginning if we have an inequality merge clause. The comment change in final_cost_mergejoin() needs more work. n1, n2, n3 are number of rows on inner side with values 1, 2, 3 resp. So n1 + n2 + n3 + ... = size of inner relation is correct. In that context I am not able to understand your change +* If the merge clauses contain inequality, (n1 + n2 + ...) ~= +* (size of inner relation)^2. I extended the comment in final_cost_mergejoin(). Not sure if that approximation makes any sense, but this is the best I could think of. Style problems are fixed. Attached please find the new version of the patch that addresses all the review comments except mergeopfamilies. The current commitfest is ending, but I'd like to continue working on this patch, so I am moving it to the next one. [1] https://www.postgresql.org/message-id/flat/CAMp0ubfwAFFW3O_NgKqpRPmm56M4weTEXjprb2gP_NrDaEC4Eg%40mail.gmail.com#camp0ubfwaffw3o_ngkqprpmm56m4wetexjprb2gp_nrdaec...@mail.gmail.com -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 32dc4e6301..2958a9e53d 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -722,19 +722,19 @@ get_useful_ecs_for_relation(PlannerInfo *root, RelOptInfo *rel) { RestrictInfo *restrictinfo = (RestrictInfo *) lfirst(lc); - /* Consider only mergejoinable clauses */ - if (restrictinfo->mergeopfamilies == NIL) + /* Consider only mergejoinable equality clauses */ + if (restrictinfo->equivopfamilies == NIL) continue; /* Make sure we've got canonical ECs. */ - update_mergeclause_ecl
Re: [HACKERS] type cache for concat functions
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation:tested, passed Looks good to me. The new status of this patch is: Ready for Committer -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] Faster processing at Gather node
Thanks Rafia, Amit, now I understand the ideas behind the patch better. I'll see if I can look at the new one. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] Faster processing at Gather node
Hi Rafia, I like the idea of reducing locking overhead by sending tuples in bulk. The implementation could probably be simpler: you could extend the API of shm_mq to decouple notifying the sender from actually putting data into the queue (i.e., make shm_mq_notify_receiver public and make a variant of shm_mq_sendv that doesn't send the notification). From Amit's letter I understand that you have already tried something along these lines and the performance wasn't good. What was the bottleneck then? If it's the locking around mq_bytes_read/written, it can be rewritten with atomics. I think it would be great to try this approach because it doesn't add much code, doesn't add any additional copying and improves shm_mq performance in general. -- Alexander Kuzmenkov Postgres Professional:http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
On 04.09.2017 15:17, Alexey Chernyshov wrote: make check-world fails on contrib/postgres_fdw because of Subquery Scan on ... Probably, query plan was changed. Hi Alexey, Thanks for testing! This is the same problem as the one in 'select_distinct' I mentioned before. I changed the test, the updated patch is attached. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index c19b3318c7..5f5f65d60c 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -2203,22 +2203,23 @@ SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 -- join with lateral reference EXPLAIN (VERBOSE, COSTS OFF) SELECT t1."C 1" FROM "S 1"."T 1" t1, LATERAL (SELECT DISTINCT t2.c1, t3.c1 FROM ft1 t2, ft2 t3 WHERE t2.c1 = t3.c1 AND t2.c2 = t1.c2) q ORDER BY t1."C 1" OFFSET 10 LIMIT 10; - QUERY PLAN - +QUERY PLAN +-- Limit Output: t1."C 1" -> Nested Loop Output: t1."C 1" -> Index Scan using t1_pkey on "S 1"."T 1" t1 Output: t1."C 1", t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8 - -> HashAggregate - Output: t2.c1, t3.c1 - Group Key: t2.c1, t3.c1 - -> Foreign Scan + -> Subquery Scan on q + -> HashAggregate Output: t2.c1, t3.c1 - Relations: (public.ft1 t2) INNER JOIN (public.ft2 t3) - Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1")) AND ((r1.c2 = $1::integer -(13 rows) + Group Key: t2.c1, t3.c1 + -> Foreign Scan + Output: t2.c1, t3.c1 + Relations: (public.ft1 t2) INNER JOIN (public.ft2 t3) + Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1")) AND ((r1.c2 = $1::integer +(14 rows) SELECT t1."C 1" FROM "S 1"."T 1" t1, LATERAL (SELECT DISTINCT t2.c1, t3.c1 FROM ft1 t2, ft2 t3 WHERE t2.c1 = t3.c1 AND t2.c2 = t1.c2) q ORDER BY t1."C 1" OFFSET 10 LIMIT 10; C 1 @@ -2672,16 +2673,17 @@ select c2, sum(c1) from ft2 group by c2 having avg(c1) < 500 and sum(c1) < 49800 -- Unshippable HAVING clause will be evaluated locally, and other qual in HAVING clause is pushed down explain (verbose, costs off) select count(*) from (select c5, count(c1) from ft1 group by c5, sqrt(c2) having (avg(c1) / avg(c1)) * random() <= 1 and avg(c1) < 500) x; - QUERY PLAN -- + QUERY PLAN +--- Aggregate Output: count(*) - -> Foreign Scan - Output: ft1.c5, NULL::bigint, (sqrt((ft1.c2)::double precision)) - Filter: (avg(ft1.c1)) / (avg(ft1.c1::double precision * random()) <= '1'::double precision) - Relations: Aggregate on (public.ft1) - Remote SQL: SELECT c5, NULL::bigint, sqrt(c2), avg("C 1") FROM "S 1"."T 1" GROUP BY c5, (sqrt(c2)) HAVING ((avg("C 1") < 500::numeric)) -(7 rows) + -> Subquery Scan on x + -> Foreign Scan + Output: ft1.c5, NULL::bigint, (sqrt((ft1.c2)::double precision)) + Filter: (avg(ft1.c1)) / (
Re: [HACKERS] Range Merge Join v1
Hi Jeff, Fast range joins are very nice to have, thank you for working on this. I've been doing a somewhat related thing recently, trying to teach the merge join executor to perform full joins on comparison clauses [1]. I have some comments after reading your patch: * At the moment, "mergejoinable clause" and "equality clause" mean the same thing to the planner, and those clauses are used both to create equivalence classes and to perform merge joins. If we start performing merge joins for different kinds of clauses, such as comparison or range intersection, it makes sense to separate these two meanings. I tried this in my patch and it didn't require many changes. I use RestrictInfo.equivopfamilies to build equivalence classes, and RestrictInfo.mergeopfamilies for mergejoinable clauses. * Semantically, MJCompare() is a function that determines whether you should emit a join result or advance one of the pointers. This meaning is not explicit in the code and is not well encapsulated: the function returns and int and 'compareNoMatch' flag, and the calling code combines them in various ways to derive the final result. This meaning can be made explicit by making MJCompare return enum values {Join, NextInner, NextOuter}, and putting inside it all the logic that decides whether we join or advance. ExecMergeJoin looks intimidating already, and I think this change would help readability. Again, you can find an illustration in my patch. I hope you find these points helpful. [1] https://www.postgresql.org/message-id/flat/1d23ad41-a9d9-1d1e-1d79-83b002d6f776%40postgrespro.ru#1d23ad41-a9d9-1d1e-1d79-83b002d6f...@postgrespro.ru -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: full merge join on comparison clause
Here is a new version of the patch, rebased to 749c7c41 and with some cosmetic changes. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index d77c2a70e4..19bc90aa32 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -722,19 +722,19 @@ get_useful_ecs_for_relation(PlannerInfo *root, RelOptInfo *rel) { RestrictInfo *restrictinfo = (RestrictInfo *) lfirst(lc); - /* Consider only mergejoinable clauses */ - if (restrictinfo->mergeopfamilies == NIL) + /* Consider only mergejoinable equality clauses */ + if (restrictinfo->equivopfamilies == NIL) continue; /* Make sure we've got canonical ECs. */ - update_mergeclause_eclasses(root, restrictinfo); + update_equivclause_eclasses(root, restrictinfo); /* - * restrictinfo->mergeopfamilies != NIL is sufficient to guarantee + * restrictinfo->equivopfamilies != NIL is sufficient to guarantee * that left_ec and right_ec will be initialized, per comments in * distribute_qual_to_rels. * - * We want to identify which side of this merge-joinable clause + * We want to identify which side of this merge-joinable equality clause * contains columns from the relation produced by this RelOptInfo. We * test for overlap, not containment, because there could be extra * relations on either side. For example, suppose we've got something diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c index 925b4cf553..8eb5c8fd1d 100644 --- a/src/backend/executor/nodeMergejoin.c +++ b/src/backend/executor/nodeMergejoin.c @@ -172,31 +172,32 @@ typedef enum * to the opfamily and collation, with nulls at the indicated end of the range. * This allows us to obtain the needed comparison function from the opfamily. */ -static MergeJoinClause +static void MJExamineQuals(List *mergeclauses, Oid *mergefamilies, Oid *mergecollations, int *mergestrategies, bool *mergenullsfirst, - PlanState *parent) + MergeJoinState *parent) { - MergeJoinClause clauses; int nClauses = list_length(mergeclauses); int iClause; ListCell *cl; - clauses = (MergeJoinClause) palloc0(nClauses * sizeof(MergeJoinClauseData)); + parent->mj_Clauses = (MergeJoinClause) palloc0(nClauses * sizeof(MergeJoinClauseData)); + parent->mj_UseEqual = (bool *) palloc0(nClauses * sizeof(bool)); + parent->mj_UseLesser = (bool *) palloc0(nClauses * sizeof(bool)); iClause = 0; foreach(cl, mergeclauses) { OpExpr *qual = (OpExpr *) lfirst(cl); - MergeJoinClause clause = &clauses[iClause]; + MergeJoinClause clause = &parent->mj_Clauses[iClause]; Oid opfamily = mergefamilies[iClause]; Oid collation = mergecollations[iClause]; - StrategyNumber opstrategy = mergestrategies[iClause]; + StrategyNumber sort_op_strategy = mergestrategies[iClause]; bool nulls_first = mergenullsfirst[iClause]; - int op_strategy; + int join_op_strategy; Oid op_lefttype; Oid op_righttype; Oid sortfunc; @@ -207,28 +208,50 @@ MJExamineQuals(List *mergeclauses, /* * Prepare the input expressions for execution. */ - clause->lexpr = ExecInitExpr((Expr *) linitial(qual->args), parent); - clause->rexpr = ExecInitExpr((Expr *) lsecond(qual->args), parent); + clause->lexpr = ExecInitExpr((Expr *) linitial(qual->args), (PlanState *) parent); + clause->rexpr = ExecInitExpr((Expr *) lsecond(qual->args), (PlanState *) parent); /* Set up sort support data */ clause->ssup.ssup_cxt = CurrentMemoryContext; clause->ssup.ssup_collation = collation; - if (opstrategy == BTLessStrategyNumber) + if (sort_op_strategy == BTLessStrategyNumber) clause->ssup.ssup_reverse = false; - else if (opstrategy == BTGreaterStrategyNumber) + else if (sort_op_strategy == BTGreaterStrategyNumber) clause->ssup.ssup_reverse = true; else /* planner screwed up */ - elog(ERROR, "unsupported mergejoin strategy %d", opstrategy); + elog(ERROR, "unsupported mergejoin strategy %d", sort_op_strategy); clause->ssup.ssup_nulls_first = nulls_first; /* Extract the operator's declared left/right datatypes */ get_op_opfamily_properties(qual->opno, opfamily, false, - &op_strategy, + &join_op_strategy, &op_lefttype, &op_righttype); - if (op_strategy != BTEqualStrategyNumber) /* should not happen */ - elog(ERROR, "cannot merge using non-equality operator %u", - qual->opno); + + /* + * Determine whether we accept lesser and/or equal tuples + * of the inner relation. + */ + switch (join_op_strategy) + { + case BTEqualStrategyNumber: +parent->mj_UseEqual[iClause] = true; +break; +
Re: [HACKERS] type cache for concat functions
Hi Pavel, I tried applying your patch, it applies and compiles fine, check and checkworld pass. I ran a simple performance test, select concat(generate_series(1,10), ... [x5 total]) vs select generate_series(1,10)::text || ... . Operator || runs in 60 ms, while unpatched concat takes 90 and patched -- 55 ms. About the code: * There seems to be an extra tab here: FmgrInfo*typcache; It's not aligned with the previous declaration with tab size 4. * You could allocate the cache and store it into the context inside rebuildConcatCache. It would return return the cache pointer, and the calling code would look cleaner -- just one line. This is a matter of taste though. * The nearby functions use snake_case, so it should be rebuild_concat_cache too. Overall the patch looks good to me. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
What problem exactly you are seeing in the clog, is it the contention around CLOGControlLock or generally accessing CLOG is slower. If former, then we already have a patch [1] to address it. It's the contention around CLogControlLock. Thank you for the pointer, next time I'll try it with the group clog update patch. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
That's not actually a problem as I am reusing an older v4 from Heikki now for the future, but I wanted to let you know about that first. Thank you, I'll look into that. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On 21.06.2017 04:48, Michael Paquier wrote: There has not been much activity on this thread for some time, and I mentioned my intentions to some developers at the last PGCon. But I am planning to study more the work that has been done here, with as envisaged goal to present a patch for the first CF of PG11. Lots of fun ahead. Hi Michael, Glad to see you working on this! I've been studying this topic too. Attached you can find a recently rebased version of Heikki's v4 patch. I also fixed a bug that appeared on report-receipts isolation test: XidIsConcurrent should report that a transaction is concurrent to ours when its csn equals our snapshotcsn. There is another bug where multiple tuples with the same primary key appear after a pgbench read-write test. Querying pgbench_branches with disabled index scans, I see several tuples with the same 'bid' value. Usually they do not touch the index so there are no errors. But sometimes HOT update is not possible and it errors out with violation of unique constraint, which is how I noticed it. I can't reproduce it on my machine and have to use a 72-core one. For now I can conclude that the oldestActiveXid is not always updated correctly. In TransactionIdAsyncCommitTree, just before the transaction sets clog status, its xid can be less than oldestActiveXid. Other transactions are seeing it as aborted for some time before it writes to clog (see TransactionIdGetCommitSeqNo). They can update a tuple it deleted, and that leads to duplication. Unfortunately, I didn't have time yet to investigate this further. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company csn-5.patch.gz Description: application/gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: full merge join on comparison clause
On 16.05.2017 18:57, Robert Haas wrote: Interesting. I suggest adding this to the next CommitFest. Thank you, added: https://commitfest.postgresql.org/14/1141/ -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PoC: full merge join on comparison clause
ches: 971 Planning time: 1.414 ms Execution time: 1324.985 ms (16 rows) test1=# explain analyze select * from t4 full join t1 on t4.a < t1.a; QUERY PLAN --- Merge Full Join (cost=809.66..34230.49 rows=333 width=8) (actual time=8.351..914.590 rows=5007546 loops=1) Merge Cond: (t1.a > t4.a) -> Index Only Scan using idx_t1_a on t1 (cost=0.28..35.27 rows=1000 width=4) (actual time=0.035..0.368 rows=1001 loops=1) Heap Fetches: 97 -> Sort (cost=809.39..834.39 rows=1 width=4) (actual time=8.309..347.851 rows=5007546 loops=1) Sort Key: t4.a Sort Method: quicksort Memory: 931kB -> Seq Scan on t4 (cost=0.00..145.00 rows=1 width=4) (actual time=0.020..2.563 rows=1 loops=1) Planning time: 1.083 ms Execution time: 1044.869 ms (10 rows) === Merge vs nested loop === test1=# explain analyze select * from t5 join t1 on t5.a <= t1.a; QUERY PLAN - Nested Loop (cost=0.28..944713.00 rows= width=8) (actual time=0.055..8718.840 rows=50014145 loops=1) -> Seq Scan on t5 (cost=0.00..1443.00 rows=10 width=4) (actual time=0.019..6.428 rows=10 loops=1) -> Index Only Scan using idx_t1_a on t1 (cost=0.28..6.10 rows=333 width=4) (actual time=0.003..0.050 rows=500 loops=10) Index Cond: (a >= t5.a) Heap Fetches: 9147995 Planning time: 2.209 ms Execution time: 9942.176 ms (7 rows) test1=# set enable_mergejoin TO on; SET test1=# explain analyze select * from t5 join t1 on t5.a <= t1.a; QUERY PLAN --- Merge Join (cost=9748.54..343618.88 rows= width=8) (actual time=35.491..9281.482 rows=50014145 loops=1) Merge Cond: (t1.a >= t5.a) -> Index Only Scan using idx_t1_a on t1 (cost=0.28..35.27 rows=1000 width=4) (actual time=0.027..0.769 rows=1001 loops=1) Heap Fetches: 97 -> Sort (cost=9747.82..9997.82 rows=10 width=4) (actual time=35.458..3906.652 rows=50014145 loops=1) Sort Key: t5.a Sort Method: quicksort Memory: 8541kB -> Seq Scan on t5 (cost=0.00..1443.00 rows=10 width=4) (actual time=0.013..8.570 rows=10 loops=1) Planning time: 2.368 ms Execution time: 10530.356 ms (10 rows) -- Alexander Kuzmenkov Postgres Professional:http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c index 62784af..db9de5b 100644 --- a/src/backend/executor/nodeMergejoin.c +++ b/src/backend/executor/nodeMergejoin.c @@ -171,31 +171,32 @@ typedef enum * to the opfamily and collation, with nulls at the indicated end of the range. * This allows us to obtain the needed comparison function from the opfamily. */ -static MergeJoinClause +static void MJExamineQuals(List *mergeclauses, Oid *mergefamilies, Oid *mergecollations, int *mergestrategies, bool *mergenullsfirst, - PlanState *parent) + MergeJoinState *parent) { - MergeJoinClause clauses; int nClauses = list_length(mergeclauses); int iClause; ListCell *cl; - clauses = (MergeJoinClause) palloc0(nClauses * sizeof(MergeJoinClauseData)); + parent->mj_Clauses = (MergeJoinClause) palloc0(nClauses * sizeof(MergeJoinClauseData)); + parent->mj_UseEqual = (bool *) palloc0(nClauses * sizeof(bool)); + parent->mj_UseLesser = (bool *) palloc0(nClauses * sizeof(bool)); iClause = 0; foreach(cl, mergeclauses) { OpExpr *qual = (OpExpr *) lfirst(cl); - MergeJoinClause clause = &clauses[iClause]; + MergeJoinClause clause = &parent->mj_Clauses[iClause]; Oid opfamily = mergefamilies[iClause]; Oid collation = mergecollations[iClause]; - StrategyNumber opstrategy = mergestrategies[iClause]; + StrategyNumber sort_op_strategy = mergestrategies[iClause]; bool nulls_first = mergenullsfirst[iClause]; - int op_strategy; + int join_op_strategy; Oid op_lefttype; Oid op_righttype; Oid sortfunc; @@ -206,28 +207,46 @@ MJExamineQuals(List *mergeclauses, /* * Prepare the input expressions for execution. */ - clause->lexpr = ExecInitExpr((Expr *) linitial(qual->args), parent); - clause->rexpr = ExecInitExpr((Expr *) lsecond(qual->args), parent); + clause->lexpr = ExecInitExpr((Expr *) linitial(qual->args), (PlanState *) parent); + clause->rexpr = ExecInitExpr((Exp
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
xt)) Rows Removed by Index Recheck: 90284 Heap Blocks: exact=13221 lossy=22217 -> Bitmap Index Scan on idx_pglist_fts (cost=0.00..529.02 rows=54503 width=0) (actual time=78.366..78.366 rows=222813 loops=1) Index Cond: (fts @@ to_tsquery('tom & lane'::text)) Planning time: 0.722 ms Execution time: 1256.028 ms (13 rows) test1=# set enable_bitmapcount to on; SET test1=# explain analyze select count(*) from pglist where fts @@ to_tsquery( 'tom & lane' ); QUERY PLAN Bitmap Count on pglist (cost=542.65..1087.68 rows=54503 width=8) (actual time=2745.740..2745.742 rows=1 loops=1) Recheck Cond: (fts @@ to_tsquery('tom & lane'::text)) Rows Removed by Index Recheck: 270853 Heap Fetches: 66138 Heap Blocks: exact=39854 lossy=66138 -> Bitmap Index Scan on idx_pglist_fts (cost=0.00..529.02 rows=54503 width=0) (actual time=85.572..85.572 rows=222813 loops=1) Index Cond: (fts @@ to_tsquery('tom & lane'::text)) Planning time: 0.701 ms Execution time: 2745.800 ms (9 rows) -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
On 12.04.2017 17:24, Tom Lane wrote: TBH, I'm not sure you need to do any of that work. Have you got evidence that the planner will fail to choose the right plan regardless? I'm particularly unconvinced that choose_bitmap_and is a critical problem, because once you're into having to AND multiple indexes, you're talking about an expensive query anyhow. The most expensive part would probably be accessing the heap in the bitmap heap scan. It may be worth trying to choose an index path that checks all the restriction and therefore allows us to skip this heap access. This path might not be the cheapest one, though. The standard AND selection procedure would have discarded it based on cost. I've seen this happen on the regression database. Somehow I can't seem to reproduce it on my earlier full-text search example. An example: regression=# explain select count(*) from tenk1 where hundred < 90 and thousand < 31; QUERY PLAN --- Bitmap Count on tenk1 (cost=182.69..185.56 rows=1 width=8) Recheck Cond: ((thousand < 31) AND (hundred < 90)) -> BitmapAnd (cost=182.69..182.69 rows=287 width=0) -> Bitmap Index Scan on tenk1_thous_tenthous (cost=0.00..6.68 rows=319 width=0) Index Cond: (thousand < 31) -> Bitmap Index Scan on tenk1_hundred (cost=0.00..175.62 rows=8978 width=0) Index Cond: (hundred < 90) (7 rows) regression=# set enable_bitmapcount to off; SET regression=# explain select count(*) from tenk1 where hundred < 90 and thousand < 31; QUERY PLAN --- Aggregate (cost=375.34..375.35 rows=1 width=8) -> Bitmap Heap Scan on tenk1 (cost=6.75..374.62 rows=287 width=0) Recheck Cond: (thousand < 31) Filter: (hundred < 90) -> Bitmap Index Scan on tenk1_thous_tenthous (cost=0.00..6.68 rows=319 width=0) Index Cond: (thousand < 31) (6 rows) -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
On 12.04.2017 15:04, Tom Lane wrote: Andrew Gierth writes: "Alexander" == Alexander Kuzmenkov writes: Alexander> Structurally, the patch consists of two major parts: a Alexander> specialized executor node Why? It strikes me that the significant fact here is not that we're doing count(*), but that we don't need any columns from the bitmap heap scan result. Rather than creating a whole new node, can't the existing bitmap heapscan be taught to skip fetching the actual table page in cases where it's all-visible, not lossy, and no columns are needed? +1 ... while I hadn't actually looked at the code, it seemed to me that anything like the optimization-as-described would be impossibly klugy from the planner's standpoint. Your formulation sounds lots nicer. Detecting that no columns are needed in the executor might be a bit tricky because of the planner's habit of inserting a "physical tlist" to avoid a projection step. (See also nearby discussion about custom scan planning.) But we could fix that. I think one rule that would make sense is to just disable the physical-tlist substitution if the relation's targetlist is empty --- it wouldn't be buying much in such a case anyhow. Then the runtime tlist for the scan node would also be empty, and away you go. When making an early prototype of this optimization, I did what you are describing with the bitmap heap scan executor node. In an internal review, it was said that the bitmap heap scan node is already complicated enough and shouldn't have more logic added to it, so I rewrote it a separate node. To me, your approach looks good too, so if the community is generally in favor of this, I could rewrite the executor as such. With planner, the changes are more complex. Two things must be done there. First, when the tlist is empty, we must use a different cost function for bitmap heap scan, because the heap access pattern is different. Second, choose_bitmap_and() must use a different algorithm for choosing the right combination of paths. A standard algorithm chooses the combination based on cost. For count(*) purposes, the decisive factor is that the path has to check all the restrictions, or else we'll need heap access to recheck some of them, which defeats the purpose of having this optimization. The planner code that builds and costs the index path is fairly complex, and integrating this additional behavior into it didn't look good to me. Instead, I created a specialized path node and isolated the logic that handles it. The resulting implementation adds several functions, but it is mostly self-contained and has a single entry point in grouping_planner(). It handles the specific case of bitmap count plans and doesn't complicate the existing code any further. The planner part is to some extent independent of whether we use a separate executor node or not. If we choose not to, the same count(*) optimization code I proposed could create plain bitmap heap scan paths. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] index-only count(*) for indexes supporting bitmap scans
Hi, I would like to propose a patch that speeds up the queries of the form 'select count(*) ... where ...', where the restriction clauses can be satisfied by some indexes. At the moment, such queries use index-only scans followed by aggregation. Index-only scans are only possible for indexes that are capable of returning indexed tuples, that is, support the 'amgettuple' access method. They are not applicable to indexes such as GIN and RUM. However, it is possible to improve count(*) performance for indexes that support bitmap scans. Having performed a bitmap index scan or a combination of such, the bits in bitmap can be counted to obtain the final result. Of course, the bitmap pages that are lossy or not visible to all existing transactions will still require heap access. One kind of applications that can benefit from this change is the full-text search with pagination. To show a search results page, the application has to know the results that go to current page, and the total number of the results. Getting one page is fast, when the desired sorting order can be provided by an index. For example, results can be sorted by date with a separate btree index, or by relevance with RUM index. However, getting the total number of results is more difficult. With text search indexes, it requires a bitmap heap scan, which can be rather slow due to obligatory heap access. A well-known hack for this is using the approximate data from 'explain' results. The proposed change allows the user to obtain the precise number of the results in an efficient and idiomatic manner. The performance of this approach was tested on an archive of pgsql-hackers mailing list. The detailed results for two sample queries can be found in the attached file 'benchmark.txt'. The first test demonstrates full-text search with RUM index, ordering the results by rank. For a query with low selectivity, getting the top results is much faster than counting them all with a bitmap heap scan. With bitmap count execution plan, the results can be counted much faster. A similar test is done with a GIN index, where the results are restricted and ordered by date using another btree index. Again, it shows a significant speedup for count(*) query for bitmap count scan compared to bitmap heap scan. These results demonstrate that the bitmap count plan can indeed be useful for full- text search scenarios. Structurally, the patch consists of two major parts: a specialized executor node and the generation of corresponding paths and plans. The optimizer behavior here is similar to that of the min/max aggregate optimization. The main entry point is preprocess_count_aggs(); it is called by grouping_planner(). It searches for count(*) expression in the target list, and, if possible, generates a special path for it (struct BitmapCountPath). This path is then added to UPPERREL_GROUP_AGG upperrel, and competes with other paths at the further stages of planning. The executor node (nodeBitmapCount.c) is similar to the bitmap heap scan node, with the main difference being that it does not access heap for tuples that are known to satisfy the restriction and to be visible to all transactions. This patch has some important limitations: * It only supports targetlist consisting of a single expression that can be projected from count(*). * count(expr) is not supported. We could support it for cases where the "expr is not null" restriction can be satisfied with an index. * The current implementation does not support parallel execution. It could be implemented during the PostgreSQL 11 release cycle. * For some indexes, the bitmap index scan will always require rechecking all the tuples. Bitmap count plans should not be used in such cases. For now, this check is not implemented. I would be glad to hear your comments on this patch. Regards, Alexander Kuzmenkov === = The data test1=# \d pglist Table "public.pglist" Column |Type | Collation | Nullable | Default +-+---+--+- id | integer | | | sent | timestamp without time zone | | | subject| text| | | author | text| | | body_plain | text| | | fts| tsvector| | | Indexes: "idx_pglist_rum_fts" rum (fts) "idx_pglist_fts" gin (fts) "idx_pglist_sent" btree (sent) test1=# select min(sent), max(sent), count(*) from pglist; min | max | count -+-