Re: Synchronizing slots from primary to standby

2022-02-23 Thread James Coleman
On Fri, Feb 18, 2022 at 5:23 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-02-11 15:28:19 +0100, Peter Eisentraut wrote:
> > On 05.02.22 20:59, Andres Freund wrote:
> > > On 2022-01-03 14:46:52 +0100, Peter Eisentraut wrote:
> > > >  From ec00dc6ab8bafefc00e9b1c78ac9348b643b8a87 Mon Sep 17 00:00:00 2001
> > > > From: Peter Eisentraut
> > > > Date: Mon, 3 Jan 2022 14:43:36 +0100
> > > > Subject: [PATCH v3] Synchronize logical replication slots from primary 
> > > > to
> > > >   standby
> > > I've just skimmed the patch and the related threads. As far as I can tell 
> > > this
> > > cannot be safely used without the conflict handling in [1], is that 
> > > correct?
> >
> > This or similar questions have been asked a few times about this or similar
> > patches, but they always come with some doubt.
>
> I'm certain it's a problem - the only reason I couched it was that there could
> have been something clever in the patch preventing problems that I missed
> because I just skimmed it.
>
>
> > If we think so, it would be
> > useful perhaps if we could come up with test cases that would demonstrate
> > why that other patch/feature is necessary.  (I'm not questioning it
> > personally, I'm just throwing out ideas here.)
>
> The patch as-is just breaks one of the fundamental guarantees necessary for
> logical decoding, that no rows versions can be removed that are still required
> for logical decoding (signalled via catalog_xmin). So there needs to be an
> explicit mechanism upholding that guarantee, but there is not right now from
> what I can see.

I've been working on adding test coverage to prove this out, but I've
encountered the problem reported in [1].

My assumption, but Andres please correct me if I'm wrong, that we
should see issues with the following steps (given the primary,
physical replica, and logical subscriber already created in the test):

1. Ensure both logical subscriber and physical replica are caught up
2. Disable logical subscription
3. Make a catalog change on the primary (currently renaming the
primary key column)
4. Vacuum pg_class
5. Ensure physical replication is caught up
6. Stop primary and promote the replica
7. Write to the changed table
8. Update subscription to point to promoted replica
9. Re-enable logical subscription

I'm attaching my test as an additional patch in the series for
reference. Currently I have steps 3 and 4 commented out to show that
the issues in [1] occur without any attempt to trigger the catalog
xmin problem.

Given this error seems pretty significant in terms of indicating
fundamental lack of test coverage (the primary stated benefit of the
patch is physical failover), and it currently is a blocker to testing
more deeply.

Thanks,
James Coleman

1: 
https://www.postgresql.org/message-id/TYCPR01MB684949EA7AA904EE938548C79F3A9%40TYCPR01MB6849.jpnprd01.prod.outlook.com


v4-0002-Add-failover-test.patch
Description: Binary data


v4-0001-Synchronize-logical-replication-slots-from-primar.patch
Description: Binary data


Consider parallel for lateral subqueries with limit

2020-11-30 Thread James Coleman
I've been investigating parallelizing certain correlated subqueries,
and during that work stumbled across the fact that
set_rel_consider_parallel disallows parallel query on what seems like
a fairly simple case.

Consider this query:

select t.unique1
from tenk1 t
join lateral (select t.unique1 from tenk1 offset 0) l on true;

Current set_rel_consider_parallel sets consider_parallel=false on the
subquery rel because it has a limit/offset. That restriction makes a
lot of sense when we have a subquery whose results conceptually need
to be "shared" (or at least be the same) across multiple workers
(indeed the relevant comment in that function notes that cases where
we could prove a unique ordering would also qualify, but punts on
implementing that due to complexity). But if the subquery is LATERAL,
then no such conceptual restriction.

If we change the code slightly to allow considering parallel query
even in the face of LIMIT/OFFSET for LATERAL subqueries, then our
query above changes from the following plan:

 Nested Loop
   Output: t.unique1
   ->  Gather
 Output: t.unique1
 Workers Planned: 2
 ->  Parallel Index Only Scan using tenk1_unique1 on public.tenk1 t
   Output: t.unique1
   ->  Gather
 Output: NULL::integer
 Workers Planned: 2
 ->  Parallel Index Only Scan using tenk1_hundred on public.tenk1
   Output: NULL::integer

to this plan:

 Gather
   Output: t.unique1
   Workers Planned: 2
   ->  Nested Loop
 Output: t.unique1
 ->  Parallel Index Only Scan using tenk1_unique1 on public.tenk1 t
   Output: t.unique1
 ->  Index Only Scan using tenk1_hundred on public.tenk1
   Output: NULL::integer

The code change itself is quite simple (1 line). As far as I can tell
we don't need to expressly check parallel safety of the limit/offset
expressions; that appears to happen elsewhere (and that makes sense
since the RTE_RELATION case doesn't check those clauses either).

If I'm missing something about the safety of this (or any other
issue), I'd appreciate the feedback.

James
From 0aff5f1b5e35e37a311c01e9f53caf6e088e8d43 Mon Sep 17 00:00:00 2001
From: jcoleman 
Date: Mon, 30 Nov 2020 11:36:35 -0500
Subject: [PATCH v1] Allow parallel LATERAL subqueries with LIMIT/OFFSET

The code that determined whether or not a rel should be considered for
parallel query excluded subqueries with LIMIT/OFFSET. That's correct in
the general case: as the comment notes that'd mean we have to guarantee
ordering (and claims it's not worth checking that) for results to be
consistent across workers. However there's a simpler case that hasn't
been considered: LATERAL subqueries with LIMIT/OFFSET don't fall under
the same reasoning since they're executed (when not converted to a JOIN)
per tuple anyway, so consistency of results across workers isn't a
factor.
---
 src/backend/optimizer/path/allpaths.c |  4 +++-
 src/test/regress/expected/select_parallel.out | 15 +++
 src/test/regress/sql/select_parallel.sql  |  6 ++
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 84a69b064a..3c9313b5a9 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -686,11 +686,13 @@ set_rel_consider_parallel(PlannerInfo *root, RelOptInfo *rel,
 			 * inconsistent results at the top-level.  (In some cases, where
 			 * the result is ordered, we could relax this restriction.  But it
 			 * doesn't currently seem worth expending extra effort to do so.)
+			 * LATERAL is an exception: LIMIT/OFFSET is safe to execute within
+			 * workers since the sub-select is executed per tuple
 			 */
 			{
 Query	   *subquery = castNode(Query, rte->subquery);
 
-if (limit_needed(subquery))
+if (!rte->lateral && limit_needed(subquery))
 	return;
 			}
 			break;
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 9b0c418db7..9ba40ca2c5 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -1042,6 +1042,21 @@ explain (costs off)
Filter: (stringu1 ~~ '%'::text)
 (11 rows)
 
+-- ...unless it's LATERAL
+savepoint settings;
+set parallel_tuple_cost=0;
+explain (costs off) select t.unique1 from tenk1 t
+join lateral (select t.unique1 from tenk1 offset 0) l on true;
+ QUERY PLAN  
+-
+ Gather
+   Workers Planned: 4
+   ->  Nested Loop
+ ->  Parallel Index Only Scan using tenk1_unique1 on tenk1 t
+ ->  Index Only Scan using tenk1_hundred on tenk1
+(5 rows)
+
+rollback to savepoint settings;
 -- to increase the parallel query test coverage
 SAVEPOINT settings;
 SET LOCAL force_parallel_mode = 1;
diff 

Re: enable_incremental_sort changes query behavior

2020-11-30 Thread James Coleman
On Tue, Nov 3, 2020 at 4:39 PM Tomas Vondra
 wrote:
> I've pushed the 0001 part, i.e. the main fix. Not sure about the other
> parts (comments and moving the code back to postgres_fdw) yet.

I noticed the CF entry [1] got moved to the next CF; I'm thinking this
entry should be marked as committed since the fix for the initial bug
reported on this thread has been pushed. We have the parallel safety
issue outstanding, but there's a separate thread and patch for that,
so I'll make a new CF entry for that.

I can mark it as committed, but I'm not sure how to "undo" (or if
that's desirable) the move to the next CF.

Thoughts?

James

1: https://commitfest.postgresql.org/30/2754/




Re: [DOC] Document concurrent index builds waiting on each other

2020-11-30 Thread James Coleman
On Mon, Nov 30, 2020 at 4:53 PM Alvaro Herrera  wrote:
>
> On 2020-Sep-30, Michael Paquier wrote:
>
> > +  
> > +   CREATE INDEX (including the 
> > CONCURRENTLY
> > +   option) commands are included when VACUUM calculates 
> > what
> > +   dead tuples are safe to remove even on tables other than the one being 
> > indexed.
> > +  
> > FWIW, this is true as well for REINDEX CONCURRENTLY because both use
> > the same code paths for index builds and validation, with basically
> > the same waiting phases.  But is CREATE INDEX the correct place for
> > that?  Wouldn't it be better to tell about such things on the VACUUM
> > doc?
>
> Yeah, I think it might be more sensible to document this in
> maintenance.sgml, as part of the paragraph that discusses removing
> tuples "to save space".  But making it inline with the rest of the flow,
> it seems to distract from higher-level considerations, so I suggest to
> make it a footnote instead.

I have mixed feelings about wholesale moving it; users aren't likely
to read the vacuum doc when considering how running CIC might impact
their system, though I do understand why it otherwise fits there. Even
if the primary details are in the vacuum, I tend to think a reference
note (or link to the vacuum docs) in the create index docs would be
useful. The principle here is that 1.) vacuum is automatic/part of the
background of the system, not just something people trigger manually,
and 2.) we ought to document things where the user action triggering
the behavior is documented.

> I'm not sure on the wording to use; what about this?

The wording seems fine to me.

This is a replacement for what was 0002 earlier? And 0001 from earlier
still seems to be a useful standalone patch?

James




Re: Why does create_gather_merge_plan need make_sort?

2020-11-30 Thread James Coleman
On Sun, Nov 29, 2020 at 4:10 PM Tomas Vondra
 wrote:
>
>
>
> On 11/29/20 3:44 PM, James Coleman wrote:
> > On Mon, Nov 23, 2020 at 8:19 AM James Coleman  wrote:
> >>
> >> ..
> >>
> >> So I'm +1 on turning it into an ERROR log instead, since it seems to
> >> me that encountering this case would almost certainly represent a bug
> >> in path generation.
> >
> > Here's a patch to do that.
> >
>
> Thanks. Isn't the comment incomplete? Should it mention just parallel
> safety or also volatility?

Volatility makes it parallel unsafe, and I'm not sure I want to get
into listing every possible issue here, but in the interest of making
it more obviously representative of the kinds of issues we might
encounter, I've tweaked it to mention volatility also, as well as
refer to the issues as "examples" of such concerns.

James
From b4b08b70d8961ea29587412e9a2ef4dd39111ff0 Mon Sep 17 00:00:00 2001
From: jcoleman 
Date: Sun, 29 Nov 2020 09:38:59 -0500
Subject: [PATCH v2] Error if gather merge paths aren't sufficiently sorted

---
 src/backend/optimizer/plan/createplan.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 40abe6f9f6..5ecf9f4065 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1793,13 +1793,15 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 		 &gm_plan->nullsFirst);
 
 
-	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
+	/*
+	 * All gather merge paths should have already guaranteed the necessary sort
+	 * order either by adding an explicit sort node or by using presorted input.
+	 * We can't simply add a sort here on additional pathkeys, because we can't
+	 * guarantee the sort would be safe. For example, expressions may be
+	 * volatile or otherwise parallel unsafe.
+	 */
 	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
-		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
-	 gm_plan->sortColIdx,
-	 gm_plan->sortOperators,
-	 gm_plan->collations,
-	 gm_plan->nullsFirst);
+		elog(ERROR, "gather merge input not sufficiently sorted");
 
 	/* Now insert the subplan under GatherMerge. */
 	gm_plan->plan.lefttree = subplan;
-- 
2.17.1



Re: Fix generate_useful_gather_paths for parallel unsafe pathkeys

2020-11-30 Thread James Coleman
On Sun, Nov 29, 2020 at 4:20 PM Tomas Vondra
 wrote:
>
> On 11/29/20 3:26 PM, James Coleman wrote:
> > On Mon, Nov 23, 2020 at 8:35 AM James Coleman  wrote:
> >>
> >> On Sun, Nov 22, 2020 at 4:59 PM Tomas Vondra
> >>  wrote:
> >>> ...
> >>> Thanks for the patches, I'll take a closer look next week. The 0002
> >>> patch is clearly something we need to backpatch, not sure about 0001 as
> >>> it essentially enables new behavior in some cases (Sort on unsorted
> >>> paths below Gather Merge).
> >>
> >> Thanks for taking a look.
> >>
> >> I had the same question re: backporting. On the one hand it will
> >> change behavior (this is always a bit of a gray area since in one
> >> sense bugs change behavior), but IMO it's not a new feature, because
> >> the code clearly intended to have that feature in the first place (it
> >> creates gather merges on top of a full sort). So I'd lean towards
> >> considering it a bug fix, but I'm also not going to make that a hill
> >> to die on.
> >>
> >> One semi-related question: do you think we should add a comment to
> >> prepare_sort_from_pathkeys explaining that modifying it may mean
> >> find_em_expr_for_rel needs to be updated also?
> >
> > Here's an updated patch series.
> >
> > 0001 and 0002 as before, but with some minor cleanup.
> >
>
> 0001 - seems fine
>
> 0002 - Should we rename the "parallel" parameter to something more
> descriptive, like "require_parallel_safe"?

Done.

> > 0003 disallows SRFs in these sort expressions (per discussion over at [1]).
> >
>
> OK, but I'd move the define from tlist.c to tlist.h, not optimizer.h.

IS_SRF_CALL doesn't really seem to be particularly specific
conceptually to target lists (and of course now we're using it in
equivclass.c), so I'd tried to put it somewhere more general. Is
moving it to tlist.h mostly to minimize changes? Or do you think it
fits more naturally with the tlist code (I might be missing
something)?

> > 0004 removes the search through the target for matching volatile
> > expressions (per discussion at [2]).
> >
>
> OK
>
> > 0005 adds the comment I mentioned above clarifying these two functions
> > are linked.
> >
>
> Makes sense. I wonder if we need to be more specific about how
> consistent those two places need to be. Do they need to match 1:1, or
> how do people in the future decide which restrictions need to be in both
> places?

At this point I'm not sure I'd have a good way to describe that: one
is a clear superset of the other (and we already talk in the comments
about the other being a subset), but it's really going to be about
"what do we want to allow to be sorted proactively"...hmm, that
thought made me take a swing at it; let me know what you think.

James
From 3cc9d13869e25b546bf113d57e5015b5ab2c2abf Mon Sep 17 00:00:00 2001
From: jcoleman 
Date: Wed, 25 Nov 2020 15:46:00 -0500
Subject: [PATCH v3 3/5] Disallow SRFs in proactive sort

So they can be evaluated at the proper time as determiend by
make_sort_input_target.
---
 src/backend/optimizer/path/equivclass.c|  8 
 src/backend/optimizer/util/tlist.c |  5 -
 src/include/optimizer/optimizer.h  |  5 +
 src/test/regress/expected/incremental_sort.out | 12 
 src/test/regress/sql/incremental_sort.sql  |  2 ++
 5 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 75cb1750a8..8c3ef98d73 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -835,6 +835,14 @@ find_em_expr_usable_for_sorting_rel(PlannerInfo *root, EquivalenceClass *ec, Rel
 
 		if (require_parallel_safe && !is_parallel_safe(root, (Node *) em_expr))
 			continue;
+
+		/*
+		 * Disallow SRFs so that all of them can be evaluated at the correct
+		 * time as determined by make_sort_input_target.
+		 */
+		if (IS_SRF_CALL((Node *) em_expr))
+			continue;
+
 		/*
 		 * As long as the expression isn't volatile then
 		 * prepare_sort_from_pathkeys is able to generate a new target entry,
diff --git a/src/backend/optimizer/util/tlist.c b/src/backend/optimizer/util/tlist.c
index 02a3c6b165..01cea102ea 100644
--- a/src/backend/optimizer/util/tlist.c
+++ b/src/backend/optimizer/util/tlist.c
@@ -21,11 +21,6 @@
 #include "optimizer/tlist.h"
 
 
-/* Test if an expression node represents a SRF call.  Beware multiple eval! */
-#define IS_SRF_CALL(node) \
-	((IsA(node, FuncExpr) && ((FuncExpr *) (node))->funcretset) || \
-	 (IsA(node, O

Re: Consider parallel for lateral subqueries with limit

2020-12-01 Thread James Coleman
On Mon, Nov 30, 2020 at 7:00 PM James Coleman  wrote:
>
> I've been investigating parallelizing certain correlated subqueries,
> and during that work stumbled across the fact that
> set_rel_consider_parallel disallows parallel query on what seems like
> a fairly simple case.
>
> Consider this query:
>
> select t.unique1
> from tenk1 t
> join lateral (select t.unique1 from tenk1 offset 0) l on true;
>
> Current set_rel_consider_parallel sets consider_parallel=false on the
> subquery rel because it has a limit/offset. That restriction makes a
> lot of sense when we have a subquery whose results conceptually need
> to be "shared" (or at least be the same) across multiple workers
> (indeed the relevant comment in that function notes that cases where
> we could prove a unique ordering would also qualify, but punts on
> implementing that due to complexity). But if the subquery is LATERAL,
> then no such conceptual restriction.
>
> If we change the code slightly to allow considering parallel query
> even in the face of LIMIT/OFFSET for LATERAL subqueries, then our
> query above changes from the following plan:
>
>  Nested Loop
>Output: t.unique1
>->  Gather
>  Output: t.unique1
>  Workers Planned: 2
>  ->  Parallel Index Only Scan using tenk1_unique1 on public.tenk1 t
>Output: t.unique1
>->  Gather
>  Output: NULL::integer
>  Workers Planned: 2
>  ->  Parallel Index Only Scan using tenk1_hundred on public.tenk1
>Output: NULL::integer
>
> to this plan:
>
>  Gather
>Output: t.unique1
>Workers Planned: 2
>->  Nested Loop
>  Output: t.unique1
>  ->  Parallel Index Only Scan using tenk1_unique1 on public.tenk1 t
>Output: t.unique1
>  ->  Index Only Scan using tenk1_hundred on public.tenk1
>Output: NULL::integer
>
> The code change itself is quite simple (1 line). As far as I can tell
> we don't need to expressly check parallel safety of the limit/offset
> expressions; that appears to happen elsewhere (and that makes sense
> since the RTE_RELATION case doesn't check those clauses either).
>
> If I'm missing something about the safety of this (or any other
> issue), I'd appreciate the feedback.

Note that near the end of grouping planner we have a similar check:

if (final_rel->consider_parallel && root->query_level > 1 &&
!limit_needed(parse))

guarding copying the partial paths from the current rel to the final
rel. I haven't managed to come up with a test case that exposes that
though since simple examples like the one above get converted into a
JOIN, so we're not in grouping_planner for a subquery. Making the
subquery above correlated results in us getting to that point, but
isn't currently marked as parallel safe for other reasons (because it
has params), so that's not a useful test. I'm not sure if there are
cases where we can't convert to a join but also don't involve params;
haven't thought about it a lot though.

James




Re: [DOC] Document concurrent index builds waiting on each other

2020-12-01 Thread James Coleman
On Tue, Dec 1, 2020 at 6:51 PM Alvaro Herrera  wrote:
>
> On 2020-Nov-30, James Coleman wrote:
>
> > On Mon, Nov 30, 2020 at 4:53 PM Alvaro Herrera  
> > wrote:
> > >
> > > On 2020-Sep-30, Michael Paquier wrote:
>
> > > Yeah, I think it might be more sensible to document this in
> > > maintenance.sgml, as part of the paragraph that discusses removing
> > > tuples "to save space".  But making it inline with the rest of the flow,
> > > it seems to distract from higher-level considerations, so I suggest to
> > > make it a footnote instead.
> >
> > I have mixed feelings about wholesale moving it; users aren't likely
> > to read the vacuum doc when considering how running CIC might impact
> > their system, though I do understand why it otherwise fits there.
>
> Makes sense.  ISTM that if we want to have a cautionary blurb CIC docs,
> it should go in REINDEX CONCURRENTLY as well.

Agreed. Or, alternatively, a blurb something like "Please note how CIC
interacts with VACUUM ...", and then the primary language in
maintenance.sgml. That would have the benefit of maintaining the core
language in only one place.

> > > I'm not sure on the wording to use; what about this?
> >
> > The wording seems fine to me.
>
> Great, thanks.
>
> > This is a replacement for what was 0002 earlier? And 0001 from earlier
> > still seems to be a useful standalone patch?
>
> 0001 is the one that I got pushed yesterday, I think -- correct?
> src/tools/git_changelog says:
>
> Author: Alvaro Herrera 
> Branch: master [58ebe967f] 2020-11-30 18:24:55 -0300
> Branch: REL_13_STABLE [3fe0e7c3f] 2020-11-30 18:24:55 -0300
> Branch: REL_12_STABLE [b2603f16a] 2020-11-30 18:24:55 -0300
> Branch: REL_11_STABLE [ed9c9b033] 2020-11-30 18:24:55 -0300
> Branch: REL_10_STABLE [d3bd36a63] 2020-11-30 18:24:55 -0300
> Branch: REL9_6_STABLE [b3d33bf59] 2020-11-30 18:24:55 -0300
> Branch: REL9_5_STABLE [968a537b4] 2020-11-30 18:24:55 -0300
>
> Document concurrent indexes waiting on each other
>
> Because regular CREATE INDEX commands are independent, and there's no
> logical data dependency, it's not immediately obvious that transactions
> held by concurrent index builds on one table will block the second phase
> of concurrent index creation on an unrelated table, so document this
> caveat.
>
> Backpatch this all the way back.  In branch master, mention that only
> some indexes are involved.
>
> Author: James Coleman 
> Reviewed-by: David Johnston 
> Discussion: 
> https://postgr.es/m/CAAaqYe994=purn8cjz4ueo_s-ffrr_3ogeryhtdghab2wg_...@mail.gmail.com

Ah, yes, somehow I'd missed that that had been pushed.

James




Re: Change definitions of bitmap flags to bit-shifting style

2020-12-06 Thread James Coleman
On Sun, Dec 6, 2020 at 1:25 AM Michael Paquier  wrote:

> On Sat, Dec 05, 2020 at 10:31:09PM -0300, Alvaro Herrera wrote:
> > The hexadecimal representation is more natural to me than bit-shifting,
> > so I would prefer to use that style too.  But maybe I'm trained to it
> > because of looking at t_infomask symbols constantly.
>
> If we are going to change all that, hexa style sounds good to me too.
> Would it be worth an addition to the docs, say in [1] to tell that
> this is a preferred style?
>
> [1]: https://www.postgresql.org/docs/devel/source-conventions.html?
> --
> Michael



In my view the bit shifting approach makes it more obvious a single bit is
being set, but on the other hand the hex approach makes it easier to
compare in debugging.

I’m not really sure which to prefer, though I think I would have leaned
slightly towards the former.

James

>


Re: Consider parallel for lateral subqueries with limit

2020-12-07 Thread James Coleman
Brian's email didn't keep the relevant headers, and so didn't show up
as a reply, so I've pasted it here and am replying in this original
thread. You can find the original at [1].

On Sun, Dec 6, 2020 at 7:34 PM Brian Davis  wrote:
>
> > Note that near the end of grouping planner we have a similar check:
> >
> > if (final_rel->consider_parallel && root->query_level > 1 &&
> > !limit_needed(parse))
> >
> > guarding copying the partial paths from the current rel to the final
> > rel. I haven't managed to come up with a test case that exposes that
>
> Played around with this a bit, here's a non-correlated subquery that gets us 
> to that if statement
>
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo (bar int);
>
> INSERT INTO foo (bar)
> SELECT
>   g
> FROM
>   generate_series(1, 1) AS g;
>
>
> SELECT
>   (
> SELECT
>   bar
> FROM
>   foo
> LIMIT 1
>   ) AS y
> FROM
>   foo;

Thanks. That triggers the parallel if case for limit -- any additional
GUCs need to be modified to get that result? I assume regardless a
parallel plan isn't chosen (even if you remove the limit needed
check)?

> I also was thinking about the LATERAL part.
>
> I couldn't think of any reason why the uncorrelated subquery's results would 
> need to be shared and therefore the same, when we'll be "looping" over each 
> row of the source table, running the subquery anew for each, conceptually.
>
> But then I tried this...
>
> test=# CREATE TABLE foo (bar int);
> CREATE TABLE
> test=#
> test=# INSERT INTO foo (bar)
> test-# SELECT
> test-#   g
> test-# FROM
> test-#   generate_series(1, 10) AS g;
> INSERT 0 10
> test=#
> test=#
> test=# SELECT
> test-#   foo.bar,
> test-#   lat.bar
> test-# FROM
> test-#   foo JOIN LATERAL (
> test(# SELECT
> test(#   bar
> test(# FROM
> test(#   foo AS foo2
> test(# ORDER BY
> test(#   random()
> test(# LIMIT 1
> test(#   ) AS lat ON true;
>  bar | bar
> -+-
>1 |   7
>2 |   7
>3 |   7
>4 |   7
>5 |   7
>6 |   7
>7 |   7
>8 |   7
>9 |   7
>   10 |   7
> (10 rows)
>
>
> As you can see, random() is only called once.  If postgres were supposed to 
> be running the subquery for each source row, conceptually, it would be a 
> mistake to cache the results of a volatile function like random().

This was genuinely surprising to me. I think one could argue that this
is just an optimization (after all -- if there is no correlation, then
running it once is conceptually/safely the same as running it multiple
times), but that doesn't seem to hold water with the volatile function
in play.

Of course, given the volatile function we'd never parallelize this
anyway. But we still have to consider the case where the result is
otherwise ordered differently between workers (just by virtue of disk
order, for example).

I've tried the above query using tenk1 from the regress tests to get a
bit more data, and, along with modifying several GUCs, can force
parallel plans. However in no case can I get it to execute that
uncorrelated lateral in multiple workers. That makes me suspicious
that there's another check in play here ensuring the lateral subquery
is executed for each group, and that in the uncorrelated case really
that rule still holds -- it's just a single group.

> The docs say: "When a FROM item contains LATERAL cross-references, evaluation 
> proceeds as follows: for each row of the FROM item providing the 
> cross-referenced column(s), or set of rows of multiple FROM items providing 
> the columns, the LATERAL item is evaluated using that row or row set's values 
> of the columns. The resulting row(s) are joined as usual with the rows they 
> were computed from. This is repeated for each row or set of rows from the 
> column source table(s)."
>
> They don't say what happens with LATERAL when there aren't cross-references 
> though.  As we expect, adding one does show random() being called once for 
> each source row.

If my theory above is correct then it's implicit that the row set is
the whole previous FROM group.

> test=# SELECT
> test-#   foo.bar,
> test-#   lat.bar
> test-# FROM
> test-#   foo JOIN LATERAL (
> test(# SELECT
> test(#   bar
> test(# FROM
> test(#   foo AS foo2
> test(# WHERE
> test(#   foo2.bar < foo.bar + 10
> test(# ORDER BY
> test(#   random()
> test(# LIMIT 1
> test(#   ) AS lat ON true;
>  bar | bar
> -+-
>1 |   5
>2 |   8
>3 |   3
>4 |   4
>5 |   5
>6 |   5
>7 |   1
>8 |   3
>9 |   7
>   10 |   3
> (10 rows)
>
> It seems like to keep the same behavior that exists today, results of LATERAL 
> subqueries would need to be the same if they aren't correlated, and so you 
> couldn't run them in parallel with a limit if the order wasn't guaranteed.  
> But I'll be the first to admit that it's easy enough for me to miss a key 
> piece of logic on something like this, so I could be way off base too.

If it weren't for the

Re: Insert Documentation - Returning Clause and Order

2020-12-12 Thread James Coleman
On Friday, December 11, 2020, David G. Johnston
 wrote:
>
> On Fri, Dec 11, 2020 at 6:24 AM Ashutosh Bapat  
> wrote:
>>
>> On Thu, Dec 10, 2020 at 7:49 PM David G. Johnston
>>  wrote:
>>
>> > Yeah, the ongoing work on parallel inserts would seem to be an issue.  We 
>> > should probably document that though.  And maybe as part of parallel 
>> > inserts patch provide a user-specifiable way to ask for such a guarantee 
>> > if needed.  ‘Insert returning ordered”
>>
>> I am curious about the usecase which needs that guarantee? Don't you
>> have a column on which you can ORDER BY so that it returns the same
>> order as INSERT?
>
>
> This comes up periodically in the context of auto-generated keys being 
> returned - specifically on the JDBC project list (maybe elsewhere...).  If 
> one adds 15 VALUES entries to an insert and then sends them in bulk to the 
> server it would be helpful if the generated keys could be matched up 
> one-to-one with the keyless objects in the client.  Basically "pipelining" 
> the client and server.

That’s a great use case. It’s not so much about ordering, per se, but
about identity.

Certainly almost every ORM, and maybe even other forms of application
code, need to be able to associate the serial column value returned
with what it inserted. I'd expect something like that (whether by
ordering explicitly or by providing some kind of mapping between
indexes in the statement data and the inserted/returned row values).

James




Re: Insert Documentation - Returning Clause and Order

2020-12-12 Thread James Coleman
On Sat, Dec 12, 2020 at 10:11 AM David G. Johnston
 wrote:
>
> On Sat, Dec 12, 2020 at 7:02 AM James Coleman  wrote:
>>
>>
>> Certainly almost every ORM, and maybe even other forms of application
>> code, need to be able to associate the serial column value returned
>> with what it inserted.
>
>
> Yet most ORM would perform single inserts at a time, not in bulk, making such 
> a feature irrelevant to them.

I think that's a pretty hasty generalization. It's the majority of use
cases in an ORM, sure, but plenty of ORMs (and libraries or
applications using them) support inserting batches where performance
requires it. Rails/ActiveRecord is actually integrating that feature
into core (though many Ruby libraries already add that support, as
does, for example, the application I spend the majority of time
working on).

James




Re: [DOC] Document concurrent index builds waiting on each other

2020-12-18 Thread James Coleman
On Tue, Dec 1, 2020 at 8:05 PM James Coleman  wrote:
>
> On Tue, Dec 1, 2020 at 6:51 PM Alvaro Herrera  wrote:
> >
> > On 2020-Nov-30, James Coleman wrote:
> >
> > > On Mon, Nov 30, 2020 at 4:53 PM Alvaro Herrera  
> > > wrote:
> > > >
> > > > On 2020-Sep-30, Michael Paquier wrote:
> >
> > > > Yeah, I think it might be more sensible to document this in
> > > > maintenance.sgml, as part of the paragraph that discusses removing
> > > > tuples "to save space".  But making it inline with the rest of the flow,
> > > > it seems to distract from higher-level considerations, so I suggest to
> > > > make it a footnote instead.
> > >
> > > I have mixed feelings about wholesale moving it; users aren't likely
> > > to read the vacuum doc when considering how running CIC might impact
> > > their system, though I do understand why it otherwise fits there.
> >
> > Makes sense.  ISTM that if we want to have a cautionary blurb CIC docs,
> > it should go in REINDEX CONCURRENTLY as well.
>
> Agreed. Or, alternatively, a blurb something like "Please note how CIC
> interacts with VACUUM ...", and then the primary language in
> maintenance.sgml. That would have the benefit of maintaining the core
> language in only one place.

Any thoughts on this?

James




Re: Dependency isn't created between extension and schema

2020-12-21 Thread James Coleman
On Mon, Dec 21, 2020 at 2:59 AM Michael Paquier  wrote:
>
> On Mon, Dec 21, 2020 at 04:02:29PM +0900, Masahiko Sawada wrote:
> > Is it a bug? Since the created schema obviously depends on the
> > extension when we created the schema specified in the schema option, I
> > think we might want to create the dependency so that DROP EXTENSION
> > drops the schema as well. I’ve attached the draft patch so that CREATE
> > EXTENSION creates the dependency if it newly creates the schema.
>
> FWIW, I recall that the "soft" behavior that exists now is wanted, as
> it is more flexible for DROP EXTENSION: what you are suggesting here
> has the disadvantage to make DROP EXTENSION fail if any non-extension
> object has been created on this schema, so this could be disruptive
> when it comes to some upgrade scenarios.

That's potentially an issue even for a schema created explicitly by
the extension's install script, since anyone can create an object
within that schema at any time.

It seems that the only consistent behavior choice would be to mark the
dependency when Postgres is creating the extension automatically but
not when the schema already exists.

>   schema_name
>
> 
>  The name of the schema in which to install the extension's
>  objects, given that the extension allows its contents to be
>  relocated.  The named schema must already exist.
> While on it..  The docs are incorrect here.  As you say,
> CreateExtensionInternal() may internally create a schema.

Alternatively the behavior could be updated to match the docs, since
that seems like reasonable intent.

James




Re: Nicer error when connecting to standby with hot_standby=off

2021-03-05 Thread James Coleman
On Fri, Mar 5, 2021 at 12:36 PM Fujii Masao  wrote:
>
>
>
> On 2021/03/05 22:45, David Steele wrote:
> > Hi Fujii,
> >
> > On 9/8/20 1:17 PM, James Coleman wrote:
> >> On Tue, Aug 18, 2020 at 12:25 PM Fujii Masao
> >>  wrote:
> >>> Thanks for updating the patch! But it failed to be applied to the master 
> >>> branch
> >>> cleanly because of the recent commit 0038f94387. So could you update the 
> >>> patch
> >>> again?
> >>
> >> Updated patch attached.
> >
> > Any thoughts on the updated patch?
>
> Thanks for the ping!
>
> With the patch, if hot_standby is enabled, the message
> "the database system is starting up" is output while the server is
> in PM_RECOVERY state until it reaches the consistent recovery point.
> On the other hand, if hot_standby is not enabled, the message
> "the database system is up, but hot_standby is off" is output even
> while the server is in that same situation. That is, opposite
> messages can be output for the same situation based on the setting
> of hot_standby. One message is "system is starting up", the other
> is "system is up". Isn't this rather confusing?

Do you have any thoughts on what you'd like to see the message be? I
could change the PM_RECOVERY (without hot standby enabled) to return
CAC_RECOVERY which would give us the message "the database system is
in recovery mode", but that would be a change from what that state
returns now in a way that's unrelated to the goal of the patch.

Thanks,
James




Re: Nicer error when connecting to standby with hot_standby=off

2021-03-09 Thread James Coleman
On Tue, Mar 9, 2021 at 8:47 AM Alvaro Herrera  wrote:
>
> On 2021-Mar-07, Magnus Hagander wrote:
>
> > On Sun, Mar 7, 2021 at 3:39 PM Fujii Masao  
> > wrote:
> >
> > > > Here's an idea:
> > > >
> > > > * hot_standby=on, before reaching consistent state
> > > >FATAL:  database is not accepting connections
> > > >DETAIL:  Consistent state has not yet been reached.
> > > >
> > > > * hot_standby=off, past consistent state
> > > >FATAL:  database is not accepting connections
> > > >DETAIL:  Hot standby mode is disabled.
> > > >
> > > > * hot_standby=off, before reaching consistent state
> > > >FATAL:  database is not accepting connections
> [...]
> > > >DETAIL:  Hot standby mode is disabled.
>
> > > I prefer the former message. Because the latter message meams that
> > > we need to output the different messages based on whether the consistent
> > > state is reached or not, and the followings would be necessary to 
> > > implement
> > > that. This looks a bit overkill to me against the purpose, at least for 
> > > me.
> >
> > Agreed. If hot standby is off, why would the admin care about whether
> > it's consistent yet or not?
>
> Great, so we're agreed on the messages to emit.  James, are you updating
> your patch, considering Fujii's note about the new signal and pmstate
> that need to be added?

Perhaps I'm missing something, but I was under the impression the
"prefer the former message" meant we were not adding a new signal and
pmstate?

James




Re: Nicer error when connecting to standby with hot_standby=off

2021-03-09 Thread James Coleman
On Tue, Mar 9, 2021 at 9:07 AM Alvaro Herrera  wrote:
>
> On 2021-Mar-09, James Coleman wrote:
>
> > On Tue, Mar 9, 2021 at 8:47 AM Alvaro Herrera  
> > wrote:
> > >
> > > On 2021-Mar-07, Magnus Hagander wrote:
> > >
> > > > On Sun, Mar 7, 2021 at 3:39 PM Fujii Masao 
> > > >  wrote:
> > >
> > > Great, so we're agreed on the messages to emit.  James, are you updating
> > > your patch, considering Fujii's note about the new signal and pmstate
> > > that need to be added?
> >
> > Perhaps I'm missing something, but I was under the impression the
> > "prefer the former message" meant we were not adding a new signal and
> > pmstate?
>
> Eh, I read that differently.  I was proposing two options for the DETAIL
> line in that case:
>
> >  DETAIL:  Hot standby mode is disabled.
> >  or maybe
> >  DETAIL:  Consistent state has not yet been reached, and hot standby mode 
> > is disabled.
>
> and both Fujii and Magnus said they prefer the first option over the
> second option.  I don't read any of them as saying that they would like
> to do something else (including not doing anything).
>
> Maybe I misinterpreted them?

Yes, I think they both agreed on the "DETAIL:  Hot standby mode is
disabled." message, but that alternative meant not needing to add any
new signals and pm states, correct?

Thanks,
James




Re: Nicer error when connecting to standby with hot_standby=off

2021-03-09 Thread James Coleman
On Tue, Mar 9, 2021 at 9:17 AM Alvaro Herrera  wrote:
>
> On 2021-Mar-09, James Coleman wrote:
>
> > Yes, I think they both agreed on the "DETAIL:  Hot standby mode is
> > disabled." message, but that alternative meant not needing to add any
> > new signals and pm states, correct?
>
> Ah, I see!  I was thinking that you still needed the state and signal in
> order to print the correct message in hot-standby mode, but that's
> (obviously!) wrong.  So you're right that no signal/state are needed.

Cool. And yes, I'm planning to update the patch soon.

Thanks,
James




Re: enable_incremental_sort changes query behavior

2020-10-30 Thread James Coleman
On Thu, Oct 29, 2020 at 6:06 PM Tomas Vondra
 wrote:
>
> On Tue, Oct 06, 2020 at 09:37:31AM -0400, James Coleman wrote:
> >On Tue, Oct 6, 2020 at 9:28 AM Jaime Casanova
> > wrote:
> >> Can you please create an entry in the commitfest for this one so we
> >> don't lose track of it?
> >
>
> We're not too far from the next minor release, so I've been looking at
> this fix again and I intend to get it committed shortly (on Monday or
> so). Attached is a slightly modified version of the v4 patch that:
>
> (a) Removes comments about projections added to code that is not
> directly related to the fix. I'm not against adding such comments
> separately.

Seems fine. Do you want to commit them at the same time (just a
separate commit)? Or have a separate patch? It seems a bit overkill to
start a new thread just for those.

> (b) Fixes comment in expected output of incremental_sort test.

Thanks.

> (c) Removes else branch from find_em_expr_usable_for_sorting_rel. It's
> not quite needed thanks to the "return" in the "if" branch. IMO this
> makes it more elegant.

No objection.

> I do have two questions about find_em_expr_usable_for_sorting_rel.
>
> (a) Where does the em_is_const check come from? The comment claims we
> should not be trying to sort by equivalence class members, but I can't
> convince myself it's actually true and I don't see it discussed in this
> thread.

That comes from find_ec_member_for_tle which is called by
prepare_sort_from_pathkeys which we note in the comments contains the
set of rules we need to mirror.

> (b) In find_em_expr_for_rel, which was what was used before, the
> condition on relids was this:
>
>  if (bms_is_subset(em->em_relids, rel->relids) &&
>  !bms_is_empty(em->em_relids))
>  {
>  return em->em_expr;
>  }
>
> but here we're using only
>
>  if (!bms_is_subset(em->em_relids, rel->relids))
>  continue;
>
> Isn't this missing the second bms_is_empty condition?

I definitely remember intentionally removing that condition. For one,
it's not present in prepare_sort_from_pathkeys. My memory is a bit
fuzzy on all of the whys, but isn't it the case that if we've already
excluded const expressions, that we must have vars (and therefore
rels) present, and therefore the relids bms must be non-empty?
Probably more importantly, we're going to check that an actual em expr
matches, which means the bms subset check is really just an
optimization to avoid unnecessarily scanning the exprs for equality.
Perhaps it's worth adding a comment saying as such?

By the way, the fact that this is parallel to
prepare_sort_from_pathkeys is the reason the XXX comment is still
there about possibly needing to remove relabel types (since
prepare_sort_from_pathkeys does that). I don't think it's a hard
requirement: the worst case is that by not digging into relabel types
we're being unnecessarily strict.

Both of these I can add if desired.

James




Re: enable_incremental_sort changes query behavior

2020-10-30 Thread James Coleman
On Fri, Oct 30, 2020 at 5:03 PM Tomas Vondra
 wrote:
>
> On Fri, Oct 30, 2020 at 01:26:08PM -0400, James Coleman wrote:
> >On Thu, Oct 29, 2020 at 6:06 PM Tomas Vondra
> > wrote:
> >>
> >> On Tue, Oct 06, 2020 at 09:37:31AM -0400, James Coleman wrote:
> >> >On Tue, Oct 6, 2020 at 9:28 AM Jaime Casanova
> >> > wrote:
> >> >> Can you please create an entry in the commitfest for this one so we
> >> >> don't lose track of it?
> >> >
> >>
> >> We're not too far from the next minor release, so I've been looking at
> >> this fix again and I intend to get it committed shortly (on Monday or
> >> so). Attached is a slightly modified version of the v4 patch that:
> >>
> >> (a) Removes comments about projections added to code that is not
> >> directly related to the fix. I'm not against adding such comments
> >> separately.
> >
> >Seems fine. Do you want to commit them at the same time (just a
> >separate commit)? Or have a separate patch? It seems a bit overkill to
> >start a new thread just for those.
> >
>
> Probably sometime later, as a separate patch. I haven't thought very
> much about those comments, it just seemed unrelated to the fix so I've
> removed that for now. Let's not bother with this until after the minor
> release.

For whatever it's worth, I didn't originally consider them unrelated;
the purpose was to explain why those places were safe relative to the
same kinds of issues under discussion here: whether an expr will be in
the target and when it might need to be added.

EIther way I've split it into a separate patch in this series.

> >> (b) Fixes comment in expected output of incremental_sort test.
> >
> >Thanks.
> >
> >> (c) Removes else branch from find_em_expr_usable_for_sorting_rel. It's
> >> not quite needed thanks to the "return" in the "if" branch. IMO this
> >> makes it more elegant.
> >
> >No objection.
> >
> >> I do have two questions about find_em_expr_usable_for_sorting_rel.
> >>
> >> (a) Where does the em_is_const check come from? The comment claims we
> >> should not be trying to sort by equivalence class members, but I can't
> >> convince myself it's actually true and I don't see it discussed in this
> >> thread.
> >
> >That comes from find_ec_member_for_tle which is called by
> >prepare_sort_from_pathkeys which we note in the comments contains the
> >set of rules we need to mirror.
> >
>
> Thanks for the pointer. I'll take a look.
>
> >> (b) In find_em_expr_for_rel, which was what was used before, the
> >> condition on relids was this:
> >>
> >>  if (bms_is_subset(em->em_relids, rel->relids) &&
> >>  !bms_is_empty(em->em_relids))
> >>  {
> >>  return em->em_expr;
> >>  }
> >>
> >> but here we're using only
> >>
> >>  if (!bms_is_subset(em->em_relids, rel->relids))
> >>  continue;
> >>
> >> Isn't this missing the second bms_is_empty condition?
> >
> >I definitely remember intentionally removing that condition. For one,
> >it's not present in prepare_sort_from_pathkeys. My memory is a bit
> >fuzzy on all of the whys, but isn't it the case that if we've already
> >excluded const expressions, that we must have vars (and therefore
> >rels) present, and therefore the relids bms must be non-empty?
> >Probably more importantly, we're going to check that an actual em expr
> >matches, which means the bms subset check is really just an
> >optimization to avoid unnecessarily scanning the exprs for equality.
> >Perhaps it's worth adding a comment saying as such?
> >
> >By the way, the fact that this is parallel to
> >prepare_sort_from_pathkeys is the reason the XXX comment is still
> >there about possibly needing to remove relabel types (since
> >prepare_sort_from_pathkeys does that). I don't think it's a hard
> >requirement: the worst case is that by not digging into relabel types
> >we're being unnecessarily strict.
> >
> >Both of these I can add if desired.
> >
>
> Yeah, it'd be good to explain the reasoning why it's fine to have the
> conditions like this. Thanks.

I was looking at this some more, and I'm still convinced that this is
correct, but I don't think a comment about it being an optimization

Re: Use of "long" in incremental sort code

2020-11-04 Thread James Coleman
On Tue, Nov 3, 2020 at 4:42 PM Tomas Vondra
 wrote:
>
> On Tue, Nov 03, 2020 at 03:53:53AM +0100, Tomas Vondra wrote:
> >Hi,
> >
> >I took another look at this, and 99% of the patch (the fixes to sort
> >debug messages) seems fine to me.  Attached is the part I plan to get
> >committed, including commit message etc.
> >
>
> I've pushed this part. Thanks for the patch, Haiying Tang.
>
> >
> >The one change I decided to remove is this change in tuplesort_free:
> >
> >-   longspaceUsed;
> >+   int64   spaceUsed;
> >
> >The reason why I think this variable should be 'long' is that we're
> >using it for this:
> >
> >spaceUsed = LogicalTapeSetBlocks(state->tapeset);
> >
> >and LogicalTapeSetBlocks is defined like this:
> >
> >extern long LogicalTapeSetBlocks(LogicalTapeSet *lts);
> >
> >FWIW the "long" is not introduced by incremental sort - it used to be in
> >tuplesort_end, the incremental sort patch just moved it to a different
> >function. It's a bit confusing that tuplesort_updatemax has this:
> >
> >int64   spaceUsed;
> >
> >But I'd argue this is actually wrong, and should be "long" instead. (And
> >this actually comes from the incremental sort patch, by me.)
> >
> >
> >FWIW while looking at what the other places calling LogicalTapeSetBlocks
> >do, and I noticed this:
> >
> >uint64  disk_used = LogicalTapeSetBlocks(...);
> >
> >in the disk-based hashagg patch. So that's a third data type ...
> >
>
> IMHO this should simply switch the current int64 variable to long, as it
> was before. Not sure about about the hashagg uint64 variable.

Is there anything that actually limits tape code to using at most 4GB
on 32-bit systems?

James




Re: enable_incremental_sort changes query behavior

2020-11-17 Thread James Coleman
On Mon, Nov 16, 2020 at 11:23 PM Tomas Vondra
 wrote:
>
> Hmm, I missed that other thread. That indeed seems like a bug in the
> same area already tweaked by ebb7ae839d033d0f2 for similar cases.

I meant to bring it up in this thread before we got the other patch
committed, but I just ended up not having time to look into it.

> The attached patch fixes this simply by adding is_parallel_safe to
> get_useful_pathkeys_for_relation - that does fix the reproducer, but I'm
> not entirely sure that's the right place. Maybe it should be done in
> find_em_expr_usable_for_sorting_rel (which might make a difference if an
> EC clause can contain a mix of parallel safe and unsafe expressions). Or
> maybe we should do it in the caller (which would allow using
> get_useful_pathkeys_for_relation in contexts not requiring parallel
> safety in the future).
>
> Anyway, while this is not an "incremental sort" bug, it seems like the
> root cause is the same as for ebb7ae839d033d0f2 - one of the incremental
> sort patches started considering sorting below gather nodes, not
> realizing these possible consequences.

Yeah. I'd like to investigate a bit if really we should also be adding
it to prepare_sort_from_pathkeys (which
find_em_expr_usable_for_sorting_rel parallels) or similar, since this
seems to be a broader concern that's been previously missed (even if
the bug was otherwise not yet exposed). In particular, as I understand
it, we did do sorts below gather nodes before, just in fewer places,
which meant that "in theory" the code was already broken (or at least
not complete) for parallel queries.

James




Re: enable_incremental_sort changes query behavior

2020-11-20 Thread James Coleman
On Tue, Nov 17, 2020 at 11:20 AM Tomas Vondra
 wrote:
>
>
>
> On 11/17/20 3:03 PM, James Coleman wrote:
> > On Mon, Nov 16, 2020 at 11:23 PM Tomas Vondra
> >  wrote:
> >>
> >> Hmm, I missed that other thread. That indeed seems like a bug in the
> >> same area already tweaked by ebb7ae839d033d0f2 for similar cases.
> >
> > I meant to bring it up in this thread before we got the other patch
> > committed, but I just ended up not having time to look into it.
> >
> >> The attached patch fixes this simply by adding is_parallel_safe to
> >> get_useful_pathkeys_for_relation - that does fix the reproducer, but I'm
> >> not entirely sure that's the right place. Maybe it should be done in
> >> find_em_expr_usable_for_sorting_rel (which might make a difference if an
> >> EC clause can contain a mix of parallel safe and unsafe expressions). Or
> >> maybe we should do it in the caller (which would allow using
> >> get_useful_pathkeys_for_relation in contexts not requiring parallel
> >> safety in the future).
> >>
> >> Anyway, while this is not an "incremental sort" bug, it seems like the
> >> root cause is the same as for ebb7ae839d033d0f2 - one of the incremental
> >> sort patches started considering sorting below gather nodes, not
> >> realizing these possible consequences.
> >
> > Yeah. I'd like to investigate a bit if really we should also be adding
> > it to prepare_sort_from_pathkeys (which
> > find_em_expr_usable_for_sorting_rel parallels) or similar, since this
> > seems to be a broader concern that's been previously missed (even if
> > the bug was otherwise not yet exposed). In particular, as I understand
> > it, we did do sorts below gather nodes before, just in fewer places,
> > which meant that "in theory" the code was already broken (or at least
> > not complete) for parallel queries.
> >
>
> True. It's possible the bug was there before, but the affected plans
> were just very unlikely for some reason, and the new plans introduced by
> incremental sort just made it easier to trigger.

For additional patches on the broader topic (e.g., parallel safety) is
it preferable to keep them here or in the thread Luis started (or even
a new one besides that -- since that's on -bugs not -hackers)?

James




Re: enable_incremental_sort changes query behavior

2020-11-20 Thread James Coleman
On Fri, Nov 20, 2020 at 12:06 PM Robert Haas  wrote:
>
> On Wed, Oct 7, 2020 at 6:22 PM Tomas Vondra
>  wrote:
> > I'm still not entirely sure I understand what's happening, or what the
> > exact rule is. Consider this query:
> >
> >explain (verbose) select distinct i, t, md5(t) from ref_0;
> >
> > which on PG12 (i.e. before incremental sort) is planned like this:
> >
> >   QUERY PLAN
> > --
> >   Unique  (cost=78120.92..83120.92 rows=50 width=65)
> > Output: i, t, (md5(t))
> > ->  Sort  (cost=78120.92..79370.92 rows=50 width=65)
> >   Output: i, t, (md5(t))
> >   Sort Key: ref_0.i, ref_0.t, (md5(ref_0.t))
> >   ->  Seq Scan on public.ref_0  (cost=0.00..10282.00 rows=50 
> > width=65)
> > Output: i, t, md5(t)
> > (7 rows)
> >
> > i.e. the (stable) function is pushed all the way to the scan node. And
> > even if we replace it with a volatile expression it gets pushed down:
> >
> > explain (verbose) select distinct i, t, md5(random()::text || t) from ref_0;
> >
> >   QUERY PLAN
> > --
> >   Unique  (cost=83120.92..88120.92 rows=50 width=65)
> > Output: i, t, (md5(((random())::text || t)))
> > ->  Sort  (cost=83120.92..84370.92 rows=50 width=65)
> >   Output: i, t, (md5(((random())::text || t)))
> >   Sort Key: ref_0.i, ref_0.t, (md5(((random())::text || ref_0.t)))
> >   ->  Seq Scan on public.ref_0  (cost=0.00..15282.00 rows=50 
> > width=65)
> > Output: i, t, md5(((random())::text || t))
> > (7 rows)
> >
> >
> > But perhaps I just don't understand the assumption correctly?
>
> This isn't a counterexample, because there's no join tree here -- or,
> well, there is, but it's trivial, because there's only one relation
> involved. You can't have a non-Var expression computed before you
> finish all the joins, because there are no joins.
>
> What I said was: "target lists for any nodes below the top of the join
> tree were previously always just Var nodes." The topmost join allowed
> non-Var nodes before, but not lower levels.

As I understand what you're saying, the attached (from the repro case
in [1]'s discussion about parallel safety here) is a counterexample.

Specifically we have a plan like:

Merge Right Join
  -> Unique
-> Gather Merge
  -> Sort
-> Nested Loop

The pathtarget of the nested loop contains non-var expressions (in
this case a CASE expression).

Am I misunderstanding what you're saying?

I've attached verbose output (and the query).

James
test=# explain (verbose)
select m.id2, m.id, s.description
FROM main m
LEFT JOIN (
  SELECT DISTINCT
m.id,
CASE
WHEN m.id2 = 15 AND (SELECT name FROM secondary x WHERE x.id = s2.id AND 
x.id2 = 10) = md5(123::text) THEN 'description'
WHEN m.id2 = 15 THEN (SELECT name FROM secondary x WHERE x.id = s2.id AND 
x.id2 = 5)
END AS description
  FROM main m
  JOIN secondary s2 ON m.id = s2.id
  WHERE m.id2 = 15 and type = 0
) s ON s.id = m.id
WHERE m.id2 IN (15) and type = 0 ;

 QUERY PLAN 

   
-

  
 Merge Right Join  (cost=126496.41..9205100.33 rows=34236 width=44)
   Output: m.id2, m.id, (CASE WHEN ((m_1.id2 = 15) AND ((SubPlan 1) = 
'202cb962ac59075b964b07152d234b70'::text)) THEN 'description'::text WHEN 
(m_1.id2 = 15) THEN (SubPlan 2) ELSE NULL::text END)


   Inner Unique: true
   Merge Cond: (m_1.id = m.id)
   ->  Unique  (cost=125495.95..9092259.53 rows=34236 width=40)
 Output: m_1.id, (CASE WHEN ((m_1.id2 = 15) AND ((SubPlan 1) = 
'202cb962ac59075b964b07152d234b70'::text)) THEN 'description'::text WHEN 
(m_1.id2 = 15) THEN (SubPlan 2) ELSE NULL::text END)

   
 ->  Gather Merge  (cost=125495.95..9089667.83 rows=518341 width=40)
   Output: m_1.id, (CASE WHEN ((m_1.id2 = 15) AND ((SubPlan 1) = 
'202cb962ac59075b964b07152d234b70'::text)) THEN 'description'::text WHEN 
(m_1.id2 = 15) THEN (SubPlan 2) ELSE NULL::text END)  

Re: enable_incremental_sort changes query behavior

2020-11-20 Thread James Coleman
Thanks much for the detailed followup; this is super helpful.

On Fri, Nov 20, 2020 at 2:57 PM Robert Haas  wrote:
>
> On Fri, Nov 20, 2020 at 1:51 PM James Coleman  wrote:
> > > This isn't a counterexample, because there's no join tree here -- or,
> > > well, there is, but it's trivial, because there's only one relation
> > > involved. You can't have a non-Var expression computed before you
> > > finish all the joins, because there are no joins.
> > >
> > > What I said was: "target lists for any nodes below the top of the join
> > > tree were previously always just Var nodes." The topmost join allowed
> > > non-Var nodes before, but not lower levels.
> >
> > As I understand what you're saying, the attached (from the repro case
> > in [1]'s discussion about parallel safety here) is a counterexample.
> >
> > Specifically we have a plan like:
> >
> > Merge Right Join
> >   -> Unique
> > -> Gather Merge
> >   -> Sort
> > -> Nested Loop
> >
> > The pathtarget of the nested loop contains non-var expressions (in
> > this case a CASE expression).
>
> Well, in this case there are two join trees, one for the subquery and
> one for the outer query. The expressions for the subquery are computed
> at the top of the plan tree for that subquery.
>
> I guess I didn't specify before that I meant "at the top of the join
> tree for a subquery level," but I did mean that. On principle, the
> planner can't very realistically postpone evaluating a subquery's
> expressions until the top of the outer query's join tree, because, as
> in your example here, the subquery might contain an ORDER BY or
> DISTINCT clause. And anyway, if you look at the planner code, you'll
> see that every subquery is basically planned separately, unless it
> gets flattened into the parent query, so even if it were possible to
> postpone expression evaluation to outer subquery levels in some case,
> the current code structure definitely would fail to achieve that goal.

Ah, the "for a subquery level" clarification is helpful.

> However, apart from incremental sort, you can postpone evaluating a
> join tree's expressions until you reach the top of the join tree, and
> I think that's what we have always done in the past. Each baserel in
> the join tree gets a target list containing a list of vars which
> corresponds to its column list, and the joinrels get lists of vars
> which correspond to the columns from the inputs are needed at higher
> levels of the plan tree. At the top of the join tree, we stick in the
> expressions, replacing the target list of the top node in the plan
> tree; the expressions have to be computable from the inputs because
> the inputs contain all the columns that we figured out were needed at
> this level at the beginning of planning. But, if there are scans or
> joins below the top level of the plan tree, then do they bubble up to
> higher levels of the plan? Should they? It's pretty complicated.
> Theoretically if I've computed length(somelongvar) then perhaps I can
> just bubble the computed value up the plan tree, rather than the
> original column, which might be cheaper. But that only works if the
> higher levels only need length(somelongvar) and not somelongvar
> itself.

Apologies for some rambling/overlapping thoughts below; I'm try to
absorb all of this.

A join rel (at the top of a subquery join tree) would have to somehow
know that it needs to pass up both length(somelongvar) and somelongvar
right? Since both might be needed by a higher level?

If I understand correctly you're saying that:
1. Target list entries for join rels contain vars and non-var expressions
2. Target list entries for base rels contain only vars
3. By implication the planner knows how to propagate non-var
expression results up from join rels, but not from base rels.

Point 3 is the one that would surprise me, or at least, I'd be
interested in knowing what makes the two different in that case (is it
an artificial restriction, a result of how things are structured in
code, something else?). I feel like I'm still missing context on why
this is inherently a problem.

Does this mean a simple "SELECT bar * 2 FROM foo" ends up with a join
rel at the top? Or that a projection handles it (maybe this is the
wrong question -- perhaps the project results in a join rel? I don't
know enough here to tell if this ends up being an odd question)? And
if a projection, then would inserting a projection above a base rel
but before the top of the join tree solve the problem while allowing
us to push expression evaluation down?

Perhaps the

Why does create_gather_merge_plan need make_sort?

2020-11-20 Thread James Coleman
While looking at another issue I noticed that create_gather_merge_plan
calls make_sort if the subplan isn't sufficiently sorted. In all of
the cases I've seen where a gather merge path (not plan) is created
the input path is expected to be properly sorted, so I was wondering
if anyone happened to know what case is being handled by the make_sort
call. Removing it doesn't seem to break any tests.

Thanks,
James




Fix generate_useful_gather_paths for parallel unsafe pathkeys

2020-11-20 Thread James Coleman
t do many safety checks
at all like we do in find_em_expr_usable_for_sorting_rel), but I think
that's safe (though I haven't thought about it a ton) because I
believe those are sort keys we're going to send to the foreign server
anyway, as opposed to sorting ourselves (but I haven't verified that
that's the case and that we never create sort nodes there).

James

1: 
https://www.postgresql.org/message-id/622580997.37108180.1604080457319.JavaMail.zimbra%40siscobra.com.br
2: 
https://www.postgresql.org/message-id/CAJGNTeNaxpXgBVcRhJX%2B2vSbq%2BF2kJqGBcvompmpvXb7pq%2BoFA%40mail.gmail.com
3: 
https://www.postgresql.org/message-id/CAAaqYe_vihKjc%2B8LuQa49EHW4%2BKfefb3wHqPYFnCuUqozo%2BLFg%40mail.gmail.com
4: 
https://www.postgresql.org/message-id/CA%2BTgmobX2079GNJWJVFjo5CmwTg%3DoQQOybFQL2CyZxMY59P6BA%40mail.gmail.com
From 11cb504dca05d93e40a97fa4fe13202a59664950 Mon Sep 17 00:00:00 2001
From: James Coleman 
Date: Fri, 20 Nov 2020 20:29:51 -0500
Subject: [PATCH v1 2/2] Enforce parallel safety of pathkeys in
 generate_useful_gather_paths

If, for example, a pathkey expression (e.g., a correlated subquery which
is currently always treated this way) is unsafe for parallel query then
we must not generate a gather merge path with either incremental or full
sort including that pathkey.
---
 src/backend/optimizer/path/allpaths.c | 13 --
 src/backend/optimizer/path/equivclass.c   |  8 +++-
 src/include/optimizer/paths.h |  2 +-
 .../regress/expected/incremental_sort.out | 40 +++
 src/test/regress/sql/incremental_sort.sql | 11 +
 5 files changed, 69 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 672a20760c..ac21b08801 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2802,6 +2802,9 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
  * This allows us to do incremental sort on top of an index scan under a gather
  * merge node, i.e. parallelized.
  *
+ * If the parallel argument is true then we'll require that pathkey expressions
+ * be parallel safe.
+ *
  * XXX At the moment this can only ever return a list with a single element,
  * because it looks at query_pathkeys only. So we might return the pathkeys
  * directly, but it seems plausible we'll want to consider other orderings
@@ -2809,7 +2812,7 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
  * merge joins.
  */
 static List *
-get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel, bool parallel)
 {
 	List	   *useful_pathkeys_list = NIL;
 
@@ -2839,8 +2842,12 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 			 * meet criteria of EC membership in the current relation, we
 			 * enable not just an incremental sort on the entirety of
 			 * query_pathkeys but also incremental sort below a JOIN.
+			 *
+			 * By passing true for the final argument
+			 * find_em_expr_usable_for_sorting_rel will ensure the chosen
+			 * expression is parallel safe.
 			 */
-			if (!find_em_expr_usable_for_sorting_rel(pathkey_ec, rel))
+			if (!find_em_expr_usable_for_sorting_rel(root, pathkey_ec, rel, true))
 break;
 
 			npathkeys++;
@@ -2894,7 +2901,7 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 	generate_gather_paths(root, rel, override_rows);
 
 	/* consider incremental sort for interesting orderings */
-	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel, true);
 
 	/* used for explicit (full) sort paths */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index f98fd7b3eb..456ab04f33 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -801,9 +801,12 @@ find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
  * Find an equivalence class member expression that can be safely used by a
  * sort node on top of the provided relation. The rules here must match those
  * applied in prepare_sort_from_pathkeys.
+ *
+ * If parallel is set to true, then we'll also enforce that the chosen
+ * expression is parallel safe.
  */
 Expr *
-find_em_expr_usable_for_sorting_rel(EquivalenceClass *ec, RelOptInfo *rel)
+find_em_expr_usable_for_sorting_rel(PlannerInfo *root, EquivalenceClass *ec, RelOptInfo *rel, bool parallel)
 {
 	ListCell   *lc_em;
 
@@ -833,6 +836,9 @@ find_em_expr_usable_for_sorting_rel(EquivalenceClass *ec, RelOptInfo *rel)
 		if (!bms_is_subset(em->em_relids, rel->relids))
 			continue;
 
+		if (parallel && !is_parallel_safe(root, (Node *) em->em_expr))
+			continue;
+
 		/*
 		 * As l

Re: Why does create_gather_merge_plan need make_sort?

2020-11-23 Thread James Coleman
On Sun, Nov 22, 2020 at 5:07 PM Tomas Vondra
 wrote:
>
> On 11/22/20 10:31 PM, Tom Lane wrote:
> > Tomas Vondra  writes:
> >> On 11/20/20 11:24 PM, James Coleman wrote:
> >>> While looking at another issue I noticed that create_gather_merge_plan
> >>> calls make_sort if the subplan isn't sufficiently sorted. In all of
> >>> the cases I've seen where a gather merge path (not plan) is created
> >>> the input path is expected to be properly sorted, so I was wondering
> >>> if anyone happened to know what case is being handled by the make_sort
> >>> call. Removing it doesn't seem to break any tests.
> >
> >> Yeah, I think you're right this is dead code, essentially. We're only
> >> ever calling create_gather_merge_path() with pathkeys matching the
> >> subpath. And it's like that on REL_12_STABLE too, i.e. before the
> >> incremental sort was introduced.
> >
> > It's probably there by analogy to the other callers of
> > prepare_sort_from_pathkeys, which all do at least a conditional
> > make_sort().  I'd be inclined to leave it there; it's cheap insurance
> > against somebody weakening the existing behavior.
> >
>
> But how do we know it's safe to actually do the sort there, e.g. in
> light of the volatility/parallel-safety issues discussed in other threads?
>
> I agree the check may be useful, but maybe we should just do elog(ERROR)
> instead.

That was my thought. I'm not a big fan of maintaining a "this might be
useful" path particularly when there isn't any case in the code or
tests at all that exercises it. In other words, not only is it not
useful [yet], but also we don't actually have any signal to know that
it works (or keeps working) -- whether through tests or production
use.

So I'm +1 on turning it into an ERROR log instead, since it seems to
me that encountering this case would almost certainly represent a bug
in path generation.

James




Re: Fix generate_useful_gather_paths for parallel unsafe pathkeys

2020-11-23 Thread James Coleman
On Sun, Nov 22, 2020 at 4:59 PM Tomas Vondra
 wrote:
>
> On 11/21/20 2:55 AM, James Coleman wrote:
> > Over on the -bugs list we had a report [1] of a seg fault with
> > incremental sort. The short of the investigation there was that a
> > subplan wasn't being serialized since it wasn't parallel safe, and
> > that attempting to initialize that subplan resulted in the seg fault.
> > Tom pushed a change to raise an ERROR when this occurs so that at
> > least we don't crash the server.
> >
> > There was some small amount of discussion about this on a thread about
> > a somewhat similar issue [2] (where volatile expressions were the
> > issue), but that thread resulted in a committed patch, and beyond that
> > it seems that this issue deserves its own discussion.
> >
> > I've been digging further into the problem, and my first question was
> > "why doesn't this result in a similar error but with a full sort when
> > incremental sort is disabled?" After all, we have code to consider a
> > gather merge + full sort on the cheapest partial path in
> > generate_useful_gather_paths(), and the plan that I get on Luis's test
> > case with incremental sort disabled has an full sort above a gather,
> > which presumably it'd be cheaper to push down into the parallel plan
> > (using gather merge instead of gather).
> >
> > It turns out that there's a separate bug here: I'm guessing that in
> > the original commit of generate_useful_gather_paths() we had a
> > copy/paste thinko in this code:
> >
> > /*
> >  * If the path has no ordering at all, then we can't use either
> >  * incremental sort or rely on implicit sorting with a gather
> >  * merge.
> >  */
> > if (subpath->pathkeys == NIL)
> >  continue;
> >
> > The comment claims that we should skip subpaths that aren't sorted at
> > all since we can't possibly use a gather merge directly or with an
> > incremental sort applied first. It's true that we can't do either of
> > those things unless the subpath is already at least partially sorted.
> > But subsequently we attempt to construct a gather merge path on top of
> > a full sort (for the cheapest path), and there's no reason to limit
> > that to partially sorted paths (indeed in the presence of incremental
> > sort it seems unlikely that that would be the cheapest path).
> >
> > We still don't want to build incremental sort paths if the subpath has
> > no pathkeys, but that will imply presorted_keys = 0, which we already
> > check.
> >
> > So 0001 removes that branch entirely. And, as expected, we now get a
> > gather merge + full sort the resulting error about subplan not being
> > initialized.
> >
>
> Yeah, that's clearly a thinko in generate_useful_gather_paths. Thanks
> for spotting it! The 0001 patch seems fine to me.
>
> > With that oddity out of the way, I started looking at the actually
> > reported problem, and nominally issue is that we have a correlated
> > subquery (in the final target list and the sort clause), and
> > correlated subqueries (not sure why exactly in this case; see [3])
> > aren't considered parallel-safe.
> >
> > As to why that's happening:
> > 1. find_em_expr_usable_for_sorting_rel doesn't exclude that
> > expression, and arguably correctly so in the general case since one
> > might (though we don't now) use this same kind of logic for
> > non-parallel plans.
> > 2. generate_useful_pathkeys_for_relation is noting that it'd be useful
> > to sort on that expression and sees it as safe in part due to the (1).
> > 3. We create a sort node that includes that expression in its output.
> > That seems pretty much the same (in terms of safety given generalized
> > input paths/plans, not in terms of what Robert brought up in [4]) as
> > what would happen in the prepare_sort_from_pathkeys call in
> > create_gather_merge_plan.
> >
> > So to fix this problem I've added the option to find_em_expr_for_rel
> > to restrict to parallel-safe expressions in 0002.
> >
>
> OK, that seems like a valid fix. I wonder if we can have EC with members
> mixing parallel-safe and parallel-unsafe expressions? I guess no, but
> it's more a feeling than a clear reasoning and so I wonder what would
> happen in such cases.

An immediately contrived case would be something like "t.x =
unsafe(t.y)". As to whether that can actually result in us wanting to
sort by the safe member before we can execute the unsafe member (e.g.

Re: enable_incremental_sort changes query behavior

2020-11-23 Thread James Coleman
On Mon, Nov 23, 2020 at 2:24 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > But I think that still leaves something missing: after all,
> > prepare_sort_from_pathkeys does know how to insert new target list
> > entries, so something either there (or in the caller/how its called)
> > has to be enforcing an apparently implicit rule about what point in
> > the tree it's safe to do such. Or even just no path generation had
> > ever considered it before (that feels unsatisfactory as an explanation
> > to me, because it feels more implicit than I'd like, but...)
>
> Hi guys,
>
> I hadn't been paying any attention to this thread, but Robert asked
> me to take a look.  A few comments:

Thanks for jumping in (and thanks Robert for asking for Tom to take a
look); I appreciate the input.

> 1. James was wondering, far upthread, why we would do projections
> pre-sort or post-sort.  I think the answers can be found by studying
> planner.c's make_sort_input_target(), which separates out what we want
> to do pre-sort and post-sort.  (There, the "sort" is envisioned as a
> standalone Sort node atop all the scan/join steps; but its decisions
> nonetheless constrain what will be considered for sorting that's
> implemented further down.)  It has a *very* long introductory comment
> laying out all the considerations.

That comment is helpful; I wish I'd discovered it sooner.

Does it imply we need to intentionally avoid SRFs also?

> 2. Robert is correct that under normal circumstances, the targetlists of
> both baserels and join rels contain only Vars.  Any computations we have
> to do are postponed to the final top-level targetlist, whether they are
> volatile or not.  The fact that this policy is applied regardless of
> volatility may explain why James isn't seeing volatility checks where he
> expected them.  The core (and, I think, only) exception to this is that
> an expression that is a sort or group key has to be evaluated earlier.
> But even then, it won't be pushed down as far as the reltargets of any
> scan or join relations; it'll be computed after the final join.

Is that primarily a project policy? Or a limitation of our current
planner (just can't push things down/carry the results back up as much
as we'd like)? Or something else? In particular, it seems plausible
there are cases where pushing down the sort on a non-Var expression to
the base rel could improve plans, so I'm wondering if there's a reason
to intentionally avoid that in the long or short run (or both).

> 3. If we have a volatile sort/group key, that constrains the set of plans
> we can validly generate, because the key expression mustn't be evaluated
> redundantly.  With the addition of parallelism, a non-parallel-safe
> sort/group key expression likewise constrains the set of valid plans,
> even if it's not considered volatile.

This makes a lot of sense (just unfortunate we had to re-discover it).

> 4. In the past, the only way we'd consider non-Var sort keys in any way
> during scan/join planning was if (a) a relation had an expression index
> matching a requested sort key; of course, that's a-fortiori going to be
> a non-volatile expression.  Or (b) we needed to sort in order to provide
> suitable input for a merge join ... but again, volatile expressions
> aren't considered candidates for mergejoin.  So there basically wasn't
> any need to consider sorting by volatiles below the top level sort/group
> processing, and that again contributes to why you don't see explicit
> tests in that area.  I'm not sure offhand whether the parallel-query
> patches added anything to guard against nonvolatile-but-parallel-unsafe
> sort expressions.  If not, there might be live bugs there independently
> of incremental sort; or that might be unreachable because of overall
> limitations on parallelism.

Interesting: so merge joins are an example of us pushing down sorts,
which I assume means (part of) the answer to my question on (2) is
that there's nothing inherently wrong/broken with evaluating
expressions lower down the tree as long as we're careful about what is
safe/unsafe with respect to volatility and parallelism?

> 5. While I've not dug far enough into the code to identify the exact
> issue, the impression I have about this bug and the parallel-unsafe
> subplan bug is that the incremental sort code is generating Paths
> without due regard to point 3.  If it is making paths based on explicit
> sorts for final-stage pathkeys, independently of either expression
> indexes or mergejoin requirements, that could well explain why it's
> got problems that weren't seen before.

Yes, that's correct. Tomas already pushed the fix 

Re: enable_incremental_sort changes query behavior

2020-11-25 Thread James Coleman
On Tue, Nov 24, 2020 at 2:31 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > On Mon, Nov 23, 2020 at 2:24 PM Tom Lane  wrote:
> >> 1. James was wondering, far upthread, why we would do projections
> >> pre-sort or post-sort.  I think the answers can be found by studying
> >> planner.c's make_sort_input_target(), which separates out what we want
> >> to do pre-sort and post-sort.
>
> > Does it imply we need to intentionally avoid SRFs also?
>
> It's sort of a wart that we allow SRFs in ORDER BY at all, but my
> expectation is that make_sort_input_target would prevent lower levels
> of the planner from needing to think about that.  We don't allow SRFs
> in index expressions, nor in WHERE clauses (so they'll never come up
> as mergejoin sort keys).  So really there's no way that scan/join
> processing should need to consider such cases.  Such a sort would
> only ever be implemented via a final Sort atop ProjectSet.

In this case though we're sorting based on "interesting" pathkeys,
which means we don't don't necessarily have index expressions or
mergejoin sort keys. For example, even with the bugfix patch (from the
parallel fix thread I've linked to previously) applied, I'm able to
generate this (with of course some GUCs "tweaked"):

select unique1 from tenk1 order by unnest('{1}'::int[]);

 Gather Merge
   Workers Planned: 2
   ->  Sort
 Sort Key: (unnest('{1}'::integer[]))
 ->  ProjectSet
   ->  Parallel Index Only Scan using tenk1_unique1 on tenk1

If I understand correctly, that's a violation of the spirit of what
the comments above make_sort_input_target(). If that's true, then it
should be pretty easy to disallow them from being considered. Given
the existing restrictions on where SRFs can be placed in a SELECT
(e.g., no CASE/COALESCE) and my assumption (without having thoroughly
verified this) that SRFs aren't allowed as arguments to functions or
as arguments to any other expression (I assume only scalars are
allowed), would it be sufficient to check the pathkey expression
(without recursion) to see if it's a FuncExpr that returns a set?

> >> 2. Robert is correct that under normal circumstances, the targetlists of
> >> both baserels and join rels contain only Vars.
>
> > Is that primarily a project policy? Or a limitation of our current
> > planner (just can't push things down/carry the results back up as much
> > as we'd like)? Or something else? In particular, it seems plausible
> > there are cases where pushing down the sort on a non-Var expression to
> > the base rel could improve plans, so I'm wondering if there's a reason
> > to intentionally avoid that in the long or short run (or both).
>
> I think you've just rediscovered Joe Hellerstein's thesis topic [1].
> We ripped out the remnants of that code ages ago (search very early
> git states for "JMH" if you're interested), because the complexity
> vs. benefit ratio seemed pretty bad.  Maybe there'll be a case for
> putting it back some day, but I'm dubious.  Note that we have the
> ability to push down sorts-on-expressions anyway; that's not constrained
> by what is in the relation targetlists.

I'm doing some grepping now to see what that work entailed, but I'm
not sure that what I'm describing is the same thing. By "pushing down
a non-Var expression to the base rel" I mean what (to my ears) sounds
like what you're saying by "we have the ability to push down
sorts-on-expressions anyway; that's not constrained by what is in the
relation targetlists". The target list entries I'm talking about for
these expressions are the ones inserted by
prepare_sort_from_pathkeys(), which in PG13 happens just a bit more
frequently since, at least in parallel query, consider sort paths
lower than we did before based on interesting pathkeys (for now just
query_pathkeys) in generate_useful_gather_paths().

> > Interesting: so merge joins are an example of us pushing down sorts,
> > which I assume means (part of) the answer to my question on (2) is
> > that there's nothing inherently wrong/broken with evaluating
> > expressions lower down the tree as long as we're careful about what is
> > safe/unsafe with respect to volatility and parallelism?
>
> Right, I don't see any fundamental problem with that, we just have
> to be careful about these constraints.

Great. That's exactly what I was looking for. To summarize: the
constraints are on volatility, parallel safety, and SRFs.

> > I have wondered if we should strictly require the expression to be in
> > the target list even if no

Re: enable_incremental_sort changes query behavior

2020-11-25 Thread James Coleman
On Wed, Nov 25, 2020 at 2:53 PM James Coleman  wrote:
>
> On Tue, Nov 24, 2020 at 2:31 PM Tom Lane  wrote:
> >
> > James Coleman  writes:
> > > On Mon, Nov 23, 2020 at 2:24 PM Tom Lane  wrote:
> > >> 1. James was wondering, far upthread, why we would do projections
> > >> pre-sort or post-sort.  I think the answers can be found by studying
> > >> planner.c's make_sort_input_target(), which separates out what we want
> > >> to do pre-sort and post-sort.
> >
> > > Does it imply we need to intentionally avoid SRFs also?
> >
> > It's sort of a wart that we allow SRFs in ORDER BY at all, but my
> > expectation is that make_sort_input_target would prevent lower levels
> > of the planner from needing to think about that.  We don't allow SRFs
> > in index expressions, nor in WHERE clauses (so they'll never come up
> > as mergejoin sort keys).  So really there's no way that scan/join
> > processing should need to consider such cases.  Such a sort would
> > only ever be implemented via a final Sort atop ProjectSet.
>
> In this case though we're sorting based on "interesting" pathkeys,
> which means we don't don't necessarily have index expressions or
> mergejoin sort keys. For example, even with the bugfix patch (from the
> parallel fix thread I've linked to previously) applied, I'm able to
> generate this (with of course some GUCs "tweaked"):
>
> select unique1 from tenk1 order by unnest('{1}'::int[]);
>
>  Gather Merge
>Workers Planned: 2
>->  Sort
>  Sort Key: (unnest('{1}'::integer[]))
>  ->  ProjectSet
>->  Parallel Index Only Scan using tenk1_unique1 on tenk1
>
> If I understand correctly, that's a violation of the spirit of what
> the comments above make_sort_input_target(). If that's true, then it
> should be pretty easy to disallow them from being considered. Given
> the existing restrictions on where SRFs can be placed in a SELECT
> (e.g., no CASE/COALESCE) and my assumption (without having thoroughly
> verified this) that SRFs aren't allowed as arguments to functions or
> as arguments to any other expression (I assume only scalars are
> allowed), would it be sufficient to check the pathkey expression
> (without recursion) to see if it's a FuncExpr that returns a set?

Here's the plan with a change to restrict SRFs here:

 Sort
   Sort Key: (unnest('{1,2}'::integer[]))
   ->  Gather
 Workers Planned: 2
 ->  ProjectSet
   ->  Parallel Index Only Scan using tenk1_unique1 on tenk1

I'm a bit surprised the ProjectSet is above the Index Scan rather than
above the Gather, but maybe this is still more correct?

James




Re: Fix generate_useful_gather_paths for parallel unsafe pathkeys

2020-11-29 Thread James Coleman
On Mon, Nov 23, 2020 at 8:35 AM James Coleman  wrote:
>
> On Sun, Nov 22, 2020 at 4:59 PM Tomas Vondra
>  wrote:
> > ...
> > Thanks for the patches, I'll take a closer look next week. The 0002
> > patch is clearly something we need to backpatch, not sure about 0001 as
> > it essentially enables new behavior in some cases (Sort on unsorted
> > paths below Gather Merge).
>
> Thanks for taking a look.
>
> I had the same question re: backporting. On the one hand it will
> change behavior (this is always a bit of a gray area since in one
> sense bugs change behavior), but IMO it's not a new feature, because
> the code clearly intended to have that feature in the first place (it
> creates gather merges on top of a full sort). So I'd lean towards
> considering it a bug fix, but I'm also not going to make that a hill
> to die on.
>
> One semi-related question: do you think we should add a comment to
> prepare_sort_from_pathkeys explaining that modifying it may mean
> find_em_expr_for_rel needs to be updated also?

Here's an updated patch series.

0001 and 0002 as before, but with some minor cleanup.

0003 disallows SRFs in these sort expressions (per discussion over at [1]).

0004 removes the search through the target for matching volatile
expressions (per discussion at [2]).

0005 adds the comment I mentioned above clarifying these two functions
are linked.

James

1: https://www.postgresql.org/message-id/295524.1606246314%40sss.pgh.pa.us
2: https://www.postgresql.org/message-id/124400.1606159456%40sss.pgh.pa.us
From 065c299bdaa5585b89d174ce649bf973dbf1c34d Mon Sep 17 00:00:00 2001
From: jcoleman 
Date: Wed, 25 Nov 2020 15:53:49 -0500
Subject: [PATCH v2 4/5] Remove volatile expr target search

While prepare_sort_from_pathkeys has to be concerned about matching up a
volatile expression to the proper tlist entry, we don't need to do that
here since such a sort will have to be postponed anyway.
---
 src/backend/optimizer/path/equivclass.c | 27 +
 1 file changed, 5 insertions(+), 22 deletions(-)

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index ac2ed7aff1..f36c57ed81 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -819,7 +819,6 @@ find_em_expr_usable_for_sorting_rel(PlannerInfo *root, EquivalenceClass *ec, Rel
 		EquivalenceMember *em = lfirst(lc_em);
 		Expr	   *em_expr = em->em_expr;
 		PathTarget *target = rel->reltarget;
-		ListCell   *lc_target_expr;
 
 		/*
 		 * We shouldn't be trying to sort by an equivalence class that
@@ -850,30 +849,14 @@ find_em_expr_usable_for_sorting_rel(PlannerInfo *root, EquivalenceClass *ec, Rel
 		 * As long as the expression isn't volatile then
 		 * prepare_sort_from_pathkeys is able to generate a new target entry,
 		 * so there's no need to verify that one already exists.
+		 *
+		 * While prepare_sort_from_pathkeys has to be concerned about matching
+		 * up a volatile expression to the proper tlist entry, it doesn't seem
+		 * valuable here to expend the work trying to find a match in the
+		 * target's exprs since such a sort will have to be postponed anyway.
 		 */
 		if (!ec->ec_has_volatile)
 			return em->em_expr;
-
-		/*
-		 * If, however, it's volatile, we have to verify that the
-		 * equivalence member's expr is already generated in the
-		 * relation's target (we do strip relabels first from both
-		 * expressions, which is cheap and might allow us to match
-		 * more expressions).
-		 */
-		while (em_expr && IsA(em_expr, RelabelType))
-			em_expr = ((RelabelType *) em_expr)->arg;
-
-		foreach(lc_target_expr, target->exprs)
-		{
-			Expr	   *target_expr = lfirst(lc_target_expr);
-
-			while (target_expr && IsA(target_expr, RelabelType))
-target_expr = ((RelabelType *) target_expr)->arg;
-
-			if (equal(target_expr, em_expr))
-return em->em_expr;
-		}
 	}
 
 	/* We didn't find any suitable equivalence class expression */
-- 
2.17.1

From 1dd54123ec95cfc2df66ccf1189bff1df70d15e7 Mon Sep 17 00:00:00 2001
From: jcoleman 
Date: Wed, 25 Nov 2020 15:46:00 -0500
Subject: [PATCH v2 3/5] Disallow SRFs in proactive sort

So they can be evaluated at the proper time as determiend by
make_sort_input_target.
---
 src/backend/optimizer/path/equivclass.c|  8 
 src/backend/optimizer/util/tlist.c |  5 -
 src/include/optimizer/optimizer.h  |  5 +
 src/test/regress/expected/incremental_sort.out | 12 
 src/test/regress/sql/incremental_sort.sql  |  2 ++
 5 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index e47a16827d..ac2ed7aff1 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/bac

Re: Why does create_gather_merge_plan need make_sort?

2020-11-29 Thread James Coleman
On Mon, Nov 23, 2020 at 8:19 AM James Coleman  wrote:
>
> On Sun, Nov 22, 2020 at 5:07 PM Tomas Vondra
>  wrote:
> >
> > On 11/22/20 10:31 PM, Tom Lane wrote:
> > > Tomas Vondra  writes:
> > >> On 11/20/20 11:24 PM, James Coleman wrote:
> > >>> While looking at another issue I noticed that create_gather_merge_plan
> > >>> calls make_sort if the subplan isn't sufficiently sorted. In all of
> > >>> the cases I've seen where a gather merge path (not plan) is created
> > >>> the input path is expected to be properly sorted, so I was wondering
> > >>> if anyone happened to know what case is being handled by the make_sort
> > >>> call. Removing it doesn't seem to break any tests.
> > >
> > >> Yeah, I think you're right this is dead code, essentially. We're only
> > >> ever calling create_gather_merge_path() with pathkeys matching the
> > >> subpath. And it's like that on REL_12_STABLE too, i.e. before the
> > >> incremental sort was introduced.
> > >
> > > It's probably there by analogy to the other callers of
> > > prepare_sort_from_pathkeys, which all do at least a conditional
> > > make_sort().  I'd be inclined to leave it there; it's cheap insurance
> > > against somebody weakening the existing behavior.
> > >
> >
> > But how do we know it's safe to actually do the sort there, e.g. in
> > light of the volatility/parallel-safety issues discussed in other threads?
> >
> > I agree the check may be useful, but maybe we should just do elog(ERROR)
> > instead.
>
> That was my thought. I'm not a big fan of maintaining a "this might be
> useful" path particularly when there isn't any case in the code or
> tests at all that exercises it. In other words, not only is it not
> useful [yet], but also we don't actually have any signal to know that
> it works (or keeps working) -- whether through tests or production
> use.
>
> So I'm +1 on turning it into an ERROR log instead, since it seems to
> me that encountering this case would almost certainly represent a bug
> in path generation.

Here's a patch to do that.

James
From 798d520a30a44bc00f4f7bdb4a39e443212f0424 Mon Sep 17 00:00:00 2001
From: jcoleman 
Date: Sun, 29 Nov 2020 09:38:59 -0500
Subject: [PATCH v1] Error if gather merge paths aren't sufficiently sorted

---
 src/backend/optimizer/plan/createplan.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 40abe6f9f6..dd58bc5688 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1793,13 +1793,14 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 		 &gm_plan->nullsFirst);
 
 
-	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
+	/*
+	 * All gather merge paths should have already guaranteed the necessary sort
+	 * order either by adding an explicit sort node or by using presorted input.
+	 * If not we have no guarantee that applying a sort at this point would be
+	 * parallel safe,
+	 */
 	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
-		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
-	 gm_plan->sortColIdx,
-	 gm_plan->sortOperators,
-	 gm_plan->collations,
-	 gm_plan->nullsFirst);
+		elog(ERROR, "gather merge input not sufficiently sorted");
 
 	/* Now insert the subplan under GatherMerge. */
 	gm_plan->plan.lefttree = subplan;
-- 
2.17.1



Re: [DOC] Document concurrent index builds waiting on each other

2021-01-13 Thread James Coleman
On Wed, Jan 13, 2021 at 12:58 AM Michael Paquier  wrote:
>
> On Tue, Jan 12, 2021 at 04:51:39PM -0300, Alvaro Herrera wrote:
> > I looked into this again, and I didn't like what I had added to
> > maintenance.sgml at all.  It seems out of place where I put it; and I
> > couldn't find any great spots.  Going back to your original proposal,
> > what about something like this?  It's just one more para in the "notes"
> > section in CREATE INDEX and REINDEX pages, without any additions to the
> > VACUUM pages.
>
> +1.

I think one more para in the notes is good. But shouldn't we still
clarify the issue is specific to CONCURRENTLY?

Also that it's not just the table being indexed seems fairly significant.

How about something like:

---
Like any long-running transaction, REINDEX CONCURRENTLY can
affect which tuples can be removed by concurrent
VACUUM on any table.
---

James




Re: [DOC] Document concurrent index builds waiting on each other

2021-01-13 Thread James Coleman
On Wed, Jan 13, 2021 at 12:33 PM Alvaro Herrera  wrote:
>
> On 2021-Jan-13, James Coleman wrote:
>
> > On Wed, Jan 13, 2021 at 12:58 AM Michael Paquier  
> > wrote:
> > >
> > > On Tue, Jan 12, 2021 at 04:51:39PM -0300, Alvaro Herrera wrote:
> > > > I looked into this again, and I didn't like what I had added to
> > > > maintenance.sgml at all.  It seems out of place where I put it; and I
> > > > couldn't find any great spots.  Going back to your original proposal,
> > > > what about something like this?  It's just one more para in the "notes"
> > > > section in CREATE INDEX and REINDEX pages, without any additions to the
> > > > VACUUM pages.
> > >
> > > +1.
> >
> > I think one more para in the notes is good. But shouldn't we still
> > clarify the issue is specific to CONCURRENTLY?
>
> How is it specific to concurrent builds?  What we're documenting here is
> the behavior of vacuum, and that one is identical in both regular builds
> and concurrent builds (since vacuum has to avoid removing rows from
> under either of them).  The only reason concurrent builds are
> interesting is because they take longer.
>
> What was specific to concurrent builds was the fact that you can't have
> more than one at a time, and that one is what was added in 58ebe967f.

Ah, right. I've mixed those up at least once on this thread already.

> > Also that it's not just the table being indexed seems fairly significant.
>
> This is true.  So I propose
>
> Like any long-running transaction, REINDEX can
> affect which tuples can be removed by concurrent VACUUM
> on any table.

That sounds good to me.

James




Re: [DOC] Document concurrent index builds waiting on each other

2021-01-13 Thread James Coleman
On Wed, Jan 13, 2021 at 4:05 PM Alvaro Herrera  wrote:
>
> On 2021-Jan-13, James Coleman wrote:
>
> > On Wed, Jan 13, 2021 at 12:33 PM Alvaro Herrera  
> > wrote:
>
> > > This is true.  So I propose
> > >
> > > Like any long-running transaction, REINDEX can
> > > affect which tuples can be removed by concurrent 
> > > VACUUM
> > > on any table.
> >
> > That sounds good to me.
>
> Great, pushed with one more wording tweak: "REINDEX on any table can
> affect ... on any other table".  To pg12 and up.

Looks like what got committed is "REINDEX on a table" not "on any",
but I'm not sure that matters too much.

James




Re: [DOC] Document concurrent index builds waiting on each other

2021-01-13 Thread James Coleman
On Wed, Jan 13, 2021 at 4:29 PM Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > On 2021-Jan-13, James Coleman wrote:
> >>>> This is true.  So I propose
> >>>> Like any long-running transaction, REINDEX can
> >>>> affect which tuples can be removed by concurrent 
> >>>> VACUUM
> >>>> on any table.
>
> >> Looks like what got committed is "REINDEX on a table" not "on any",
> >> but I'm not sure that matters too much.
>
> > Ouch.  The difference seems slight enough that it doesn't matter; is it
> > ungrammatical?
>
> I'd personally have written "on other tables" or "on another table",
> or left out that clause altogether and just said "concurrent
> VACUUM".  I'm not sure it's ungrammatical exactly,
> but the antecedent of "a table" is a bit unclear; people might
> wonder if it means the table being reindexed.

It does mean the table being reindexed; the last phrase says "any
table" meaning "any other table".

James




Re: [DOC] Document concurrent index builds waiting on each other

2021-01-13 Thread James Coleman
On Wed, Jan 13, 2021 at 5:00 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > On Wed, Jan 13, 2021 at 4:29 PM Tom Lane  wrote:
> >> but the antecedent of "a table" is a bit unclear; people might
> >> wonder if it means the table being reindexed.
>
> > It does mean the table being reindexed; the last phrase says "any
> > table" meaning "any other table".
>
> [ raised eyebrow ]  Surely REINDEX and VACUUM can't run on the same
> table at the same time.

+ Like any long-running transaction, CREATE INDEX on a
+ table can affect which tuples can be removed by concurrent
+ VACUUM on any other table.

The "on a table" is the table on which the REINDEX/CREATE INDEX is
occurring. The "any other table" is where VACUUM might run.

James




Re: Discarding DISCARD ALL

2021-01-20 Thread James Coleman
I hope to do further review of the patch later this week, but I wanted
to at least comment on this piece:

On Wed, Jan 20, 2021 at 2:48 AM Peter Eisentraut
 wrote:
>
> On 2020-12-23 15:33, Simon Riggs wrote:
> > Poolers such as pgbouncer would then be able to connect transaction
> > mode pools by setting transaction_cleanup=on at time of connection,
> > avoiding any need to issue a server_reset_query, removing the DISCARD
> > ALL command from the normal execution path, while still achieving the
> > same thing.
>
> PgBouncer does not send DISCARD ALL in transaction mode.  There is a
> separate setting to do that, but it's not the default, and it's more of
> a workaround for bad client code.  So I don't know if this feature would
> be of much use for PgBouncer.  Other connection poolers might have other
> opinions.

Yes, to have server_reset_query apply in transaction pooling mode you
have to additionally configure pgbouncer with
server_reset_query_always enabled.

I'd mildly take issue with "a workaround for bad client code". Yes,
clients in transaction pooling mode shouldn't issue (for example) `SET
...`, but there's no way I'm aware of in Postgres to prevent
session-specific items like those GUCs from being set by a given user,
so I view it more like a safeguard than a workaround.

In our setup we have server_reset_query_always=1 as such a safeguard,
because it's too easy for application code to update, for example,
statement_timeout to disastrous results. But we also work to make sure
those don't happen (or get cleaned up if they happen to slip in).

An alternative approach that occurred to me while typing this reply: a
setting in Postgres that would disallow setting session level GUCs
(i.e., enforce `SET LOCAL` transaction level usage instead) would
remove a large chunk of our need to set server_reset_query_always=1
(and more interestingly it'd highlight when broken code gets pushed).
But even with that, I see some value in the proposed setting since
there is additional session state beyond GUCs.

James




Re: Discarding DISCARD ALL

2021-01-20 Thread James Coleman
On Wed, Jan 20, 2021 at 9:58 AM Simon Riggs  wrote:
>
> On Wed, 20 Jan 2021 at 14:21, James Coleman  wrote:
>
> > An alternative approach that occurred to me while typing this reply: a
> > setting in Postgres that would disallow setting session level GUCs
> > (i.e., enforce `SET LOCAL` transaction level usage instead) would
> > remove a large chunk of our need to set server_reset_query_always=1
> > (and more interestingly it'd highlight when broken code gets pushed).
> > But even with that, I see some value in the proposed setting since
> > there is additional session state beyond GUCs.
>
> With transaction_cleanup=on we could force all SETs to be SET LOCAL.
>
> The point is that if we declare ahead of time that the transaction
> will be reset then we can act differently and more easily for various
> circumstances, for SETs, for Temp tables and others.

Right, I agree it's independently useful. My "alternative" is a subset
of that functionality and doesn't cover as many cases.

James




Re: Consider parallel for lateral subqueries with limit

2022-01-04 Thread James Coleman
On Tue, Jan 4, 2022 at 5:31 PM Tom Lane  wrote:
>
> Greg Nancarrow  writes:
> > The patch LGTM.
> > I have set the status to "Ready for Committer".
>
> I don't really see why this patch is even a little bit safe.
> The argument for it seems to be that a lateral subquery will
> necessarily be executed in such a way that each complete iteration
> of the subquery, plus joining to its outer rel, happens within a
> single worker ... but where is the guarantee of that?  Once
> you've marked the rel as parallel-safe, the planner is free to
> consider all sorts of parallel join structures.  I'm afraid this
> would be easily broken as soon as you look at cases with three or
> more rels.  Or maybe even just two.  The reason for the existing
> restriction boils down to this structure being unsafe:
>
> Gather
> NestLoop
> Scan ...
> Limit
> Scan ...
>
> and I don't see how the existence of a lateral reference
> makes it any safer.

Thanks for taking a look. I'm not following how the structure you
posited is inherently unsafe when it's a lateral reference. That
limit/scan (if lateral) has to be being executed per tuple in the
outer scan, right? And if it's a unique execution per tuple, then
consistency across tuples (that are in different workers) can't be a
concern.

Is there a scenario I'm missing where lateral can currently be
executed in that way in that structure (or a different one)?

Thanks,
James Coleman




Re: Consider parallel for lateral subqueries with limit

2022-01-13 Thread James Coleman
On Tue, Jan 4, 2022 at 9:59 PM James Coleman  wrote:
>
> On Tue, Jan 4, 2022 at 5:31 PM Tom Lane  wrote:
> >
> > Greg Nancarrow  writes:
> > > The patch LGTM.
> > > I have set the status to "Ready for Committer".
> >
> > I don't really see why this patch is even a little bit safe.
> > The argument for it seems to be that a lateral subquery will
> > necessarily be executed in such a way that each complete iteration
> > of the subquery, plus joining to its outer rel, happens within a
> > single worker ... but where is the guarantee of that?  Once
> > you've marked the rel as parallel-safe, the planner is free to
> > consider all sorts of parallel join structures.  I'm afraid this
> > would be easily broken as soon as you look at cases with three or
> > more rels.  Or maybe even just two.  The reason for the existing
> > restriction boils down to this structure being unsafe:
> >
> > Gather
> > NestLoop
> > Scan ...
> > Limit
> > Scan ...
> >
> > and I don't see how the existence of a lateral reference
> > makes it any safer.
>
> Thanks for taking a look. I'm not following how the structure you
> posited is inherently unsafe when it's a lateral reference. That
> limit/scan (if lateral) has to be being executed per tuple in the
> outer scan, right? And if it's a unique execution per tuple, then
> consistency across tuples (that are in different workers) can't be a
> concern.
>
> Is there a scenario I'm missing where lateral can currently be
> executed in that way in that structure (or a different one)?

To expand on this:

Suppose lateral is not in play. Then if we have a plan like:

Gather
NestLoop
Scan X
Limit
Scan Y

Because we have the result "X join Limit(Y)"  we need "Limit(Y)" to be
consistent across all of the possible executions of "Limit(Y)" (i.e.,
in each worker it executes in). That means (absent infrastructure for
guaranteeing a unique ordering) we obviously can't parallelize the
inner side of the join as the limit may be applied in different ways
in each worker's execution.

Now suppose lateral is in play. Then (given the same plan) instead of
our result being "X join Limit(Y)" the result is "X join Limit(Y sub
X)", that is, each row in X is joined to a unique invocation of
"Limit(Y)". In this case we are already conceivably getting different
results for each execution of the subquery "Limit(Y)" even if we're
not running those executions across multiple workers. Whether we've
optimized such a subquery into running a single execution per row in X
or have managed to optimize it in such a way that a single execution
of "Limit(Y)" may be shared by multiple rows in X makes no difference
because there are no relational guarantees that that is the case
(conceptually each row in X gets its own results for "Limit(Y)").

I've not been able to come up with a scenario where this doesn't hold
-- even if part of the join or the subquery execution happens in a
different worker. I believe that for there to be a parallel safety
problem here you'd need to have a given subquery execution for a row
in X be executed multiple times. Alternatively I've been trying to
reason about whether there might be a scenario where a 3rd rel is
involved (i.e., the inner rel is itself a join rel), but as far as I
can tell the moment we end up with a join structure such that the
lateral rel is the one with the limit we'd be back to being safe again
for the reasons claimed earlier.

If there's something obvious (or not so obvious) I'm missing, I'd
appreciate a counterexample, but I'm currently unable to falsify my
original claim that the lateral reference is a fundamental difference
that allows us to consider this parallel safe.

Thanks,
James Coleman




Re: Parallelize correlated subqueries that execute within each worker

2022-01-14 Thread James Coleman
On Fri, Dec 3, 2021 at 2:35 AM Michael Paquier  wrote:
>
> On Mon, Nov 15, 2021 at 10:01:37AM -0500, Robert Haas wrote:
> > On Wed, Nov 3, 2021 at 1:34 PM James Coleman  wrote:
> >> As I understand the current code, parallel plans are largely chosen
> >> based not on where it's safe to insert a Gather node but rather by
> >> determining if a given path is parallel safe. Through that lens params
> >> are a bit of an odd man out -- they aren't inherently unsafe in the
> >> way a parallel-unsafe function is, but they can only be used in
> >> parallel plans under certain conditions (whether because of project
> >> policy, performance, or missing infrastructure).
> >
> > Right.
>
> Please note that the CF bot is complaining here, so I have moved this
> patch to the next CF, but changed the status as waiting on author.

I rebased this back in December, but somehow forgot to reply with the
updated patch, so, here it is finally.

Thanks,
James Coleman


v4-0001-Allow-parallel-LATERAL-subqueries-with-LIMIT-OFFS.patch
Description: Binary data


v4-0002-Parallel-query-support-for-basic-correlated-subqu.patch
Description: Binary data


v4-0003-Other-places-to-consider-for-completeness.patch
Description: Binary data


Re: Parallelize correlated subqueries that execute within each worker

2022-01-14 Thread James Coleman
On Mon, Nov 15, 2021 at 10:01 AM Robert Haas  wrote:
>
> On Wed, Nov 3, 2021 at 1:34 PM James Coleman  wrote:
> > As I understand the current code, parallel plans are largely chosen
> > based not on where it's safe to insert a Gather node but rather by
> > determining if a given path is parallel safe. Through that lens params
> > are a bit of an odd man out -- they aren't inherently unsafe in the
> > way a parallel-unsafe function is, but they can only be used in
> > parallel plans under certain conditions (whether because of project
> > policy, performance, or missing infrastructure).
>
> Right.
>
> > Introducing consider_parallel_rechecking_params and
> > parallel_safe_ignoring_params allows us to keep more context on params
> > and make a more nuanced decision at the proper level of the plan. This
> > is what I mean by "rechecked in the using context", though I realize
> > now that both "recheck" and "context" are overloaded terms in the
> > project, so don't describe the concept particularly clearly. When a
> > path relies on params we can only make a final determination about its
> > parallel safety if we know whether or not the current parallel node
> > can provide the param's value. We don't necessarily know that
> > information until we attempt to generate a full parallel node in the
> > plan (I think what you're describing as "inserting a Gather node")
> > since the param may come from another node in the plan. These new
> > values allow us to do that by tracking tentatively parallel-safe
> > subplans (given proper Gather node placement) and delaying the
> > parallel-safety determination until the point at which a param is
> > available (or not).
>
> So I think I agree with you here. But I don't like all of this
> "ignoring_params" stuff and I don't see why it's necessary. Say we
> don't have both parallel_safe and parallel_safe_ignoring_params. Say
> we just have parallel_safe. If the plan will be parallel safe if the
> params are available, we label it parallel safe. If the plan will not
> be parallel safe even if the params are available, we say it's not
> parallel safe. Then, when we get to generate_gather_paths(), we don't
> generate any paths if there are required parameters that are not
> available. What's wrong with that approach?
>
> Maybe it's clearer to say this: I feel like one extra Boolean is
> either too much or too little. I think maybe it's not even needed. But
> if it is needed, then why just a bool instead of, say, a Bitmapset of
> params that are needed, or something?
>
> I'm sort of speaking from intuition here rather than sure knowledge. I
> might be totally wrong.

Apologies for quite the delay responding to this.

I've been chewing on this a bit, and I was about to go re-read the
code and see how easy it'd be to do exactly what you're suggesting in
generate_gather_paths() (and verifying it doesn't need to happen in
other places). However there's one (I think large) gotcha with that
approach (assuming it otherwise makes sense): it means we do
unnecessary work. In the current patch series we only need to recheck
parallel safety if we're in a situation where we might actually
benefit from doing that work (namely when we have a correlated
subquery we might otherwise be able to execute in a parallel plan). If
we don't track that status we'd have to recheck the full parallel
safety of the path for all paths -- even without correlated
subqueries.

Alternatively we could merge these fields into a single enum field
that tracked these states. Even better, we could use a bitmap to
signify what items are/aren't parallel safe. I'm not sure if that'd
create even larger churn in the patch, but maybe it's worth it either
way. In theory it'd open up further expansions to this concept later
(though I'm not aware of any such ideas).

If you think such an approach would be an improvement I'd be happy to
take a pass at a revised patch.

Thoughts?

Thanks,
James Coleman




Add last commit LSN to pg_last_committed_xact()

2022-01-14 Thread James Coleman
I'd recently been thinking about monitoring how many bytes behind a
logical slot was and realized that's not really possible to compute
currently. That's easy enough with a physical slot because we can get
the current WAL LSN easily enough and the slot exposes the current LSN
positions of the slot. However for logical slots that naive
computation isn't quite right. The logical slot can't flush past the
last commit, so even if there's 100s of megabytes of unflushed WAL on
the slot there may be zero lag (in terms of what's possible to
process).

I've attached a simple patch (sans tests and documentation) to get
feedback early. After poking around this afternoon it seemed to me
that the simplest approach was to hook into the commit timestamps
infrastructure and store the commit's XLogRecPtr in the cache of the
most recent value (but of course don't write it out to disk). That the
downside of making this feature dependent on "track_commit_timestamps
= on", but that seems reasonable:

1. Getting the xid of the last commit is similarly dependent on commit
timestamps infrastructure.
2. It's a simple place to hook into and avoids new shared data and locking.

Thoughts?

Thanks,
James Coleman


v1-0001-Expose-LSN-of-last-commit-via-pg_last_committed_x.patch
Description: Binary data


Re: Nicer error when connecting to standby with hot_standby=off

2021-03-19 Thread James Coleman
On Tue, Mar 9, 2021 at 9:27 AM Fujii Masao  wrote:
>
>
>
> On 2021/03/09 23:19, James Coleman wrote:
> > On Tue, Mar 9, 2021 at 9:17 AM Alvaro Herrera  
> > wrote:
> >>
> >> On 2021-Mar-09, James Coleman wrote:
> >>
> >>> Yes, I think they both agreed on the "DETAIL:  Hot standby mode is
> >>> disabled." message, but that alternative meant not needing to add any
> >>> new signals and pm states, correct?
> >>
> >> Ah, I see!  I was thinking that you still needed the state and signal in
> >> order to print the correct message in hot-standby mode, but that's
> >> (obviously!) wrong.  So you're right that no signal/state are needed.
> >
> > Cool. And yes, I'm planning to update the patch soon.
>
> +1. Thanks!

Here's an updated patch; I think I've gotten what Alvaro suggested.

Thanks,
James


v4-0001-Improve-standby-connection-denied-error-message.patch
Description: Binary data


Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-03-19 Thread James Coleman
On Fri, Sep 11, 2020 at 5:11 PM James Coleman  wrote:
>
> On Tue, Sep 8, 2020 at 4:37 PM Heikki Linnakangas  wrote:
> >
> > On 08/09/2020 22:25, James Coleman wrote:
> > > On Wed, Aug 19, 2020 at 3:16 AM Heikki Linnakangas  
> > > wrote:
> > >>
> > >> You could also apply the optimization for non-Const expressions, as long
> > >> as they're stable (i.e. !contain_volatile_functions(expr)).
> > >
> > > Is that true? Don't we also have to worry about whether or not the
> > > value is stable (i.e., know when a param has changed)? There have been
> > > discussions about being able to cache stable subexpressions, and my
> > > understanding was that we'd need to have that infrastructure (along
> > > with the necessarily invalidation when the param changes) to be able
> > > to use this for non-Const expressions.
> >
> > Yeah, you're right, you'd have to also check for PARAM_EXEC Params. And
> > Vars. I think the conditions are the same as those checked in
> > match_clause_to_partition_key() in partprune.c (it's a long function,
> > search for "if (!IsA(rightop, Const))"). Not sure it's worth the
> > trouble, then. But it would be nice to handle queries like "WHERE column
> > = ANY ($1)"
>
> If I'm understanding properly you're imagining something in the form of:
>
> with x as (select '{1,2,3,4,5,6,7,8,9,10}'::int[])
> select * from t where i = any ((select * from x)::int[]);
>
> I've been playing around with this and currently have these checks:
>
> contain_var_clause((Node *) arrayarg)
> contain_volatile_functions((Node *) arrayarg)
> contain_exec_param((Node *) arrayarg, NIL)
>
> (the last one I had to modify the function to handle empty lists)
>
> If any are true, then have to disable the optimization. But for
> queries in the form above the test contain_exec_param((Node *)
> arrayarg, NIL) evaluates to true, even though we know from looking at
> the query that the array subexpression is stable for the length of the
> query.
>
> Am I misunderstanding what you're going for? Or is there a way to
> confirm the expr, although an exec param, won't change?
>
> Another interesting thing that this would imply is that we'd either have to:
>
> 1. Remove the array length check altogether,
> 2. Always use the hash when have a non-Const, but when a Const only if
> the array length check passes, or
> 3. Make this new expr op more fully featured by teaching it how to use
> either a straight loop through a deconstructed array or use the hash.
>
> That last option feeds into further discussion in the below:
>
> > >> Deconstructing the array Datum into a simple C array on first call would
> > >> be a win even for very small arrays and for AND semantics, even if you
> > >> don't use a hash table.
> > >
> > > Because you wouldn't have to repeatedly detoast it? Or some other
> > > reason I'm not thinking of? My intuition would have been that (aside
> > > from detoasting if necessary) there wouldn't be much real overhead in,
> > > for example, an array storing integers.
> >
> > Dealing with NULLs and different element sizes in the array is pretty
> > complicated. Looping through the array currently looks like this:
> >
> > /* Loop over the array elements */
> > s = (char *) ARR_DATA_PTR(arr);
> > bitmap = ARR_NULLBITMAP(arr);
> > bitmask = 1;
> >
> > for (int i = 0; i < nitems; i++)
> > {
> > Datum   elt;
> > Datum   thisresult;
> >
> > /* Get array element, checking for NULL */
> > if (bitmap && (*bitmap & bitmask) == 0)
> > {
> > fcinfo->args[1].value = (Datum) 0;
> > fcinfo->args[1].isnull = true;
> > }
> > else
> > {
> > elt = fetch_att(s, typbyval, typlen);
> > s = att_addlength_pointer(s, typlen, s);
> > s = (char *) att_align_nominal(s, typalign);
> > fcinfo->args[1].value = elt;
> > fcinfo->args[1].isnull = false;
> > }
> >
> > [do stuff with Datum/isnull]
> >
> > /* advance bitmap pointer if any */
> > if (bitmap)
>

Re: Nicer error when connecting to standby with hot_standby=off

2021-03-22 Thread James Coleman
On Mon, Mar 22, 2021 at 1:49 PM Fujii Masao  wrote:
>
>
>
> On 2021/03/19 23:35, James Coleman wrote:
> > Here's an updated patch; I think I've gotten what Alvaro suggested.
>
> Thanks for updating the patch! But I was thinking that our consensus is
> something like the attached patch. Could you check this patch?

As far as I can tell (I might be missing something) your v5 patch does
the same thing, albeit with different code organization. It did catch
though that I'd neglected to add the DETAIL line as separate from the
errmsg line.

Is the attached (in the style of my v4, since I'm not following why we
need to move the standby determination logic into a new
CAC_NOCONSISTENT block) what you're thinking? Or is there something
else I'm missing?

Thanks,
James


v6-0001-Improve-standby-connection-denied-error-message.patch
Description: Binary data


Re: Nicer error when connecting to standby with hot_standby=off

2021-03-22 Thread James Coleman
On Mon, Mar 22, 2021 at 2:52 PM Fujii Masao  wrote:
>
>
>
> On 2021/03/23 3:25, James Coleman wrote:
> > On Mon, Mar 22, 2021 at 1:49 PM Fujii Masao  
> > wrote:
> >>
> >>
> >>
> >> On 2021/03/19 23:35, James Coleman wrote:
> >>> Here's an updated patch; I think I've gotten what Alvaro suggested.
> >>
> >> Thanks for updating the patch! But I was thinking that our consensus is
> >> something like the attached patch. Could you check this patch?
> >
> > As far as I can tell (I might be missing something) your v5 patch does
> > the same thing, albeit with different code organization. It did catch
> > though that I'd neglected to add the DETAIL line as separate from the
> > errmsg line.
> >
> > Is the attached (in the style of my v4, since I'm not following why we
> > need to move the standby determination logic into a new
> > CAC_NOCONSISTENT block) what you're thinking? Or is there something
> > else I'm missing?
>
> I just did that to avoid adding more CAC_state. But basically it's
> ok to check hot standby at either canAcceptConnections() or
> ProcessStartupPacket().
>
> case CAC_STARTUP:
> ereport(FATAL,
> (errcode(ERRCODE_CANNOT_CONNECT_NOW),
> -errmsg("the database system is 
> starting up")));
> +errmsg("the database system is not 
> accepting connections"),
> +errdetail("Consistent recovery state 
> has not been yet reached.")));
>
> Do you want to report this message even in crash recovery? Since crash
> recovery is basically not so much related to "consistent recovery state",
> at least for me the original message seems more suitable for crash recovery.
>
> Also if we adopt this message, the server with hot_standby=off reports
> "Consistent recovery state has not been yet reached." in PM_STARTUP,
> but stops reporting this message at PM_RECOVERY even if the consistent
> recovery state has not been reached yet. Instead, it reports "Hot standby
> mode is disabled." at PM_RECOVERY. Isn't this transition of message confusing?

Are you saying we should only change the message for a single case:
the case where we'd otherwise allow connections but EnableHotStandby
is false? I believe that's what the original patch did, but then
Alvaro's proposal included changing additional messages.

James Coleman




Re: Nicer error when connecting to standby with hot_standby=off

2021-03-23 Thread James Coleman
On Tue, Mar 23, 2021 at 1:46 AM Fujii Masao  wrote:
>
>
>
> On 2021/03/23 3:59, James Coleman wrote:
> > Are you saying we should only change the message for a single case:
> > the case where we'd otherwise allow connections but EnableHotStandby
> > is false?
>
> No. Let me clarify my opinion.
>
> At PM_STARTUP, "the database system is starting up" should be logged
> whatever the setting of hot_standby is. This is the same as the original
> behavior. During crash recovery, this message is output. Also at archive
> recovery or standby server, until the startup process sends
> PMSIGNAL_RECOVERY_STARTED, this message is logged.
>
> At PM_RECOVERY, originally "the database system is starting up" was logged
> whatever the setting of hot_standby is. My opinion is the same as our
> consensus, i.e., "the database system is not accepting connections" and
> "Hot standby mode is disabled." are logged if hot_standby is disabled.
> "the database system is not accepting connections" and "Consistent
>   recovery state has not been yet reached." are logged if hot_standby is
>   enabled.
>
> After the consistent recovery state is reached, if hot_standby is disabled,
> the postmaster state is still PM_RECOVERY. So "Hot standby mode is disabled."
> is still logged in this case. This is also different behavior from the 
> original.
> If hot_standby is enabled, read-only connections can be accepted because
> the consistent state is reached. So no message needs to be logged.
>
> Therefore for now what we've not reached the consensus is what message
> should be logged at PM_STARTUP. I'm thinking it's better to log
> "the database system is starting up" in that case because of the reasons
> that I explained upthread.
>
> Thought?

I understand your point now, and I agree, that makes sense.

The attached takes a similar approach to your v5, but I've used
CAC_NOTCONSISTENT instead of CAC_NOCONSISTENT because I think it reads
better (CAC_INCONSISTENT would technically be better English,
but...also it doesn't parallel the code and error message).

Thoughts?

James Coleman


v7-0001-Improve-standby-connection-denied-error-message.patch
Description: Binary data


Re: Nicer error when connecting to standby with hot_standby=off

2021-03-23 Thread James Coleman
On Tue, Mar 23, 2021 at 12:34 PM Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > However, for this one
>
> > +   case CAC_NOTCONSISTENT:
> > +   if (EnableHotStandby)
> > +   ereport(FATAL,
> > +   (errcode(ERRCODE_CANNOT_CONNECT_NOW),
> > +errmsg("the database system is not accepting 
> > connections"),
> > +errdetail("Consistent recovery state has not been 
> > yet reached.")));
>
> > Maybe it makes sense to say "... is not accepting connections *yet*".
>
> +1, but I think "... is not yet accepting connections" is slightly
> better style.

All right, see attached v8.

James Coleman


v8-0001-Improve-standby-connection-denied-error-message.patch
Description: Binary data


Re: Nicer error when connecting to standby with hot_standby=off

2021-03-24 Thread James Coleman
On Wed, Mar 24, 2021 at 5:55 AM Fujii Masao  wrote:
>
>
>
> On 2021/03/24 16:59, Alvaro Herrera wrote:
> > On 2021-Mar-24, Fujii Masao wrote:
> >
> >> On 2021/03/24 5:59, Tom Lane wrote:
> >>> Alvaro Herrera  writes:
> >>>> FATAL:  the database system is starting up
> >>>> DETAIL:  WAL is being applied to recover from a system crash.
> >>>> or
> >>>> DETAIL:  The system is applying WAL to recover from a system crash.
> >>>> or
> >>>> DETAIL:  The startup process is applying WAL to recover from a system 
> >>>> crash.
> >>>
> >>> I don't think the postmaster has enough context to know if that's
> >>> actually true.  It just launches the startup process and waits for
> >>> results.  If somebody saw this during a normal (non-crash) startup,
> >>> they'd be justifiably alarmed.
> >>
> >> Yes, so logging "the database system is starting up" seems enough to me.
> >
> > No objection.
>
> Thanks! So I changed the message reported at PM_STARTUP to that one,
> based on v8 patch that James posted upthread. I also ran pgindent for
> the patch. Attached is the updated version of the patch.
>
> Barring any objection, I will commit this.

That looks good to me. Thanks for working on this.

James Coleman




Re: [EXTERNAL] Any objection to documenting pg_sequence_last_value()?

2021-04-06 Thread James Coleman
On Tue, Mar 30, 2021 at 4:37 AM Hanefi Onaldi
 wrote:
>
> Hi All,
>
> I recently used pg_sequence_last_value() when working on a feature in an 
> extension, and it would have been easier for me if there were some 
> documentation for this function.
>
> I'd like to help document this function if there are no objections.
>
> Best,
> Hanefi
>
> -Original Message-
> From: James Coleman 
> Sent: 6 Ağustos 2020 Perşembe 16:14
> To: pgsql-hackers 
> Subject: [EXTERNAL] Any objection to documenting pg_sequence_last_value()?
>
> The function pg_sequence_last_value() was added to underlie the pg_sequences 
> view, and it's the only way I'm aware of from userspace to directly get the 
> last value of a sequence globally (i.e., not within the current session like 
> currval()/lastval()). Obviously you can join to the pg_sequences view, but 
> that's sometimes unnecessarily cumbersome since it doesn't expose the relid 
> of the sequence.
>
> When that function got added it apparently wasn't added to the docs, though 
> I'm not sure if that was intentional or not.
>
> Does anyone have any objections to documenting
> pg_sequence_last_value() in the sequence manipulation functions doc page?
>
> James
>
>

Given there's been no objection, I think it'd be worth submitting a
patch (and I'd be happy to review if you're willing to author one).

James




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-04-06 Thread James Coleman
On Mon, Apr 5, 2021 at 11:58 PM David Rowley  wrote:
>
> On Sat, 20 Mar 2021 at 09:41, James Coleman  wrote:
> > I've attached a cleaned up patch. Last CF it was listed in is
> > https://commitfest.postgresql.org/29/2542/ -- what's the appropriate
> > step to take here given it's an already existing patch, but not yet
> > moved into recent CFs?
>
> I had a look at this patch.  I like the idea of using a simplehash.h
> hash table to hash the constant values so that repeat lookups can be
> performed much more quickly, however, I'm a bit concerned that there
> are quite a few places in the code where we often just execute a
> ScalarArrayOpExpr once and I'm a bit worried that we'll slow down
> expression evaluation of those cases.
>
> The two cases that I have in mind are:
>
> 1. eval_const_expressions() where we use the executor to evaluate the
> ScalarArrayOpExpr to see if the result is Const.
> 2. CHECK constraints with IN clauses and single-row INSERTs.

This is a good point I hadn't considered; now that you mention it, I
think another case would be expression evaluation in pl/pgsql.

> I tried to benchmark both of these but I'm struggling to get stable
> enough performance for #2, even with fsync=off.  Sometimes I'm getting
> results 2.5x slower than other runs.
>
> For benchmarking #1 I'm also not too sure I'm getting stable enough
> results for them to mean anything.
>
> I was running:
>
> create table a (a int);
>
> bench.sql: explain select * from a where a
> in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16);
>
> drowley@amd3990x:~$ pgbench -n -T 60 -j 64 -c 64 -f bench.sql -P 10 postgres
> Master (6d41dd045):
> tps = 992586.991045 (without initial connection time)
> tps = 987964.990483 (without initial connection time)
> tps = 994309.670918 (without initial connection time)
>
> Master + v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch
> tps = 956022.557626 (without initial connection time)
> tps = 963043.352823 (without initial connection time)
> tps = 968582.600100 (without initial connection time)
>
> This puts the patched version about 3% slower. I'm not sure how much
> of that is changes in the binary and noise and how much is the
> needless hashtable build done for eval_const_expressions().
>
> I wondered if we should make it the query planner's job of deciding if
> the ScalarArrayOpExpr should be hashed or not.  I ended up with the
> attached rough-cut patch that introduces HashedScalarArrayOpExpr and
> has the query planner decide if it's going to replace
> ScalarArrayOpExpr with these HashedScalarArrayOpExpr during
> preprocess_expression().  I do think that we might want to consider
> being a bit selective about when we do these replacements.  It seems
> likely that we'd want to do this for EXPRKIND_QUAL and maybe
> EXPRKIND_TARGET, but I imagine that converting ScalarArrayOpExpr to
> HashedScalarArrayOpExpr for EXPRKIND_VALUES would be a waste of time
> since those will just be executed once.

In theory we might want to cost them differently as well, though I'm
slightly hesitant to do so at this point to avoid causing plan changes
(I'm not sure how we would balance that concern with the potential
that the best plan isn't chosen).

> I tried the same above test with the
> v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch plus the
> attached rough-cut patch and got:
>
> master + v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch
> + v5-0002-Rough-cut-patch-for-HashedScalarArrayOpExpr.patch
> tps = 1167969.983173 (without initial connection time)
> tps = 1199636.793314 (without initial connection time)
> tps = 1190690.939963 (without initial connection time)
>
> I can't really explain why this became faster. I was expecting it just
> to reduce that slowdown of the v4 patch a little. I don't really see
> any reason why it would become faster.  It's almost 20% faster which
> seems like too much to just be fluctuations in code alignment in the
> binary.

I'm not at a place where I can do good perf testing right now (just on
my laptop for the moment), unfortunately, so I can't confirm one way
or the other.

> The attached patch is still missing the required changes to
> llvmjit_expr.c. I think that was also missing from the original patch
> too, however.

Ah, I didn't realize that needed to be changed as well. I'll take a
look at that.

> Also, I added HashedScalarArrayOpExpr to plannodes.h.  All other Expr
> type nodes are in primnodes.h. However, I put HashedScalarArrayOpExpr
> in plannodes.h because the parser does not generate this and it's not
> going to be stored in the catalogue file

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-04-08 Thread James Coleman
On Thu, Apr 8, 2021 at 8:01 AM David Rowley  wrote:
>
> On Thu, 8 Apr 2021 at 22:54, David Rowley  wrote:
> >
> > I think the changes in the patch are fairly isolated and the test
> > coverage is now pretty good.  I'm planning on looking at the patch
> > again now and will consider pushing it for PG14.
>
> I push this with some minor cleanup from the v6 patch I posted earlier.
>
> David

Thank you!

I assume proper procedure for the CF entry is to move it into the
current CF and then mark it as committed, however I don't know how (or
don't have permissions?) to move it into the current CF. How does one
go about doing that?

Here's the entry: https://commitfest.postgresql.org/29/2542/

Thanks,
James




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-04-08 Thread James Coleman
On Thu, Apr 8, 2021 at 1:04 PM Alvaro Herrera  wrote:
>
> On 2021-Apr-08, James Coleman wrote:
>
> > I assume proper procedure for the CF entry is to move it into the
> > current CF and then mark it as committed, however I don't know how (or
> > don't have permissions?) to move it into the current CF. How does one
> > go about doing that?
> >
> > Here's the entry: https://commitfest.postgresql.org/29/2542/
>
> Done, thanks.
>
> --
> Álvaro Herrera39°49'30"S 73°17'W

Thanks. Is that something I should be able to do myself (should I be
asking someone for getting privileges in the app to do so)? I'm not
sure what the project policy is on that.

James




Processing btree walks as a batch to parallelize IO

2021-04-09 Thread James Coleman
$SUBJECT is still a very loosely formed idea, so forgive lack of detail or
things I've likely missed, but I wanted to get it out there to see if it
sounded at all intriguing to people.

Background: One of the big problems with non-local storage such as AWS EBS
volumes or a SAN is that in a large database (really, working set, where
working set includes reads) exceeds the size of buffer cache (and page
cache) the cost of random page reads hitting the underlying disk system
dominates. This is because networked disks have an order of magnitude
higher latency than a bunch of RAIDed SSDs (even more so with NVMe
storage). In some of our experiments on Aurora I've seen a 10x change
versus pretty good physical hardware, and I'd assume RDS (since it's
EBS-backed) is similar.

A specific area where this is particularly painful is btree index reads.
Walking the tree to leaf pages isn't naturally prefetchable, and so for
each level you pay the random page cost. Of course higher levels in the
tree will almost certainly exhibit emergent behavior such that they (just
by fact of the LRU caching) will be in the buffer cache, but for a large
index lower levels likely won't be.

If we squint a bit, insertions look a whole lot like reads as well since we
have to walk the tree to find the leaf insertion page for a new tuple. This
is particularly true for indexes where inserts are roughly randomly
distributed data, like a uuid.

The read-for-lookups problem is harder to solve, but the cost as it relates
to table inserts is possibly more tractable. Tables typically have more
than one index to update, so the obvious approach is "let's just
parallelize the index insertions". Of course we know that's difficult given
the multi-process approach Postgres uses for parallelism.

Another approach that at first glance seems like it fits better into
Postgres (I'm not claiming it's easy or a small patch) would be to process
a batch of indexes at once. For example, if the index access methods were
extended to allow being given a list of indexes that need to be walked,
then the btree code could process each layer in the walk as a group --
issuing IO fetches for all of the first level blocks in the tree, and then
computing all of the next level blocks needed and issuing those IO requests
at a time, and so on.

In some workloads we've been testing I believe such an approach could
plausibly improve table insert (and update) performance by multiple
hundreds of percent.

I don't have any code at the moment to show here, but I wanted to get the
idea out there to see if there were any immediate reactions or other
thoughts on the topic.

Thoughts?

James


Re: Nicer error when connecting to standby with hot_standby=off

2021-04-09 Thread James Coleman
On Wed, Mar 24, 2021 at 9:43 PM Fujii Masao  wrote:
>
>
>
> On 2021/03/24 22:06, James Coleman wrote:
> > That looks good to me. Thanks for working on this.
>
> Thanks! I pushed the patch.
>
> Regards,
>
> --
> Fujii Masao
> Advanced Computing Technology Center
> Research and Development Headquarters
> NTT DATA CORPORATION

Thanks for reviewing and committing!

James




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-04-12 Thread James Coleman
On Mon, Apr 12, 2021 at 7:49 PM David Rowley  wrote:
>
> On Tue, 13 Apr 2021 at 11:42, Tom Lane  wrote:
> >
> > David Rowley  writes:
> > > I realised when working on something unrelated last night that we can
> > > also do hash lookups for NOT IN too.
> >
> > ... and still get the behavior right for nulls?
>
> Yeah, it will. There are already some special cases for NULLs in the
> IN version. Those would need to be adapted for NOT IN.

I hadn't thought about using the negator operator directly that way
when I initially wrote the patch.

But also I didn't think a whole lot about the NOT IN case at all --
and there's no mention of such that I see in this thread or the
precursor thread. It's pretty obvious that it wasn't part of my
immediate need, but obviously it'd be nice to have the consistency.

All that to say this: my vote would be to put it into PG15 also.

James




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-04-13 Thread James Coleman
On Mon, Apr 12, 2021 at 10:07 PM James Coleman  wrote:
>
> On Mon, Apr 12, 2021 at 7:49 PM David Rowley  wrote:
> >
> > On Tue, 13 Apr 2021 at 11:42, Tom Lane  wrote:
> > >
> > > David Rowley  writes:
> > > > I realised when working on something unrelated last night that we can
> > > > also do hash lookups for NOT IN too.
> > >
> > > ... and still get the behavior right for nulls?
> >
> > Yeah, it will. There are already some special cases for NULLs in the
> > IN version. Those would need to be adapted for NOT IN.
>
> I hadn't thought about using the negator operator directly that way
> when I initially wrote the patch.
>
> But also I didn't think a whole lot about the NOT IN case at all --
> and there's no mention of such that I see in this thread or the
> precursor thread. It's pretty obvious that it wasn't part of my
> immediate need, but obviously it'd be nice to have the consistency.
>
> All that to say this: my vote would be to put it into PG15 also.

...and here's a draft patch. I can take this to a new thread if you'd
prefer; the one here already got committed, on the other hand this is
pretty strongly linked to this discussion, so I figured it made sense
to post it here.

James
From 08c37c5b228a0a7e9546a481a789eb1c384fcfc7 Mon Sep 17 00:00:00 2001
From: jcoleman 
Date: Tue, 13 Apr 2021 13:36:38 -0400
Subject: [PATCH v1] Add HashedScalarArrayOp support for NOT IN

---
 src/backend/optimizer/prep/prepqual.c |  1 +
 src/backend/optimizer/util/clauses.c  |  4 +-
 src/backend/parser/parse_oper.c   |  1 +
 src/include/nodes/primnodes.h |  1 +
 src/test/regress/expected/expressions.out | 90 +++
 src/test/regress/sql/expressions.sql  | 31 
 6 files changed, 126 insertions(+), 2 deletions(-)

diff --git a/src/backend/optimizer/prep/prepqual.c b/src/backend/optimizer/prep/prepqual.c
index 42c3e4dc04..2c29993b13 100644
--- a/src/backend/optimizer/prep/prepqual.c
+++ b/src/backend/optimizer/prep/prepqual.c
@@ -129,6 +129,7 @@ negate_clause(Node *node)
 	newopexpr->opfuncid = InvalidOid;
 	newopexpr->hashfuncid = InvalidOid;
 	newopexpr->useOr = !saopexpr->useOr;
+	newopexpr->isNegator = true;
 	newopexpr->inputcollid = saopexpr->inputcollid;
 	newopexpr->args = saopexpr->args;
 	newopexpr->location = saopexpr->location;
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index 526997327c..99e688426e 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -2137,8 +2137,8 @@ convert_saop_to_hashed_saop_walker(Node *node, void *context)
 		Oid			lefthashfunc;
 		Oid			righthashfunc;
 
-		if (saop->useOr && arrayarg && IsA(arrayarg, Const) &&
-			!((Const *) arrayarg)->constisnull &&
+		if ((saop->useOr || (!saop->useOr && saop->isNegator)) && arrayarg &&
+			IsA(arrayarg, Const) && !((Const *) arrayarg)->constisnull &&
 			get_op_hash_functions(saop->opno, &lefthashfunc, &righthashfunc) &&
 			lefthashfunc == righthashfunc)
 		{
diff --git a/src/backend/parser/parse_oper.c b/src/backend/parser/parse_oper.c
index c379d72fce..222f15719a 100644
--- a/src/backend/parser/parse_oper.c
+++ b/src/backend/parser/parse_oper.c
@@ -896,6 +896,7 @@ make_scalar_array_op(ParseState *pstate, List *opname,
 	result->opfuncid = opform->oprcode;
 	result->hashfuncid = InvalidOid;
 	result->useOr = useOr;
+	result->isNegator = strcmp("<>", NameStr(opform->oprname)) == 0;
 	/* inputcollid will be set by parse_collate.c */
 	result->args = args;
 	result->location = location;
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 9ae851d847..819856395e 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -592,6 +592,7 @@ typedef struct ScalarArrayOpExpr
 	Oid			opfuncid;		/* PG_PROC OID of comparison function */
 	Oid			hashfuncid;		/* PG_PROC OID of hash func or InvalidOid */
 	bool		useOr;			/* true for ANY, false for ALL */
+	bool		isNegator;		/* true if NOT has been applied to opno */
 	Oid			inputcollid;	/* OID of collation that operator should use */
 	List	   *args;			/* the scalar and array operands */
 	int			location;		/* token location, or -1 if unknown */
diff --git a/src/test/regress/expected/expressions.out b/src/test/regress/expected/expressions.out
index 5944dfd5e1..2e88f1ca19 100644
--- a/src/test/regress/expected/expressions.out
+++ b/src/test/regress/expected/expressions.out
@@ -216,6 +216,61 @@ select return_text_input('a') in ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h

Re: 13dev failed assert: comparetup_index_btree(): ItemPointer values should never be equal

2020-07-28 Thread James Coleman
On Tue, Jul 28, 2020 at 11:10 AM Justin Pryzby  wrote:
>
> An internal server aborted last night while running a maintenance script.  I
> reproduced this easily running the crashing command in a loop, and verified
> this is a live issue on REL_13_STABLE (dc6f2fb43).
> REINDEX INDEX pg_class_tblspc_relfilenode_index
>
> It looks like this crashed once before, and I didn't notice until now:
> Fri Jun 26 22:30:29 CDT 2020: pg_shdescription: 
> pg_toast.pg_toast_2396_index(reindex toast)...
> psql: error: could not connect to server: server closed the connection 
> unexpectedly
>
> #3  0x00afde98 in comparetup_index_btree (a=0x201fa58, b=0x201fa10, 
> state=0x20147b0) at tuplesort.c:4251
> 4251Assert(false);
> (gdb) l
> 4246if (pos1 != pos2)
> 4247return (pos1 < pos2) ? -1 : 1;
> 4248}
> 4249
> 4250/* ItemPointer values should never be equal */
> 4251Assert(false);
> 4252
> 4253return 0;
> 4254}
> 4255
>
> #3  0x00afde98 in comparetup_index_btree (a=0x201fa58, b=0x201fa10, 
> state=0x20147b0) at tuplesort.c:4251
> sortKey = 0x2014d60
> tuple1 = 0x20189d8
> tuple2 = 0x2018898
> keysz = 2
> tupDes = 0x7f48996b3790
> equal_hasnull = false
> nkey = 3
> compare = 0
> datum1 = 67999603
> datum2 = 67999603
> isnull1 = false
> isnull2 = false
> __func__ = "comparetup_index_btree"
>
> (gdb) p *tuple1
> $2 = {t_tid = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 43}, t_info = 16}
> (gdb) p *tuple2
> $3 = {t_tid = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 43}, t_info = 16}
>
> (gdb) p *sortKey
> $5 = {ssup_cxt = 0x40, ssup_collation = 0, ssup_reverse = false, 
> ssup_nulls_first = false, ssup_attno = 0, ssup_extra = 0x0, comparator = 
> 0x7f7f7f7f7f7f7f7f, abbreviate = 127, abbrev_converter = 0x7f7f7f7f7f7f7f7f,
>   abbrev_abort = 0x7f7f7f7f7f7f7f7f, abbrev_full_comparator = 
> 0x7f7f7f7f7f7f7f7f}
>
> (gdb) p *tupDes
> $6 = {natts = 2, tdtypeid = 0, tdtypmod = -1, tdrefcount = 1, constr = 0x0, 
> attrs = 0x7f48996b37a8}
>
> Core was generated by `postgres: postgres sentinel [local] REINDEX
> '.
>
> (gdb) bt
> #0  0x7f489853d1f7 in raise () from /lib64/libc.so.6
> #1  0x7f489853e8e8 in abort () from /lib64/libc.so.6
> #2  0x00aafff7 in ExceptionalCondition (conditionName=0xccd0dc 
> "false", errorType=0xccc327 "FailedAssertion", fileName=0xccc2fd 
> "tuplesort.c", lineNumber=4251) at assert.c:67
> #3  0x00afde98 in comparetup_index_btree (a=0x201fa58, b=0x201fa10, 
> state=0x20147b0) at tuplesort.c:4251
> #4  0x00af1d5e in qsort_tuple (a=0x201fa10, n=18, cmp_tuple=0xafcf21 
> , state=0x20147b0) at qsort_tuple.c:140
> #5  0x00af2104 in qsort_tuple (a=0x201f710, n=50, cmp_tuple=0xafcf21 
> , state=0x20147b0) at qsort_tuple.c:191
> #6  0x00af2104 in qsort_tuple (a=0x201cc38, n=544, cmp_tuple=0xafcf21 
> , state=0x20147b0) at qsort_tuple.c:191
> #7  0x00af8056 in tuplesort_sort_memtuples (state=0x20147b0) at 
> tuplesort.c:3490
> #8  0x00af51a9 in tuplesort_performsort (state=0x20147b0) at 
> tuplesort.c:1985
> #9  0x00529418 in _bt_leafbuild (btspool=0x1f784e0, btspool2=0x0) at 
> nbtsort.c:553
> #10 0x00528f9c in btbuild (heap=0x1fb5758, index=0x7f48996b3460, 
> indexInfo=0x1f77a48) at nbtsort.c:333
> #11 0x005adcb3 in index_build (heapRelation=0x1fb5758, 
> indexRelation=0x7f48996b3460, indexInfo=0x1f77a48, isreindex=true, 
> parallel=true) at index.c:2903
> #12 0x005aec6b in reindex_index (indexId=3455, 
> skip_constraint_checks=false, persistence=112 'p', options=2) at index.c:3539
> #13 0x00692583 in ReindexIndex (indexRelation=0x1f54840, options=0, 
> concurrent=false) at indexcmds.c:2466
> #14 0x00932e36 in standard_ProcessUtility (pstmt=0x1f54960, 
> queryString=0x1f53d90 "REINDEX INDEX pg_class_tblspc_relfilenode_index", 
> context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x1f54c40,
> qc=0x7ffde023bf80) at utility.c:929
> #15 0x0093241f in ProcessUtility (pstmt=0x1f54960, 
> queryString=0x1f53d90 "REINDEX INDEX pg_class_tblspc_relfilenode_index", 
> context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x1f54c40, 
> qc=0x7ffde023bf80)
> at utility.c:524
> #16 0x00931288 in PortalRunUtility (portal=0x1fb5ac0, 
> pstmt=0x1f54960, isTopLevel=true, setHoldSnapshot=false, dest=0x1f54c40, 
> qc=0x7ffde023bf80) at pquery.c:1157
> #17 0x009314a7 in PortalRunMulti (portal=0x1fb5ac0, isTopLevel=true, 
> setHoldSnapshot=false, dest=0x1f54c40, altdest=0x1f54c40, qc=0x7ffde023bf80) 
> at pquery.c:1303
> #18 0x009309bc in PortalRun (portal=0x1fb5ac0, 
> count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x1f54c40, 
> altdest=0x1f54c40, qc=0x7ffde023bf80) at pquery.c:779
> #19 0x0092ab5b i

Re: HashAgg's batching counter starts at 0, but Hash's starts at 1. (now: incremental sort)

2020-07-30 Thread James Coleman
On Thu, Jul 30, 2020 at 8:22 PM Justin Pryzby  wrote:

> On Wed, Jul 29, 2020 at 09:18:44PM -0700, Peter Geoghegan wrote:
> > On Wed, Jul 29, 2020 at 9:05 PM Justin Pryzby 
> wrote:
> > > So my 2ndary suggestion is to conditionalize based on the method
> rather than
> > > value of space used.
> >
> > What's the advantage of doing it that way?
>
> Because filtering out zero values is exactly what's intended to be avoided
> for
> nontext output.
>
> I think checking whether the method was used should result in the same
> output,
> without the literal check for zero value (which itself sets a bad example).
>
> --- a/src/backend/commands/explain.c
> +++ b/src/backend/commands/explain.c
> @@ -2824,13 +2824,13 @@
> show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
> appendStringInfo(&groupName, "%s Groups", groupLabel);
> ExplainOpenGroup("Incremental Sort Groups",
> groupName.data, true, es);
> ExplainPropertyInteger("Group Count", NULL,
> groupInfo->groupCount, es);
>
> ExplainPropertyList("Sort Methods Used", methodNames, es);
>
> -   if (groupInfo->maxMemorySpaceUsed > 0)
> +   if (groupInfo->sortMethods & SORT_TYPE_QUICKSORT)
> {
> longavgSpace =
> groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
> const char *spaceTypeName;
> StringInfoData memoryName;
>
> spaceTypeName =
> tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
> @@ -2841,13 +2841,13 @@
> show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
> ExplainPropertyInteger("Average Sort Space Used",
> "kB", avgSpace, es);
> ExplainPropertyInteger("Peak Sort Space Used",
> "kB",
>
>  groupInfo->maxMemorySpaceUsed, es);
>
> ExplainCloseGroup("Sort Spaces", memoryName.data,
> true, es);
> }
> -   if (groupInfo->maxDiskSpaceUsed > 0)
> +   if (groupInfo->sortMethods & SORT_TYPE_EXTERNAL_SORT)
> {
> longavgSpace =
> groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
> const char *spaceTypeName;
> StringInfoData diskName;
>
> spaceTypeName =
> tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
>

I very much do not like this approach, and I think it's actually
fundamentally wrong, at least for the memory check. Quicksort is not the
only option that uses memory. For now, there's only one option that spills
to disk (external merge sort), but there's no reason it has to remain that
way. And in the future we might accurately report memory consumed even when
we've eventually spilled to disk also, so memory used would be relevant
potentially even if no in-memory sort was ever performed.

So I'm pretty confident checking the space used is the correct way to do
this.

James


Re: Use of "long" in incremental sort code

2020-07-31 Thread James Coleman
On Thu, Jul 30, 2020 at 10:12 PM David Rowley  wrote:
>
> On Fri, 3 Jul 2020 at 07:47, James Coleman  wrote:
> > Patch using int64 attached.
>
> I added this to the open items list for PG13.
>
> David

I'd previously attached a patch [1], and there seemed to be agreement
it was reasonable (lightly so, but I also didn't see any
disagreement); would someone be able to either commit the change or
provide some additional feedback?

Thanks,
James

[1]: 
https://www.postgresql.org/message-id/CAAaqYe_Y5zwCTFCJeso7p34yJgf4khR8EaKeJtGd%3DQPudOad6A%40mail.gmail.com




Re: [DOC] Document concurrent index builds waiting on each other

2020-07-31 Thread James Coleman
On Thu, Jul 16, 2020 at 7:34 PM David Johnston
 wrote:
>
> The following review has been posted through the commitfest application:
> make installcheck-world:  not tested
> Implements feature:   not tested
> Spec compliant:   not tested
> Documentation:tested, passed
>
> James,
>
> I'm on board with the point of pointing out explicitly the "concurrent index 
> builds on multiple tables at the same time will not return on any one table 
> until all have completed", with back-patching.  I do not believe the new 
> paragraph is necessary though.  I'd suggest trying to weave it into the 
> existing paragraph ending "Even then, however, the index may not be 
> immediately usable for queries: in the worst case, it cannot be used as long 
> as transactions exist that predate the start of the index build."  Adding 
> "Notably, " in front of the existing sentence fragment above and tacking it 
> onto the end probably suffices.

I'm not sure "the index may not be immediately usable for queries" is
really accurate/sufficient: it seems to imply the CREATE INDEX has
returned but for some reason the index isn't yet valid. The issue I'm
trying to describe here is that the CREATE INDEX query itself will not
return until all preceding queries have completed *including*
concurrent index creations on unrelated tables.

> I don't actually don't whether this is true behavior though.  Is it something 
> our tests do, or could, demonstrate?

It'd take tests that exercise parallelism, but it's pretty simple to
demonstrate (but you do have to catch the first index build in a scan
phase, so you either need lots of data or a hack). Here's an example
that uses a bit of a hack to simulate a slow scan phase:

Setup:
create table items(i int);
create table others(i int);
create function slow_expr() returns text as $$ select pg_sleep(15);
select '5'; $$ language sql immutable;
insert into items(i) values (1), (2);
insert into others(i) values (1), (2);

Then the following in order:
1. In session A: create index concurrently on items((i::text || slow_expr()));
2. In session B (at the same time): create index concurrently on others(i);

You'll notice that the 2nd command, which should be practically
instantaneous, waits on the first ~30s scan phase of (1) before it
returns. The same is true if after (2) completes you immediately run
it again -- it waits on the second ~30s scan phase of (1).

That does reveal a bit of complexity though that that the current
patch doesn't address, which is that this can be phase dependent (and
that complexity gets a lot more non-obvious when there's real live
activity (particularly long-running transactions) in the system as
well.

I've attached a new patch series with two items:
1. A simpler (and I believe more correct) doc changes for "cic blocks
cic on other tables".
2. A patch to document that all index builds can prevent tuples from
being vacuumed away on other tables.

If it's preferable we could commit the first and discuss the second
separately, but since that limitation was also discussed up-thread, I
decided to include them both here for now.

James


v2-0001-Document-concurrent-indexes-waiting-on-each-other.patch
Description: Binary data


v2-0002-Document-vacuum-on-one-table-depending-on-concurr.patch
Description: Binary data


Re: Comment simplehash/dynahash trade-offs

2020-07-31 Thread James Coleman
On Mon, Jul 20, 2020 at 1:29 AM Thomas Munro  wrote:
>
> On Fri, May 1, 2020 at 1:53 PM James Coleman  wrote:
> > In another thread [1] I'd mused that "there might be some value in a
> > README or comments
> > addition that would be a guide to what the various hash
> > implementations are useful for...so that we have something to make the
> > code base a bit more discoverable."
>
> +1
>
> > I'd solicited feedback from Andres (as the author of the simplehash
> > implementation) and gotten further explanation from Tomas (both cc'd
> > here) and have tried to condense that into the comment changes in this
> > patch series.
> >
> > v1-0001-Summarize-trade-offs-between-simplehash-and-dynah.patch
> > Contains the summaries mentioned above.
>
> + * - It supports partitioning, which is useful for shared memory access using
>
> I wonder if we should say a bit more about the shared memory mode.
> Shared memory dynahash tables are allocated in a fixed size area at
> startup, and are discoverable by name in other other processes that
> need to get access to them, while simplehash assumes that it can get
> memory from a MemoryContext or an allocator with a malloc/free-style
> interface, which isn't very well suited for use in shared memory.
> (I'm sure you can convince it to work in shared memory with some
> work.)

Added.

> > v1-0002-Improve-simplehash-usage-notes.patch
>
> + *For convenience the hash table create functions accept a void pointer
> + *will be stored in the hash table type's member private_data.
>
> *that* will be stored?

Fixed.

> > v1-0003-Show-sample-simplehash-method-signatures.patch
> > I find it hard to read the macro code "templating" particularly for
> > seeing what the available API is and so added sample method signatures
> > in comments to the macro generated method signature defines.
>
> I didn't double-check all the expansions of the macros but +1 for this
> idea, it's very useful.

James


v2-0003-Show-sample-simplehash-method-signatures.patch
Description: Binary data


v2-0002-Improve-simplehash-usage-notes.patch
Description: Binary data


v2-0001-Summarize-trade-offs-between-simplehash-and-dynah.patch
Description: Binary data


Re: Nicer error when connecting to standby with hot_standby=off

2020-07-31 Thread James Coleman
On Wed, Jul 29, 2020 at 11:24 AM Fujii Masao
 wrote:
>
>
>
> On 2020/04/03 22:49, James Coleman wrote:
> > On Thu, Apr 2, 2020 at 5:53 PM David Zhang  wrote:
> >>
> >> The following review has been posted through the commitfest application:
> >> make installcheck-world:  not tested
> >> Implements feature:   tested, passed
> >> Spec compliant:   not tested
> >> Documentation:not tested
> >>
> >> I applied the patch to the latest master branch and run a test below. The 
> >> error messages have been separated. Below is the test steps.
> >>
> >> ### setup primary server
> >> initdb -D /tmp/primary/data
> >> mkdir /tmp/archive_dir
> >> echo "archive_mode='on'" >> /tmp/primary/data/postgresql.conf
> >> echo "archive_command='cp %p /tmp/archive_dir/%f'" >> 
> >> /tmp/primary/data/postgresql.conf
> >> pg_ctl -D /tmp/primary/data -l /tmp/primary-logs start
> >>
> >> ### setup host standby server
> >> pg_basebackup -p 5432 -w -R -D /tmp/hotstandby
> >> echo "primary_conninfo='host=127.0.0.1 port=5432 user=pgdev'" >> 
> >> /tmp/hotstandby/postgresql.conf
> >> echo "restore_command='cp /tmp/archive_dir/%f %p'" >> 
> >> /tmp/hotstandby/postgresql.conf
> >> echo "hot_standby = off" >> /tmp/hotstandby/postgresql.conf
> >> pg_ctl -D /tmp/hotstandby -l /tmp/hotstandby-logs -o "-p 5433" start
> >>
> >> ### keep trying to connect to hot standby server in order to get the error 
> >> messages in different stages.
> >> while true; do echo "`date`"; psql postgres -p 5433 -c "SELECT 
> >> txid_current_snapshot();" sleep 0.2; done
> >>
> >> ### before the patch
> >> psql: error: could not connect to server: FATAL:  the database system is 
> >> starting up
> >> ...
> >>
> >> ### after the patch, got different messages, one message indicates 
> >> hot_standby is off
> >> psql: error: could not connect to server: FATAL:  the database system is 
> >> starting up
> >> ...
> >> psql: error: could not connect to server: FATAL:  the database system is 
> >> up, but hot_standby is off
> >> ...
> >
> > Thanks for the review and testing!
>
> Thanks for the patch! Here is the comment from me.
>
> +   else if (!FatalError && pmState == PM_RECOVERY)
> +   return CAC_STANDBY; /* connection disallowed on 
> non-hot standby */
>
> Even if hot_standby is enabled, pmState seems to indicate PM_RECOVERY
> until recovery has reached a consistent state. No? So, if my understanding
> is right, "FATAL:  the database system is up, but hot_standby is off" can
> be logged even when hot_standby is on. Which sounds very confusing.

That's a good point. I've attached a corrected version.

I still don't have a good idea for how to add a test for this change.
If a test for this warranted, I'd be interested in any ideas.

James


v2-0001-Improve-standby-connection-denied-error-message.patch
Description: Binary data


Re: Comment simplehash/dynahash trade-offs

2020-08-01 Thread James Coleman
On Fri, Jul 31, 2020 at 8:17 PM Thomas Munro  wrote:
>
> On Sat, Aug 1, 2020 at 7:22 AM James Coleman  wrote:
> > [v2 patch set]
>
> I ran it through pgindent which insisted on adding some newlines, I
> manually replaced some spaces with tabs to match nearby lines, I added
> some asterisks in your example function prototypes where  is
> returned because they seemed to be missing, and I pushed this.
> Thanks!


Thanks for reviewing and committing!

James




Re: pg13dev: explain partial, parallel hashagg, and memory use

2020-08-04 Thread James Coleman
On Tue, Aug 4, 2020 at 9:44 PM David Rowley  wrote:
>
> On Wed, 5 Aug 2020 at 13:21, Justin Pryzby  wrote:
> >
> > I'm testing with a customer's data on pg13dev and got output for which Peak
> > Memory doesn't look right/useful.  I reproduced it on 565f16902.
>
> Likely the sanity of those results depends on whether you think that
> the Memory Usage reported outside of the workers is meant to be the
> sum of all processes or the memory usage for the leader backend.
>
> All that's going on here is that the Parallel Append is using some
> parallel safe paths and giving one to each worker. The 2 workers take
> the first 2 subpaths and the leader takes the third.  The memory usage
> reported helps confirm that's the case.
>
> Can you explain what you'd want to see changed about this?   Or do you
> want to see the non-parallel worker memory be the sum of all workers?
> Sort does not seem to do that, so I'm not sure if we should consider
> hash agg as an exception to that.

I've always found the way we report parallel workers in EXPLAIN quite
confusing. I realize it matches the actual implementation model (the
leader often is also "another worker", but I think the natural
expectation from a user perspective would be that you'd show as
workers all backends (including the leader) that did work, and then
aggregate into a summary line (where the leader is displayed now).

In the current output there's nothing really to hint to the use that
the model is leader + workers and that the "summary" line is really
the leader. If I were to design this from scratch, I'd want to propose
doing what I said above (summary aggregate line + treat leader as a
worker line, likely with a "leader" tag), but that seems like a big
change to make now. On the other hand, perhaps designating what looks
like a summary line as the "leader" or some such would help clear up
the confusion? Perhaps it could also say "Participating" or
"Non-participating"?

James




Any objection to documenting pg_sequence_last_value()?

2020-08-06 Thread James Coleman
The function pg_sequence_last_value() was added to underlie the
pg_sequences view, and it's the only way I'm aware of from userspace
to directly get the last value of a sequence globally (i.e., not
within the current session like currval()/lastval()). Obviously you
can join to the pg_sequences view, but that's sometimes unnecessarily
cumbersome since it doesn't expose the relid of the sequence.

When that function got added it apparently wasn't added to the docs,
though I'm not sure if that was intentional or not.

Does anyone have any objections to documenting
pg_sequence_last_value() in the sequence manipulation functions doc
page?

James




Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-08-10 Thread James Coleman
On Mon, Aug 10, 2020 at 8:37 PM Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > To recap: currently, any CREATE INDEX CONCURRENTLY will wait for all
> > other CICs running concurrently to finish, because they can't be
> > distinguished amidst other old snapshots.  We can change things by
> > having CIC set a special flag in PGPROC (like PROC_IN_VACUUM) indicating
> > that it's doing CIC; other CICs will see that flag and will know that
> > they don't need to wait for those processes.  With this, CIC on small
> > tables don't have to wait for CIC on large tables to complete.
>
> Hm.  +1 for improving this, if we can, but ...
>
> It seems clearly unsafe to ignore a CIC that is in active index-building;
> a snapshot held for that purpose is just as real as any other.  It *might*
> be all right to ignore a CIC that is just waiting, but you haven't made
> any argument in the patch comments as to why that's safe either.
> (Moreover, at the points where we're just waiting, I don't think we have
> a snapshot, so another CIC's WaitForOlderSnapshots shouldn't wait for us
> anyway.)

Why is a CIC in active index-building something we need to wait for?
Wouldn't it fall under a similar kind of logic to the other snapshot
types we can explicitly ignore? CIC can't be run in a manual
transaction, so the snapshot it holds won't be used to perform
arbitrary operations (i.e., the reason why a manual ANALYZE can't be
ignored).

> Actually, it doesn't look like you've touched the comments at all.
> WaitForOlderSnapshots' header comment has a long explanation of why
> it's safe to ignore certain processes.  That certainly needs to be
> updated by any patch that's going to change the rules.

Agreed that the comment needs to be updated to discuss the
(im)possibility of arbitrary operations within a snapshot held by CIC.

James




Re: massive FPI_FOR_HINT load after promote

2020-08-11 Thread James Coleman
On Tue, Aug 11, 2020 at 2:55 AM Masahiko Sawada
 wrote:
>
> On Tue, 11 Aug 2020 at 07:56, Alvaro Herrera  wrote:
> >
> > Last week, James reported to us that after promoting a replica, some
> > seqscan was taking a huge amount of time; on investigation he saw that
> > there was a high rate of FPI_FOR_HINT wal messages by the seqscan.
> > Looking closely at the generated traffic, HEAP_XMIN_COMMITTED was being
> > set on some tuples.
> >
> > Now this may seem obvious to some as a drawback of the current system,
> > but I was taken by surprise.  The problem was simply that when a page is
> > examined by a seqscan, we do HeapTupleSatisfiesVisibility of each tuple
> > in isolation; and for each tuple we call SetHintBits().  And only the
> > first time the FPI happens; by the time we get to the second tuple, the
> > page is already dirty, so there's no need to emit an FPI.  But the FPI
> > we sent only had the bit on the first tuple ... so the standby will not
> > have the bit set for any subsequent tuple.  And on promotion, the
> > standby will have to have the bits set for all those tuples, unless you
> > happened to dirty the page again later for other reasons.
> >
> > So if you have some table where tuples gain hint bits in bulk, and
> > rarely modify the pages afterwards, and promote before those pages are
> > frozen, then you may end up with a massive amount of pages that will
> > need hinting after the promote, which can become troublesome.
>
> Did the case you observed not use hot standby? I thought the impact of
> this issue could be somewhat alleviated in hot standby cases since
> read queries on the hot standby can set hint bits.

We do have hot standby enabled, and there are sometimes large queries
that may do seq scans that run against a replica, but there are
multiple replicas (and each one would have to have the bits set), and
a given replica that gets promoted in our topology isn't guaranteed to
be one that's seen those reads.

James




Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-08-11 Thread James Coleman
On Tue, Aug 11, 2020 at 11:14 AM Tom Lane  wrote:
>
> James Coleman  writes:
> > Why is a CIC in active index-building something we need to wait for?
> > Wouldn't it fall under a similar kind of logic to the other snapshot
> > types we can explicitly ignore? CIC can't be run in a manual
> > transaction, so the snapshot it holds won't be used to perform
> > arbitrary operations (i.e., the reason why a manual ANALYZE can't be
> > ignored).
>
> Expression indexes that call user-defined functions seem like a
> pretty serious risk factor for that argument.  Those are exactly
> the same expressions that ANALYZE will evaluate, as a result of
> which we judge it unsafe to ignore.  Why would CIC be different?

The comments for WaitForOlderSnapshots() don't discuss that problem at
all; as far as ANALYZE goes they only say:

* Manual ANALYZEs, however, can't be excluded because they
* might be within transactions that are going to do arbitrary operations
* later.

But nonetheless over in the thread Álvaro linked to we'd discussed
these issues already. Andres in [1] and [2] believed that:

> For the snapshot waits we can add a procarray flag
> (alongside PROCARRAY_VACUUM_FLAG) indicating that the backend is
> doing. Which WaitForOlderSnapshots() can then use to ignore those CICs,
> which is safe, because those transactions definitely don't insert into
> relations targeted by CIC. The change to WaitForOlderSnapshots() would
> just be to pass the new flag to GetCurrentVirtualXIDs, I think.

and

> What I was thinking of was a new flag, with a distinct value from
> PROC_IN_VACUUM. It'd currently just be specified in the
> GetCurrentVirtualXIDs() calls in WaitForOlderSnapshots(). That'd avoid
> needing to wait for other CICs on different relations. Since CIC is not
> permitted on system tables, and CIC doesn't do DML on normal tables, it
> seems fairly obviously correct to exclude other CICs.

In [3] I'd brought up that a function index can do arbitrary
operations (including, terribly, a query of another table), and Andres
(in [4]) noted that:

> Well, even if we consider this an actual problem, we could still use
> PROC_IN_CIC for non-expression non-partial indexes (index operator
> themselves better ensure this isn't a problem, or they're ridiculously
> broken already - they can get called during vacuum).

But went on to describe how this is a problem with all expression
indexes (even if those expressions don't do dangerous things) because
of syscache lookups, but that ideally for expression indexes we'd have
a way to use a different (or more frequently taken) snapshot for the
purpose of computing those expressions. That's a significantly more
involved patch though.

So from what I understand, everything that I'd claimed in my previous
message still holds true for non-expression/non-partial indexes. Is
there something else I'm missing?

James

1: 
https://www.postgresql.org/message-id/20200325191935.jjhdg2zy5k7ath5v%40alap3.anarazel.de
2: 
https://www.postgresql.org/message-id/20200325195841.gq4hpl25t6pxv3gl%40alap3.anarazel.de
3: 
https://www.postgresql.org/message-id/CAAaqYe_fveT_tvKonVt1imujOURUUMrydGeaBGahqLLy9D-REw%40mail.gmail.com
4: 
https://www.postgresql.org/message-id/20200416221207.wmnzbxvjqqpazeob%40alap3.anarazel.de




Re: Nicer error when connecting to standby with hot_standby=off

2020-09-08 Thread James Coleman
On Tue, Aug 18, 2020 at 12:25 PM Fujii Masao
 wrote:
> Thanks for updating the patch! But it failed to be applied to the master 
> branch
> cleanly because of the recent commit 0038f94387. So could you update the patch
> again?

Updated patch attached.

James


v3-0001-Improve-standby-connection-denied-error-message.patch
Description: Binary data


Re: [DOC] Document concurrent index builds waiting on each other

2020-09-08 Thread James Coleman
On Fri, Jul 31, 2020 at 2:51 PM James Coleman  wrote:
>
> On Thu, Jul 16, 2020 at 7:34 PM David Johnston
>  wrote:
> >
> > The following review has been posted through the commitfest application:
> > make installcheck-world:  not tested
> > Implements feature:   not tested
> > Spec compliant:   not tested
> > Documentation:tested, passed
> >
> > James,
> >
> > I'm on board with the point of pointing out explicitly the "concurrent 
> > index builds on multiple tables at the same time will not return on any one 
> > table until all have completed", with back-patching.  I do not believe the 
> > new paragraph is necessary though.  I'd suggest trying to weave it into the 
> > existing paragraph ending "Even then, however, the index may not be 
> > immediately usable for queries: in the worst case, it cannot be used as 
> > long as transactions exist that predate the start of the index build."  
> > Adding "Notably, " in front of the existing sentence fragment above and 
> > tacking it onto the end probably suffices.
>
> I'm not sure "the index may not be immediately usable for queries" is
> really accurate/sufficient: it seems to imply the CREATE INDEX has
> returned but for some reason the index isn't yet valid. The issue I'm
> trying to describe here is that the CREATE INDEX query itself will not
> return until all preceding queries have completed *including*
> concurrent index creations on unrelated tables.
>
> > I don't actually don't whether this is true behavior though.  Is it 
> > something our tests do, or could, demonstrate?
>
> It'd take tests that exercise parallelism, but it's pretty simple to
> demonstrate (but you do have to catch the first index build in a scan
> phase, so you either need lots of data or a hack). Here's an example
> that uses a bit of a hack to simulate a slow scan phase:
>
> Setup:
> create table items(i int);
> create table others(i int);
> create function slow_expr() returns text as $$ select pg_sleep(15);
> select '5'; $$ language sql immutable;
> insert into items(i) values (1), (2);
> insert into others(i) values (1), (2);
>
> Then the following in order:
> 1. In session A: create index concurrently on items((i::text || slow_expr()));
> 2. In session B (at the same time): create index concurrently on others(i);
>
> You'll notice that the 2nd command, which should be practically
> instantaneous, waits on the first ~30s scan phase of (1) before it
> returns. The same is true if after (2) completes you immediately run
> it again -- it waits on the second ~30s scan phase of (1).
>
> That does reveal a bit of complexity though that that the current
> patch doesn't address, which is that this can be phase dependent (and
> that complexity gets a lot more non-obvious when there's real live
> activity (particularly long-running transactions) in the system as
> well.
>
> I've attached a new patch series with two items:
> 1. A simpler (and I believe more correct) doc changes for "cic blocks
> cic on other tables".
> 2. A patch to document that all index builds can prevent tuples from
> being vacuumed away on other tables.
>
> If it's preferable we could commit the first and discuss the second
> separately, but since that limitation was also discussed up-thread, I
> decided to include them both here for now.

Álvaro's patch confused the current state of this thread, so I'm
reattaching (rebased) v2 as v3.

James


v3-0002-Document-vacuum-on-one-table-depending-on-concurr.patch
Description: Binary data


v3-0001-Document-concurrent-indexes-waiting-on-each-other.patch
Description: Binary data


Re: PROC_IN_ANALYZE stillborn 13 years ago

2020-09-08 Thread James Coleman
On Sat, Aug 29, 2020 at 8:06 PM Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > I pushed despite the objection because it seemed that downstream
> > discussion was largely favorable to the change, and there's a different
> > proposal to solve the bloat problem for analyze; and also:
>
> Note that this quasi-related patch has pretty thoroughly hijacked
> the CF entry for James' original docs patch proposal.  The cfbot
> thinks that that's the latest patch in the original thread, and
> unsurprisingly is failing to apply it.
>
> Since the discussion was all over the place, I'm not sure whether
> there's still a live docs patch proposal or not; but if so, somebody
> should repost that patch (and go back to the original thread title).

I replied to the original email thread with reposted patches.

James




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2020-09-08 Thread James Coleman
On Wed, Aug 19, 2020 at 3:16 AM Heikki Linnakangas  wrote:
>
> On 01/05/2020 05:20, James Coleman wrote:
> > On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra
> >  wrote:
> > ...
> >> Any particular reasons to pick dynahash over simplehash? ISTM we're
> >> using simplehash elsewhere in the executor (grouping, tidbitmap, ...),
> >> while there are not many places using dynahash for simple short-lived
> >> hash tables. Of course, that alone is a weak reason to insist on using
> >> simplehash here, but I suppose there were reasons for not using dynahash
> >> and we'll end up facing the same issues here.
> >
> > I've attached a patch series that includes switching to simplehash.
> > Obviously we'd really just collapse all of these patches, but it's
> > perhaps interesting to see the changes required to use each style
> > (binary search, dynahash, simplehash).
> >
> > As before, there are clearly comments and naming things to be
> > addressed, but the implementation should be reasonably clean.
>
> Looks good, aside from the cleanup work that you mentioned. There are a
> few more cases that I think you could easily handle with very little
> extra code:
>
> You could also apply the optimization for non-Const expressions, as long
> as they're stable (i.e. !contain_volatile_functions(expr)).

Is that true? Don't we also have to worry about whether or not the
value is stable (i.e., know when a param has changed)? There have been
discussions about being able to cache stable subexpressions, and my
understanding was that we'd need to have that infrastructure (along
with the necessarily invalidation when the param changes) to be able
to use this for non-Const expressions.

> Deconstructing the array Datum into a simple C array on first call would
> be a win even for very small arrays and for AND semantics, even if you
> don't use a hash table.

Because you wouldn't have to repeatedly detoast it? Or some other
reason I'm not thinking of? My intuition would have been that (aside
from detoasting if necessary) there wouldn't be much real overhead in,
for example, an array storing integers.

James




Re: Parallelize correlated subqueries that execute within each worker

2021-11-03 Thread James Coleman
On Wed, Sep 8, 2021 at 8:47 AM James Coleman  wrote:

> See updated patch series attached.

Jaime,

I noticed on 3-October you moved this into "waiting on author"; I
don't see anything waiting in this thread, however. Am I missing
something?

I'm planning to change it back to "needs review".

Thanks,
James




Re: Consider parallel for lateral subqueries with limit

2021-11-03 Thread James Coleman
On Fri, Jul 16, 2021 at 3:16 PM James Coleman  wrote:
>
> On Thu, May 27, 2021 at 9:01 PM Greg Nancarrow  wrote:
> >
> > On Tue, Dec 8, 2020 at 10:46 AM James Coleman  wrote:
> > >
> > > While I haven't actually tracked down to guarantee this is handled
> > > elsewhere, a thought experiment -- I think -- shows it must be so.
> > > Here's why: suppose we don't have a limit here, but the query return
> > > order is different in different backends. Then we would have the same
> > > problem you bring up. In that case this code is already setting
> > > consider_parallel=true on the rel. So I don't think we're changing any
> > > behavior here.
> > >
> >
> > AFAICS, the patch seems very reasonable and specifically targets
> > lateral subqueries with limit/offset. It seems like the uncorrelated
> > case is the only real concern.
> > I generally agree that the current patch is probably not changing any
> > behavior in the uncorrelated case (and like yourself, haven't yet
> > found a case for which it breaks), but I'm not sure Brian's concerns
> > can be ruled out entirely.
> >
> > How about a minor update to the patch to make it slightly more
> > restrictive, to exclude the case when there are no lateral
> > cross-references, so we'll be allowing parallelism only when we know
> > the lateral subquery will be evaluated anew for each source row?
> > I was thinking of the following patch modification:
> >
> > BEFORE:
> > -if (limit_needed(subquery))
> > +   if (!rte->lateral && limit_needed(subquery))
> >
> > AFTER:
> > -if (limit_needed(subquery))
> > +   if ((!rte->lateral || bms_is_empty(rel->lateral_relids)) &&
> > + limit_needed(subquery))
> >
> >
> > Thoughts?
>
> Apologies for the delayed response; this seems fine to me. I've
> attached patch v2.

Greg,

Do you believe this is now ready for committer?

Thanks,
James Coleman




Re: Parallelize correlated subqueries that execute within each worker

2021-11-03 Thread James Coleman
Hi Robert, thanks for the detailed reply.

On Wed, Nov 3, 2021 at 10:48 AM Robert Haas  wrote:
>
> As a preliminary comment, it would be quite useful to get Tom Lane's
> opinion on this, since it's not an area I understand especially well,
> and I think he understands it better than anyone.
>
> On Fri, May 7, 2021 at 12:30 PM James Coleman  wrote:
> > The basic idea is that we need to track (both on nodes and relations)
> > not only whether that node or rel is parallel safe but also whether
> > it's parallel safe assuming params are rechecked in the using context.
> > That allows us to delay making a final decision until we have
> > sufficient context to conclude that a given usage of a Param is
> > actually parallel safe or unsafe.
>
> I don't really understand what you mean by "assuming params are
> rechecked in the using context." However, I think that a possibly
> better approach to this whole area would be to try to solve the
> problem by putting limits on where you can insert a Gather node.
> Consider:
>
> Nested Loop
> -> Seq Scan on x
> -> Index Scan on y
>Index Cond: y.q = x.q
>
> If you insert a Gather node atop the Index Scan, necessarily changing
> it to a Parallel Index Scan, then you need to pass values around. For
> every value we get for x.q, we would need to start workers, sending
> them the value of x.q, and they do a parallel index scan working
> together to find all rows where y.q = x.q, and then exit. We repeat
> this for every tuple from x.q. In the absence of infrastructure to
> pass those parameters, we can't put the Gather there. We also don't
> want to, because it would be really slow.
>
> If you insert the Gather node atop the Seq Scan or the Nested Loop, in
> either case necessarily changing the Seq Scan to a Parallel Seq Scan,
> you have no problem. If you put it on top of the Nested Loop, the
> parameter will be set in the workers and used in the workers and
> everything is fine. If you put it on top of the Seq Scan, the
> parameter will be set in the leader -- by the Nested Loop -- and used
> in the leader, and again you have no problem.
>
> So in my view of the world, the parameter just acts as an additional
> constraint on where Gather nodes can be placed. I don't see that there
> are any parameters that are unsafe categorically -- they're just
> unsafe if the place where they are set is on a different side of the
> Gather from the place where they are used. So I don't understand --
> possibly just because I'm dumb -- the idea behind
> consider_parallel_rechecking_params, because that seems to be making a
> sort of overall judgement about the safety or unsafety of the
> parameter on its own merits, rather than thinking about the Gather
> placement.

I had to read through this several times before I understood the point
(not your fault, this is, as you note, a complicated area). I *think*
if I grok it properly you're effectively describing what this patch
results in conceptually (but possibly solving it from a different
direction).

As I understand the current code, parallel plans are largely chosen
based not on where it's safe to insert a Gather node but rather by
determining if a given path is parallel safe. Through that lens params
are a bit of an odd man out -- they aren't inherently unsafe in the
way a parallel-unsafe function is, but they can only be used in
parallel plans under certain conditions (whether because of project
policy, performance, or missing infrastructure).

Under that paradigm the existing consider_parallel and parallel_safe
boolean values imply everything is about whether a plan is inherently
parallel safe. Thus the current doesn't have the context to handle the
nuance of params (as they are not inherently parallel-safe or unsafe).

Introducing consider_parallel_rechecking_params and
parallel_safe_ignoring_params allows us to keep more context on params
and make a more nuanced decision at the proper level of the plan. This
is what I mean by "rechecked in the using context", though I realize
now that both "recheck" and "context" are overloaded terms in the
project, so don't describe the concept particularly clearly. When a
path relies on params we can only make a final determination about its
parallel safety if we know whether or not the current parallel node
can provide the param's value. We don't necessarily know that
information until we attempt to generate a full parallel node in the
plan (I think what you're describing as "inserting a Gather node")
since the param may come from another node in the plan. These new
values allow us to do that by tracking tentatively parallel-safe
subplans (given proper Gather node placement) and del

Re: Misleading comment in tuplesort_set_bound

2019-09-05 Thread James Coleman
Yes, planning on it, just a bit behind right now so will likely be a
few more days at least.

On Thu, Sep 5, 2019 at 4:57 PM Alvaro Herrera from 2ndQuadrant
 wrote:
>
> On 2019-Aug-26, Tom Lane wrote:
>
> > James Coleman  writes:
>
> > I think the comment is fine as-is.  Perhaps the code would be clearer
> > though, if we merged those two asserts into one?
> >
> >   /* Assert we're called before loading any tuples */
> >   Assert(state->status == TSS_INITIAL &&
> >  state->memtupcount == 0);
>
> Makes sense to me.  James, do you want to submit a new patch?
>
> > I'm not totally sure about the usefulness/relevance of the two
> > assertions following these, but they could likely do with their
> > own comment(s), because this one surely isn't covering them.
>
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-09-12 Thread James Coleman
> OK, so we have that now.  I suppose this spreadsheet now tells us which
> places are useful and which aren't, at least for the queries that you've
> tested.  Dowe that mean that we want to get the patch to consider adding
> paths only the places that your spreadsheet says are useful?  I'm not
> sure what the next steps are for this patch.

I wanted to note here that I haven't abandoned this patch, but ended
up needing to use my extra time for working on a conference talk. That
talk is today, so I'm hoping to be able to catch up on this again
soon.

James Coleman




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-09-13 Thread James Coleman
On Fri, Sep 13, 2019 at 10:54 AM Tomas Vondra
 wrote:
>
> On Thu, Sep 12, 2019 at 11:54:06AM -0400, James Coleman wrote:
> >> OK, so we have that now.  I suppose this spreadsheet now tells us which
> >> places are useful and which aren't, at least for the queries that you've
> >> tested.  Dowe that mean that we want to get the patch to consider adding
> >> paths only the places that your spreadsheet says are useful?  I'm not
> >> sure what the next steps are for this patch.
> >
> >I wanted to note here that I haven't abandoned this patch, but ended
> >up needing to use my extra time for working on a conference talk. That
> >talk is today, so I'm hoping to be able to catch up on this again
> >soon.
> >
>
> Good! I'm certainly looking forward to a new patch version.
>
> As discussed in the past, this patch is pretty sensitive (large, touches
> planning, ...), so we should try getting most of it in not too late in
> the cycle. For example 2019-11 would be nice.

Completely agree; originally I'd hoped to have it in rough draft
finished form to get serious review in the September CF...but...




pg_rewind docs correction

2019-09-13 Thread James Coleman
The pg_rewind docs assert that the state of the target's data directory
after rewind is equivalent to the source's data directory. But that
isn't true both because the base state is further back in time and
because the target's data directory will include the current state on
the source of any copied blocks.

So I've attached a patch to summarize more correctly as well as
document clearly the state of the cluster after the operation and also
the operation sequencing dangers caused by copying configuration
files from the source.

James Coleman


001_pg_rewind_explanation_doc_fixes.patch
Description: Binary data


Re: [DOC] Document auto vacuum interruption

2019-09-13 Thread James Coleman
On Sat, Aug 31, 2019 at 10:51 PM Amit Kapila  wrote:
>
> On Fri, Jul 26, 2019 at 1:45 AM James Coleman  wrote:
> >
> > We've discussed this internally many times, but today finally decided
> > to write up a doc patch.
> >
>
> Thanks, I think something on the lines of what you have written can
> help some users to understand the behavior in this area and there
> doesn't seem to be any harm in giving such information to the user.
>
> > Autovacuum holds a SHARE UPDATE EXCLUSIVE lock, but other processes
> > can cancel autovacuum if blocked by that lock unless the autovacuum is
> > to prevent wraparound.This can result in very surprising behavior:
> > imagine a system that needs to run ANALYZE manually before batch jobs
> > to ensure reasonable query plans. That ANALYZE will interrupt attempts
> > to run autovacuum, and pretty soon the table is far more bloated than
> > expected, and query plans (ironically) degrade further.
> >
>
> +If a process attempts to acquire a SHARE UPDATE
> EXCLUSIVE
> +lock (the lock type held by autovacuum), lock acquisition will interrupt
> +the autovacuum.
>
> I think it is not only for a process that tries to acquire a lock in
> SHARE UPDATE EXCLUSIVE mode, rather when a process tries to acquire
> any lock mode that conflicts with SHARE UPDATE EXCLUSIVE.  For the
> conflicting lock modes, you can refer docs [1] (See Table 13.2.
> Conflicting Lock Modes).
>
> [1] - https://www.postgresql.org/docs/devel/explicit-locking.html

Updated patch attached. I changed the wording to be about conflicting
locks rather than a single lock type, added a link to the conflicting
locks table, and fixed a few sgml syntax issues in the original.

James Coleman


autovacuum-interruption-v2.patch
Description: Binary data


Re: pg_rewind docs correction

2019-09-14 Thread James Coleman
On Sat, Sep 14, 2019 at 12:20 AM Michael Paquier  wrote:
>
> On Fri, Sep 13, 2019 at 01:47:03PM -0400, James Coleman wrote:
> > So I've attached a patch to summarize more correctly as well as
> > document clearly the state of the cluster after the operation and also
> > the operation sequencing dangers caused by copying configuration
> > files from the source.
>
> +   After a successful rewind, the target data directory is equivalent
> to the
> +   to the state of the data directory at the point at which the
> source and
> +   target diverged plus the current state on the source of any blocks
> changed
> +   on the target after that divergence. While only changed blocks
> from relation
> +   files are copied; all other files are copied in full, including
> configuration
> +   files and WAL segments. The advantage of
> pg_rewind
> +   over taking a new base backup, or tools like
> rsync,
> +   is that pg_rewind does not require
> comparing or
> +   copying unchanged relation blocks in the cluster. As such the
> rewind operation
> +   is significantly faster than other approaches when the database is
> large and
> +   only a small fraction of blocks differ between the clusters.
>
> The point of divergence could be defined as the LSN position where WAL
> has forked on the new timeline, but the block diffs are copied from
> actually the last checkpoint just before WAL has forked.  So this new
> paragraph brings confusion about the actual divergence point.
>
> Regarding the relation files, if the file does not exist on the target
> but does exist on the source, it is also copied fully, so the second
> sentence is wrong here to mention as relation files could also be
> copied fully.

Updated (plus some additional wordsmithing).

James Coleman


001_pg_rewind_explanation_doc_fixes_v2.patch
Description: Binary data


Re: pg_rewind docs correction

2019-09-15 Thread James Coleman
On Sun, Sep 15, 2019 at 10:25 AM Michael Paquier  wrote:
>
> On Sat, Sep 14, 2019 at 07:00:54PM -0400, James Coleman wrote:
> > Updated (plus some additional wordsmithing).
>
> +The rewind operation is not expected to result in a consistent data
> +directory state either internally to the node or with respect to the rest
> +of the cluster. Instead the resulting data directory will only be 
> consistent
> +after WAL replay has completed to at least the LSN at which changed 
> blocks
> +copied from the source were originally written on the source.
>
> That's not necessarily true.  pg_rewind enforces in the control file
> of the target the minimum consistency LSN to be
> pg_current_wal_insert_lsn() when using a live source or the last
> checkpoint LSN for a stopped source, so while that sounds true from
> the point of view of all the blocks copied, the control file may still
> cause a complain that the target recovering has not reached its
> consistent point even if all the blocks are already at a position
> not-so-far from what has been registered in the control file.

I could just say "after WAL replay has completed to a consistent state"?

> +   the point at which the WAL timelines of the source and target diverged 
> plus
> +   the current state on the source of any blocks changed on the target after
> +   that divergence. While only changed blocks from existing relation files 
> are
>
> And here we could mention that all the blocks copied from the source
> are the ones which are found in the WAL records of the target until
> the end of WAL of its timeline.  Still, that's basically what is
> mentioned in the first part of "How It Works", which explains things
> better.  I honestly don't really see that all this paragraph is an
> improvement over the simplicity of the original when it comes to
> understand the global idea of what pg_rewind does.

The problem with the original is that while simple, it's actually
incorrect in that simplicity. Pg_rewind does *not* result in the data
directory on the target matching the data directory on the source.

> +  
> +Because pg_rewind copies configuration files
> +entirely from the source, correcting recovery configuration options 
> before
> +restarting the server is necessary if you intend to re-introduce the 
> target
> +as a replica of the source. If you restart the server after the rewind
> +operation has finished but without configuring recovery, the target will
> +again diverge from the primary.
> +   
>
> No objections regarding that part.  Now it seems to me that we had
> better apply that to the last part of "How it works" instead?  I kind
> of agree that the last paragraph could provide more details regarding
> the risks of overwriting the wanted configuration.  The existing docs
> also mention that pg_rewind only creates a backup_label file to start
> recovery, perhaps we could mention up to which point recovery happens
> in this section?  There is a bit more here than just "apply the WAL".

I'll look to see if there's a better place to put this.

James Coleman




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-09-15 Thread James Coleman
On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
 wrote:
> >> ...
> >>
> >> I think this may be a thinko, as this plan demonstrates - but I'm not
> >> sure about it. I wonder if this might be penalizing some other types of
> >> plans (essentially anything with limit + gather).
> >>
> >> Attached is a WIP patch fixing this by considering both startup and
> >> total cost (by calling compare_path_costs_fuzzily).
> >
> >It seems to me that this is likely a bug, and not just a changed
> >needed for this. Do you think it's better addressed in a separate
> >thread? Or retain it as part of this patch for now (and possibly break
> >it out later)? On the other hand, it's entirely possible that someone
> >more familiar with parallel plan limitations could explain why the
> >above comment holds true. That makes me lean towards asking in a new
> >thread.
> >
>
> Maybe. I think creating a separate thread would be useful, provided we
> manage to demonstrate the issue without an incremental sort.

I did some more thinking about this, and I can't currently come up
with a way to reproduce this issue outside of this patch. It doesn't
seem reasonable to me to assume that there's anything inherent about
this patch that means it's the only way we can end up with a partial
path with a low startup cost we'd want to prefer.

Part of me wants to pull it over to a separate thread just to get
additional feedback, but I'm not sure how useful that is given we
don't currently have an example case outside of this patch.

One thing to note though: the current patch does not also modify
add_partial_path_precheck which also does not take into account
startup cost, so we probably need to update that for completeness's
sake.

James Coleman




Re: [DOC] Document auto vacuum interruption

2019-09-17 Thread James Coleman
On Tue, Sep 17, 2019 at 2:21 AM Amit Kapila  wrote:
>
> On Fri, Sep 13, 2019 at 11:59 PM James Coleman  wrote:
> >
> > On Sat, Aug 31, 2019 at 10:51 PM Amit Kapila  
> > wrote:
> > >
> >
> > Updated patch attached. I changed the wording to be about conflicting
> > locks rather than a single lock type, added a link to the conflicting
> > locks table, and fixed a few sgml syntax issues in the original.
> >
>
> I see error while compiling this patch on HEAD.  See the below error:
> /usr/bin/xmllint --path . --noout --valid postgres.sgml
> postgres.sgml:833: element xref: validity error : IDREF attribute
> linkend references an unknown ID
> "mvcc-locking-tables-table-lock-compatibility"
> make: *** [check] Error 4
>
> The tag id mvcc-locking-tables-table-lock-compatibility is wrong.

My apologies; I'd fixed that on my local copy before sending my last
email, but I must have somehow grabbed the wrong patch file to attach
to the email.

> The
> other problem I see is the wrong wording in one of the literals. I
> have fixed both of these issues and slightly tweaked one of the
> sentence.  See the updated patch attached.  On which version, are you
> preparing this patch?  I see both HEAD and 9.4 has the problems fixed
> by me.
>
> Let me know what you think of attached?  I think we can back-patch
> this patch.  What do you think?  Does anyone else have an opinion on
> this patch especially if we see any problem in back-patching this?

The attached looks great!

I was working on HEAD for the patch, but this concern has been an
issue for quite a long time. We were running into it on 9.6 in
production, for example. And given how frequently it seems like there
are large-scale production issues related to auto vacuum, I think any
amount of back patching we can do to make that footgun less likely
would be a good thing.

James Coleman




Re: pg_rewind docs correction

2019-09-17 Thread James Coleman
On Tue, Sep 17, 2019 at 3:51 AM Michael Paquier  wrote:
>
> On Sun, Sep 15, 2019 at 10:36:04AM -0400, James Coleman wrote:
> > On Sun, Sep 15, 2019 at 10:25 AM Michael Paquier  
> > wrote:
> >> +The rewind operation is not expected to result in a consistent data
> >> +directory state either internally to the node or with respect to the 
> >> rest
> >> +of the cluster. Instead the resulting data directory will only be 
> >> consistent
> >> +after WAL replay has completed to at least the LSN at which changed 
> >> blocks
> >> +copied from the source were originally written on the source.
> >>
> >> That's not necessarily true.  pg_rewind enforces in the control file
> >> of the target the minimum consistency LSN to be
> >> pg_current_wal_insert_lsn() when using a live source or the last
> >> checkpoint LSN for a stopped source, so while that sounds true from
> >> the point of view of all the blocks copied, the control file may still
> >> cause a complain that the target recovering has not reached its
> >> consistent point even if all the blocks are already at a position
> >> not-so-far from what has been registered in the control file.
> >
> > I could just say "after WAL replay has completed to a consistent state"?
>
> I still would not change this paragraph.  The first sentence means
> that we have an equivalency, because that's the case if you think
> about it as we make sure that the target is able to sync with the
> source, and the target gets into a state where it as an on-disk state
> equivalent to the target up to the minimum consistency point defined
> in the control file once the tool has done its work (this last point
> is too precise to be included in a global description to be honest).
> And the second sentence makes clear what are the actual diffs are.
> >> +   the point at which the WAL timelines of the source and target diverged 
> >> plus
> >> +   the current state on the source of any blocks changed on the target 
> >> after
> >> +   that divergence. While only changed blocks from existing relation 
> >> files are
> >>
> >> And here we could mention that all the blocks copied from the source
> >> are the ones which are found in the WAL records of the target until
> >> the end of WAL of its timeline.  Still, that's basically what is
> >> mentioned in the first part of "How It Works", which explains things
> >> better.  I honestly don't really see that all this paragraph is an
> >> improvement over the simplicity of the original when it comes to
> >> understand the global idea of what pg_rewind does.
> >
> > The problem with the original is that while simple, it's actually
> > incorrect in that simplicity. Pg_rewind does *not* result in the data
> > directory on the target matching the data directory on the source.
>
> That's not what I get from the original docs, but I may be too much
> used to it.

I don't agree that that's a valid equivalency. I myself spent a lot of
time trying to understand how this could possibly be true a while
back, and even looked at source code to be certain. I've asked other
people and found the same confusion.

As I read it the 2nd second sentence doesn't actually tell you the
differences; it makes a quick attempt at summarizing *how* the first
sentence is true, but if the first sentence isn't accurate, then it's
hard to read the 2nd one as helping.

If you'd prefer something less detailed at this point at that point in
the docs, then something along the lines of "results in a data
directory state which can then be safely replayed from the source" or
some such.

The docs shouldn't be correct just for someone how already understands
the intricacies. And the end  user shouldn't have to read the "how it
works" (which incidentally is kinda hidden at the bottom underneath
the CLI args -- perhaps we could move that?) to extrapolate things in
the primary documentation.

James Coleman




[DOC] Document concurrent index builds waiting on each other

2019-09-18 Thread James Coleman
In my experience it's not immediately obvious (even after reading the
documentation) the implications of how concurrent index builds manage
transactions with respect to multiple concurrent index builds in
flight at the same time.

Specifically, as I understand multiple concurrent index builds running
at the same time will all return at the same time as the longest
running one.

I've attached a small patch to call this caveat out specifically in
the documentation. I think the description in the patch is accurate,
but please let me know if there's some intricacies around how the
various stages might change the results.

James Coleman
commit 9e28e704820eebb81ff94c1c3cbfb7db087b2c45
Author: James Coleman 
Date:   Wed Sep 18 13:36:22 2019 -0400

Document concurrent indexes waiting on each other

It's not immediately obvious that because concurrent index building
waits on previously running transactions to complete, running multiple
concurrent index builds at the same time will result in each of them
taking as long to return as the longest takes, so, document this caveat.

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 629a31ef79..35f15abb0e 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -616,6 +616,13 @@ Indexes:
 cannot.

 
+   
+Because the second table scan must wait for any transactions having a
+snapshot preceding the start of that scan to finish before completing the
+scan, concurrent index builds on multiple tables at the same time will
+not return on any one table until all have completed.
+   
+

 Concurrent builds for indexes on partitioned tables are currently not
 supported.  However, you may concurrently build the index on each


Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-09-27 Thread James Coleman
On Mon, Sep 9, 2019 at 5:55 PM Tomas Vondra
 wrote:
> The "patched" column means all developer GUCs disabled, so it's expected
> to produce the same plan as master (and it is). And then there's one
> column for each developer GUC. If the column is just TRUE it means the
> GUC does not affect any of the synthetic queries. There are 4 of them:
>
> - devel_add_paths_to_grouping_rel_parallel
> - devel_create_partial_grouping_paths
> - devel_gather_grouping_paths
> - devel_standard_join_search
>
> The places controlled by those GUCs are either useless, or the query
> affected by them is not included in the list of queries.

I'd previously found (in my reverse engineering efforts) the query:

select *
from tenk1 t1
join tenk1 t2 on t1.hundred = t2.hundred
join tenk1 t3 on t1.hundred = t3.hundred
order by t1.hundred, t1.twenty
limit 50;

can change plans to use incremental sort when
generate_useful_gather_paths() is added to standard_join_search().
Specifically, we get a merge join between t1 and t3 as the top level
(besides limit) node where the driving side of the join is a gather
merge with incremental sort. This does rely on these gucs set in the
test harness:

set local max_parallel_workers_per_gather=4;
set local min_parallel_table_scan_size=0;
set local parallel_tuple_cost=0;
set local parallel_setup_cost=0;

So I think we can reduce the number of unused gucs to 3.

James




Consider low startup cost in add_partial_path

2019-09-27 Thread James Coleman
Over in the incremental sort patch discussion we found [1] a case
where a higher cost plan ends up being chosen because a low startup
cost partial path is ignored in favor of a lower total cost partial
path and a limit is a applied on top of that which would normal favor
the lower startup cost plan.

45be99f8cd5d606086e0a458c9c72910ba8a613d originally added
`add_partial_path` with the comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily
come up with a scenario in which it's not, but I also can't come up
with an inherent reason why such a scenario cannot exist.

We could just continue to include this change as part of the
incremental sort patch itself, but it seemed worth it to me to break
it out for some more targeted discussion, and also include Robert as
the initial author of add_partial_path in the hopes that maybe we
could retrieve some almost 4-year-old memories on why this was
inherently true then, and maybe that would shed some light on whether
it's still inherently true.

I've attached a patch (by Tomas Vondra, also cc'd) to consider startup
cost in add_partial_path, but should we apply the patch we'll also
likely need to apply the same kind of change to
add_partial_path_precheck.

James Coleman

[1]: 
https://www.postgresql.org/message-id/20190720132244.3vgg2uynfpxh3me5%40development


consider-startup-cost-in-add-partial-path_v1.patch
Description: Binary data


Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-09-27 Thread James Coleman
On Mon, Sep 16, 2019 at 6:32 AM Tomas Vondra
 wrote:
>
> On Sun, Sep 15, 2019 at 09:33:33PM -0400, James Coleman wrote:
> >On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
> > wrote:
> >> >> ...
> >> >>
> >> >> I think this may be a thinko, as this plan demonstrates - but I'm not
> >> >> sure about it. I wonder if this might be penalizing some other types of
> >> >> plans (essentially anything with limit + gather).
> >> >>
> >> >> Attached is a WIP patch fixing this by considering both startup and
> >> >> total cost (by calling compare_path_costs_fuzzily).
> >> >
> >> >It seems to me that this is likely a bug, and not just a changed
> >> >needed for this. Do you think it's better addressed in a separate
> >> >thread? Or retain it as part of this patch for now (and possibly break
> >> >it out later)? On the other hand, it's entirely possible that someone
> >> >more familiar with parallel plan limitations could explain why the
> >> >above comment holds true. That makes me lean towards asking in a new
> >> >thread.
> >> >
> >>
> >> Maybe. I think creating a separate thread would be useful, provided we
> >> manage to demonstrate the issue without an incremental sort.
> >
> >I did some more thinking about this, and I can't currently come up
> >with a way to reproduce this issue outside of this patch. It doesn't
> >seem reasonable to me to assume that there's anything inherent about
> >this patch that means it's the only way we can end up with a partial
> >path with a low startup cost we'd want to prefer.
> >
> >Part of me wants to pull it over to a separate thread just to get
> >additional feedback, but I'm not sure how useful that is given we
> >don't currently have an example case outside of this patch.
> >
>
> Hmm, I see.
>
> While I initially suggested to start a separate thread only if we have
> example not involving an incremental sort, that's probably not a hard
> requirement. I think it's fine to start a thead briefly explaining the
> issue, and pointing to incremental sort thread for actual example.
>
> >
> >One thing to note though: the current patch does not also modify
> >add_partial_path_precheck which also does not take into account
> >startup cost, so we probably need to update that for completeness's
> >sake.
> >
>
> Good point. It does indeed seem to make the same assumption about only
> comparing total cost before calling add_path_precheck.

I've started a new thread to discuss:
https://www.postgresql.org/message-id/CAAaqYe-5HmM4ih6FWp2RNV9rruunfrFrLhqFXF_nrrNCPy1Zhg%40mail.gmail.com

James




Re: Consider low startup cost in add_partial_path

2019-09-28 Thread James Coleman
On Saturday, September 28, 2019, Tomas Vondra 
wrote:

> On Sat, Sep 28, 2019 at 12:16:05AM -0400, Robert Haas wrote:
>
>> On Fri, Sep 27, 2019 at 2:24 PM James Coleman  wrote:
>>
>>> Over in the incremental sort patch discussion we found [1] a case
>>> where a higher cost plan ends up being chosen because a low startup
>>> cost partial path is ignored in favor of a lower total cost partial
>>> path and a limit is a applied on top of that which would normal favor
>>> the lower startup cost plan.
>>>
>>> 45be99f8cd5d606086e0a458c9c72910ba8a613d originally added
>>> `add_partial_path` with the comment:
>>>
>>> > Neither do we need to consider startup costs:
>>> > parallelism is only used for plans that will be run to completion.
>>> > Therefore, this routine is much simpler than add_path: it needs to
>>> > consider only pathkeys and total cost.
>>>
>>> I'm not entirely sure if that is still true or not--I can't easily
>>> come up with a scenario in which it's not, but I also can't come up
>>> with an inherent reason why such a scenario cannot exist.
>>>
>>
>> I think I just didn't think carefully about the Limit case.
>>
>>
> Thanks! In that case I suggest we treat it as a separate patch/fix,
> independent of the incremental sort patch. I don't want to bury it in
> that patch series, it's already pretty large.
>

Now the trick is to figure out a way to demonstrate it in test :)

Basically we need:
Path A: Can short circuit with LIMIT but has high total cost
Path B: Can’t short circuit with LIMIT but has lower total cost

(Both must be parallel aware of course.)

Maybe ordering in B can be a sort node and A can be an index scan (perhaps
with very high random page cost?) and force choosing a parallel plan?

I’m trying to describe this to jog my thoughts (not in front of my laptop
right now so can’t try it out).

Any other ideas?

James


Re: [DOC] Document concurrent index builds waiting on each other

2019-09-28 Thread James Coleman
On Sat, Sep 28, 2019 at 9:22 PM Alvaro Herrera  wrote:
>
> On 2019-Sep-28, Bruce Momjian wrote:
>
> > The CREATE INDEX docs already say:
> >
> > In a concurrent index build, the index is actually entered into
> > the system catalogs in one transaction, then two table scans occur in
> > two more transactions.  Before each table scan, the index build must
> > wait for existing transactions that have modified the table to 
> > terminate.
> > After the second scan, the index build must wait for any transactions
> > --> that have a snapshot (see ) predating the second
> > --> scan to terminate.  Then finally the index can be marked ready for use,
> >
> > So, having multiple concurrent index scans is just a special case of
> > having to "wait for any transactions that have a snapshot", no?  I am
> > not sure adding a doc mention of other index builds really is helpful.
>
> I always thought that create index concurrently was prevented from
> running concurrently in a table by the ShareUpdateExclusive lock that's
> held during the operation.

You mean multiple CICs on a single table at the same time? Yes, that
(unfortunately) isn't possible, but I'm concerned in the patch with
the fact that CIC on table X blocks CIC on table Y.

James




Re: [DOC] Document concurrent index builds waiting on each other

2019-09-28 Thread James Coleman
On Sat, Sep 28, 2019 at 9:56 PM Bruce Momjian  wrote:
>
> On Sat, Sep 28, 2019 at 09:54:48PM -0400, James Coleman wrote:
> > On Sat, Sep 28, 2019 at 9:22 PM Alvaro Herrera  
> > wrote:
> > >
> > > On 2019-Sep-28, Bruce Momjian wrote:
> > >
> > > > The CREATE INDEX docs already say:
> > > >
> > > > In a concurrent index build, the index is actually entered into
> > > > the system catalogs in one transaction, then two table scans occur 
> > > > in
> > > > two more transactions.  Before each table scan, the index build must
> > > > wait for existing transactions that have modified the table to 
> > > > terminate.
> > > > After the second scan, the index build must wait for any 
> > > > transactions
> > > > --> that have a snapshot (see ) predating the 
> > > > second
> > > > --> scan to terminate.  Then finally the index can be marked ready for 
> > > > use,
> > > >
> > > > So, having multiple concurrent index scans is just a special case of
> > > > having to "wait for any transactions that have a snapshot", no?  I am
> > > > not sure adding a doc mention of other index builds really is helpful.

While that may be technically true, as a co-worker of mine likes to
point out, being "technically correct" is the worst kind of correct.

Here's what I mean:

First, I believe the docs should aim to be as useful as possible to
even those with more entry-level understanding of PostgreSQL. The fact
the paragraph you cite actually links to the entire chapter on
concurrency control in Postgres demonstrates that there's some
not-so-immediate stuff here to consider. For one: is it obvious to all
users that the transaction held by CIC (or even that all transactions)
has an open snapshot?

Second, this is a difference from a regular CREATE INDEX, and we
already call out as caveats differences between CREATE INDEX
CONCURRENTLY and regular CREATE INDEX as I point out below re:
Alvaro's comment.

Third, related to the above point, many DDL commands only block DDL
against the table being operated on. The fact that CIC here is
different is, in my opinion, a fairly surprising break from that
pattern, and as such likely to catch users off guard. I can attest
that this surprised at least one entire database team a while back :)
including many people who've been operating Postgres at a large scale
for a long time.

I believe caveats like this are worth calling out rather than
expecting users to have to understand the implementation details an
work out the implications on their own.

> > > I always thought that create index concurrently was prevented from
> > > running concurrently in a table by the ShareUpdateExclusive lock that's
> > > held during the operation.
> >
> > You mean multiple CICs on a single table at the same time? Yes, that
> > (unfortunately) isn't possible, but I'm concerned in the patch with
> > the fact that CIC on table X blocks CIC on table Y.
>
> I think any open transaction will block CIC, which is my point.

I read Alvaro as referring to the fact that the docs already call out
the following:

> Regular index builds permit other regular index builds on the same table to 
> occur simultaneously, but only one concurrent index build can occur on a 
> table at a time.

James




Re: Consider low startup cost in add_partial_path

2019-10-02 Thread James Coleman
On Sat, Sep 28, 2019 at 7:21 PM James Coleman  wrote:
> Now the trick is to figure out a way to demonstrate it in test :)
>
> Basically we need:
> Path A: Can short circuit with LIMIT but has high total cost
> Path B: Can’t short circuit with LIMIT but has lower total cost
>
> (Both must be parallel aware of course.)

I'm adding one requirement, or clarifying it anyway: the above paths
must be partial paths, and can't just apply at the top level of the
parallel part of the plan. I.e., the lower startup cost has to matter
at a subtree of the parallel portion of the plan.

> Maybe ordering in B can be a sort node and A can be an index scan (perhaps 
> with very high random page cost?) and force choosing a parallel plan?
>
> I’m trying to describe this to jog my thoughts (not in front of my laptop 
> right now so can’t try it out).
>
> Any other ideas?

I've been playing with this a good bit, and I'm struggling to come up
with a test case. Because the issue only manifests in a subtree of the
parallel portion of the plan, a scan on a single relation won't do.
Merge join seems like a good area to look at because it requires
ordering, and that ordering can be either the result of an index scan
(short-circuit-able) or an explicit sort (not short-circuit-able). But
I've been unable to make that result in any different plans with
either 2 or 3 relations joined together, ordered, and a limit applied.

In all cases I've been starting with:

set enable_hashjoin = off;
set enable_nestloop = off;
set max_parallel_workers_per_gather = 4;
set min_parallel_index_scan_size = 0;
set min_parallel_table_scan_size = 0;
set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;

I've also tried various combinations of random_page_cost,
cpu_index_tuple_cost, cpu_tuple_cost.

Interestingly I've noticed plans joining two relations that look like:

 Limit
   ->  Merge Join
 Merge Cond: (t1.pk = t2.pk)
 ->  Gather Merge
   Workers Planned: 4
   ->  Parallel Index Scan using t_pkey on t t1
 ->  Gather Merge
   Workers Planned: 4
   ->  Parallel Index Scan using t_pkey on t t2

Where I would have expected a Gather Merge above a parallelized merge
join. Is that reasonable to expect?

If there doesn't seem to be an obvious way to reproduce the issue
currently, but we know we have a reproduction example along with
incremental sort, what is the path forward for this? Is it reasonable
to try to commit it anyway knowing that it's a "correct" change and
been demonstrated elsewhere?

James




Re: Consider low startup cost in add_partial_path

2019-10-24 Thread James Coleman
On Fri, Oct 4, 2019 at 8:36 AM Robert Haas  wrote:
>
> On Wed, Oct 2, 2019 at 10:22 AM James Coleman  wrote:
> > In all cases I've been starting with:
> >
> > set enable_hashjoin = off;
> > set enable_nestloop = off;
> > set max_parallel_workers_per_gather = 4;
> > set min_parallel_index_scan_size = 0;
> > set min_parallel_table_scan_size = 0;
> > set parallel_setup_cost = 0;
> > set parallel_tuple_cost = 0;
> >
> > I've also tried various combinations of random_page_cost,
> > cpu_index_tuple_cost, cpu_tuple_cost.
> >
> > Interestingly I've noticed plans joining two relations that look like:
> >
> >  Limit
> >->  Merge Join
> >  Merge Cond: (t1.pk = t2.pk)
> >  ->  Gather Merge
> >Workers Planned: 4
> >->  Parallel Index Scan using t_pkey on t t1
> >  ->  Gather Merge
> >Workers Planned: 4
> >->  Parallel Index Scan using t_pkey on t t2
> >
> > Where I would have expected a Gather Merge above a parallelized merge
> > join. Is that reasonable to expect?
>
> Well, you told the planner that parallel_setup_cost = 0, so starting
> workers is free. And you told the planner that parallel_tuple_cost =
> 0, so shipping tuples from the worker to the leader is also free. So
> it is unclear why it should prefer a single Gather Merge over two
> Gather Merges: after all, the Gather Merge is free!
>
> If you use give those things some positive cost, even if it's smaller
> than the default, you'll probably get a saner-looking plan choice.

That makes sense.

Right now I currently see trying to get this a separate test feels a
bit like a distraction.

Given there doesn't seem to be an obvious way to reproduce the issue
currently, but we know we have a reproduction example along with
incremental sort, what is the path forward for this? Is it reasonable
to try to commit it anyway knowing that it's a "correct" change and
been demonstrated elsewhere?

James




Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

2019-03-02 Thread James Coleman
On Fri, Mar 1, 2019 at 5:28 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > [ saop_is_not_null-v10.patch ]
>
> I went through this again, and this time (after some more rewriting
> of the comments) I satisfied myself that the logic is correct.
> Hence, pushed.

Thanks!

> I also tweaked it to recognize the case where we can prove the
> array, rather than the scalar, to be null.  I'm not sure how useful
> that is in practice, but it seemed weird that we'd exploit that
> only if we can also prove the scalar to be null.

Just for my own understanding: I thought the "if
(arrayconst->constisnull)" took care of the array constant being null?
I don't see a check on the scalar node / lhs. I do see you added a
check for the entire clause being null, but I'm not sure I understand
when that would occur (unless it's only in the recursive case?)

> Take a look at the ScalarArrayOp case in eval_const_expressions.
> Right now it's only smart about the all-inputs-constant case.
> I'm not really convinced it's worth spending cycles on the constant-
> null-array case, but that'd be where to do it if we want to do it
> in a general way.  (I think that what I added to clause_is_strict_for
> is far cheaper, because while it's the same test, it's in a place
> that we won't hit during most queries.)

Thanks for the pointer; I'll take a look if for no other reason than curiosity.

Thanks,
James Coleman



  1   2   3   4   5   6   >