Re: [HACKERS] Parallel Hash take II
While re-basing the parallel-B-tree-index-build patch on top v22 patch sets, found cosmetic review: 1) BufFileSetEstimate is removed but it's still into buffile.h +extern size_t BufFileSetEstimate(int stripes); 2) BufFileSetCreate is renamed with BufFileSetInit, but used at below place in comments: * Attach to a set of named BufFiles that was created with BufFileSetCreate. Thanks, On Wed, Oct 25, 2017 at 11:33 AM, Thomas Munro < thomas.mu...@enterprisedb.com> wrote: > On Tue, Oct 24, 2017 at 10:10 PM, Thomas Munro > <thomas.mu...@enterprisedb.com> wrote: > > Here is an updated patch set that does that ^. > > It's a bit hard to understand what's going on with the v21 patch set I > posted yesterday because EXPLAIN ANALYZE doesn't tell you anything > interesting. Also, if you apply the multiplex_gather patch[1] I > posted recently and set multiplex_gather to off then it doesn't tell > you anything at all, because the leader has no hash table (I suppose > that could happen with unpatched master given sufficiently bad > timing). Here's a new version with an extra patch that adds some > basic information about load balancing to EXPLAIN ANALYZE, inspired by > what commit bf11e7ee did for Sort. > > Example output: > > enable_parallel_hash = on, multiplex_gather = on: > > -> Parallel Hash (actual rows=100 loops=3) >Buckets: 131072 Batches: 16 >Leader:Shared Memory Usage: 3552kB Hashed: 396120 Batches > Probed: 7 >Worker 0: Shared Memory Usage: 3552kB Hashed: 276640 Batches > Probed: 6 >Worker 1: Shared Memory Usage: 3552kB Hashed: 327240 Batches > Probed: 6 >-> Parallel Seq Scan on simple s (actual rows=33 loops=3) > > -> Parallel Hash (actual rows=1000 loops=8) >Buckets: 131072 Batches: 256 >Leader:Shared Memory Usage: 2688kB Hashed: 1347720 > Batches Probed: 36 >Worker 0: Shared Memory Usage: 2688kB Hashed: 1131360 > Batches Probed: 33 >Worker 1: Shared Memory Usage: 2688kB Hashed: 1123560 > Batches Probed: 38 >Worker 2: Shared Memory Usage: 2688kB Hashed: 1231920 > Batches Probed: 38 >Worker 3: Shared Memory Usage: 2688kB Hashed: 1272720 > Batches Probed: 34 >Worker 4: Shared Memory Usage: 2688kB Hashed: 1234800 > Batches Probed: 33 >Worker 5: Shared Memory Usage: 2688kB Hashed: 1294680 > Batches Probed: 37 >Worker 6: Shared Memory Usage: 2688kB Hashed: 1363240 > Batches Probed: 35 >-> Parallel Seq Scan on big s2 (actual rows=125 loops=8) > > enable_parallel_hash = on, multiplex_gather = off (ie no leader > participation): > > -> Parallel Hash (actual rows=100 loops=2) >Buckets: 131072 Batches: 16 >Worker 0: Shared Memory Usage: 3520kB Hashed: 475920 Batches > Probed: 9 >Worker 1: Shared Memory Usage: 3520kB Hashed: 524080 Batches > Probed: 8 >-> Parallel Seq Scan on simple s (actual rows=50 loops=2) > > enable_parallel_hash = off, multiplex_gather = on: > > -> Hash (actual rows=100 loops=3) >Buckets: 131072 Batches: 16 >Leader:Memory Usage: 3227kB >Worker 0: Memory Usage: 3227kB >Worker 1: Memory Usage: 3227kB >-> Seq Scan on simple s (actual rows=100 loops=3) > > enable_parallel_hash = off, multiplex_gather = off: > > -> Hash (actual rows=100 loops=2) >Buckets: 131072 Batches: 16 >Worker 0: Memory Usage: 3227kB >Worker 1: Memory Usage: 3227kB >-> Seq Scan on simple s (actual rows=100 loops=2) > > parallelism disabled (traditional single-line output, unchanged): > > -> Hash (actual rows=100 loops=1) >Buckets: 131072 Batches: 16 Memory Usage: 3227kB >-> Seq Scan on simple s (actual rows=100 loops=1) > > (It actually says "Tuples Hashed", not "Hashed" but I edited the above > to fit on a standard punchcard.) Thoughts? > > [1] https://www.postgresql.org/message-id/CAEepm%3D2U%2B% > 2BLp3bNTv2Bv_kkr5NE2pOyHhxU%3DG0YTa4ZhSYhHiw%40mail.gmail.com > > -- > Thomas Munro > http://www.enterprisedb.com > -- Rushabh Lathia
Re: [HACKERS] Parallel Hash take II
v20 patch set (I was trying 0008, 0009 patch) not getting cleanly apply on latest commit also getting compilation error due to refactor in below commit. commit 0c5803b450e0cc29b3527df3f352e6f18a038cc6 Author: Peter Eisentraut <pete...@gmx.net> Date: Sat Sep 23 09:49:22 2017 -0400 Refactor new file permission handling On Mon, Sep 25, 2017 at 11:38 AM, Thomas Munro < thomas.mu...@enterprisedb.com> wrote: > On Thu, Sep 21, 2017 at 5:49 AM, Peter Geoghegan <p...@bowt.ie> wrote: > > Graefe's "Query Evaluation Techniques for Large Databases" has several > > pages on deadlock avoidance strategies. It was written almost 25 years > > ago, but still has some good insights IMV (you'll recall that Graefe > > is the author of the Volcano paper; this reference paper seems like > > his follow-up). Apparently, deadlock avoidance strategy becomes > > important for parallel sort with partitioning. You may be able to get > > some ideas from there. And even if you don't, his handling of the > > topic is very deliberate and high level, which suggests that ours > > should be, too. > > Very interesting and certainly relevant (the parts I've read so far), > though we don't have multiple consumers. Multiplexing one thread so > that it is both a consumer and a producer is an extra twist though. > > -- > Thomas Munro > http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Rushabh Lathia
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Sep 20, 2017 at 5:17 AM, Peter Geoghegan <p...@bowt.ie> wrote: > On Tue, Sep 19, 2017 at 3:21 AM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > As per the earlier discussion in the thread, I did experiment using > > BufFileSet interface from parallel-hash-v18.patchset. I took the > reference > > of parallel-hash other patches to understand the BufFileSet APIs, and > > incorporate the changes to parallel create index. > > > > In order to achieve the same: > > > > - Applied 0007-Remove-BufFile-s-isTemp-flag.patch and > > 0008-Add-BufFileSet-for-sharing-temporary-files-between-b.patch from the > > parallel-hash-v18.patchset. > > - Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE > > INDEX v10 patch. > > - incorporate the BufFileSet API to the parallel tuple sort for CREATE > > INDEX. > > - Changes into few existing functions as well as added few to support the > > BufFileSet changes. > > I'm glad that somebody is working on this. (Someone closer to the more > general work on shared/parallel BufFile infrastructure than I am.) > > I do have some quick feedback, and I hope to be able to provide that > to both you and Thomas, as needed to see this one through. I'm not > going to get into the tricky details around resource management just > yet. I'll start with some simpler questions, to get a general sense of > the plan here. > > Thanks Peter. > I gather that you're at least aware that your v11 of the patch doesn't > preserve randomAccess support for parallel sorts, because you didn't > include my 0002-* testing GUCs patch, which was specifically designed > to make various randomAccess stuff testable. I also figured this to be > true because I noticed this FIXME among (otherwise unchanged) > tuplesort code: > > Yes, I haven't touched the randomAccess part yet. My initial goal was to incorporate the BufFileSet api's here. > > +static void > > +leader_takeover_tapes(Tuplesortstate *state) > > +{ > > + Sharedsort *shared = state->shared; > > + int nLaunched = state->nLaunched; > > + int j; > > + > > + Assert(LEADER(state)); > > + Assert(nLaunched >= 1); > > + Assert(nLaunched == shared->workersFinished); > > + > > + /* > > +* Create the tapeset from worker tapes, including a leader-owned > tape at > > +* the end. Parallel workers are far more expensive than logical > tapes, > > +* so the number of tapes allocated here should never be excessive. > FIXME > > +*/ > > + inittapestate(state, nLaunched + 1); > > + state->tapeset = LogicalTapeSetCreate(nLaunched + 1, shared->tapes, > > + state->fileset, state->worker); > > It's not surprising to me that you do not yet have this part working, > because much of my design was about changing as little as possible > above the BufFile interface, in order for tuplesort.c (and logtape.c) > code like this to "just work" as if it was the serial case. Right. I just followed your design in the your earlier patches. > It doesn't > look like you've added the kind of BufFile multiplexing code that I > expected to see in logtape.c. This is needed to compensate for the > code removed from fd.c and buffile.c. Perhaps it would help me to go > look at Thomas' latest parallel hash join patch -- did it gain some > kind of transparent multiplexing ability that you actually (want to) > use here? > > Sorry, I didn't get this part. Are you talking about the your patch changes into OpenTemporaryFileInTablespace(), BufFileUnify() and other changes related to ltsUnify() ? If that's the case, I don't think it require with the BufFileSet. Correct me if I am wrong here. Though randomAccess isn't used by CREATE INDEX in general, and so not > supporting randomAccess within tuplesort.c for parallel callers > doesn't matter as far as this CREATE INDEX user-visible feature is > concerned, I still believe that randomAccess is important (IIRC, > Robert thought so too). Specifically, it seems like a good idea to > have randomAccess support, both on general principle (why should the > parallel case be different?), and because having it now will probably > enable future enhancements to logtape.c. Enhancements that have it > manage parallel sorts based on partitioning/distribution/bucketing > [1]. I'm pretty sure that partitioning-based parallel sort is going to > become very important in the future, especially for parallel > GroupAggregate. The leader needs to truly own the tapes it reclaims > from workers in order for all of this to work. > > First application f
Re: [HACKERS] GatherMerge misses to push target list
On Mon, Sep 18, 2017 at 2:48 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Mon, Sep 18, 2017 at 12:46 PM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > Thanks Amit for the patch. > > > > I reviewed the code changes as well as performed more testing. Patch > > looks good to me. > > > > In that case, can you please mark the patch [1] as ready for committer in > CF app > Done. > > > Here is the updated patch - where added test-case clean up. > > > > oops, missed dropping the function. > > > Thanks for the review. > > > [1] - https://commitfest.postgresql.org/15/1293/ > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > -- Rushabh Lathia
Re: [HACKERS] GatherMerge misses to push target list
Thanks Amit for the patch. I reviewed the code changes as well as performed more testing. Patch looks good to me. Here is the updated patch - where added test-case clean up. On Thu, Sep 14, 2017 at 10:02 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Wed, Sep 13, 2017 at 5:30 PM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > On Wed, Sep 6, 2017 at 10:04 AM, Amit Kapila <amit.kapil...@gmail.com> > > wrote: > >> > > > > > > This seems like a good optimization. I tried to simulate the test given > > in the mail, initially wasn't able to generate the exact test - as index > > creation is missing in the test shared. > > > > Oops. > > > I also won't consider this as bug, but its definitely good optimization > > for GatherMerge. > > > >> > >> > >> Note - If we agree on the problems and fix, then I can add regression > >> tests to cover above cases in the patch. > > > > > > Sure, once you do that - I will review the patch. > > > > The attached patch contains regression test as well. > > Thanks for looking into it. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > -- Rushabh Lathia diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 7f146d6..b12e57b 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -5028,7 +5028,7 @@ create_ordered_paths(PlannerInfo *root, path = (Path *) create_gather_merge_path(root, ordered_rel, path, - target, root->sort_pathkeys, NULL, + ordered_rel->reltarget, root->sort_pathkeys, NULL, _groups); /* Add projection step if needed */ diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c index 26567cb..516a396 100644 --- a/src/backend/optimizer/util/pathnode.c +++ b/src/backend/optimizer/util/pathnode.c @@ -2354,9 +2354,9 @@ create_projection_path(PlannerInfo *root, * knows that the given path isn't referenced elsewhere and so can be modified * in-place. * - * If the input path is a GatherPath, we try to push the new target down to - * its input as well; this is a yet more invasive modification of the input - * path, which create_projection_path() can't do. + * If the input path is a GatherPath or GatherMergePath, we try to push the + * new target down to its input as well; this is a yet more invasive + * modification of the input path, which create_projection_path() can't do. * * Note that we mustn't change the source path's parent link; so when it is * add_path'd to "rel" things will be a bit inconsistent. So far that has @@ -2393,31 +2393,44 @@ apply_projection_to_path(PlannerInfo *root, (target->cost.per_tuple - oldcost.per_tuple) * path->rows; /* - * If the path happens to be a Gather path, we'd like to arrange for the - * subpath to return the required target list so that workers can help - * project. But if there is something that is not parallel-safe in the - * target expressions, then we can't. + * If the path happens to be a Gather or GatherMerge path, we'd like to + * arrange for the subpath to return the required target list so that + * workers can help project. But if there is something that is not + * parallel-safe in the target expressions, then we can't. */ - if (IsA(path, GatherPath) && + if ((IsA(path, GatherPath) || IsA(path, GatherMergePath)) && is_parallel_safe(root, (Node *) target->exprs)) { - GatherPath *gpath = (GatherPath *) path; - /* * We always use create_projection_path here, even if the subpath is * projection-capable, so as to avoid modifying the subpath in place. * It seems unlikely at present that there could be any other * references to the subpath, but better safe than sorry. * - * Note that we don't change the GatherPath's cost estimates; it might + * Note that we don't change the parallel path's cost estimates; it might * be appropriate to do so, to reflect the fact that the bulk of the * target evaluation will happen in workers. */ - gpath->subpath = (Path *) - create_projection_path(root, - gpath->subpath->parent, - gpath->subpath, - target); + if (IsA(path, GatherPath)) + { + GatherPath *gpath = (GatherPath *) path; + + gpath->subpath = (Path *) +create_projection_path(root, + gpath->subpath->parent, + gpath->subpath, + target); + } + else + { + GatherMergePath *gmpath = (GatherMergePath *) path; + + gmpath->subpath = (Path *) +create_projection_path(root, + gmpath->subpath->parent, + gmpath->subpath, + target); + } } else if (path->p
Re: [HACKERS] GatherMerge misses to push target list
On Wed, Sep 6, 2017 at 10:04 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > During my recent work on costing of parallel paths [1], I noticed that > we are missing to push target list below GatherMerge in some simple > cases like below. > > Test prepration > - > create or replace function simple_func(var1 integer) returns integer > as $$ > begin > return var1 + 10; > end; > $$ language plpgsql PARALLEL SAFE; > > create table t1(c1 int, c2 char(5)); > insert into t1 values(generate_series(1,50), 'aaa'); > set parallel_setup_cost=0; > set parallel_tuple_cost=0; > set min_parallel_table_scan_size=0; > set max_parallel_workers_per_gather=4; > set cpu_operator_cost=0; set min_parallel_index_scan_size=0; > > Case-1 > - > postgres=# explain (costs off, verbose) select c1,simple_func(c1) from > t1 where c1 < 1 order by c1; > QUERY PLAN > - > Gather Merge >Output: c1, simple_func(c1) >Workers Planned: 4 >-> Parallel Index Scan using idx_t1 on public.t1 > Output: c1 > Index Cond: (t1.c1 < 1) > (6 rows) > > In the above case, I don't see any reason why we can't push the target > list expression (simple_func(c1)) down to workers. > > Case-2 > -- > set enable_indexonlyscan=off; > set enable_indexscan=off; > postgres=# explain (costs off, verbose) select c1,simple_func(c1) from > t1 where c1 < 1 order by c1; > QUERY PLAN > > Gather Merge >Output: c1, simple_func(c1) >Workers Planned: 4 >-> Sort > Output: c1 > Sort Key: t1.c1 > -> Parallel Bitmap Heap Scan on public.t1 >Output: c1 >Recheck Cond: (t1.c1 < 1) >-> Bitmap Index Scan on idx_t1 > Index Cond: (t1.c1 < 1) > (11 rows) > > It is different from above case (Case-1) because sort node can't > project, but I think adding a Result node on top of it can help to > project and serve our purpose. > > The attached patch allows pushing the target list expression in both > the cases. I think we should handle GatherMerge case in > apply_projection_to_path similar to Gather so that target list can be > pushed. Another problem was during the creation of ordered paths > where we don't allow target expressions to be pushed below gather > merge. > > Plans after patch: > > Case-1 > -- > postgres=# explain (costs off, verbose) select c1,simple_func(c1) from > t1 where c1 < 1 order by c1; > QUERY PLAN > -- > Gather Merge >Output: c1, (simple_func(c1)) >Workers Planned: 4 >-> Parallel Index Only Scan using idx_t1 on public.t1 > Output: c1, simple_func(c1) > Index Cond: (t1.c1 < 1) > (6 rows) > > Case-2 > --- > postgres=# explain (costs off, verbose) select c1,simple_func(c1) from > t1 where c1 < 1 order by c1; > QUERY PLAN > -- > Gather Merge >Output: c1, (simple_func(c1)) >Workers Planned: 4 >-> Result > Output: c1, simple_func(c1) > -> Sort >Output: c1 >Sort Key: t1.c1 >-> Parallel Bitmap Heap Scan on public.t1 > Output: c1 > Recheck Cond: (t1.c1 < 1) > -> Bitmap Index Scan on idx_t1 >Index Cond: (t1.c1 < 1) > (13 rows) > > Note, that simple_func() is pushed down to workers in both the cases. > > Thoughts? > This seems like a good optimization. I tried to simulate the test given in the mail, initially wasn't able to generate the exact test - as index creation is missing in the test shared. I also won't consider this as bug, but its definitely good optimization for GatherMerge. > > Note - If we agree on the problems and fix, then I can add regression > tests to cover above cases in the patch. > > [1] - https://www.postgresql.org/message-id/CAA4eK1Ji_ > 0pgrjFotAyvvfxGikxJQEKcxD863VQ-xYtfQBy0uQ%40mail.gmail.com Sure, once you do that - I will review the patch. Thanks, > > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- Rushabh Lathia
Re: [HACKERS] dropping partitioned tables without CASCADE
Okay, I have marked this as ready for committer. Thanks, On Wed, Sep 6, 2017 at 2:50 PM, Ashutosh Bapat < ashutosh.ba...@enterprisedb.com> wrote: > On Wed, Sep 6, 2017 at 1:06 PM, Rushabh Lathia <rushabh.lat...@gmail.com> > wrote: > > > > 2) Add partition to the foo; > > > > create table foo_p1 partition of foo for values in (1, 2, 3) partition by > > list (b); > > > > postgres=# \d foo > > Table "public.foo" > > Column | Type | Collation | Nullable | Default > > +-+---+--+- > > a | integer | | | > > b | integer | | | > > Partition key: LIST (a) > > Number of partitions: 1 (Use \d+ to list them.) > > > > postgres=# \d+ foo > > Table "public.foo" > > Column | Type | Collation | Nullable | Default | Storage | Stats > target > > | Description > > +-+---+--+-+ > -+--+- > > a | integer | | | | plain | > > | > > b | integer | | | | plain | > > | > > Partition key: LIST (a) > > Partitions: foo_p1 FOR VALUES IN (1, 2, 3) has partitions > > > > Above verbose output for foo says, foo_p1 "has partitions". But if I do > > > > postgres=# \d foo_p1 > >Table "public.foo_p1" > > Column | Type | Collation | Nullable | Default > > +-+---+--+- > > a | integer | | | > > b | integer | | | > > Partition of: foo FOR VALUES IN (1, 2, 3) > > Partition key: LIST (b) > > Number of partitions: 0 > > > > it tell "Number of partitions: 0". > > > > I feel like information is conflicting with each other. AFAIU, idea about > > adding > > "has partitions" was to let know that it's a partitioned table. So can > you > > directly > > add the "is partitioned" in place of "has partitions"? > > > > Did those change in the attached patch and update regression expected > > output. > > > > Looks better. > > > Also run pgindent on the patch. > > > > Thanks for the changes. The patch looks good to me. > > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company > -- Rushabh Lathia
Re: [HACKERS] dropping partitioned tables without CASCADE
I picked this patch for review and started looking at the implementation details. Consider the below test: 1) postgres=# create table foo (a int, b int) partition by list (a); CREATE TABLE postgres=# \d foo Table "public.foo" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | b | integer | | | Partition key: LIST (a) Number of partitions: 0 postgres=# \d+ foo Table "public.foo" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description +-+---+--+-+-+--+- a | integer | | | | plain | | b | integer | | | | plain | | Partition key: LIST (a) Number of partitions: 0 In the above case, verbose as well as normal output give information about number of partitions. 2) Add partition to the foo; create table foo_p1 partition of foo for values in (1, 2, 3) partition by list (b); postgres=# \d foo Table "public.foo" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | b | integer | | | Partition key: LIST (a) *Number of partitions: 1 (Use \d+ to list them.)* postgres=# \d+ foo Table "public.foo" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description +-+---+--+-+-+--+- a | integer | | | | plain | | b | integer | | | | plain | | Partition key: LIST (a) *Partitions: foo_p1 FOR VALUES IN (1, 2, 3) has partitions* Above verbose output for foo says, foo_p1 "has partitions". But if I do postgres=# \d foo_p1 Table "public.foo_p1" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | b | integer | | | Partition of: foo FOR VALUES IN (1, 2, 3) Partition key: LIST (b) *Number of partitions: 0* it tell "Number of partitions: 0". I feel like information is conflicting with each other. AFAIU, idea about adding "has partitions" was to let know that it's a partitioned table. So can you directly add the "is partitioned" in place of "has partitions"? Did those change in the attached patch and update regression expected output. Also run pgindent on the patch. Thanks, On Mon, Sep 4, 2017 at 6:22 PM, Ashutosh Bapat < ashutosh.ba...@enterprisedb.com> wrote: > On Mon, Sep 4, 2017 at 3:48 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org> > wrote: > > > > if (tuples > 0) > > { > > - if (tableinfo.relkind != > RELKIND_PARTITIONED_TABLE) > > - printfPQExpBuffer(, > _("Number of child tables: %d (Use \\d+ to list them.)"), tuples); > > - else > > - printfPQExpBuffer(, > _("Number of partitions: %d (Use \\d+ to list them.)"), tuples); > > + printfPQExpBuffer(, _("Number of %s: > %d (Use \\d+ to list them.)"), ct, tuples); > > printTableAddFooter(, buf.data); > > } > > > > Please don't do this, because it breaks translatability. I think the > > original is fine. > > > > We have used this style in the "else" case of if (!verbose). So, I > just copied it. I have removed that change in the attached patch. > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > ... Rushabh Lathia www.EnterpriseDB.com diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c index f6049cc..ab9a637 100644 --- a/src/bin/psql/describe.c +++ b/src/bin/psql/describe.c @@ -2828,7 +2828,7 @@ describeOneTableDetails(const char *schemaname, /* print child tables (with additional info if partitions) */ if (pset.sversion >= 10) printfPQExpBuffer(, - "SELECT c.oid::pg_catalog.regclass, pg_catalog.pg_get_expr(c.relpartbound, c.oid)" + "SELECT c.oid::pg_catalog.regclass, pg_get_expr(c.relpartbound, c.oid), c.relkind&qu
Re: [HACKERS] hash partitioning based on v10Beta2
# explain analyze select * from h where id = 1 or id = 5;; > QUERY PLAN > > > > Append (cost=0.00..96.50 rows=50 width=4) (actual time= > 0.017..0.078 rows=2 loops=1) >-> Seq Scan on h1 (cost=0.00..48.25 rows=25 width=4) ( > actual time=0.015..0.019 rows=1 loops=1) > Filter: ((id = 1) OR (id = 5)) > Rows Removed by Filter: 6 >-> Seq Scan on h3 (cost=0.00..48.25 rows=25 width=4) ( > actual time=0.005..0.010 rows=1 loops=1) > Filter: ((id = 1) OR (id = 5)) > Rows Removed by Filter: 3 > Planning time: 0.396 ms > Execution time: 0.139 ms > (9 rows) > > Can not detach / attach / drop partition table. > > Best regards, > young > > -- > > https://yonj1e.github.io/ > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- Rushabh Lathia
Re: [HACKERS] reload-through-the-top-parent switch the partition table
On Fri, Aug 11, 2017 at 10:50 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Fri, Aug 11, 2017 at 5:36 AM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > Please find attach patch with the changes. > > I found the way that you had the logic structured in flagInhTables() > to be quite hard to follow, so I rewrote it in the attached version. > This version also has a bunch of minor cosmetic changes. Barring > objections, I'll commit it once the tree opens for v11 development. > > Thanks Robert. Patch changes looks good to me. regards, Rushabh Lathia www.EnterpriseDB.com
Re: [HACKERS] reload-through-the-top-parent switch the partition table
On Thu, Aug 10, 2017 at 8:26 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Thu, Aug 10, 2017 at 3:47 AM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > >> (1) seems like a pretty arbitrary restriction, so I don't like that > >> option. (2) would hurt performance in some use cases. Do we have an > >> option (3)? > > > > How about protecting option 2) with the load-via-partition-root > protection. > > Means > > load the parents information even dump is not set only when > > load-via-partition-root > > & ispartition. > > > > This won't hurt performance for the normal cases. > > Yes, that seems like the right approach. > > +Dump data via the top-most partitioned table (rather than > partition > +table) when dumping data for the partition table. > > I think we should phrase this a bit more clearly, something like this: > When dumping a COPY or INSERT statement for a partitioned table, > target the root of the partitioning hierarchy which contains it rather > than the partition itself. This may be useful when reloading data on > a server where rows do not always fall into the same partitions as > they did on the original server. This could happen, for example, if > the partitioning column is of type text and the two system have > different definitions of the collation used to partition the data. > > Done. > +printf(_(" --load-via-partition-rootload partition table via > the root relation\n")); > > "relation" seems odd to me here. root table? > > Done. > /* Don't bother computing anything for non-target tables, either > */ > if (!tblinfo[i].dobj.dump) > +{ > +/* > + * Load the parents information for the partition table when > + * the load-via-partition-root option is set. As we need the > + * parents information to get the partition root. > + */ > +if (dopt->load_via_partition_root && > +tblinfo[i].ispartition) > +findParentsByOid([i], inhinfo, numInherits); > continue; > +} > > Duplicating the call to findParentsByOid seems less then ideal. How > about doing something like this: > > if (tblinfo[i].dobj.dump) > { > find_parents = true; > mark_parents = true; > } > else if (dopt->load_via_partition_root && tblinfo[i].ispartition) > find_parents = true; > > if (find_parents) > findParentsByOid([i], inhinfo, numInherits); > > etc. > > Done changes to avoid duplicate call to findParentsByOid(). > The comments for this function also need some work - e.g. the function > header comment deserves some kind of update for these changes. > > Done. > +static TableInfo * > +getRootTableInfo(TableInfo *tbinfo) > +{ > +Assert(tbinfo->ispartition); > + Assert(tbinfo->numParents == 1); > +if (tbinfo->parents[0]->ispartition) > +return getRootTableInfo(tbinfo->parents[0]); > + > +return tbinfo->parents[0]; > +} > > This code should iterate, not recurse, to avoid any risk of blowing > out the stack. > > Done. Please find attach patch with the changes. Thanks, Rushabh Lathia www.EnterpriseDB.com diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index bafa031..de8297a 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -889,6 +889,21 @@ PostgreSQL documentation + --load-via-partition-root + + +When dumping a COPY or INSERT statement for a partitioned table, +target the root of the partitioning hierarchy which contains it rather +than the partition itself. This will be useful when reloading data on +a server where rows do not always fall into the same partitions as +they did on the original server. This could happen, for example, if +the partitioning column is of type text and the two system have +different definitions of the collation used to partition the data. + + + + + --section=sectionname diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml index aa944a2..dc1d3cc 100644 --- a/doc/src/sgml/ref/pg_dumpall.sgml +++ b/doc/src/sgml/ref/pg_dumpall.sgml @@ -431,6 +431,21 @@ PostgreSQL documentation + --load-via-partition-root + + +When dumping a COPY or INSERT statement for a partitioned table, +target the root of the partitioning hierarchy which contains it rather +than the partition itself. This will be useful when reloading
Re: [HACKERS] reload-through-the-top-parent switch the partition table
On Wed, Aug 9, 2017 at 1:20 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Aug 8, 2017 at 8:48 AM, Rushabh Lathia <rushabh.lat...@gmail.com> > wrote: > > It seems like with we set the numParents and parents only for the > > dumpable objects (flagInhTables()). Current patch relies on the > numParents > > and parents to get the root partition TableInfo, but when --schema is > been > > specified - it doesn't load those information for the object which is not > > dumpable. > > > > Now one options is: > > > > 1) restrict the --load-via-partition-root to be used with > > the --schema or may be some other options - where we restrict the > > objects. > > > > Consider this, partition root is in schema 'a' and the partition table > is in > > schema 'b', if someone specify the --schema b with > > --load-via-partition-root, > > I think we should not do "INSERT INTO a.tab" to load the data (because > > user specified --schema b). > > > > 2) fix flagInhTables() to load the numParents and the parents information > > for the partition table (can be checked using ispartition), irrespective > of > > whether object is dumpable is true or not. > > (1) seems like a pretty arbitrary restriction, so I don't like that > option. (2) would hurt performance in some use cases. Do we have an > option (3)? > How about protecting option 2) with the load-via-partition-root protection. Means load the parents information even dump is not set only when load-via-partition-root & ispartition. This won't hurt performance for the normal cases. Attaching the patch. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > -- Rushabh Lathia diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index bafa031..65f7e98 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -889,6 +889,16 @@ PostgreSQL documentation + --load-via-partition-root + + +Dump data via the top-most partitioned table (rather than partition +table) when dumping data for the partition table. + + + + + --section=sectionname diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml index aa944a2..54b8e7e 100644 --- a/doc/src/sgml/ref/pg_dumpall.sgml +++ b/doc/src/sgml/ref/pg_dumpall.sgml @@ -431,6 +431,16 @@ PostgreSQL documentation + --load-via-partition-root + + +Dump data via the top-most partitioned table (rather than partition +table) when dumping data for the partition table. + + + + + --use-set-session-authorization diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c index 47191be..65c2d7f 100644 --- a/src/bin/pg_dump/common.c +++ b/src/bin/pg_dump/common.c @@ -66,7 +66,7 @@ static int numExtensions; static ExtensionMemberId *extmembers; static int numextmembers; -static void flagInhTables(TableInfo *tbinfo, int numTables, +static void flagInhTables(Archive *fout, TableInfo *tbinfo, int numTables, InhInfo *inhinfo, int numInherits); static void flagInhAttrs(DumpOptions *dopt, TableInfo *tblinfo, int numTables); static DumpableObject **buildIndexArray(void *objArray, int numObjs, @@ -243,7 +243,7 @@ getSchemaData(Archive *fout, int *numTablesPtr) /* Link tables to parents, mark parents of target tables interesting */ if (g_verbose) write_msg(NULL, "finding inheritance relationships\n"); - flagInhTables(tblinfo, numTables, inhinfo, numInherits); + flagInhTables(fout, tblinfo, numTables, inhinfo, numInherits); if (g_verbose) write_msg(NULL, "reading column info for interesting tables\n"); @@ -304,9 +304,10 @@ getSchemaData(Archive *fout, int *numTablesPtr) * modifies tblinfo */ static void -flagInhTables(TableInfo *tblinfo, int numTables, +flagInhTables(Archive *fout, TableInfo *tblinfo, int numTables, InhInfo *inhinfo, int numInherits) { + DumpOptions *dopt = fout->dopt; int i, j; int numParents; @@ -322,7 +323,17 @@ flagInhTables(TableInfo *tblinfo, int numTables, /* Don't bother computing anything for non-target tables, either */ if (!tblinfo[i].dobj.dump) + { + /* + * Load the parents information for the partition table when + * the load-via-partition-root option is set. As we need the + * parents information to get the partition root. + */ + if (dopt->load_via_partition_root && +tblinfo[i].ispartition) +findParentsByOid([i], inhinfo, numInherits); continue; + } /* Find all the immediate parent tables */ findParentsByOid([i], inhinfo, numInherits); diff --git a/src/bin/pg_dump/pg_backup.
Re: [HACKERS] reload-through-the-top-parent switch the partition table
Thanks Rajkumar for testing and reporting this. It seems like with we set the numParents and parents only for the dumpable objects (flagInhTables()). Current patch relies on the numParents and parents to get the root partition TableInfo, but when --schema is been specified - it doesn't load those information for the object which is not dumpable. Now one options is: 1) restrict the --load-via-partition-root to be used with the --schema or may be some other options - where we restrict the objects. Consider this, partition root is in schema 'a' and the partition table is in schema 'b', if someone specify the --schema b with --load-via-partition-root, I think we should not do "INSERT INTO a.tab" to load the data (because user specified --schema b). 2) fix flagInhTables() to load the numParents and the parents information for the partition table (can be checked using ispartition), irrespective of whether object is dumpable is true or not. May be something like: @@ -322,7 +322,11 @@ flagInhTables(TableInfo *tblinfo, int numTables, /* Don't bother computing anything for non-target tables, either */ if (!tblinfo[i].dobj.dump) + { + if (tblinfo[i].ispartition) + findParentsByOid([i], inhinfo, numInherits); continue; + } I am still looking into this, meanwhile any inputs are welcome. On Tue, Aug 8, 2017 at 4:10 PM, Rajkumar Raghuwanshi < rajkumar.raghuwan...@enterprisedb.com> wrote: > Hi Rushabh, > > While testing latest v2 patch, I got a crash when using > --load-via-partition-root with --schema options. Below are steps to > reproduce. > > --create below test data > create schema a; > create schema b; > create schema c; > > create table t1 (a int,b text) partition by list(a); > create table a.t1_p1 partition of t1 FOR VALUES in (1,2,3,4) partition by > list(a); > create table b.t1_p1_p1 partition of a.t1_p1 FOR VALUES in (1,2); > create table c.t1_p1_p2 partition of a.t1_p1 FOR VALUES in (3,4); > create table b.t1_p2 partition of t1 FOR VALUES in (5,6,7,8) partition by > list(a); > create table a.t1_p2_p1 partition of b.t1_p2 FOR VALUES in (5,6); > create table t1_p2_p2 partition of b.t1_p2 FOR VALUES in (7,8); > > insert into t1 values (8,'t1'); > insert into a.t1_p1 values (2,'a.t1_p1'); > insert into b.t1_p1_p1 values (1,'b.t1_p1_p1'); > insert into c.t1_p1_p2 values (3,'c.t1_p1_p2'); > insert into b.t1_p2 values (6,'b.t1_p2'); > insert into a.t1_p2_p1 values (5,'a.t1_p2_p1'); > insert into t1_p2_p2 values (7,'t1_p2_p1'); > insert into t1 values (4 ,'t1'); > > --trying to take pg_dump > [edb@localhost bin]$ ./pg_dump -d postgres --schema=a -f d1.dump -Fp > [edb@localhost bin]$ ./pg_dump -d postgres --load-via-partition-root -f > d2.dump -Fp > [edb@localhost bin]$ ./pg_dump -d postgres --load-via-partition-root > --schema=a -f d3.dump -Fp > pg_dump: pg_dump.c:2063: getRootTableInfo: Assertion `tbinfo->numParents > == 1' failed. > Aborted (core dumped) > > > > Thanks & Regards, > Rajkumar Raghuwanshi > QMG, EnterpriseDB Corporation > > On Fri, Aug 4, 2017 at 3:01 PM, Rushabh Lathia <rushabh.lat...@gmail.com> > wrote: > >> >> Here is an update patch, now renamed the switch to >> --load-via-partition-root >> and also added the documentation for the new switch into pg_dump as well >> as pg_dumpall. >> >> >> On Fri, Aug 4, 2017 at 7:13 AM, Amit Langote < >> langote_amit...@lab.ntt.co.jp> wrote: >> >>> On 2017/08/04 1:08, David G. Johnston wrote: >>> > On Thu, Aug 3, 2017 at 8:53 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: >>> > >>> >> Robert Haas <robertmh...@gmail.com> writes: >>> >>> So maybe --load-via-partition-root if nobody likes my previous >>> >>> suggestion of --partition-data-via-root ? >>> >> >>> >> WFM. >>> >> >>> > >>> > +1 >>> >>> +1. >>> >>> Thanks, >>> Amit >>> >>> >> >> >> Thanks, >> Rushabh Lathia >> www.EnterpriseDB.com >> >> >> -- >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-hackers >> >> > -- Rushabh Lathia
Re: [HACKERS] reload-through-the-top-parent switch the partition table
Here is an update patch, now renamed the switch to --load-via-partition-root and also added the documentation for the new switch into pg_dump as well as pg_dumpall. On Fri, Aug 4, 2017 at 7:13 AM, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote: > On 2017/08/04 1:08, David G. Johnston wrote: > > On Thu, Aug 3, 2017 at 8:53 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > > > >> Robert Haas <robertmh...@gmail.com> writes: > >>> So maybe --load-via-partition-root if nobody likes my previous > >>> suggestion of --partition-data-via-root ? > >> > >> WFM. > >> > > > > +1 > > +1. > > Thanks, > Amit > > Thanks, Rushabh Lathia www.EnterpriseDB.com diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index bafa031..65f7e98 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -889,6 +889,16 @@ PostgreSQL documentation + --load-via-partition-root + + +Dump data via the top-most partitioned table (rather than partition +table) when dumping data for the partition table. + + + + + --section=sectionname diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml index aa944a2..54b8e7e 100644 --- a/doc/src/sgml/ref/pg_dumpall.sgml +++ b/doc/src/sgml/ref/pg_dumpall.sgml @@ -431,6 +431,16 @@ PostgreSQL documentation + --load-via-partition-root + + +Dump data via the top-most partitioned table (rather than partition +table) when dumping data for the partition table. + + + + + --use-set-session-authorization diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h index 144068a..ce3100f 100644 --- a/src/bin/pg_dump/pg_backup.h +++ b/src/bin/pg_dump/pg_backup.h @@ -157,6 +157,7 @@ typedef struct _dumpOptions int outputNoTablespaces; int use_setsessauth; int enable_row_security; + int load_via_partition_root; /* default, if no "inclusion" switches appear, is to dump everything */ bool include_everything; diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 393b9e2..27e23ca 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -269,6 +269,7 @@ static void appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions, const char *prefix, Archive *fout); static char *get_synchronized_snapshot(Archive *fout); static void setupDumpWorker(Archive *AHX); +static TableInfo *getRootTableInfo(TableInfo *tbinfo); int @@ -345,6 +346,7 @@ main(int argc, char **argv) {"lock-wait-timeout", required_argument, NULL, 2}, {"no-tablespaces", no_argument, , 1}, {"quote-all-identifiers", no_argument, _all_identifiers, 1}, + {"load-via-partition-root", no_argument, _via_partition_root, 1}, {"role", required_argument, NULL, 3}, {"section", required_argument, NULL, 5}, {"serializable-deferrable", no_argument, _deferrable, 1}, @@ -959,6 +961,7 @@ help(const char *progname) printf(_(" --no-tablespaces do not dump tablespace assignments\n")); printf(_(" --no-unlogged-table-data do not dump unlogged table data\n")); printf(_(" --quote-all-identifiers quote all identifiers, even if not key words\n")); + printf(_(" --load-via-partition-rootload partition table via the root relation\n")); printf(_(" --section=SECTIONdump named section (pre-data, data, or post-data)\n")); printf(_(" --serializable-deferrablewait until the dump can run without anomalies\n")); printf(_(" --snapshot=SNAPSHOT use given snapshot for the dump\n")); @@ -1902,8 +1905,32 @@ dumpTableData_insert(Archive *fout, void *dcontext) if (insertStmt == NULL) { insertStmt = createPQExpBuffer(); + +/* + * When load-via-partition-root is set, get the root relation name + * for the partition table, so that we can reload data through + * the root table. + */ +if (dopt->load_via_partition_root && tbinfo->ispartition) +{ + TableInfo *parentTbinfo; + + parentTbinfo = getRootTableInfo(tbinfo); + + /* + * When we loading data through the root, we will qualify + * the table name. This is needed because earlier + * search_path will be set for the partition table. + */ + classname = (char *) fmtQualifiedId(fout->remoteVersion, + parentTbinfo->dobj.namespace->dobj.name, + parentTbinfo->dobj.name); +} +else + classname = fmtId(tbinfo->dobj.name); + appendPQExpBuffer(insertStmt, "INSERT INTO %s ", - fmtId(classname));
Re: [HACKERS] reload-through-the-top-parent switch the partition table
On Thu, Aug 3, 2017 at 3:39 PM, Ashutosh Bapat < ashutosh.ba...@enterprisedb.com> wrote: > On Wed, Aug 2, 2017 at 11:47 PM, David G. Johnston > <david.g.johns...@gmail.com> wrote: > > On Wed, Aug 2, 2017 at 10:58 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > >> > >> Robert Haas <robertmh...@gmail.com> writes: > >> > On Wed, Aug 2, 2017 at 1:08 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > >> >> --restore-via-partition-root ? > >> > >> > I worry someone will think that pg_dump is now restoring stuff, but it > >> > isn't. > >> > >> Well, the point is that the commands it emits will cause the eventual > >> restore to go through the root. Anyway, I think trying to avoid using > >> a verb altogether is going to result in a very stilted option name. > >> > >> I notice that the option list already includes some references to > >> "insert", so maybe "--insert-via-partition-root"? Although you could > >> argue that that's confusing when we're using COPY. > > > > > > --use-partitioned-table [partition-name, ...] # if names are omitted it > > defaults to all partitioned tables > > I like this idea since it allows using this feature for selected > tables e.g. hash tables. Otherwise, users will be forced to use this > option even when there is only a single hash partitioned table and > many other list/range partitioned tables. > > +1. > What we are trying to do here is dump the data in a partitioned table > as if it's not partitioned. Combine that with --data-only dumps, and > one could use it to load partitioned data into unpartitioned or > differently partitioned table. So, how about naming the option as > > --unpartition-partitioned-table [partitioned-table-name, ] # if > names are omitted it defaults to all the partitioned tables. > > --unpartition-partitioned-table is very confusing. I liked the previous option which is --use-partitioned-table [partition-name, ...]. The Only problem with --use-partitioned-table is, a user needs to specify the partition-name in the options list. Imagine if someone having 100's of partitions then specifying those name is pg_dump option is a pain. Rather than that: --use-partitioned-table [partitioned_name, ...] # if names are omitted it defaults to all the partitioned tables. Here user need to specify the root relation name in the option - and any partition table have that as a ROOT, will load the data through top-parent-relation. That really says what dump is really doing without focusing on how the > data will be used like restoring/inserting/copying etc. > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company > -- Rushabh Lathia
Re: [HACKERS] reload-through-the-top-parent switch the partition table
On Wed, Aug 2, 2017 at 3:55 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Aug 1, 2017 at 5:34 AM, Rushabh Lathia <rushabh.lat...@gmail.com> > wrote: > > My colleague Robert and I had doubt about the order in of TABLE > > and TABLE_DATA. We thought earlier that reload-thought-root might > > might not solve the purpose which has been discussed in the above > > mentioned thread. But later looking into code I realize the sort order > > for DO_TABLE and DO_TABLE_DATA are different, so we don't need > > to worry about that issue. > > Hmm. Does that mean that table data restoration will *absolutely > always* follow all CREATE TABLE commands, or is that something that's > *normally* true but potentially false if dependency sorting switches > things around? > > Looking at the dbObjectTypePriority comments that seems like data restoration will *absolutely always* follow all CREATE TABLE commands. -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > Thanks, Rushabh Lathia www.EnterpriseDB.com
[HACKERS] reload-through-the-top-parent switch the partition table
https://www.postgresql.org/message-id/CA%2BTgmoZFn7TJ7QBsFat nuEE%3DGYGdZSNXqr9489n5JBsdy5rFfA%40mail.gmail.com Above thread, it's been pointed out as important consideration about adding reload-through-the-top-parent switch the partition table. One small step toward making use of hash function was adding a switch into pg_dump which reload through the top parent for the partition table. Attach patch does the implementation for the same: - Added switch reload-through-root: (Open for suggestion for the switch name) - Fix dumpTableData to COPY to the Root table with the reload-through-root switch. - Fix dumpTableData_insert - to generate the proper insert statement with the reload-through-root switch. - Added switch reload-through-root into pg_dumpall Testing: - Performed test with multi level partition + different schema combination. - Tested a case with parent and child attributes in different order. Attaching the pg_dump output for below test with --reload-through-root for COPY and INSERTS. Testcase: create schema a; create schema b; create schema c; create table a.t1 ( a int, b int ) partition by list(a); create table b.t1_p1 partition of a.t1 FOR VALUES in ( 1, 2, 3) partition by list(b); create table c.t1_p1_p1 partition of b.t1_p1 FOR VALUES in (10, 20 ); insert into a.t1 values ( 2 , 10 ); insert into a.t1 values ( 1 , 20 ); My colleague Robert and I had doubt about the order in of TABLE and TABLE_DATA. We thought earlier that reload-thought-root might might not solve the purpose which has been discussed in the above mentioned thread. But later looking into code I realize the sort order for DO_TABLE and DO_TABLE_DATA are different, so we don't need to worry about that issue. TODOs: - Update documentation for pg_dump & pg_dumpall Thanks, Rushabh Lathia www.EnterpriseDB.com -- -- PostgreSQL database dump -- -- Dumped from database version 10beta2 -- Dumped by pg_dump version 10beta2 SET statement_timeout = 0; SET lock_timeout = 0; SET idle_in_transaction_session_timeout = 0; SET client_encoding = 'UTF8'; SET standard_conforming_strings = on; SET check_function_bodies = false; SET client_min_messages = warning; SET row_security = off; -- -- Name: postgres; Type: COMMENT; Schema: -; Owner: rushabh -- COMMENT ON DATABASE postgres IS 'default administrative connection database'; -- -- Name: a; Type: SCHEMA; Schema: -; Owner: rushabh -- CREATE SCHEMA a; ALTER SCHEMA a OWNER TO rushabh; -- -- Name: b; Type: SCHEMA; Schema: -; Owner: rushabh -- CREATE SCHEMA b; ALTER SCHEMA b OWNER TO rushabh; -- -- Name: c; Type: SCHEMA; Schema: -; Owner: rushabh -- CREATE SCHEMA c; ALTER SCHEMA c OWNER TO rushabh; -- -- Name: plpgsql; Type: EXTENSION; Schema: -; Owner: -- CREATE EXTENSION IF NOT EXISTS plpgsql WITH SCHEMA pg_catalog; -- -- Name: EXTENSION plpgsql; Type: COMMENT; Schema: -; Owner: -- COMMENT ON EXTENSION plpgsql IS 'PL/pgSQL procedural language'; SET search_path = a, pg_catalog; SET default_tablespace = ''; SET default_with_oids = false; -- -- Name: t1; Type: TABLE; Schema: a; Owner: rushabh -- CREATE TABLE t1 ( a integer, b integer ) PARTITION BY LIST (a); ALTER TABLE t1 OWNER TO rushabh; SET search_path = b, pg_catalog; -- -- Name: t1_p1; Type: TABLE; Schema: b; Owner: rushabh -- CREATE TABLE t1_p1 PARTITION OF a.t1 FOR VALUES IN (1, 2, 3) PARTITION BY LIST (b); ALTER TABLE t1_p1 OWNER TO rushabh; SET search_path = c, pg_catalog; -- -- Name: t1_p1_p1; Type: TABLE; Schema: c; Owner: rushabh -- CREATE TABLE t1_p1_p1 PARTITION OF b.t1_p1 FOR VALUES IN (10, 20); ALTER TABLE t1_p1_p1 OWNER TO rushabh; -- -- Data for Name: t1_p1_p1; Type: TABLE DATA; Schema: c; Owner: rushabh -- INSERT INTO a.t1 VALUES (1, 20); INSERT INTO a.t1 VALUES (2, 10); -- -- PostgreSQL database dump complete -- -- -- PostgreSQL database dump -- -- Dumped from database version 10beta2 -- Dumped by pg_dump version 10beta2 SET statement_timeout = 0; SET lock_timeout = 0; SET idle_in_transaction_session_timeout = 0; SET client_encoding = 'UTF8'; SET standard_conforming_strings = on; SET check_function_bodies = false; SET client_min_messages = warning; SET row_security = off; -- -- Name: postgres; Type: COMMENT; Schema: -; Owner: rushabh -- COMMENT ON DATABASE postgres IS 'default administrative connection database'; -- -- Name: a; Type: SCHEMA; Schema: -; Owner: rushabh -- CREATE SCHEMA a; ALTER SCHEMA a OWNER TO rushabh; -- -- Name: b; Type: SCHEMA; Schema: -; Owner: rushabh -- CREATE SCHEMA b; ALTER SCHEMA b OWNER TO rushabh; -- -- Name: c; Type: SCHEMA; Schema: -; Owner: rushabh -- CREATE SCHEMA c; ALTER SCHEMA c OWNER TO rushabh; -- -- Name: plpgsql; Type: EXTENSION; Schema: -; Owner: -- CREATE EXTENSION IF NOT EXISTS plpgsql WITH SCHEMA pg_catalog; -- -- Name: EXTENSION plpgsql; Type: COMMENT; Schema: -; Owner: -- COMMENT ON EXTENSION plpgsql IS 'PL/pgSQL procedural language'; SET search_path = a, pg_catalog; SET default_tables
Re: [HACKERS] cache lookup failed error for partition key with custom opclass
On Tue, Jul 25, 2017 at 7:43 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Rushabh Lathia <rushabh.lat...@gmail.com> writes: > > On Mon, Jul 24, 2017 at 7:23 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > >> Some looking around says that this *isn't* the only place that just > >> blithely assumes that it will find an opfamily entry. But I agree > >> that not checking for that isn't up to project standards. > > > I go thorough the get_opfamily_proc() in the system and added the > > check for InvalidOid. > > Think I did that already, please compare your results with > 278cb4341103e967189997985b09981a73e23a34 > Thanks Tom. > > regards, tom lane > -- Rushabh Lathia
Re: [HACKERS] cache lookup failed error for partition key with custom opclass
On Mon, Jul 24, 2017 at 7:23 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Rushabh Lathia <rushabh.lat...@gmail.com> writes: > > PFA patch, where added elog() to add the error message same as all other > > places. > > Some looking around says that this *isn't* the only place that just > blithely assumes that it will find an opfamily entry. But I agree > that not checking for that isn't up to project standards. > Thanks Tom. I go thorough the get_opfamily_proc() in the system and added the check for InvalidOid. Thanks, Rushabh Lathia www.EnterpriseDB.com diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c index 5267a01..c9c1a54 100644 --- a/src/backend/executor/execExpr.c +++ b/src/backend/executor/execExpr.c @@ -1640,6 +1640,9 @@ ExecInitExprRec(Expr *node, PlanState *parent, ExprState *state, lefttype, righttype, BTORDER_PROC); + if (!OidIsValid(proc))/* should not happen */ + elog(ERROR, "missing support function %d(%u,%u) in opfamily %u", + BTORDER_PROC, lefttype, righttype, opfamily); /* Set up the primary fmgr lookup information */ finfo = palloc0(sizeof(FmgrInfo)); diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c index d8aceb1..ff758d6 100644 --- a/src/backend/executor/nodeIndexscan.c +++ b/src/backend/executor/nodeIndexscan.c @@ -1367,6 +1367,9 @@ ExecIndexBuildScanKeys(PlanState *planstate, Relation index, op_lefttype, op_righttype, BTORDER_PROC); +if (!OidIsValid(opfuncid))/* should not happen */ + elog(ERROR, "missing support function %d(%u,%u) in opfamily %u", + BTORDER_PROC, op_lefttype, op_righttype, opfamily); inputcollation = lfirst_oid(collids_cell); collids_cell = lnext(collids_cell); diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 43238dd..6bd93b0 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -945,6 +945,10 @@ RelationBuildPartitionKey(Relation relation) opclassform->opcintype, opclassform->opcintype, BTORDER_PROC); + if (!OidIsValid(funcid)) /* should not happen */ + elog(ERROR, "missing support function %d(%u,%u) in opfamily %u", + BTORDER_PROC, opclassform->opcintype, opclassform->opcintype, + opclassform->opcfamily); fmgr_info(funcid, >partsupfunc[i]); diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c index 7ec31eb..bc2f164 100644 --- a/src/backend/utils/cache/typcache.c +++ b/src/backend/utils/cache/typcache.c @@ -440,6 +440,10 @@ lookup_type_cache(Oid type_id, int flags) typentry->btree_opintype, typentry->btree_opintype, BTORDER_PROC); + if (!OidIsValid(cmp_proc))/* should not happen */ + elog(ERROR, "missing support function %d(%u,%u) in opfamily %u", + BTORDER_PROC, typentry->btree_opintype, + typentry->btree_opintype, typentry->btree_opf); /* As above, make sure array_cmp or record_cmp will succeed */ if (cmp_proc == F_BTARRAYCMP && -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] cache lookup failed error for partition key with custom opclass
Hi, Consider the following test: CREATE OR REPLACE FUNCTION dummy_binaryint4(a int4, b int4) RETURNS int4 AS $$ BEGIN RETURN a; END; $$ LANGUAGE 'plpgsql' IMMUTABLE; CREATE OPERATOR CLASS custom_opclass2 FOR TYPE int2 USING BTREE AS OPERATOR 1 = , FUNCTION 1 dummy_binaryint4(int4, int4); t=# CREATE TABLE list_tab(a int2, b int) PARTITION BY LIST (a custom_opclass2); *ERROR: cache lookup failed for function 0* In the above test creating OP class type int2, but passing the function of int4 type. During CREATE PARTITION, ComputePartitionAttrs() able to resolve the opclass for the partition key (partition key type is int2), but while looking for a method for the int2 - it unable to find the proper function and end up with the cache lookup failed error. Error coming from RelationBuildPartitionKey(). I think overall this is expected but still error can be better - like all the other places where get_opfamily_proc() unable to find valid function oid. PFA patch, where added elog() to add the error message same as all other places. Thanks, Rushabh Lathia www.EnterpriseDB.com diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 43238dd..6bd93b0 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -945,6 +945,10 @@ RelationBuildPartitionKey(Relation relation) opclassform->opcintype, opclassform->opcintype, BTORDER_PROC); + if (!OidIsValid(funcid)) /* should not happen */ + elog(ERROR, "missing support function %d(%u,%u) in opfamily %u", + BTORDER_PROC, opclassform->opcintype, opclassform->opcintype, + opclassform->opcfamily); fmgr_info(funcid, >partsupfunc[i]); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] RLS policy not getting honer while pg_dump on declarative partition
While doing some testing I noticed that RLS policy not getting honer while pg_dump on declarative partition. I can understand that while doing SELECT on individual child table, policy of parent is not getting applied. But is this desirable behaviour? I think for partitions, any policy on the root table should get redirect to the child, thoughts? If current behaviour is desirable then atleast we should document this. Consider the below test: \c postgres rushabh CREATE USER rls_test_user1; CREATE TABLE tp_sales ( visibility VARCHAR(30), sales_region VARCHAR(30) ) PARTITION BY LIST (sales_region); create table tp_sales_p_india partition of tp_sales for values in ('INDIA'); create table tp_sales_p_rest partition of tp_sales for values in ('REST'); insert into tp_sales values ( 'hidden', 'INDIA'); insert into tp_sales values ( 'visible', 'INDIA'); insert into tp_sales values ( 'hidden', 'REST'); insert into tp_sales values ( 'visible', 'REST'); GRANT SELECT ON tp_sales to rls_test_user1; GRANT SELECT ON tp_sales_p_india to rls_test_user1; GRANT SELECT ON tp_sales_p_rest to rls_test_user1; ALTER TABLE tp_sales ENABLE ROW LEVEL SECURITY; CREATE POLICY dump_p1 ON tp_sales FOR ALL USING (visibility = 'visible'); \c - rls_test_user1 -- SELECT honer the policy SELECT * FROM tp_sales; When we run the pg_dump using user rls_test_user1, can see the hidden rows in the pg_dump output. ./db/bin/pg_dump -U rls_test_user1 postgres --inserts Attaching the dump output. Thanks, Rushabh Lathia www.EnterpriseDB.com -- -- PostgreSQL database dump -- -- Dumped from database version 10beta1 -- Dumped by pg_dump version 10beta1 SET statement_timeout = 0; SET lock_timeout = 0; SET idle_in_transaction_session_timeout = 0; SET client_encoding = 'UTF8'; SET standard_conforming_strings = on; SET check_function_bodies = false; SET client_min_messages = warning; SET row_security = off; -- -- Name: postgres; Type: COMMENT; Schema: -; Owner: rushabh -- COMMENT ON DATABASE postgres IS 'default administrative connection database'; -- -- Name: plpgsql; Type: EXTENSION; Schema: -; Owner: -- CREATE EXTENSION IF NOT EXISTS plpgsql WITH SCHEMA pg_catalog; -- -- Name: EXTENSION plpgsql; Type: COMMENT; Schema: -; Owner: -- COMMENT ON EXTENSION plpgsql IS 'PL/pgSQL procedural language'; SET search_path = public, pg_catalog; SET default_tablespace = ''; SET default_with_oids = false; -- -- Name: tp_sales; Type: TABLE; Schema: public; Owner: rushabh -- CREATE TABLE tp_sales ( visibility character varying(30), sales_region character varying(30) ) PARTITION BY LIST (sales_region); ALTER TABLE tp_sales OWNER TO rushabh; -- -- Name: tp_sales_p_india; Type: TABLE; Schema: public; Owner: rushabh -- CREATE TABLE tp_sales_p_india PARTITION OF tp_sales FOR VALUES IN ('INDIA'); ALTER TABLE tp_sales_p_india OWNER TO rushabh; -- -- Name: tp_sales_p_rest; Type: TABLE; Schema: public; Owner: rushabh -- CREATE TABLE tp_sales_p_rest PARTITION OF tp_sales FOR VALUES IN ('REST'); ALTER TABLE tp_sales_p_rest OWNER TO rushabh; -- -- Data for Name: tp_sales_p_india; Type: TABLE DATA; Schema: public; Owner: rushabh -- INSERT INTO tp_sales_p_india VALUES ('hidden', 'INDIA'); INSERT INTO tp_sales_p_india VALUES ('visible', 'INDIA'); -- -- Data for Name: tp_sales_p_rest; Type: TABLE DATA; Schema: public; Owner: rushabh -- INSERT INTO tp_sales_p_rest VALUES ('hidden', 'REST'); INSERT INTO tp_sales_p_rest VALUES ('visible', 'REST'); -- -- Name: tp_sales dump_p1; Type: POLICY; Schema: public; Owner: rushabh -- CREATE POLICY dump_p1 ON tp_sales USING (((visibility)::text = 'visible'::text)); -- -- Name: tp_sales; Type: ROW SECURITY; Schema: public; Owner: rushabh -- ALTER TABLE tp_sales ENABLE ROW LEVEL SECURITY; -- -- Name: tp_sales; Type: ACL; Schema: public; Owner: rushabh -- GRANT SELECT ON TABLE tp_sales TO rls_test_user1; -- -- Name: tp_sales_p_india; Type: ACL; Schema: public; Owner: rushabh -- GRANT SELECT ON TABLE tp_sales_p_india TO rls_test_user1; -- -- Name: tp_sales_p_rest; Type: ACL; Schema: public; Owner: rushabh -- GRANT SELECT ON TABLE tp_sales_p_rest TO rls_test_user1; -- -- PostgreSQL database dump complete -- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On Wed, May 17, 2017 at 12:06 PM, Dilip Kumar <dilipbal...@gmail.com> wrote: > On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan...@gmail.com> > wrote: > > Option 3 > > > > > > BR, AR delete triggers on source partition > > BR, AR insert triggers on destination partition. > > > > Rationale : > > Since the update is converted to delete+insert, just skip the update > > triggers completely. > > +1 to option3 > Generally, BR triggers are used for updating the ROW value and AR > triggers to VALIDATE the row or to modify some other tables. So it > seems that we can fire the triggers what is actual operation is > happening at the partition level. > > For source partition, it's only the delete operation (no update > happened) so we fire delete triggers and for the destination only > insert operations so fire only inserts triggers. That will keep the > things simple. And, it will also be in sync with the actual partition > level delete/insert operations. > > We may argue that user might have declared only update triggers and as > he has executed the update operation he may expect those triggers to > get fired. But, I think this behaviour can be documented with the > proper logic that if the user is updating the partition key then he > must be ready with the Delete/Insert triggers also, he can not rely > only upon update level triggers. > > Right, that is even my concern. That user might had declared only update triggers and when user executing UPDATE its expect it to get call - but with option 3 its not happening. In term of consistency option 1 looks better. Its doing the same what its been implemented for the UPSERT - so that user might be already aware of trigger behaviour. Plus if we document the behaviour then it sounds correct - - Original command was UPDATE so BR update - Later found that its ROW movement - so BR delete followed by AR delete - Then Insert in new partition - so BR INSERT followed by AR Insert. But again I am not quite sure how good it will be to compare the partition behaviour with the UPSERT. > Earlier I thought that option1 is better but later I think that this > can complicate the situation as we are firing first BR update then BR > delete and can change the row multiple time and defining such > behaviour can be complicated. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Rushabh Lathia
Re: [HACKERS] Adding support for Default partition in partitioning
On 2017/04/06 0:19, Robert Haas wrote: > On Wed, Apr 5, 2017 at 5:57 AM, Rahila Syed <rahilasye...@gmail.com> wrote: >>> Could you briefly elaborate why you think the lack global index support >>> would be a problem in this regard? >> I think following can happen if we allow rows satisfying the new partition >> to lie around in the >> default partition until background process moves it. >> Consider a scenario where partition key is a primary key and the data in the >> default partition is >> not yet moved into the newly added partition. If now, new data is added into >> the new partition >> which also exists(same key) in default partition there will be data >> duplication. If now >> we scan the partitioned table for that key(from both the default and new >> partition as we >> have not moved the rows) it will fetch the both rows. >> Unless we have global indexes for partitioned tables, there is chance of >> data duplication between >> child table added after default partition and the default partition. > > Yes, I think it would be completely crazy to try to migrate the data > in the background: > > - The migration might never complete because of a UNIQUE or CHECK > constraint on the partition to which rows are being migrated. > > - Even if the migration eventually succeeded, such a design abandons > all hope of making INSERT .. ON CONFLICT DO NOTHING work sensibly > while the migration is in progress, unless the new partition has no > UNIQUE constraints. > > - Partition-wise join and partition-wise aggregate would need to have > special case handling for the case of an unfinished migration, as > would any user code that accesses partitions directly. > > - More generally, I think users expect that when a DDL command > finishes execution, it's done all of the work that there is to do (or > at the very least, that any remaining work has no user-visible > consequences, which would not be the case here). Thanks Robert for this explanation. This makes it more clear, why row movement by background is not sensible idea. On Thu, Apr 6, 2017 at 9:38 AM, Keith Fiske <ke...@omniti.com> wrote: > > On Wed, Apr 5, 2017 at 2:51 PM, Keith Fiske <ke...@omniti.com> wrote: > >> >> >> Only issue I see with this, and I'm not sure if it is an issue, is what >> happens to that default constraint clause when 1000s of partitions start >> getting added? From what I gather the default's constraint is built based >> off the cumulative opposite of all other child constraints. I don't >> understand the code well enough to see what it's actually doing, but if >> there are no gaps, is the method used smart enough to aggregate all the >> child constraints to make a simpler constraint that is simply outside the >> current min/max boundaries? If so, for serial/time range partitioning this >> should typically work out fine since there are rarely gaps. This actually >> seems more of an issue for list partitioning where each child is a distinct >> value or range of values that are completely arbitrary. Won't that check >> and re-evaluation of the default's constraint just get worse and worse as >> more children are added? Is there really even a need for the default to >> have an opposite constraint like this? Not sure on how the planner works >> with partitioning now, but wouldn't it be better to first check all >> non-default children for a match the same as it does now without a default >> and, failing that, then route to the default if one is declared? The >> default should accept any data then so I don't see the need for the >> constraint unless it's required for the current implementation. If that's >> the case, could that be changed? >> >> Keith >> > > Actually, thinking on this more, I realized this does again come back to > the lack of a global index. Without the constraint, data could be put > directly into the default that could technically conflict with the > partition scheme elsewhere. Perhaps, instead of the constraint, inserts > directly to the default could be prevented on the user level. Writing to > valid children directly certainly has its place, but been thinking about > it, and I can't see any reason why one would ever want to write directly to > the default. It's use case seems to be around being a sort of temporary > storage until that data can be moved to a valid location. Would still need > to allow removal of data, though. > > Not sure if that's even a workable solution. Just trying to think of ways > around the current limitations and still allow this feature. > I like the idea about having DEFAULT partition for the range partition. With the way partition is designed it can have holes into range partition. I think DEFAULT for the range partition is a good idea, generally when the range having holes. When range is serial then of course DEFAULT partition doen't much sense. Regarda, Rushabh Lathia www.EnterpriseDB.com
Re: [HACKERS] Adding support for Default partition in partitioning
bal index support. If this restriction is removed, there's still > an > > issue of data duplication after the necessary child table is added. So > > guess it's a matter of deciding which user experience is better for the > > moment? > > I'm not sure about the fate of this particular patch for v10, but until we > implement a solution to move rows and design appropriate interface for the > same, we could error out if moving rows is required at all, like the patch > does. > > +1 I agree about the future plan about the row movement, how that is I am not quite sure at this stage. I was thinking that CREATE new partition is the DDL command, so even if row-movement works with holding the lock on the new partition table, that should be fine. I am not quire sure, why row movement should be happen in the back-ground process. Of-course, one idea is that if someone don't want feature of row-movement, then we might add that under some GUC or may be as another option into the CREATE partition table. Could you briefly elaborate why you think the lack global index support > would be a problem in this regard? > > I agree that some design is required here to implement a solution > redistribution of rows; not only in the context of supporting the notion > of default partitions, but also to allow the feature to split/merge range > (only?) partitions. I'd like to work on the latter for v11 for which I > would like to post a proposal soon; if anyone would like to collaborate > (ideas, code, review), I look forward to. (sorry for hijacking this > thread.) > > Thanks, > Amit > > > -- Rushabh Lathia
Re: [HACKERS] [COMMITTERS] pgsql: Avoid GatherMerge crash when there are no workers.
Hi, On Mon, Apr 3, 2017 at 10:56 AM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > > > On Sat, Apr 1, 2017 at 7:58 PM, Robert Haas <robertmh...@gmail.com> wrote: > >> On Fri, Mar 31, 2017 at 10:26 PM, Andres Freund <and...@anarazel.de> >> wrote: >> > Hi, >> > >> > On 2017-04-01 01:22:14 +, Robert Haas wrote: >> >> Avoid GatherMerge crash when there are no workers. >> > >> > I think the gather merge code needs a bit more test coverage (sorry to >> > make this a larger theme today). As shown by >> > https://coverage.postgresql.org/src/backend/executor/nodeGat >> herMerge.c.gcov.html >> > we don't actually merge anything (heap_compare_slots is not exercised). >> >> Sounds reasonable. >> > > I am working on this and will submit the patch soon. > > On my local environment I was getting coverage for the heap_compare_slots with existing test. But it seems like due to low number of record fetch only leader get evolved to the fetching tuples in the shared report. I modified the current test coverage for the Gather Merge to add more rows to be fetch by the GatherMerge node to make sure that it do coverage for heap_compare_slots. Also added test for the zero worker. PFA patch as well as LCOV report for the nodeGatherMerge.c. >> > I btw also wonder if it's good that we have a nearly identical copy of >> > heap_compare_slots and a bunch of the calling code in both >> > nodeMergeAppend.c and nodeGatherMerge.c. On the other hand, it's not >> > heavily envolving code. >> >> Yeah, I don't know. We could alternatively try to move that to some >> common location and merge the two implementations. I'm not sure >> exactly where, though. >> >> -- >> Robert Haas >> EnterpriseDB: http://www.enterprisedb.com >> The Enterprise PostgreSQL Company >> >> >> -- >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-hackers >> > > > > -- > Rushabh Lathia > -- Rushabh Lathia www.EnterpriseDB.com Title: LCOV - PostgreSQL - src/backend/executor/nodeGatherMerge.c LCOV - code coverage report Current view: top level - src/backend/executor - nodeGatherMerge.c (source / functions) Hit Total Coverage Test: PostgreSQL Lines: 200 208 96.2 % Date: 2017-04-03 Functions: 12 13 92.3 % Legend: Lines: hit not hit Line dataSource code 1 : /*- 2 : * 3 : * nodeGatherMerge.c 4 : * Scan a plan in multiple workers, and do order-preserving merge. 5 : * 6 : * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group 7 : * Portions Copyright (c) 1994, Regents of the University of California 8 : * 9 : * IDENTIFICATION 10 : *src/backend/executor/nodeGatherMerge.c 11 : * 12 : *- 13 : */ 14 : 15 : #include "postgres.h" 16 : 17 : #include "access/relscan.h" 18 : #include "access/xact.h" 19 : #include "executor/execdebug.h" 20 : #include "executor/execParallel.h" 21 : #include "executor/nodeGatherMerge.h" 22 : #include "executor/nodeSubplan.h" 23 : #include "executor/tqueue.h" 24 : #include "lib/binaryheap.h" 25 : #include "miscadmin.h" 26 : #include "utils/memutils.h" 27 : #include "utils/rel.h" 28 : 29 : /* 30 : * Tuple array for each worker 31 : */ 32 : typedef struct GMReaderTupleBuffer 33 : { 34
Re: [HACKERS] [COMMITTERS] pgsql: Avoid GatherMerge crash when there are no workers.
On Sat, Apr 1, 2017 at 7:58 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Fri, Mar 31, 2017 at 10:26 PM, Andres Freund <and...@anarazel.de> > wrote: > > Hi, > > > > On 2017-04-01 01:22:14 +, Robert Haas wrote: > >> Avoid GatherMerge crash when there are no workers. > > > > I think the gather merge code needs a bit more test coverage (sorry to > > make this a larger theme today). As shown by > > https://coverage.postgresql.org/src/backend/executor/ > nodeGatherMerge.c.gcov.html > > we don't actually merge anything (heap_compare_slots is not exercised). > > Sounds reasonable. > I am working on this and will submit the patch soon. > > > I btw also wonder if it's good that we have a nearly identical copy of > > heap_compare_slots and a bunch of the calling code in both > > nodeMergeAppend.c and nodeGatherMerge.c. On the other hand, it's not > > heavily envolving code. > > Yeah, I don't know. We could alternatively try to move that to some > common location and merge the two implementations. I'm not sure > exactly where, though. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Rushabh Lathia
Re: [HACKERS] crashes due to setting max_parallel_workers=0
On Wed, Mar 29, 2017 at 8:38 AM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > Hi, > > On 03/28/2017 11:07 AM, Rushabh Lathia wrote: > >> ... >> I think we all agree that we should get rid of nreaders from the >> GatherMergeState and need to do some code re-factor. But if I >> understood correctly that Robert's concern was to do that re-factor >> as separate commit. >> >> > Maybe. It depends on how valuable it's to keep Gather and GatherMerge > similar - having nreaders in one and not the other seems a bit weird. But > maybe the refactoring would remove it from both nodes? > > Also, it does not really solve the issue that we're using 'nreaders' or > 'nworkers_launched' to access array with one extra element. > With the latest patches, into gather_merge_init() there is one loop where we initialize the tuple array (which is to hold the tuple came from the each worker) and slot (which is to hold the current tuple slot). We hold the tuple array for the worker because when we read tuples from workers, it's a good idea to read several at once for efficiency when possible: this minimizes context-switching overhead. For leader we don't initialize the slot because for leader we directly call then ExecProcNode(), which returns the slot. Also there is no point in holding the tuple array for the leader. With the latest patch's, gather_merge_clear_slots() will only clean the tuple table slot and the tuple array buffer for the workers (nworkers_launched). Only time we loop through the "nworkers_launched + 1" is while reading the tuple from the each gather merge input. Personally I don't see any problem with that, but I am very much open for suggestions. > > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > -- Rushabh Lathia
Re: [HACKERS] crashes due to setting max_parallel_workers=0
On Tue, Mar 28, 2017 at 10:00 AM, Tomas Vondra <tomas.von...@2ndquadrant.com > wrote: > > > On 03/27/2017 01:40 PM, Rushabh Lathia wrote: > >> >> ... >> I was doing more testing with the patch and I found one more server >> crash with the patch around same area, when we forced the gather >> merge for the scan having zero rows. >> >> create table dept ( deptno numeric, dname varchar(20); >> set parallel_tuple_cost =0; >> set parallel_setup_cost =0; >> set min_parallel_table_scan_size =0; >> set min_parallel_index_scan_size =0; >> set force_parallel_mode=regress; >> explain analyze select * from dept order by deptno; >> >> This is because for leader we don't initialize the slot into gm_slots. So >> in case where launched worker is zero and table having zero rows, we >> end up having NULL slot into gm_slots array. >> >> Currently gather_merge_clear_slots() clear out the tuple table slots for >> each >> gather merge input and returns clear slot. In the patch I modified >> function >> gather_merge_clear_slots() to just clear out the tuple table slots and >> always return NULL when All the queues and heap us exhausted. >> >> > Isn't that just another sign the code might be a bit too confusing? I see > two main issues in the code: > > 1) allocating 'slots' as 'nreaders+1' elements, which seems like a good > way to cause off-by-one errors > > 2) mixing objects with different life spans (slots for readers vs. slot > for the leader) seems like a bad idea too > > I wonder how much we gain by reusing the slot from the leader (I'd be > surprised if it was at all measurable). David posted a patch reworking > this, and significantly simplifying the GatherMerge node. Why not to accept > that? > > > I think we all agree that we should get rid of nreaders from the GatherMergeState and need to do some code re-factor. But if I understood correctly that Robert's concern was to do that re-factor as separate commit. I picked David's patch and started reviewing the changes. I applied that patch on top of my v2 patch (which does the re-factor of gather_merge_clear_slots). In David's patch, into gather_merge_init(), a loop where tuple array is getting allocate, that loop need to only up to nworkers_launched. Because we don't hold the tuple array for leader. I changed that and did some other simple changes based on mine v2 patch. I also performed manual testing with the changes. Please find attached re-factor patch, which is v2 patch submitted for the server crash fix. (Attaching both the patch here again, for easy of access). Thanks, -- Rushabh Lathia www.EnterpriseDB.com diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c index 62c399e..2d7eb71 100644 --- a/src/backend/executor/nodeGatherMerge.c +++ b/src/backend/executor/nodeGatherMerge.c @@ -195,9 +195,9 @@ ExecGatherMerge(GatherMergeState *node) /* Set up tuple queue readers to read the results. */ if (pcxt->nworkers_launched > 0) { -node->nreaders = 0; -node->reader = palloc(pcxt->nworkers_launched * - sizeof(TupleQueueReader *)); +node->reader = (TupleQueueReader **) + palloc(pcxt->nworkers_launched * + sizeof(TupleQueueReader *)); Assert(gm->numCols); @@ -205,7 +205,7 @@ ExecGatherMerge(GatherMergeState *node) { shm_mq_set_handle(node->pei->tqueue[i], pcxt->worker[i].bgwhandle); - node->reader[node->nreaders++] = + node->reader[i] = CreateTupleQueueReader(node->pei->tqueue[i], node->tupDesc); } @@ -298,7 +298,7 @@ ExecShutdownGatherMergeWorkers(GatherMergeState *node) { int i; - for (i = 0; i < node->nreaders; ++i) + for (i = 0; i < node->nworkers_launched; ++i) if (node->reader[i]) DestroyTupleQueueReader(node->reader[i]); @@ -344,28 +344,26 @@ ExecReScanGatherMerge(GatherMergeState *node) static void gather_merge_init(GatherMergeState *gm_state) { - int nreaders = gm_state->nreaders; + int nslots = gm_state->nworkers_launched + 1; bool initialize = true; int i; /* * Allocate gm_slots for the number of worker + one more slot for leader. - * Last slot is always for leader. Leader always calls ExecProcNode() to - * read the tuple which will return the TupleTableSlot. Later it will - * directly get assigned to gm_slot. So just initialize leader gm_slot - * with NULL. For other slots below code will call - * ExecInitExtraTupleSlot() which will do the initialization of worker - * slots. + * The final slot in the array is reserved for the leader process. This + * slot is always populated via ExecProcNode(). This can be set to NULL + * for now. The remaining slo
Re: [HACKERS] crashes due to setting max_parallel_workers=0
On Mon, Mar 27, 2017 at 9:52 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Mon, Mar 27, 2017 at 12:11 PM, David Rowley > <david.row...@2ndquadrant.com> wrote: > > On 28 March 2017 at 04:57, Robert Haas <robertmh...@gmail.com> wrote: > >> On Sat, Mar 25, 2017 at 12:18 PM, Rushabh Lathia > >> <rushabh.lat...@gmail.com> wrote: > >>> About the original issue reported by Tomas, I did more debugging and > >>> found that - problem was gather_merge_clear_slots() was not returning > >>> the clear slot when nreader is zero (means nworkers_launched = 0). > >>> Due to the same scan was continue even all the tuple are exhausted, > >>> and then end up with server crash at gather_merge_getnext(). In the > patch > >>> I also added the Assert into gather_merge_getnext(), about the index > >>> should be less then the nreaders + 1 (leader). > >> > >> Well, you and David Rowley seem to disagree on what the fix is here. > >> His patches posted upthread do A, and yours do B, and from a quick > >> look those things are not just different ways of spelling the same > >> underlying fix, but actually directly conflicting ideas about what the > >> fix should be. Any chance you can review his patches, and maybe he > >> can review yours, and we could try to agree on a consensus position? > >> :-) > > > > When comparing Rushabh's to my minimal patch, both change > > gather_merge_clear_slots() to clear the leader's slot. My fix > > mistakenly changes things so it does ExecInitExtraTupleSlot() on the > > leader's slot, but seems that's not required since > > gather_merge_readnext() sets the leader's slot to the output of > > ExecProcNode(outerPlan). I'd ignore my minimal fix because of that > > mistake. Rushabh's patch sidesteps this by adding a conditional > > pfree() to not free slot that we didn't allocate in the first place. > > > > I do think the code could be improved a bit. I don't like the way > > GatherMergeState's nreaders and nworkers_launched are always the same. > > I think this all threw me off a bit and may have been the reason for > > the bug in the first place. > > Yeah, if those fields are always the same, then I think that they > should be merged. That seems undeniably a good idea. Hmm I agree that it's good idea, and I will work on that as separate patch. > Possibly we > should fix the crash bug first, though, and then do that afterwards. > What bugs me a little about Rushabh's fix is that it looks like magic. > You have to know that we're looping over two things and freeing them > up, but there's one more of one thing than the other thing. I think > that at least needs some comments or something. > > So in my second version of patch I change gather_merge_clear_slots() to just clear the slot for the worker and some other clean up. Also throwing NULL from gather_merge_getnext() when all the queues and heap are exhausted - which earlier gather_merge_clear_slots() was returning clear slot. This way we make sure that we don't run over freeing the slot for the leader and gather_merge_getnext() don't need to depend on that clear slot. Thanks, Rushabh Lathia www.EnterpriseDB.com
Re: [HACKERS] crashes due to setting max_parallel_workers=0
On Mon, Mar 27, 2017 at 10:59 AM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > > > On Mon, Mar 27, 2017 at 3:43 AM, Tomas Vondra < > tomas.von...@2ndquadrant.com> wrote: > >> On 03/25/2017 05:18 PM, Rushabh Lathia wrote: >> >>> >>> >>> On Sat, Mar 25, 2017 at 7:01 PM, Peter Eisentraut >>> <peter.eisentr...@2ndquadrant.com >>> <mailto:peter.eisentr...@2ndquadrant.com>> wrote: >>> >>> On 3/25/17 09:01, David Rowley wrote: >>> > On 25 March 2017 at 23:09, Rushabh Lathia < >>> rushabh.lat...@gmail.com <mailto:rushabh.lat...@gmail.com>> wrote: >>> >> Also another point which I think we should fix is, when someone >>> set >>> >> max_parallel_workers = 0, we should also set the >>> >> max_parallel_workers_per_gather >>> >> to zero. So that way it we can avoid generating the gather path >>> with >>> >> max_parallel_worker = 0. >>> > I see that it was actually quite useful that it works the way it >>> does. >>> > If it had worked the same as max_parallel_workers_per_gather, then >>> > likely Tomas would never have found this bug. >>> >>> Another problem is that the GUC system doesn't really support cases >>> where the validity of one setting depends on the current value of >>> another setting. So each individual setting needs to be robust >>> against >>> cases of related settings being nonsensical. >>> >>> >>> Okay. >>> >>> About the original issue reported by Tomas, I did more debugging and >>> found that - problem was gather_merge_clear_slots() was not returning >>> the clear slot when nreader is zero (means nworkers_launched = 0). >>> Due to the same scan was continue even all the tuple are exhausted, >>> and then end up with server crash at gather_merge_getnext(). In the patch >>> I also added the Assert into gather_merge_getnext(), about the index >>> should be less then the nreaders + 1 (leader). >>> >>> PFA simple patch to fix the problem. >>> >>> >> I think there are two issues at play, here - the first one is that we >> still produce parallel plans even with max_parallel_workers=0, and the >> second one is the crash in GatherMerge when nworkers=0. >> >> Your patch fixes the latter (thanks for looking into it), which is >> obviously a good thing - getting 0 workers on a busy system is quite >> possible, because all the parallel workers can be already chewing on some >> other query. >> >> > Thanks. > > I was doing more testing with the patch and I found one more server crash with the patch around same area, when we forced the gather merge for the scan having zero rows. create table dept ( deptno numeric, dname varchar(20); set parallel_tuple_cost =0; set parallel_setup_cost =0; set min_parallel_table_scan_size =0; set min_parallel_index_scan_size =0; set force_parallel_mode=regress; explain analyze select * from dept order by deptno; This is because for leader we don't initialize the slot into gm_slots. So in case where launched worker is zero and table having zero rows, we end up having NULL slot into gm_slots array. Currently gather_merge_clear_slots() clear out the tuple table slots for each gather merge input and returns clear slot. In the patch I modified function gather_merge_clear_slots() to just clear out the tuple table slots and always return NULL when All the queues and heap us exhausted. > But it seems a bit futile to produce the parallel plan in the first place, >> because with max_parallel_workers=0 we can't possibly get any parallel >> workers ever. I wonder why compute_parallel_worker() only looks at >> max_parallel_workers_per_gather, i.e. why shouldn't it do: >> >>parallel_workers = Min(parallel_workers, max_parallel_workers); >> >> > I agree with you here. Producing the parallel plan when > max_parallel_workers = 0 is wrong. But rather then your suggested fix, I > think that we should do something like: > > /* > * In no case use more than max_parallel_workers_per_gather or > * max_parallel_workers. > */ > parallel_workers = Min(parallel_workers, Min(max_parallel_workers, > max_parallel_workers_per_gather)); > > > >> Perhaps this was discussed and is actually intentional, though. >> >> > Yes, I am not quite sure about this. > > Regarding handling this at the GUC level - I agree with Peter that that's >> not a good idea. I suppose we could d
Re: [HACKERS] Adding support for Default partition in partitioning
I applied the patch and was trying to perform some testing, but its ending up with server crash with the test shared by you in your starting mail: postgres=# CREATE TABLE list_partitioned ( postgres(# a int postgres(# ) PARTITION BY LIST (a); CREATE TABLE postgres=# postgres=# CREATE TABLE part_default PARTITION OF list_partitioned FOR VALUES IN (DEFAULT); CREATE TABLE postgres=# CREATE TABLE part_1 PARTITION OF list_partitioned FOR VALUES IN (4,5); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. Apart from this, few more explanation in the patch is needed to explain the changes for the DEFAULT partition. Like I am not quite sure what exactly the latest version of patch supports, like does that support the tuple row movement, or adding new partition will be allowed having partition table having DEFAULT partition, which is quite difficult to understand through the code changes. Another part which is missing in the patch is the test coverage, adding proper test coverage, which explain what is supported and what's not. Thanks, On Fri, Mar 24, 2017 at 3:25 PM, Rahila Syed <rahilasye...@gmail.com> wrote: > Hello Rushabh, > > Thank you for reviewing. > Have addressed all your comments in the attached patch. The attached patch > currently throws an > error if a new partition is added after default partition. > > >Rather then adding check for default here, I think this should be handle > inside > >get_qual_for_list(). > Have moved the check inside get_qual_for_partbound() as needed to do some > operations > before calling get_qual_for_list() for default partitions. > > Thank you, > Rahila Syed > > On Tue, Mar 21, 2017 at 11:36 AM, Rushabh Lathia <rushabh.lat...@gmail.com > > wrote: > >> I picked this for review and noticed that patch is not getting >> cleanly complied on my environment. >> >> partition.c: In function ‘RelationBuildPartitionDesc’: >> partition.c:269:6: error: ISO C90 forbids mixed declarations and code >> [-Werror=declaration-after-statement] >> Const*val = lfirst(c); >> ^ >> partition.c: In function ‘generate_partition_qual’: >> partition.c:1590:2: error: ISO C90 forbids mixed declarations and code >> [-Werror=declaration-after-statement] >> PartitionBoundSpec *spec = (PartitionBoundSpec *)bound; >> ^ >> partition.c: In function ‘get_partition_for_tuple’: >> partition.c:1820:5: error: array subscript has type ‘char’ >> [-Werror=char-subscripts] >> result = parent->indexes[partdesc->boundinfo->def_index]; >> ^ >> partition.c:1824:16: error: assignment makes pointer from integer without >> a cast [-Werror] >> *failed_at = RelationGetRelid(parent->reldesc); >> ^ >> cc1: all warnings being treated as errors >> >> Apart from this, I was reading patch here are few more comments: >> >> 1) Variable initializing happening at two place. >> >> @@ -166,6 +170,8 @@ RelationBuildPartitionDesc(Relation rel) >> /* List partitioning specific */ >> PartitionListValue **all_values = NULL; >> boolfound_null = false; >> +boolfound_def = false; >> +intdef_index = -1; >> intnull_index = -1; >> >> /* Range partitioning specific */ >> @@ -239,6 +245,8 @@ RelationBuildPartitionDesc(Relation rel) >> i = 0; >> found_null = false; >> null_index = -1; >> +found_def = false; >> +def_index = -1; >> foreach(cell, boundspecs) >> { >> ListCell *c; >> @@ -249,6 +257,15 @@ RelationBuildPartitionDesc(Relation rel) >> >> >> 2) >> >> @@ -1558,6 +1586,19 @@ generate_partition_qual(Relation rel) >> bound = stringToNode(TextDatumGetCString(boundDatum)); >> ReleaseSysCache(tuple); >> >> +/* Return if it is a default list partition */ >> +PartitionBoundSpec *spec = (PartitionBoundSpec *)bound; >> +ListCell *cell; >> +foreach(cell, spec->listdatums) >> >> More comment on above hunk is needed? >> >> Rather then adding check for default here, I think this should be handle >> inside >> get_qual_for_list(). >> >> 3) Code is not aligned with existing >> >> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y >> index 6316688..ebb7db7 100644 >> --- a/src/b
Re: [HACKERS] crashes due to setting max_parallel_workers=0
On Mon, Mar 27, 2017 at 3:43 AM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > On 03/25/2017 05:18 PM, Rushabh Lathia wrote: > >> >> >> On Sat, Mar 25, 2017 at 7:01 PM, Peter Eisentraut >> <peter.eisentr...@2ndquadrant.com >> <mailto:peter.eisentr...@2ndquadrant.com>> wrote: >> >> On 3/25/17 09:01, David Rowley wrote: >> > On 25 March 2017 at 23:09, Rushabh Lathia <rushabh.lat...@gmail.com >> <mailto:rushabh.lat...@gmail.com>> wrote: >> >> Also another point which I think we should fix is, when someone set >> >> max_parallel_workers = 0, we should also set the >> >> max_parallel_workers_per_gather >> >> to zero. So that way it we can avoid generating the gather path >> with >> >> max_parallel_worker = 0. >> > I see that it was actually quite useful that it works the way it >> does. >> > If it had worked the same as max_parallel_workers_per_gather, then >> > likely Tomas would never have found this bug. >> >> Another problem is that the GUC system doesn't really support cases >> where the validity of one setting depends on the current value of >> another setting. So each individual setting needs to be robust >> against >> cases of related settings being nonsensical. >> >> >> Okay. >> >> About the original issue reported by Tomas, I did more debugging and >> found that - problem was gather_merge_clear_slots() was not returning >> the clear slot when nreader is zero (means nworkers_launched = 0). >> Due to the same scan was continue even all the tuple are exhausted, >> and then end up with server crash at gather_merge_getnext(). In the patch >> I also added the Assert into gather_merge_getnext(), about the index >> should be less then the nreaders + 1 (leader). >> >> PFA simple patch to fix the problem. >> >> > I think there are two issues at play, here - the first one is that we > still produce parallel plans even with max_parallel_workers=0, and the > second one is the crash in GatherMerge when nworkers=0. > > Your patch fixes the latter (thanks for looking into it), which is > obviously a good thing - getting 0 workers on a busy system is quite > possible, because all the parallel workers can be already chewing on some > other query. > > Thanks. > But it seems a bit futile to produce the parallel plan in the first place, > because with max_parallel_workers=0 we can't possibly get any parallel > workers ever. I wonder why compute_parallel_worker() only looks at > max_parallel_workers_per_gather, i.e. why shouldn't it do: > >parallel_workers = Min(parallel_workers, max_parallel_workers); > > I agree with you here. Producing the parallel plan when max_parallel_workers = 0 is wrong. But rather then your suggested fix, I think that we should do something like: /* * In no case use more than max_parallel_workers_per_gather or * max_parallel_workers. */ parallel_workers = Min(parallel_workers, Min(max_parallel_workers, max_parallel_workers_per_gather)); > Perhaps this was discussed and is actually intentional, though. > > Yes, I am not quite sure about this. Regarding handling this at the GUC level - I agree with Peter that that's > not a good idea. I suppose we could deal with checking the values in the > GUC check/assign hooks, but what we don't have is a way to undo the changes > in all the GUCs. That is, if I do > >SET max_parallel_workers = 0; >SET max_parallel_workers = 16; > > I expect to end up with just max_parallel_workers GUC changed and nothing > else. > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > -- Rushabh Lathia
Re: [HACKERS] crashes due to setting max_parallel_workers=0
On Sat, Mar 25, 2017 at 7:01 PM, Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 3/25/17 09:01, David Rowley wrote: > > On 25 March 2017 at 23:09, Rushabh Lathia <rushabh.lat...@gmail.com> > wrote: > >> Also another point which I think we should fix is, when someone set > >> max_parallel_workers = 0, we should also set the > >> max_parallel_workers_per_gather > >> to zero. So that way it we can avoid generating the gather path with > >> max_parallel_worker = 0. > > I see that it was actually quite useful that it works the way it does. > > If it had worked the same as max_parallel_workers_per_gather, then > > likely Tomas would never have found this bug. > > Another problem is that the GUC system doesn't really support cases > where the validity of one setting depends on the current value of > another setting. So each individual setting needs to be robust against > cases of related settings being nonsensical. > Okay. About the original issue reported by Tomas, I did more debugging and found that - problem was gather_merge_clear_slots() was not returning the clear slot when nreader is zero (means nworkers_launched = 0). Due to the same scan was continue even all the tuple are exhausted, and then end up with server crash at gather_merge_getnext(). In the patch I also added the Assert into gather_merge_getnext(), about the index should be less then the nreaders + 1 (leader). PFA simple patch to fix the problem. Thanks, Rushabh Lathia www.Enterprisedb.com diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c index 72f30ab..e19bace 100644 --- a/src/backend/executor/nodeGatherMerge.c +++ b/src/backend/executor/nodeGatherMerge.c @@ -431,9 +431,14 @@ gather_merge_clear_slots(GatherMergeState *gm_state) { int i; - for (i = 0; i < gm_state->nreaders; i++) + for (i = 0; i < gm_state->nreaders + 1; i++) { - pfree(gm_state->gm_tuple_buffers[i].tuple); + /* + * Gather merge always allow leader to participate and we don't + * allocate the tuple, so no need to free the tuple for leader. + */ + if (i != gm_state->nreaders) + pfree(gm_state->gm_tuple_buffers[i].tuple); gm_state->gm_slots[i] = ExecClearTuple(gm_state->gm_slots[i]); } @@ -474,6 +479,8 @@ gather_merge_getnext(GatherMergeState *gm_state) */ i = DatumGetInt32(binaryheap_first(gm_state->gm_heap)); + Assert(i < gm_state->nreaders + 1); + if (gather_merge_readnext(gm_state, i, false)) binaryheap_replace_first(gm_state->gm_heap, Int32GetDatum(i)); else -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crashes due to setting max_parallel_workers=0
Thanks Tomas for reporting issue and Thanks David for working on this. I can see the problem in GatherMerge, specially when nworkers_launched is zero. I will look into this issue and will post a fix for the same. Also another point which I think we should fix is, when someone set max_parallel_workers = 0, we should also set the max_parallel_workers_per_gather to zero. So that way it we can avoid generating the gather path with max_parallel_worker = 0. Thanks, On Sat, Mar 25, 2017 at 2:21 PM, David Rowley <david.row...@2ndquadrant.com> wrote: > On 25 March 2017 at 13:10, Tomas Vondra <tomas.von...@2ndquadrant.com> > wrote: > > while working on a patch I ran into some crashes that seem to be caused > by > > inconsistent handling of max_parallel_workers - queries still seem to be > > planned with parallel plans enabled, but then crash at execution. > > > > The attached script reproduces the issue on a simple query, causing > crashed > > in GatherMerge, but I assume the issue is more general. > > I had a look at this and found a bunch of off by 1 errors in the code. > > I've attached a couple of patches, one is the minimal fix, and one > more complete one. > > In the more complete one I ended up ditching the > GatherMergeState->nreaders altogether. It was always the same as > nworkers_launched anyway, so I saw no point in it. > Here I've attempted to make the code a bit more understandable, to > prevent further confusion about how many elements are in each array. > Probably there's more that can be done here. I see GatherState has > nreaders too, but I've not looked into the detail of if it suffers > from the same weirdness of nreaders always matching nworkers_launched. > > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- Rushabh Lathia
[HACKERS] [psql] patch to fix ordering in words_after_create array
Hi All, While looking at the code around tab-complete.c, I found the ordering in words_after_create array is not correct for DEFAULT PRIVILEGES, which been added under below commit: commit d7d77f3825122bde55be9e06f6c4851028b99795 Author: Peter Eisentraut <pete...@gmx.net> Date: Thu Mar 16 18:54:28 2017 -0400 psql: Add completion for \help DROP|ALTER PFA patch to fix the same. Thanks, Rushabh Lathia www.EnterpriseDB.com diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index f749406..02a1571 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -999,8 +999,8 @@ static const pgsql_thing_t words_after_create[] = { {"CONFIGURATION", Query_for_list_of_ts_configurations, NULL, THING_NO_SHOW}, {"CONVERSION", "SELECT pg_catalog.quote_ident(conname) FROM pg_catalog.pg_conversion WHERE substring(pg_catalog.quote_ident(conname),1,%d)='%s'"}, {"DATABASE", Query_for_list_of_databases}, - {"DICTIONARY", Query_for_list_of_ts_dictionaries, NULL, THING_NO_SHOW}, {"DEFAULT PRIVILEGES", NULL, NULL, THING_NO_CREATE | THING_NO_DROP}, + {"DICTIONARY", Query_for_list_of_ts_dictionaries, NULL, THING_NO_SHOW}, {"DOMAIN", NULL, _for_list_of_domains}, {"EVENT TRIGGER", NULL, NULL}, {"EXTENSION", Query_for_list_of_extensions}, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] create_unique_path and GEQO
On Wed, Mar 22, 2017 at 3:43 PM, Ashutosh Bapat < ashutosh.ba...@enterprisedb.com> wrote: > Hi, > In create_unique_path() there's comment > /* > * We must ensure path struct and subsidiary data are allocated in main > * planning context; otherwise GEQO memory management causes trouble. > */ > oldcontext = MemoryContextSwitchTo(root->planner_cxt); > > pathnode = makeNode(UniquePath); > > This means that when GEQO resets the memory context, the RelOptInfo > for which this path is created and may be set to cheapest_unique_path > goes away, the unique path lingers on in the planner context. > Shouldn't we instead allocate the path in the same context as the > RelOptInfo similar to mark_dummy_rel()? > > Do you have test case, which can reproduce the issue you explained above? > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Rushabh Lathia
[HACKERS] dblink module printing unnamed connection (with commit acaf7ccb94)
Hi All, DBLINK contrib module started showing :"unnamed" connection name. Consider the below test: postgres=# CREATE ROLE alice NOSUPERUSER NOCREATEDB NOCREATEROLE LOGIN PASSWORD 'wonderland'; CREATE ROLE postgres=# GRANT EXECUTE ON FUNCTION dblink_connect_u(text,text) to alice; GRANT postgres=# \c postgres alice You are now connected to database "postgres" as user "alice". postgres=> SELECT dblink_connect_u('sm_conn_30','dbname=postgres user=alice password=wonderland'); dblink_connect_u -- OK (1 row) postgres=> SELECT * FROM dblink_send_query('sm_conn_30','SELECT pg_stat_reset()') as dgr; dgr - 1 (1 row) postgres=> SELECT * FROM dblink_get_result('sm_conn_30') AS dgr(curr_user boolean); ERROR: permission denied for function pg_stat_reset CONTEXT: Error occurred on dblink connection named "*unnamed*": could not execute query. This started with below commit: commit acaf7ccb94a3916ea83712671a3563f0eb595558 Author: Peter Eisentraut <pete...@gmx.net> Date: Sun Dec 25 12:00:00 2016 -0500 dblink: Replace some macros by static functions Also remove some unused code and the no longer useful dblink.h file. Reviewed-by: Tsunakawa, Takayuki <tsunakawa.ta...@jp.fujitsu.com> Before this, macro used to assign the conname local variable; I quickly worked on the fix and attached patch do fix the issues. Patch assign the conname local variable, so that error context show the correct connection name. Regards, Rushabh Lathia www.EnterpriseDB.com diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index c1e9089..e42cec4 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -679,7 +679,7 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async) PG_TRY(); { char *sql = NULL; - char *conname = NULL; + char *conname = text_to_cstring(PG_GETARG_TEXT_PP(0)); bool fail = true; /* default to backward compatible */ if (!is_async) @@ -687,7 +687,7 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async) if (PG_NARGS() == 3) { /* text,text,bool */ -dblink_get_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)), , , ); +dblink_get_conn(conname, , , ); sql = text_to_cstring(PG_GETARG_TEXT_PP(1)); fail = PG_GETARG_BOOL(2); } @@ -702,7 +702,7 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async) } else { - dblink_get_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)), , , ); + dblink_get_conn(conname, , , ); sql = text_to_cstring(PG_GETARG_TEXT_PP(1)); } } @@ -722,13 +722,13 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async) if (PG_NARGS() == 2) { /* text,bool */ -conn = dblink_get_named_conn(text_to_cstring(PG_GETARG_TEXT_PP(0))); +conn = dblink_get_named_conn(conname); fail = PG_GETARG_BOOL(1); } else if (PG_NARGS() == 1) { /* text */ -conn = dblink_get_named_conn(text_to_cstring(PG_GETARG_TEXT_PP(0))); +conn = dblink_get_named_conn(conname); } else /* shouldn't happen */ @@ -1390,7 +1390,8 @@ dblink_exec(PG_FUNCTION_ARGS) if (PG_NARGS() == 3) { /* must be text,text,bool */ - dblink_get_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)), , , ); + conname = text_to_cstring(PG_GETARG_TEXT_PP(0)); + dblink_get_conn(conname, , , ); sql = text_to_cstring(PG_GETARG_TEXT_PP(1)); fail = PG_GETARG_BOOL(2); } @@ -1405,7 +1406,8 @@ dblink_exec(PG_FUNCTION_ARGS) } else { -dblink_get_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)), , , ); +conname = text_to_cstring(PG_GETARG_TEXT_PP(0)); +dblink_get_conn(conname, , , ); sql = text_to_cstring(PG_GETARG_TEXT_PP(1)); } } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible regression with gather merge.
Hi, postgres=# explain analyze select * from test order by v1, v2 limit 10; QUERY PLAN Limit (cost=19576.71..19577.88 rows=10 width=16) (actual time=408.842..408.853 rows=10 loops=1) -> Gather Merge (cost=19576.71..116805.80 rows=84 width=16) (actual time=408.841..408.850 rows=10 loops=1) Workers Planned: 2 Workers Launched: 2 -> Sort (cost=18576.69..19618.36 rows=416667 width=16) (actual time=393.602..394.080 rows=911 loops=3) Sort Key: v1, v2 Sort Method: external merge Disk: 8128kB -> Parallel Seq Scan on test (cost=0.00..9572.67 rows=416667 width=16) (actual time=0.053..46.238 rows=33 loops=3) Planning time: 0.118 ms Execution time: 414.763 ms (10 rows) postgres=# set enable_gathermerge = off; SET postgres=# explain analyze select * from test order by v1, v2 limit 10; QUERY PLAN - Limit (cost=37015.64..37015.67 rows=10 width=16) (actual time=268.859..268.861 rows=10 loops=1) -> Sort (cost=37015.64..39515.64 rows=100 width=16) (actual time=268.858..268.859 rows=10 loops=1) Sort Key: v1, v2 Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on test (cost=0.00..15406.00 rows=100 width=16) (actual time=0.085..107.841 rows=100 loops=1) Planning time: 0.163 ms Execution time: 268.897 ms (7 rows) Looking at the explain analyze output of both the plan, its clear that GM taking longer as its using external merge dist for the sort, where as another plan perform top-N heapsort. For normal sort path, it can consider the limit as bound, but for GM its not possible. In the code, gather merge consider the limit for the sort into create_ordered_paths, which is wrong. Into create_ordered_paths(), GM should not consider the limit while doing costing for the sort node. Attached patch fix the bug. On Wed, Mar 22, 2017 at 12:05 PM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > Thanks for reporting, I am looking into this. > > On Wed, Mar 22, 2017 at 11:51 AM, Mithun Cy <mithun...@enterprisedb.com> > wrote: > >> Adding more rows to table make gather merge execution time very slow >> when compared to non-parallel plan we get after disabling gather >> merge. >> >> create table test as (select id, (random()*1)::int as v1, random() as >> v2 from generate_series(1,1) id); >> >> postgres=# set max_parallel_workers_per_gather = default; >> SET >> postgres=# explain analyze select * from test order by v1, v2 limit 10; >> >> QUERY PLAN >> >> >> >> Limit (cost=1858610.53..1858611.70 rows=10 width=16) (actual >> time=31103.880..31103.885 rows=10 loops=1) >>-> Gather Merge (cost=1858610.53..11581520.05 rows=8406 >> width=16) (actual time=31103.878..31103.882 rows=10 loops=1) >> Workers Planned: 2 >> Workers Launched: 2 >> -> Sort (cost=1857610.50..1961777.26 rows=41666703 >> width=16) (actual time=30560.865..30561.046 rows=911 loops=3) >>Sort Key: v1, v2 >>Sort Method: external merge Disk: 841584kB >>-> Parallel Seq Scan on test (cost=0.00..957208.03 >> rows=41666703 width=16) (actual time=0.050..2330.275 rows= >> loops=3) >> Planning time: 0.292 ms >> Execution time: 31502.896 ms >> (10 rows) >> >> postgres=# set max_parallel_workers_per_gather = 0; >> SET >> postgres=# explain analyze select * from test order by v1, v2 limit 10; >> QUERY >> PLAN >> >> >> Limit (cost=3701507.83..3701507.85 rows=10 width=16) (actual >> time=13231.264..13231.266 rows=10 loops=1) >>-> Sort (cost=3701507.83..3951508.05 rows=10088 width=16) >> (actual time=13231.261..13231.262 rows=10 loops=1) >> Sort Key: v1, v2 >> Sort Method: top-N heapsort Memory: 25kB >> -> Seq Scan on test (cost=0.00..1540541.88 rows=10088 >> width=16) (actual time=0.045..6759.363 rows=1 loops=1) >> Pl
Re: [HACKERS] Possible regression with gather merge.
64 rows=100 width=16) (actual > > time=211.581..211.582 rows=10 loops=1) > > Sort Key: v1, v2 > > Sort Method: top-N heapsort Memory: 25kB > > -> Seq Scan on test (cost=0.00..15406.00 rows=100 > > width=16) (actual time=0.085..107.522 rows=100 loops=1) > > Planning time: 0.093 ms > > Execution time: 211.608 ms > > (7 rows) > > > > > > > > -- > > Thanks and Regards > > Mithun C Y > > EnterpriseDB: http://www.enterprisedb.com > > > > -- > Thanks and Regards > Mithun C Y > EnterpriseDB: http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Rushabh Lathia
Re: [HACKERS] Adding support for Default partition in partitioning
first >> partition. The only thing a default range partition would do is let >> you do is have a single partition cover all of the ranges that would >> otherwise be unassigned, which might not even be something we want. >> >> I don't know how complete the patch is, but the specification seems >> clear enough. If you have partitions for 1, 3, and 5, you get >> partition constraints of (a = 1), (a = 3), and (a = 5). If you add a >> default partition, you get a constraint of (a != 1 and a != 3 and a != >> 5). If you then add a partition for 7, the default partition's >> constraint becomes (a != 1 and a != 3 and a != 5 and a != 7). The >> partition must be revalidated at that point for conflicting rows, >> which we can either try to move to the new partition, or just error >> out if there are any, depending on what we decide we want to do. I >> don't think any of that requires any special handling for either >> pg_dump or pg_upgrade; it all just falls out of getting the >> partitioning constraints correct and consistently enforcing them, just >> as for any other partition. >> >> -- >> Robert Haas >> EnterpriseDB: http://www.enterprisedb.com >> The Enterprise PostgreSQL Company >> > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- Rushabh Lathia
Re: [HACKERS] wait events for disk I/O
On Sat, Mar 18, 2017 at 5:15 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Fri, Mar 17, 2017 at 10:01 AM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > I tried to cover all the suggestion in the attached latest patch. > > Committed. I reworded the documentation entries, renamed a few of the > wait events to make things more consistent, put all three lists in > rigorous alphabetical order, and I fixed a couple of places where an > error return from a system call could lead to returning without > clearing the wait event. > > Thanks Robert. > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > -- Rushabh Lathia
Re: [HACKERS] wait events for disk I/O
Thanks Robert for the review. On Thu, Mar 16, 2017 at 8:05 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Thu, Mar 16, 2017 at 8:28 AM, Rahila Syed <rahilasye...@gmail.com> > wrote: > > Thank you for the updated patch. > > > > I have applied and tested it on latest sources and the patch looks good > to > > me. > > The documentation puts the new wait events in a pretty random order. > I think they should be alphabetized, like we do with the IPC events. > Done. > I also suggest we change the naming scheme so that the kind of thing > being operated on is first and this is followed by the operation name. > This will let us keep related entries next to each other after > alphabetizing. So with that principle in mind: > > Yes above naming scheme is more clear then the one i choose. - instead of ReadDataBlock etc. I propose DataFileRead, DataFileWrite, > DataFileSync, DataFileExtend, DataFileFlush, DataFilePrefetch, > DataFileTruncate. using file instead of block avoids singular/plural > confusion. > - instead of RelationSync and RelationImmedSync I proposed > DataFileSync and DataFileImmediateSync; these are md.c operations like > the previous set, so why name it differently? > Yes, you are right, DataFileSync and DataFileImmediateSync make more sense. > - instead of WriteRewriteDataBlock and SyncRewriteDataBlock and > TruncateLogicalMappingRewrite, which aren't consistent with each other > even though they are related, I propose LogicalRewriteWrite, > LogicalRewriteSync, and LogicalRewriteTruncate, which are also closer > to the names of the functions that contain those wait points > - for ReadBuffile and WriteBuffile seem OK, I propose BufFileRead and > BufFileWrite, again reversing the order and also tweaking the > capitalization > - in keeping with our new policy of referring to xlog as wal in user > visible interfaces, I propose WALRead, WALCopyRead, WALWrite, > WALInitWrite, WALCopyWrite, WALBootstrapWrite, WALInitSync, > WALBootstrapSync, WALSyncMethodAssign > - for control file ops, ControlFileRead, ControlFileWrite, > ControlFileWriteUpdate, ControlFileSync, ControlFileSyncUpdate > - ReadApplyLogicalMapping and friends seem to have to do with the > reorderbuffer code, so maybe ReorderBufferRead etc. > - there seems to be some discrepancy between the documentation and > pgstat_get_wait_io for the snapbuild stuff. maybe SnapBuildWrite, > SnapBuildSync, SnapBuildRead. > - SLRURead, SLRUWrite, etc. > - TimelineHistoryRead, etc. > - the syslogger changes should be dropped, since the syslogger is not > and should not be connected to shared memory > Ok removed. > - the replslot terminology seems like a case of odd capitalization and > overeager abbreviation. why not ReplicationSlotRead, > ReplicationSlotWrite, etc? similarly RelationMapRead, > RelationMapWrite, etc? > - CopyFileRead, CopyFileWrite > - LockFileCreateRead, etc. > - AddToDataDirLockFileRead is a little long and incomprehensible; > maybe LockFileUpdateRead etc. > How about LockFileAddToDataDirRead? even though its little long but it gives clear understanding about what's going on. > - DSMWriteZeroBytes, maybe? > > DSMFillZeroWrite? Basically want to keep the file IP operation at the end of the event name. > Of course the constants should be renamed to match. > I tried to cover all the suggestion in the attached latest patch. Thanks, Rushabh Lathia www.EnterpriseDB.com diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 9eaf43a..acf310e 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -716,6 +716,12 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser point. + + + IO: The server process is waiting for a IO to complete. + wait_event will identify the specific wait point. + + @@ -1272,6 +1278,271 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser RecoveryApplyDelay Waiting to apply WAL at recovery because it is delayed. + + IO + BufFileRead + Waiting during buffer read operation. + + + BufFileWrite + Waiting during buffer write operation. + + + CheckpointLogicalRewriteSync + Waiting to sync logical mapping during a checkpoint for logical rewrite mappings. + + + ControlFileRead + Waiting to read the control file. + + + ControlFileSync + Waiting to sync the control file. + + + ControlFileSyncUpdate + Waiting to sync the control file during update control file. +
Re: [HACKERS] wait events for disk I/O
Thanks Rahila for reviewing this patch. On Tue, Mar 14, 2017 at 8:13 PM, Rahila Syed <rahilasye...@gmail.com> wrote: > Hello, > > I applied and tested this patch on latest sources and it works fine. > > Following are some comments, > > >+ /* Wait event for SNRU */ > >+ WAIT_EVENT_READ_SLRU_PAGE, > Typo in the comment. > > Fixed. > >FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, > WAIT_EVENT_FLUSH_DATA_BLOCK); > This call is inside mdwriteback() which can flush more than one block so > should WAIT_EVENT _FLUSH_DATA_BLOCK > be renamed to WAIT_EVENT_FLUSH_DATA_BLOCKS? > > Changed with WAIT_EVENT_FLUSH_DATA_BLOCKS. > Should calls to write() in following functions be tracked too? > qtext_store() - This is related to pg_stat_statements > > I am not quite sure about this, as this is for stat statements. Also part from the place you found there are many other fwrite() call into pg_stat_statements, and I intentionally haven't added event here as it is very small write about stat, and doesn't look like we should add for those call. > dsm_impl_mmap() - This is in relation to creating dsm segments. > > Added new event here. Actually particular write call is zero-filling the DSM file. > write_auto_conf_file()- This is called when updated configuration > parameters are > written to a temp file. > > write_auto_conf_file() is getting called during the ALTER SYSTEM call. Here write happen only when someone explicitly run the ALTER SYSTEM call. This is administrator call and so doesn't seem like necessary to add separate wait event for this. PFA latest patch with other fixes. > > On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lat...@gmail.com> > wrote: > >> >> >> On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmh...@gmail.com> >> wrote: >> >>> On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapil...@gmail.com> >>> wrote: >>> > On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmh...@gmail.com> >>> wrote: >>> >> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapil...@gmail.com> >>> wrote: >>> >>> Sure, if you think both Writes and Reads at OS level can have some >>> >>> chance of blocking in obscure cases, then we should add a wait event >>> >>> for them. >>> >> >>> >> I think writes have a chance of blocking in cases even in cases that >>> >> are not very obscure at all. >>> > >>> > Point taken for writes, but I think in general we should have some >>> > criteria based on which we can decide whether to have a wait event for >>> > a particular call. It should not happen that we have tons of wait >>> > events and out of which, only a few are helpful in most of the cases >>> > in real-world scenarios. >>> >>> Well, the problem is that if you pick and choose which wait events to >>> add based on what you think will be common, you're actually kind of >>> hosing yourself. Because now when something uncommon happens, suddenly >>> you don't get any wait event data and you can't tell what's happening. >>> I think the number of new wait events added by Rushabh's patch is >>> wholly reasonable. Yeah, some of those are going to be a lot more >>> common than others, but so what? We add wait events so that we can >>> find out what's going on. I don't want to sometimes know when a >>> backend is blocked on an I/O. I want to ALWAYS know. >>> >>> >> Yes, I agree with Robert. Knowing what we want and what we don't >> want is difficult to judge. Something which we might think its not useful >> information, and later of end up into situation where we re-think about >> adding those missing stuff is not good. Having more information about >> the system, specially for monitoring purpose is always good. >> >> I am attaching another version of the patch, as I found stupid mistake >> in the earlier version of patch, where I missed to initialize initial >> value to >> WaitEventIO enum. Also earlier version was not getting cleanly apply on >> the current version of sources. >> >> >> >> -- >> Rushabh Lathia >> EnterpriseDB: http://www.enterprisedb.com >> The Enterprise PostgreSQL Company >> >> >> >> -- >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-hackers >> >> > Rega
Re: [HACKERS] Gather Merge
On Mon, Mar 13, 2017 at 10:56 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Fri, Mar 10, 2017 at 7:59 AM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > Error coming from create_gather_merge_plan() from below condition: > > > > if (memcmp(sortColIdx, gm_plan->sortColIdx, > >numsortkeys * sizeof(AttrNumber)) != 0) > > elog(ERROR, "GatherMerge child's targetlist doesn't match > > GatherMerge"); > > > > Above condition checks the sort column numbers explicitly, to ensure the > > tlists > > really do match up. This been copied from the create_merge_append_plan(). > > Now > > this make sense as for MergeAppend as its not projection capable plan > (see > > is_projection_capable_plan()). But like Gather, GatherMerge is the > > projection > > capable and its target list can be different from the subplan, so I don't > > think this > > condition make sense for the GatherMerge. > > > > Here is the some one the debugging info, through which I was able to > reach > > to this conclusion: > > > > - targetlist for the GatherMerge and subpath is same during > > create_gather_merge_path(). > > > > - targetlist for the GatherMerge is getting changed into > > create_gather_merge_plan(). > > > > postgres=# explain (analyze, verbose) select t2.j from t1 JOIN t2 ON > > t1.k=t2.k where t1.i=1 order by t1.j desc; > > NOTICE: path parthtarget: {PATHTARGET :exprs ({VAR :varno 2 :varattno 2 > > :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 2 > :varoattno > > 2 :location 34} {VAR :varno 1 :varattno 2 :vartype 23 :vartypmod -1 > > :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 2 :location 90}) > > :sortgrouprefs 0 1 :cost.startup 0.00 :cost.per_tuple 0.00 :width 8} > > > > NOTICE: subpath parthtarget: {PATHTARGET :exprs ({VAR :varno 1 > :varattno 2 > > :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 > :varoattno > > 2 :location 90} {VAR :varno 2 :varattno 2 :vartype 23 :vartypmod -1 > > :varcollid 0 :varlevelsup 0 :varnoold 2 :varoattno 2 :location 34}) > > :cost.startup 0.00 :cost.per_tuple 0.00 :width 8} > > > > - Attached memory watch point and found that target list for GatherMerge > is > > getting > > changed into groupping_planner() -> apply_projection_to_path(). > > > > PFA patch to fix this issue. > > I don't think this fix is correct, partly on theoretical grounds and > partly because I managed to make it crash. The problem is that > prepare_sort_for_pathkeys() actually alters the output tlist of Gather > Merge, which is inconsistent with the idea that Gather Merge can do > projection. It's going to produce whatever > prepare_sort_for_pathkeys() says it's going to produce, which may or > may not be what was there before. Using Kuntal's dump file and your > patch: > > set min_parallel_table_scan_size = 0; > set parallel_setup_cost = 0; > set parallel_tuple_cost = 0; > set enable_sort = false; > set enable_bitmapscan = false; > alter table t1 alter column j type text; > select t2.i from t1 join t2 on t1.k=t2.k where t1.i=1 order by t1.j desc; > > Crash. Abbreviated stack trace: > > #0 pg_detoast_datum_packed (datum=0xbc) at fmgr.c:2176 > #1 0x00010160e707 in varstrfastcmp_locale (x=188, y=819, > ssup=0x7fe1ea06a568) at varlena.c:1997 > #2 0x0001013efc73 in ApplySortComparator [inlined] () at > /Users/rhaas/pgsql/src/include/utils/sortsupport.h:225 > #3 0x0001013efc73 in heap_compare_slots (a= unavailable, due to optimizations>, b= due to optimizations>, arg=0x7fe1ea04e590) at sortsupport.h:681 > #4 0x0001014057b2 in sift_down (heap=0x7fe1ea079458, > node_off=) at > binaryheap.c:274 > #5 0x00010140573a in binaryheap_build (heap=0x7fe1ea079458) at > binaryheap.c:131 > #6 0x0001013ef771 in gather_merge_getnext [inlined] () at > /Users/rhaas/pgsql/src/backend/executor/nodeGatherMerge.c:421 > #7 0x0001013ef771 in ExecGatherMerge (node=0x7fe1ea04e590) at > nodeGatherMerge.c:240 > > Obviously, this is happening because we're trying to apply a > comparator for text to a value of type integer. I propose the > attached, slightly more involved fix, which rips out the first call to > prepare_sort_from_pathkeys() altogether. > > Thanks Robert for the patch and the explanation. I studied the patch and that look right to me. I performed manual testing, run the scripts which I created during the gather merge patch also run the tpch queries to make sure that it all working good. I haven't found any regression the that changes. > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > -- Rushabh Lathia
Re: [HACKERS] Gather Merge
Error coming from create_gather_merge_plan() from below condition: if (memcmp(sortColIdx, gm_plan->sortColIdx, numsortkeys * sizeof(AttrNumber)) != 0) elog(ERROR, "GatherMerge child's targetlist doesn't match GatherMerge"); Above condition checks the sort column numbers explicitly, to ensure the tlists really do match up. This been copied from the create_merge_append_plan(). Now this make sense as for MergeAppend as its not projection capable plan (see is_projection_capable_plan()). But like Gather, GatherMerge is the projection capable and its target list can be different from the subplan, so I don't think this condition make sense for the GatherMerge. Here is the some one the debugging info, through which I was able to reach to this conclusion: - targetlist for the GatherMerge and subpath is same during create_gather_merge_path(). - targetlist for the GatherMerge is getting changed into create_gather_merge_plan(). postgres=# explain (analyze, verbose) select t2.j from t1 JOIN t2 ON t1.k=t2.k where t1.i=1 order by t1.j desc; NOTICE: path parthtarget: {PATHTARGET :exprs ({VAR :varno 2 :varattno 2 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 2 :varoattno 2 :location 34} {VAR :varno 1 :varattno 2 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 2 :location 90}) :sortgrouprefs 0 1 :cost.startup 0.00 :cost.per_tuple 0.00 :width 8} NOTICE: subpath parthtarget: {PATHTARGET :exprs ({VAR :varno 1 :varattno 2 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 2 :location 90} {VAR :varno 2 :varattno 2 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 2 :varoattno 2 :location 34}) :cost.startup 0.00 :cost.per_tuple 0.00 :width 8} - Attached memory watch point and found that target list for GatherMerge is getting changed into groupping_planner() -> apply_projection_to_path(). PFA patch to fix this issue. On Fri, Mar 10, 2017 at 4:12 PM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > > > On Fri, Mar 10, 2017 at 4:09 PM, Kuntal Ghosh <kuntalghosh.2...@gmail.com> > wrote: > >> On Fri, Mar 10, 2017 at 3:04 PM, Rushabh Lathia >> <rushabh.lat...@gmail.com> wrote: >> > >> > >> > On Fri, Mar 10, 2017 at 2:42 PM, Andreas Joseph Krogh < >> andr...@visena.com> >> > wrote: >> >> >> >> På fredag 10. mars 2017 kl. 10:09:22, skrev Rushabh Lathia >> >> <rushabh.lat...@gmail.com>: >> >> >> >> >> >> >> >> On Fri, Mar 10, 2017 at 2:33 PM, Andreas Joseph Krogh < >> andr...@visena.com> >> >> wrote: >> >>> >> >>> [...] >> >>> The execution-plan seems (unsurprisingly) to depend on >> data-distribution, >> >>> so is there a way I can force a GatherMerge? >> >> >> >> >> >> Not directly. GatherMerge cost is mainly depend on parallel_setup_cost, >> >> parallel_tuple_cost and cpu_operator_cost. May be you can force this >> >> by setting this cost low enough. Or another way to force is by disable >> the >> >> other plans. >> >> >> >> What plan you are getting now? You not seeing the below error ? >> >> >> >> ERROR: GatherMerge child's targetlist doesn't match GatherMerge >> >> >> >> >> >> I'm seeing the same error, it's just that for reproducing it I'd rather >> >> not copy my whole dataset. >> > >> > >> > Can you share me a schema information, I will try to reproduce at my >> side? >> I'm able to reproduce the error. I've attached the dump file and a >> script to reproduce it. >> >> The following query executes successfully. >> postgres=# explain select t1.* from t1 JOIN t2 ON t1.k=t2.k where >> t1.i=1 order by t1.j desc; >> QUERY PLAN >> >> --- >> Gather Merge (cost=0.58..243.02 rows=943 width=12) >>Workers Planned: 1 >>-> Nested Loop (cost=0.57..235.94 rows=555 width=12) >> -> Parallel Index Scan Backward using idx_t1_i_j on t1 >> (cost=0.29..14.33 rows=603 width=12) >>Index Cond: (i = 1) >> -> Index Only Scan using idx_t2_k on t2 (cost=0.29..0.34 >> rows=3 width=4) >>Index Cond: (k = t1.k) >> (7 rows) >> >> Whereas, If columns from t2 is projected, it throws the same error. >> postgres=# explain select t2.* from t1 JOIN t2 ON t1.k=t2.k where >> t1.i=1 order by t1.j desc; >> ERROR: GatherMerge child's targetlist doesn't match GatherMerge >> >> > Thanks Kuntal. > > I am able to reproduce the issue with the shared script. I will look into > this now. > > >> >> >> -- >> Thanks & Regards, >> Kuntal Ghosh >> EnterpriseDB: http://www.enterprisedb.com >> > > > > -- > Rushabh Lathia > -- Rushabh Lathia gm_plan_fix.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Gather Merge
On Fri, Mar 10, 2017 at 4:09 PM, Kuntal Ghosh <kuntalghosh.2...@gmail.com> wrote: > On Fri, Mar 10, 2017 at 3:04 PM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > > > > > On Fri, Mar 10, 2017 at 2:42 PM, Andreas Joseph Krogh < > andr...@visena.com> > > wrote: > >> > >> På fredag 10. mars 2017 kl. 10:09:22, skrev Rushabh Lathia > >> <rushabh.lat...@gmail.com>: > >> > >> > >> > >> On Fri, Mar 10, 2017 at 2:33 PM, Andreas Joseph Krogh < > andr...@visena.com> > >> wrote: > >>> > >>> [...] > >>> The execution-plan seems (unsurprisingly) to depend on > data-distribution, > >>> so is there a way I can force a GatherMerge? > >> > >> > >> Not directly. GatherMerge cost is mainly depend on parallel_setup_cost, > >> parallel_tuple_cost and cpu_operator_cost. May be you can force this > >> by setting this cost low enough. Or another way to force is by disable > the > >> other plans. > >> > >> What plan you are getting now? You not seeing the below error ? > >> > >> ERROR: GatherMerge child's targetlist doesn't match GatherMerge > >> > >> > >> I'm seeing the same error, it's just that for reproducing it I'd rather > >> not copy my whole dataset. > > > > > > Can you share me a schema information, I will try to reproduce at my > side? > I'm able to reproduce the error. I've attached the dump file and a > script to reproduce it. > > The following query executes successfully. > postgres=# explain select t1.* from t1 JOIN t2 ON t1.k=t2.k where > t1.i=1 order by t1.j desc; > QUERY PLAN > > --- > Gather Merge (cost=0.58..243.02 rows=943 width=12) >Workers Planned: 1 >-> Nested Loop (cost=0.57..235.94 rows=555 width=12) > -> Parallel Index Scan Backward using idx_t1_i_j on t1 > (cost=0.29..14.33 rows=603 width=12) >Index Cond: (i = 1) > -> Index Only Scan using idx_t2_k on t2 (cost=0.29..0.34 > rows=3 width=4) >Index Cond: (k = t1.k) > (7 rows) > > Whereas, If columns from t2 is projected, it throws the same error. > postgres=# explain select t2.* from t1 JOIN t2 ON t1.k=t2.k where > t1.i=1 order by t1.j desc; > ERROR: GatherMerge child's targetlist doesn't match GatherMerge > > Thanks Kuntal. I am able to reproduce the issue with the shared script. I will look into this now. > > > -- > Thanks & Regards, > Kuntal Ghosh > EnterpriseDB: http://www.enterprisedb.com > -- Rushabh Lathia
Re: [HACKERS] Gather Merge
On Fri, Mar 10, 2017 at 2:42 PM, Andreas Joseph Krogh <andr...@visena.com> wrote: > På fredag 10. mars 2017 kl. 10:09:22, skrev Rushabh Lathia < > rushabh.lat...@gmail.com>: > > > > On Fri, Mar 10, 2017 at 2:33 PM, Andreas Joseph Krogh <andr...@visena.com> > wrote: >> >> [...] >> The execution-plan seems (unsurprisingly) to depend on data-distribution, >> so is there a way I can force a GatherMerge? >> > > Not directly. GatherMerge cost is mainly depend on parallel_setup_cost, > parallel_tuple_cost and cpu_operator_cost. May be you can force this > by setting this cost low enough. Or another way to force is by disable the > other plans. > > What plan you are getting now? You not seeing the below error ? > > ERROR: GatherMerge child's targetlist doesn't match GatherMerge > > > I'm seeing the same error, it's just that for reproducing it I'd rather > not copy my whole dataset. > Can you share me a schema information, I will try to reproduce at my side? > > -- > *Andreas Joseph Krogh* > -- Rushabh Lathia
Re: [HACKERS] Gather Merge
On Fri, Mar 10, 2017 at 2:33 PM, Andreas Joseph Krogh <andr...@visena.com> wrote: > På fredag 10. mars 2017 kl. 09:53:47, skrev Rushabh Lathia < > rushabh.lat...@gmail.com>: > > > > On Fri, Mar 10, 2017 at 1:44 PM, Andreas Joseph Krogh <andr...@visena.com> > wrote: >> >> På torsdag 09. mars 2017 kl. 18:09:45, skrev Robert Haas < >> robertmh...@gmail.com>: >> >> On Thu, Mar 9, 2017 at 11:25 AM, Rushabh Lathia >> <rushabh.lat...@gmail.com> wrote: >> > I don't see this failure with the patch. Even I forced the gather merge >> > in the above query and that just working fine. >> > >> > Attaching patch, with the discussed changes. >> >> Committed. >> >> >> >> I'm still getting (as of 9c2635e26f6f4e34b3b606c0fc79d0e111953a74): >> ERROR: GatherMerge child's targetlist doesn't match GatherMerge >> >> >> from this query: >> >> >> EXPLAIN ANALYSE SELECTem.entity_idFROM origo_email_delivery del >> JOIN origo_email_message em ON (del.message_id = em.entity_id)WHERE 1 = >> 1 AND del.owner_id = 3 AND ( >> del.from_entity_id = 279519 OR >> del.from_entity_id = 3 AND em.entity_id IN ( >> SELECT ea_owner.message_id >> FROM origo_email_address_owner ea_owner >> WHERE ea_owner.recipient_id = 279519 ) >> ) >> ORDER BY del.received_timestamp DESCLIMIT 101 OFFSET 0; >> >> >> Is this known or shall I provide more info/schema etc? >> > > Please provide the reproducible test if possible. > > > The execution-plan seems (unsurprisingly) to depend on data-distribution, > so is there a way I can force a GatherMerge? > Not directly. GatherMerge cost is mainly depend on parallel_setup_cost, parallel_tuple_cost and cpu_operator_cost. May be you can force this by setting this cost low enough. Or another way to force is by disable the other plans. What plan you are getting now? You not seeing the below error ? ERROR: GatherMerge child's targetlist doesn't match GatherMerge > -- > *Andreas Joseph Krogh* > -- Rushabh Lathia
Re: [HACKERS] Gather Merge
On Fri, Mar 10, 2017 at 1:44 PM, Andreas Joseph Krogh <andr...@visena.com> wrote: > På torsdag 09. mars 2017 kl. 18:09:45, skrev Robert Haas < > robertmh...@gmail.com>: > > On Thu, Mar 9, 2017 at 11:25 AM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > I don't see this failure with the patch. Even I forced the gather merge > > in the above query and that just working fine. > > > > Attaching patch, with the discussed changes. > > Committed. > > > > I'm still getting (as of 9c2635e26f6f4e34b3b606c0fc79d0e111953a74): > ERROR: GatherMerge child's targetlist doesn't match GatherMerge > > > from this query: > > > EXPLAIN ANALYSE SELECTem.entity_idFROM origo_email_delivery del > JOIN origo_email_message em ON (del.message_id = em.entity_id)WHERE 1 = 1 > AND del.owner_id = 3 AND ( > del.from_entity_id = 279519 OR del.from_entity_id > = 3 AND em.entity_id IN ( > SELECT ea_owner.message_id > FROM origo_email_address_owner ea_owner > WHERE ea_owner.recipient_id = 279519 ) > ) > ORDER BY del.received_timestamp DESCLIMIT 101 OFFSET 0; > > > Is this known or shall I provide more info/schema etc? > Please provide the reproducible test if possible. > > If I select del.entity_id, it works: > > > EXPLAIN ANALYSE SELECTdel.entity_id > FROM origo_email_delivery del > JOIN origo_email_message em ON (del.message_id = > em.entity_id) > WHERE 1 = 1 AND del.owner_id = 3 > AND ( > del.from_entity_id = 279519 > OR del.from_entity_id = 3 AND em.entity_id IN ( > SELECT > ea_owner.message_id > FROM origo_email_address_owner ea_owner > WHERE ea_owner.recipient_id = 279519 > ) > ) > > ORDER BY del.received_timestamp DESCLIMIT 101 > OFFSET 0; > > > Plan is: > │ Limit (cost=1259.72..15798.21 rows=101 width=16) (actual > time=152.946..153.269 rows=34 loops=1) > > │ > > │ -> Gather Merge (cost=1259.72..3139414.43 rows=21801 width=16) > (actual time=152.945..153.264 rows=34 loops=1) > >│ > │ Workers Planned: 4 > > > │ > │ Workers Launched: 4 > > >│ > │ -> Nested Loop (cost=259.66..3135817.66 rows=5450 width=16) > (actual time=95.295..137.549 rows=7 loops=5) > │ > │ -> Parallel Index Scan Backward using > origo_email_received_idx on origo_email_delivery del (cost=0.42..312163.56 > rows=10883 width=32) (actual time=0.175..121.434 rows=6540 loops=5) > │ > │ Filter: ((owner_id = 3) AND ((from_entity_id = > 279519) OR (from_entity_id = 3))) > > │ > │ Rows Removed by Filter: 170355 > > > │ > │ -> Index Only Scan using origo_email_message_pkey on > origo_email_message em (cost=259.24..259.45 rows=1 width=8) (actual > time=0.002..0.002 rows=0 loops=32699) │ > │ Index Cond: (entity_id = del.message_id) > > > │ > │ Filter: ((del.from_entity_id = 279519) OR > ((del.from_entity_id = 3) AND (hashed SubPlan 1))) > > │ > │ Rows Removed by Filter: 1 > > >│ > │ Heap Fetches: 0 > > >│ > │ SubPlan 1 > > > │ > │ -> Index Scan using > origo_email_address_owner_recipient_id_idx > on origo_email_address_owner ea_owner (cost=0.43..258.64 rows=69 width=8) > (actual time=0.032..0.294 rows=175 loops=5) │ > │ Index Cond: (recipient_id = 279519) > > > │ > │ Planning time: 1.372 ms > > >│ > │ Execution time: 170.859 ms > > > │ > > > > -- > *Andreas Joseph Krogh* > > -- Rushabh Lathia
Re: [HACKERS] Gather Merge
On Thu, Mar 9, 2017 at 9:42 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Thu, Mar 9, 2017 at 8:21 AM, Rushabh Lathia <rushabh.lat...@gmail.com> > wrote: > > Thanks Robert for committing this. > > > > My colleague Neha Sharma found one regression with the patch. I was about > > to send this mail and noticed that you committed the patch. > > Oops. Bad timing. > > > postgres=# explain select aid from pgbench_accounts where aid % 25= 0 > group > > by aid; > > ERROR: ORDER/GROUP BY expression not found in targetlist > > I think your fix for this looks right, although I would write it this way: > > -gm_plan->plan.targetlist = subplan->targetlist; > +gm_plan->plan.targetlist = build_path_tlist(root, _path->path); > > The second part of your fix looks wrong. I think you want this: > > create_gather_merge_path(root, > grouped_rel, > subpath, > - NULL, > + partial_grouping_target, > root->group_pathkeys, > NULL, > _groups); > > That will match the create_gather_path case. > > Right, I did that change and perform the test with the fix and I don't see any regression now. > This test case is still failing for me even with those fixes: > > rhaas=# select aid+1 from pgbench_accounts group by aid+1; > ERROR: could not find pathkey item to sort > > I don't see this failure with the patch. Even I forced the gather merge in the above query and that just working fine. Attaching patch, with the discussed changes. Thanks, Rushabh Lathia fix_target_gm_v2.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Gather Merge
On Thu, Mar 9, 2017 at 6:19 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Wed, Mar 8, 2017 at 11:59 PM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > Here is another version of patch with the suggested changes. > > Committed. > > Thanks Robert for committing this. My colleague Neha Sharma found one regression with the patch. I was about to send this mail and noticed that you committed the patch. Here is the small example: Test setup: 1) ./db/bin/pgbench postgres -i -F 100 -s 20 2) update pgbench_accounts set filler = 'foo' where aid%10 = 0; 3) vacuum analyze pgbench_accounts 4) postgres=# set max_parallel_workers_per_gather = 4; SET postgres=# explain select aid from pgbench_accounts where aid % 25= 0 group by aid; ERROR: ORDER/GROUP BY expression not found in targetlist postgres=# set enable_indexonlyscan = off; SET postgres=# explain select aid from pgbench_accounts where aid % 25= 0 group by aid; QUERY PLAN Group (cost=44708.21..45936.81 rows=10001 width=4) Group Key: aid -> Gather Merge (cost=44708.21..45911.81 rows=1 width=0) Workers Planned: 4 -> Group (cost=43708.15..43720.65 rows=2500 width=4) Group Key: aid -> Sort (cost=43708.15..43714.40 rows=2500 width=4) Sort Key: aid -> Parallel Seq Scan on pgbench_accounts (cost=0.00..43567.06 rows=2500 width=4) Filter: ((aid % 25) = 0) (10 rows) - Index only scan with GM do work - but with ORDER BY clause postgres=# set enable_seqscan = off; SET postgres=# explain select aid from pgbench_accounts where aid % 25= 0 order by aid; QUERY PLAN - Gather Merge (cost=1000.49..113924.61 rows=10001 width=4) Workers Planned: 4 -> Parallel Index Scan using pgbench_accounts_pkey on pgbench_accounts (cost=0.43..111733.33 rows=2500 width=4) Filter: ((aid % 25) = 0) (4 rows) Debugging further I found that problem only when IndexOnlyScan under GM and that to only with grouping. Debugging problem I found that ressortgroupref is not getting set. That lead me thought that it might be because create_gather_merge_plan() is building tlist, with build_path_tlist. Another problem I found is that create_grouping_paths() is passing NULL for the targets while calling create_gather_merge_path(). (fix_target_gm.patch) With those changes above test is running fine, but it broke other things. Like postgres=# explain select distinct(bid) from pgbench_accounts where filler like '%foo%' group by aid; ERROR: GatherMerge child's targetlist doesn't match GatherMerge Another thing I noticed that, if we stop adding gathermerge during the generate_gather_paths() then all the test working fine. (comment_gm_from_generate_gather.patch) I will continue looking at this problem. Attaching both the patch for reference. regards, Rushabh Lathia www.EnterpriseDB.com comment_gm_from_generate_gather.patch Description: application/download fix_target_gm.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wait events for disk I/O
Thanks Rajkumar for performing tests on this patch. Yes, I also noticed similar results in my testing. Additionally sometime I also noticed ReadSLRUPage event on my system. I also run the reindex database command and I notices below IO events. SyncImmedRelation, WriteDataBlock WriteBuffile, WriteXLog, ReadDataBlock On Wed, Mar 8, 2017 at 6:41 PM, Rajkumar Raghuwanshi < rajkumar.raghuwan...@enterprisedb.com> wrote: > On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lat...@gmail.com> > wrote: > >> I am attaching another version of the patch, as I found stupid mistake >> in the earlier version of patch, where I missed to initialize initial >> value to >> WaitEventIO enum. Also earlier version was not getting cleanly apply on >> the current version of sources. >> > > I have applied attached patch, set shared_buffers to 128kB and ran > pgbench, I am able to see below distinct IO events. > > --with ./pgbench -i -s 500 postgres > application_namewait_event_typewait_event query > pgbench IO ExtendDataBlock > copy pgbench_account > pgbench IO WriteXLog > copy pgbench_account > pgbench IO WriteDataBlock > copy pgbench_account > pgbench IO ReadDataBlock > vacuum analyze pgben > pgbench IO ReadBuffile >alter table pgbench_ > > --with ./pgbench -T 600 postgres (readwrite) > application_namewait_event_typewait_event query > pgbench IO ReadDataBlock > UPDATE pgbench_accou > pgbench IO WriteDataBlock > UPDATE pgbench_telle > IO SyncDataBlock > > pgbench IO SyncDataBlock > UPDATE pgbench_telle > IO SyncDataBlock > autovacuum: VACUUM A > pgbench IO WriteXLog > END; > pgbench IO ExtendDataBlock > copy pgbench_account > > --with ./pgbench -T 600 -S postgres (select only) > application_namewait_event_typewait_event query > pgbench IO ReadDataBlock > SELECT abalance FROM > > Attached excel with all IO event values. > > Thanks & Regards, > Rajkumar Raghuwanshi > QMG, EnterpriseDB Corporation > > > -- Rushabh Lathia www.EnterpriseDB.com
Re: [HACKERS] Gather Merge
On Thu, Mar 9, 2017 at 8:40 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Mon, Feb 20, 2017 at 1:35 AM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > Thanks Amit for raising this point. I was not at all aware of > mark/restore. > > I tried to come up with the case, but haven't found such case. > > > > For now here is the patch with comment update. > > Sorry for the delay in getting back to this. This seems to need minor > rebasing again. > > Done. > A few other points: > > ExecEndGatherMerge needs to be patched along the lines of > acf555bc53acb589b5a2827e65d655fa8c9adee0. > > Done. > @@ -2290,7 +2376,6 @@ create_limit_plan(PlannerInfo *root, LimitPath > *best_path, int flags) > return plan; > } > > - > /*** > ** > * > * BASE-RELATION SCAN METHODS > > Unnecessary. > Fixed. Here is another version of patch with the suggested changes. Thanks, Rushabh Lathia EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company gather-merge-v9.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wait events for disk I/O
On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > > On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmh...@gmail.com> > wrote: > >> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > >>> Sure, if you think both Writes and Reads at OS level can have some > >>> chance of blocking in obscure cases, then we should add a wait event > >>> for them. > >> > >> I think writes have a chance of blocking in cases even in cases that > >> are not very obscure at all. > > > > Point taken for writes, but I think in general we should have some > > criteria based on which we can decide whether to have a wait event for > > a particular call. It should not happen that we have tons of wait > > events and out of which, only a few are helpful in most of the cases > > in real-world scenarios. > > Well, the problem is that if you pick and choose which wait events to > add based on what you think will be common, you're actually kind of > hosing yourself. Because now when something uncommon happens, suddenly > you don't get any wait event data and you can't tell what's happening. > I think the number of new wait events added by Rushabh's patch is > wholly reasonable. Yeah, some of those are going to be a lot more > common than others, but so what? We add wait events so that we can > find out what's going on. I don't want to sometimes know when a > backend is blocked on an I/O. I want to ALWAYS know. > > Yes, I agree with Robert. Knowing what we want and what we don't want is difficult to judge. Something which we might think its not useful information, and later of end up into situation where we re-think about adding those missing stuff is not good. Having more information about the system, specially for monitoring purpose is always good. I am attaching another version of the patch, as I found stupid mistake in the earlier version of patch, where I missed to initialize initial value to WaitEventIO enum. Also earlier version was not getting cleanly apply on the current version of sources. -- Rushabh Lathia EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company wait_event_for_disk_IO_v2.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE/ALTER ROLE PASSWORD ('value' USING 'method')
On Wed, Mar 8, 2017 at 9:59 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paqu...@gmail.com> writes: > > here is a separate thread dedicated to the following extension for > > CREATE/ALTER ROLE: PASSWORD ('value' USING 'method'). > > The parentheses seem weird ... do we really need those? > +1 I had quick glance to patch and it looks great. One quick comments: +| PASSWORD '(' Sconst USING Sconst ')' +{ +$$ = makeDefElem("methodPassword", + (Node *)list_make2(makeString($3), +makeString($5)), + @1); +} methodPassword looks bit weird, can we change it to passwordMethod or pwdEncryptMethod ? > regards, tom lane > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Rushabh Lathia
Re: [HACKERS] Print correct startup cost for the group aggregate.
Thanks Ashutosh & Robert for the explanation. On Mon, Mar 6, 2017 at 10:02 AM, Ashutosh Bapat < ashutosh.ba...@enterprisedb.com> wrote: > On Sat, Mar 4, 2017 at 2:50 PM, Robert Haas <robertmh...@gmail.com> wrote: > > On Thu, Mar 2, 2017 at 6:48 PM, Ashutosh Bapat > > <ashutosh.ba...@enterprisedb.com> wrote: > >> On Thu, Mar 2, 2017 at 6:06 PM, Rushabh Lathia < > rushabh.lat...@gmail.com> wrote: > >>> While reading through the cost_agg() I found that startup cost for the > >>> group aggregate is not correctly assigned. Due to this explain plan is > >>> not printing the correct startup cost. > >>> > >>> Without patch: > >>> > >>> postgres=# explain select aid, sum(abalance) from pgbench_accounts > where > >>> filler like '%foo%' group by aid; > >>> QUERY PLAN > >>> > - > >>> GroupAggregate (cost=81634.33..85102.04 rows=198155 width=12) > >>>Group Key: aid > >>>-> Sort (cost=81634.33..82129.72 rows=198155 width=8) > >>> Sort Key: aid > >>> -> Seq Scan on pgbench_accounts (cost=0.00..61487.89 > rows=198155 > >>> width=8) > >>>Filter: (filler ~~ '%foo%'::text) > >>> (6 rows) > >>> > >>> With patch: > >>> > >>> postgres=# explain select aid, sum(abalance) from pgbench_accounts > where > >>> filler like '%foo%' group by aid; > >>> QUERY PLAN > >>> > - > >>> GroupAggregate (cost=82129.72..85102.04 rows=198155 width=12) > >>>Group Key: aid > >>>-> Sort (cost=81634.33..82129.72 rows=198155 width=8) > >>> Sort Key: aid > >>> -> Seq Scan on pgbench_accounts (cost=0.00..61487.89 > rows=198155 > >>> width=8) > >>>Filter: (filler ~~ '%foo%'::text) > >>> (6 rows) > >>> > >> > >> The reason the reason why startup_cost = input_startup_cost and not > >> input_total_cost for aggregation by sorting is we don't need the whole > >> input before the Group/Agg plan can produce the first row. But I think > >> setting startup_cost = input_startup_cost is also not exactly correct. > >> Before the plan can produce one row, it has to transit through all the > >> rows belonging to the group to which the first row belongs. On an > >> average it has to scan (total number of rows)/(number of groups) > >> before producing the first aggregated row. startup_cost will be > >> input_startup_cost + cost to scan (total number of rows)/(number of > >> groups) rows + cost of transiting over those many rows. Total cost = > >> startup_cost + cost of scanning and transiting through the remaining > >> number of input rows. > > > > While that idea has some merit, I think it's inconsistent with current > > practice. cost_seqscan(), for example, doesn't include the cost of > > reading the first page in the startup cost, even though that certainly > > must be done before returning the first row. > > OTOH, while costing for merge join, initial_cost_mergejoin() seems to > consider the cost of scanning outer and inner relations upto the first > matching tuple on both sides in the startup_cost. Different policies > at different places. > > > I think there have been > > previous discussions of switching over to the practice for which you > > are advocating here, but my impression (without researching) is that > > the current practice is more like what Rushabh did. > > > > I am not sure Rushabh's approach is correct. Here's the excerpt from my > mail. > > >> The reason the reason why startup_cost = input_startup_cost and not > >> input_total_cost for aggregation by sorting is we don't need the whole > >> input before the Group/Agg plan can produce the first row. > > > With Rushabh's approach the startup cost of aggregation by sorting > would be way higher that it's right now. Secondly, it would match that > of hash aggregation and thus forcing hash aggregation to be chosen in > almost all the cases. > I understood you reasoning of why startup_cost = input_startup_cost and not input_total_cost for aggregation by sorting. But what I didn't understand is how come higher startup cost for aggregation by sorting would force hash aggregation to be chosen? I am not clear about this part. -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company > Regards. Rushabh Lathia www.EnterpriseDB.com
Re: [HACKERS] wait events for disk I/O
On Sat, Mar 4, 2017 at 7:53 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Mon, Feb 20, 2017 at 4:04 PM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > > > My colleague Rahila reported compilation issue with > > the patch. Issue was only coming with we do the clean > > build on the branch. > > > > Fixed the same into latest version of patch. > > > > Few assorted comments: > > 1. > + > + WriteBuffile > + Waiting during > buffile read operation. > + > > Operation name and definition are not matching. > > Will fix this. > > 2. > +FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info) > { > #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED) > int returnCode; > @@ -1527,8 +1527,10 @@ FilePrefetch(File file, off_t offset, int amount) > if (returnCode < 0) > return returnCode; > > + pgstat_report_wait_start(wait_event_info); > returnCode = posix_fadvise(VfdCache[file].fd, offset, amount, > POSIX_FADV_WILLNEED); > + pgstat_report_wait_end(); > > AFAIK, this call is non-blocking and will just initiate a read, so do > you think we should record wait event for such a call. > > Yes, prefecth call is a non-blocking and will just initiate a read. But this info about the prefetch into wait events will give more info about the system. 3. > - written = FileWrite(src->vfd, waldata_start, len); > + written = FileWrite(src->vfd, waldata_start, len, > + WAIT_EVENT_WRITE_REWRITE_DATA_BLOCK); > if (written != len) > ereport(ERROR, > (errcode_for_file_access(), > @@ -957,7 +960,7 @@ logical_end_heap_rewrite(RewriteState state) > hash_seq_init(_status, state->rs_logical_mappings); > while ((src = (RewriteMappingFile *) hash_seq_search(_status)) != > NULL) > { > - if (FileSync(src->vfd) != 0) > + if (FileSync(src->vfd, WAIT_EVENT_SYNC_REWRITE_DATA_BLOCK) != 0) > > > Do we want to consider recording wait event for both write and sync? > It seems to me OS level writes are relatively cheap and sync calls are > expensive, so shouldn't we just log for sync calls? I could see the > similar usage at multiple places in the patch. > > Yes, I thought of adding wait event only for the sync but then recording the wait event for both write and sync. I understand that OS level writes are cheap but it still have some cost attached to that. Also I thought for the monitoring tool being develop using this wait events, will have more useful capture data if we try to collect as much info as we can. Or may be not. I am open for other opinion/suggestions. -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > -- Rushabh Lathia
[HACKERS] Print correct startup cost for the group aggregate.
Hi, While reading through the cost_agg() I found that startup cost for the group aggregate is not correctly assigned. Due to this explain plan is not printing the correct startup cost. Without patch: postgres=# explain select aid, sum(abalance) from pgbench_accounts where filler like '%foo%' group by aid; QUERY PLAN - GroupAggregate (cost=81634.33..85102.04 rows=198155 width=12) Group Key: aid -> Sort (cost=81634.33..82129.72 rows=198155 width=8) Sort Key: aid -> Seq Scan on pgbench_accounts (cost=0.00..61487.89 rows=198155 width=8) Filter: (filler ~~ '%foo%'::text) (6 rows) With patch: postgres=# explain select aid, sum(abalance) from pgbench_accounts where filler like '%foo%' group by aid; QUERY PLAN - GroupAggregate (cost=82129.72..85102.04 rows=198155 width=12) Group Key: aid -> Sort (cost=81634.33..82129.72 rows=198155 width=8) Sort Key: aid -> Seq Scan on pgbench_accounts (cost=0.00..61487.89 rows=198155 width=8) Filter: (filler ~~ '%foo%'::text) (6 rows) PFA patch to correct the same. Regards, Rushabh Lathia www.EnterpriseDB.com fix_startup_cost_cost_agg.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] hash partitioning
on h (cost=0.00..0.00 rows=1 width=4) > Filter: ((abs(hashint4(i)) % 3) = 2) > -> Seq Scan on h3 (cost=0.00..61.00 rows=13 width=4) > Filter: ((abs(hashint4(i)) % 3) = 2) > (5 rows) > > Best regards, > Yugo Nagata > > -- > Yugo Nagata <nag...@sraoss.co.jp> > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > Regards, Rushabh Lathia
Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw
On Wed, Feb 22, 2017 at 12:15 PM, Etsuro Fujita <fujita.ets...@lab.ntt.co.jp > wrote: > On 2017/02/21 19:31, Rushabh Lathia wrote: > >> On Mon, Feb 20, 2017 at 1:41 PM, Etsuro Fujita >> <fujita.ets...@lab.ntt.co.jp <mailto:fujita.ets...@lab.ntt.co.jp>> wrote: >> > > On 2017/02/13 18:24, Rushabh Lathia wrote: >> Here are few comments: >> >> 1) >> >> @@ -211,6 +211,12 @@ typedef struct PgFdwDirectModifyState >> PGresult *result;/* result for query */ >> intnum_tuples;/* # of result tuples */ >> intnext_tuple;/* index of next one to >> return */ >> +RelationresultRel;/* relcache entry for the >> target table */ >> >> >> Why we need resultRel? Can't we directly use dmstate->rel ? >> >> >> The reason why we need that is because in get_returning_data, we >> pass dmstate->rel to make_tuple_from_result_row, which requires that >> dmstate->rel be NULL when the scan tuple is described by >> fdw_scan_tlist. So in that case we set dmstate->rel to NULL and >> have dmstate->resultRel that is the relcache entry for the target >> relation in postgresBeginDirectModify. >> > > Thanks for the explanation. We might do something here by using >> fdw_scan_tlist or changing the assumption of >> make_tuple_from_result_row(), and that way we can avoid two similar >> variable pointer in the PgFdwDirectModifyState. >> > > I agree that the two similar variables are annoying, to some extent, but > ISTM that is not that bad because that makes the handling of dmstate->rel > consistent with that of PgFdwScanState's rel. As you know, > PgFdwDirectModifyState is defined in a similar way as PgFdwScanState, in > which the rel is a relcache entry for the foreign table for a simple > foreign table scan and NULL for a foreign join scan (see comments for the > definition of PgFdwScanState). > > I am okay with currently also, but it adding a note somewhere about this >> would be great. Also let keep this point open for the committer, if >> committer feel this is good then lets go ahead with this. >> > > Agreed. > > Thanks. Marked this as Ready for Committer. > Here are few other cosmetic changes: >> >> 1) >> >> + * >> + * 'target_rel' is either zero or the rangetable index of a target >> relation. >> + * In the latter case this construncts FROM clause of UPDATE or USING >> clause >> + * of DELETE by simply ignoring the target relation while deparsing the >> given >> >> Spell correction: - construncts >> >> 2) >> >> +/* >> + * If either input is the target relation, get all the >> joinclauses. >> + * Otherwise extract conditions mentioning the target relation >> from >> + * the joinclauses. >> + */ >> >> >> space between joinclauses needed. >> >> 3) >> >> +/* >> + * If UPDATE/DELETE on a join, create a RETURINING list used in >> the >> + * remote query. >> + */ >> +if (fscan->scan.scanrelid == 0) >> +returningList = make_explicit_returning_list(resultRelation, >> + rel, >> + returningList); >> >> >> Spell correction: RETURINING >> > > Good catch! > > I did above changes in the attached patch. Please have a look once and >> > > I'm fine with that. Thanks for the patch! > > then I feel like this patch is ready for committer. >> > > Thanks for reviewing! > > Best regards, > Etsuro Fujita > > > -- Rushabh Lathia
Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw
On Mon, Feb 20, 2017 at 1:41 PM, Etsuro Fujita <fujita.ets...@lab.ntt.co.jp> wrote: > On 2017/02/13 18:24, Rushabh Lathia wrote: > >> I started reviewing the patch again. Patch applied cleanly on latest >> source >> as well as regression pass through with the patch. I also performed >> few manual testing and haven't found any regression. Patch look >> much cleaner the earlier version, and don't have any major concern as >> such. >> > > Thanks for the review! > > Here are few comments: >> >> 1) >> >> @@ -211,6 +211,12 @@ typedef struct PgFdwDirectModifyState >> PGresult *result;/* result for query */ >> intnum_tuples;/* # of result tuples */ >> intnext_tuple;/* index of next one to return */ >> +RelationresultRel;/* relcache entry for the target table >> */ >> > > Why we need resultRel? Can't we directly use dmstate->rel ? >> > > The reason why we need that is because in get_returning_data, we pass > dmstate->rel to make_tuple_from_result_row, which requires that > dmstate->rel be NULL when the scan tuple is described by fdw_scan_tlist. > So in that case we set dmstate->rel to NULL and have dmstate->resultRel > that is the relcache entry for the target relation in > postgresBeginDirectModify. > > Thanks for the explanation. We might do something here by using fdw_scan_tlist or changing the assumption of make_tuple_from_result_row(), and that way we can avoid two similar variable pointer in the PgFdwDirectModifyState. I am okay with currently also, but it adding a note somewhere about this would be great. Also let keep this point open for the committer, if committer feel this is good then lets go ahead with this. Here are few other cosmetic changes: 1) + * + * 'target_rel' is either zero or the rangetable index of a target relation. + * In the latter case this construncts FROM clause of UPDATE or USING clause + * of DELETE by simply ignoring the target relation while deparsing the given Spell correction: - construncts 2) +/* + * If either input is the target relation, get all the joinclauses. + * Otherwise extract conditions mentioning the target relation from + * the joinclauses. + */ space between joinclauses needed. 3) +/* + * If UPDATE/DELETE on a join, create a RETURINING list used in the + * remote query. + */ +if (fscan->scan.scanrelid == 0) +returningList = make_explicit_returning_list(resultRelation, + rel, + returningList); Spell correction: RETURINING I did above changes in the attached patch. Please have a look once and then I feel like this patch is ready for committer. Thanks, Rushabh Lathia postgres-fdw-more-update-pushdown-v3_cosmetic_changes.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wait events for disk I/O
My colleague Rahila reported compilation issue with the patch. Issue was only coming with we do the clean build on the branch. Fixed the same into latest version of patch. Thanks, On Tue, Jan 31, 2017 at 11:09 AM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > > > On Tue, Jan 31, 2017 at 8:54 AM, Michael Paquier < > michael.paqu...@gmail.com> wrote: > >> On Mon, Jan 30, 2017 at 10:01 PM, Rushabh Lathia >> <rushabh.lat...@gmail.com> wrote: >> > Attached is the patch, which extend the existing wait event >> infrastructure >> > to implement the wait events for the disk I/O. Basically >> pg_stat_activity's >> > wait event information to show data about disk I/O as well as IPC >> primitives. >> > >> > Implementation details: >> > >> > - Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO >> > - Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch, >> > FileWriteback, FileSync, and FileTruncate. Set this wait event just >> before >> > performing the file system operation and clear it just after. >> > - Pass down an appropriate wait event from caller of any of those >> > functions. >> > - Also set and clear a wait event around standalone calls to read(), >> > write(), fsync() in other parts of the system. >> > - Added documentation for all newly added wait event. >> >> Looks neat, those are unlikely to overlap with other wait events. >> > > Thanks. > > >> >> > Open issue: >> > - Might missed few standalone calls to read(), write(), etc which need >> > to pass the wait_event_info. >> >> It may be an idea to use LD_PRELOAD with custom versions of read(), >> write() and fsync(), and look at the paths where no flags are set in >> MyProc->wait_event_info, and log information when that happens. >> >> > Yes, may be I will try this. > > >> > Thanks to my colleague Robert Haas for his help in design. >> > Please let me know your thought, and thanks for reading. >> >> Did you consider a wrapper of the type pg_read_event() or >> pg_write_event(), etc.? >> > > I thought on that, but eventually stick to this approach as it looks > more neat and uniform with other wait event implementation. > > > >> -- >> Michael >> > > > > Thanks, > Rushabh Lathia > www.EnterpriseDB.com > -- Rushabh Lathia wait_event_for_disk_IO_v1.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Gather Merge
Thanks Amit for raising this point. I was not at all aware of mark/restore. I tried to come up with the case, but haven't found such case. For now here is the patch with comment update. Thanks, On Sun, Feb 19, 2017 at 7:30 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Sun, Feb 19, 2017 at 2:22 PM, Robert Haas <robertmh...@gmail.com> > wrote: > > On Sat, Feb 18, 2017 at 6:43 PM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > >> I think there is a value in supporting mark/restore position for any > >> node which produces sorted results, however, if you don't want to > >> support it, then I think we should update above comment in the code to > >> note this fact. Also, you might want to check other places to see if > >> any similar comment updates are required in case you don't want to > >> support mark/restore position for GatherMerge. > > > > I don't think it makes sense to put mark/support restore into Gather > > Merge. Maybe somebody else will come up with a test that shows > > differently, but ISTM that with something like Sort it makes a ton of > > sense to support mark/restore because the Sort node itself can do it > > much more cheaply than would be possible with a separate Materialize > > node. If you added a separate Materialize node, the Sort node would > > be trying to throw away the exact same data that the Materialize node > > was trying to accumulate, which is silly. > > > > I am not sure but there might be some cases where adding Materialize > node on top of Sort node could make sense as we try to cost it as well > and add it if it turns out to be cheap. > > > Here with Gather Merge > > there is no such thing happening; mark/restore support would require > > totally new code - probably we would end up shoving the same code that > > already exists in Materialize into Gather Merge as well. > > > > I have tried to evaluate this based on plans reported by Rushabh and > didn't find any case where GatherMerge can be beneficial by supporting > mark/restore in the plans where it is being used (it is not being used > on the right side of merge join). Now, it is quite possible that it > might be beneficial at higher scale factors or may be planner has > ignored such a plan. However, as we are not clear what kind of > benefits we can get via mark/restore support for GatherMerge, it > doesn't make much sense to take the trouble of implementing it. > > > > > A comment update is probably a good idea, though. > > > > Agreed. > > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > -- Rushabh Lathia gather-merge-v8-update-comment.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Gather Merge
On Fri, Feb 17, 2017 at 4:47 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Fri, Feb 17, 2017 at 3:59 PM, Thomas Munro > <thomas.mu...@enterprisedb.com> wrote: > > On Thu, Feb 2, 2017 at 2:32 AM, Rushabh Lathia <rushabh.lat...@gmail.com> > wrote: > >> Please find attached latest patch. > > > > The latest patch still applies (with some fuzz), builds and the > > regression tests pass. > > > > I see that Robert made a number of changes and posted a v6 along with > > some numbers which he described as lacklustre, but then fixed a row > > estimate problem which was discouraging parallel joins (commit > > 0c2070ce). Rushabh posted a v7 and test results which look good. > > > > Are you suggesting that commit 0c2070ce has helped to improve > performance if so, I don't think that has been proved? I guess the > numbers are different either due to different m/c or some other > settings like scale factor or work_mem. > I don't really think 0c2070ce is the exact reason. I run the tpch runs with the same same setting as what Robert was running. I haven't noticed any regression with the runs. For the last runs I also uploaded the tpch run outputs for the individual queries for the reference. > > As > > far as I can see there are no outstanding issues or unhandled review > > feedback. I've had a fresh read through of the latest version and > > have no further comments myself. > > > > I've set this to ready-for-committer now. If I've misunderstood and > > there are still unresolved issues from that earlier email exchange or > > someone else wants to post a review or objection, then of course > > please feel free to set it back. > > > > BTW There is no regression test supplied. I see that commit 5262f7a4 > > adding parallel index scans put simple explain output in > > "select_parallel" to demonstrate the new kind of plan being created; > > It has added both explain statement test and a test to exercise > parallel index scan code. > > Thanks for the reference. I added the similar tests for GM in the uploaded latest patch. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > -- Rushabh Lathia
Re: [HACKERS] Gather Merge
On Fri, Feb 17, 2017 at 3:59 PM, Thomas Munro <thomas.mu...@enterprisedb.com > wrote: > On Thu, Feb 2, 2017 at 2:32 AM, Rushabh Lathia <rushabh.lat...@gmail.com> > wrote: > > Please find attached latest patch. > > The latest patch still applies (with some fuzz), builds and the > regression tests pass. > > Attached latest patch, which applies cleanly on latest source. > I see that Robert made a number of changes and posted a v6 along with > some numbers which he described as lacklustre, but then fixed a row > estimate problem which was discouraging parallel joins (commit > 0c2070ce). Rushabh posted a v7 and test results which look good. As > far as I can see there are no outstanding issues or unhandled review > feedback. I've had a fresh read through of the latest version and > have no further comments myself. > > I've set this to ready-for-committer now. If I've misunderstood and > there are still unresolved issues from that earlier email exchange or > someone else wants to post a review or objection, then of course > please feel free to set it back. > > Thanks Thomas. > BTW There is no regression test supplied. I see that commit 5262f7a4 > adding parallel index scans put simple explain output in > "select_parallel" to demonstrate the new kind of plan being created; > perhaps this patch should do the same? I know it wouldn't really test > much of the code but it's at least something. Perhaps you could post > a new version with that? > > Added the regression test to the new version of patch. PFA latest patch. -- Rushabh Lathia www.EnterpriseDB.com gather-merge-v8.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw
Sorry for delay in the review. I started reviewing the patch again. Patch applied cleanly on latest source as well as regression pass through with the patch. I also performed few manual testing and haven't found any regression. Patch look much cleaner the earlier version, and don't have any major concern as such. Here are few comments: 1) @@ -211,6 +211,12 @@ typedef struct PgFdwDirectModifyState PGresult *result;/* result for query */ intnum_tuples;/* # of result tuples */ intnext_tuple;/* index of next one to return */ +RelationresultRel;/* relcache entry for the target table */ Why we need resultRel? Can't we directly use dmstate->rel ? 2) In the patch somewhere scanrelid condition being used as fscan->scan.scanrelid == 0 where as some place its been used as fsplan->scan.scanrelid > 0. Infact in the same function its been used differently example postgresBeginDirectModify. Can make this consistent. 3) + * If UPDATE/DELETE on a join, create a RETURINING list used in the remote + * query. + */ +if (fscan->scan.scanrelid == 0) +returningList = make_explicit_returning_list(resultRelation, rel, + returningList); + Above block can be moved inside the if (plan->returningLists) condition above the block. Like this: /* * Extract the relevant RETURNING list if any. */ if (plan->returningLists) { returningList = (List *) list_nth(plan->returningLists, subplan_index); /* * If UPDATE/DELETE on a join, create a RETURINING list used in the remote * query. */ if (fscan->scan.scanrelid == 0) returningList = make_explicit_returning_list(resultRelation, rel, returningList); } I am still doing few more testing with the patch, if I will found anything apart from this I will raise that into another mail. Thanks, On Wed, Feb 1, 2017 at 11:50 AM, Michael Paquier <michael.paqu...@gmail.com> wrote: > On Wed, Jan 25, 2017 at 7:20 PM, Etsuro Fujita > <fujita.ets...@lab.ntt.co.jp> wrote: > > Attached is the new version of the patch. I also addressed other > comments > > from you: moved rewriting the fdw_scan_tlist to postgres_fdw.c, > > added/revised comments, and added regression tests for the case where a > > pushed down UPDATE/DELETE on a join has RETURNING. > > > > My apologies for having been late to work on this. > > Moved to CF 2017-03. > -- > Michael > -- Rushabh Lathia
[HACKERS] pg_restore is broken on 9.2 version.
Hi All, Commit c59a1a89035674c6efacc596d652528cebba37ec don't allow non-positive number of jobs. Now on 9.2 number of jobs get assigned to opts->number_of_jobs. If user don't specify any value -j then default value will be always 0. Which will lead to the "invalid number of parallel jobs" error. if (opts->number_of_jobs <= 0) { fprintf(stderr, _("%s: invalid number of parallel jobs\n"), progname); exit(1); } Please find attach patch to initialize default value for number of jobs to 1. Thanks, Rushabh Lathia www.EnterpriseDB.com pg_restore_fix.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Gather Merge
Due to recent below commit, patch not getting apply cleanly on master branch. commit d002f16c6ec38f76d1ee97367ba6af3000d441d0 Author: Tom Lane <t...@sss.pgh.pa.us> Date: Mon Jan 30 17:15:42 2017 -0500 Add a regression test script dedicated to exercising system views. Please find attached latest patch. On Wed, Feb 1, 2017 at 5:55 PM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > I am sorry for the delay, here is the latest re-based patch. > > my colleague Neha Sharma, reported one regression with the patch, where > explain output for the Sort node under GatherMerge was always showing > cost as zero: > > explain analyze select '' AS "xxx" from pgbench_accounts where filler > like '%foo%' order by aid; >QUERY > PLAN > > > > Gather Merge (cost=47169.81..70839.91 rows=197688 width=36) (actual > time=406.297..653.572 rows=20 loops=1) >Workers Planned: 4 >Workers Launched: 4 >-> Sort (*cost=0.00..0.00 rows=0 width=0*) (actual > time=368.945..391.124 rows=4 loops=5) > Sort Key: aid > Sort Method: quicksort Memory: 3423kB > -> Parallel Seq Scan on pgbench_accounts (cost=0.00..42316.60 > rows=49422 width=36) (actual time=296.612..338.873 rows=4 loops=5) >Filter: (filler ~~ '%foo%'::text) >Rows Removed by Filter: 36 > Planning time: 0.184 ms > Execution time: 734.963 ms > > This patch also fix that issue. > > > > > On Wed, Feb 1, 2017 at 11:27 AM, Michael Paquier < > michael.paqu...@gmail.com> wrote: > >> On Mon, Jan 23, 2017 at 6:51 PM, Kuntal Ghosh >> <kuntalghosh.2...@gmail.com> wrote: >> > On Wed, Jan 18, 2017 at 11:31 AM, Rushabh Lathia >> > <rushabh.lat...@gmail.com> wrote: >> >> >> > The patch needs a rebase after the commit 69f4b9c85f168ae006929eec4. >> >> Is an update going to be provided? I have moved this patch to next CF >> with "waiting on author" as status. >> -- >> Michael >> > > > > -- > Rushabh Lathia > -- Rushabh Lathia gather-merge-v7.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Gather Merge
I am sorry for the delay, here is the latest re-based patch. my colleague Neha Sharma, reported one regression with the patch, where explain output for the Sort node under GatherMerge was always showing cost as zero: explain analyze select '' AS "xxx" from pgbench_accounts where filler like '%foo%' order by aid; QUERY PLAN Gather Merge (cost=47169.81..70839.91 rows=197688 width=36) (actual time=406.297..653.572 rows=20 loops=1) Workers Planned: 4 Workers Launched: 4 -> Sort (*cost=0.00..0.00 rows=0 width=0*) (actual time=368.945..391.124 rows=4 loops=5) Sort Key: aid Sort Method: quicksort Memory: 3423kB -> Parallel Seq Scan on pgbench_accounts (cost=0.00..42316.60 rows=49422 width=36) (actual time=296.612..338.873 rows=4 loops=5) Filter: (filler ~~ '%foo%'::text) Rows Removed by Filter: 36 Planning time: 0.184 ms Execution time: 734.963 ms This patch also fix that issue. On Wed, Feb 1, 2017 at 11:27 AM, Michael Paquier <michael.paqu...@gmail.com> wrote: > On Mon, Jan 23, 2017 at 6:51 PM, Kuntal Ghosh > <kuntalghosh.2...@gmail.com> wrote: > > On Wed, Jan 18, 2017 at 11:31 AM, Rushabh Lathia > > <rushabh.lat...@gmail.com> wrote: > >> > > The patch needs a rebase after the commit 69f4b9c85f168ae006929eec4. > > Is an update going to be provided? I have moved this patch to next CF > with "waiting on author" as status. > -- > Michael > -- Rushabh Lathia gather-merge-v7.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wait events for disk I/O
On Tue, Jan 31, 2017 at 8:54 AM, Michael Paquier <michael.paqu...@gmail.com> wrote: > On Mon, Jan 30, 2017 at 10:01 PM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > Attached is the patch, which extend the existing wait event > infrastructure > > to implement the wait events for the disk I/O. Basically > pg_stat_activity's > > wait event information to show data about disk I/O as well as IPC > primitives. > > > > Implementation details: > > > > - Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO > > - Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch, > > FileWriteback, FileSync, and FileTruncate. Set this wait event just > before > > performing the file system operation and clear it just after. > > - Pass down an appropriate wait event from caller of any of those > > functions. > > - Also set and clear a wait event around standalone calls to read(), > > write(), fsync() in other parts of the system. > > - Added documentation for all newly added wait event. > > Looks neat, those are unlikely to overlap with other wait events. > Thanks. > > > Open issue: > > - Might missed few standalone calls to read(), write(), etc which need > > to pass the wait_event_info. > > It may be an idea to use LD_PRELOAD with custom versions of read(), > write() and fsync(), and look at the paths where no flags are set in > MyProc->wait_event_info, and log information when that happens. > > Yes, may be I will try this. > > Thanks to my colleague Robert Haas for his help in design. > > Please let me know your thought, and thanks for reading. > > Did you consider a wrapper of the type pg_read_event() or > pg_write_event(), etc.? > I thought on that, but eventually stick to this approach as it looks more neat and uniform with other wait event implementation. > -- > Michael > Thanks, Rushabh Lathia www.EnterpriseDB.com
[HACKERS] wait events for disk I/O
Hi All, Attached is the patch, which extend the existing wait event infrastructure to implement the wait events for the disk I/O. Basically pg_stat_activity's wait event information to show data about disk I/O as well as IPC primitives. Implementation details: - Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO - Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch, FileWriteback, FileSync, and FileTruncate. Set this wait event just before performing the file system operation and clear it just after. - Pass down an appropriate wait event from caller of any of those functions. - Also set and clear a wait event around standalone calls to read(), write(), fsync() in other parts of the system. - Added documentation for all newly added wait event. Open issue: - Might missed few standalone calls to read(), write(), etc which need to pass the wait_event_info. Thanks to my colleague Robert Haas for his help in design. Please let me know your thought, and thanks for reading. Thanks, Rushabh Lathia www.EnterpriseDB.com wait_event_disk_IO.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Query fails when SRFs are part of FROM clause (Commit id: 69f4b9c85f)
On Sat, Jan 28, 2017 at 3:43 AM, David G. Johnston < david.g.johns...@gmail.com> wrote: > On Fri, Jan 27, 2017 at 5:28 AM, Rushabh Lathia <rushabh.lat...@gmail.com> > wrote: > >> Consider the below test; >> >> CREATE TABLE tab ( a int primary key); >> >> SELECT * >> FROM pg_constraint pc, >> CAST(CASE WHEN pc.contype IN ('f','u','p') THEN generate_series(1, >> array_upper(pc.conkey, 1)) ELSE NULL END AS int) AS position; >> >> Above query is failing with "set-valued function called in context that >> cannot >> accept a set". But if I remove the CASE from the query then it working >> just good. >> >> Like: >> >> SELECT * >> FROM pg_constraint pc, >> CAST(generate_series(1, array_upper(pc.conkey, 1)) AS int) AS position; >> >> This started failing with 69f4b9c85f168ae006929eec44fc44d569e846b9. It >> seems >> check_srf_call_placement() sets the hasTargetSRFs flag and but when the >> SRFs >> at the rtable ofcourse this flag doesn't get set. It seems like missing >> something >> their, but I might be completely wrong as not quire aware of this area. >> >> > I'm a bit surprised that your query actually works...and without delving > into source code its hard to explain why it should/shouldn't or whether the > recent SRF work was intended to impact it. > > In any case the more idiomatic way of writing your query these days (since > 9.4 came out) is: > > SELECT * > FROM pg_constraint pc > LEFT JOIN LATERAL generate_series(1, case when contype in ('f','p','u') > then array_upper(pc.conkey, 1) else 0 end) gs ON true; > > generate_series is smart enough to return an empty set (instead of > erroring out) when provided with (1,0) as arguments. > > Thanks for the providing work-around query and I also understood your point. At the same time reason to raise this issue was, because this was working before 69f4b9c85f168ae006929eec44fc44d569e846b9 commit and now its throwing an error. So whether its intended or query started failing because of some bug introduced with the commit. Issues is reproducible when query re-written with LEFT JOIN LATERAL and I continue to use CASE statement. SELECT * FROM pg_constraint pc LEFT JOIN LATERAL CAST((CASE WHEN pc.contype IN ('f','u','p') THEN generate_series(1, array_upper(pc.conkey, 1)) ELSE NULL END) AS int) gs ON true; ERROR: set-valued function called in context that cannot accept a set David J. > > -- Rushabh Lathia www.EnterpriseDB.com
[HACKERS] Query fails when SRFs are part of FROM clause (Commit id: 69f4b9c85f)
Consider the below test; CREATE TABLE tab ( a int primary key); SELECT * FROM pg_constraint pc, CAST(CASE WHEN pc.contype IN ('f','u','p') THEN generate_series(1, array_upper(pc.conkey, 1)) ELSE NULL END AS int) AS position; Above query is failing with "set-valued function called in context that cannot accept a set". But if I remove the CASE from the query then it working just good. Like: SELECT * FROM pg_constraint pc, CAST(generate_series(1, array_upper(pc.conkey, 1)) AS int) AS position; This started failing with 69f4b9c85f168ae006929eec44fc44d569e846b9. It seems check_srf_call_placement() sets the hasTargetSRFs flag and but when the SRFs at the rtable ofcourse this flag doesn't get set. It seems like missing something their, but I might be completely wrong as not quire aware of this area. regards, Rushabh Lathia www.EnterpriseDB.com
Re: [HACKERS] Gather Merge
On Tue, Jan 17, 2017 at 6:44 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Tue, Jan 17, 2017 at 5:19 PM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > > > > > > > Here is the benchmark number which I got with the latest (v6) patch: > > > > - max_worker_processes = DEFAULT (8) > > - max_parallel_workers_per_gather = 4 > > - Cold cache environment is ensured. With every query execution - server > is > > stopped and also OS caches were dropped. > > - The reported values of execution time (in ms) is median of 3 > executions. > > - power2 machine with 512GB of RAM > > - With default postgres.conf > > > > You haven't mentioned scale factor used in these tests. > Oops sorry. Those results are for scale factor 10. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > -- Rushabh Lathia
Re: [HACKERS] Gather Merge
On Tue, Jan 17, 2017 at 5:19 PM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > > > On Fri, Jan 13, 2017 at 10:52 AM, Rushabh Lathia <rushabh.lat...@gmail.com > > wrote: > >> >> >> On Thu, Jan 12, 2017 at 8:50 AM, Robert Haas <robertmh...@gmail.com> >> wrote: >> >>> On Sun, Dec 4, 2016 at 7:36 PM, Haribabu Kommi <kommi.harib...@gmail.com> >>> wrote: >>> > On Thu, Nov 24, 2016 at 11:12 PM, Rushabh Lathia < >>> rushabh.lat...@gmail.com> >>> > wrote: >>> >> PFA latest patch with fix as well as few cosmetic changes. >>> > >>> > Moved to next CF with "needs review" status. >>> >>> I spent quite a bit of time on this patch over the last couple of >>> days. I was hoping to commit it, but I think it's not quite ready for >>> that yet and I hit a few other issues along the way. Meanwhile, >>> here's an updated version with the following changes: >>> >>> * Adjusted cost_gather_merge because we don't need to worry about less >>> than 1 worker. >>> * Don't charge double maintenance cost of the heap per 34ca0905. This >>> was pointed out previous and Rushabh said it was fixed, but it wasn't >>> fixed in v5. >>> * cost_gather_merge claimed to charge a slightly higher IPC cost >>> because we have to block, but didn't. Fix it so it does. >>> * Move several hunks to more appropriate places in the file, near >>> related code or in a more logical position relative to surrounding >>> code. >>> * Fixed copyright dates for the new files. One said 2015, one said 2016. >>> * Removed unnecessary code from create_gather_merge_plan that tried to >>> handle an empty list of pathkeys (shouldn't happen). >>> * Make create_gather_merge_plan more consistent with >>> create_merge_append_plan. Remove make_gather_merge for the same >>> reason. >>> * Changed generate_gather_paths to generate gather merge paths. In >>> the previous coding, only the upper planner nodes ever tried to >>> generate gather merge nodes, but that seems unnecessarily limiting, >>> since it could be useful to generate a gathered path with pathkeys at >>> any point in the tree where we'd generate a gathered path with no >>> pathkeys. >>> * Rewrote generate_ordered_paths() logic to consider only the one >>> potentially-useful path not now covered by the new code in >>> generate_gather_paths(). >>> * Reverted changes in generate_distinct_paths(). I think we should >>> add something here but the existing logic definitely isn't right >>> considering the change to generate_gather_paths(). >>> * Assorted cosmetic cleanup in nodeGatherMerge.c. >>> * Documented the new GUC enable_gathermerge. >>> * Improved comments. Dropped one that seemed unnecessary. >>> * Fixed parts of the patch to be more pgindent-clean. >>> >>> >> Thanks Robert for hacking into this. >> >> >>> Testing this against the TPC-H queries at 10GB with >>> max_parallel_workers_per_gather = 4, seq_page_cost = 0.1, >>> random_page_cost = 0.1, work_mem = 64MB initially produced somewhat >>> demoralizing results. Only Q17, Q4, and Q8 picked Gather Merge, and >>> of those only Q17 got faster. Investigating this led to me realizing >>> that join costing for parallel joins is all messed up: see >>> https://www.postgresql.org/message-id/CA+TgmoYt2pyk2CTyvYCtF >>> ySXN=jsorgh8_mjttlowu5qkjo...@mail.gmail.com >>> >>> With that patch applied, in my testing, Gather Merge got picked for >>> Q3, Q4, Q5, Q6, Q7, Q8, Q10, and Q17, but a lot of those queries get a >>> little slower instead of a little faster. Here are the timings -- >>> these are with EXPLAIN ANALYZE, so take them with a grain of salt -- >>> first number is without Gather Merge, second is with Gather Merge: >>> >>> Q3 16943.938 ms -> 18645.957 ms >>> Q4 3155.350 ms -> 4179.431 ms >>> Q5 13611.484 ms -> 13831.946 ms >>> Q6 9264.942 ms -> 8734.899 ms >>> Q7 9759.026 ms -> 10007.307 ms >>> Q8 2473.899 ms -> 2459.225 ms >>> Q10 13814.950 ms -> 12255.618 ms >>> Q17 49552.298 ms -> 46633.632 ms >>> >>> >> This is strange, I will re-run the test again and post the results soon. >> >> > Here is the benchmark number which I got with the latest (v6) patch: > > - max_worker_processes = DEFAULT (8) > - max_parallel_workers_per_gather
Re: [HACKERS] Gather Merge
On Fri, Jan 13, 2017 at 10:52 AM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > > > On Thu, Jan 12, 2017 at 8:50 AM, Robert Haas <robertmh...@gmail.com> > wrote: > >> On Sun, Dec 4, 2016 at 7:36 PM, Haribabu Kommi <kommi.harib...@gmail.com> >> wrote: >> > On Thu, Nov 24, 2016 at 11:12 PM, Rushabh Lathia < >> rushabh.lat...@gmail.com> >> > wrote: >> >> PFA latest patch with fix as well as few cosmetic changes. >> > >> > Moved to next CF with "needs review" status. >> >> I spent quite a bit of time on this patch over the last couple of >> days. I was hoping to commit it, but I think it's not quite ready for >> that yet and I hit a few other issues along the way. Meanwhile, >> here's an updated version with the following changes: >> >> * Adjusted cost_gather_merge because we don't need to worry about less >> than 1 worker. >> * Don't charge double maintenance cost of the heap per 34ca0905. This >> was pointed out previous and Rushabh said it was fixed, but it wasn't >> fixed in v5. >> * cost_gather_merge claimed to charge a slightly higher IPC cost >> because we have to block, but didn't. Fix it so it does. >> * Move several hunks to more appropriate places in the file, near >> related code or in a more logical position relative to surrounding >> code. >> * Fixed copyright dates for the new files. One said 2015, one said 2016. >> * Removed unnecessary code from create_gather_merge_plan that tried to >> handle an empty list of pathkeys (shouldn't happen). >> * Make create_gather_merge_plan more consistent with >> create_merge_append_plan. Remove make_gather_merge for the same >> reason. >> * Changed generate_gather_paths to generate gather merge paths. In >> the previous coding, only the upper planner nodes ever tried to >> generate gather merge nodes, but that seems unnecessarily limiting, >> since it could be useful to generate a gathered path with pathkeys at >> any point in the tree where we'd generate a gathered path with no >> pathkeys. >> * Rewrote generate_ordered_paths() logic to consider only the one >> potentially-useful path not now covered by the new code in >> generate_gather_paths(). >> * Reverted changes in generate_distinct_paths(). I think we should >> add something here but the existing logic definitely isn't right >> considering the change to generate_gather_paths(). >> * Assorted cosmetic cleanup in nodeGatherMerge.c. >> * Documented the new GUC enable_gathermerge. >> * Improved comments. Dropped one that seemed unnecessary. >> * Fixed parts of the patch to be more pgindent-clean. >> >> > Thanks Robert for hacking into this. > > >> Testing this against the TPC-H queries at 10GB with >> max_parallel_workers_per_gather = 4, seq_page_cost = 0.1, >> random_page_cost = 0.1, work_mem = 64MB initially produced somewhat >> demoralizing results. Only Q17, Q4, and Q8 picked Gather Merge, and >> of those only Q17 got faster. Investigating this led to me realizing >> that join costing for parallel joins is all messed up: see >> https://www.postgresql.org/message-id/CA+TgmoYt2pyk2CTyvYCtF >> ySXN=jsorgh8_mjttlowu5qkjo...@mail.gmail.com >> >> With that patch applied, in my testing, Gather Merge got picked for >> Q3, Q4, Q5, Q6, Q7, Q8, Q10, and Q17, but a lot of those queries get a >> little slower instead of a little faster. Here are the timings -- >> these are with EXPLAIN ANALYZE, so take them with a grain of salt -- >> first number is without Gather Merge, second is with Gather Merge: >> >> Q3 16943.938 ms -> 18645.957 ms >> Q4 3155.350 ms -> 4179.431 ms >> Q5 13611.484 ms -> 13831.946 ms >> Q6 9264.942 ms -> 8734.899 ms >> Q7 9759.026 ms -> 10007.307 ms >> Q8 2473.899 ms -> 2459.225 ms >> Q10 13814.950 ms -> 12255.618 ms >> Q17 49552.298 ms -> 46633.632 ms >> >> > This is strange, I will re-run the test again and post the results soon. > > Here is the benchmark number which I got with the latest (v6) patch: - max_worker_processes = DEFAULT (8) - max_parallel_workers_per_gather = 4 - Cold cache environment is ensured. With every query execution - server is stopped and also OS caches were dropped. - The reported values of execution time (in ms) is median of 3 executions. - power2 machine with 512GB of RAM - With default postgres.conf Timing with v6 patch on REL9_6_STABLE branch (last commit: 8a70d8ae7501141d283e56b31e10c66697c986d5). Query 3: 49888.300 -> 45914.426 Query 4: 8531.939 -> 7790.498 Query 5: 40668.920 -> 38403.658 Q
Re: [HACKERS] Gather Merge
On Thu, Jan 12, 2017 at 8:50 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Sun, Dec 4, 2016 at 7:36 PM, Haribabu Kommi <kommi.harib...@gmail.com> > wrote: > > On Thu, Nov 24, 2016 at 11:12 PM, Rushabh Lathia < > rushabh.lat...@gmail.com> > > wrote: > >> PFA latest patch with fix as well as few cosmetic changes. > > > > Moved to next CF with "needs review" status. > > I spent quite a bit of time on this patch over the last couple of > days. I was hoping to commit it, but I think it's not quite ready for > that yet and I hit a few other issues along the way. Meanwhile, > here's an updated version with the following changes: > > * Adjusted cost_gather_merge because we don't need to worry about less > than 1 worker. > * Don't charge double maintenance cost of the heap per 34ca0905. This > was pointed out previous and Rushabh said it was fixed, but it wasn't > fixed in v5. > * cost_gather_merge claimed to charge a slightly higher IPC cost > because we have to block, but didn't. Fix it so it does. > * Move several hunks to more appropriate places in the file, near > related code or in a more logical position relative to surrounding > code. > * Fixed copyright dates for the new files. One said 2015, one said 2016. > * Removed unnecessary code from create_gather_merge_plan that tried to > handle an empty list of pathkeys (shouldn't happen). > * Make create_gather_merge_plan more consistent with > create_merge_append_plan. Remove make_gather_merge for the same > reason. > * Changed generate_gather_paths to generate gather merge paths. In > the previous coding, only the upper planner nodes ever tried to > generate gather merge nodes, but that seems unnecessarily limiting, > since it could be useful to generate a gathered path with pathkeys at > any point in the tree where we'd generate a gathered path with no > pathkeys. > * Rewrote generate_ordered_paths() logic to consider only the one > potentially-useful path not now covered by the new code in > generate_gather_paths(). > * Reverted changes in generate_distinct_paths(). I think we should > add something here but the existing logic definitely isn't right > considering the change to generate_gather_paths(). > * Assorted cosmetic cleanup in nodeGatherMerge.c. > * Documented the new GUC enable_gathermerge. > * Improved comments. Dropped one that seemed unnecessary. > * Fixed parts of the patch to be more pgindent-clean. > > Thanks Robert for hacking into this. > Testing this against the TPC-H queries at 10GB with > max_parallel_workers_per_gather = 4, seq_page_cost = 0.1, > random_page_cost = 0.1, work_mem = 64MB initially produced somewhat > demoralizing results. Only Q17, Q4, and Q8 picked Gather Merge, and > of those only Q17 got faster. Investigating this led to me realizing > that join costing for parallel joins is all messed up: see > https://www.postgresql.org/message-id/CA+TgmoYt2pyk2CTyvYCtFySXN= > jsorgh8_mjttlowu5qkjo...@mail.gmail.com > > With that patch applied, in my testing, Gather Merge got picked for > Q3, Q4, Q5, Q6, Q7, Q8, Q10, and Q17, but a lot of those queries get a > little slower instead of a little faster. Here are the timings -- > these are with EXPLAIN ANALYZE, so take them with a grain of salt -- > first number is without Gather Merge, second is with Gather Merge: > > Q3 16943.938 ms -> 18645.957 ms > Q4 3155.350 ms -> 4179.431 ms > Q5 13611.484 ms -> 13831.946 ms > Q6 9264.942 ms -> 8734.899 ms > Q7 9759.026 ms -> 10007.307 ms > Q8 2473.899 ms -> 2459.225 ms > Q10 13814.950 ms -> 12255.618 ms > Q17 49552.298 ms -> 46633.632 ms > > This is strange, I will re-run the test again and post the results soon. > I haven't really had time to dig into these results yet, so I'm not > sure how "real" these numbers are and how much is run-to-run jitter, > EXPLAIN ANALYZE distortion, or whatever. I think this overall concept > is good, because there should be cases where it's substantially > cheaper to preserve the order while gathering tuples from workers than > to re-sort afterwards. But this particular set of results is a bit > lackluster. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > -- Rushabh Lathia
Re: [HACKERS] Gather Merge
On Wed, Nov 16, 2016 at 3:10 PM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > > > On Mon, Nov 14, 2016 at 3:51 PM, Thomas Munro < > thomas.mu...@enterprisedb.com> wrote: > >> On Sat, Nov 12, 2016 at 1:56 AM, Rushabh Lathia >> <rushabh.lat...@gmail.com> wrote: >> > On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro < >> thomas.mu...@enterprisedb.com> >> > wrote: >> >> + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development >> Group >> >> + * Portions Copyright (c) 1994, Regents of the University of >> California >> >> >> >> Shouldn't this say just "(c) 2016, PostgreSQL Global Development >> >> Group"? >> > >> > Fixed. >> >> The year also needs updating to 2016 in nodeGatherMerge.h. >> > > Oops sorry, fixed now. > > >> >> >> + /* Per-tuple heap maintenance cost */ >> >> + run_cost += path->path.rows * comparison_cost * 2.0 * logN; >> >> >> >> Why multiply by two? The comment above this code says "about log2(N) >> >> comparisons to delete the top heap entry and another log2(N) >> >> comparisons to insert its successor". In fact gather_merge_getnext >> >> calls binaryheap_replace_first, which replaces the top element without >> >> any comparisons at all and then performs a sift-down in log2(N) >> >> comparisons to find its new position. There is no per-tuple "delete" >> >> involved. We "replace" the top element with the value it already had, >> >> just to trigger the sift-down, because we know that our comparator >> >> function might have a new opinion of the sort order of this element. >> >> Very clever! The comment and the 2.0 factor in cost_gather_merge seem >> >> to be wrong though -- or am I misreading the code? >> >> >> > See cost_merge_append. >> >> That just got tweaked in commit 34ca0905. >> > > Fixed. > > >> >> > Looking at the plan I realize that this is happening because wrong >> costing >> > for Gather Merge. Here in the plan we can see the row estimated by >> > Gather Merge is wrong. This is because earlier patch GM was considering >> > rows = subpath->rows, which is not true as subpath is partial path. So >> > we need to multiple it with number of worker. Attached patch also fixed >> > this issues. I also run the TPC-H benchmark with the patch and results >> > are same as earlier. >> >> In create_grouping_paths: >> + double total_groups = gmpath->rows * >> gmpath->parallel_workers; >> >> This hides a variable of the same name in the enclosing scope. Maybe >> confusing? >> >> In some other places like create_ordered_paths: >> + double total_groups = path->rows * path->parallel_workers; >> >> Though it probably made sense to use this variable name in >> create_grouping_paths, wouldn't total_rows be better here? >> >> > Initially I just copied from the other places. I agree with you that > create_ordered_paths - total_rows make more sense. > > >> It feels weird to be working back to a total row count estimate from >> the partial one by simply multiplying by path->parallel_workers. >> Gather Merge will underestimate the total rows when parallel_workers < >> 4, if using partial row estimates ultimately from cost_seqscan which >> assume some leader contribution. I don't have a better idea though. >> Reversing cost_seqscan's logic certainly doesn't seem right. I don't >> know how to make them agree on the leader's contribution AND give >> principled answers, since there seems to be some kind of cyclic >> dependency in the costing logic (cost_seqscan really needs to be given >> a leader contribution estimate from its superpath which knows whether >> it will allow the leader to pull tuples greedily/fairly or not, but >> that superpath hasn't been created yet; cost_gather_merge needs the >> row count from its subpath). Or maybe I'm just confused. >> >> > Yes, I agree with you. But we can't really do changes into cost_seqscan. > Another option I can think of is just calculate the rows for gather merge, > by using the reverse formula which been used into cost_seqscan. So > we can completely remote the rows argument from the > create_gather_merge_path() > and then inside create_gather_merge_path() - calculate the total_rows > using same formula which been used into cost_seqscan. This is wor
Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw
On Tue, Nov 22, 2016 at 6:24 PM, Etsuro Fujita <fujita.ets...@lab.ntt.co.jp> wrote: > Hi Rushabh, > > On 2016/11/22 19:05, Rushabh Lathia wrote: > >> I started reviewing the patch and here are few initial review points and >> questions for you. >> > > Thanks for the review! > > 1) >> -static void deparseExplicitTargetList(List *tlist, List >> **retrieved_attrs, >> +static void deparseExplicitTargetList(bool is_returning, >> + List *tlist, >> + List **retrieved_attrs, >>deparse_expr_cxt *context); >> >> Any particular reason of inserting new argument as 1st argunment? In >> general >> we add new flags at the end or before the out param for the existing >> function. >> > > No, I don't. I think the *context should be the last argument as in other > functions in deparse.c. How about inserting that before the out param > **retrieved_attrs, like this? > > static void > deparseExplicitTargetList(List *tlist, > bool is_returning, > List **retrieved_attrs, > deparse_expr_cxt *context); > > Yes, adding it before retrieved_attrs would be good. > 2) >> -static void deparseFromExprForRel(StringInfo buf, PlannerInfo *root, >> -RelOptInfo *joinrel, bool use_alias, List >> **params_list); >> +static void deparseFromExprForRel(StringInfo buf, PlannerInfo *root, >> RelOptInfo *foreignrel, >> + bool use_alias, Index target_rel, List >> **params_list); >> > > Going beyond 80 character length >> > > Will fix. > > 3) >> static void >> -deparseExplicitTargetList(List *tlist, List **retrieved_attrs, >> +deparseExplicitTargetList(bool is_returning, >> + List *tlist, >> + List **retrieved_attrs, >>deparse_expr_cxt *context) >> >> This looks bit wired to me - as comment for the >> deparseExplicitTargetList() >> function name and comments says different then what its trying to do when >> is_returning flag is passed. Either function comment need to be change or >> may be separate function fo deparse returning list for join relation? >> >> Is it possible to teach deparseReturningList() to deparse the returning >> list for the joinrel? >> > > I don't think it's possible to do that because deparseReturningList uses > deparseTargetList internally, which only allows us to emit a target list > for a baserel. For example, deparseReturningList can't handle a returning > list that contains join outputs like this: > > postgres=# explain verbose delete from ft1 using ft2 where ft1.a = ft2.a > returning ft1.*, ft2.*; >QUERY PLAN > > > Delete on public.ft1 (cost=100.00..102.06 rows=1 width=46) >Output: ft1.a, ft1.b, ft2.a, ft2.b >-> Foreign Delete (cost=100.00..102.06 rows=1 width=46) > Remote SQL: DELETE FROM public.t1 r1 USING public.t2 r2 WHERE > ((r1.a = r2.a)) RETURNING r1.a, r1.b, r2.a, r2.b > (4 rows) > > I'd like to change the comment for deparseExplicitTargetList. > > 4) >> >> +if (foreignrel->reloptkind == RELOPT_JOINREL) >> +{ >> +List *rlist = NIL; >> + >> +/* Create a RETURNING list */ >> +rlist = make_explicit_returning_list(rtindex, rel, >> + returningList, >> + fdw_scan_tlist); >> + >> +deparseExplicitReturningList(rlist, retrieved_attrs, ); >> +} >> +else >> +deparseReturningList(buf, root, rtindex, rel, false, >> + returningList, retrieved_attrs); >> >> The code will be more centralized if we add the logic of extracting >> returninglist >> for JOINREL inside deparseReturningList() only, isn't it? >> > > You are right. Will do. > > 5) make_explicit_returning_list() pull the var list from the >> returningList and >> build the targetentry for the returning list and at the end rewrites the >> fdw_scan_tlist. >> >> AFAIK, in case of DML - which is going to pushdown to the remote server >> ideally fdw_scan_tlist should be same as returning list, as final output >> for the query is query will be RETUNING list only. isn't that
Re: [HACKERS] Declarative partitioning - another take
way down to > the comparison function. > > Also, we no longer have list_partition_for_tuple() and > range_partition_for_tuple(). Instead, in get_partition_for_tuple() > itself, there is a bsearch followed by list and range partitioning > specific steps based on the returned offset. > > > - I don't see why you need the bound->lower stuff any more. If > > rangeinfo.bounds[offset] is a lower bound for a partition, then > > rangeinfo.bounds[offset+1] is either (a) the upper bound for that > > partition and the partition is followed by a "gap" or (b) both the > > upper bound for that partition and the lower bound for the next > > partition. With the inclusive/exclusive bound stuff gone, every range > > bound has the same sense: if the probed value is <= the bound then > > we're supposed to be a lower-numbered partition, but if > then we're > > supposed to be in this partition or a higher-numbered one. > > OK, I've managed to get rid of lower. At least it is no longer kept in > the new relcache struct PartitionBoundInfo. It is still kept in > PartitionRangeBound which is used to hold individual range bounds when > sorting them (during relcache build). Comparisons invoked during the > aforementioned sorting step still need to distinguish between lower and > upper bounds (such that '1)' < '[1'). > > Tuple-routing no longer needs to look at lower. In that case, what you > described above applies. > > As a result, one change became necessary: to how we flag individual range > bound datum as infinite or not. Previously, it was a regular Boolean > value (either infinite or not) and to distinguish +infinity from > -infinity, we looked at whether the bound is lower or upper (the lower > flag). Now, instead, the variable holding the status of individual range > bound datum is set to a ternary value: RANGE_DATUM_FINITE (0), > RANGE_DATUM_NEG_INF (1), and RANGE_DATUM_POS_INF (2), which still fits in > a bool. Upon encountering an infinite range bound datum, whether it's > negative or positive infinity derives the comparison result. Consider the > following example: > > partition p1 from (1, unbounded) to (1, 1); > partition p2 from (1, 1) to (1, 10); > partition p3 from (1, 10) to (1, unbounded); > partition p4 from (2, unbounded) to (2, 1); > ... so on > > In this case, we need to be able to conclude, say, (1, -inf) < (1, 15) < > (1, +inf), so that tuple (1, 15) is assigned to the proper partition. > > Does this last thing sound reasonable? > > Thanks, > Amit > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- Rushabh Lathia
Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw
I started reviewing the patch and here are few initial review points and questions for you. 1) -static void deparseExplicitTargetList(List *tlist, List **retrieved_attrs, +static void deparseExplicitTargetList(bool is_returning, + List *tlist, + List **retrieved_attrs, deparse_expr_cxt *context); Any particular reason of inserting new argument as 1st argunment? In general we add new flags at the end or before the out param for the existing function. 2) -static void deparseFromExprForRel(StringInfo buf, PlannerInfo *root, -RelOptInfo *joinrel, bool use_alias, List **params_list); +static void deparseFromExprForRel(StringInfo buf, PlannerInfo *root, RelOptInfo *foreignrel, + bool use_alias, Index target_rel, List **params_list); Going beyond 80 character length 3) static void -deparseExplicitTargetList(List *tlist, List **retrieved_attrs, +deparseExplicitTargetList(bool is_returning, + List *tlist, + List **retrieved_attrs, deparse_expr_cxt *context) This looks bit wired to me - as comment for the deparseExplicitTargetList() function name and comments says different then what its trying to do when is_returning flag is passed. Either function comment need to be change or may be separate function fo deparse returning list for join relation? Is it possible to teach deparseReturningList() to deparse the returning list for the joinrel? 4) +if (foreignrel->reloptkind == RELOPT_JOINREL) +{ +List *rlist = NIL; + +/* Create a RETURNING list */ +rlist = make_explicit_returning_list(rtindex, rel, + returningList, + fdw_scan_tlist); + +deparseExplicitReturningList(rlist, retrieved_attrs, ); +} +else +deparseReturningList(buf, root, rtindex, rel, false, + returningList, retrieved_attrs); The code will be more centralized if we add the logic of extracting returninglist for JOINREL inside deparseReturningList() only, isn't it? 5) make_explicit_returning_list() pull the var list from the returningList and build the targetentry for the returning list and at the end rewrites the fdw_scan_tlist. AFAIK, in case of DML - which is going to pushdown to the remote server ideally fdw_scan_tlist should be same as returning list, as final output for the query is query will be RETUNING list only. isn't that true? If yes, then why can't we directly replace the fdw_scan_tlist with the returning list, rather then make_explicit_returning_list()? Also I think rewriting the fdw_scan_tlist should happen into postgres_fdw.c - rather then deparse.c. 6) Test coverage for the returning list is missing. On Fri, Nov 18, 2016 at 8:56 AM, Etsuro Fujita <fujita.ets...@lab.ntt.co.jp> wrote: > On 2016/11/16 16:38, Etsuro Fujita wrote: > >> On 2016/11/16 13:10, Ashutosh Bapat wrote: >> >>> I don't see any reason why DML/UPDATE pushdown should depend upon >>> subquery deparsing or least PHV patch. Combined together they can help >>> in more cases, but without those patches, we will be able to push-down >>> more stuff. Probably, we should just restrict push-down only for the >>> cases when above patches are not needed. That makes reviews easy. Once >>> those patches get committed, we may add more functionality depending >>> upon the status of this patch. Does that make sense? >>> >> > OK, I'll extract from the patch the minimal part that wouldn't depend on >> the two patches. >> > > Here is a patch for that. Todo items are: (1) add more comments and (2) > add more regression tests. I'll do that in the next version. > > Best regards, > Etsuro Fujita > -- Rushabh Lathia
Re: [HACKERS] Gather Merge
On Mon, Nov 14, 2016 at 3:51 PM, Thomas Munro <thomas.mu...@enterprisedb.com > wrote: > On Sat, Nov 12, 2016 at 1:56 AM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro < > thomas.mu...@enterprisedb.com> > > wrote: > >> + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development > Group > >> + * Portions Copyright (c) 1994, Regents of the University of California > >> > >> Shouldn't this say just "(c) 2016, PostgreSQL Global Development > >> Group"? > > > > Fixed. > > The year also needs updating to 2016 in nodeGatherMerge.h. > Oops sorry, fixed now. > > >> + /* Per-tuple heap maintenance cost */ > >> + run_cost += path->path.rows * comparison_cost * 2.0 * logN; > >> > >> Why multiply by two? The comment above this code says "about log2(N) > >> comparisons to delete the top heap entry and another log2(N) > >> comparisons to insert its successor". In fact gather_merge_getnext > >> calls binaryheap_replace_first, which replaces the top element without > >> any comparisons at all and then performs a sift-down in log2(N) > >> comparisons to find its new position. There is no per-tuple "delete" > >> involved. We "replace" the top element with the value it already had, > >> just to trigger the sift-down, because we know that our comparator > >> function might have a new opinion of the sort order of this element. > >> Very clever! The comment and the 2.0 factor in cost_gather_merge seem > >> to be wrong though -- or am I misreading the code? > >> > > See cost_merge_append. > > That just got tweaked in commit 34ca0905. > Fixed. > > > Looking at the plan I realize that this is happening because wrong > costing > > for Gather Merge. Here in the plan we can see the row estimated by > > Gather Merge is wrong. This is because earlier patch GM was considering > > rows = subpath->rows, which is not true as subpath is partial path. So > > we need to multiple it with number of worker. Attached patch also fixed > > this issues. I also run the TPC-H benchmark with the patch and results > > are same as earlier. > > In create_grouping_paths: > + double total_groups = gmpath->rows * > gmpath->parallel_workers; > > This hides a variable of the same name in the enclosing scope. Maybe > confusing? > > In some other places like create_ordered_paths: > + double total_groups = path->rows * path->parallel_workers; > > Though it probably made sense to use this variable name in > create_grouping_paths, wouldn't total_rows be better here? > > Initially I just copied from the other places. I agree with you that create_ordered_paths - total_rows make more sense. > It feels weird to be working back to a total row count estimate from > the partial one by simply multiplying by path->parallel_workers. > Gather Merge will underestimate the total rows when parallel_workers < > 4, if using partial row estimates ultimately from cost_seqscan which > assume some leader contribution. I don't have a better idea though. > Reversing cost_seqscan's logic certainly doesn't seem right. I don't > know how to make them agree on the leader's contribution AND give > principled answers, since there seems to be some kind of cyclic > dependency in the costing logic (cost_seqscan really needs to be given > a leader contribution estimate from its superpath which knows whether > it will allow the leader to pull tuples greedily/fairly or not, but > that superpath hasn't been created yet; cost_gather_merge needs the > row count from its subpath). Or maybe I'm just confused. > > Yes, I agree with you. But we can't really do changes into cost_seqscan. Another option I can think of is just calculate the rows for gather merge, by using the reverse formula which been used into cost_seqscan. So we can completely remote the rows argument from the create_gather_merge_path() and then inside create_gather_merge_path() - calculate the total_rows using same formula which been used into cost_seqscan. This is working fine - but not quite sure about the approach. So I attached that part of changes as separate patch. Any suggestions? -- Rushabh Lathia www.EnterpriseDB.com gather_merge_v4_minor_changes.patch Description: application/download gm_v4_plus_rows_estimate.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw
Thanks Fujita-san for working on this. I've signed up to review this patch. Your latest patch doesn't not get apply cleanly apply on master branch. patching file contrib/postgres_fdw/deparse.c 6 out of 17 hunks FAILED -- saving rejects to file contrib/postgres_fdw/deparse.c.rej patching file contrib/postgres_fdw/expected/postgres_fdw.out 2 out of 2 hunks FAILED -- saving rejects to file contrib/postgres_fdw/expected/postgres_fdw.out.rej patching file contrib/postgres_fdw/postgres_fdw.c 2 out of 22 hunks FAILED -- saving rejects to file contrib/postgres_fdw/postgres_fdw.c.rej patching file contrib/postgres_fdw/postgres_fdw.h 1 out of 3 hunks FAILED -- saving rejects to file contrib/postgres_fdw/postgres_fdw.h.rej Please share the patch which get apply clean on master branch. On Fri, Nov 11, 2016 at 5:00 PM, Etsuro Fujita <fujita.ets...@lab.ntt.co.jp> wrote: > On 2016/09/08 19:55, Etsuro Fujita wrote: > >> On 2016/09/07 13:21, Ashutosh Bapat wrote: >> >>> * with the patch: >>> postgres=# explain verbose delete from ft1 using ft2 where ft1.a = >>> ft2.a; >>> QUERY PLAN >>> >>> >>> - >>> >>> Delete on public.ft1 (cost=100.00..102.04 rows=1 width=38) >>>-> Foreign Delete (cost=100.00..102.04 rows=1 width=38) >>> Remote SQL: DELETE FROM public.t1 r1 USING (SELECT ROW(a, >>> b), a FROM public.t2) ss1(c1, c2) WHERE ((r1.a = ss1.c2)) >>> (3 rows) >>> >> >> The underlying scan on t2 requires ROW(a,b) for locking the row for >>> update/share. But clearly it's not required if the full query is being >>> pushed down. >>> >> > Is there a way we can detect that ROW(a,b) is useless >>> column (not used anywhere in the other parts of the query like >>> RETURNING, DELETE clause etc.) and eliminate it? >>> >> > I don't have a clear solution for that yet, but I'll try to remove that >> in the next version. >> > > Similarly for a, it's >>> part of the targetlist of the underlying scan so that the WHERE clause >>> can be applied on it. But it's not needed if we are pushing down the >>> query. If we eliminate the targetlist of the query, we could construct a >>> remote query without having subquery in it, making it more readable. >>> >> > Will try to do so also. >> > > I addressed this by improving the deparse logic so that a remote query for > performing an UPDATE/DELETE on a join directly on the remote can be created > as proposed if possible. Attached is an updated version of the patch, > which is created on top of the patch set [1]. The patch is still WIP (ie, > needs more comments and regression tests, at least), but any comments would > be gratefully appreciated. > > Best regards, > Etsuro Fujita > > [1] https://www.postgresql.org/message-id/11eafd10-d3f8-ac8a-b64 > 2-b0e65037c76b%40lab.ntt.co.jp > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- Rushabh Lathia
Re: [HACKERS] Gather Merge
Oops forgot to attach latest patch in the earlier mail. On Fri, Nov 11, 2016 at 6:26 PM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > > > On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro < > thomas.mu...@enterprisedb.com> wrote: > >> On Thu, Oct 27, 2016 at 10:50 PM, Rushabh Lathia >> <rushabh.lat...@gmail.com> wrote: >> > Please find attached latest patch which fix the review point as well as >> > additional clean-up. >> >> I've signed up to review this patch and I'm planning to do some >> testing. Here's some initial feedback after a quick read-through: >> >> > Thanks Thomas. > > >> + if (gather_merge_readnext(gm_state, i, initialize ? false : true)) >> >> Clunky ternary operator... how about "!initialize". >> >> > Fixed. > > >> +/* >> + * Function clear out a slot in the tuple table for each gather merge >> + * slots and returns the clear clear slot. >> + */ >> >> Maybe better like this? "_Clear_ out a slot in the tuple table for >> each gather merge _slot_ and _return_ the _cleared_ slot." >> >> > Fixed. > > >> + /* Free tuple array as we no more need it */ >> >> "... as we don't need it any more" >> >> > Fixed > > >> +/* >> + * Read the next tuple for gather merge. >> + * >> + * Function fetch the sorted tuple out of the heap. >> + */ >> >> "_Fetch_ the sorted tuple out of the heap." >> >> > Fixed > > >> + * Otherwise, pull the next tuple from whichever participate we >> + * returned from last time, and reinsert the index into the heap, >> + * because it might now compare differently against the existing >> >> s/participate/participant/ >> >> > Fixed. > > >> + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group >> + * Portions Copyright (c) 1994, Regents of the University of California >> >> Shouldn't this say just "(c) 2016, PostgreSQL Global Development >> Group"? > > > Fixed. > > >> Are we supposed to be blaming the University of California >> for new files? >> >> > Not quite sure about this, so keeping this as it is. > > >> +#include "executor/tqueue.h" >> +#include "miscadmin.h" >> +#include "utils/memutils.h" >> +#include "utils/rel.h" >> +#include "lib/binaryheap.h" >> >> Not correctly sorted. >> >> > Copied from nodeGather.c. But Fixed here. > > >> + /* >> + * store the tuple descriptor into gather merge state, so we can use it >> + * later while initilizing the gather merge slots. >> + */ >> >> s/initilizing/initializing/ >> >> > Fixed. > > >> +/* >> + * ExecEndGatherMerge >> + * >> + * frees any storage allocated through C routines. >> + * >> >> The convention in Postgres code seems to be to use a form like "Free >> any storage ..." in function documentation. Not sure if that's an >> imperative, an infinitive, or if the word "we" is omitted since >> English is so fuzzy like that, but it's inconsistent with other >> documentation to use "frees" here. Oh, I see that exact wording is in >> several other files. I guess I'll just leave this as a complaint >> about all those files then :-) >> >> > Sure. > > >> + * Pull atleast single tuple from each worker + leader and set up the >> heap. >> >> s/atleast single/at least a single/ >> >> > Fixed. > > >> + * Read the tuple for given reader into nowait mode, and form the tuple >> array. >> >> s/ into / in / >> >> > Fixed. > > >> + * Function attempt to read tuple for the given reader and store it into >> reader >> >> s/Function attempt /Attempt / >> >> > Fixed. > > >> + * Function returns true if found tuple for the reader, otherwise returns >> >> s/Function returns /Return / >> >> > Fixed. > > >> + * First try to read tuple for each worker (including leader) into nowait >> + * mode, so that we initialize read from each worker as well as leader. >> >> I wonder if it would be good to standardise on the terminology we use >> when we mean workers AND the leader. In my Parallel Shared Hash work, >> I've been saying '
Re: [HACKERS] Gather Merge
On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > On Thu, Oct 27, 2016 at 10:50 PM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > Please find attached latest patch which fix the review point as well as > > additional clean-up. > > I've signed up to review this patch and I'm planning to do some > testing. Here's some initial feedback after a quick read-through: > > Thanks Thomas. > + if (gather_merge_readnext(gm_state, i, initialize ? false : true)) > > Clunky ternary operator... how about "!initialize". > > Fixed. > +/* > + * Function clear out a slot in the tuple table for each gather merge > + * slots and returns the clear clear slot. > + */ > > Maybe better like this? "_Clear_ out a slot in the tuple table for > each gather merge _slot_ and _return_ the _cleared_ slot." > > Fixed. > + /* Free tuple array as we no more need it */ > > "... as we don't need it any more" > > Fixed > +/* > + * Read the next tuple for gather merge. > + * > + * Function fetch the sorted tuple out of the heap. > + */ > > "_Fetch_ the sorted tuple out of the heap." > > Fixed > + * Otherwise, pull the next tuple from whichever participate we > + * returned from last time, and reinsert the index into the heap, > + * because it might now compare differently against the existing > > s/participate/participant/ > > Fixed. > + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group > + * Portions Copyright (c) 1994, Regents of the University of California > > Shouldn't this say just "(c) 2016, PostgreSQL Global Development > Group"? Fixed. > Are we supposed to be blaming the University of California > for new files? > > Not quite sure about this, so keeping this as it is. > +#include "executor/tqueue.h" > +#include "miscadmin.h" > +#include "utils/memutils.h" > +#include "utils/rel.h" > +#include "lib/binaryheap.h" > > Not correctly sorted. > > Copied from nodeGather.c. But Fixed here. > + /* > + * store the tuple descriptor into gather merge state, so we can use it > + * later while initilizing the gather merge slots. > + */ > > s/initilizing/initializing/ > > Fixed. > +/* > + * ExecEndGatherMerge > + * > + * frees any storage allocated through C routines. > + * > > The convention in Postgres code seems to be to use a form like "Free > any storage ..." in function documentation. Not sure if that's an > imperative, an infinitive, or if the word "we" is omitted since > English is so fuzzy like that, but it's inconsistent with other > documentation to use "frees" here. Oh, I see that exact wording is in > several other files. I guess I'll just leave this as a complaint > about all those files then :-) > > Sure. > + * Pull atleast single tuple from each worker + leader and set up the > heap. > > s/atleast single/at least a single/ > > Fixed. > + * Read the tuple for given reader into nowait mode, and form the tuple > array. > > s/ into / in / > > Fixed. > + * Function attempt to read tuple for the given reader and store it into > reader > > s/Function attempt /Attempt / > > Fixed. > + * Function returns true if found tuple for the reader, otherwise returns > > s/Function returns /Return / > > Fixed. > + * First try to read tuple for each worker (including leader) into nowait > + * mode, so that we initialize read from each worker as well as leader. > > I wonder if it would be good to standardise on the terminology we use > when we mean workers AND the leader. In my Parallel Shared Hash work, > I've been saying 'participants' if I mean = workers + leader. What do > you think? > > I am not quite sure about participants. In my options when we explicitly say workers + leader its more clear. I am open to change is if committer thinks otherwise. > + * After this if all active worker unable to produce the tuple, then > + * re-read and this time read the tuple into wait mode. For the worker, > + * which was able to produced single tuple in the earlier loop and still > + * active, just try fill the tuple array if more tuples available. > + */ > > How about this? "After this, if all active workers are unable to > produce a tuple, then re-read and this time us wait mode. For workers > that were able to produce a tuple in the earlier loop and are still > active, just try to fill the tuple arra
Re: [HACKERS] Gather Merge
Please find attached latest patch which fix the review point as well as additional clean-up. - Get rid of funnel_slot as its not needed for the Gather Merge - renamed fill_tuple_array to form_tuple_array - Fix possible infinite loop into gather_merge_init (Reported by Amit Kaplia) - Fix tqueue.c memory leak, by calling TupleQueueReaderNext() with per-tuple context. - Code cleanup into ExecGatherMerge. - Added new function gather_merge_clear_slots(), which clear out all gather merge slots and also free tuple array at end of execution. I did the performance testing again with the latest patch and I haven't observed any regression. Some of TPC-H queries showing additional benefit with the latest patch, but its just under 5%. Do let me know if I missed anything. On Mon, Oct 24, 2016 at 11:55 AM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > > > On Thu, Oct 20, 2016 at 1:12 PM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > >> On Tue, Oct 18, 2016 at 5:29 PM, Rushabh Lathia >> <rushabh.lat...@gmail.com> wrote: >> > On Mon, Oct 17, 2016 at 2:26 PM, Amit Kapila <amit.kapil...@gmail.com> >> > wrote: >> >> >> >> There is lot of common code between ExecGatherMerge and ExecGather. >> >> Do you think it makes sense to have a common function to avoid the >> >> duplicity? >> >> >> >> I see there are small discrepancies in both the codes like I don't see >> >> the use of single_copy flag, as it is present in gather node. >> >> >> > >> > Yes, even I thought to centrilize some things of ExecGather and >> > ExecGatherMerge, >> > but its really not something that is fixed. And I thought it might >> change >> > particularly >> > for the Gather Merge. And as explained by Robert single_copy doesn't >> make >> > sense >> > for the Gather Merge. I will still look into this to see if something >> can be >> > make >> > centralize. >> > >> >> Okay, I haven't thought about it, but do let me know if you couldn't >> find any way to merge the code. >> >> > Sure, I will look into this. > > >> >> >> >> 3. >> >> +gather_merge_readnext(GatherMergeState * gm_state, int reader, bool >> >> force) >> >> { >> >> .. >> >> + tup = gm_readnext_tuple(gm_state, reader, force, NULL); >> >> + >> >> + /* >> >> + >> >> * try to read more tuple into nowait mode and store it into the tuple >> >> + * array. >> >> + >> >> */ >> >> + if (HeapTupleIsValid(tup)) >> >> + fill_tuple_array(gm_state, reader); >> >> >> >> How the above read tuple is stored in array? In anycase the above >> >> interface seems slightly awkward to me. Basically, I think what you >> >> are trying to do here is after reading first tuple in waitmode, you >> >> are trying to fill the array by reading more tuples. So, can't we >> >> push reading of this fist tuple into that function and name it as >> >> form_tuple_array(). >> >> >> > >> > Yes, you are right. >> > >> >> You have not answered my first question. I will try to ask again, how >> the tuple read by below code is stored in the array: >> >> > Tuple directly get stored into related TupleTableSlot. > In gather_merge_readnext() > at the end of function it build the TupleTableSlot for the given tuple. So > tuple > read by above code is directly stored into TupleTableSlot. > > >> + tup = gm_readnext_tuple(gm_state, reader, force, NULL); >> >> > First its trying to read tuple into wait-mode, and once >> > it >> > find tuple then its try to fill the tuple array (which basically try to >> read >> > tuple >> > into nowait-mode). Reason I keep it separate is because in case of >> > initializing >> > the gather merge, if we unable to read tuple from all the worker - while >> > trying >> > re-read, we again try to fill the tuple array for the worker who already >> > produced >> > atleast a single tuple (see gather_merge_init() for more details). >> > >> >> Whenever any worker produced one tuple, you already try to fill the >> array in gather_merge_readnext(), so does the above code help much? >> >> > Also I >> > thought >> > filling tuple array (which basically read tuple into nowait mode) and >> > reading tuple >> > (into wait-mode) are two separate t
Re: [HACKERS] Gather Merge
On Thu, Oct 20, 2016 at 1:12 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Tue, Oct 18, 2016 at 5:29 PM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > On Mon, Oct 17, 2016 at 2:26 PM, Amit Kapila <amit.kapil...@gmail.com> > > wrote: > >> > >> There is lot of common code between ExecGatherMerge and ExecGather. > >> Do you think it makes sense to have a common function to avoid the > >> duplicity? > >> > >> I see there are small discrepancies in both the codes like I don't see > >> the use of single_copy flag, as it is present in gather node. > >> > > > > Yes, even I thought to centrilize some things of ExecGather and > > ExecGatherMerge, > > but its really not something that is fixed. And I thought it might change > > particularly > > for the Gather Merge. And as explained by Robert single_copy doesn't make > > sense > > for the Gather Merge. I will still look into this to see if something > can be > > make > > centralize. > > > > Okay, I haven't thought about it, but do let me know if you couldn't > find any way to merge the code. > > Sure, I will look into this. > >> > >> 3. > >> +gather_merge_readnext(GatherMergeState * gm_state, int reader, bool > >> force) > >> { > >> .. > >> + tup = gm_readnext_tuple(gm_state, reader, force, NULL); > >> + > >> + /* > >> + > >> * try to read more tuple into nowait mode and store it into the tuple > >> + * array. > >> + > >> */ > >> + if (HeapTupleIsValid(tup)) > >> + fill_tuple_array(gm_state, reader); > >> > >> How the above read tuple is stored in array? In anycase the above > >> interface seems slightly awkward to me. Basically, I think what you > >> are trying to do here is after reading first tuple in waitmode, you > >> are trying to fill the array by reading more tuples. So, can't we > >> push reading of this fist tuple into that function and name it as > >> form_tuple_array(). > >> > > > > Yes, you are right. > > > > You have not answered my first question. I will try to ask again, how > the tuple read by below code is stored in the array: > > Tuple directly get stored into related TupleTableSlot. In gather_merge_readnext() at the end of function it build the TupleTableSlot for the given tuple. So tuple read by above code is directly stored into TupleTableSlot. >> + tup = gm_readnext_tuple(gm_state, reader, force, NULL); > > > First its trying to read tuple into wait-mode, and once > > it > > find tuple then its try to fill the tuple array (which basically try to > read > > tuple > > into nowait-mode). Reason I keep it separate is because in case of > > initializing > > the gather merge, if we unable to read tuple from all the worker - while > > trying > > re-read, we again try to fill the tuple array for the worker who already > > produced > > atleast a single tuple (see gather_merge_init() for more details). > > > > Whenever any worker produced one tuple, you already try to fill the > array in gather_merge_readnext(), so does the above code help much? > > > Also I > > thought > > filling tuple array (which basically read tuple into nowait mode) and > > reading tuple > > (into wait-mode) are two separate task - and if its into separate > function > > that code > > look more clear. > > To me that looks slightly confusing. > > > If you have any suggestion for the function name > > (fill_tuple_array) > > then I am open to change that. > > > > form_tuple_array (form_tuple is used at many places in code, so it > should look okay). Ok, I rename it with next patch. > I think you might want to consider forming array > even for leader, although it might not be as beneficial as for > workers, OTOH, the code will get simplified if we do that way. > Yes, I did that earlier - and as you guessed its not be any beneficial so to avoided extra memory allocation for the tuple array, I am not forming array for leader. Today, I observed another issue in code: > > +gather_merge_init(GatherMergeState *gm_state) > { > .. > +reread: > + for (i = 0; i < nreaders + 1; i++) > + { > + if (TupIsNull(gm_state->gm_slots[i]) || > + gm_state->gm_slots[i]->tts_isempty) > + { > + if (gather_merge_readnext(gm_state, i, initialize ? false : true)) > + { > + binaryheap_add_unordered(gm_state->gm_heap, > + Int32GetDatum(i)); > + } > + } > + else > + fill_tuple_array
Re: [HACKERS] Gather Merge
On Thu, Oct 20, 2016 at 12:22 AM, Peter Geoghegan <p...@heroku.com> wrote: > On Tue, Oct 4, 2016 at 11:05 PM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > Query 4: With GM 7901.480 -> Without GM 9064.776 > > Query 5: With GM 53452.126 -> Without GM 55059.511 > > Query 9: With GM 52613.132 -> Without GM 98206.793 > > Query 15: With GM 68051.058 -> Without GM 68918.378 > > Query 17: With GM 129236.075 -> Without GM 160451.094 > > Query 20: With GM 259144.232 -> Without GM 306256.322 > > Query 21: With GM 153483.497 -> Without GM 168169.916 > > > > Here from the results we can see that query 9, 17 and 20 are the one > which > > show good performance benefit with the Gather Merge. > > Were all other TPC-H queries unaffected? IOW, did they have the same > plan as before with your patch applied? Did you see any regressions? > > Yes, all other TPC-H queries where unaffected with the patch. At the initially stage of patch development I noticed the regressions, but then realize that it is because I am not allowing leader to participate in the GM. Later on I fixed that and after that I didn't noticed any regressions. I assume that this patch has each worker use work_mem for its own > sort, as with hash joins today. One concern with that model when > testing is that you could end up with a bunch of internal sorts for > cases with a GM node, where you get one big external sort for cases > without one. Did you take that into consideration? > > Yes, but isn't that good? Please correct me if I am missing anything. > -- > Peter Geoghegan > -- Rushabh Lathia www.EnterpriseDB.com
Re: [HACKERS] Gather Merge
Thanks Amit for reviewing this patch. On Mon, Oct 17, 2016 at 2:26 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Wed, Oct 5, 2016 at 11:35 AM, Rushabh Lathia > <rushabh.lat...@gmail.com> wrote: > > Hi hackers, > > > > Attached is the patch to implement Gather Merge. > > > > Couple of review comments: > > 1. > ExecGatherMerge() > { > .. > + /* No workers? Then never mind. */ > + if (!got_any_worker > || > + node->nreaders < 2) > + { > + > ExecShutdownGatherMergeWorkers(node); > + node->nreaders = 0; > + > } > > Are you planning to restrict the use of gather merge based on number > of workers, if there is a valid reason, then I think comments should > be updated for same? > > Thanks for catching this. This is left over from the earlier design patch. With current design we don't have any limitation for the number of worker . I did the performance testing with setting max_parallel_workers_per_gather to 1 and didn't noticed any performance degradation. So I removed this limitation into attached patch. 2. > +ExecGatherMerge(GatherMergeState * node){ > .. > + if (!node->initialized) > + { > + EState *estate = node->ps.state; > + > GatherMerge *gm = (GatherMerge *) node->ps.plan; > + > + /* > + * Sometimes we > might have to run without parallelism; but if parallel > + * mode is active then we can try to > fire up some workers. > + */ > + if (gm->num_workers > 0 && IsInParallelMode()) > + > { > + ParallelContext *pcxt; > + bool got_any_worker = > false; > + > + /* Initialize the workers required to execute Gather node. */ > + > if (!node->pei) > + node->pei = ExecInitParallelPlan(node- > >ps.lefttree, > + > estate, > + > gm->num_workers); > .. > } > > There is lot of common code between ExecGatherMerge and ExecGather. > Do you think it makes sense to have a common function to avoid the > duplicity? > > I see there are small discrepancies in both the codes like I don't see > the use of single_copy flag, as it is present in gather node. > > Yes, even I thought to centrilize some things of ExecGather and ExecGatherMerge, but its really not something that is fixed. And I thought it might change particularly for the Gather Merge. And as explained by Robert single_copy doesn't make sense for the Gather Merge. I will still look into this to see if something can be make centralize. > 3. > +gather_merge_readnext(GatherMergeState * gm_state, int reader, bool > force) > { > .. > + tup = gm_readnext_tuple(gm_state, reader, force, NULL); > + > + /* > + > * try to read more tuple into nowait mode and store it into the tuple > + * array. > + > */ > + if (HeapTupleIsValid(tup)) > + fill_tuple_array(gm_state, reader); > > How the above read tuple is stored in array? In anycase the above > interface seems slightly awkward to me. Basically, I think what you > are trying to do here is after reading first tuple in waitmode, you > are trying to fill the array by reading more tuples. So, can't we > push reading of this fist tuple into that function and name it as > form_tuple_array(). > > Yes, you are right. First its trying to read tuple into wait-mode, and once it find tuple then its try to fill the tuple array (which basically try to read tuple into nowait-mode). Reason I keep it separate is because in case of initializing the gather merge, if we unable to read tuple from all the worker - while trying re-read, we again try to fill the tuple array for the worker who already produced atleast a single tuple (see gather_merge_init() for more details). Also I thought filling tuple array (which basically read tuple into nowait mode) and reading tuple (into wait-mode) are two separate task - and if its into separate function that code look more clear. If you have any suggestion for the function name (fill_tuple_array) then I am open to change that. > 4. > +create_gather_merge_path(..) > { > .. > + 0 /* FIXME: pathnode->limit_tuples*/); > > What exactly you want to fix in above code? > > Fixed. > 5. > +/* Tuple array size */ > +#define MAX_TUPLE_STORE 10 > > Have you tried with other values of MAX_TUPLE_STORE? If yes, then > what are the results? I think it is better to add a comment why array > size is best for performance. > > Actually I was thinking on that, but I don't wanted to add their because its just performance number on my machine. Anyway I added the comments. > > 6. > +/* INTERFACE ROUTINES > + * ExecInitGatherMerge - initialize the MergeAppend > node > + * ExecGatherMerge - retrieve the next tuple from the node > + * > ExecEndGatherMerge - shut down the MergeAppend nod
Re: [HACKERS] Showing parallel status in \df+
On Thu, Sep 29, 2016 at 12:07 AM, Pavel Stehule <pavel.steh...@gmail.com> wrote: > Hi > > 2016-09-28 18:57 GMT+02:00 Tom Lane <t...@sss.pgh.pa.us>: > >> Pavel Stehule <pavel.steh...@gmail.com> writes: >> > 2016-09-28 16:03 GMT+02:00 Tom Lane <t...@sss.pgh.pa.us>: >> >> I propose to push my current patch (ie, move PL function >> >> source code to \df+ footers), and we can use it in HEAD for awhile >> >> and see what we think. We can alway improve or revert it later. >> >> > I had some objection to format of source code - it should be full source >> > code, not just header and body. >> >> That would be redundant with stuff that's in the main part of the \df >> display. I really don't need to see the argument types twice, for >> instance. >> > > I am sorry, I disagree. Proposed form is hard readable. Is not possible to > simply copy/paste. > > I cannot to imagine any use case for proposed format. > > I just did testing on Tom's patch - which show pl source code as a footer (show-pl-source-code-as-a-footer.patch). I am sorry, but I agree with Paval, its is hard readable - and its not adding any simplification on what we have now. Pavel Stehule <pavel.steh...@gmail.com> writes: > We are in cycle because prosrc field is used for two independent features - > and then it can be hard to find a agreement. > I thought pretty much everyone was on board with the idea of keeping > prosrc in \df+ for internal/C-language functions (and then probably > renaming the column, since it isn't actually source code in that case). >The argument is over what to do for PL functions, which is only one use > case not two Thinking more, I am good for keeping prosrc in \df+ for internal/C-language functions (with changed column name). and then \sf will be used to get the source code for PL, SQL, language. Regards > > Pavel > > >> >> regards, tom lane >> > > -- Rushabh Lathia www.EnterpriseDB.com
Re: [HACKERS] Showing parallel status in \df+
On Mon, Sep 26, 2016 at 3:06 PM, Stephen Frost <sfr...@snowman.net> wrote: > I feel like we're getting wrapped around the axle as it regards who is > perceived to be voting for what. Thanks Stephen Frost for listing down all the concerns from the people on the different approaches. On Tue, Sep 27, 2016 at 7:56 PM, Stephen Frost <sfr...@snowman.net> wrote: > * Tom Lane (t...@sss.pgh.pa.us) wrote: > > I think the debate is more about whether moving the source display > > functionality over to \sf is a better solution than rearranging \df+ > > output. (If we had consensus to do that, I'd be happy to go code it, > > but I'm not going to invest the effort when it seems like we don't.) > > Right, that's the main question. > > > If we'd had \sf all along, I think it's likely that we would never > > have put source-code display into \df. But of course we didn't, > > Indeed. > > > and what would have been best in a green field is not necessarily > > what's best or achievable given existing reality. Both Robert and > > Peter have put forward the argument that people are used to finding > > this info in \df+ output, and I think that deserves a whole lot of > > weight. The \sf solution might be cleaner, but it's not so much > > better that it can justify forcing people to relearn their habits. > > > > So I think that rearranging \df+ output is really what we ought to > > be doing here. > > Alright, given that Robert's made it clear what his preference is and > you're in agreement with that, I'll remove my objection to moving down > that path. I agree that it's better than the current situation. If we > do end up improving \sf (which seems like a good idea, in general), then > we may wish to consider a display option to control if the source is > included in \df+ or not, but that doesn't need to bar this patch from > going in. > > The earlier comments on the thread hadn't been as clear with regard to > who held what opinions regarding the options and I'm glad that we were > able to reach a point where it was much clearer that there was strong > support for keeping the source in \df+. > > > I'm not necessarily wedded to any of the precise details of what I did > > in my patch --- for instance, maybe function bodies ought to be indented > > one tab stop? But we've not gotten to the merits of such points, for > > lack of agreement about whether this is the basic approach to take. > > As for this, I wouldn't indent or change the source at all. For > starters, indentation actually matters for some PLs, and I can certainly > see people wanting to be able to copy/paste from the output, now that > it'll be possible to reasonably do from the \df+ output. > Yes, it seems like "source code" as part of \df+ output (irrespective of language) is still very much useful for the people - it make sense not to change it at all. Also agree with Stephen view that once we do end up improving \sf - we may be re-consider removing source code from the \df+ output. For now we should stick with the goal for a thread that started out being about showing parallel status in \df+ output. > Thanks! > > Stephen > -- Rushabh Lathia www.EnterpriseDB.com
Re: [HACKERS] PoC: Make it possible to disallow WHERE-less UPDATE and DELETE
On Mon, Sep 26, 2016 at 5:48 AM, Thomas Munro <thomas.mu...@enterprisedb.com > wrote: > On Mon, Sep 19, 2016 at 4:02 PM, David Fetter <da...@fetter.org> wrote: > > > > [training_wheels_004.patch] > > openjade:filelist.sgml:144:16:E: character "_" invalid: only parameter > literal, "CDATA", "ENDTAG", "MD", "MS", "PI", "PUBLIC", "SDATA", > "STARTTAG", "SYSTEM" and parameter separators allowed > openjade:contrib.sgml:138:2:W: cannot generate system identifier for > general entity "require" > > The documentation doesn't build here, I think because require_where is > not an acceptable entity name. It works for me if I change the > underscore to a minus in various places. That fixes these errors: > > + > + Here is an example showing how to set up a database cluster with > + require_where. > + > +$ psql -U postgres > +# SHOW shared_preload_libraries; /* Make sure not to clobber > something by accident */ > + > +If you found something, > +# ALTER SYSTEM SET > shared_preload_libraries='the,stuff,you,found,require_where'; > + > +Otherwise, > +# ALTER SYSTEM SET shared_preload_libraries='require_where'; > + > +Then restart PostgreSQL > + > + > > Could use a full stop (period) on the end of that sentence. Also it > shouldn't be inside the "screen" tags. Maybe "If you found > something," and "Otherwise," shouldn't be either, or should somehow be > marked up so as not to appear to be text from the session. > > postgres=# delete from foo; > ERROR: DELETE requires a WHERE clause > HINT: To delete all rows, use "WHERE true" or similar. > > Maybe one of those messages could use some indication of where this is > coming from, for surprised users encountering this non-standard > behaviour for the first time? > > +1. I think hint message should be more clear about where its coming from. May be it can specify the GUC name, or suggest that if you really want to use DELETE without WHERE clause, turn OFF this GUC? or something in similar line. > FWIW I saw something similar enforced globally by the DBA team at a > large company with many database users. I think experienced users > probably initially felt mollycoddled when they first encountered the > error but I'm sure that some were secretly glad of its existence from > time to time... I think it's a useful feature for users who want it, > and a nice little demonstration of how extensible Postgres is. > > -- > Thomas Munro > http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Rushabh Lathia
Re: [HACKERS] Showing parallel status in \df+
On Thu, Sep 22, 2016 at 10:04 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Rushabh Lathia <rushabh.lat...@gmail.com> writes: > > I agree with the argument in this thread, having "Source code" as part > > of \df+ is bit annoying, specifically when output involve some really > > big PL language functions. Having is separate does make \df+ output more > > readable. So I would vote for \df++ rather then adding the source code > > as part of footer for \df+. > > If it's unreadable in \df+, how would \df++ make that any better? > > Eventhough source code as part of \df+ is bit annoying (specifically for PL functions), I noticed the argument in this thread that it's useful information for some of. So \df++ is just alternate option for the those who want the source code. > regards, tom lane > -- Rushabh Lathia
Re: [HACKERS] Confusing docs about GetForeignUpperPaths in fdwhandler.sgml
On Fri, Sep 2, 2016 at 5:00 PM, Etsuro Fujita <fujita.ets...@lab.ntt.co.jp> wrote: > Hi Robert, > > Thanks for the comments! > > On 2016/09/02 11:55, Robert Haas wrote: > >> On Mon, Aug 1, 2016 at 5:44 PM, Etsuro Fujita >> <fujita.ets...@lab.ntt.co.jp> wrote: >> >>> I noticed that the following note about direct modification via >>> GetForeignUpperPaths in fdwhandler.sgml is a bit confusing. We have >>> another >>> approach using PlanDirectModify, so that should be reflected in the note >>> as >>> well. Please find attached a patch. >>> >>> PlanForeignModify and the other callbacks described in >>> are designed around the >>> assumption >>> that the foreign relation will be scanned in the usual way and then >>> individual row updates will be driven by a local >>> ModifyTable >>> plan node. This approach is necessary for the general case where an >>> update requires reading local tables as well as foreign tables. >>> However, if the operation could be executed entirely by the foreign >>> server, the FDW could generate a path representing that and insert >>> it >>> into the UPPERREL_FINAL upper relation, where it would >>> compete against the ModifyTable approach. >>> >> > I suppose this is factually correct, but I don't think it's very >> illuminating. I think that if we're going to document both >> approaches, there should be some discussion of the pros and cons of >> PlanDirectModify vs. PlanForeignModify. >> > > PlanDirectModify vs. GetForeignUpperPaths for an UPPERREL_FINAL upper > relation? > > Of course either should be >> better than an iterative ModifyTable, but how should the FDW author >> decide between the two of them? >> > > That would apply to row locking. We have two approaches for that too: > GetForeignRowMarkType and GetForeignUpperPaths, which is documented in the > same paragraph following the above documentation: > > This approach > could also be used to implement remote SELECT FOR UPDATE, > rather than using the row locking callbacks described in > . Keep in mind that a path > > The point of the patch is just to let the FDW author know that there is > another approach for implementing direct modification (ie, > PlanDirectModify) just as for implementing row locking. > > Considering the primary object of this patch is just to let the FDW author know that there is another approach for implementing direct modification, I like the idea of modifying the document. I agree that the documentation about how the FDW author should decide > between the two would be helpful, but I'd like to leave that for future > work. > I performed basic test with patch, a) patch get applied cleanly on latest source, b) able to build documentation cleanly. Marking this as ready for committer. > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Rushabh Lathia
Re: [HACKERS] Showing parallel status in \df+
I agree with the argument in this thread, having "Source code" as part of \df+ is bit annoying, specifically when output involve some really big PL language functions. Having is separate does make \df+ output more readable. So I would vote for \df++ rather then adding the source code as part of footer for \df+. Personally I didn't like idea for keeping "source code" for C/internal functions as part of \df+ and moving others out of it. If we really want to move "source code" from \df+, then it should be consistent - irrespective of language. So may be remove "source code" completely from \df+ and add \df++ support for the "source code". On Wed, Sep 7, 2016 at 12:14 AM, Pavel Stehule <pavel.steh...@gmail.com> wrote: > Hi > > 2016-09-06 0:05 GMT+02:00 Tom Lane <t...@sss.pgh.pa.us>: > >> I wrote: >> > Pavel Stehule <pavel.steh...@gmail.com> writes: >> >> Using footer for this purpose is little bit strange. What about >> following >> >> design? >> >> 1. move out source code of PL functions from \df+ >> >> 2. allow not unique filter in \sf and allow to display multiple >> functions >> >> > Wasn't that proposed and rejected upthread? >> >> So ... why did you put this patch in "Waiting on Author" state? AFAIK, >> we had dropped the idea of relying on \sf for this, mainly because >> Peter complained about \df+ no longer providing source code. I follow >> his point: if you're used to using \df+ to see source code, you probably >> can figure it out quickly if that command shows the source in a different >> place than before. But if it doesn't show it at all, using \sf instead >> might not occur to you right away. >> > > I see only one situation, when I want to see more then one source code - > checking overloaded functions. I prefer to see complete source code - in > \sf format. But I don't remember, when I did it last time. So I can live > without it well. > > I am thinking, there is strong agreement about reduction \dt+ result. I am > not sure about usability of showing source code in footer. It is not too > much readable - and the fact, so function's body is displayed not as CREATE > statements, does the result less readable. > > Now I am thinking so using footer for this purpose is not too great idea - > maybe we can live better without it (without source code of PL in \dt+ > result, I would to see only C function source there). If you like using > footer, then the format should be changed to be more consistent, readable? > I am not sure, how it can be enhanced. > > Regards > > Pavel > > >> >> regards, tom lane >> > > -- Rushabh Lathia
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
On Fri, Sep 2, 2016 at 1:12 PM, Rahila Syed <rahilasye...@gmail.com> wrote: > Hello, > > Thank you for comments. > > >Above test not throwing psql error, as you used strcmp(newval,"ON"). You > >should use pg_strcasecmp. > Corrected in the attached. > > >Above error coming because in below code block change, you don't have > check for > >command (you should check opt0 for AUTOCOMMIT command) > Corrected in the attached. > > >postgres=# BEGIN; > >BEGIN > >postgres=# create table foo ( a int ); > >CREATE TABLE > >postgres=# \set AUTOCOMMIT ON > >Don't you think, in above case also we should throw a psql error? > IMO, in this case BEGIN is explicitly specified by user, so I think it is > understood that a commit is required for changes to be effective. > Hence I did not consider this case. > > Document doesn't say this changes are only for implicit BEGIN.. + + + Autocommit cannot be set on inside a transaction, the ongoing + transaction has to be ended by entering COMMIT or + ROLLBACK before setting autocommit on. + + In my opinion, its good to be consistent, whether its implicit or explicitly specified BEGIN. >postgres=# \set AUTOCOMMIT off > >postgres=# create table foo ( a int ); > >CREATE TABLE > >postgres=# \set XYZ ON > >\set: Cannot set XYZ to ON inside a transaction, either COMMIT or > ROLLBACK and retry > > >May be you would like to move the new code block inside SetVariable(). So > that > >don't need to worry about the validity of the variable names. > > I think validating variable names wont be required if we throw error only > if command is \set AUTOCOMMIT. > Validation can happen later as in the existing code. > > It might happen that SetVariable() can be called from other places in future, so its good to have all the set variable related checks inside SetVariable(). Also you can modify the condition in such way, so that we don't often end up calling PQtransactionStatus() even though its not require. Example: if (!strcmp(opt0,"AUTOCOMMIT") && !strcasecmp(newval,"ON") && !pset.autocommit && PQtransactionStatus(pset.db) == PQTRANS_INTRANS) >Basically if I understand correctly, if we are within transaction and > someone > >tries the set the AUTOCOMMIT ON, it should throw a psql error. Correct me > >if I am missing anything? > > Yes the psql_error is thrown when AUTOCOMMIT is turned on inside a > transaction. But only when there is an implicit BEGIN as in following case, > > postgres=# \set AUTOCOMMIT OFF > postgres=# create table test(i int); > CREATE TABLE > postgres=# \set AUTOCOMMIT ON > \set: Cannot set AUTOCOMMIT to ON inside a transaction, either COMMIT or > ROLLBACK and retry > postgres=# > > Thank you, > Rahila Syed > > Regards, Rushabh Lathia www.EnterpriseDB.com
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
I don't think patch is doing correct things: 1) postgres=# \set AUTOCOMMIT off postgres=# create table foo (a int ); CREATE TABLE postgres=# \set AUTOCOMMIT on Above test not throwing psql error, as you used strcmp(newval,"ON"). You should use pg_strcasecmp. 2) postgres=# \set AUTOCOMMIT OFF postgres=# create table foo ( a int ); CREATE TABLE postgres=# \set VERBOSITY ON \set: Cannot set VERBOSITY to ON inside a transaction, either COMMIT or ROLLBACK and retry Above error coming because in below code block change, you don't have check for command (you should check opt0 for AUTOCOMMIT command) -if (!SetVariable(pset.vars, opt0, newval)) +if (transaction_status == PQTRANS_INTRANS && !pset.autocommit && +!strcmp(newval,"ON")) +{ +psql_error("\\%s: Cannot set %s to %s inside a transaction, either COMMIT or ROLLBACK and retry\n", +cmd, opt0, newval); +success = false; +} 3) postgres=# BEGIN; BEGIN postgres=# create table foo ( a int ); CREATE TABLE postgres=# \set AUTOCOMMIT ON Don't you think, in above case also we should throw a psql error? 4) postgres=# \set AUTOCOMMIT off postgres=# create table foo ( a int ); CREATE TABLE postgres=# \set XYZ ON \set: Cannot set XYZ to ON inside a transaction, either COMMIT or ROLLBACK and retry May be you would like to move the new code block inside SetVariable(). So that don't need to worry about the validity of the variable names. Basically if I understand correctly, if we are within transaction and someone tries the set the AUTOCOMMIT ON, it should throw a psql error. Correct me if I am missing anything? On Thu, Sep 1, 2016 at 4:23 PM, Rahila Syed <rahilasye...@gmail.com> wrote: > Ok. Please find attached a patch which introduces psql error when > autocommit is turned on inside a transaction. It also adds relevant > documentation in psql-ref.sgml. Following is the output. > > bash-4.2$ psql -d postgres > psql (10devel) > Type "help" for help. > > postgres=# \set AUTOCOMMIT OFF > postgres=# create table test(i int); > CREATE TABLE > postgres=# \set AUTOCOMMIT ON > \set: Cannot set AUTOCOMMIT to ON inside a transaction, either COMMIT or > ROLLBACK and retry > postgres=# > > > Thank you, > Rahila Syed > > On Wed, Aug 17, 2016 at 6:31 PM, Robert Haas <robertmh...@gmail.com> > wrote: > >> On Tue, Aug 16, 2016 at 5:25 PM, Rahila Syed <rahilasye...@gmail.com> >> wrote: >> >>I think I like the option of having psql issue an error. On the >> >>server side, the transaction would still be open, but the user would >> >>receive a psql error message and the autocommit setting would not be >> >>changed. So the user could type COMMIT or ROLLBACK manually and then >> >>retry changing the value of the setting. >> > >> > Throwing psql error comes out to be most accepted outcome on this >> thread. I >> > agree it is safer than guessing user intention. >> > >> > Although according to the default behaviour of psql, error will abort >> the >> > current transaction and roll back all the previous commands. >> >> A server error would do that, but a psql errror won't. >> >> -- >> Robert Haas >> EnterpriseDB: http://www.enterprisedb.com >> The Enterprise PostgreSQL Company >> > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- Rushabh Lathia
Re: [HACKERS] [parallel query] random server crash while running tpc-h query on power2
On Mon, Aug 15, 2016 at 6:02 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Sat, Aug 13, 2016 at 4:36 AM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > > AFAICS, your patch seems to be the right fix for this issue, unless we > > need the instrumentation information during execution (other than for > > explain) for some purpose. > > Hmm, I disagree. It should be the job of > ExecParallelRetrieveInstrumentation to allocate its data in the > correct context, not the responsibility of nodeGather.c to work around > the fact that it doesn't. The worker instrumentation should be > allocated in the same context as the regular instrumentation > information, which I assume is probably the per-query context. > I agree, this make sense. Here is the patch to allocate worker instrumentation into same context as the regular instrumentation which is per-query context. PFA patch. -- Rushabh Lathia www.EnterpriseDB.com diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c index 380d743..5aa6f02 100644 --- a/src/backend/executor/execParallel.c +++ b/src/backend/executor/execParallel.c @@ -500,6 +500,7 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate, int n; int ibytes; int plan_node_id = planstate->plan->plan_node_id; + MemoryContext oldcontext; /* Find the instumentation for this node. */ for (i = 0; i < instrumentation->num_plan_nodes; ++i) @@ -514,10 +515,19 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate, for (n = 0; n < instrumentation->num_workers; ++n) InstrAggNode(planstate->instrument, [n]); - /* Also store the per-worker detail. */ + /* + * Also store the per-worker detail. + * + * Worker instrumentation should be allocated in the same context as + * the regular instrumentation information, which is the per-query + * context. Switch into per-query memory context. + */ + oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt); ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation)); planstate->worker_instrument = palloc(ibytes + offsetof(WorkerInstrumentation, instrument)); + MemoryContextSwitchTo(oldcontext); + planstate->worker_instrument->num_workers = instrumentation->num_workers; memcpy(>worker_instrument->instrument, instrument, ibytes); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [parallel query] random server crash while running tpc-h query on power2
Hi All, Recently while running tpc-h queries on postgresql master branch, I am noticed random server crash. Most of the time server crash coming while turn tpch query number 3 - (but its very random). Call Stack of server crash: (gdb) bt #0 0x102aa9ac in ExplainNode (planstate=0x1001a7471d8, ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0, es=0x1001a65ad48) at explain.c:1516 #1 0x102aaeb8 in ExplainNode (planstate=0x1001a746b50, ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0, es=0x1001a65ad48) at explain.c:1599 #2 0x102aaeb8 in ExplainNode (planstate=0x1001a7468b8, ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0, es=0x1001a65ad48) at explain.c:1599 #3 0x102aaeb8 in ExplainNode (planstate=0x1001a745e48, ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0, es=0x1001a65ad48) at explain.c:1599 #4 0x102aaeb8 in ExplainNode (planstate=0x1001a745bb0, ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0, es=0x1001a65ad48) at explain.c:1599 #5 0x102aaeb8 in ExplainNode (planstate=0x1001a745870, ancestors=0x1001a745348, relationship=0x0, plan_name=0x0, es=0x1001a65ad48) at explain.c:1599 #6 0x102a81d8 in ExplainPrintPlan (es=0x1001a65ad48, queryDesc=0x1001a7442a0) at explain.c:599 #7 0x102a7e20 in ExplainOnePlan (plannedstmt=0x1001a744208, into=0x0, es=0x1001a65ad48, queryString=0x1001a6a8c08 "explain (analyze,buffers,verbose) select\n\tl_orderkey,\n\tsum(l_extendedprice * (1 - l_discount)) as revenue,\n\to_orderdate,\n\to_shippriority\nfrom\n\tcustomer,\n\torders,\n\tlineitem\nwhere\n\tc_mktsegment = 'BUILD"..., params=0x0, planduration=0x3fffe17c6488) at explain.c:515 #8 0x102a7968 in ExplainOneQuery (query=0x1001a65ab98, into=0x0, es=0x1001a65ad48, queryString=0x1001a6a8c08 "explain (analyze,buffers,verbose) select\n\tl_orderkey,\n\tsum(l_extendedprice * (1 - l_discount)) as revenue,\n\to_orderdate,\n\to_shippriority\nfrom\n\tcustomer,\n\torders,\n\tlineitem\nwhere\n\tc_mktsegment = 'BUILD"..., params=0x0) at explain.c:357 #9 0x102a74b8 in ExplainQuery (stmt=0x1001a6fe468, queryString=0x1001a6a8c08 "explain (analyze,buffers,verbose) select\n\tl_orderkey,\n\tsum(l_extendedprice * (1 - l_discount)) as revenue,\n\to_orderdate,\n\to_shippriority\nfrom\n\tcustomer,\n\torders,\n\tlineitem\nwhere\n\tc_mktsegment = 'BUILD"..., params=0x0, dest=0x1001a65acb0) at explain.c:245 (gdb) p *w $2 = {num_workers = 2139062143, instrument = 0x1001af8a8d0} (gdb) p planstate $3 = (PlanState *) 0x1001a7471d8 (gdb) p *planstate $4 = {type = T_HashJoinState, plan = 0x1001a740730, state = 0x1001a745758, instrument = 0x1001a7514a8, worker_instrument = 0x1001af8a8c8, targetlist = 0x1001a747448, qual = 0x0, lefttree = 0x1001a74a0e8, righttree = 0x1001a74f2f8, initPlan = 0x0, subPlan = 0x0, chgParam = 0x0, ps_ResultTupleSlot = 0x1001a750c10, ps_ExprContext = 0x1001a7472f0, ps_ProjInfo = 0x1001a751220, ps_TupFromTlist = 0 '\000'} (gdb) p *planstate->worker_instrument $5 = {num_workers = 2139062143, instrument = 0x1001af8a8d0} (gdb) p n $6 = 13055 (gdb) p n $7 = 13055 Here its clear that work_instrument is either corrupted or Un-inililized that is the reason its ending up with server crash. With bit more debugging and looked at git history I found that issue started coming with commit af33039317ddc4a0e38a02e2255c2bf453115fd2. gather_readnext() calls ExecShutdownGatherWorkers() when nreaders == 0. ExecShutdownGatherWorkers() calls ExecParallelFinish() which collects the instrumentation before marking ParallelExecutorInfo to finish. ExecParallelRetrieveInstrumentation() do the allocation of planstate->worker_instrument. With commit af33039317 now we calling the gather_readnext() with per-tuple context, but with nreader == 0 with ExecShutdownGatherWorkers() we end up with allocation of planstate->worker_instrument into per-tuple context - which is wrong. Now fix can be: 1) Avoid calling ExecShutdownGatherWorkers() from the gather_readnext() and let ExecEndGather() do that things. But with this change, gather_readread() and gather_getnext() depend on planstate->reader structure to continue reading tuple. Now either we can change those condition to be depend on planstate->nreaders or just pfree(planstate->reader) into gather_readnext() instead of calling ExecShutdownGatherWorkers(). 2) Teach ExecParallelRetrieveInstrumentation() to do allocation of planstate->worker_instrument into proper memory context. Attaching patch, which fix the issue with approach 1). Regards. Rushabh Lathia www.EnterpriseDB.com diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c index 438d1b2..bc56f0e 100644 --- a/src/backend/executor/nodeGather.c +++ b/src/backend/executor/nodeGather.c @@ -348,7
Re: [HACKERS] ERROR: ORDER/GROUP BY expression not found in targetlist
Thanks Tom. I performed testing with the latest commit and test are running fine. On Tue, Jun 28, 2016 at 8:14 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Rushabh Lathia <rushabh.lat...@gmail.com> writes: > > SELECT setval('s', max(100)) from tab; > > ERROR: ORDER/GROUP BY expression not found in targetlist > > Fixed, thanks for the report! > > regards, tom lane > -- Rushabh Lathia