from:"Rushabh Lathia"

Re: [HACKERS] Parallel Hash take II

2017-10-26 Thread Rushabh Lathia

While re-basing the parallel-B-tree-index-build patch on top v22 patch
sets, found cosmetic review:

1) BufFileSetEstimate is removed but it's still into buffile.h

+extern size_t BufFileSetEstimate(int stripes);


2) BufFileSetCreate is renamed with BufFileSetInit, but used at below
place in comments:

* Attach to a set of named BufFiles that was created with BufFileSetCreate.

Thanks,

On Wed, Oct 25, 2017 at 11:33 AM, Thomas Munro <
thomas.mu...@enterprisedb.com> wrote:

> On Tue, Oct 24, 2017 at 10:10 PM, Thomas Munro
> <thomas.mu...@enterprisedb.com> wrote:
> > Here is an updated patch set that does that ^.
>
> It's a bit hard to understand what's going on with the v21 patch set I
> posted yesterday because EXPLAIN ANALYZE doesn't tell you anything
> interesting.  Also, if you apply the multiplex_gather patch[1] I
> posted recently and set multiplex_gather to off then it doesn't tell
> you anything at all, because the leader has no hash table (I suppose
> that could happen with unpatched master given sufficiently bad
> timing).  Here's a new version with an extra patch that adds some
> basic information about load balancing to EXPLAIN ANALYZE, inspired by
> what commit bf11e7ee did for Sort.
>
> Example output:
>
> enable_parallel_hash = on, multiplex_gather = on:
>
>  ->  Parallel Hash (actual rows=100 loops=3)
>Buckets: 131072  Batches: 16
>Leader:Shared Memory Usage: 3552kB  Hashed: 396120  Batches
> Probed: 7
>Worker 0:  Shared Memory Usage: 3552kB  Hashed: 276640  Batches
> Probed: 6
>Worker 1:  Shared Memory Usage: 3552kB  Hashed: 327240  Batches
> Probed: 6
>->  Parallel Seq Scan on simple s (actual rows=33 loops=3)
>
>  ->  Parallel Hash (actual rows=1000 loops=8)
>Buckets: 131072  Batches: 256
>Leader:Shared Memory Usage: 2688kB  Hashed: 1347720
> Batches Probed: 36
>Worker 0:  Shared Memory Usage: 2688kB  Hashed: 1131360
> Batches Probed: 33
>Worker 1:  Shared Memory Usage: 2688kB  Hashed: 1123560
> Batches Probed: 38
>Worker 2:  Shared Memory Usage: 2688kB  Hashed: 1231920
> Batches Probed: 38
>Worker 3:  Shared Memory Usage: 2688kB  Hashed: 1272720
> Batches Probed: 34
>Worker 4:  Shared Memory Usage: 2688kB  Hashed: 1234800
> Batches Probed: 33
>Worker 5:  Shared Memory Usage: 2688kB  Hashed: 1294680
> Batches Probed: 37
>Worker 6:  Shared Memory Usage: 2688kB  Hashed: 1363240
> Batches Probed: 35
>->  Parallel Seq Scan on big s2 (actual rows=125 loops=8)
>
> enable_parallel_hash = on, multiplex_gather = off (ie no leader
> participation):
>
>  ->  Parallel Hash (actual rows=100 loops=2)
>Buckets: 131072  Batches: 16
>Worker 0:  Shared Memory Usage: 3520kB  Hashed: 475920  Batches
> Probed: 9
>Worker 1:  Shared Memory Usage: 3520kB  Hashed: 524080  Batches
> Probed: 8
>->  Parallel Seq Scan on simple s (actual rows=50 loops=2)
>
> enable_parallel_hash = off, multiplex_gather = on:
>
>  ->  Hash (actual rows=100 loops=3)
>Buckets: 131072  Batches: 16
>Leader:Memory Usage: 3227kB
>Worker 0:  Memory Usage: 3227kB
>Worker 1:  Memory Usage: 3227kB
>->  Seq Scan on simple s (actual rows=100 loops=3)
>
> enable_parallel_hash = off, multiplex_gather = off:
>
>  ->  Hash (actual rows=100 loops=2)
>Buckets: 131072  Batches: 16
>Worker 0:  Memory Usage: 3227kB
>Worker 1:  Memory Usage: 3227kB
>->  Seq Scan on simple s (actual rows=100 loops=2)
>
> parallelism disabled (traditional single-line output, unchanged):
>
>  ->  Hash (actual rows=100 loops=1)
>Buckets: 131072  Batches: 16  Memory Usage: 3227kB
>->  Seq Scan on simple s (actual rows=100 loops=1)
>
> (It actually says "Tuples Hashed", not "Hashed" but I edited the above
> to fit on a standard punchcard.)  Thoughts?
>
> [1] https://www.postgresql.org/message-id/CAEepm%3D2U%2B%
> 2BLp3bNTv2Bv_kkr5NE2pOyHhxU%3DG0YTa4ZhSYhHiw%40mail.gmail.com
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>



-- 
Rushabh Lathia

Re: [HACKERS] Parallel Hash take II

2017-10-05 Thread Rushabh Lathia

v20 patch set (I was trying 0008, 0009 patch)  not getting cleanly apply on
latest commit also getting compilation error due to refactor in below
commit.

commit 0c5803b450e0cc29b3527df3f352e6f18a038cc6
Author: Peter Eisentraut <pete...@gmx.net>
Date:   Sat Sep 23 09:49:22 2017 -0400

Refactor new file permission handling



On Mon, Sep 25, 2017 at 11:38 AM, Thomas Munro <
thomas.mu...@enterprisedb.com> wrote:

> On Thu, Sep 21, 2017 at 5:49 AM, Peter Geoghegan <p...@bowt.ie> wrote:
> > Graefe's "Query Evaluation Techniques for Large Databases" has several
> > pages on deadlock avoidance strategies. It was written almost 25 years
> > ago, but still has some good insights IMV (you'll recall that Graefe
> > is the author of the Volcano paper; this reference paper seems like
> > his follow-up). Apparently, deadlock avoidance strategy becomes
> > important for parallel sort with partitioning. You may be able to get
> > some ideas from there. And even if you don't, his handling of the
> > topic is very deliberate and high level, which suggests that ours
> > should be, too.
>
> Very interesting and certainly relevant (the parts I've read so far),
> though we don't have multiple consumers.  Multiplexing one thread so
> that it is both a consumer and a producer is an extra twist though.
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Rushabh Lathia

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-09-20 Thread Rushabh Lathia

On Wed, Sep 20, 2017 at 5:17 AM, Peter Geoghegan <p...@bowt.ie> wrote:

> On Tue, Sep 19, 2017 at 3:21 AM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> > As per the earlier discussion in the thread, I did experiment using
> > BufFileSet interface from parallel-hash-v18.patchset.  I took the
> reference
> > of parallel-hash other patches to understand the BufFileSet APIs, and
> > incorporate the changes to parallel create index.
> >
> > In order to achieve the same:
> >
> > - Applied 0007-Remove-BufFile-s-isTemp-flag.patch and
> > 0008-Add-BufFileSet-for-sharing-temporary-files-between-b.patch from the
> > parallel-hash-v18.patchset.
> > - Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE
> > INDEX v10 patch.
> > - incorporate the BufFileSet API to the parallel tuple sort for CREATE
> > INDEX.
> > - Changes into few existing functions as well as added few to support the
> > BufFileSet changes.
>
> I'm glad that somebody is working on this. (Someone closer to the more
> general work on shared/parallel BufFile infrastructure than I am.)
>
> I do have some quick feedback, and I hope to be able to provide that
> to both you and Thomas, as needed to see this one through. I'm not
> going to get into the tricky details around resource management just
> yet. I'll start with some simpler questions, to get a general sense of
> the plan here.
>
>
Thanks Peter.


> I gather that you're at least aware that your v11 of the patch doesn't
> preserve randomAccess support for parallel sorts, because you didn't
> include my 0002-* testing GUCs patch, which was specifically designed
> to make various randomAccess stuff testable. I also figured this to be
> true because I noticed this FIXME among (otherwise unchanged)
> tuplesort code:
>
>
Yes, I haven't touched the randomAccess part yet. My initial goal was
to incorporate the BufFileSet api's here.


> > +static void
> > +leader_takeover_tapes(Tuplesortstate *state)
> > +{
> > +   Sharedsort *shared = state->shared;
> > +   int nLaunched = state->nLaunched;
> > +   int j;
> > +
> > +   Assert(LEADER(state));
> > +   Assert(nLaunched >= 1);
> > +   Assert(nLaunched == shared->workersFinished);
> > +
> > +   /*
> > +* Create the tapeset from worker tapes, including a leader-owned
> tape at
> > +* the end.  Parallel workers are far more expensive than logical
> tapes,
> > +* so the number of tapes allocated here should never be excessive.
> FIXME
> > +*/
> > +   inittapestate(state, nLaunched + 1);
> > +   state->tapeset = LogicalTapeSetCreate(nLaunched + 1, shared->tapes,
> > + state->fileset, state->worker);
>
> It's not surprising to me that you do not yet have this part working,
> because much of my design was about changing as little as possible
> above the BufFile interface, in order for tuplesort.c (and logtape.c)
> code like this to "just work" as if it was the serial case.


Right. I just followed your design in the your earlier patches.


> It doesn't
> look like you've added the kind of BufFile multiplexing code that I
> expected to see in logtape.c. This is needed to compensate for the
> code removed from fd.c and buffile.c. Perhaps it would help me to go
> look at Thomas' latest parallel hash join patch -- did it gain some
> kind of transparent multiplexing ability that you actually (want to)
> use here?
>
>
Sorry, I didn't get this part. Are you talking about the your patch changes
into OpenTemporaryFileInTablespace(),  BufFileUnify() and other changes
related to ltsUnify() ?  If that's the case, I don't think it require with
the
BufFileSet. Correct me if I am wrong here.

Though randomAccess isn't used by CREATE INDEX in general, and so not
> supporting randomAccess within tuplesort.c for parallel callers
> doesn't matter as far as this CREATE INDEX user-visible feature is
> concerned, I still believe that randomAccess is important (IIRC,
> Robert thought so too). Specifically, it seems like a good idea to
> have randomAccess support, both on general principle (why should the
> parallel case be different?), and because having it now will probably
> enable future enhancements to logtape.c. Enhancements that have it
> manage parallel sorts based on partitioning/distribution/bucketing
> [1]. I'm pretty sure that partitioning-based parallel sort is going to
> become very important in the future, especially for parallel
> GroupAggregate. The leader needs to truly own the tapes it reclaims
> from workers in order for all of this to work.
>
>
First application f

Re: [HACKERS] GatherMerge misses to push target list

2017-09-18 Thread Rushabh Lathia

On Mon, Sep 18, 2017 at 2:48 PM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Mon, Sep 18, 2017 at 12:46 PM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> > Thanks Amit for the patch.
> >
> > I reviewed the code changes as well as performed more testing. Patch
> > looks good to me.
> >
>
> In that case, can you please mark the patch [1] as ready for committer in
> CF app
>

Done.


>
> > Here is the updated patch - where added test-case clean up.
> >
>
> oops, missed dropping the function.
>
>
> Thanks for the review.
>
>
> [1] - https://commitfest.postgresql.org/15/1293/
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>



-- 
Rushabh Lathia

Re: [HACKERS] GatherMerge misses to push target list

2017-09-18 Thread Rushabh Lathia

Thanks Amit for the patch.

I reviewed the code changes as well as performed more testing. Patch
looks good to me.

Here is the updated patch - where added test-case clean up.



On Thu, Sep 14, 2017 at 10:02 AM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Wed, Sep 13, 2017 at 5:30 PM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> > On Wed, Sep 6, 2017 at 10:04 AM, Amit Kapila <amit.kapil...@gmail.com>
> > wrote:
> >>
> >
> >
> > This seems like a good optimization. I tried to simulate the test given
> > in the mail, initially wasn't able to generate the exact test - as index
> > creation is missing in the test shared.
> >
>
> Oops.
>
> > I also won't consider this as bug, but its definitely good optimization
> > for GatherMerge.
> >
> >>
> >>
> >> Note - If we agree on the problems and fix, then I can add regression
> >> tests to cover above cases in the patch.
> >
> >
> > Sure, once you do that - I will review the patch.
> >
>
> The attached patch contains regression test as well.
>
> Thanks for looking into it.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>



-- 
Rushabh Lathia
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7f146d6..b12e57b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5028,7 +5028,7 @@ create_ordered_paths(PlannerInfo *root,
 			path = (Path *)
 create_gather_merge_path(root, ordered_rel,
 		 path,
-		 target, root->sort_pathkeys, NULL,
+		 ordered_rel->reltarget, root->sort_pathkeys, NULL,
 		 _groups);
 
 			/* Add projection step if needed */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 26567cb..516a396 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2354,9 +2354,9 @@ create_projection_path(PlannerInfo *root,
  * knows that the given path isn't referenced elsewhere and so can be modified
  * in-place.
  *
- * If the input path is a GatherPath, we try to push the new target down to
- * its input as well; this is a yet more invasive modification of the input
- * path, which create_projection_path() can't do.
+ * If the input path is a GatherPath or GatherMergePath, we try to push the
+ * new target down to its input as well; this is a yet more invasive
+ * modification of the input path, which create_projection_path() can't do.
  *
  * Note that we mustn't change the source path's parent link; so when it is
  * add_path'd to "rel" things will be a bit inconsistent.  So far that has
@@ -2393,31 +2393,44 @@ apply_projection_to_path(PlannerInfo *root,
 		(target->cost.per_tuple - oldcost.per_tuple) * path->rows;
 
 	/*
-	 * If the path happens to be a Gather path, we'd like to arrange for the
-	 * subpath to return the required target list so that workers can help
-	 * project.  But if there is something that is not parallel-safe in the
-	 * target expressions, then we can't.
+	 * If the path happens to be a Gather or GatherMerge path, we'd like to
+	 * arrange for the subpath to return the required target list so that
+	 * workers can help project.  But if there is something that is not
+	 * parallel-safe in the target expressions, then we can't.
 	 */
-	if (IsA(path, GatherPath) &&
+	if ((IsA(path, GatherPath) || IsA(path, GatherMergePath)) &&
 		is_parallel_safe(root, (Node *) target->exprs))
 	{
-		GatherPath *gpath = (GatherPath *) path;
-
 		/*
 		 * We always use create_projection_path here, even if the subpath is
 		 * projection-capable, so as to avoid modifying the subpath in place.
 		 * It seems unlikely at present that there could be any other
 		 * references to the subpath, but better safe than sorry.
 		 *
-		 * Note that we don't change the GatherPath's cost estimates; it might
+		 * Note that we don't change the parallel path's cost estimates; it might
 		 * be appropriate to do so, to reflect the fact that the bulk of the
 		 * target evaluation will happen in workers.
 		 */
-		gpath->subpath = (Path *)
-			create_projection_path(root,
-   gpath->subpath->parent,
-   gpath->subpath,
-   target);
+		if (IsA(path, GatherPath))
+		{
+			GatherPath *gpath = (GatherPath *) path;
+
+			gpath->subpath = (Path *)
+create_projection_path(root,
+	   gpath->subpath->parent,
+	   gpath->subpath,
+	   target);
+		}
+		else
+		{
+			GatherMergePath *gmpath = (GatherMergePath *) path;
+
+			gmpath->subpath = (Path *)
+create_projection_path(root,
+	   gmpath->subpath->parent,
+	   gmpath->subpath,
+	   target);
+		}
 	}
 	else if (path->p

Re: [HACKERS] GatherMerge misses to push target list

2017-09-13 Thread Rushabh Lathia

On Wed, Sep 6, 2017 at 10:04 AM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> During my recent work on costing of parallel paths [1], I noticed that
> we are missing to push target list below GatherMerge in some simple
> cases like below.
>
> Test prepration
> -
> create or replace function simple_func(var1 integer) returns integer
> as $$
> begin
> return var1 + 10;
> end;
> $$ language plpgsql PARALLEL SAFE;
>
> create table t1(c1 int, c2 char(5));
> insert into t1 values(generate_series(1,50), 'aaa');
> set parallel_setup_cost=0;
> set parallel_tuple_cost=0;
> set min_parallel_table_scan_size=0;
> set max_parallel_workers_per_gather=4;
> set cpu_operator_cost=0; set min_parallel_index_scan_size=0;
>
> Case-1
> -
> postgres=# explain (costs off, verbose) select c1,simple_func(c1) from
> t1 where c1 < 1 order by c1;
>  QUERY PLAN
> -
>  Gather Merge
>Output: c1, simple_func(c1)
>Workers Planned: 4
>->  Parallel Index Scan using idx_t1 on public.t1
>  Output: c1
>  Index Cond: (t1.c1 < 1)
> (6 rows)
>
> In the above case, I don't see any reason why we can't push the target
> list expression (simple_func(c1)) down to workers.
>
> Case-2
> --
> set enable_indexonlyscan=off;
> set enable_indexscan=off;
> postgres=# explain (costs off, verbose) select c1,simple_func(c1) from
> t1 where c1 < 1 order by c1;
>  QUERY PLAN
> 
>  Gather Merge
>Output: c1, simple_func(c1)
>Workers Planned: 4
>->  Sort
>  Output: c1
>  Sort Key: t1.c1
>  ->  Parallel Bitmap Heap Scan on public.t1
>Output: c1
>Recheck Cond: (t1.c1 < 1)
>->  Bitmap Index Scan on idx_t1
>  Index Cond: (t1.c1 < 1)
> (11 rows)
>
> It is different from above case (Case-1) because sort node can't
> project, but I think adding a Result node on top of it can help to
> project and serve our purpose.
>
> The attached patch allows pushing the target list expression in both
> the cases. I think we should handle GatherMerge case in
> apply_projection_to_path similar to Gather so that target list can be
> pushed.  Another problem was during the creation of ordered paths
> where we don't allow target expressions to be pushed below gather
> merge.
>
> Plans after patch:
>
> Case-1
> --
> postgres=# explain (costs off, verbose) select c1,simple_func(c1) from
> t1 where c1 < 1 order by c1;
> QUERY PLAN
> --
>  Gather Merge
>Output: c1, (simple_func(c1))
>Workers Planned: 4
>->  Parallel Index Only Scan using idx_t1 on public.t1
>  Output: c1, simple_func(c1)
>  Index Cond: (t1.c1 < 1)
> (6 rows)
>
> Case-2
> ---
> postgres=# explain (costs off, verbose) select c1,simple_func(c1) from
> t1 where c1 < 1 order by c1;
> QUERY PLAN
> --
>  Gather Merge
>Output: c1, (simple_func(c1))
>Workers Planned: 4
>->  Result
>  Output: c1, simple_func(c1)
>  ->  Sort
>Output: c1
>Sort Key: t1.c1
>->  Parallel Bitmap Heap Scan on public.t1
>  Output: c1
>  Recheck Cond: (t1.c1 < 1)
>  ->  Bitmap Index Scan on idx_t1
>Index Cond: (t1.c1 < 1)
> (13 rows)
>
> Note, that simple_func() is pushed down to workers in both the cases.
>
> Thoughts?
>

This seems like a good optimization. I tried to simulate the test given
in the mail, initially wasn't able to generate the exact test - as index
creation is missing in the test shared.

I also won't consider this as bug, but its definitely good optimization
for GatherMerge.


>
> Note - If we agree on the problems and fix, then I can add regression
> tests to cover above cases in the patch.
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1Ji_
> 0pgrjFotAyvvfxGikxJQEKcxD863VQ-xYtfQBy0uQ%40mail.gmail.com


Sure, once you do that - I will review the patch.

Thanks,


>
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


-- 
Rushabh Lathia

Re: [HACKERS] dropping partitioned tables without CASCADE

2017-09-06 Thread Rushabh Lathia

Okay, I have marked this as ready for committer.

Thanks,

On Wed, Sep 6, 2017 at 2:50 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Wed, Sep 6, 2017 at 1:06 PM, Rushabh Lathia <rushabh.lat...@gmail.com>
> wrote:
> >
> > 2) Add partition to the foo;
> >
> > create table foo_p1 partition of foo for values in (1, 2, 3) partition by
> > list (b);
> >
> > postgres=# \d foo
> > Table "public.foo"
> >  Column |  Type   | Collation | Nullable | Default
> > +-+---+--+-
> >  a  | integer |   |  |
> >  b  | integer |   |  |
> > Partition key: LIST (a)
> > Number of partitions: 1 (Use \d+ to list them.)
> >
> > postgres=# \d+ foo
> > Table "public.foo"
> >  Column |  Type   | Collation | Nullable | Default | Storage | Stats
> target
> > | Description
> > +-+---+--+-+
> -+--+-
> >  a  | integer |   |  | | plain   |
> > |
> >  b  | integer |   |  | | plain   |
> > |
> > Partition key: LIST (a)
> > Partitions: foo_p1 FOR VALUES IN (1, 2, 3) has partitions
> >
> > Above verbose output for foo says, foo_p1 "has partitions". But if I do
> >
> > postgres=# \d foo_p1
> >Table "public.foo_p1"
> >  Column |  Type   | Collation | Nullable | Default
> > +-+---+--+-
> >  a  | integer |   |  |
> >  b  | integer |   |  |
> > Partition of: foo FOR VALUES IN (1, 2, 3)
> > Partition key: LIST (b)
> > Number of partitions: 0
> >
> > it tell "Number of partitions: 0".
> >
> > I feel like information is conflicting with each other. AFAIU, idea about
> > adding
> > "has partitions" was to let know that it's a partitioned table. So can
> you
> > directly
> > add the "is partitioned" in place of "has partitions"?
> >
> > Did those change in the attached patch and update regression expected
> > output.
> >
>
> Looks better.
>
> > Also run pgindent on the patch.
> >
>
> Thanks for the changes. The patch looks good to me.
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>



-- 
Rushabh Lathia

Re: [HACKERS] dropping partitioned tables without CASCADE

2017-09-06 Thread Rushabh Lathia

I picked this patch for review and started looking at the implementation
details.

Consider the below test:

1)

postgres=# create table foo (a int, b int) partition by list (a);
CREATE TABLE
postgres=# \d foo
Table "public.foo"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
 b  | integer |   |  |
Partition key: LIST (a)
Number of partitions: 0

postgres=# \d+ foo
Table "public.foo"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target
| Description
+-+---+--+-+-+--+-
 a  | integer |   |  | | plain   |
|
 b  | integer |   |  | | plain   |
|
Partition key: LIST (a)
Number of partitions: 0

In the above case, verbose as well as normal output give information
about number of partitions.

2) Add partition to the foo;

create table foo_p1 partition of foo for values in (1, 2, 3) partition by
list (b);

postgres=# \d foo
Table "public.foo"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
 b  | integer |   |  |
Partition key: LIST (a)
*Number of partitions: 1 (Use \d+ to list them.)*

postgres=# \d+ foo
Table "public.foo"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target
| Description
+-+---+--+-+-+--+-
 a  | integer |   |  | | plain   |
|
 b  | integer |   |  | | plain   |
|
Partition key: LIST (a)
*Partitions: foo_p1 FOR VALUES IN (1, 2, 3) has partitions*

Above verbose output for foo says, foo_p1 "has partitions". But if I do

postgres=# \d foo_p1
   Table "public.foo_p1"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
 b  | integer |   |  |
Partition of: foo FOR VALUES IN (1, 2, 3)
Partition key: LIST (b)

*Number of partitions: 0*
it tell "Number of partitions: 0".

I feel like information is conflicting with each other. AFAIU, idea about
adding
"has partitions" was to let know that it's a partitioned table. So can you
directly
add the "is partitioned" in place of "has partitions"?

Did those change in the attached patch and update regression expected
output.

Also run pgindent on the patch.

Thanks,



On Mon, Sep 4, 2017 at 6:22 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Mon, Sep 4, 2017 at 3:48 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org>
> wrote:
> >
> > if (tuples > 0)
> > {
> > -   if (tableinfo.relkind !=
> RELKIND_PARTITIONED_TABLE)
> > -   printfPQExpBuffer(,
> _("Number of child tables: %d (Use \\d+ to list them.)"), tuples);
> > -   else
> > -   printfPQExpBuffer(,
> _("Number of partitions: %d (Use \\d+ to list them.)"), tuples);
> > +   printfPQExpBuffer(, _("Number of %s:
> %d (Use \\d+ to list them.)"), ct, tuples);
> > printTableAddFooter(, buf.data);
> > }
> >
> > Please don't do this, because it breaks translatability.  I think the
> > original is fine.
> >
>
> We have used this style in the "else" case of if (!verbose). So, I
> just copied it. I have removed that change in the attached patch.
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


...
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index f6049cc..ab9a637 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2828,7 +2828,7 @@ describeOneTableDetails(const char *schemaname,
 		/* print child tables (with additional info if partitions) */
 		if (pset.sversion >= 10)
 			printfPQExpBuffer(,
-			  "SELECT c.oid::pg_catalog.regclass, pg_catalog.pg_get_expr(c.relpartbound, c.oid)"
+			  "SELECT c.oid::pg_catalog.regclass, pg_get_expr(c.relpartbound, c.oid), c.relkind&qu

Re: [HACKERS] hash partitioning based on v10Beta2

2017-08-27 Thread Rushabh Lathia

# explain analyze select * from h where id = 1 or id = 5;;
>  QUERY PLAN
>
> 
> 
>  Append  (cost=0.00..96.50 rows=50 width=4) (actual time=
> 0.017..0.078 rows=2 loops=1)
>->  Seq Scan on h1  (cost=0.00..48.25 rows=25 width=4) (
> actual time=0.015..0.019 rows=1 loops=1)
>  Filter: ((id = 1) OR (id = 5))
>  Rows Removed by Filter: 6
>->  Seq Scan on h3  (cost=0.00..48.25 rows=25 width=4) (
> actual time=0.005..0.010 rows=1 loops=1)
>  Filter: ((id = 1) OR (id = 5))
>  Rows Removed by Filter: 3
>  Planning time: 0.396 ms
>  Execution time: 0.139 ms
> (9 rows)
>
> Can not detach / attach / drop partition table.
>
> Best regards,
> young
>
> --
>
> https://yonj1e.github.io/
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


-- 
Rushabh Lathia

Re: [HACKERS] reload-through-the-top-parent switch the partition table

2017-08-13 Thread Rushabh Lathia

On Fri, Aug 11, 2017 at 10:50 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Fri, Aug 11, 2017 at 5:36 AM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> > Please find attach patch with the changes.
>
> I found the way that you had the logic structured in flagInhTables()
> to be quite hard to follow, so I rewrote it in the attached version.
> This version also has a bunch of minor cosmetic changes.  Barring
> objections, I'll commit it once the tree opens for v11 development.
>
>
Thanks Robert.

Patch changes looks good to me.



regards,
Rushabh Lathia
www.EnterpriseDB.com

Re: [HACKERS] reload-through-the-top-parent switch the partition table

2017-08-11 Thread Rushabh Lathia

On Thu, Aug 10, 2017 at 8:26 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Thu, Aug 10, 2017 at 3:47 AM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> >> (1) seems like a pretty arbitrary restriction, so I don't like that
> >> option.  (2) would hurt performance in some use cases.  Do we have an
> >> option (3)?
> >
> > How about protecting option 2) with the load-via-partition-root
> protection.
> > Means
> > load the parents information even dump is not set only when
> > load-via-partition-root
> > & ispartition.
> >
> > This won't hurt performance for the normal cases.
>
> Yes, that seems like the right approach.
>
> +Dump data via the top-most partitioned table (rather than
> partition
> +table) when dumping data for the partition table.
>
> I think we should phrase this a bit more clearly, something like this:
> When dumping a COPY or INSERT statement for a partitioned table,
> target the root of the partitioning hierarchy which contains it rather
> than the partition itself.  This may be useful when reloading data on
> a server where rows do not always fall into the same partitions as
> they did on the original server.  This could happen, for example, if
> the partitioning column is of type text and the two system have
> different definitions of the collation used to partition the data.
>
>
Done.


> +printf(_("  --load-via-partition-rootload partition table via
> the root relation\n"));
>
> "relation" seems odd to me here.  root table?
>
>
Done.


>  /* Don't bother computing anything for non-target tables, either
> */
>  if (!tblinfo[i].dobj.dump)
> +{
> +/*
> + * Load the parents information for the partition table when
> + * the load-via-partition-root option is set. As we need the
> + * parents information to get the partition root.
> + */
> +if (dopt->load_via_partition_root &&
> +tblinfo[i].ispartition)
> +findParentsByOid([i], inhinfo, numInherits);
>  continue;
> +}
>
> Duplicating the call to findParentsByOid seems less then ideal.  How
> about doing something like this:
>
> if (tblinfo[i].dobj.dump)
> {
> find_parents = true;
> mark_parents = true;
> }
> else if (dopt->load_via_partition_root && tblinfo[i].ispartition)
> find_parents = true;
>
> if (find_parents)
> findParentsByOid([i], inhinfo, numInherits);
>
> etc.
>
>
Done changes to avoid duplicate call to findParentsByOid().


> The comments for this function also need some work - e.g. the function
> header comment deserves some kind of update for these changes.
>
>
Done.


> +static TableInfo *
> +getRootTableInfo(TableInfo *tbinfo)
> +{
> +Assert(tbinfo->ispartition);
> +    Assert(tbinfo->numParents == 1);
> +if (tbinfo->parents[0]->ispartition)
> +return getRootTableInfo(tbinfo->parents[0]);
> +
> +return tbinfo->parents[0];
> +}
>
> This code should iterate, not recurse, to avoid any risk of blowing
> out the stack.
>
>
Done.

Please find attach patch with the changes.

Thanks,
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index bafa031..de8297a 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -889,6 +889,21 @@ PostgreSQL documentation
  
 
  
+  --load-via-partition-root
+  
+   
+When dumping a COPY or INSERT statement for a partitioned table,
+target the root of the partitioning hierarchy which contains it rather
+than the partition itself.  This will be useful when reloading data on
+a server where rows do not always fall into the same partitions as
+they did on the original server.  This could happen, for example, if
+the partitioning column is of type text and the two system have
+different definitions of the collation used to partition the data.
+   
+  
+ 
+
+ 
--section=sectionname

  
diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml
index aa944a2..dc1d3cc 100644
--- a/doc/src/sgml/ref/pg_dumpall.sgml
+++ b/doc/src/sgml/ref/pg_dumpall.sgml
@@ -431,6 +431,21 @@ PostgreSQL documentation
  
 
  
+  --load-via-partition-root
+  
+   
+When dumping a COPY or INSERT statement for a partitioned table,
+target the root of the partitioning hierarchy which contains it rather
+than the partition itself.  This will be useful when reloading

Re: [HACKERS] reload-through-the-top-parent switch the partition table

2017-08-10 Thread Rushabh Lathia

On Wed, Aug 9, 2017 at 1:20 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Tue, Aug 8, 2017 at 8:48 AM, Rushabh Lathia <rushabh.lat...@gmail.com>
> wrote:
> > It seems like with we set the numParents and parents only for the
> > dumpable objects (flagInhTables()). Current patch relies on the
> numParents
> > and parents to get the root partition TableInfo, but when --schema is
> been
> > specified - it doesn't load those information for the object which is not
> > dumpable.
> >
> > Now one options is:
> >
> > 1) restrict the --load-via-partition-root to be used with
> > the --schema or may be some other options - where we restrict the
> > objects.
> >
> > Consider this, partition root is in schema 'a' and the partition table
> is in
> > schema 'b', if someone specify the --schema b with
> > --load-via-partition-root,
> > I think we should not do "INSERT INTO a.tab" to load the data (because
> > user specified --schema b).
> >
> > 2) fix flagInhTables() to load the numParents and the parents information
> > for the partition table (can be checked using ispartition), irrespective
> of
> > whether object is dumpable is true or not.
>
> (1) seems like a pretty arbitrary restriction, so I don't like that
> option.  (2) would hurt performance in some use cases.  Do we have an
> option (3)?
>

How about protecting option 2) with the load-via-partition-root protection.
Means
load the parents information even dump is not set only when
load-via-partition-root
& ispartition.

This won't hurt performance for the normal cases.

Attaching the patch.



>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Rushabh Lathia
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index bafa031..65f7e98 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -889,6 +889,16 @@ PostgreSQL documentation
  
 
  
+  --load-via-partition-root
+  
+   
+Dump data via the top-most partitioned table (rather than partition
+table) when dumping data for the partition table.
+   
+  
+ 
+
+ 
--section=sectionname

  
diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml
index aa944a2..54b8e7e 100644
--- a/doc/src/sgml/ref/pg_dumpall.sgml
+++ b/doc/src/sgml/ref/pg_dumpall.sgml
@@ -431,6 +431,16 @@ PostgreSQL documentation
  
 
  
+  --load-via-partition-root
+  
+   
+Dump data via the top-most partitioned table (rather than partition
+table) when dumping data for the partition table.
+   
+  
+ 
+
+ 
   --use-set-session-authorization
   

diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c
index 47191be..65c2d7f 100644
--- a/src/bin/pg_dump/common.c
+++ b/src/bin/pg_dump/common.c
@@ -66,7 +66,7 @@ static int	numExtensions;
 static ExtensionMemberId *extmembers;
 static int	numextmembers;
 
-static void flagInhTables(TableInfo *tbinfo, int numTables,
+static void flagInhTables(Archive *fout, TableInfo *tbinfo, int numTables,
 			  InhInfo *inhinfo, int numInherits);
 static void flagInhAttrs(DumpOptions *dopt, TableInfo *tblinfo, int numTables);
 static DumpableObject **buildIndexArray(void *objArray, int numObjs,
@@ -243,7 +243,7 @@ getSchemaData(Archive *fout, int *numTablesPtr)
 	/* Link tables to parents, mark parents of target tables interesting */
 	if (g_verbose)
 		write_msg(NULL, "finding inheritance relationships\n");
-	flagInhTables(tblinfo, numTables, inhinfo, numInherits);
+	flagInhTables(fout, tblinfo, numTables, inhinfo, numInherits);
 
 	if (g_verbose)
 		write_msg(NULL, "reading column info for interesting tables\n");
@@ -304,9 +304,10 @@ getSchemaData(Archive *fout, int *numTablesPtr)
  * modifies tblinfo
  */
 static void
-flagInhTables(TableInfo *tblinfo, int numTables,
+flagInhTables(Archive *fout, TableInfo *tblinfo, int numTables,
 			  InhInfo *inhinfo, int numInherits)
 {
+	DumpOptions *dopt = fout->dopt;
 	int			i,
 j;
 	int			numParents;
@@ -322,7 +323,17 @@ flagInhTables(TableInfo *tblinfo, int numTables,
 
 		/* Don't bother computing anything for non-target tables, either */
 		if (!tblinfo[i].dobj.dump)
+		{
+			/*
+			 * Load the parents information for the partition table when
+			 * the load-via-partition-root option is set. As we need the
+			 * parents information to get the partition root.
+			 */
+			if (dopt->load_via_partition_root &&
+tblinfo[i].ispartition)
+findParentsByOid([i], inhinfo, numInherits);
 			continue;
+		}
 
 		/* Find all the immediate parent tables */
 		findParentsByOid([i], inhinfo, numInherits);
diff --git a/src/bin/pg_dump/pg_backup.

Re: [HACKERS] reload-through-the-top-parent switch the partition table

2017-08-08 Thread Rushabh Lathia

Thanks Rajkumar for testing and reporting this.


It seems like with we set the numParents and parents only for the
dumpable objects (flagInhTables()). Current patch relies on the numParents
and parents to get the root partition TableInfo, but when --schema is been
specified - it doesn't load those information for the object which is not
dumpable.

Now one options is:

1) restrict the --load-via-partition-root to be used with
the --schema or may be some other options - where we restrict the
objects.

Consider this, partition root is in schema 'a' and the partition table is in
schema 'b', if someone specify the --schema b with
--load-via-partition-root,
I think we should not do "INSERT INTO a.tab" to load the data (because
user specified --schema b).

2) fix flagInhTables() to load the numParents and the parents information
for the partition table (can be checked using ispartition), irrespective of
whether object is dumpable is true or not.

May be something like:

@@ -322,7 +322,11 @@ flagInhTables(TableInfo *tblinfo, int numTables,

/* Don't bother computing anything for non-target tables, either */
if (!tblinfo[i].dobj.dump)
+   {
+   if (tblinfo[i].ispartition)
+   findParentsByOid([i], inhinfo, numInherits);
continue;
+   }

I am still looking into this, meanwhile any inputs are welcome.



On Tue, Aug 8, 2017 at 4:10 PM, Rajkumar Raghuwanshi <
rajkumar.raghuwan...@enterprisedb.com> wrote:

> Hi Rushabh,
>
> While testing latest v2 patch, I got a crash when using
> --load-via-partition-root with --schema options. Below are steps to
> reproduce.
>
> --create below test data
> create schema a;
> create schema b;
> create schema c;
>
> create table t1 (a int,b text) partition by list(a);
> create table a.t1_p1 partition of t1 FOR VALUES in (1,2,3,4) partition by
> list(a);
> create table b.t1_p1_p1 partition of a.t1_p1 FOR VALUES in (1,2);
> create table c.t1_p1_p2 partition of a.t1_p1 FOR VALUES in (3,4);
> create table b.t1_p2 partition of t1 FOR VALUES in (5,6,7,8) partition by
> list(a);
> create table a.t1_p2_p1 partition of b.t1_p2 FOR VALUES in (5,6);
> create table t1_p2_p2 partition of b.t1_p2 FOR VALUES in (7,8);
>
> insert into t1 values (8,'t1');
> insert into a.t1_p1 values (2,'a.t1_p1');
> insert into b.t1_p1_p1 values (1,'b.t1_p1_p1');
> insert into c.t1_p1_p2 values (3,'c.t1_p1_p2');
> insert into b.t1_p2 values (6,'b.t1_p2');
> insert into a.t1_p2_p1 values (5,'a.t1_p2_p1');
> insert into t1_p2_p2 values (7,'t1_p2_p1');
> insert into t1 values (4 ,'t1');
>
> --trying to take pg_dump
> [edb@localhost bin]$ ./pg_dump -d postgres --schema=a -f d1.dump -Fp
> [edb@localhost bin]$ ./pg_dump -d postgres --load-via-partition-root -f
> d2.dump -Fp
> [edb@localhost bin]$ ./pg_dump -d postgres --load-via-partition-root
> --schema=a -f d3.dump -Fp
> pg_dump: pg_dump.c:2063: getRootTableInfo: Assertion `tbinfo->numParents
> == 1' failed.
> Aborted (core dumped)
>
>
>
> Thanks & Regards,
> Rajkumar Raghuwanshi
> QMG, EnterpriseDB Corporation
>
> On Fri, Aug 4, 2017 at 3:01 PM, Rushabh Lathia <rushabh.lat...@gmail.com>
> wrote:
>
>>
>> Here is an update patch,  now renamed the switch to
>> --load-via-partition-root
>> and also added the documentation for the new switch into pg_dump as well
>> as pg_dumpall.
>>
>>
>> On Fri, Aug 4, 2017 at 7:13 AM, Amit Langote <
>> langote_amit...@lab.ntt.co.jp> wrote:
>>
>>> On 2017/08/04 1:08, David G. Johnston wrote:
>>> > On Thu, Aug 3, 2017 at 8:53 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> >
>>> >> Robert Haas <robertmh...@gmail.com> writes:
>>> >>> So maybe --load-via-partition-root if nobody likes my previous
>>> >>> suggestion of --partition-data-via-root ?
>>> >>
>>> >> WFM.
>>> >>
>>> >
>>> > +1
>>>
>>> +1.
>>>
>>> Thanks,
>>> Amit
>>>
>>>
>>
>>
>> Thanks,
>> Rushabh Lathia
>> www.EnterpriseDB.com
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>>
>


-- 
Rushabh Lathia

Re: [HACKERS] reload-through-the-top-parent switch the partition table

2017-08-04 Thread Rushabh Lathia

Here is an update patch,  now renamed the switch to
--load-via-partition-root
and also added the documentation for the new switch into pg_dump as well
as pg_dumpall.


On Fri, Aug 4, 2017 at 7:13 AM, Amit Langote <langote_amit...@lab.ntt.co.jp>
wrote:

> On 2017/08/04 1:08, David G. Johnston wrote:
> > On Thu, Aug 3, 2017 at 8:53 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> >
> >> Robert Haas <robertmh...@gmail.com> writes:
> >>> So maybe --load-via-partition-root if nobody likes my previous
> >>> suggestion of --partition-data-via-root ?
> >>
> >> WFM.
> >>
> >
> > +1
>
> +1.
>
> Thanks,
> Amit
>
>


Thanks,
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index bafa031..65f7e98 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -889,6 +889,16 @@ PostgreSQL documentation
  
 
  
+  --load-via-partition-root
+  
+   
+Dump data via the top-most partitioned table (rather than partition
+table) when dumping data for the partition table.
+   
+  
+ 
+
+ 
--section=sectionname

  
diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml
index aa944a2..54b8e7e 100644
--- a/doc/src/sgml/ref/pg_dumpall.sgml
+++ b/doc/src/sgml/ref/pg_dumpall.sgml
@@ -431,6 +431,16 @@ PostgreSQL documentation
  
 
  
+  --load-via-partition-root
+  
+   
+Dump data via the top-most partitioned table (rather than partition
+table) when dumping data for the partition table.
+   
+  
+ 
+
+ 
   --use-set-session-authorization
   

diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 144068a..ce3100f 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -157,6 +157,7 @@ typedef struct _dumpOptions
 	int			outputNoTablespaces;
 	int			use_setsessauth;
 	int			enable_row_security;
+	int			load_via_partition_root;
 
 	/* default, if no "inclusion" switches appear, is to dump everything */
 	bool		include_everything;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 393b9e2..27e23ca 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -269,6 +269,7 @@ static void appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions,
 		const char *prefix, Archive *fout);
 static char *get_synchronized_snapshot(Archive *fout);
 static void setupDumpWorker(Archive *AHX);
+static TableInfo *getRootTableInfo(TableInfo *tbinfo);
 
 
 int
@@ -345,6 +346,7 @@ main(int argc, char **argv)
 		{"lock-wait-timeout", required_argument, NULL, 2},
 		{"no-tablespaces", no_argument, , 1},
 		{"quote-all-identifiers", no_argument, _all_identifiers, 1},
+		{"load-via-partition-root", no_argument, _via_partition_root, 1},
 		{"role", required_argument, NULL, 3},
 		{"section", required_argument, NULL, 5},
 		{"serializable-deferrable", no_argument, _deferrable, 1},
@@ -959,6 +961,7 @@ help(const char *progname)
 	printf(_("  --no-tablespaces do not dump tablespace assignments\n"));
 	printf(_("  --no-unlogged-table-data do not dump unlogged table data\n"));
 	printf(_("  --quote-all-identifiers  quote all identifiers, even if not key words\n"));
+	printf(_("  --load-via-partition-rootload partition table via the root relation\n"));
 	printf(_("  --section=SECTIONdump named section (pre-data, data, or post-data)\n"));
 	printf(_("  --serializable-deferrablewait until the dump can run without anomalies\n"));
 	printf(_("  --snapshot=SNAPSHOT  use given snapshot for the dump\n"));
@@ -1902,8 +1905,32 @@ dumpTableData_insert(Archive *fout, void *dcontext)
 			if (insertStmt == NULL)
 			{
 insertStmt = createPQExpBuffer();
+
+/*
+ * When load-via-partition-root is set, get the root relation name
+ * for the partition table, so that we can reload data through
+ * the root table.
+ */
+if (dopt->load_via_partition_root && tbinfo->ispartition)
+{
+	TableInfo  *parentTbinfo;
+
+	parentTbinfo = getRootTableInfo(tbinfo);
+
+	/*
+	 * When we loading data through the root, we will qualify
+	 * the table name. This is needed because earlier
+	 * search_path will be set for the partition table.
+	 */
+	classname = (char *) fmtQualifiedId(fout->remoteVersion,
+		parentTbinfo->dobj.namespace->dobj.name,
+		parentTbinfo->dobj.name);
+}
+else
+	classname = fmtId(tbinfo->dobj.name);
+
 appendPQExpBuffer(insertStmt, "INSERT INTO %s ",
-  fmtId(classname));

Re: [HACKERS] reload-through-the-top-parent switch the partition table

2017-08-03 Thread Rushabh Lathia

On Thu, Aug 3, 2017 at 3:39 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Wed, Aug 2, 2017 at 11:47 PM, David G. Johnston
> <david.g.johns...@gmail.com> wrote:
> > On Wed, Aug 2, 2017 at 10:58 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> >>
> >> Robert Haas <robertmh...@gmail.com> writes:
> >> > On Wed, Aug 2, 2017 at 1:08 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> >> >> --restore-via-partition-root ?
> >>
> >> > I worry someone will think that pg_dump is now restoring stuff, but it
> >> > isn't.
> >>
> >> Well, the point is that the commands it emits will cause the eventual
> >> restore to go through the root.  Anyway, I think trying to avoid using
> >> a verb altogether is going to result in a very stilted option name.
> >>
> >> I notice that the option list already includes some references to
> >> "insert", so maybe "--insert-via-partition-root"?  Although you could
> >> argue that that's confusing when we're using COPY.
> >
> >
> > --use-partitioned-table [partition-name, ...]  # if names are omitted it
> > defaults to all partitioned tables
>
> I like this idea since it allows using this feature for selected
> tables e.g. hash tables. Otherwise, users will be forced to use this
> option even when there is only  a single hash partitioned table and
> many other list/range partitioned tables.
>
>
+1.


> What we are trying to do here is dump the data in a partitioned table
> as if it's not partitioned. Combine that with --data-only dumps, and
> one could use it to load partitioned data into unpartitioned or
> differently partitioned table. So, how about naming the option as
>
> --unpartition-partitioned-table [partitioned-table-name, ] # if
> names are omitted it defaults to all the partitioned tables.
>
>
--unpartition-partitioned-table is very confusing.

I liked the previous option which is --use-partitioned-table
[partition-name, ...].
The Only problem with --use-partitioned-table is, a user needs to specify
the
partition-name in the options list. Imagine if someone having 100's of
partitions then specifying those name is pg_dump option is a pain.

Rather than that:

--use-partitioned-table [partitioned_name, ...] # if
names are omitted it defaults to all the partitioned tables.

Here user need to specify the root relation name in the option - and any
partition table have that as a ROOT, will load the data through
top-parent-relation.


That really says what dump is really doing without focusing on how the
> data will be used like restoring/inserting/copying etc.
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>



-- 
Rushabh Lathia

Re: [HACKERS] reload-through-the-top-parent switch the partition table

2017-08-01 Thread Rushabh Lathia

On Wed, Aug 2, 2017 at 3:55 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Tue, Aug 1, 2017 at 5:34 AM, Rushabh Lathia <rushabh.lat...@gmail.com>
> wrote:
> > My colleague Robert and I had doubt about the order in of TABLE
> > and TABLE_DATA. We thought earlier that reload-thought-root might
> > might not solve the purpose which has been discussed in the above
> > mentioned thread.  But later looking into code I realize the sort order
> > for DO_TABLE and DO_TABLE_DATA are different, so we don't need
> > to worry about that issue.
>
> Hmm.  Does that mean that table data restoration will *absolutely
> always* follow all CREATE TABLE commands, or is that something that's
> *normally* true but potentially false if dependency sorting switches
> things around?
>
>
Looking at the dbObjectTypePriority comments that seems like data
restoration
will *absolutely always* follow all CREATE TABLE commands.

--
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



Thanks,
Rushabh Lathia
www.EnterpriseDB.com

[HACKERS] reload-through-the-top-parent switch the partition table

2017-08-01 Thread Rushabh Lathia

https://www.postgresql.org/message-id/CA%2BTgmoZFn7TJ7QBsFat
nuEE%3DGYGdZSNXqr9489n5JBsdy5rFfA%40mail.gmail.com

Above thread, it's been pointed out as important consideration
about adding reload-through-the-top-parent switch the partition
table.  One small step toward making use of hash function was
adding a switch into pg_dump which reload through the top
parent for the partition table.

Attach patch does the implementation for the same:

- Added switch reload-through-root: (Open for suggestion for the switch
name)
- Fix dumpTableData to COPY to the Root table with the reload-through-root
switch.
- Fix dumpTableData_insert - to generate the proper insert statement with
the reload-through-root switch.
- Added switch reload-through-root into pg_dumpall

Testing:

- Performed test with multi level partition + different schema combination.
- Tested a case with parent and child attributes in different order.


Attaching the pg_dump output for below test with --reload-through-root for
COPY and INSERTS.

Testcase:

create schema a;
create schema b;
create schema c;

create table a.t1 ( a int, b int ) partition by list(a);
create table b.t1_p1 partition of a.t1 FOR VALUES in ( 1, 2, 3) partition
by list(b);
create table c.t1_p1_p1 partition of b.t1_p1 FOR VALUES in (10, 20 );
insert into a.t1 values ( 2 , 10 );
insert into a.t1 values ( 1 , 20 );

My colleague Robert and I had doubt about the order in of TABLE
and TABLE_DATA. We thought earlier that reload-thought-root might
might not solve the purpose which has been discussed in the above
mentioned thread.  But later looking into code I realize the sort order
for DO_TABLE and DO_TABLE_DATA are different, so we don't need
to worry about that issue.

TODOs:
- Update documentation for pg_dump & pg_dumpall

Thanks,
Rushabh Lathia
www.EnterpriseDB.com
--
-- PostgreSQL database dump
--

-- Dumped from database version 10beta2
-- Dumped by pg_dump version 10beta2

SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;

--
-- Name: postgres; Type: COMMENT; Schema: -; Owner: rushabh
--

COMMENT ON DATABASE postgres IS 'default administrative connection database';


--
-- Name: a; Type: SCHEMA; Schema: -; Owner: rushabh
--

CREATE SCHEMA a;


ALTER SCHEMA a OWNER TO rushabh;

--
-- Name: b; Type: SCHEMA; Schema: -; Owner: rushabh
--

CREATE SCHEMA b;


ALTER SCHEMA b OWNER TO rushabh;

--
-- Name: c; Type: SCHEMA; Schema: -; Owner: rushabh
--

CREATE SCHEMA c;


ALTER SCHEMA c OWNER TO rushabh;

--
-- Name: plpgsql; Type: EXTENSION; Schema: -; Owner: 
--

CREATE EXTENSION IF NOT EXISTS plpgsql WITH SCHEMA pg_catalog;


--
-- Name: EXTENSION plpgsql; Type: COMMENT; Schema: -; Owner: 
--

COMMENT ON EXTENSION plpgsql IS 'PL/pgSQL procedural language';


SET search_path = a, pg_catalog;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: t1; Type: TABLE; Schema: a; Owner: rushabh
--

CREATE TABLE t1 (
a integer,
b integer
)
PARTITION BY LIST (a);


ALTER TABLE t1 OWNER TO rushabh;

SET search_path = b, pg_catalog;

--
-- Name: t1_p1; Type: TABLE; Schema: b; Owner: rushabh
--

CREATE TABLE t1_p1 PARTITION OF a.t1
FOR VALUES IN (1, 2, 3)
PARTITION BY LIST (b);


ALTER TABLE t1_p1 OWNER TO rushabh;

SET search_path = c, pg_catalog;

--
-- Name: t1_p1_p1; Type: TABLE; Schema: c; Owner: rushabh
--

CREATE TABLE t1_p1_p1 PARTITION OF b.t1_p1
FOR VALUES IN (10, 20);


ALTER TABLE t1_p1_p1 OWNER TO rushabh;

--
-- Data for Name: t1_p1_p1; Type: TABLE DATA; Schema: c; Owner: rushabh
--

INSERT INTO a.t1 VALUES (1, 20);
INSERT INTO a.t1 VALUES (2, 10);


--
-- PostgreSQL database dump complete
--

--
-- PostgreSQL database dump
--

-- Dumped from database version 10beta2
-- Dumped by pg_dump version 10beta2

SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;

--
-- Name: postgres; Type: COMMENT; Schema: -; Owner: rushabh
--

COMMENT ON DATABASE postgres IS 'default administrative connection database';


--
-- Name: a; Type: SCHEMA; Schema: -; Owner: rushabh
--

CREATE SCHEMA a;


ALTER SCHEMA a OWNER TO rushabh;

--
-- Name: b; Type: SCHEMA; Schema: -; Owner: rushabh
--

CREATE SCHEMA b;


ALTER SCHEMA b OWNER TO rushabh;

--
-- Name: c; Type: SCHEMA; Schema: -; Owner: rushabh
--

CREATE SCHEMA c;


ALTER SCHEMA c OWNER TO rushabh;

--
-- Name: plpgsql; Type: EXTENSION; Schema: -; Owner: 
--

CREATE EXTENSION IF NOT EXISTS plpgsql WITH SCHEMA pg_catalog;


--
-- Name: EXTENSION plpgsql; Type: COMMENT; Schema: -; Owner: 
--

COMMENT ON EXTENSION plpgsql IS 'PL/pgSQL procedural language';


SET search_path = a, pg_catalog;

SET default_tables

Re: [HACKERS] cache lookup failed error for partition key with custom opclass

2017-07-26 Thread Rushabh Lathia

On Tue, Jul 25, 2017 at 7:43 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Rushabh Lathia <rushabh.lat...@gmail.com> writes:
> > On Mon, Jul 24, 2017 at 7:23 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> >> Some looking around says that this *isn't* the only place that just
> >> blithely assumes that it will find an opfamily entry.  But I agree
> >> that not checking for that isn't up to project standards.
>
> > I go thorough the get_opfamily_proc() in the system and added the
> > check for InvalidOid.
>
> Think I did that already, please compare your results with
> 278cb4341103e967189997985b09981a73e23a34
>

Thanks Tom.


>
> regards, tom lane
>



-- 
Rushabh Lathia

Re: [HACKERS] cache lookup failed error for partition key with custom opclass

2017-07-24 Thread Rushabh Lathia

On Mon, Jul 24, 2017 at 7:23 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Rushabh Lathia <rushabh.lat...@gmail.com> writes:
> > PFA patch, where added elog() to add the error message same as all other
> > places.
>
> Some looking around says that this *isn't* the only place that just
> blithely assumes that it will find an opfamily entry.  But I agree
> that not checking for that isn't up to project standards.
>

Thanks Tom.

I go thorough the get_opfamily_proc() in the system and added the
check for InvalidOid.

Thanks,
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 5267a01..c9c1a54 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -1640,6 +1640,9 @@ ExecInitExprRec(Expr *node, PlanState *parent, ExprState *state,
 			 lefttype,
 			 righttype,
 			 BTORDER_PROC);
+	if (!OidIsValid(proc))/* should not happen */
+		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
+			 BTORDER_PROC, lefttype, righttype, opfamily);
 
 	/* Set up the primary fmgr lookup information */
 	finfo = palloc0(sizeof(FmgrInfo));
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index d8aceb1..ff758d6 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -1367,6 +1367,9 @@ ExecIndexBuildScanKeys(PlanState *planstate, Relation index,
 			 op_lefttype,
 			 op_righttype,
 			 BTORDER_PROC);
+if (!OidIsValid(opfuncid))/* should not happen */
+	elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
+		 BTORDER_PROC, op_lefttype, op_righttype, opfamily);
 
 inputcollation = lfirst_oid(collids_cell);
 collids_cell = lnext(collids_cell);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 43238dd..6bd93b0 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -945,6 +945,10 @@ RelationBuildPartitionKey(Relation relation)
    opclassform->opcintype,
    opclassform->opcintype,
    BTORDER_PROC);
+		if (!OidIsValid(funcid))	/* should not happen */
+			elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
+ BTORDER_PROC, opclassform->opcintype, opclassform->opcintype,
+ opclassform->opcfamily);
 
 		fmgr_info(funcid, >partsupfunc[i]);
 
diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c
index 7ec31eb..bc2f164 100644
--- a/src/backend/utils/cache/typcache.c
+++ b/src/backend/utils/cache/typcache.c
@@ -440,6 +440,10 @@ lookup_type_cache(Oid type_id, int flags)
 		 typentry->btree_opintype,
 		 typentry->btree_opintype,
 		 BTORDER_PROC);
+		if (!OidIsValid(cmp_proc))/* should not happen */
+			elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
+ BTORDER_PROC, typentry->btree_opintype,
+ typentry->btree_opintype, typentry->btree_opf);
 
 		/* As above, make sure array_cmp or record_cmp will succeed */
 		if (cmp_proc == F_BTARRAYCMP &&

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] cache lookup failed error for partition key with custom opclass

2017-07-24 Thread Rushabh Lathia

Hi,

Consider the following test:

CREATE OR REPLACE FUNCTION dummy_binaryint4(a int4, b int4) RETURNS int4 AS
$$ BEGIN RETURN a; END; $$ LANGUAGE 'plpgsql' IMMUTABLE;

CREATE OPERATOR CLASS custom_opclass2 FOR TYPE int2 USING BTREE AS OPERATOR
1 = , FUNCTION 1 dummy_binaryint4(int4, int4);

t=# CREATE TABLE list_tab(a int2, b int) PARTITION BY LIST (a
custom_opclass2);
*ERROR:  cache lookup failed for function 0*

In the above test creating OP class type int2, but passing the function of
int4
type. During CREATE PARTITION, ComputePartitionAttrs() able to resolve the
opclass
for the partition key (partition key type is int2), but while looking for a
method for the int2 -
it unable to find the proper function and end up with the cache lookup
failed error.
Error coming from RelationBuildPartitionKey().

I think overall this is expected but still error can be better - like all
the other
places where get_opfamily_proc() unable to find valid function oid.

PFA patch, where added elog() to add the error message same as all other
places.


Thanks,
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 43238dd..6bd93b0 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -945,6 +945,10 @@ RelationBuildPartitionKey(Relation relation)
    opclassform->opcintype,
    opclassform->opcintype,
    BTORDER_PROC);
+		if (!OidIsValid(funcid))	/* should not happen */
+			elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
+ BTORDER_PROC, opclassform->opcintype, opclassform->opcintype,
+ opclassform->opcfamily);
 
 		fmgr_info(funcid, >partsupfunc[i]);
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] RLS policy not getting honer while pg_dump on declarative partition

2017-06-16 Thread Rushabh Lathia

While doing some testing I noticed that RLS policy not getting honer
while pg_dump on declarative partition.

I can understand that while doing SELECT on individual child
table, policy of parent is not getting applied. But is this desirable
behaviour? I think for partitions, any policy on the root table should
get redirect to the child, thoughts?

If current behaviour is desirable then atleast we should document this.

Consider the below test:

\c postgres rushabh

CREATE USER rls_test_user1;

CREATE TABLE tp_sales
(
visibility VARCHAR(30),
sales_region   VARCHAR(30)
) PARTITION BY LIST (sales_region);

create table tp_sales_p_india  partition of tp_sales for values in
('INDIA');
create table tp_sales_p_rest  partition of tp_sales for values in ('REST');

insert into tp_sales values ( 'hidden', 'INDIA');
insert into tp_sales values ( 'visible', 'INDIA');
insert into tp_sales values ( 'hidden', 'REST');
insert into tp_sales values ( 'visible', 'REST');

GRANT SELECT ON tp_sales to rls_test_user1;
GRANT SELECT ON tp_sales_p_india to rls_test_user1;
GRANT SELECT ON tp_sales_p_rest to rls_test_user1;

ALTER TABLE tp_sales ENABLE ROW LEVEL SECURITY;

CREATE POLICY dump_p1 ON tp_sales FOR ALL USING (visibility = 'visible');

\c - rls_test_user1

-- SELECT honer the policy
SELECT * FROM tp_sales;

When we run the pg_dump using user rls_test_user1, can see the hidden
rows in the pg_dump output.

./db/bin/pg_dump -U rls_test_user1 postgres --inserts

Attaching the dump output.


Thanks,
Rushabh Lathia
www.EnterpriseDB.com
--
-- PostgreSQL database dump
--

-- Dumped from database version 10beta1
-- Dumped by pg_dump version 10beta1

SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;

--
-- Name: postgres; Type: COMMENT; Schema: -; Owner: rushabh
--

COMMENT ON DATABASE postgres IS 'default administrative connection database';


--
-- Name: plpgsql; Type: EXTENSION; Schema: -; Owner: 
--

CREATE EXTENSION IF NOT EXISTS plpgsql WITH SCHEMA pg_catalog;


--
-- Name: EXTENSION plpgsql; Type: COMMENT; Schema: -; Owner: 
--

COMMENT ON EXTENSION plpgsql IS 'PL/pgSQL procedural language';


SET search_path = public, pg_catalog;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: tp_sales; Type: TABLE; Schema: public; Owner: rushabh
--

CREATE TABLE tp_sales (
visibility character varying(30),
sales_region character varying(30)
)
PARTITION BY LIST (sales_region);


ALTER TABLE tp_sales OWNER TO rushabh;

--
-- Name: tp_sales_p_india; Type: TABLE; Schema: public; Owner: rushabh
--

CREATE TABLE tp_sales_p_india PARTITION OF tp_sales
FOR VALUES IN ('INDIA');


ALTER TABLE tp_sales_p_india OWNER TO rushabh;

--
-- Name: tp_sales_p_rest; Type: TABLE; Schema: public; Owner: rushabh
--

CREATE TABLE tp_sales_p_rest PARTITION OF tp_sales
FOR VALUES IN ('REST');


ALTER TABLE tp_sales_p_rest OWNER TO rushabh;

--
-- Data for Name: tp_sales_p_india; Type: TABLE DATA; Schema: public; Owner: rushabh
--

INSERT INTO tp_sales_p_india VALUES ('hidden', 'INDIA');
INSERT INTO tp_sales_p_india VALUES ('visible', 'INDIA');


--
-- Data for Name: tp_sales_p_rest; Type: TABLE DATA; Schema: public; Owner: rushabh
--

INSERT INTO tp_sales_p_rest VALUES ('hidden', 'REST');
INSERT INTO tp_sales_p_rest VALUES ('visible', 'REST');


--
-- Name: tp_sales dump_p1; Type: POLICY; Schema: public; Owner: rushabh
--

CREATE POLICY dump_p1 ON tp_sales USING (((visibility)::text = 'visible'::text));


--
-- Name: tp_sales; Type: ROW SECURITY; Schema: public; Owner: rushabh
--

ALTER TABLE tp_sales ENABLE ROW LEVEL SECURITY;

--
-- Name: tp_sales; Type: ACL; Schema: public; Owner: rushabh
--

GRANT SELECT ON TABLE tp_sales TO rls_test_user1;


--
-- Name: tp_sales_p_india; Type: ACL; Schema: public; Owner: rushabh
--

GRANT SELECT ON TABLE tp_sales_p_india TO rls_test_user1;


--
-- Name: tp_sales_p_rest; Type: ACL; Schema: public; Owner: rushabh
--

GRANT SELECT ON TABLE tp_sales_p_rest TO rls_test_user1;


--
-- PostgreSQL database dump complete
--


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-05-17 Thread Rushabh Lathia

On Wed, May 17, 2017 at 12:06 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:

> On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan...@gmail.com>
> wrote:
> > Option 3
> > 
> >
> > BR, AR delete triggers on source partition
> > BR, AR insert triggers on destination partition.
> >
> > Rationale :
> > Since the update is converted to delete+insert, just skip the update
> > triggers completely.
>
> +1 to option3
> Generally, BR triggers are used for updating the ROW value and AR
> triggers to VALIDATE the row or to modify some other tables.  So it
> seems that we can fire the triggers what is actual operation is
> happening at the partition level.
>
> For source partition, it's only the delete operation (no update
> happened) so we fire delete triggers and for the destination only
> insert operations so fire only inserts triggers.  That will keep the
> things simple.  And, it will also be in sync with the actual partition
> level delete/insert operations.
>
> We may argue that user might have declared only update triggers and as
> he has executed the update operation he may expect those triggers to
> get fired.  But, I think this behaviour can be documented with the
> proper logic that if the user is updating the partition key then he
> must be ready with the Delete/Insert triggers also, he can not rely
> only upon update level triggers.
>
>
Right, that is even my concern. That user might had declared only update
triggers and when user executing UPDATE its expect it to get call - but
with option 3 its not happening.

In term of consistency option 1 looks better. Its doing the same what
its been implemented for the UPSERT - so that user might be already
aware of trigger behaviour. Plus if we document the behaviour then it
sounds correct -

- Original command was UPDATE so BR update
- Later found that its ROW movement - so BR delete followed by AR delete
- Then Insert in new partition - so BR INSERT followed by AR Insert.

But again I am not quite sure how good it will be to compare the partition
behaviour with the UPSERT.




> Earlier I thought that option1 is better but later I think that this
> can complicate the situation as we are firing first BR update then BR
> delete and can change the row multiple time and defining such
> behaviour can be complicated.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Rushabh Lathia

Re: [HACKERS] Adding support for Default partition in partitioning

2017-04-05 Thread Rushabh Lathia

On 2017/04/06 0:19, Robert Haas wrote:
> On Wed, Apr 5, 2017 at 5:57 AM, Rahila Syed <rahilasye...@gmail.com>
wrote:
>>> Could you briefly elaborate why you think the lack global index support
>>> would be a problem in this regard?
>> I think following can happen if we allow rows satisfying the new
partition
>> to lie around in the
>> default partition until background process moves it.
>> Consider a scenario where partition key is a primary key and the data in
the
>> default partition is
>> not yet moved into the newly added partition. If now, new data is added
into
>> the new partition
>> which also exists(same key) in default partition there will be data
>> duplication. If now
>> we scan the partitioned table for that key(from both the default and new
>> partition as we
>> have not moved the rows) it will fetch the both rows.
>> Unless we have global indexes for partitioned tables, there is chance of
>> data duplication between
>> child table added after default partition and the default partition.
>
> Yes, I think it would be completely crazy to try to migrate the data
> in the background:
>
> - The migration might never complete because of a UNIQUE or CHECK
> constraint on the partition to which rows are being migrated.
>
> - Even if the migration eventually succeeded, such a design abandons
> all hope of making INSERT .. ON CONFLICT DO NOTHING work sensibly
> while the migration is in progress, unless the new partition has no
> UNIQUE constraints.
>
> - Partition-wise join and partition-wise aggregate would need to have
> special case handling for the case of an unfinished migration, as
> would any user code that accesses partitions directly.
>
> - More generally, I think users expect that when a DDL command
> finishes execution, it's done all of the work that there is to do (or
> at the very least, that any remaining work has no user-visible
> consequences, which would not be the case here).

Thanks Robert for this explanation. This makes it more clear, why row
movement by background is not sensible idea.

On Thu, Apr 6, 2017 at 9:38 AM, Keith Fiske <ke...@omniti.com> wrote:

>
> On Wed, Apr 5, 2017 at 2:51 PM, Keith Fiske <ke...@omniti.com> wrote:
>
>>
>>
>> Only issue I see with this, and I'm not sure if it is an issue, is what
>> happens to that default constraint clause when 1000s of partitions start
>> getting added? From what I gather the default's constraint is built based
>> off the cumulative opposite of all other child constraints. I don't
>> understand the code well enough to see what it's actually doing, but if
>> there are no gaps, is the method used smart enough to aggregate all the
>> child constraints to make a simpler constraint that is simply outside the
>> current min/max boundaries? If so, for serial/time range partitioning this
>> should typically work out fine since there are rarely gaps. This actually
>> seems more of an issue for list partitioning where each child is a distinct
>> value or range of values that are completely arbitrary. Won't that check
>> and re-evaluation of the default's constraint just get worse and worse as
>> more children are added? Is there really even a need for the default to
>> have an opposite constraint like this? Not sure on how the planner works
>> with partitioning now, but wouldn't it be better to first check all
>> non-default children for a match the same as it does now without a default
>> and, failing that, then route to the default if one is declared? The
>> default should accept any data then so I don't see the need for the
>> constraint unless it's required for the current implementation. If that's
>> the case, could that be changed?
>>
>> Keith
>>
>
> Actually, thinking on this more, I realized this does again come back to
> the lack of a global index. Without the constraint, data could be put
> directly into the default that could technically conflict with the
> partition scheme elsewhere. Perhaps, instead of the constraint, inserts
> directly to the default could be prevented on the user level. Writing to
> valid children directly certainly has its place, but been thinking about
> it, and I can't see any reason why one would ever want to write directly to
> the default. It's use case seems to be around being a sort of temporary
> storage until that data can be moved to a valid location. Would still need
> to allow removal of data, though.
>
> Not sure if that's even a workable solution. Just trying to think of ways
> around the current limitations and still allow this feature.
>

I like the idea about having DEFAULT partition for the range partition.
With the
way partition is designed it can have holes into range partition. I think
DEFAULT
for the range partition is a good idea, generally when the range having
holes. When
range is serial then of course DEFAULT partition doen't much sense.

Regarda,

Rushabh Lathia
www.EnterpriseDB.com

Re: [HACKERS] Adding support for Default partition in partitioning

2017-04-04 Thread Rushabh Lathia

bal index support. If this restriction is removed, there's still
> an
> > issue of data duplication after the necessary child table is added. So
> > guess it's a matter of deciding which user experience is better for the
> > moment?
>
> I'm not sure about the fate of this particular patch for v10, but until we
> implement a solution to move rows and design appropriate interface for the
> same, we could error out if moving rows is required at all, like the patch
> does.
>
>
+1

I agree about the future plan about the row movement, how that is I am
not quite sure at this stage.

I was thinking that CREATE new partition is the DDL command, so even
if row-movement works with holding the lock on the new partition table,
that should be fine. I am not quire sure, why row movement should be
happen in the back-ground process.

Of-course, one idea is that if someone don't want feature of row-movement,
then we might add that under some GUC or may be as another option into
the CREATE partition table.

Could you briefly elaborate why you think the lack global index support
> would be a problem in this regard?
>
> I agree that some design is required here to implement a solution
> redistribution of rows; not only in the context of supporting the notion
> of default partitions, but also to allow the feature to split/merge range
> (only?) partitions.  I'd like to work on the latter for v11 for which I
> would like to post a proposal soon; if anyone would like to collaborate
> (ideas, code, review), I look forward to.  (sorry for hijacking this
> thread.)
>
> Thanks,
> Amit
>
>
>


-- 
Rushabh Lathia

Re: [HACKERS] [COMMITTERS] pgsql: Avoid GatherMerge crash when there are no workers.

2017-04-03 Thread Rushabh Lathia

Hi,

On Mon, Apr 3, 2017 at 10:56 AM, Rushabh Lathia <rushabh.lat...@gmail.com>
wrote:

>
>
> On Sat, Apr 1, 2017 at 7:58 PM, Robert Haas <robertmh...@gmail.com> wrote:
>
>> On Fri, Mar 31, 2017 at 10:26 PM, Andres Freund <and...@anarazel.de>
>> wrote:
>> > Hi,
>> >
>> > On 2017-04-01 01:22:14 +, Robert Haas wrote:
>> >> Avoid GatherMerge crash when there are no workers.
>> >
>> > I think the gather merge code needs a bit more test coverage (sorry to
>> > make this a larger theme today).  As shown by
>> > https://coverage.postgresql.org/src/backend/executor/nodeGat
>> herMerge.c.gcov.html
>> > we don't actually merge anything (heap_compare_slots is not exercised).
>>
>> Sounds reasonable.
>>
>
> I am working on this and will submit the patch soon.
>
>

On my local environment I was getting coverage for the heap_compare_slots
with
existing test. But it seems like due to low number of record fetch only
leader get
evolved to the fetching tuples in the shared report.

I modified the current test coverage for the Gather Merge to add more rows
to be
fetch by the GatherMerge node to make sure that it do coverage for
heap_compare_slots. Also added test for the zero worker.

PFA patch as well as LCOV report for the nodeGatherMerge.c.


>> > I btw also wonder if it's good that we have a nearly identical copy of
>> > heap_compare_slots and a bunch of the calling code in both
>> > nodeMergeAppend.c and nodeGatherMerge.c.  On the other hand, it's not
>> > heavily envolving code.
>>
>> Yeah, I don't know.  We could alternatively try to move that to some
>> common location and merge the two implementations.  I'm not sure
>> exactly where, though.
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>
>
>
> --
> Rushabh Lathia
>



-- 
Rushabh Lathia
www.EnterpriseDB.com
Title: LCOV - PostgreSQL - src/backend/executor/nodeGatherMerge.c








  
LCOV - code coverage report



  

  
Current view:
top level - src/backend/executor - nodeGatherMerge.c (source / functions)


Hit
Total
Coverage
  
  
Test:
PostgreSQL

Lines:
200
208
96.2 %
  
  
Date:
2017-04-03

Functions:
12
13
92.3 %
  
  
Legend:
Lines:
hit
not hit


  
  

  



  

  

  


  
  Line dataSource code

   1 : /*-
   2 :  *
   3 :  * nodeGatherMerge.c
   4 :  *  Scan a plan in multiple workers, and do order-preserving merge.
   5 :  *
   6 :  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
   7 :  * Portions Copyright (c) 1994, Regents of the University of California
   8 :  *
   9 :  * IDENTIFICATION
  10 :  *src/backend/executor/nodeGatherMerge.c
  11 :  *
  12 :  *-
  13 :  */
  14 : 
  15 : #include "postgres.h"
  16 : 
  17 : #include "access/relscan.h"
  18 : #include "access/xact.h"
  19 : #include "executor/execdebug.h"
  20 : #include "executor/execParallel.h"
  21 : #include "executor/nodeGatherMerge.h"
  22 : #include "executor/nodeSubplan.h"
  23 : #include "executor/tqueue.h"
  24 : #include "lib/binaryheap.h"
  25 : #include "miscadmin.h"
  26 : #include "utils/memutils.h"
  27 : #include "utils/rel.h"
  28 : 
  29 : /*
  30 :  * Tuple array for each worker
  31 :  */
  32 : typedef struct GMReaderTupleBuffer
  33 : {
  34

Re: [HACKERS] [COMMITTERS] pgsql: Avoid GatherMerge crash when there are no workers.

2017-04-02 Thread Rushabh Lathia

On Sat, Apr 1, 2017 at 7:58 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Fri, Mar 31, 2017 at 10:26 PM, Andres Freund <and...@anarazel.de>
> wrote:
> > Hi,
> >
> > On 2017-04-01 01:22:14 +, Robert Haas wrote:
> >> Avoid GatherMerge crash when there are no workers.
> >
> > I think the gather merge code needs a bit more test coverage (sorry to
> > make this a larger theme today).  As shown by
> > https://coverage.postgresql.org/src/backend/executor/
> nodeGatherMerge.c.gcov.html
> > we don't actually merge anything (heap_compare_slots is not exercised).
>
> Sounds reasonable.
>

I am working on this and will submit the patch soon.


>
> > I btw also wonder if it's good that we have a nearly identical copy of
> > heap_compare_slots and a bunch of the calling code in both
> > nodeMergeAppend.c and nodeGatherMerge.c.  On the other hand, it's not
> > heavily envolving code.
>
> Yeah, I don't know.  We could alternatively try to move that to some
> common location and merge the two implementations.  I'm not sure
> exactly where, though.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Rushabh Lathia

Re: [HACKERS] crashes due to setting max_parallel_workers=0

2017-03-30 Thread Rushabh Lathia

On Wed, Mar 29, 2017 at 8:38 AM, Tomas Vondra <tomas.von...@2ndquadrant.com>
wrote:

> Hi,
>
> On 03/28/2017 11:07 AM, Rushabh Lathia wrote:
>
>> ...
>> I think we all agree that we should get rid of nreaders from the
>> GatherMergeState and need to do some code re-factor. But if I
>> understood correctly that Robert's concern was to do that re-factor
>> as separate commit.
>>
>>
> Maybe. It depends on how valuable it's to keep Gather and GatherMerge
> similar - having nreaders in one and not the other seems a bit weird. But
> maybe the refactoring would remove it from both nodes?
>
> Also, it does not really solve the issue that we're using 'nreaders' or
> 'nworkers_launched' to access array with one extra element.
>

With the latest patches, into gather_merge_init() there is one loop where we
initialize the tuple array (which is to hold the tuple came from the each
worker)
and slot (which is to hold the current tuple slot).

We hold the tuple array for the worker because when we read tuples from
workers,
it's a good idea to read several at once for efficiency when possible: this
minimizes
context-switching overhead.

For leader we don't initialize the slot because for leader we directly call
then ExecProcNode(),
which returns the slot. Also there is no point in holding the tuple array
for the leader.

With the latest patch's, gather_merge_clear_slots() will only clean the
tuple table
slot and the tuple array buffer for the workers (nworkers_launched).

Only time we loop through the "nworkers_launched + 1" is while reading the
tuple
from the each gather merge input. Personally I don't see any problem with
that,
but I am very much open for suggestions.

>
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

-- 
Rushabh Lathia

Re: [HACKERS] crashes due to setting max_parallel_workers=0

2017-03-28 Thread Rushabh Lathia

On Tue, Mar 28, 2017 at 10:00 AM, Tomas Vondra <tomas.von...@2ndquadrant.com
> wrote:

>
>
> On 03/27/2017 01:40 PM, Rushabh Lathia wrote:
>
>>
>> ...
>> I was doing more testing with the patch and I found one more server
>> crash with the patch around same area, when we forced the gather
>> merge for the scan having zero rows.
>>
>> create table dept ( deptno numeric, dname varchar(20);
>> set parallel_tuple_cost =0;
>> set parallel_setup_cost =0;
>> set min_parallel_table_scan_size =0;
>> set min_parallel_index_scan_size =0;
>> set force_parallel_mode=regress;
>> explain analyze select * from dept order by deptno;
>>
>> This is because for leader we don't initialize the slot into gm_slots. So
>> in case where launched worker is zero and table having zero rows, we
>> end up having NULL slot into gm_slots array.
>>
>> Currently gather_merge_clear_slots() clear out the tuple table slots for
>> each
>> gather merge input and returns clear slot. In the patch I modified
>> function
>> gather_merge_clear_slots() to just clear out the tuple table slots and
>> always return NULL when All the queues and heap us exhausted.
>>
>>
> Isn't that just another sign the code might be a bit too confusing? I see
> two main issues in the code:
>
> 1) allocating 'slots' as 'nreaders+1' elements, which seems like a good
> way to cause off-by-one errors
>
> 2) mixing objects with different life spans (slots for readers vs. slot
> for the leader) seems like a bad idea too
>
> I wonder how much we gain by reusing the slot from the leader (I'd be
> surprised if it was at all measurable). David posted a patch reworking
> this, and significantly simplifying the GatherMerge node. Why not to accept
> that?
>
>
>
I think we all agree that we should get rid of nreaders from the
GatherMergeState
and need to do some code re-factor. But if I understood correctly that
Robert's
concern was to do that re-factor as separate commit.

I picked David's patch and started reviewing the changes. I applied that
patch
on top of my v2 patch (which does the re-factor of
gather_merge_clear_slots).

In David's patch, into gather_merge_init(), a loop where tuple array is
getting
allocate, that loop need to only up to nworkers_launched. Because we don't
hold the tuple array for leader. I changed that and did some other simple
changes based on mine v2 patch. I also performed manual testing with
the changes.

Please find attached re-factor patch, which is v2 patch submitted for the
server crash fix. (Attaching both the patch here again, for easy of access).

Thanks,


-- 
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 62c399e..2d7eb71 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -195,9 +195,9 @@ ExecGatherMerge(GatherMergeState *node)
 			/* Set up tuple queue readers to read the results. */
 			if (pcxt->nworkers_launched > 0)
 			{
-node->nreaders = 0;
-node->reader = palloc(pcxt->nworkers_launched *
-	  sizeof(TupleQueueReader *));
+node->reader = (TupleQueueReader **)
+			palloc(pcxt->nworkers_launched *
+   sizeof(TupleQueueReader *));
 
 Assert(gm->numCols);
 
@@ -205,7 +205,7 @@ ExecGatherMerge(GatherMergeState *node)
 {
 	shm_mq_set_handle(node->pei->tqueue[i],
 	  pcxt->worker[i].bgwhandle);
-	node->reader[node->nreaders++] =
+	node->reader[i] =
 		CreateTupleQueueReader(node->pei->tqueue[i],
 			   node->tupDesc);
 }
@@ -298,7 +298,7 @@ ExecShutdownGatherMergeWorkers(GatherMergeState *node)
 	{
 		int			i;
 
-		for (i = 0; i < node->nreaders; ++i)
+		for (i = 0; i < node->nworkers_launched; ++i)
 			if (node->reader[i])
 DestroyTupleQueueReader(node->reader[i]);
 
@@ -344,28 +344,26 @@ ExecReScanGatherMerge(GatherMergeState *node)
 static void
 gather_merge_init(GatherMergeState *gm_state)
 {
-	int			nreaders = gm_state->nreaders;
+	int			nslots = gm_state->nworkers_launched + 1;
 	bool		initialize = true;
 	int			i;
 
 	/*
 	 * Allocate gm_slots for the number of worker + one more slot for leader.
-	 * Last slot is always for leader. Leader always calls ExecProcNode() to
-	 * read the tuple which will return the TupleTableSlot. Later it will
-	 * directly get assigned to gm_slot. So just initialize leader gm_slot
-	 * with NULL. For other slots below code will call
-	 * ExecInitExtraTupleSlot() which will do the initialization of worker
-	 * slots.
+	 * The final slot in the array is reserved for the leader process. This
+	 * slot is always populated via ExecProcNode(). This can be set to NULL
+	 * for now.  The remaining slo

Re: [HACKERS] crashes due to setting max_parallel_workers=0

2017-03-27 Thread Rushabh Lathia

On Mon, Mar 27, 2017 at 9:52 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Mon, Mar 27, 2017 at 12:11 PM, David Rowley
> <david.row...@2ndquadrant.com> wrote:
> > On 28 March 2017 at 04:57, Robert Haas <robertmh...@gmail.com> wrote:
> >> On Sat, Mar 25, 2017 at 12:18 PM, Rushabh Lathia
> >> <rushabh.lat...@gmail.com> wrote:
> >>> About the original issue reported by Tomas, I did more debugging and
> >>> found that - problem was gather_merge_clear_slots() was not returning
> >>> the clear slot when nreader is zero (means nworkers_launched = 0).
> >>> Due to the same scan was continue even all the tuple are exhausted,
> >>> and then end up with server crash at gather_merge_getnext(). In the
> patch
> >>> I also added the Assert into gather_merge_getnext(), about the index
> >>> should be less then the nreaders + 1 (leader).
> >>
> >> Well, you and David Rowley seem to disagree on what the fix is here.
> >> His patches posted upthread do A, and yours do B, and from a quick
> >> look those things are not just different ways of spelling the same
> >> underlying fix, but actually directly conflicting ideas about what the
> >> fix should be.  Any chance you can review his patches, and maybe he
> >> can review yours, and we could try to agree on a consensus position?
> >> :-)
> >
> > When comparing Rushabh's to my minimal patch, both change
> > gather_merge_clear_slots() to clear the leader's slot. My fix
> > mistakenly changes things so it does ExecInitExtraTupleSlot() on the
> > leader's slot, but seems that's not required since
> > gather_merge_readnext() sets the leader's slot to the output of
> > ExecProcNode(outerPlan). I'd ignore my minimal fix because of that
> > mistake. Rushabh's patch sidesteps this by adding a conditional
> > pfree() to not free slot that we didn't allocate in the first place.
> >
> > I do think the code could be improved a bit. I don't like the way
> > GatherMergeState's nreaders and nworkers_launched are always the same.
> > I think this all threw me off a bit and may have been the reason for
> > the bug in the first place.
>
> Yeah, if those fields are always the same, then I think that they
> should be merged.  That seems undeniably a good idea.


Hmm I agree that it's good idea, and I will work on that as separate patch.


> Possibly we
> should fix the crash bug first, though, and then do that afterwards.
> What bugs me a little about Rushabh's fix is that it looks like magic.
> You have to know that we're looping over two things and freeing them
> up, but there's one more of one thing than the other thing.  I think
> that at least needs some comments or something.
>
>
So in my second version of patch I change  gather_merge_clear_slots() to
just clear the slot for the worker and some other clean up. Also throwing
NULL from gather_merge_getnext() when all the queues and heap are
exhausted - which earlier gather_merge_clear_slots() was returning clear
slot. This way we make sure that we don't run over freeing the slot for
the leader and gather_merge_getnext() don't need to depend on that
clear slot.


Thanks,
Rushabh Lathia
www.EnterpriseDB.com

Re: [HACKERS] crashes due to setting max_parallel_workers=0

2017-03-27 Thread Rushabh Lathia

On Mon, Mar 27, 2017 at 10:59 AM, Rushabh Lathia <rushabh.lat...@gmail.com>
wrote:

>
>
> On Mon, Mar 27, 2017 at 3:43 AM, Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote:
>
>> On 03/25/2017 05:18 PM, Rushabh Lathia wrote:
>>
>>>
>>>
>>> On Sat, Mar 25, 2017 at 7:01 PM, Peter Eisentraut
>>> <peter.eisentr...@2ndquadrant.com
>>> <mailto:peter.eisentr...@2ndquadrant.com>> wrote:
>>>
>>> On 3/25/17 09:01, David Rowley wrote:
>>> > On 25 March 2017 at 23:09, Rushabh Lathia <
>>> rushabh.lat...@gmail.com <mailto:rushabh.lat...@gmail.com>> wrote:
>>> >> Also another point which I think we should fix is, when someone
>>> set
>>> >> max_parallel_workers = 0, we should also set the
>>> >> max_parallel_workers_per_gather
>>> >> to zero. So that way it we can avoid generating the gather path
>>> with
>>> >> max_parallel_worker = 0.
>>> > I see that it was actually quite useful that it works the way it
>>> does.
>>> > If it had worked the same as max_parallel_workers_per_gather, then
>>> > likely Tomas would never have found this bug.
>>>
>>> Another problem is that the GUC system doesn't really support cases
>>> where the validity of one setting depends on the current value of
>>> another setting.  So each individual setting needs to be robust
>>> against
>>> cases of related settings being nonsensical.
>>>
>>>
>>> Okay.
>>>
>>> About the original issue reported by Tomas, I did more debugging and
>>> found that - problem was gather_merge_clear_slots() was not returning
>>> the clear slot when nreader is zero (means nworkers_launched = 0).
>>> Due to the same scan was continue even all the tuple are exhausted,
>>> and then end up with server crash at gather_merge_getnext(). In the patch
>>> I also added the Assert into gather_merge_getnext(), about the index
>>> should be less then the nreaders + 1 (leader).
>>>
>>> PFA simple patch to fix the problem.
>>>
>>>
>> I think there are two issues at play, here - the first one is that we
>> still produce parallel plans even with max_parallel_workers=0, and the
>> second one is the crash in GatherMerge when nworkers=0.
>>
>> Your patch fixes the latter (thanks for looking into it), which is
>> obviously a good thing - getting 0 workers on a busy system is quite
>> possible, because all the parallel workers can be already chewing on some
>> other query.
>>
>>
> Thanks.
>
>

I was doing more testing with the patch and I found one more server
crash with the patch around same area, when we forced the gather
merge for the scan having zero rows.

create table dept ( deptno numeric, dname varchar(20);
set parallel_tuple_cost =0;
set parallel_setup_cost =0;
set min_parallel_table_scan_size =0;
set min_parallel_index_scan_size =0;
set force_parallel_mode=regress;
explain analyze select * from dept order by deptno;

This is because for leader we don't initialize the slot into gm_slots. So
in case where launched worker is zero and table having zero rows, we
end up having NULL slot into gm_slots array.

Currently gather_merge_clear_slots() clear out the tuple table slots for
each
gather merge input and returns clear slot. In the patch I modified function
gather_merge_clear_slots() to just clear out the tuple table slots and
always return NULL when All the queues and heap us exhausted.



> But it seems a bit futile to produce the parallel plan in the first place,
>> because with max_parallel_workers=0 we can't possibly get any parallel
>> workers ever. I wonder why compute_parallel_worker() only looks at
>> max_parallel_workers_per_gather, i.e. why shouldn't it do:
>>
>>parallel_workers = Min(parallel_workers, max_parallel_workers);
>>
>>
> I agree with you here. Producing the parallel plan when
> max_parallel_workers = 0 is wrong. But rather then your suggested fix, I
> think that we should do something like:
>
> /*
>  * In no case use more than max_parallel_workers_per_gather or
>  * max_parallel_workers.
>  */
> parallel_workers = Min(parallel_workers, Min(max_parallel_workers,
> max_parallel_workers_per_gather));
>
>
>
>> Perhaps this was discussed and is actually intentional, though.
>>
>>
> Yes, I am not quite sure about this.
>
> Regarding handling this at the GUC level - I agree with Peter that that's
>> not a good idea. I suppose we could d

Re: [HACKERS] Adding support for Default partition in partitioning

2017-03-27 Thread Rushabh Lathia

I applied the patch and was trying to perform some testing, but its
ending up with server crash with the test shared by you in your starting
mail:

postgres=# CREATE TABLE list_partitioned (
postgres(# a int
postgres(# ) PARTITION BY LIST (a);
CREATE TABLE
postgres=#
postgres=# CREATE TABLE part_default PARTITION OF list_partitioned FOR
VALUES IN (DEFAULT);
CREATE TABLE

postgres=# CREATE TABLE part_1 PARTITION OF list_partitioned FOR VALUES IN
(4,5);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Apart from this, few more explanation in the patch is needed to explain the
changes for the DEFAULT partition. Like I am not quite sure what exactly the
latest version of patch supports, like does that support the tuple row
movement,
or adding new partition will be allowed having partition table having
DEFAULT
partition, which is quite difficult to understand through the code changes.

Another part which is missing in the patch is the test coverage, adding
proper test coverage, which explain what is supported and what's not.

Thanks,

On Fri, Mar 24, 2017 at 3:25 PM, Rahila Syed <rahilasye...@gmail.com> wrote:

> Hello Rushabh,
>
> Thank you for reviewing.
> Have addressed all your comments in the attached patch. The attached patch
> currently throws an
> error if a new partition is added after default partition.
>
> >Rather then adding check for default here, I think this should be handle
> inside
> >get_qual_for_list().
> Have moved the check inside get_qual_for_partbound() as needed to do some
> operations
> before calling get_qual_for_list() for default partitions.
>
> Thank you,
> Rahila Syed
>
> On Tue, Mar 21, 2017 at 11:36 AM, Rushabh Lathia <rushabh.lat...@gmail.com
> > wrote:
>
>> I picked this for review and noticed that patch is not getting
>> cleanly complied on my environment.
>>
>> partition.c: In function ‘RelationBuildPartitionDesc’:
>> partition.c:269:6: error: ISO C90 forbids mixed declarations and code
>> [-Werror=declaration-after-statement]
>>   Const*val = lfirst(c);
>>   ^
>> partition.c: In function ‘generate_partition_qual’:
>> partition.c:1590:2: error: ISO C90 forbids mixed declarations and code
>> [-Werror=declaration-after-statement]
>>   PartitionBoundSpec *spec = (PartitionBoundSpec *)bound;
>>   ^
>> partition.c: In function ‘get_partition_for_tuple’:
>> partition.c:1820:5: error: array subscript has type ‘char’
>> [-Werror=char-subscripts]
>>  result = parent->indexes[partdesc->boundinfo->def_index];
>>  ^
>> partition.c:1824:16: error: assignment makes pointer from integer without
>> a cast [-Werror]
>>  *failed_at = RelationGetRelid(parent->reldesc);
>> ^
>> cc1: all warnings being treated as errors
>>
>> Apart from this, I was reading patch here are few more comments:
>>
>> 1) Variable initializing happening at two place.
>>
>> @@ -166,6 +170,8 @@ RelationBuildPartitionDesc(Relation rel)
>>  /* List partitioning specific */
>>  PartitionListValue **all_values = NULL;
>>  boolfound_null = false;
>> +boolfound_def = false;
>> +intdef_index = -1;
>>  intnull_index = -1;
>>
>>  /* Range partitioning specific */
>> @@ -239,6 +245,8 @@ RelationBuildPartitionDesc(Relation rel)
>>  i = 0;
>>  found_null = false;
>>  null_index = -1;
>> +found_def = false;
>> +def_index = -1;
>>  foreach(cell, boundspecs)
>>  {
>>  ListCell   *c;
>> @@ -249,6 +257,15 @@ RelationBuildPartitionDesc(Relation rel)
>>
>>
>> 2)
>>
>> @@ -1558,6 +1586,19 @@ generate_partition_qual(Relation rel)
>>  bound = stringToNode(TextDatumGetCString(boundDatum));
>>  ReleaseSysCache(tuple);
>>
>> +/* Return if it is a default list partition */
>> +PartitionBoundSpec *spec = (PartitionBoundSpec *)bound;
>> +ListCell *cell;
>> +foreach(cell, spec->listdatums)
>>
>> More comment on above hunk is needed?
>>
>> Rather then adding check for default here, I think this should be handle
>> inside
>> get_qual_for_list().
>>
>> 3) Code is not aligned with existing
>>
>> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
>> index 6316688..ebb7db7 100644
>> --- a/src/b

Re: [HACKERS] crashes due to setting max_parallel_workers=0

2017-03-26 Thread Rushabh Lathia

On Mon, Mar 27, 2017 at 3:43 AM, Tomas Vondra <tomas.von...@2ndquadrant.com>
wrote:

> On 03/25/2017 05:18 PM, Rushabh Lathia wrote:
>
>>
>>
>> On Sat, Mar 25, 2017 at 7:01 PM, Peter Eisentraut
>> <peter.eisentr...@2ndquadrant.com
>> <mailto:peter.eisentr...@2ndquadrant.com>> wrote:
>>
>> On 3/25/17 09:01, David Rowley wrote:
>> > On 25 March 2017 at 23:09, Rushabh Lathia <rushabh.lat...@gmail.com
>> <mailto:rushabh.lat...@gmail.com>> wrote:
>> >> Also another point which I think we should fix is, when someone set
>> >> max_parallel_workers = 0, we should also set the
>> >> max_parallel_workers_per_gather
>> >> to zero. So that way it we can avoid generating the gather path
>> with
>> >> max_parallel_worker = 0.
>> > I see that it was actually quite useful that it works the way it
>> does.
>> > If it had worked the same as max_parallel_workers_per_gather, then
>> > likely Tomas would never have found this bug.
>>
>> Another problem is that the GUC system doesn't really support cases
>> where the validity of one setting depends on the current value of
>> another setting.  So each individual setting needs to be robust
>> against
>> cases of related settings being nonsensical.
>>
>>
>> Okay.
>>
>> About the original issue reported by Tomas, I did more debugging and
>> found that - problem was gather_merge_clear_slots() was not returning
>> the clear slot when nreader is zero (means nworkers_launched = 0).
>> Due to the same scan was continue even all the tuple are exhausted,
>> and then end up with server crash at gather_merge_getnext(). In the patch
>> I also added the Assert into gather_merge_getnext(), about the index
>> should be less then the nreaders + 1 (leader).
>>
>> PFA simple patch to fix the problem.
>>
>>
> I think there are two issues at play, here - the first one is that we
> still produce parallel plans even with max_parallel_workers=0, and the
> second one is the crash in GatherMerge when nworkers=0.
>
> Your patch fixes the latter (thanks for looking into it), which is
> obviously a good thing - getting 0 workers on a busy system is quite
> possible, because all the parallel workers can be already chewing on some
> other query.
>
>
Thanks.


> But it seems a bit futile to produce the parallel plan in the first place,
> because with max_parallel_workers=0 we can't possibly get any parallel
> workers ever. I wonder why compute_parallel_worker() only looks at
> max_parallel_workers_per_gather, i.e. why shouldn't it do:
>
>parallel_workers = Min(parallel_workers, max_parallel_workers);
>
>
I agree with you here. Producing the parallel plan when
max_parallel_workers = 0 is wrong. But rather then your suggested fix, I
think that we should do something like:

/*
 * In no case use more than max_parallel_workers_per_gather or
 * max_parallel_workers.
 */
parallel_workers = Min(parallel_workers, Min(max_parallel_workers,
max_parallel_workers_per_gather));



> Perhaps this was discussed and is actually intentional, though.
>
>
Yes, I am not quite sure about this.

Regarding handling this at the GUC level - I agree with Peter that that's
> not a good idea. I suppose we could deal with checking the values in the
> GUC check/assign hooks, but what we don't have is a way to undo the changes
> in all the GUCs. That is, if I do
>
>SET max_parallel_workers = 0;
>SET max_parallel_workers = 16;
>
> I expect to end up with just max_parallel_workers GUC changed and nothing
> else.
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>



-- 
Rushabh Lathia

Re: [HACKERS] crashes due to setting max_parallel_workers=0

2017-03-25 Thread Rushabh Lathia

On Sat, Mar 25, 2017 at 7:01 PM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 3/25/17 09:01, David Rowley wrote:
> > On 25 March 2017 at 23:09, Rushabh Lathia <rushabh.lat...@gmail.com>
> wrote:
> >> Also another point which I think we should fix is, when someone set
> >> max_parallel_workers = 0, we should also set the
> >> max_parallel_workers_per_gather
> >> to zero. So that way it we can avoid generating the gather path with
> >> max_parallel_worker = 0.
> > I see that it was actually quite useful that it works the way it does.
> > If it had worked the same as max_parallel_workers_per_gather, then
> > likely Tomas would never have found this bug.
>
> Another problem is that the GUC system doesn't really support cases
> where the validity of one setting depends on the current value of
> another setting.  So each individual setting needs to be robust against
> cases of related settings being nonsensical.
>

Okay.

About the original issue reported by Tomas, I did more debugging and
found that - problem was gather_merge_clear_slots() was not returning
the clear slot when nreader is zero (means nworkers_launched = 0).
Due to the same scan was continue even all the tuple are exhausted,
and then end up with server crash at gather_merge_getnext(). In the patch
I also added the Assert into gather_merge_getnext(), about the index
should be less then the nreaders + 1 (leader).

PFA simple patch to fix the problem.

Thanks,
Rushabh Lathia
www.Enterprisedb.com
diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 72f30ab..e19bace 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -431,9 +431,14 @@ gather_merge_clear_slots(GatherMergeState *gm_state)
 {
 	int			i;
 
-	for (i = 0; i < gm_state->nreaders; i++)
+	for (i = 0; i < gm_state->nreaders + 1; i++)
 	{
-		pfree(gm_state->gm_tuple_buffers[i].tuple);
+		/*
+		 * Gather merge always allow leader to participate and we don't
+		 * allocate the tuple, so no need to free the tuple for leader.
+		 */
+		if (i != gm_state->nreaders)
+			pfree(gm_state->gm_tuple_buffers[i].tuple);
 		gm_state->gm_slots[i] = ExecClearTuple(gm_state->gm_slots[i]);
 	}
 
@@ -474,6 +479,8 @@ gather_merge_getnext(GatherMergeState *gm_state)
 		 */
 		i = DatumGetInt32(binaryheap_first(gm_state->gm_heap));
 
+		Assert(i < gm_state->nreaders + 1);
+
 		if (gather_merge_readnext(gm_state, i, false))
 			binaryheap_replace_first(gm_state->gm_heap, Int32GetDatum(i));
 		else

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crashes due to setting max_parallel_workers=0

2017-03-25 Thread Rushabh Lathia

Thanks Tomas for reporting issue and Thanks David for working
on this.

I can see the problem in GatherMerge, specially when nworkers_launched
is zero. I will look into this issue and will post a fix for the same.

Also another point which I think we should fix is, when someone set
max_parallel_workers = 0, we should also set
the max_parallel_workers_per_gather
to zero. So that way it we can avoid generating the gather path with
max_parallel_worker = 0.

Thanks,


On Sat, Mar 25, 2017 at 2:21 PM, David Rowley <david.row...@2ndquadrant.com>
wrote:

> On 25 March 2017 at 13:10, Tomas Vondra <tomas.von...@2ndquadrant.com>
> wrote:
> > while working on a patch I ran into some crashes that seem to be caused
> by
> > inconsistent handling of max_parallel_workers - queries still seem to be
> > planned with parallel plans enabled, but then crash at execution.
> >
> > The attached script reproduces the issue on a simple query, causing
> crashed
> > in GatherMerge, but I assume the issue is more general.
>
> I had a look at this and found a bunch of off by 1 errors in the code.
>
> I've attached a couple of patches, one is the minimal fix, and one
> more complete one.
>
> In the more complete one I ended up ditching the
> GatherMergeState->nreaders altogether. It was always the same as
> nworkers_launched anyway, so I saw no point in it.
> Here I've attempted to make the code a bit more understandable, to
> prevent further confusion about how many elements are in each array.
> Probably there's more that can be done here. I see GatherState has
> nreaders too, but I've not looked into the detail of if it suffers
> from the same weirdness of nreaders always matching nworkers_launched.
>
>
> --
>  David Rowley   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


-- 
Rushabh Lathia

[HACKERS] [psql] patch to fix ordering in words_after_create array

2017-03-24 Thread Rushabh Lathia

Hi All,

While looking at the code around tab-complete.c, I
found the ordering in words_after_create array is not
correct for DEFAULT PRIVILEGES, which been added
under below commit:

commit d7d77f3825122bde55be9e06f6c4851028b99795
Author: Peter Eisentraut <pete...@gmx.net>
Date:   Thu Mar 16 18:54:28 2017 -0400

psql: Add completion for \help DROP|ALTER

PFA patch to fix the same.

Thanks,
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index f749406..02a1571 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -999,8 +999,8 @@ static const pgsql_thing_t words_after_create[] = {
 	{"CONFIGURATION", Query_for_list_of_ts_configurations, NULL, THING_NO_SHOW},
 	{"CONVERSION", "SELECT pg_catalog.quote_ident(conname) FROM pg_catalog.pg_conversion WHERE substring(pg_catalog.quote_ident(conname),1,%d)='%s'"},
 	{"DATABASE", Query_for_list_of_databases},
-	{"DICTIONARY", Query_for_list_of_ts_dictionaries, NULL, THING_NO_SHOW},
 	{"DEFAULT PRIVILEGES", NULL, NULL, THING_NO_CREATE | THING_NO_DROP},
+	{"DICTIONARY", Query_for_list_of_ts_dictionaries, NULL, THING_NO_SHOW},
 	{"DOMAIN", NULL, _for_list_of_domains},
 	{"EVENT TRIGGER", NULL, NULL},
 	{"EXTENSION", Query_for_list_of_extensions},

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] create_unique_path and GEQO

2017-03-24 Thread Rushabh Lathia

On Wed, Mar 22, 2017 at 3:43 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> Hi,
> In create_unique_path() there's comment
> /*
>  * We must ensure path struct and subsidiary data are allocated in main
>  * planning context; otherwise GEQO memory management causes trouble.
>  */
> oldcontext = MemoryContextSwitchTo(root->planner_cxt);
>
> pathnode = makeNode(UniquePath);
>
> This means that when GEQO resets the memory context, the RelOptInfo
> for which this path is created and may be set to cheapest_unique_path
> goes away, the unique path lingers on in the planner context.
> Shouldn't we instead allocate the path in the same context as the
> RelOptInfo similar to mark_dummy_rel()?
>
>
Do you have test case, which can reproduce the issue you
explained above?


> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Rushabh Lathia

[HACKERS] dblink module printing unnamed connection (with commit acaf7ccb94)

2017-03-22 Thread Rushabh Lathia

Hi All,

DBLINK contrib module started showing :"unnamed" connection name.

Consider the below test:

postgres=# CREATE ROLE alice NOSUPERUSER NOCREATEDB NOCREATEROLE LOGIN
PASSWORD 'wonderland';
CREATE ROLE
postgres=# GRANT EXECUTE ON FUNCTION dblink_connect_u(text,text) to alice;
GRANT
postgres=# \c postgres alice
You are now connected to database "postgres" as user "alice".
postgres=> SELECT dblink_connect_u('sm_conn_30','dbname=postgres user=alice
password=wonderland');
 dblink_connect_u
--
 OK
(1 row)

postgres=> SELECT * FROM dblink_send_query('sm_conn_30','SELECT
pg_stat_reset()') as dgr;
 dgr
-
   1
(1 row)

postgres=> SELECT * FROM dblink_get_result('sm_conn_30') AS dgr(curr_user
boolean);
ERROR:  permission denied for function pg_stat_reset
CONTEXT:  Error occurred on dblink connection named "*unnamed*": could not
execute query.

This started with below commit:

commit acaf7ccb94a3916ea83712671a3563f0eb595558
Author: Peter Eisentraut <pete...@gmx.net>
Date:   Sun Dec 25 12:00:00 2016 -0500

dblink: Replace some macros by static functions

Also remove some unused code and the no longer useful dblink.h file.

Reviewed-by: Tsunakawa, Takayuki <tsunakawa.ta...@jp.fujitsu.com>


Before this, macro used to assign the conname local variable; I quickly
worked
on the fix and attached patch do fix the issues. Patch assign the conname
local variable, so that error context show the correct connection name.


Regards,
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index c1e9089..e42cec4 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -679,7 +679,7 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async)
 	PG_TRY();
 	{
 		char	   *sql = NULL;
-		char	   *conname = NULL;
+		char	   *conname = text_to_cstring(PG_GETARG_TEXT_PP(0));
 		bool		fail = true;	/* default to backward compatible */
 
 		if (!is_async)
@@ -687,7 +687,7 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async)
 			if (PG_NARGS() == 3)
 			{
 /* text,text,bool */
-dblink_get_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)), , , );
+dblink_get_conn(conname, , , );
 sql = text_to_cstring(PG_GETARG_TEXT_PP(1));
 fail = PG_GETARG_BOOL(2);
 			}
@@ -702,7 +702,7 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async)
 }
 else
 {
-	dblink_get_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)), , , );
+	dblink_get_conn(conname, , , );
 	sql = text_to_cstring(PG_GETARG_TEXT_PP(1));
 }
 			}
@@ -722,13 +722,13 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async)
 			if (PG_NARGS() == 2)
 			{
 /* text,bool */
-conn = dblink_get_named_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)));
+conn = dblink_get_named_conn(conname);
 fail = PG_GETARG_BOOL(1);
 			}
 			else if (PG_NARGS() == 1)
 			{
 /* text */
-conn = dblink_get_named_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)));
+conn = dblink_get_named_conn(conname);
 			}
 			else
 /* shouldn't happen */
@@ -1390,7 +1390,8 @@ dblink_exec(PG_FUNCTION_ARGS)
 		if (PG_NARGS() == 3)
 		{
 			/* must be text,text,bool */
-			dblink_get_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)), , , );
+			conname = text_to_cstring(PG_GETARG_TEXT_PP(0));
+			dblink_get_conn(conname, , , );
 			sql = text_to_cstring(PG_GETARG_TEXT_PP(1));
 			fail = PG_GETARG_BOOL(2);
 		}
@@ -1405,7 +1406,8 @@ dblink_exec(PG_FUNCTION_ARGS)
 			}
 			else
 			{
-dblink_get_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)), , , );
+conname = text_to_cstring(PG_GETARG_TEXT_PP(0));
+dblink_get_conn(conname, , , );
 sql = text_to_cstring(PG_GETARG_TEXT_PP(1));
 			}
 		}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible regression with gather merge.

2017-03-22 Thread Rushabh Lathia

Hi,


postgres=# explain analyze select * from test order by v1, v2 limit 10;
   QUERY
PLAN

 Limit  (cost=19576.71..19577.88 rows=10 width=16) (actual
time=408.842..408.853 rows=10 loops=1)
   ->  Gather Merge  (cost=19576.71..116805.80 rows=84 width=16)
(actual time=408.841..408.850 rows=10 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 ->  Sort  (cost=18576.69..19618.36 rows=416667 width=16) (actual
time=393.602..394.080 rows=911 loops=3)
   Sort Key: v1, v2
   Sort Method: external merge  Disk: 8128kB
   ->  Parallel Seq Scan on test  (cost=0.00..9572.67
rows=416667 width=16) (actual time=0.053..46.238 rows=33 loops=3)
 Planning time: 0.118 ms
 Execution time: 414.763 ms
(10 rows)

postgres=# set enable_gathermerge = off;
SET
postgres=# explain analyze select * from test order by v1, v2 limit 10;
 QUERY
PLAN
-
 Limit  (cost=37015.64..37015.67 rows=10 width=16) (actual
time=268.859..268.861 rows=10 loops=1)
   ->  Sort  (cost=37015.64..39515.64 rows=100 width=16) (actual
time=268.858..268.859 rows=10 loops=1)
 Sort Key: v1, v2
 Sort Method: top-N heapsort  Memory: 25kB
 ->  Seq Scan on test  (cost=0.00..15406.00 rows=100 width=16)
(actual time=0.085..107.841 rows=100 loops=1)
 Planning time: 0.163 ms
 Execution time: 268.897 ms
(7 rows)

Looking at the explain analyze output of both the plan, its clear that GM
taking longer as its using external merge dist for the sort, where as
another plan perform top-N heapsort. For normal sort path, it can consider
the limit as bound, but for GM its not possible.

In the code, gather merge consider the limit for the sort into
create_ordered_paths,
which is wrong. Into create_ordered_paths(), GM should not consider the
limit
while doing costing for the sort node.

Attached patch fix the bug.


On Wed, Mar 22, 2017 at 12:05 PM, Rushabh Lathia <rushabh.lat...@gmail.com>
wrote:

> Thanks for reporting, I am looking into this.
>
> On Wed, Mar 22, 2017 at 11:51 AM, Mithun Cy <mithun...@enterprisedb.com>
> wrote:
>
>> Adding more rows to table make gather merge execution time very slow
>> when compared to non-parallel plan we get after disabling gather
>> merge.
>>
>> create table test as (select id, (random()*1)::int as v1, random() as
>> v2 from generate_series(1,1) id);
>>
>> postgres=# set max_parallel_workers_per_gather = default;
>> SET
>> postgres=# explain analyze select * from test order by v1, v2 limit 10;
>>
>> QUERY PLAN
>> 
>> 
>> 
>>  Limit  (cost=1858610.53..1858611.70 rows=10 width=16) (actual
>> time=31103.880..31103.885 rows=10 loops=1)
>>->  Gather Merge  (cost=1858610.53..11581520.05 rows=8406
>> width=16) (actual time=31103.878..31103.882 rows=10 loops=1)
>>  Workers Planned: 2
>>  Workers Launched: 2
>>  ->  Sort  (cost=1857610.50..1961777.26 rows=41666703
>> width=16) (actual time=30560.865..30561.046 rows=911 loops=3)
>>Sort Key: v1, v2
>>Sort Method: external merge  Disk: 841584kB
>>->  Parallel Seq Scan on test  (cost=0.00..957208.03
>> rows=41666703 width=16) (actual time=0.050..2330.275 rows=
>> loops=3)
>>  Planning time: 0.292 ms
>>  Execution time: 31502.896 ms
>> (10 rows)
>>
>> postgres=# set max_parallel_workers_per_gather = 0;
>> SET
>> postgres=# explain analyze select * from test order by v1, v2 limit 10;
>>  QUERY
>> PLAN
>> 
>> 
>>  Limit  (cost=3701507.83..3701507.85 rows=10 width=16) (actual
>> time=13231.264..13231.266 rows=10 loops=1)
>>->  Sort  (cost=3701507.83..3951508.05 rows=10088 width=16)
>> (actual time=13231.261..13231.262 rows=10 loops=1)
>>  Sort Key: v1, v2
>>  Sort Method: top-N heapsort  Memory: 25kB
>>  ->  Seq Scan on test  (cost=0.00..1540541.88 rows=10088
>> width=16) (actual time=0.045..6759.363 rows=1 loops=1)
>>  Pl

Re: [HACKERS] Possible regression with gather merge.

2017-03-22 Thread Rushabh Lathia

64 rows=100 width=16) (actual
> > time=211.581..211.582 rows=10 loops=1)
> >  Sort Key: v1, v2
> >  Sort Method: top-N heapsort  Memory: 25kB
> >  ->  Seq Scan on test  (cost=0.00..15406.00 rows=100
> > width=16) (actual time=0.085..107.522 rows=100 loops=1)
> >  Planning time: 0.093 ms
> >  Execution time: 211.608 ms
> > (7 rows)
> >
> >
> >
> > --
> > Thanks and Regards
> > Mithun C Y
> > EnterpriseDB: http://www.enterprisedb.com
>
>
>
> --
> Thanks and Regards
> Mithun C Y
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Rushabh Lathia

Re: [HACKERS] Adding support for Default partition in partitioning

2017-03-21 Thread Rushabh Lathia

first
>> partition.  The only thing a default range partition would do is let
>> you do is have a single partition cover all of the ranges that would
>> otherwise be unassigned, which might not even be something we want.
>>
>> I don't know how complete the patch is, but the specification seems
>> clear enough.  If you have partitions for 1, 3, and 5, you get
>> partition constraints of (a = 1), (a = 3), and (a = 5).  If you add a
>> default partition, you get a constraint of (a != 1 and a != 3 and a !=
>> 5).  If you then add a partition for 7, the default partition's
>> constraint becomes (a != 1 and a != 3 and a != 5 and a != 7).  The
>> partition must be revalidated at that point for conflicting rows,
>> which we can either try to move to the new partition, or just error
>> out if there are any, depending on what we decide we want to do.  I
>> don't think any of that requires any special handling for either
>> pg_dump or pg_upgrade; it all just falls out of getting the
>> partitioning constraints correct and consistently enforcing them, just
>> as for any other partition.
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


-- 
Rushabh Lathia

Re: [HACKERS] wait events for disk I/O

2017-03-18 Thread Rushabh Lathia

On Sat, Mar 18, 2017 at 5:15 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Fri, Mar 17, 2017 at 10:01 AM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> > I tried to cover all the suggestion in the attached latest patch.
>
> Committed.  I reworded the documentation entries, renamed a few of the
> wait events to make things more consistent, put all three lists in
> rigorous alphabetical order, and I fixed a couple of places where an
> error return from a system call could lead to returning without
> clearing the wait event.
>
>
Thanks Robert.


> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Rushabh Lathia

Re: [HACKERS] wait events for disk I/O

2017-03-17 Thread Rushabh Lathia

Thanks Robert for the review.

On Thu, Mar 16, 2017 at 8:05 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Thu, Mar 16, 2017 at 8:28 AM, Rahila Syed <rahilasye...@gmail.com>
> wrote:
> > Thank you for the updated patch.
> >
> > I have applied and tested it on latest sources and the patch looks good
> to
> > me.
>
> The documentation puts the new wait events in a pretty random order.
> I think they should be alphabetized, like we do with the IPC events.
>

Done.


> I also suggest we change the naming scheme so that the kind of thing
> being operated on is first and this is followed by the operation name.
> This will let us keep related entries next to each other after
> alphabetizing.  So with that principle in mind:
>
>
Yes above naming scheme is more clear then the one i choose.

- instead of ReadDataBlock etc. I propose DataFileRead, DataFileWrite,
> DataFileSync, DataFileExtend, DataFileFlush, DataFilePrefetch,
> DataFileTruncate.  using file instead of block avoids singular/plural
> confusion.
> - instead of RelationSync and RelationImmedSync I proposed
> DataFileSync and DataFileImmediateSync; these are md.c operations like
> the previous set, so why name it differently?
>

Yes, you are right, DataFileSync and DataFileImmediateSync make more
sense.


> - instead of WriteRewriteDataBlock and SyncRewriteDataBlock and
> TruncateLogicalMappingRewrite, which aren't consistent with each other
> even though they are related, I propose LogicalRewriteWrite,
> LogicalRewriteSync, and LogicalRewriteTruncate, which are also closer
> to the names of the functions that contain those wait points
> - for ReadBuffile and WriteBuffile seem OK, I propose BufFileRead and
> BufFileWrite, again reversing the order and also tweaking the
> capitalization
> - in keeping with our new policy of referring to xlog as wal in user
> visible interfaces, I propose WALRead, WALCopyRead, WALWrite,
> WALInitWrite, WALCopyWrite, WALBootstrapWrite, WALInitSync,
> WALBootstrapSync, WALSyncMethodAssign
> - for control file ops, ControlFileRead, ControlFileWrite,
> ControlFileWriteUpdate, ControlFileSync, ControlFileSyncUpdate
>

- ReadApplyLogicalMapping and friends seem to have to do with the
> reorderbuffer code, so maybe ReorderBufferRead etc.
> - there seems to be some discrepancy between the documentation and
> pgstat_get_wait_io for the snapbuild stuff.  maybe SnapBuildWrite,
> SnapBuildSync, SnapBuildRead.
> - SLRURead, SLRUWrite, etc.
> - TimelineHistoryRead, etc.
> - the syslogger changes should be dropped, since the syslogger is not
> and should not be connected to shared memory
>

Ok removed.


> - the replslot terminology seems like a case of odd capitalization and
> overeager abbreviation.  why not ReplicationSlotRead,
> ReplicationSlotWrite, etc?  similarly RelationMapRead,
> RelationMapWrite, etc?
> - CopyFileRead, CopyFileWrite
> - LockFileCreateRead, etc.
> - AddToDataDirLockFileRead is a little long and incomprehensible;
> maybe LockFileUpdateRead etc.
>

How about LockFileAddToDataDirRead? even though its little long but it
gives clear understanding about what's going on.


> - DSMWriteZeroBytes, maybe?
>
>
DSMFillZeroWrite? Basically want to keep the file IP operation at the end of
the event name.


> Of course the constants should be renamed to match.
>

I tried to cover all the suggestion in the attached latest patch.


Thanks,
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9eaf43a..acf310e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -716,6 +716,12 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   point.
  
 
+
+ 
+  IO: The server process is waiting for a IO to complete.
+  wait_event will identify the specific wait point.
+ 
+

   
  
@@ -1272,6 +1278,271 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  RecoveryApplyDelay
  Waiting to apply WAL at recovery because it is delayed.
 
+
+ IO
+ BufFileRead
+ Waiting during buffer read operation.
+
+
+ BufFileWrite
+ Waiting during buffer write operation.
+
+
+ CheckpointLogicalRewriteSync
+ Waiting to sync logical mapping during a checkpoint for logical rewrite mappings.
+
+
+ ControlFileRead
+ Waiting to read the control file.
+
+
+ ControlFileSync
+ Waiting to sync the control file.
+
+
+ ControlFileSyncUpdate
+ Waiting to sync the control file during update control file.
+

Re: [HACKERS] wait events for disk I/O

2017-03-15 Thread Rushabh Lathia

Thanks Rahila for reviewing this patch.

On Tue, Mar 14, 2017 at 8:13 PM, Rahila Syed <rahilasye...@gmail.com> wrote:

> Hello,
>
> I applied and tested this patch on latest sources and it works fine.
>
> Following are some comments,
>
> >+   /* Wait event for SNRU */
> >+   WAIT_EVENT_READ_SLRU_PAGE,
> Typo in the comment.
>
>
Fixed.


> >FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush,
> WAIT_EVENT_FLUSH_DATA_BLOCK);
> This call is inside mdwriteback() which can flush more than one block so
> should  WAIT_EVENT _FLUSH_DATA_BLOCK
> be renamed to WAIT_EVENT_FLUSH_DATA_BLOCKS?
>
>
Changed with WAIT_EVENT_FLUSH_DATA_BLOCKS.


> Should calls to write() in following functions be tracked too?
>  qtext_store()  - This is related to pg_stat_statements
>
>

I am not quite sure about this, as this is for stat statements. Also part
from the
place you found there are many other fwrite() call into pg_stat_statements,
and
I intentionally haven't added event here as it is very small write about
stat, and
doesn't look like we should add for those call.



> dsm_impl_mmap() - This is in relation to creating dsm segments.
>
>
Added new event here. Actually particular write call is zero-filling the
DSM file.


> write_auto_conf_file()-  This is called when updated configuration
> parameters are
>  written to a temp file.
>
>
write_auto_conf_file() is getting called during the ALTER SYSTEM call. Here
write
happen only when someone explicitly run the ALTER SYSTEM call. This is
administrator call and so doesn't seem like necessary to add separate wait
event
for this.

PFA latest patch with other fixes.


>
> On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lat...@gmail.com>
> wrote:
>
>>
>>
>> On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmh...@gmail.com>
>> wrote:
>>
>>> On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapil...@gmail.com>
>>> wrote:
>>> > On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmh...@gmail.com>
>>> wrote:
>>> >> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapil...@gmail.com>
>>> wrote:
>>> >>> Sure, if you think both Writes and Reads at OS level can have some
>>> >>> chance of blocking in obscure cases, then we should add a wait event
>>> >>> for them.
>>> >>
>>> >> I think writes have a chance of blocking in cases even in cases that
>>> >> are not very obscure at all.
>>> >
>>> > Point taken for writes, but I think in general we should have some
>>> > criteria based on which we can decide whether to have a wait event for
>>> > a particular call. It should not happen that we have tons of wait
>>> > events and out of which, only a few are helpful in most of the cases
>>> > in real-world scenarios.
>>>
>>> Well, the problem is that if you pick and choose which wait events to
>>> add based on what you think will be common, you're actually kind of
>>> hosing yourself. Because now when something uncommon happens, suddenly
>>> you don't get any wait event data and you can't tell what's happening.
>>> I think the number of new wait events added by Rushabh's patch is
>>> wholly reasonable.  Yeah, some of those are going to be a lot more
>>> common than others, but so what?  We add wait events so that we can
>>> find out what's going on.  I don't want to sometimes know when a
>>> backend is blocked on an I/O.  I want to ALWAYS know.
>>>
>>>
>> Yes, I agree with Robert. Knowing what we want and what we don't
>> want is difficult to judge. Something which we might think its not useful
>> information, and later of end up into situation where we re-think about
>> adding those missing stuff is not good. Having more information about
>> the system, specially for monitoring purpose is always good.
>>
>> I am attaching  another version of the patch, as I found stupid mistake
>> in the earlier version of patch, where I missed to initialize initial
>> value to
>> WaitEventIO enum. Also earlier version was not getting cleanly apply on
>> the current version of sources.
>>
>>
>>
>> --
>> Rushabh Lathia
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>>
>


Rega

Re: [HACKERS] Gather Merge

2017-03-14 Thread Rushabh Lathia

On Mon, Mar 13, 2017 at 10:56 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Fri, Mar 10, 2017 at 7:59 AM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> > Error coming from create_gather_merge_plan() from below condition:
> >
> > if (memcmp(sortColIdx, gm_plan->sortColIdx,
> >numsortkeys * sizeof(AttrNumber)) != 0)
> > elog(ERROR, "GatherMerge child's targetlist doesn't match
> > GatherMerge");
> >
> > Above condition checks the sort column numbers explicitly, to ensure the
> > tlists
> > really do match up. This been copied from the create_merge_append_plan().
> > Now
> > this make sense as for MergeAppend as its not projection capable plan
> (see
> > is_projection_capable_plan()). But like Gather, GatherMerge is the
> > projection
> > capable and its target list can be different from the subplan, so I don't
> > think this
> > condition make sense for the GatherMerge.
> >
> > Here is the some one the debugging info, through which I was able to
> reach
> > to this conclusion:
> >
> > - targetlist for the GatherMerge and subpath is same during
> > create_gather_merge_path().
> >
> > - targetlist for the GatherMerge is getting changed into
> > create_gather_merge_plan().
> >
> > postgres=# explain (analyze, verbose) select t2.j from t1 JOIN t2 ON
> > t1.k=t2.k where t1.i=1 order by t1.j desc;
> > NOTICE:  path parthtarget: {PATHTARGET :exprs ({VAR :varno 2 :varattno 2
> > :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 2
> :varoattno
> > 2 :location 34} {VAR :varno 1 :varattno 2 :vartype 23 :vartypmod -1
> > :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 2 :location 90})
> > :sortgrouprefs 0 1 :cost.startup 0.00 :cost.per_tuple 0.00 :width 8}
> >
> > NOTICE:  subpath parthtarget: {PATHTARGET :exprs ({VAR :varno 1
> :varattno 2
> > :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1
> :varoattno
> > 2 :location 90} {VAR :varno 2 :varattno 2 :vartype 23 :vartypmod -1
> > :varcollid 0 :varlevelsup 0 :varnoold 2 :varoattno 2 :location 34})
> > :cost.startup 0.00 :cost.per_tuple 0.00 :width 8}
> >
> > - Attached memory watch point and found that target list for GatherMerge
> is
> > getting
> > changed into groupping_planner() -> apply_projection_to_path().
> >
> > PFA patch to fix this issue.
>
> I don't think this fix is correct, partly on theoretical grounds and
> partly because I managed to make it crash.  The problem is that
> prepare_sort_for_pathkeys() actually alters the output tlist of Gather
> Merge, which is inconsistent with the idea that Gather Merge can do
> projection.  It's going to produce whatever
> prepare_sort_for_pathkeys() says it's going to produce, which may or
> may not be what was there before.  Using Kuntal's dump file and your
> patch:
>
> set min_parallel_table_scan_size = 0;
> set parallel_setup_cost = 0;
> set parallel_tuple_cost = 0;
> set enable_sort = false;
> set enable_bitmapscan = false;
> alter table t1 alter column j type text;
> select t2.i from t1 join t2 on t1.k=t2.k where t1.i=1 order by t1.j desc;
>
> Crash.   Abbreviated stack trace:
>
> #0  pg_detoast_datum_packed (datum=0xbc) at fmgr.c:2176
> #1  0x00010160e707 in varstrfastcmp_locale (x=188, y=819,
> ssup=0x7fe1ea06a568) at varlena.c:1997
> #2  0x0001013efc73 in ApplySortComparator [inlined] () at
> /Users/rhaas/pgsql/src/include/utils/sortsupport.h:225
> #3  0x0001013efc73 in heap_compare_slots (a= unavailable, due to optimizations>, b= due to optimizations>, arg=0x7fe1ea04e590) at sortsupport.h:681
> #4  0x0001014057b2 in sift_down (heap=0x7fe1ea079458,
> node_off=) at
> binaryheap.c:274
> #5  0x00010140573a in binaryheap_build (heap=0x7fe1ea079458) at
> binaryheap.c:131
> #6  0x0001013ef771 in gather_merge_getnext [inlined] () at
> /Users/rhaas/pgsql/src/backend/executor/nodeGatherMerge.c:421
> #7  0x0001013ef771 in ExecGatherMerge (node=0x7fe1ea04e590) at
> nodeGatherMerge.c:240
>
> Obviously, this is happening because we're trying to apply a
> comparator for text to a value of type integer.  I propose the
> attached, slightly more involved fix, which rips out the first call to
> prepare_sort_from_pathkeys() altogether.
>
> Thanks Robert for the patch and the explanation.

I studied the patch and that look right to me. I performed manual testing,
run the scripts which I created during the gather merge patch also run
the tpch queries to make sure that it all working good.

I haven't found any regression the that changes.



> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Rushabh Lathia

Re: [HACKERS] Gather Merge

2017-03-10 Thread Rushabh Lathia

Error coming from create_gather_merge_plan() from below condition:

if (memcmp(sortColIdx, gm_plan->sortColIdx,
   numsortkeys * sizeof(AttrNumber)) != 0)
elog(ERROR, "GatherMerge child's targetlist doesn't match
GatherMerge");

Above condition checks the sort column numbers explicitly, to ensure the
tlists
really do match up. This been copied from the create_merge_append_plan().
Now
this make sense as for MergeAppend as its not projection capable plan (see
is_projection_capable_plan()). But like Gather, GatherMerge is the
projection
capable and its target list can be different from the subplan, so I don't
think this
condition make sense for the GatherMerge.

Here is the some one the debugging info, through which I was able to reach
to this conclusion:

- targetlist for the GatherMerge and subpath is same during
create_gather_merge_path().

- targetlist for the GatherMerge is getting changed into
create_gather_merge_plan().

postgres=# explain (analyze, verbose) select t2.j from t1 JOIN t2 ON
t1.k=t2.k where t1.i=1 order by t1.j desc;
NOTICE:  path parthtarget: {PATHTARGET :exprs ({VAR :varno 2 :varattno 2
:vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 2
:varoattno 2 :location 34} {VAR :varno 1 :varattno 2 :vartype 23 :vartypmod
-1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 2 :location 90})
:sortgrouprefs 0 1 :cost.startup 0.00 :cost.per_tuple 0.00 :width 8}

NOTICE:  subpath parthtarget: {PATHTARGET :exprs ({VAR :varno 1 :varattno 2
:vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1
:varoattno 2 :location 90} {VAR :varno 2 :varattno 2 :vartype 23 :vartypmod
-1 :varcollid 0 :varlevelsup 0 :varnoold 2 :varoattno 2 :location 34})
:cost.startup 0.00 :cost.per_tuple 0.00 :width 8}

- Attached memory watch point and found that target list for GatherMerge is
getting
changed into groupping_planner() -> apply_projection_to_path().

PFA patch to fix this issue.



On Fri, Mar 10, 2017 at 4:12 PM, Rushabh Lathia <rushabh.lat...@gmail.com>
wrote:

>
>
> On Fri, Mar 10, 2017 at 4:09 PM, Kuntal Ghosh <kuntalghosh.2...@gmail.com>
> wrote:
>
>> On Fri, Mar 10, 2017 at 3:04 PM, Rushabh Lathia
>> <rushabh.lat...@gmail.com> wrote:
>> >
>> >
>> > On Fri, Mar 10, 2017 at 2:42 PM, Andreas Joseph Krogh <
>> andr...@visena.com>
>> > wrote:
>> >>
>> >> På fredag 10. mars 2017 kl. 10:09:22, skrev Rushabh Lathia
>> >> <rushabh.lat...@gmail.com>:
>> >>
>> >>
>> >>
>> >> On Fri, Mar 10, 2017 at 2:33 PM, Andreas Joseph Krogh <
>> andr...@visena.com>
>> >> wrote:
>> >>>
>> >>> [...]
>> >>> The execution-plan seems (unsurprisingly) to depend on
>> data-distribution,
>> >>> so is there a way I can force a GatherMerge?
>> >>
>> >>
>> >> Not directly. GatherMerge cost is mainly depend on parallel_setup_cost,
>> >> parallel_tuple_cost and cpu_operator_cost. May be you can force this
>> >> by setting this cost low enough. Or another way to force is by disable
>> the
>> >> other plans.
>> >>
>> >> What plan you are getting now? You not seeing the below error ?
>> >>
>> >> ERROR:  GatherMerge child's targetlist doesn't match GatherMerge
>> >>
>> >>
>> >> I'm seeing the same error, it's just that for reproducing it I'd rather
>> >> not copy my whole dataset.
>> >
>> >
>> > Can you share me a schema information, I will try to reproduce at my
>> side?
>> I'm able to reproduce the error. I've attached the dump file and a
>> script to reproduce it.
>>
>> The following query executes successfully.
>> postgres=# explain select t1.* from t1 JOIN t2 ON t1.k=t2.k where
>> t1.i=1 order by t1.j desc;
>>   QUERY PLAN
>> 
>> ---
>>  Gather Merge  (cost=0.58..243.02 rows=943 width=12)
>>Workers Planned: 1
>>->  Nested Loop  (cost=0.57..235.94 rows=555 width=12)
>>  ->  Parallel Index Scan Backward using idx_t1_i_j on t1
>> (cost=0.29..14.33 rows=603 width=12)
>>Index Cond: (i = 1)
>>  ->  Index Only Scan using idx_t2_k on t2  (cost=0.29..0.34
>> rows=3 width=4)
>>Index Cond: (k = t1.k)
>> (7 rows)
>>
>> Whereas, If columns from t2 is projected, it throws the same error.
>> postgres=# explain select t2.* from t1 JOIN t2 ON t1.k=t2.k where
>> t1.i=1 order by t1.j desc;
>> ERROR:  GatherMerge child's targetlist doesn't match GatherMerge
>>
>>
> Thanks Kuntal.
>
> I am able to reproduce the issue with the shared script. I will look into
> this now.
>
>
>>
>>
>> --
>> Thanks & Regards,
>> Kuntal Ghosh
>> EnterpriseDB: http://www.enterprisedb.com
>>
>
>
>
> --
> Rushabh Lathia
>



-- 
Rushabh Lathia


gm_plan_fix.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gather Merge

2017-03-10 Thread Rushabh Lathia

On Fri, Mar 10, 2017 at 4:09 PM, Kuntal Ghosh <kuntalghosh.2...@gmail.com>
wrote:

> On Fri, Mar 10, 2017 at 3:04 PM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> >
> >
> > On Fri, Mar 10, 2017 at 2:42 PM, Andreas Joseph Krogh <
> andr...@visena.com>
> > wrote:
> >>
> >> På fredag 10. mars 2017 kl. 10:09:22, skrev Rushabh Lathia
> >> <rushabh.lat...@gmail.com>:
> >>
> >>
> >>
> >> On Fri, Mar 10, 2017 at 2:33 PM, Andreas Joseph Krogh <
> andr...@visena.com>
> >> wrote:
> >>>
> >>> [...]
> >>> The execution-plan seems (unsurprisingly) to depend on
> data-distribution,
> >>> so is there a way I can force a GatherMerge?
> >>
> >>
> >> Not directly. GatherMerge cost is mainly depend on parallel_setup_cost,
> >> parallel_tuple_cost and cpu_operator_cost. May be you can force this
> >> by setting this cost low enough. Or another way to force is by disable
> the
> >> other plans.
> >>
> >> What plan you are getting now? You not seeing the below error ?
> >>
> >> ERROR:  GatherMerge child's targetlist doesn't match GatherMerge
> >>
> >>
> >> I'm seeing the same error, it's just that for reproducing it I'd rather
> >> not copy my whole dataset.
> >
> >
> > Can you share me a schema information, I will try to reproduce at my
> side?
> I'm able to reproduce the error. I've attached the dump file and a
> script to reproduce it.
>
> The following query executes successfully.
> postgres=# explain select t1.* from t1 JOIN t2 ON t1.k=t2.k where
> t1.i=1 order by t1.j desc;
>   QUERY PLAN
> 
> ---
>  Gather Merge  (cost=0.58..243.02 rows=943 width=12)
>Workers Planned: 1
>->  Nested Loop  (cost=0.57..235.94 rows=555 width=12)
>  ->  Parallel Index Scan Backward using idx_t1_i_j on t1
> (cost=0.29..14.33 rows=603 width=12)
>Index Cond: (i = 1)
>  ->  Index Only Scan using idx_t2_k on t2  (cost=0.29..0.34
> rows=3 width=4)
>Index Cond: (k = t1.k)
> (7 rows)
>
> Whereas, If columns from t2 is projected, it throws the same error.
> postgres=# explain select t2.* from t1 JOIN t2 ON t1.k=t2.k where
> t1.i=1 order by t1.j desc;
> ERROR:  GatherMerge child's targetlist doesn't match GatherMerge
>
>
Thanks Kuntal.

I am able to reproduce the issue with the shared script. I will look into
this now.


>
>
> --
> Thanks & Regards,
> Kuntal Ghosh
> EnterpriseDB: http://www.enterprisedb.com
>



-- 
Rushabh Lathia

Re: [HACKERS] Gather Merge

2017-03-10 Thread Rushabh Lathia

On Fri, Mar 10, 2017 at 2:42 PM, Andreas Joseph Krogh <andr...@visena.com>
wrote:

> På fredag 10. mars 2017 kl. 10:09:22, skrev Rushabh Lathia <
> rushabh.lat...@gmail.com>:
>
>
>
> On Fri, Mar 10, 2017 at 2:33 PM, Andreas Joseph Krogh <andr...@visena.com>
> wrote:
>>
>> [...]
>> The execution-plan seems (unsurprisingly) to depend on data-distribution,
>> so is there a way I can force a GatherMerge?
>>
>
> Not directly. GatherMerge cost is mainly depend on parallel_setup_cost,
> parallel_tuple_cost and cpu_operator_cost. May be you can force this
> by setting this cost low enough. Or another way to force is by disable the
> other plans.
>
> What plan you are getting now? You not seeing the below error ?
>
> ERROR:  GatherMerge child's targetlist doesn't match GatherMerge
>
>
> I'm seeing the same error, it's just that for reproducing it I'd rather
> not copy my whole dataset.
>

Can you share me a schema information, I will try to reproduce at my side?



>
> --
> *Andreas Joseph Krogh*
>



-- 
Rushabh Lathia

Re: [HACKERS] Gather Merge

2017-03-10 Thread Rushabh Lathia

On Fri, Mar 10, 2017 at 2:33 PM, Andreas Joseph Krogh <andr...@visena.com>
wrote:

> På fredag 10. mars 2017 kl. 09:53:47, skrev Rushabh Lathia <
> rushabh.lat...@gmail.com>:
>
>
>
> On Fri, Mar 10, 2017 at 1:44 PM, Andreas Joseph Krogh <andr...@visena.com>
> wrote:
>>
>> På torsdag 09. mars 2017 kl. 18:09:45, skrev Robert Haas <
>> robertmh...@gmail.com>:
>>
>> On Thu, Mar 9, 2017 at 11:25 AM, Rushabh Lathia
>> <rushabh.lat...@gmail.com> wrote:
>> > I don't see this failure with the patch. Even I forced the gather merge
>> > in the above query and that just working fine.
>> >
>> > Attaching patch, with the discussed changes.
>>
>> Committed.
>>
>>
>>
>> I'm still getting (as of 9c2635e26f6f4e34b3b606c0fc79d0e111953a74):
>> ERROR:  GatherMerge child's targetlist doesn't match GatherMerge
>>
>>
>> from this query:
>>
>>
>> EXPLAIN ANALYSE SELECTem.entity_idFROM origo_email_delivery del
>> JOIN origo_email_message em ON (del.message_id = em.entity_id)WHERE 1 = 
>> 1  AND del.owner_id = 3  AND (
>>   del.from_entity_id = 279519  OR  
>> del.from_entity_id = 3 AND em.entity_id IN (
>>   SELECT  ea_owner.message_id
>>   FROM origo_email_address_owner ea_owner
>>   WHERE ea_owner.recipient_id = 279519  )
>>   )
>> ORDER BY del.received_timestamp DESCLIMIT 101 OFFSET 0;
>>
>>
>> Is this known or shall I provide more info/schema etc?
>>
>
> Please provide the reproducible test if possible.
>
>
> The execution-plan seems (unsurprisingly) to depend on data-distribution,
> so is there a way I can force a GatherMerge?
>

Not directly. GatherMerge cost is mainly depend on parallel_setup_cost,
parallel_tuple_cost and cpu_operator_cost. May be you can force this
by setting this cost low enough. Or another way to force is by disable the
other plans.

What plan you are getting now? You not seeing the below error ?

ERROR:  GatherMerge child's targetlist doesn't match GatherMerge


> --
> *Andreas Joseph Krogh*
>



-- 
Rushabh Lathia

Re: [HACKERS] Gather Merge

2017-03-10 Thread Rushabh Lathia

On Fri, Mar 10, 2017 at 1:44 PM, Andreas Joseph Krogh <andr...@visena.com>
wrote:

> På torsdag 09. mars 2017 kl. 18:09:45, skrev Robert Haas <
> robertmh...@gmail.com>:
>
> On Thu, Mar 9, 2017 at 11:25 AM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> > I don't see this failure with the patch. Even I forced the gather merge
> > in the above query and that just working fine.
> >
> > Attaching patch, with the discussed changes.
>
> Committed.
>
>
>
> I'm still getting (as of 9c2635e26f6f4e34b3b606c0fc79d0e111953a74):
> ERROR:  GatherMerge child's targetlist doesn't match GatherMerge
>
>
> from this query:
>
>
> EXPLAIN ANALYSE SELECTem.entity_idFROM origo_email_delivery del
> JOIN origo_email_message em ON (del.message_id = em.entity_id)WHERE 1 = 1 
>  AND del.owner_id = 3  AND (
>   del.from_entity_id = 279519  OR  del.from_entity_id 
> = 3 AND em.entity_id IN (
>   SELECT  ea_owner.message_id
>   FROM origo_email_address_owner ea_owner
>   WHERE ea_owner.recipient_id = 279519  )
>   )
> ORDER BY del.received_timestamp DESCLIMIT 101 OFFSET 0;
>
>
> Is this known or shall I provide more info/schema etc?
>

Please provide the reproducible test if possible.


>
> If I select del.entity_id, it works:
>
>
> EXPLAIN ANALYSE SELECTdel.entity_id
> FROM origo_email_delivery del
> JOIN origo_email_message em ON (del.message_id = 
> em.entity_id)
> WHERE 1 = 1  AND del.owner_id = 3 
>  AND (
>   del.from_entity_id = 279519 
>  OR  del.from_entity_id = 3 AND em.entity_id IN (
>   SELECT  
> ea_owner.message_id
>   FROM origo_email_address_owner ea_owner
>   WHERE ea_owner.recipient_id = 279519
>   )
>   )
>
> ORDER BY del.received_timestamp DESCLIMIT 101 
> OFFSET 0;
>
>
> Plan is:
> │ Limit  (cost=1259.72..15798.21 rows=101 width=16) (actual
> time=152.946..153.269 rows=34 loops=1)
>   
> │
>
> │   ->  Gather Merge  (cost=1259.72..3139414.43 rows=21801 width=16)
> (actual time=152.945..153.264 rows=34 loops=1)
>
>│
> │ Workers Planned: 4
>
>
> │
> │ Workers Launched: 4
>
>
>│
> │ ->  Nested Loop  (cost=259.66..3135817.66 rows=5450 width=16)
> (actual time=95.295..137.549 rows=7 loops=5)
> │
> │   ->  Parallel Index Scan Backward using
> origo_email_received_idx on origo_email_delivery del  (cost=0.42..312163.56
> rows=10883 width=32) (actual time=0.175..121.434 rows=6540 loops=5)
> │
> │ Filter: ((owner_id = 3) AND ((from_entity_id =
> 279519) OR (from_entity_id = 3)))
>
>   │
> │ Rows Removed by Filter: 170355
>
>
>   │
> │   ->  Index Only Scan using origo_email_message_pkey on
> origo_email_message em  (cost=259.24..259.45 rows=1 width=8) (actual
> time=0.002..0.002 rows=0 loops=32699)   │
> │ Index Cond: (entity_id = del.message_id)
>
>
> │
> │ Filter: ((del.from_entity_id = 279519) OR
> ((del.from_entity_id = 3) AND (hashed SubPlan 1)))
>
> │
> │ Rows Removed by Filter: 1
>
>
>│
> │ Heap Fetches: 0
>
>
>│
> │ SubPlan 1
>
>
>  │
> │   ->  Index Scan using 
> origo_email_address_owner_recipient_id_idx
> on origo_email_address_owner ea_owner  (cost=0.43..258.64 rows=69 width=8)
> (actual time=0.032..0.294 rows=175 loops=5) │
> │ Index Cond: (recipient_id = 279519)
>
>
>  │
> │ Planning time: 1.372 ms
>
>
>│
> │ Execution time: 170.859 ms
>
>
> │
>
>
>
> --
> *Andreas Joseph Krogh*
>
>



-- 
Rushabh Lathia

Re: [HACKERS] Gather Merge

2017-03-09 Thread Rushabh Lathia

On Thu, Mar 9, 2017 at 9:42 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Thu, Mar 9, 2017 at 8:21 AM, Rushabh Lathia <rushabh.lat...@gmail.com>
> wrote:
> > Thanks Robert for committing this.
> >
> > My colleague Neha Sharma found one regression with the patch. I was about
> > to send this mail and noticed that you committed the patch.
>
> Oops.  Bad timing.
>
> > postgres=# explain select aid from pgbench_accounts where aid % 25= 0
> group
> > by aid;
> > ERROR:  ORDER/GROUP BY expression not found in targetlist
>
> I think your fix for this looks right, although I would write it this way:
>
> -gm_plan->plan.targetlist = subplan->targetlist;
> +gm_plan->plan.targetlist = build_path_tlist(root, _path->path);
>
> The second part of your fix looks wrong.  I think you want this:
>
>  create_gather_merge_path(root,
>   grouped_rel,
>   subpath,
> - NULL,
> + partial_grouping_target,
>   root->group_pathkeys,
>   NULL,
>   _groups);
>
> That will match the create_gather_path case.
>
>
Right, I did that change and perform the test with the fix and I don't
see any regression now.


> This test case is still failing for me even with those fixes:
>
> rhaas=# select aid+1 from pgbench_accounts group by aid+1;
> ERROR:  could not find pathkey item to sort
>
>
I don't see this failure with the patch. Even I forced the gather merge
in the above query and that just working fine.

Attaching patch, with the discussed changes.



Thanks,
Rushabh Lathia


fix_target_gm_v2.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gather Merge

2017-03-09 Thread Rushabh Lathia

On Thu, Mar 9, 2017 at 6:19 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Wed, Mar 8, 2017 at 11:59 PM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> > Here is another version of patch with the suggested changes.
>
> Committed.
>
>
Thanks Robert for committing this.

My colleague Neha Sharma found one regression with the patch. I was about
to send this mail and noticed that you committed the patch.

Here is the small example:

Test setup:

1) ./db/bin/pgbench postgres -i -F 100 -s 20

2) update pgbench_accounts set filler = 'foo' where aid%10 = 0;

3) vacuum analyze pgbench_accounts

4)

postgres=# set max_parallel_workers_per_gather = 4;
SET

postgres=# explain select aid from pgbench_accounts where aid % 25= 0 group
by aid;
ERROR:  ORDER/GROUP BY expression not found in targetlist

postgres=# set enable_indexonlyscan = off;
SET
postgres=# explain select aid from pgbench_accounts where aid % 25= 0 group
by aid;
   QUERY
PLAN

 Group  (cost=44708.21..45936.81 rows=10001 width=4)
   Group Key: aid
   ->  Gather Merge  (cost=44708.21..45911.81 rows=1 width=0)
 Workers Planned: 4
 ->  Group  (cost=43708.15..43720.65 rows=2500 width=4)
   Group Key: aid
   ->  Sort  (cost=43708.15..43714.40 rows=2500 width=4)
 Sort Key: aid
 ->  Parallel Seq Scan on pgbench_accounts
(cost=0.00..43567.06 rows=2500 width=4)
   Filter: ((aid % 25) = 0)
(10 rows)

- Index only scan with GM do work - but with ORDER BY clause
postgres=# set enable_seqscan = off;
SET
postgres=# explain select aid from pgbench_accounts where aid % 25= 0 order
by aid;
 QUERY
PLAN
-
 Gather Merge  (cost=1000.49..113924.61 rows=10001 width=4)
   Workers Planned: 4
   ->  Parallel Index Scan using pgbench_accounts_pkey on pgbench_accounts
(cost=0.43..111733.33 rows=2500 width=4)
 Filter: ((aid % 25) = 0)
(4 rows)

Debugging further I found that problem only when IndexOnlyScan under GM and
that to only with grouping. Debugging problem I found that ressortgroupref
is
not getting set. That lead me thought that it might be because
create_gather_merge_plan()
is building tlist, with build_path_tlist. Another problem I found is that
create_grouping_paths() is passing NULL for the targets while calling
create_gather_merge_path(). (fix_target_gm.patch)

With those changes above test is running fine, but it broke other things.
Like

postgres=# explain select distinct(bid) from pgbench_accounts where filler
like '%foo%' group by aid;
ERROR:  GatherMerge child's targetlist doesn't match GatherMerge

Another thing I noticed that, if we stop adding gathermerge during the
generate_gather_paths() then all the test working fine.
(comment_gm_from_generate_gather.patch)

I will continue looking at this problem.

Attaching both the patch for reference.

regards,
Rushabh Lathia
www.EnterpriseDB.com

comment_gm_from_generate_gather.patch
Description: application/download

fix_target_gm.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] wait events for disk I/O

2017-03-08 Thread Rushabh Lathia

Thanks Rajkumar for performing tests on this patch.

Yes, I also noticed similar results in my testing. Additionally sometime I
also
noticed ReadSLRUPage event on my system.

I also run the reindex database command and I notices below IO events.

SyncImmedRelation,
WriteDataBlock
WriteBuffile,
WriteXLog,
ReadDataBlock



On Wed, Mar 8, 2017 at 6:41 PM, Rajkumar Raghuwanshi <
rajkumar.raghuwan...@enterprisedb.com> wrote:

> On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lat...@gmail.com>
> wrote:
>
>> I am attaching  another version of the patch, as I found stupid mistake
>> in the earlier version of patch, where I missed to initialize initial
>> value to
>> WaitEventIO enum. Also earlier version was not getting cleanly apply on
>> the current version of sources.
>>
>
> I have applied attached patch, set shared_buffers to 128kB and ran
> pgbench, I am able to see below distinct IO events.
>
> --with ./pgbench -i -s 500 postgres
> application_namewait_event_typewait_event  query
>  pgbench  IO  ExtendDataBlock
> copy pgbench_account
>  pgbench  IO  WriteXLog
> copy pgbench_account
>  pgbench  IO  WriteDataBlock
>  copy pgbench_account
>  pgbench  IO  ReadDataBlock
>  vacuum analyze pgben
>  pgbench  IO  ReadBuffile
>alter table pgbench_
>
> --with ./pgbench -T 600 postgres (readwrite)
> application_namewait_event_typewait_event  query
>  pgbench IO   ReadDataBlock
>  UPDATE pgbench_accou
>  pgbench IO   WriteDataBlock
> UPDATE pgbench_telle
> IO   SyncDataBlock
>
>  pgbench IO   SyncDataBlock
>  UPDATE pgbench_telle
> IO   SyncDataBlock
>  autovacuum: VACUUM A
>  pgbench IO   WriteXLog
> END;
>  pgbench IO   ExtendDataBlock
> copy pgbench_account
>
> --with ./pgbench -T 600 -S postgres (select only)
> application_namewait_event_typewait_event  query
>  pgbench IO   ReadDataBlock
>  SELECT abalance FROM
>
> Attached excel with all IO event values.
>
> Thanks & Regards,
> Rajkumar Raghuwanshi
> QMG, EnterpriseDB Corporation
>
>
>


-- 
Rushabh Lathia
www.EnterpriseDB.com

Re: [HACKERS] Gather Merge

2017-03-08 Thread Rushabh Lathia

On Thu, Mar 9, 2017 at 8:40 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Mon, Feb 20, 2017 at 1:35 AM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> > Thanks Amit for raising this point. I was not at all aware of
> mark/restore.
> > I tried to come up with the case, but haven't found such case.
> >
> > For now here is the patch with comment update.
>
> Sorry for the delay in getting back to this.  This seems to need minor
> rebasing again.
>
>
Done.


> A few other points:
>
> ExecEndGatherMerge needs to be patched along the lines of
> acf555bc53acb589b5a2827e65d655fa8c9adee0.
>
>
Done.


> @@ -2290,7 +2376,6 @@ create_limit_plan(PlannerInfo *root, LimitPath
> *best_path, int flags)
>  return plan;
>  }
>
> -
>  /***
> **
>   *
>   *  BASE-RELATION SCAN METHODS
>
> Unnecessary.
>

Fixed.

Here is another version of patch with the suggested changes.


Thanks,

Rushabh Lathia
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


gather-merge-v9.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] wait events for disk I/O

2017-03-08 Thread Rushabh Lathia

On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
> > On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmh...@gmail.com>
> wrote:
> >> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
> >>> Sure, if you think both Writes and Reads at OS level can have some
> >>> chance of blocking in obscure cases, then we should add a wait event
> >>> for them.
> >>
> >> I think writes have a chance of blocking in cases even in cases that
> >> are not very obscure at all.
> >
> > Point taken for writes, but I think in general we should have some
> > criteria based on which we can decide whether to have a wait event for
> > a particular call. It should not happen that we have tons of wait
> > events and out of which, only a few are helpful in most of the cases
> > in real-world scenarios.
>
> Well, the problem is that if you pick and choose which wait events to
> add based on what you think will be common, you're actually kind of
> hosing yourself. Because now when something uncommon happens, suddenly
> you don't get any wait event data and you can't tell what's happening.
> I think the number of new wait events added by Rushabh's patch is
> wholly reasonable.  Yeah, some of those are going to be a lot more
> common than others, but so what?  We add wait events so that we can
> find out what's going on.  I don't want to sometimes know when a
> backend is blocked on an I/O.  I want to ALWAYS know.
>
>
Yes, I agree with Robert. Knowing what we want and what we don't
want is difficult to judge. Something which we might think its not useful
information, and later of end up into situation where we re-think about
adding those missing stuff is not good. Having more information about
the system, specially for monitoring purpose is always good.

I am attaching  another version of the patch, as I found stupid mistake
in the earlier version of patch, where I missed to initialize initial value
to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.



-- 
Rushabh Lathia
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


wait_event_for_disk_IO_v2.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE/ALTER ROLE PASSWORD ('value' USING 'method')

2017-03-07 Thread Rushabh Lathia

On Wed, Mar 8, 2017 at 9:59 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Michael Paquier <michael.paqu...@gmail.com> writes:
> > here is a separate thread dedicated to the following extension for
> > CREATE/ALTER ROLE: PASSWORD ('value' USING 'method').
>
> The parentheses seem weird ... do we really need those?
>

+1

I had quick glance to patch and it looks great.

One quick comments:

+| PASSWORD '(' Sconst USING Sconst ')'
+{
+$$ = makeDefElem("methodPassword",
+ (Node *)list_make2(makeString($3),
+makeString($5)),
+ @1);
+}

methodPassword looks bit weird, can we change it to passwordMethod
or pwdEncryptMethod ?


> regards, tom lane
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Rushabh Lathia

Re: [HACKERS] Print correct startup cost for the group aggregate.

2017-03-06 Thread Rushabh Lathia

Thanks Ashutosh & Robert for the explanation.

On Mon, Mar 6, 2017 at 10:02 AM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Sat, Mar 4, 2017 at 2:50 PM, Robert Haas <robertmh...@gmail.com> wrote:
> > On Thu, Mar 2, 2017 at 6:48 PM, Ashutosh Bapat
> > <ashutosh.ba...@enterprisedb.com> wrote:
> >> On Thu, Mar 2, 2017 at 6:06 PM, Rushabh Lathia <
> rushabh.lat...@gmail.com> wrote:
> >>> While reading through the cost_agg() I found that startup cost for the
> >>> group aggregate is not correctly assigned. Due to this explain plan is
> >>> not printing the correct startup cost.
> >>>
> >>> Without patch:
> >>>
> >>> postgres=# explain select aid, sum(abalance) from pgbench_accounts
> where
> >>> filler like '%foo%' group by aid;
> >>>  QUERY PLAN
> >>> 
> -
> >>>  GroupAggregate  (cost=81634.33..85102.04 rows=198155 width=12)
> >>>Group Key: aid
> >>>->  Sort  (cost=81634.33..82129.72 rows=198155 width=8)
> >>>  Sort Key: aid
> >>>  ->  Seq Scan on pgbench_accounts  (cost=0.00..61487.89
> rows=198155
> >>> width=8)
> >>>Filter: (filler ~~ '%foo%'::text)
> >>> (6 rows)
> >>>
> >>> With patch:
> >>>
> >>> postgres=# explain select aid, sum(abalance) from pgbench_accounts
> where
> >>> filler like '%foo%' group by aid;
> >>>  QUERY PLAN
> >>> 
> -
> >>>  GroupAggregate  (cost=82129.72..85102.04 rows=198155 width=12)
> >>>Group Key: aid
> >>>->  Sort  (cost=81634.33..82129.72 rows=198155 width=8)
> >>>  Sort Key: aid
> >>>  ->  Seq Scan on pgbench_accounts  (cost=0.00..61487.89
> rows=198155
> >>> width=8)
> >>>Filter: (filler ~~ '%foo%'::text)
> >>> (6 rows)
> >>>
> >>
> >> The reason the reason why startup_cost = input_startup_cost and not
> >> input_total_cost for aggregation by sorting is we don't need the whole
> >> input before the Group/Agg plan can produce the first row. But I think
> >> setting startup_cost = input_startup_cost is also not exactly correct.
> >> Before the plan can produce one row, it has to transit through all the
> >> rows belonging to the group to which the first row belongs. On an
> >> average it has to scan (total number of rows)/(number of groups)
> >> before producing the first aggregated row. startup_cost will be
> >> input_startup_cost + cost to scan (total number of rows)/(number of
> >> groups) rows + cost of transiting over those many rows. Total cost =
> >> startup_cost + cost of scanning and transiting through the remaining
> >> number of input rows.
> >
> > While that idea has some merit, I think it's inconsistent with current
> > practice.  cost_seqscan(), for example, doesn't include the cost of
> > reading the first page in the startup cost, even though that certainly
> > must be done before returning the first row.
>
> OTOH, while costing for merge join, initial_cost_mergejoin() seems to
> consider the cost of scanning outer and inner relations upto the first
> matching tuple on both sides in the startup_cost. Different policies
> at different places.
>
> > I think there have been
> > previous discussions of switching over to the practice for which you
> > are advocating here, but my impression (without researching) is that
> > the current practice is more like what Rushabh did.
> >
>
> I am not sure Rushabh's approach is correct. Here's the excerpt from my
> mail.
>
> >> The reason the reason why startup_cost = input_startup_cost and not
> >> input_total_cost for aggregation by sorting is we don't need the whole
> >> input before the Group/Agg plan can produce the first row.
>
>

> With Rushabh's approach the startup cost of aggregation by sorting
> would be way higher that it's right now. Secondly, it would match that
> of hash aggregation and thus forcing hash aggregation to be chosen in
> almost all the cases.
>

I understood you reasoning of why startup_cost = input_startup_cost and not
input_total_cost for aggregation by sorting. But what I didn't understand is
how come higher startup cost for aggregation by sorting would force hash
aggregation to be chosen? I am not clear about this part.

--
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>



Regards.
Rushabh Lathia
www.EnterpriseDB.com

Re: [HACKERS] wait events for disk I/O

2017-03-06 Thread Rushabh Lathia

On Sat, Mar 4, 2017 at 7:53 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:

> On Mon, Feb 20, 2017 at 4:04 PM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> >
> > My colleague Rahila reported compilation issue with
> > the patch. Issue was only coming with we do the clean
> > build on the branch.
> >
> > Fixed the same into latest version of patch.
> >
>
> Few assorted comments:
>
> 1.
> +
> + WriteBuffile
> + Waiting during
> buffile read operation.
> +
>
> Operation name and definition are not matching.
>
>
Will fix this.


>
> 2.
> +FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info)
>  {
>  #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
>   int returnCode;
> @@ -1527,8 +1527,10 @@ FilePrefetch(File file, off_t offset, int amount)
>   if (returnCode < 0)
>   return returnCode;
>
> + pgstat_report_wait_start(wait_event_info);
>   returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
> POSIX_FADV_WILLNEED);
> + pgstat_report_wait_end();
>
> AFAIK, this call is non-blocking and will just initiate a read, so do
> you think we should record wait event for such a call.
>
>
Yes, prefecth call is a non-blocking and will just initiate a read. But
this info
about the prefetch into wait events will give more info about the system.

3.
> - written = FileWrite(src->vfd, waldata_start, len);
> + written = FileWrite(src->vfd, waldata_start, len,
> + WAIT_EVENT_WRITE_REWRITE_DATA_BLOCK);
>   if (written != len)
>   ereport(ERROR,
>   (errcode_for_file_access(),
> @@ -957,7 +960,7 @@ logical_end_heap_rewrite(RewriteState state)
>   hash_seq_init(_status, state->rs_logical_mappings);
>   while ((src = (RewriteMappingFile *) hash_seq_search(_status)) !=
> NULL)
>   {
> - if (FileSync(src->vfd) != 0)
> + if (FileSync(src->vfd, WAIT_EVENT_SYNC_REWRITE_DATA_BLOCK) != 0)
>
>
> Do we want to consider recording wait event for both write and sync?
> It seems to me OS level writes are relatively cheap and sync calls are
> expensive, so shouldn't we just log for sync calls?  I could see the
> similar usage at multiple places in the patch.
>
>
Yes, I thought of adding wait event only for the sync but then recording the
wait event for both write and sync. I understand that OS level writes are
cheap but it still have some cost attached to that. Also I thought for the
monitoring tool being develop using this wait events, will have more useful
capture data if we try to collect as much info as we can. Or may be not.

I am open for other opinion/suggestions.


--
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>



-- 
Rushabh Lathia

[HACKERS] Print correct startup cost for the group aggregate.

2017-03-02 Thread Rushabh Lathia

Hi,

While reading through the cost_agg() I found that startup cost for the
group aggregate is not correctly assigned. Due to this explain plan is
not printing the correct startup cost.

Without patch:

postgres=# explain select aid, sum(abalance) from pgbench_accounts where
filler like '%foo%' group by aid;
 QUERY
PLAN
-
 GroupAggregate  (cost=81634.33..85102.04 rows=198155 width=12)
   Group Key: aid
   ->  Sort  (cost=81634.33..82129.72 rows=198155 width=8)
 Sort Key: aid
 ->  Seq Scan on pgbench_accounts  (cost=0.00..61487.89 rows=198155
width=8)
   Filter: (filler ~~ '%foo%'::text)
(6 rows)

With patch:

postgres=# explain select aid, sum(abalance) from pgbench_accounts where
filler like '%foo%' group by aid;
 QUERY
PLAN
-
 GroupAggregate  (cost=82129.72..85102.04 rows=198155 width=12)
   Group Key: aid
   ->  Sort  (cost=81634.33..82129.72 rows=198155 width=8)
 Sort Key: aid
 ->  Seq Scan on pgbench_accounts  (cost=0.00..61487.89 rows=198155
width=8)
   Filter: (filler ~~ '%foo%'::text)
(6 rows)

PFA patch to correct the same.

Regards,
Rushabh Lathia
www.EnterpriseDB.com


fix_startup_cost_cost_agg.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

2017-02-28 Thread Rushabh Lathia

on h  (cost=0.00..0.00 rows=1 width=4)
>  Filter: ((abs(hashint4(i)) % 3) = 2)
>    ->  Seq Scan on h3  (cost=0.00..61.00 rows=13 width=4)
>  Filter: ((abs(hashint4(i)) % 3) = 2)
> (5 rows)
>
> Best regards,
> Yugo Nagata
>
> --
> Yugo Nagata <nag...@sraoss.co.jp>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


Regards,

Rushabh Lathia

Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw

2017-02-22 Thread Rushabh Lathia

On Wed, Feb 22, 2017 at 12:15 PM, Etsuro Fujita <fujita.ets...@lab.ntt.co.jp
> wrote:

> On 2017/02/21 19:31, Rushabh Lathia wrote:
>
>> On Mon, Feb 20, 2017 at 1:41 PM, Etsuro Fujita
>> <fujita.ets...@lab.ntt.co.jp <mailto:fujita.ets...@lab.ntt.co.jp>> wrote:
>>
>
> On 2017/02/13 18:24, Rushabh Lathia wrote:
>> Here are few comments:
>>
>> 1)
>>
>> @@ -211,6 +211,12 @@ typedef struct PgFdwDirectModifyState
>>  PGresult   *result;/* result for query */
>>  intnum_tuples;/* # of result tuples */
>>  intnext_tuple;/* index of next one to
>> return */
>> +RelationresultRel;/* relcache entry for the
>> target table */
>>
>>
>> Why we need resultRel? Can't we directly use dmstate->rel ?
>>
>>
>> The reason why we need that is because in get_returning_data, we
>> pass dmstate->rel to make_tuple_from_result_row, which requires that
>> dmstate->rel be NULL when the scan tuple is described by
>> fdw_scan_tlist.  So in that case we set dmstate->rel to NULL and
>> have dmstate->resultRel that is the relcache entry for the target
>> relation in postgresBeginDirectModify.
>>
>
> Thanks for the explanation. We might do something here by using
>> fdw_scan_tlist or changing the assumption of
>> make_tuple_from_result_row(), and that way we can avoid two similar
>> variable pointer in the PgFdwDirectModifyState.
>>
>
> I agree that the two similar variables are annoying, to some extent, but
> ISTM that is not that bad because that makes the handling of dmstate->rel
> consistent with that of PgFdwScanState's rel.  As you know,
> PgFdwDirectModifyState is defined in a similar way as PgFdwScanState, in
> which the rel is a relcache entry for the foreign table for a simple
> foreign table scan and NULL for a foreign join scan (see comments for the
> definition of PgFdwScanState).
>
> I am okay with currently also, but it adding a note somewhere about this
>> would be great. Also let keep this point open for the committer, if
>> committer feel this is good then lets go ahead with this.
>>
>
> Agreed.
>
>
Thanks.

Marked this as Ready for Committer.


> Here are few other cosmetic changes:
>>
>> 1)
>>
>> + *
>> + * 'target_rel' is either zero or the rangetable index of a target
>> relation.
>> + * In the latter case this construncts FROM clause of UPDATE or USING
>> clause
>> + * of DELETE by simply ignoring the target relation while deparsing the
>> given
>>
>> Spell correction: - construncts
>>
>> 2)
>>
>> +/*
>> + * If either input is the target relation, get all the
>> joinclauses.
>> + * Otherwise extract conditions mentioning the target relation
>> from
>> + * the joinclauses.
>> + */
>>
>>
>> space between joinclauses needed.
>>
>> 3)
>>
>> +/*
>> + * If UPDATE/DELETE on a join, create a RETURINING list used in
>> the
>> + * remote query.
>> + */
>> +if (fscan->scan.scanrelid == 0)
>> +returningList = make_explicit_returning_list(resultRelation,
>> + rel,
>> + returningList);
>>
>>
>> Spell correction: RETURINING
>>
>
> Good catch!
>
> I did above changes in the attached patch. Please have  a look once and
>>
>
> I'm fine with that.  Thanks for the patch!
>
> then I feel like this patch is ready for committer.
>>
>
> Thanks for reviewing!
>
> Best regards,
> Etsuro Fujita
>
>
>


-- 
Rushabh Lathia

Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw

2017-02-21 Thread Rushabh Lathia

On Mon, Feb 20, 2017 at 1:41 PM, Etsuro Fujita <fujita.ets...@lab.ntt.co.jp>
wrote:

> On 2017/02/13 18:24, Rushabh Lathia wrote:
>
>> I started reviewing the patch again. Patch applied cleanly on latest
>> source
>> as well as regression pass through with the patch. I also performed
>> few manual testing and haven't found any regression. Patch look
>> much cleaner the earlier version, and don't have any major concern as
>> such.
>>
>
> Thanks for the review!
>
> Here are few comments:
>>
>> 1)
>>
>> @@ -211,6 +211,12 @@ typedef struct PgFdwDirectModifyState
>>  PGresult   *result;/* result for query */
>>  intnum_tuples;/* # of result tuples */
>>  intnext_tuple;/* index of next one to return */
>> +RelationresultRel;/* relcache entry for the target table
>> */
>>
>
> Why we need resultRel? Can't we directly use dmstate->rel ?
>>
>
> The reason why we need that is because in get_returning_data, we pass
> dmstate->rel to make_tuple_from_result_row, which requires that
> dmstate->rel be NULL when the scan tuple is described by fdw_scan_tlist.
> So in that case we set dmstate->rel to NULL and have dmstate->resultRel
> that is the relcache entry for the target relation in
> postgresBeginDirectModify.
>
>
Thanks for the explanation. We might do something here by using
fdw_scan_tlist or changing the assumption of make_tuple_from_result_row(),
and that way we can avoid two similar variable pointer in the
PgFdwDirectModifyState.

I am okay with currently also, but it adding a note somewhere about this
would be great. Also let keep this point open for the committer, if
committer feel this is good then lets go ahead with this.

Here are few other cosmetic changes:

1)

+ *
+ * 'target_rel' is either zero or the rangetable index of a target
relation.
+ * In the latter case this construncts FROM clause of UPDATE or USING
clause
+ * of DELETE by simply ignoring the target relation while deparsing the
given

Spell correction: - construncts

2)

+/*
+ * If either input is the target relation, get all the joinclauses.
+ * Otherwise extract conditions mentioning the target relation from
+ * the joinclauses.
+ */


space between joinclauses needed.

3)

+/*
+ * If UPDATE/DELETE on a join, create a RETURINING list used in the
+ * remote query.
+ */
+if (fscan->scan.scanrelid == 0)
+returningList = make_explicit_returning_list(resultRelation,
+ rel,
+     returningList);


Spell correction: RETURINING

I did above changes in the attached patch. Please have  a look once and
then I feel like this patch is ready for committer.

Thanks,
Rushabh Lathia


postgres-fdw-more-update-pushdown-v3_cosmetic_changes.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] wait events for disk I/O

2017-02-20 Thread Rushabh Lathia

My colleague Rahila reported compilation issue with
the patch. Issue was only coming with we do the clean
build on the branch.

Fixed the same into latest version of patch.

Thanks,


On Tue, Jan 31, 2017 at 11:09 AM, Rushabh Lathia <rushabh.lat...@gmail.com>
wrote:

>
>
> On Tue, Jan 31, 2017 at 8:54 AM, Michael Paquier <
> michael.paqu...@gmail.com> wrote:
>
>> On Mon, Jan 30, 2017 at 10:01 PM, Rushabh Lathia
>> <rushabh.lat...@gmail.com> wrote:
>> > Attached is the patch, which extend the existing wait event
>> infrastructure
>> > to implement the wait events for the disk I/O. Basically
>> pg_stat_activity's
>> > wait event information to show data about disk I/O as well as IPC
>> primitives.
>> >
>> > Implementation details:
>> >
>> > - Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO
>> > - Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch,
>> > FileWriteback, FileSync, and FileTruncate. Set this wait event just
>> before
>> > performing the file system operation and clear it just after.
>> > - Pass down an appropriate wait event from  caller of any of those
>> > functions.
>> > - Also set and clear a wait event around standalone calls to read(),
>> > write(), fsync() in other parts of the system.
>> > - Added documentation for all newly added wait event.
>>
>> Looks neat, those are unlikely to overlap with other wait events.
>>
>
> Thanks.
>
>
>>
>> > Open issue:
>> > - Might missed few standalone calls to read(), write(), etc which need
>> > to pass the wait_event_info.
>>
>> It may be an idea to use LD_PRELOAD with custom versions of read(),
>> write() and fsync(), and look at the paths where no flags are set in
>> MyProc->wait_event_info, and log information when that happens.
>>
>>
> Yes, may be I will try this.
>
>
>> > Thanks to my colleague Robert Haas for his help in design.
>> > Please let me know your thought, and thanks for reading.
>>
>> Did you consider a wrapper of the type pg_read_event() or
>> pg_write_event(), etc.?
>>
>
> I thought on that, but eventually stick to this approach as it looks
> more neat and uniform with other wait event implementation.
>
>
>
>> --
>> Michael
>>
>
>
>
> Thanks,
> Rushabh Lathia
> www.EnterpriseDB.com
>



-- 
Rushabh Lathia


wait_event_for_disk_IO_v1.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gather Merge

2017-02-19 Thread Rushabh Lathia

Thanks Amit for raising this point. I was not at all aware of mark/restore.
I tried to come up with the case, but haven't found such case.

For now here is the patch with comment update.

Thanks,


On Sun, Feb 19, 2017 at 7:30 PM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Sun, Feb 19, 2017 at 2:22 PM, Robert Haas <robertmh...@gmail.com>
> wrote:
> > On Sat, Feb 18, 2017 at 6:43 PM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
> >> I think there is a value in supporting mark/restore position for any
> >> node which produces sorted results, however, if you don't want to
> >> support it, then I think we should update above comment in the code to
> >> note this fact.  Also, you might want to check other places to see if
> >> any similar comment updates are required in case you don't want to
> >> support mark/restore position for GatherMerge.
> >
> > I don't think it makes sense to put mark/support restore into Gather
> > Merge.  Maybe somebody else will come up with a test that shows
> > differently, but ISTM that with something like Sort it makes a ton of
> > sense to support mark/restore because the Sort node itself can do it
> > much more cheaply than would be possible with a separate Materialize
> > node.  If you added a separate Materialize node, the Sort node would
> > be trying to throw away the exact same data that the Materialize node
> > was trying to accumulate, which is silly.
> >
>
> I am not sure but there might be some cases where adding Materialize
> node on top of Sort node could make sense as we try to cost it as well
> and add it if it turns out to be cheap.
>
> >  Here with Gather Merge
> > there is no such thing happening; mark/restore support would require
> > totally new code - probably we would end up shoving the same code that
> > already exists in Materialize into Gather Merge as well.
> >
>
> I have tried to evaluate this based on plans reported by Rushabh and
> didn't find any case where GatherMerge can be beneficial by supporting
> mark/restore in the plans where it is being used (it is not being used
> on the right side of merge join).  Now, it is quite possible that it
> might be beneficial at higher scale factors or may be planner has
> ignored such a plan.  However, as we are not clear what kind of
> benefits we can get via mark/restore support for GatherMerge, it
> doesn't make much sense to take the trouble of implementing it.
>
> >
> > A comment update is probably a good idea, though.
> >
>
> Agreed.
>
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>



-- 
Rushabh Lathia


gather-merge-v8-update-comment.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gather Merge

2017-02-17 Thread Rushabh Lathia

On Fri, Feb 17, 2017 at 4:47 PM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Fri, Feb 17, 2017 at 3:59 PM, Thomas Munro
> <thomas.mu...@enterprisedb.com> wrote:
> > On Thu, Feb 2, 2017 at 2:32 AM, Rushabh Lathia <rushabh.lat...@gmail.com>
> wrote:
> >> Please find attached latest patch.
> >
> > The latest patch still applies (with some fuzz), builds and the
> > regression tests pass.
> >
> > I see that Robert made a number of changes and posted a v6 along with
> > some numbers which he described as lacklustre, but then fixed a row
> > estimate problem which was discouraging parallel joins (commit
> > 0c2070ce).  Rushabh posted a v7 and test results which look good.
> >
>
> Are you suggesting that commit 0c2070ce has helped to improve
> performance if so, I don't think that has been proved?  I guess the
> numbers are different either due to different m/c or some other
> settings like scale factor or work_mem.
>

I don't really think 0c2070ce is the exact reason. I run the tpch runs
with the same same setting as what Robert was running. I haven't
noticed any regression with the runs. For the last runs I also
uploaded the tpch run outputs for the individual queries for the
reference.



> >  As
> > far as I can see there are no outstanding issues or unhandled review
> > feedback.  I've had a fresh read through of the latest version and
> > have no further comments myself.
> >
> > I've set this to ready-for-committer now.  If I've misunderstood and
> > there are still unresolved issues from that earlier email exchange or
> > someone else wants to post a review or objection, then of course
> > please feel free to set it back.
> >
> > BTW  There is no regression test supplied.  I see that commit 5262f7a4
> > adding parallel index scans put simple explain output in
> > "select_parallel" to demonstrate the new kind of plan being created;
>
> It has added both explain statement test and a test to exercise
> parallel index scan code.
>
>
Thanks for the reference. I added the similar tests for GM in the
uploaded latest patch.


>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>



-- 
Rushabh Lathia

Re: [HACKERS] Gather Merge

2017-02-17 Thread Rushabh Lathia

On Fri, Feb 17, 2017 at 3:59 PM, Thomas Munro <thomas.mu...@enterprisedb.com
> wrote:

> On Thu, Feb 2, 2017 at 2:32 AM, Rushabh Lathia <rushabh.lat...@gmail.com>
> wrote:
> > Please find attached latest patch.
>
> The latest patch still applies (with some fuzz), builds and the
> regression tests pass.
>
> Attached latest patch, which applies cleanly on latest source.


> I see that Robert made a number of changes and posted a v6 along with
> some numbers which he described as lacklustre, but then fixed a row
> estimate problem which was discouraging parallel joins (commit
> 0c2070ce).  Rushabh posted a v7 and test results which look good.  As
> far as I can see there are no outstanding issues or unhandled review
> feedback.  I've had a fresh read through of the latest version and
> have no further comments myself.
>
> I've set this to ready-for-committer now.  If I've misunderstood and
> there are still unresolved issues from that earlier email exchange or
> someone else wants to post a review or objection, then of course
> please feel free to set it back.
>
>
Thanks Thomas.


> BTW  There is no regression test supplied.  I see that commit 5262f7a4
> adding parallel index scans put simple explain output in
> "select_parallel" to demonstrate the new kind of plan being created;
> perhaps this patch should do the same?  I know it wouldn't really test
> much of the code but it's at least something.  Perhaps you could post
> a new version with that?
>
>
Added the regression test to the new version of patch.

PFA latest patch.


-- 
Rushabh Lathia
www.EnterpriseDB.com


gather-merge-v8.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw

2017-02-13 Thread Rushabh Lathia

Sorry for delay in the review.

I started reviewing the patch again. Patch applied cleanly on latest source
as well as regression pass through with the patch. I also performed
few manual testing and haven't found any regression. Patch look
much cleaner the earlier version, and don't have any major concern as
such. Here are few comments:

1)

@@ -211,6 +211,12 @@ typedef struct PgFdwDirectModifyState
 PGresult   *result;/* result for query */
 intnum_tuples;/* # of result tuples */
 intnext_tuple;/* index of next one to return */
+RelationresultRel;/* relcache entry for the target table */

Why we need resultRel? Can't we directly use dmstate->rel ?

2) In the patch somewhere scanrelid condition being used as
fscan->scan.scanrelid == 0 where as some place its been used as
fsplan->scan.scanrelid > 0. Infact in the same function its been used
differently example postgresBeginDirectModify. Can make this consistent.

3)

+ * If UPDATE/DELETE on a join, create a RETURINING list used in the
remote
+ * query.
+ */
+if (fscan->scan.scanrelid == 0)
+returningList = make_explicit_returning_list(resultRelation, rel,
+ returningList);
+

Above block can be moved inside the if (plan->returningLists) condition
above
the block. Like this:

/*
 * Extract the relevant RETURNING list if any.
 */
if (plan->returningLists)
{
returningList = (List *) list_nth(plan->returningLists,
subplan_index);

/*
 * If UPDATE/DELETE on a join, create a RETURINING list used in the
remote
 * query.
 */
if (fscan->scan.scanrelid == 0)
returningList = make_explicit_returning_list(resultRelation,
 rel,
 returningList);
}

I am still doing few more testing with the patch, if I will found anything
apart from
this I will raise that into another mail.

Thanks,

On Wed, Feb 1, 2017 at 11:50 AM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Wed, Jan 25, 2017 at 7:20 PM, Etsuro Fujita
> <fujita.ets...@lab.ntt.co.jp> wrote:
> > Attached is the new version of the patch.  I also addressed other
> comments
> > from you: moved rewriting the fdw_scan_tlist to postgres_fdw.c,
> > added/revised comments, and added regression tests for the case where a
> > pushed down UPDATE/DELETE on a join has RETURNING.
> >
> > My apologies for having been late to work on this.
>
> Moved to CF 2017-03.
> --
> Michael
>

-- 
Rushabh Lathia

[HACKERS] pg_restore is broken on 9.2 version.

2017-02-07 Thread Rushabh Lathia

Hi All,

Commit c59a1a89035674c6efacc596d652528cebba37ec don't allow non-positive
number
of jobs. Now on 9.2 number of jobs get assigned to opts->number_of_jobs. If
user don't specify any value -j then default value will be always 0. Which
will
lead to the "invalid number of parallel jobs" error.

if (opts->number_of_jobs <= 0)
{
fprintf(stderr, _("%s: invalid number of parallel jobs\n"),
progname);
exit(1);
}

Please find attach patch to initialize default value for number of jobs to
1.

Thanks,
Rushabh Lathia
www.EnterpriseDB.com


pg_restore_fix.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gather Merge

2017-02-01 Thread Rushabh Lathia

Due to recent below commit, patch not getting apply cleanly on
master branch.

commit d002f16c6ec38f76d1ee97367ba6af3000d441d0
Author: Tom Lane <t...@sss.pgh.pa.us>
Date:   Mon Jan 30 17:15:42 2017 -0500

Add a regression test script dedicated to exercising system views.

Please find attached latest patch.



On Wed, Feb 1, 2017 at 5:55 PM, Rushabh Lathia <rushabh.lat...@gmail.com>
wrote:

> I am sorry for the delay, here is the latest re-based patch.
>
> my colleague Neha Sharma, reported one regression with the patch, where
> explain output for the Sort node under GatherMerge was always showing
> cost as zero:
>
> explain analyze select '' AS "xxx" from pgbench_accounts  where filler
> like '%foo%' order by aid;
>QUERY
> PLAN
> 
> 
> 
>  Gather Merge  (cost=47169.81..70839.91 rows=197688 width=36) (actual
> time=406.297..653.572 rows=20 loops=1)
>Workers Planned: 4
>Workers Launched: 4
>->  Sort  (*cost=0.00..0.00 rows=0 width=0*) (actual
> time=368.945..391.124 rows=4 loops=5)
>  Sort Key: aid
>  Sort Method: quicksort  Memory: 3423kB
>  ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..42316.60
> rows=49422 width=36) (actual time=296.612..338.873 rows=4 loops=5)
>Filter: (filler ~~ '%foo%'::text)
>Rows Removed by Filter: 36
>  Planning time: 0.184 ms
>  Execution time: 734.963 ms
>
> This patch also fix that issue.
>
>
>
>
> On Wed, Feb 1, 2017 at 11:27 AM, Michael Paquier <
> michael.paqu...@gmail.com> wrote:
>
>> On Mon, Jan 23, 2017 at 6:51 PM, Kuntal Ghosh
>> <kuntalghosh.2...@gmail.com> wrote:
>> > On Wed, Jan 18, 2017 at 11:31 AM, Rushabh Lathia
>> > <rushabh.lat...@gmail.com> wrote:
>> >>
>> > The patch needs a rebase after the commit 69f4b9c85f168ae006929eec4.
>>
>> Is an update going to be provided? I have moved this patch to next CF
>> with "waiting on author" as status.
>> --
>> Michael
>>
>
>
>
> --
> Rushabh Lathia
>



-- 
Rushabh Lathia


gather-merge-v7.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gather Merge

2017-02-01 Thread Rushabh Lathia

I am sorry for the delay, here is the latest re-based patch.

my colleague Neha Sharma, reported one regression with the patch, where
explain output for the Sort node under GatherMerge was always showing
cost as zero:

explain analyze select '' AS "xxx" from pgbench_accounts  where filler like
'%foo%' order by aid;
   QUERY
PLAN

 Gather Merge  (cost=47169.81..70839.91 rows=197688 width=36) (actual
time=406.297..653.572 rows=20 loops=1)
   Workers Planned: 4
   Workers Launched: 4
   ->  Sort  (*cost=0.00..0.00 rows=0 width=0*) (actual
time=368.945..391.124 rows=4 loops=5)
 Sort Key: aid
 Sort Method: quicksort  Memory: 3423kB
 ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..42316.60
rows=49422 width=36) (actual time=296.612..338.873 rows=4 loops=5)
   Filter: (filler ~~ '%foo%'::text)
   Rows Removed by Filter: 36
 Planning time: 0.184 ms
 Execution time: 734.963 ms

This patch also fix that issue.




On Wed, Feb 1, 2017 at 11:27 AM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Mon, Jan 23, 2017 at 6:51 PM, Kuntal Ghosh
> <kuntalghosh.2...@gmail.com> wrote:
> > On Wed, Jan 18, 2017 at 11:31 AM, Rushabh Lathia
> > <rushabh.lat...@gmail.com> wrote:
> >>
> > The patch needs a rebase after the commit 69f4b9c85f168ae006929eec4.
>
> Is an update going to be provided? I have moved this patch to next CF
> with "waiting on author" as status.
> --
> Michael
>



-- 
Rushabh Lathia


gather-merge-v7.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] wait events for disk I/O

2017-01-30 Thread Rushabh Lathia

On Tue, Jan 31, 2017 at 8:54 AM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Mon, Jan 30, 2017 at 10:01 PM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> > Attached is the patch, which extend the existing wait event
> infrastructure
> > to implement the wait events for the disk I/O. Basically
> pg_stat_activity's
> > wait event information to show data about disk I/O as well as IPC
> primitives.
> >
> > Implementation details:
> >
> > - Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO
> > - Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch,
> > FileWriteback, FileSync, and FileTruncate. Set this wait event just
> before
> > performing the file system operation and clear it just after.
> > - Pass down an appropriate wait event from  caller of any of those
> > functions.
> > - Also set and clear a wait event around standalone calls to read(),
> > write(), fsync() in other parts of the system.
> > - Added documentation for all newly added wait event.
>
> Looks neat, those are unlikely to overlap with other wait events.
>

Thanks.


>
> > Open issue:
> > - Might missed few standalone calls to read(), write(), etc which need
> > to pass the wait_event_info.
>
> It may be an idea to use LD_PRELOAD with custom versions of read(),
> write() and fsync(), and look at the paths where no flags are set in
> MyProc->wait_event_info, and log information when that happens.
>
>
Yes, may be I will try this.


> > Thanks to my colleague Robert Haas for his help in design.
> > Please let me know your thought, and thanks for reading.
>
> Did you consider a wrapper of the type pg_read_event() or
> pg_write_event(), etc.?
>

I thought on that, but eventually stick to this approach as it looks
more neat and uniform with other wait event implementation.



> --
> Michael
>



Thanks,
Rushabh Lathia
www.EnterpriseDB.com

[HACKERS] wait events for disk I/O

2017-01-30 Thread Rushabh Lathia

Hi All,

Attached is the patch, which extend the existing wait event infrastructure
to
implement the wait events for the disk I/O. Basically pg_stat_activity's
wait
event information to show data about disk I/O as well as IPC primitives.

Implementation details:

- Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO
- Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch,
FileWriteback, FileSync, and FileTruncate. Set this wait event just before
performing the file system operation and clear it just after.
- Pass down an appropriate wait event from  caller of any of those
functions.
- Also set and clear a wait event around standalone calls to read(),
write(), fsync() in other parts of the system.
- Added documentation for all newly added wait event.

Open issue:
- Might missed few standalone calls to read(), write(), etc which need
to pass the wait_event_info.

Thanks to my colleague Robert Haas for his help in design.

Please let me know your thought, and thanks for reading.

Thanks,
Rushabh Lathia
www.EnterpriseDB.com


wait_event_disk_IO.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Query fails when SRFs are part of FROM clause (Commit id: 69f4b9c85f)

2017-01-29 Thread Rushabh Lathia

On Sat, Jan 28, 2017 at 3:43 AM, David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Fri, Jan 27, 2017 at 5:28 AM, Rushabh Lathia <rushabh.lat...@gmail.com>
> wrote:
>
>> Consider the below test;
>>
>> CREATE TABLE tab ( a int primary key);
>>
>> SELECT  *
>> FROM pg_constraint pc,
>> CAST(CASE WHEN pc.contype IN ('f','u','p') THEN generate_series(1,
>> array_upper(pc.conkey, 1)) ELSE NULL END AS int) AS position;
>>
>> Above query is failing with "set-valued function called in context that
>> cannot
>> accept a set". But if I remove the CASE from the query then it working
>> just good.
>>
>> Like:
>>
>> SELECT  *
>> FROM pg_constraint pc,
>> CAST(generate_series(1, array_upper(pc.conkey, 1)) AS int) AS position;
>>
>> This started failing with 69f4b9c85f168ae006929eec44fc44d569e846b9. It
>> seems
>> check_srf_call_placement() sets the hasTargetSRFs flag and but when the
>> SRFs
>> at the rtable ofcourse this flag doesn't get set. It seems like missing
>> something
>> their, but I might be completely wrong as not quire aware of this area.
>>
>>
> I'm a bit surprised that your query actually works...and without delving
> into source code its hard to explain why it should/shouldn't or whether the
> recent SRF work was intended to impact it.
>
> In any case the more idiomatic way of writing your query these days (since
> 9.4 came out) is:
>
> SELECT *
> FROM pg_constraint pc
> LEFT JOIN LATERAL generate_series(1, case when contype in ('f','p','u')
> then array_upper(pc.conkey, 1) else 0 end) gs ON true;
>
> generate_series is smart enough to return an empty set (instead of
> erroring out) when provided with (1,0) as arguments.
>
>
Thanks for the providing work-around query and I also understood your point.

At the same time reason to raise this issue was, because this was working
before
69f4b9c85f168ae006929eec44fc44d569e846b9 commit and now its throwing
an error. So whether its intended or query started failing because of some
bug introduced with the commit.

Issues is reproducible when query re-written with LEFT JOIN LATERAL and I
continue to use CASE statement.

SELECT *
FROM pg_constraint pc
LEFT JOIN LATERAL CAST((CASE WHEN pc.contype IN ('f','u','p') THEN
generate_series(1, array_upper(pc.conkey, 1)) ELSE NULL END) AS int) gs ON
true;
ERROR:  set-valued function called in context that cannot accept a set



David J.
>
>


-- 
Rushabh Lathia
www.EnterpriseDB.com

[HACKERS] Query fails when SRFs are part of FROM clause (Commit id: 69f4b9c85f)

2017-01-27 Thread Rushabh Lathia

Consider the below test;

CREATE TABLE tab ( a int primary key);

SELECT  *
FROM pg_constraint pc,
CAST(CASE WHEN pc.contype IN ('f','u','p') THEN generate_series(1,
array_upper(pc.conkey, 1)) ELSE NULL END AS int) AS position;

Above query is failing with "set-valued function called in context that
cannot
accept a set". But if I remove the CASE from the query then it working just
good.

Like:

SELECT  *
FROM pg_constraint pc,
CAST(generate_series(1, array_upper(pc.conkey, 1)) AS int) AS position;

This started failing with 69f4b9c85f168ae006929eec44fc44d569e846b9. It seems
check_srf_call_placement() sets the hasTargetSRFs flag and but when the SRFs
at the rtable ofcourse this flag doesn't get set. It seems like missing
something
their, but I might be completely wrong as not quire aware of this area.




regards,
Rushabh Lathia
www.EnterpriseDB.com

Re: [HACKERS] Gather Merge

2017-01-17 Thread Rushabh Lathia

On Tue, Jan 17, 2017 at 6:44 PM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Tue, Jan 17, 2017 at 5:19 PM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> >
> >
> >
> > Here is the benchmark number which I got with the latest (v6) patch:
> >
> > - max_worker_processes = DEFAULT (8)
> > - max_parallel_workers_per_gather = 4
> > - Cold cache environment is ensured. With every query execution - server
> is
> >   stopped and also OS caches were dropped.
> > - The reported values of execution time (in ms) is median of 3
> executions.
> > - power2 machine with 512GB of RAM
> > - With default postgres.conf
> >
>
> You haven't mentioned scale factor used in these tests.
>

Oops sorry. Those results are for scale factor 10.


>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>



-- 
Rushabh Lathia

Re: [HACKERS] Gather Merge

2017-01-17 Thread Rushabh Lathia

On Tue, Jan 17, 2017 at 5:19 PM, Rushabh Lathia <rushabh.lat...@gmail.com>
wrote:

>
>
> On Fri, Jan 13, 2017 at 10:52 AM, Rushabh Lathia <rushabh.lat...@gmail.com
> > wrote:
>
>>
>>
>> On Thu, Jan 12, 2017 at 8:50 AM, Robert Haas <robertmh...@gmail.com>
>> wrote:
>>
>>> On Sun, Dec 4, 2016 at 7:36 PM, Haribabu Kommi <kommi.harib...@gmail.com>
>>> wrote:
>>> > On Thu, Nov 24, 2016 at 11:12 PM, Rushabh Lathia <
>>> rushabh.lat...@gmail.com>
>>> > wrote:
>>> >> PFA latest patch with fix as well as few cosmetic changes.
>>> >
>>> > Moved to next CF with "needs review" status.
>>>
>>> I spent quite a bit of time on this patch over the last couple of
>>> days.  I was hoping to commit it, but I think it's not quite ready for
>>> that yet and I hit a few other issues along the way.  Meanwhile,
>>> here's an updated version with the following changes:
>>>
>>> * Adjusted cost_gather_merge because we don't need to worry about less
>>> than 1 worker.
>>> * Don't charge double maintenance cost of the heap per 34ca0905.  This
>>> was pointed out previous and Rushabh said it was fixed, but it wasn't
>>> fixed in v5.
>>> * cost_gather_merge claimed to charge a slightly higher IPC cost
>>> because we have to block, but didn't.  Fix it so it does.
>>> * Move several hunks to more appropriate places in the file, near
>>> related code or in a more logical position relative to surrounding
>>> code.
>>> * Fixed copyright dates for the new files.  One said 2015, one said 2016.
>>> * Removed unnecessary code from create_gather_merge_plan that tried to
>>> handle an empty list of pathkeys (shouldn't happen).
>>> * Make create_gather_merge_plan more consistent with
>>> create_merge_append_plan.  Remove make_gather_merge for the same
>>> reason.
>>> * Changed generate_gather_paths to generate gather merge paths.  In
>>> the previous coding, only the upper planner nodes ever tried to
>>> generate gather merge nodes, but that seems unnecessarily limiting,
>>> since it could be useful to generate a gathered path with pathkeys at
>>> any point in the tree where we'd generate a gathered path with no
>>> pathkeys.
>>> * Rewrote generate_ordered_paths() logic to consider only the one
>>> potentially-useful path not now covered by the new code in
>>> generate_gather_paths().
>>> * Reverted changes in generate_distinct_paths().  I think we should
>>> add something here but the existing logic definitely isn't right
>>> considering the change to generate_gather_paths().
>>> * Assorted cosmetic cleanup in nodeGatherMerge.c.
>>> * Documented the new GUC enable_gathermerge.
>>> * Improved comments.  Dropped one that seemed unnecessary.
>>> * Fixed parts of the patch to be more pgindent-clean.
>>>
>>>
>> Thanks Robert for hacking into this.
>>
>>
>>> Testing this against the TPC-H queries at 10GB with
>>> max_parallel_workers_per_gather = 4, seq_page_cost = 0.1,
>>> random_page_cost = 0.1, work_mem = 64MB initially produced somewhat
>>> demoralizing results.  Only Q17, Q4, and Q8 picked Gather Merge, and
>>> of those only Q17 got faster.  Investigating this led to me realizing
>>> that join costing for parallel joins is all messed up: see
>>> https://www.postgresql.org/message-id/CA+TgmoYt2pyk2CTyvYCtF
>>> ySXN=jsorgh8_mjttlowu5qkjo...@mail.gmail.com
>>>
>>> With that patch applied, in my testing, Gather Merge got picked for
>>> Q3, Q4, Q5, Q6, Q7, Q8, Q10, and Q17, but a lot of those queries get a
>>> little slower instead of a little faster.  Here are the timings --
>>> these are with EXPLAIN ANALYZE, so take them with a grain of salt --
>>> first number is without Gather Merge, second is with Gather Merge:
>>>
>>> Q3 16943.938 ms -> 18645.957 ms
>>> Q4 3155.350 ms -> 4179.431 ms
>>> Q5 13611.484 ms -> 13831.946 ms
>>> Q6 9264.942 ms -> 8734.899 ms
>>> Q7 9759.026 ms -> 10007.307 ms
>>> Q8 2473.899 ms -> 2459.225 ms
>>> Q10 13814.950 ms -> 12255.618 ms
>>> Q17 49552.298 ms -> 46633.632 ms
>>>
>>>
>> This is strange, I will re-run the test again and post the results soon.
>>
>>
> Here is the benchmark number which I got with the latest (v6) patch:
>
> - max_worker_processes = DEFAULT (8)
> - max_parallel_workers_per_gather

Re: [HACKERS] Gather Merge

2017-01-17 Thread Rushabh Lathia

On Fri, Jan 13, 2017 at 10:52 AM, Rushabh Lathia <rushabh.lat...@gmail.com>
wrote:

>
>
> On Thu, Jan 12, 2017 at 8:50 AM, Robert Haas <robertmh...@gmail.com>
> wrote:
>
>> On Sun, Dec 4, 2016 at 7:36 PM, Haribabu Kommi <kommi.harib...@gmail.com>
>> wrote:
>> > On Thu, Nov 24, 2016 at 11:12 PM, Rushabh Lathia <
>> rushabh.lat...@gmail.com>
>> > wrote:
>> >> PFA latest patch with fix as well as few cosmetic changes.
>> >
>> > Moved to next CF with "needs review" status.
>>
>> I spent quite a bit of time on this patch over the last couple of
>> days.  I was hoping to commit it, but I think it's not quite ready for
>> that yet and I hit a few other issues along the way.  Meanwhile,
>> here's an updated version with the following changes:
>>
>> * Adjusted cost_gather_merge because we don't need to worry about less
>> than 1 worker.
>> * Don't charge double maintenance cost of the heap per 34ca0905.  This
>> was pointed out previous and Rushabh said it was fixed, but it wasn't
>> fixed in v5.
>> * cost_gather_merge claimed to charge a slightly higher IPC cost
>> because we have to block, but didn't.  Fix it so it does.
>> * Move several hunks to more appropriate places in the file, near
>> related code or in a more logical position relative to surrounding
>> code.
>> * Fixed copyright dates for the new files.  One said 2015, one said 2016.
>> * Removed unnecessary code from create_gather_merge_plan that tried to
>> handle an empty list of pathkeys (shouldn't happen).
>> * Make create_gather_merge_plan more consistent with
>> create_merge_append_plan.  Remove make_gather_merge for the same
>> reason.
>> * Changed generate_gather_paths to generate gather merge paths.  In
>> the previous coding, only the upper planner nodes ever tried to
>> generate gather merge nodes, but that seems unnecessarily limiting,
>> since it could be useful to generate a gathered path with pathkeys at
>> any point in the tree where we'd generate a gathered path with no
>> pathkeys.
>> * Rewrote generate_ordered_paths() logic to consider only the one
>> potentially-useful path not now covered by the new code in
>> generate_gather_paths().
>> * Reverted changes in generate_distinct_paths().  I think we should
>> add something here but the existing logic definitely isn't right
>> considering the change to generate_gather_paths().
>> * Assorted cosmetic cleanup in nodeGatherMerge.c.
>> * Documented the new GUC enable_gathermerge.
>> * Improved comments.  Dropped one that seemed unnecessary.
>> * Fixed parts of the patch to be more pgindent-clean.
>>
>>
> Thanks Robert for hacking into this.
>
>
>> Testing this against the TPC-H queries at 10GB with
>> max_parallel_workers_per_gather = 4, seq_page_cost = 0.1,
>> random_page_cost = 0.1, work_mem = 64MB initially produced somewhat
>> demoralizing results.  Only Q17, Q4, and Q8 picked Gather Merge, and
>> of those only Q17 got faster.  Investigating this led to me realizing
>> that join costing for parallel joins is all messed up: see
>> https://www.postgresql.org/message-id/CA+TgmoYt2pyk2CTyvYCtF
>> ySXN=jsorgh8_mjttlowu5qkjo...@mail.gmail.com
>>
>> With that patch applied, in my testing, Gather Merge got picked for
>> Q3, Q4, Q5, Q6, Q7, Q8, Q10, and Q17, but a lot of those queries get a
>> little slower instead of a little faster.  Here are the timings --
>> these are with EXPLAIN ANALYZE, so take them with a grain of salt --
>> first number is without Gather Merge, second is with Gather Merge:
>>
>> Q3 16943.938 ms -> 18645.957 ms
>> Q4 3155.350 ms -> 4179.431 ms
>> Q5 13611.484 ms -> 13831.946 ms
>> Q6 9264.942 ms -> 8734.899 ms
>> Q7 9759.026 ms -> 10007.307 ms
>> Q8 2473.899 ms -> 2459.225 ms
>> Q10 13814.950 ms -> 12255.618 ms
>> Q17 49552.298 ms -> 46633.632 ms
>>
>>
> This is strange, I will re-run the test again and post the results soon.
>
>
Here is the benchmark number which I got with the latest (v6) patch:

- max_worker_processes = DEFAULT (8)
- max_parallel_workers_per_gather = 4
- Cold cache environment is ensured. With every query execution - server is
  stopped and also OS caches were dropped.
- The reported values of execution time (in ms) is median of 3 executions.
- power2 machine with 512GB of RAM
- With default postgres.conf

Timing with v6 patch on REL9_6_STABLE branch
(last commit: 8a70d8ae7501141d283e56b31e10c66697c986d5).

Query 3: 49888.300 -> 45914.426
Query 4: 8531.939 -> 7790.498
Query 5: 40668.920 -> 38403.658
Q

Re: [HACKERS] Gather Merge

2017-01-12 Thread Rushabh Lathia

On Thu, Jan 12, 2017 at 8:50 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Sun, Dec 4, 2016 at 7:36 PM, Haribabu Kommi <kommi.harib...@gmail.com>
> wrote:
> > On Thu, Nov 24, 2016 at 11:12 PM, Rushabh Lathia <
> rushabh.lat...@gmail.com>
> > wrote:
> >> PFA latest patch with fix as well as few cosmetic changes.
> >
> > Moved to next CF with "needs review" status.
>
> I spent quite a bit of time on this patch over the last couple of
> days.  I was hoping to commit it, but I think it's not quite ready for
> that yet and I hit a few other issues along the way.  Meanwhile,
> here's an updated version with the following changes:
>
> * Adjusted cost_gather_merge because we don't need to worry about less
> than 1 worker.
> * Don't charge double maintenance cost of the heap per 34ca0905.  This
> was pointed out previous and Rushabh said it was fixed, but it wasn't
> fixed in v5.
> * cost_gather_merge claimed to charge a slightly higher IPC cost
> because we have to block, but didn't.  Fix it so it does.
> * Move several hunks to more appropriate places in the file, near
> related code or in a more logical position relative to surrounding
> code.
> * Fixed copyright dates for the new files.  One said 2015, one said 2016.
> * Removed unnecessary code from create_gather_merge_plan that tried to
> handle an empty list of pathkeys (shouldn't happen).
> * Make create_gather_merge_plan more consistent with
> create_merge_append_plan.  Remove make_gather_merge for the same
> reason.
> * Changed generate_gather_paths to generate gather merge paths.  In
> the previous coding, only the upper planner nodes ever tried to
> generate gather merge nodes, but that seems unnecessarily limiting,
> since it could be useful to generate a gathered path with pathkeys at
> any point in the tree where we'd generate a gathered path with no
> pathkeys.
> * Rewrote generate_ordered_paths() logic to consider only the one
> potentially-useful path not now covered by the new code in
> generate_gather_paths().
> * Reverted changes in generate_distinct_paths().  I think we should
> add something here but the existing logic definitely isn't right
> considering the change to generate_gather_paths().
> * Assorted cosmetic cleanup in nodeGatherMerge.c.
> * Documented the new GUC enable_gathermerge.
> * Improved comments.  Dropped one that seemed unnecessary.
> * Fixed parts of the patch to be more pgindent-clean.
>
>
Thanks Robert for hacking into this.


> Testing this against the TPC-H queries at 10GB with
> max_parallel_workers_per_gather = 4, seq_page_cost = 0.1,
> random_page_cost = 0.1, work_mem = 64MB initially produced somewhat
> demoralizing results.  Only Q17, Q4, and Q8 picked Gather Merge, and
> of those only Q17 got faster.  Investigating this led to me realizing
> that join costing for parallel joins is all messed up: see
> https://www.postgresql.org/message-id/CA+TgmoYt2pyk2CTyvYCtFySXN=
> jsorgh8_mjttlowu5qkjo...@mail.gmail.com
>
> With that patch applied, in my testing, Gather Merge got picked for
> Q3, Q4, Q5, Q6, Q7, Q8, Q10, and Q17, but a lot of those queries get a
> little slower instead of a little faster.  Here are the timings --
> these are with EXPLAIN ANALYZE, so take them with a grain of salt --
> first number is without Gather Merge, second is with Gather Merge:
>
> Q3 16943.938 ms -> 18645.957 ms
> Q4 3155.350 ms -> 4179.431 ms
> Q5 13611.484 ms -> 13831.946 ms
> Q6 9264.942 ms -> 8734.899 ms
> Q7 9759.026 ms -> 10007.307 ms
> Q8 2473.899 ms -> 2459.225 ms
> Q10 13814.950 ms -> 12255.618 ms
> Q17 49552.298 ms -> 46633.632 ms
>
>
This is strange, I will re-run the test again and post the results soon.



> I haven't really had time to dig into these results yet, so I'm not
> sure how "real" these numbers are and how much is run-to-run jitter,
> EXPLAIN ANALYZE distortion, or whatever.  I think this overall concept
> is good, because there should be cases where it's substantially
> cheaper to preserve the order while gathering tuples from workers than
> to re-sort afterwards.  But this particular set of results is a bit
> lackluster.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Rushabh Lathia

Re: [HACKERS] Gather Merge

2016-11-24 Thread Rushabh Lathia

On Wed, Nov 16, 2016 at 3:10 PM, Rushabh Lathia <rushabh.lat...@gmail.com>
wrote:

>
>
> On Mon, Nov 14, 2016 at 3:51 PM, Thomas Munro <
> thomas.mu...@enterprisedb.com> wrote:
>
>> On Sat, Nov 12, 2016 at 1:56 AM, Rushabh Lathia
>> <rushabh.lat...@gmail.com> wrote:
>> > On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro <
>> thomas.mu...@enterprisedb.com>
>> > wrote:
>> >> + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development
>> Group
>> >> + * Portions Copyright (c) 1994, Regents of the University of
>> California
>> >>
>> >> Shouldn't this say just "(c) 2016, PostgreSQL Global Development
>> >> Group"?
>> >
>> > Fixed.
>>
>> The year also needs updating to 2016 in nodeGatherMerge.h.
>>
>
> Oops sorry, fixed now.
>
>
>>
>> >> + /* Per-tuple heap maintenance cost */
>> >> + run_cost += path->path.rows * comparison_cost * 2.0 * logN;
>> >>
>> >> Why multiply by two?  The comment above this code says "about log2(N)
>> >> comparisons to delete the top heap entry and another log2(N)
>> >> comparisons to insert its successor".  In fact gather_merge_getnext
>> >> calls binaryheap_replace_first, which replaces the top element without
>> >> any comparisons at all and then performs a sift-down in log2(N)
>> >> comparisons to find its new position.  There is no per-tuple "delete"
>> >> involved.  We "replace" the top element with the value it already had,
>> >> just to trigger the sift-down, because we know that our comparator
>> >> function might have a new opinion of the sort order of this element.
>> >> Very clever!  The comment and the 2.0 factor in cost_gather_merge seem
>> >> to be wrong though -- or am I misreading the code?
>> >>
>> > See cost_merge_append.
>>
>> That just got tweaked in commit 34ca0905.
>>
>
> Fixed.
>
>
>>
>> > Looking at the plan I realize that this is happening because wrong
>> costing
>> > for Gather Merge. Here in the plan we can see the row estimated by
>> > Gather Merge is wrong. This is because earlier patch GM was considering
>> > rows = subpath->rows, which is not true as subpath is partial path. So
>> > we need to multiple it with number of worker. Attached patch also fixed
>> > this issues. I also run the TPC-H benchmark with the patch and results
>> > are same as earlier.
>>
>> In create_grouping_paths:
>> +   double  total_groups = gmpath->rows *
>> gmpath->parallel_workers;
>>
>> This hides a variable of the same name in the enclosing scope.  Maybe
>> confusing?
>>
>> In some other places like create_ordered_paths:
>> +   double  total_groups = path->rows * path->parallel_workers;
>>
>> Though it probably made sense to use this variable name in
>> create_grouping_paths, wouldn't total_rows be better here?
>>
>>
> Initially I just copied from the other places. I agree with you that
> create_ordered_paths - total_rows make more sense.
>
>
>> It feels weird to be working back to a total row count estimate from
>> the partial one by simply multiplying by path->parallel_workers.
>> Gather Merge will underestimate the total rows when parallel_workers <
>> 4, if using partial row estimates ultimately from  cost_seqscan which
>> assume some leader contribution.  I don't have a better idea though.
>> Reversing cost_seqscan's logic certainly doesn't seem right.  I don't
>> know how to make them agree on the leader's contribution AND give
>> principled answers, since there seems to be some kind of cyclic
>> dependency in the costing logic (cost_seqscan really needs to be given
>> a leader contribution estimate from its superpath which knows whether
>> it will allow the leader to pull tuples greedily/fairly or not, but
>> that superpath hasn't been created yet; cost_gather_merge needs the
>> row count from its subpath).  Or maybe I'm just confused.
>>
>>
> Yes, I agree with you. But we can't really do changes into cost_seqscan.
> Another option I can think of is just calculate the rows for gather merge,
> by using the reverse formula which been used into cost_seqscan. So
> we can completely remote the rows argument from the
> create_gather_merge_path()
> and then inside create_gather_merge_path() - calculate the total_rows
> using same formula which been used into cost_seqscan. This is wor

Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw

2016-11-23 Thread Rushabh Lathia

On Tue, Nov 22, 2016 at 6:24 PM, Etsuro Fujita <fujita.ets...@lab.ntt.co.jp>
wrote:

> Hi Rushabh,
>
> On 2016/11/22 19:05, Rushabh Lathia wrote:
>
>> I started reviewing the patch and here are few initial review points and
>> questions for you.
>>
>
> Thanks for the review!
>
> 1)
>> -static void deparseExplicitTargetList(List *tlist, List
>> **retrieved_attrs,
>> +static void deparseExplicitTargetList(bool is_returning,
>> +  List *tlist,
>> +  List **retrieved_attrs,
>>deparse_expr_cxt *context);
>>
>> Any particular reason of inserting new argument as 1st argunment? In
>> general
>> we add new flags at the end or before the out param for the existing
>> function.
>>
>
> No, I don't.  I think the *context should be the last argument as in other
> functions in deparse.c.  How about inserting that before the out param
> **retrieved_attrs, like this?
>
> static void
> deparseExplicitTargetList(List *tlist,
>   bool is_returning,
>   List **retrieved_attrs,
>   deparse_expr_cxt *context);
>
>
Yes, adding it before retrieved_attrs would be good.


> 2)
>> -static void deparseFromExprForRel(StringInfo buf, PlannerInfo *root,
>> -RelOptInfo *joinrel, bool use_alias, List
>> **params_list);
>> +static void deparseFromExprForRel(StringInfo buf, PlannerInfo *root,
>> RelOptInfo *foreignrel,
>> +  bool use_alias, Index target_rel, List
>> **params_list);
>>
>
> Going beyond 80 character length
>>
>
> Will fix.
>
> 3)
>>  static void
>> -deparseExplicitTargetList(List *tlist, List **retrieved_attrs,
>> +deparseExplicitTargetList(bool is_returning,
>> +  List *tlist,
>> +  List **retrieved_attrs,
>>deparse_expr_cxt *context)
>>
>> This looks bit wired to me - as comment for the
>> deparseExplicitTargetList()
>> function name and comments says different then what its trying to do when
>> is_returning flag is passed. Either function comment need to be change or
>> may be separate function fo deparse returning list for join relation?
>>
>> Is it possible to teach deparseReturningList() to deparse the returning
>> list for the joinrel?
>>
>
> I don't think it's possible to do that because deparseReturningList uses
> deparseTargetList internally, which only allows us to emit a target list
> for a baserel.  For example, deparseReturningList can't handle a returning
> list that contains join outputs like this:
>
> postgres=# explain verbose delete from ft1 using ft2 where ft1.a = ft2.a
> returning ft1.*, ft2.*;
>QUERY PLAN
> 
> 
>  Delete on public.ft1  (cost=100.00..102.06 rows=1 width=46)
>Output: ft1.a, ft1.b, ft2.a, ft2.b
>->  Foreign Delete  (cost=100.00..102.06 rows=1 width=46)
>  Remote SQL: DELETE FROM public.t1 r1 USING public.t2 r2 WHERE
> ((r1.a = r2.a)) RETURNING r1.a, r1.b, r2.a, r2.b
> (4 rows)
>
> I'd like to change the comment for deparseExplicitTargetList.
>
> 4)
>>
>> +if (foreignrel->reloptkind == RELOPT_JOINREL)
>> +{
>> +List   *rlist = NIL;
>> +
>> +/* Create a RETURNING list */
>> +rlist = make_explicit_returning_list(rtindex, rel,
>> + returningList,
>> + fdw_scan_tlist);
>> +
>> +deparseExplicitReturningList(rlist, retrieved_attrs, );
>> +}
>> +else
>> +deparseReturningList(buf, root, rtindex, rel, false,
>> + returningList, retrieved_attrs);
>>
>> The code will be more centralized if we add the logic of extracting
>> returninglist
>> for JOINREL inside deparseReturningList() only, isn't it?
>>
>
> You are right.  Will do.
>
> 5) make_explicit_returning_list() pull the var list from the
>> returningList and
>> build the targetentry for the returning list and at the end rewrites the
>> fdw_scan_tlist.
>>
>> AFAIK, in case of DML - which is going to pushdown to the remote server
>> ideally fdw_scan_tlist should be same as returning list, as final output
>> for the query is query will be RETUNING list only. isn't that

Re: [HACKERS] Declarative partitioning - another take

2016-11-22 Thread Rushabh Lathia

 way down to
> the comparison function.
>
> Also, we no longer have list_partition_for_tuple() and
> range_partition_for_tuple().  Instead, in get_partition_for_tuple()
> itself, there is a bsearch followed by list and range partitioning
> specific steps based on the returned offset.
>
> > - I don't see why you need the bound->lower stuff any more.  If
> > rangeinfo.bounds[offset] is a lower bound for a partition, then
> > rangeinfo.bounds[offset+1] is either (a) the upper bound for that
> > partition and the partition is followed by a "gap" or (b) both the
> > upper bound for that partition and the lower bound for the next
> > partition.  With the inclusive/exclusive bound stuff gone, every range
> > bound has the same sense: if the probed value is <= the bound then
> > we're supposed to be a lower-numbered partition, but if > then we're
> > supposed to be in this partition or a higher-numbered one.
>
> OK, I've managed to get rid of lower.  At least it is no longer kept in
> the new relcache struct PartitionBoundInfo.  It is still kept in
> PartitionRangeBound which is used to hold individual range bounds when
> sorting them (during relcache build).  Comparisons invoked during the
> aforementioned sorting step still need to distinguish between lower and
> upper bounds (such that '1)' < '[1').
>
> Tuple-routing no longer needs to look at lower.  In that case, what you
> described above applies.
>
> As a result, one change became necessary: to how we flag individual range
> bound datum as infinite or not.  Previously, it was a regular Boolean
> value (either infinite or not) and to distinguish +infinity from
> -infinity, we looked at whether the bound is lower or upper (the lower
> flag).  Now, instead, the variable holding the status of individual range
> bound datum is set to a ternary value: RANGE_DATUM_FINITE (0),
> RANGE_DATUM_NEG_INF (1), and RANGE_DATUM_POS_INF (2), which still fits in
> a bool.  Upon encountering an infinite range bound datum, whether it's
> negative or positive infinity derives the comparison result.  Consider the
> following example:
>
> partition p1 from (1, unbounded) to (1, 1);
> partition p2 from (1, 1) to (1, 10);
> partition p3 from (1, 10) to (1, unbounded);
> partition p4 from (2, unbounded) to (2, 1);
> ... so on
>
> In this case, we need to be able to conclude, say, (1, -inf) < (1, 15) <
> (1, +inf), so that tuple (1, 15) is assigned to the proper partition.
>
> Does this last thing sound reasonable?
>
> Thanks,
> Amit
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


-- 
Rushabh Lathia

Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw

2016-11-22 Thread Rushabh Lathia

I started reviewing the patch and here are few initial review points and
questions for you.

1)
-static void deparseExplicitTargetList(List *tlist, List **retrieved_attrs,
+static void deparseExplicitTargetList(bool is_returning,
+  List *tlist,
+  List **retrieved_attrs,
   deparse_expr_cxt *context);

Any particular reason of inserting new argument as 1st argunment? In general
we add new flags at the end or before the out param for the existing
function.

2)
-static void deparseFromExprForRel(StringInfo buf, PlannerInfo *root,
-RelOptInfo *joinrel, bool use_alias, List
**params_list);
+static void deparseFromExprForRel(StringInfo buf, PlannerInfo *root,
RelOptInfo *foreignrel,
+  bool use_alias, Index target_rel, List
**params_list);

Going beyond 80 character length

3)
 static void
-deparseExplicitTargetList(List *tlist, List **retrieved_attrs,
+deparseExplicitTargetList(bool is_returning,
+  List *tlist,
+  List **retrieved_attrs,
   deparse_expr_cxt *context)

This looks bit wired to me - as comment for the deparseExplicitTargetList()
function name and comments says different then what its trying to do when
is_returning flag is passed. Either function comment need to be change or
may be separate function fo deparse returning list for join relation?

Is it possible to teach deparseReturningList() to deparse the returning
list for the joinrel?

4)

+if (foreignrel->reloptkind == RELOPT_JOINREL)
+{
+List   *rlist = NIL;
+
+/* Create a RETURNING list */
+rlist = make_explicit_returning_list(rtindex, rel,
+ returningList,
+ fdw_scan_tlist);
+
+deparseExplicitReturningList(rlist, retrieved_attrs, );
+}
+else
+deparseReturningList(buf, root, rtindex, rel, false,
+ returningList, retrieved_attrs);

The code will be more centralized if we add the logic of extracting
returninglist
for JOINREL inside deparseReturningList() only, isn't it?

5) make_explicit_returning_list() pull the var list from the returningList
and
build the targetentry for the returning list and at the end rewrites the
fdw_scan_tlist.

AFAIK, in case of DML - which is going to pushdown to the remote server
ideally fdw_scan_tlist should be same as returning list, as final output
for the query is query will be RETUNING list only. isn't that true?

If yes, then why can't we directly replace the fdw_scan_tlist with the
returning
list, rather then make_explicit_returning_list()?

Also I think rewriting the fdw_scan_tlist should happen into postgres_fdw.c
-
rather then deparse.c.

6) Test coverage for the returning list is missing.

On Fri, Nov 18, 2016 at 8:56 AM, Etsuro Fujita <fujita.ets...@lab.ntt.co.jp>
wrote:

> On 2016/11/16 16:38, Etsuro Fujita wrote:
>
>> On 2016/11/16 13:10, Ashutosh Bapat wrote:
>>
>>> I don't see any reason why DML/UPDATE pushdown should depend upon
>>> subquery deparsing or least PHV patch. Combined together they can help
>>> in more cases, but without those patches, we will be able to push-down
>>> more stuff. Probably, we should just restrict push-down only for the
>>> cases when above patches are not needed. That makes reviews easy. Once
>>> those patches get committed, we may add more functionality depending
>>> upon the status of this patch. Does that make sense?
>>>
>>
> OK, I'll extract from the patch the minimal part that wouldn't depend on
>> the two patches.
>>
>
> Here is a patch for that.  Todo items are: (1) add more comments and (2)
> add more regression tests.  I'll do that in the next version.
>
> Best regards,
> Etsuro Fujita
>

-- 
Rushabh Lathia

Re: [HACKERS] Gather Merge

2016-11-16 Thread Rushabh Lathia

On Mon, Nov 14, 2016 at 3:51 PM, Thomas Munro <thomas.mu...@enterprisedb.com
> wrote:

> On Sat, Nov 12, 2016 at 1:56 AM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> > On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro <
> thomas.mu...@enterprisedb.com>
> > wrote:
> >> + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development
> Group
> >> + * Portions Copyright (c) 1994, Regents of the University of California
> >>
> >> Shouldn't this say just "(c) 2016, PostgreSQL Global Development
> >> Group"?
> >
> > Fixed.
>
> The year also needs updating to 2016 in nodeGatherMerge.h.
>

Oops sorry, fixed now.


>
> >> + /* Per-tuple heap maintenance cost */
> >> + run_cost += path->path.rows * comparison_cost * 2.0 * logN;
> >>
> >> Why multiply by two?  The comment above this code says "about log2(N)
> >> comparisons to delete the top heap entry and another log2(N)
> >> comparisons to insert its successor".  In fact gather_merge_getnext
> >> calls binaryheap_replace_first, which replaces the top element without
> >> any comparisons at all and then performs a sift-down in log2(N)
> >> comparisons to find its new position.  There is no per-tuple "delete"
> >> involved.  We "replace" the top element with the value it already had,
> >> just to trigger the sift-down, because we know that our comparator
> >> function might have a new opinion of the sort order of this element.
> >> Very clever!  The comment and the 2.0 factor in cost_gather_merge seem
> >> to be wrong though -- or am I misreading the code?
> >>
> > See cost_merge_append.
>
> That just got tweaked in commit 34ca0905.
>

Fixed.


>
> > Looking at the plan I realize that this is happening because wrong
> costing
> > for Gather Merge. Here in the plan we can see the row estimated by
> > Gather Merge is wrong. This is because earlier patch GM was considering
> > rows = subpath->rows, which is not true as subpath is partial path. So
> > we need to multiple it with number of worker. Attached patch also fixed
> > this issues. I also run the TPC-H benchmark with the patch and results
> > are same as earlier.
>
> In create_grouping_paths:
> +   double  total_groups = gmpath->rows *
> gmpath->parallel_workers;
>
> This hides a variable of the same name in the enclosing scope.  Maybe
> confusing?
>
> In some other places like create_ordered_paths:
> +   double  total_groups = path->rows * path->parallel_workers;
>
> Though it probably made sense to use this variable name in
> create_grouping_paths, wouldn't total_rows be better here?
>
>
Initially I just copied from the other places. I agree with you that
create_ordered_paths - total_rows make more sense.


> It feels weird to be working back to a total row count estimate from
> the partial one by simply multiplying by path->parallel_workers.
> Gather Merge will underestimate the total rows when parallel_workers <
> 4, if using partial row estimates ultimately from  cost_seqscan which
> assume some leader contribution.  I don't have a better idea though.
> Reversing cost_seqscan's logic certainly doesn't seem right.  I don't
> know how to make them agree on the leader's contribution AND give
> principled answers, since there seems to be some kind of cyclic
> dependency in the costing logic (cost_seqscan really needs to be given
> a leader contribution estimate from its superpath which knows whether
> it will allow the leader to pull tuples greedily/fairly or not, but
> that superpath hasn't been created yet; cost_gather_merge needs the
> row count from its subpath).  Or maybe I'm just confused.
>
>
Yes, I agree with you. But we can't really do changes into cost_seqscan.
Another option I can think of is just calculate the rows for gather merge,
by using the reverse formula which been used into cost_seqscan. So
we can completely remote the rows argument from the
create_gather_merge_path()
and then inside create_gather_merge_path() - calculate the total_rows
using same formula which been used into cost_seqscan. This is working
fine - but not quite sure about the approach. So I attached that part of
changes
as separate patch. Any suggestions?


-- 
Rushabh Lathia
www.EnterpriseDB.com


gather_merge_v4_minor_changes.patch
Description: application/download


gm_v4_plus_rows_estimate.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw

2016-11-15 Thread Rushabh Lathia

Thanks Fujita-san for working on this. I've signed up to review this patch.

Your latest patch doesn't not get apply cleanly apply on master branch.

patching file contrib/postgres_fdw/deparse.c
6 out of 17 hunks FAILED -- saving rejects to file
contrib/postgres_fdw/deparse.c.rej
patching file contrib/postgres_fdw/expected/postgres_fdw.out
2 out of 2 hunks FAILED -- saving rejects to file
contrib/postgres_fdw/expected/postgres_fdw.out.rej
patching file contrib/postgres_fdw/postgres_fdw.c
2 out of 22 hunks FAILED -- saving rejects to file
contrib/postgres_fdw/postgres_fdw.c.rej
patching file contrib/postgres_fdw/postgres_fdw.h
1 out of 3 hunks FAILED -- saving rejects to file
contrib/postgres_fdw/postgres_fdw.h.rej

Please share the patch which get apply clean on master branch.



On Fri, Nov 11, 2016 at 5:00 PM, Etsuro Fujita <fujita.ets...@lab.ntt.co.jp>
wrote:

> On 2016/09/08 19:55, Etsuro Fujita wrote:
>
>> On 2016/09/07 13:21, Ashutosh Bapat wrote:
>>
>>> * with the patch:
>>> postgres=# explain verbose delete from ft1 using ft2 where ft1.a =
>>> ft2.a;
>>>  QUERY PLAN
>>>
>>> 
>>> -
>>>
>>>  Delete on public.ft1  (cost=100.00..102.04 rows=1 width=38)
>>>->  Foreign Delete  (cost=100.00..102.04 rows=1 width=38)
>>>  Remote SQL: DELETE FROM public.t1 r1 USING (SELECT ROW(a,
>>> b), a FROM public.t2) ss1(c1, c2) WHERE ((r1.a = ss1.c2))
>>> (3 rows)
>>>
>>
>> The underlying scan on t2 requires ROW(a,b) for locking the row for
>>> update/share. But clearly it's not required if the full query is being
>>> pushed down.
>>>
>>
> Is there a way we can detect that ROW(a,b) is useless
>>> column (not used anywhere in the other parts of the query like
>>> RETURNING, DELETE clause etc.) and eliminate it?
>>>
>>
> I don't have a clear solution for that yet, but I'll try to remove that
>> in the next version.
>>
>
> Similarly for a, it's
>>> part of the targetlist of the underlying scan so that the WHERE clause
>>> can be applied on it. But it's not needed if we are pushing down the
>>> query. If we eliminate the targetlist of the query, we could construct a
>>> remote query without having subquery in it, making it more readable.
>>>
>>
> Will try to do so also.
>>
>
> I addressed this by improving the deparse logic so that a remote query for
> performing an UPDATE/DELETE on a join directly on the remote can be created
> as proposed if possible.  Attached is an updated version of the patch,
> which is created on top of the patch set [1].  The patch is still WIP (ie,
> needs more comments and regression tests, at least), but any comments would
> be gratefully appreciated.
>
> Best regards,
> Etsuro Fujita
>
> [1] https://www.postgresql.org/message-id/11eafd10-d3f8-ac8a-b64
> 2-b0e65037c76b%40lab.ntt.co.jp
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


-- 
Rushabh Lathia

Re: [HACKERS] Gather Merge

2016-11-11 Thread Rushabh Lathia

Oops forgot to attach latest patch in the earlier mail.



On Fri, Nov 11, 2016 at 6:26 PM, Rushabh Lathia <rushabh.lat...@gmail.com>
wrote:

>
>
> On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro <
> thomas.mu...@enterprisedb.com> wrote:
>
>> On Thu, Oct 27, 2016 at 10:50 PM, Rushabh Lathia
>> <rushabh.lat...@gmail.com> wrote:
>> > Please find attached latest patch which fix the review point as well as
>> > additional clean-up.
>>
>> I've signed up to review this patch and I'm planning to do some
>> testing.  Here's some initial feedback after a quick read-through:
>>
>>
> Thanks Thomas.
>
>
>> + if (gather_merge_readnext(gm_state, i, initialize ? false : true))
>>
>> Clunky ternary operator... how about "!initialize".
>>
>>
> Fixed.
>
>
>> +/*
>> + * Function clear out a slot in the tuple table for each gather merge
>> + * slots and returns the clear clear slot.
>> + */
>>
>> Maybe better like this?  "_Clear_ out a slot in the tuple table for
>> each gather merge _slot_ and _return_ the _cleared_ slot."
>>
>>
> Fixed.
>
>
>> + /* Free tuple array as we no more need it */
>>
>> "... as we don't need it any more"
>>
>>
> Fixed
>
>
>> +/*
>> + * Read the next tuple for gather merge.
>> + *
>> + * Function fetch the sorted tuple out of the heap.
>> + */
>>
>> "_Fetch_ the sorted tuple out of the heap."
>>
>>
> Fixed
>
>
>> + * Otherwise, pull the next tuple from whichever participate we
>> + * returned from last time, and reinsert the index into the heap,
>> + * because it might now compare differently against the existing
>>
>> s/participate/participant/
>>
>>
> Fixed.
>
>
>> + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
>> + * Portions Copyright (c) 1994, Regents of the University of California
>>
>> Shouldn't this say just "(c) 2016, PostgreSQL Global Development
>> Group"?
>
>
> Fixed.
>
>
>> Are we supposed to be blaming the University of California
>> for new files?
>>
>>
> Not quite sure about this, so keeping this as it is.
>
>
>> +#include "executor/tqueue.h"
>> +#include "miscadmin.h"
>> +#include "utils/memutils.h"
>> +#include "utils/rel.h"
>> +#include "lib/binaryheap.h"
>>
>> Not correctly sorted.
>>
>>
> Copied from nodeGather.c. But Fixed here.
>
>
>> + /*
>> + * store the tuple descriptor into gather merge state, so we can use it
>> + * later while initilizing the gather merge slots.
>> + */
>>
>> s/initilizing/initializing/
>>
>>
> Fixed.
>
>
>> +/* 
>> + * ExecEndGatherMerge
>> + *
>> + * frees any storage allocated through C routines.
>> + * 
>>
>> The convention in Postgres code seems to be to use a form like "Free
>> any storage ..." in function documentation.  Not sure if that's an
>> imperative, an infinitive, or if the word "we" is omitted since
>> English is so fuzzy like that, but it's inconsistent with other
>> documentation to use "frees" here.  Oh, I see that exact wording is in
>> several other files.  I guess I'll just leave this as a complaint
>> about all those files then :-)
>>
>>
> Sure.
>
>
>> + * Pull atleast single tuple from each worker + leader and set up the
>> heap.
>>
>> s/atleast single/at least a single/
>>
>>
> Fixed.
>
>
>> + * Read the tuple for given reader into nowait mode, and form the tuple
>> array.
>>
>> s/ into / in /
>>
>>
> Fixed.
>
>
>> + * Function attempt to read tuple for the given reader and store it into
>> reader
>>
>> s/Function attempt /Attempt /
>>
>>
> Fixed.
>
>
>> + * Function returns true if found tuple for the reader, otherwise returns
>>
>> s/Function returns /Return /
>>
>>
> Fixed.
>
>
>> + * First try to read tuple for each worker (including leader) into nowait
>> + * mode, so that we initialize read from each worker as well as leader.
>>
>> I wonder if it would be good to standardise on the terminology we use
>> when we mean workers AND the leader.  In my Parallel Shared Hash work,
>> I've been saying '

Re: [HACKERS] Gather Merge

2016-11-11 Thread Rushabh Lathia

On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro <thomas.mu...@enterprisedb.com>
wrote:

> On Thu, Oct 27, 2016 at 10:50 PM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> > Please find attached latest patch which fix the review point as well as
> > additional clean-up.
>
> I've signed up to review this patch and I'm planning to do some
> testing.  Here's some initial feedback after a quick read-through:
>
>
Thanks Thomas.


> + if (gather_merge_readnext(gm_state, i, initialize ? false : true))
>
> Clunky ternary operator... how about "!initialize".
>
>
Fixed.


> +/*
> + * Function clear out a slot in the tuple table for each gather merge
> + * slots and returns the clear clear slot.
> + */
>
> Maybe better like this?  "_Clear_ out a slot in the tuple table for
> each gather merge _slot_ and _return_ the _cleared_ slot."
>
>
Fixed.


> + /* Free tuple array as we no more need it */
>
> "... as we don't need it any more"
>
>
Fixed


> +/*
> + * Read the next tuple for gather merge.
> + *
> + * Function fetch the sorted tuple out of the heap.
> + */
>
> "_Fetch_ the sorted tuple out of the heap."
>
>
Fixed


> + * Otherwise, pull the next tuple from whichever participate we
> + * returned from last time, and reinsert the index into the heap,
> + * because it might now compare differently against the existing
>
> s/participate/participant/
>
>
Fixed.


> + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
>
> Shouldn't this say just "(c) 2016, PostgreSQL Global Development
> Group"?


Fixed.


> Are we supposed to be blaming the University of California
> for new files?
>
>
Not quite sure about this, so keeping this as it is.


> +#include "executor/tqueue.h"
> +#include "miscadmin.h"
> +#include "utils/memutils.h"
> +#include "utils/rel.h"
> +#include "lib/binaryheap.h"
>
> Not correctly sorted.
>
>
Copied from nodeGather.c. But Fixed here.


> + /*
> + * store the tuple descriptor into gather merge state, so we can use it
> + * later while initilizing the gather merge slots.
> + */
>
> s/initilizing/initializing/
>
>
Fixed.


> +/* 
> + * ExecEndGatherMerge
> + *
> + * frees any storage allocated through C routines.
> + * 
>
> The convention in Postgres code seems to be to use a form like "Free
> any storage ..." in function documentation.  Not sure if that's an
> imperative, an infinitive, or if the word "we" is omitted since
> English is so fuzzy like that, but it's inconsistent with other
> documentation to use "frees" here.  Oh, I see that exact wording is in
> several other files.  I guess I'll just leave this as a complaint
> about all those files then :-)
>
>
Sure.


> + * Pull atleast single tuple from each worker + leader and set up the
> heap.
>
> s/atleast single/at least a single/
>
>
Fixed.


> + * Read the tuple for given reader into nowait mode, and form the tuple
> array.
>
> s/ into / in /
>
>
Fixed.


> + * Function attempt to read tuple for the given reader and store it into
> reader
>
> s/Function attempt /Attempt /
>
>
Fixed.


> + * Function returns true if found tuple for the reader, otherwise returns
>
> s/Function returns /Return /
>
>
Fixed.


> + * First try to read tuple for each worker (including leader) into nowait
> + * mode, so that we initialize read from each worker as well as leader.
>
> I wonder if it would be good to standardise on the terminology we use
> when we mean workers AND the leader.  In my Parallel Shared Hash work,
> I've been saying 'participants' if I mean = workers + leader.  What do
> you think?
>
>
I am not quite sure about participants. In my options when we explicitly
say workers + leader its more clear. I am open to change is if committer
thinks otherwise.



> + * After this if all active worker unable to produce the tuple, then
> + * re-read and this time read the tuple into wait mode. For the worker,
> + * which was able to produced single tuple in the earlier loop and still
> + * active, just try fill the tuple array if more tuples available.
> + */
>
> How about this?  "After this, if all active workers are unable to
> produce a tuple, then re-read and this time us wait mode.  For workers
> that were able to produce a tuple in the earlier loop and are still
> active, just try to fill the tuple arra

Re: [HACKERS] Gather Merge

2016-10-27 Thread Rushabh Lathia

Please find attached latest patch which fix the review point as well as
additional clean-up.

- Get rid of funnel_slot as its not needed for the Gather Merge
- renamed fill_tuple_array to form_tuple_array
- Fix possible infinite loop into gather_merge_init (Reported by Amit
Kaplia)
- Fix tqueue.c memory leak, by calling TupleQueueReaderNext() with
per-tuple context.
- Code cleanup into ExecGatherMerge.
- Added new function gather_merge_clear_slots(), which clear out all gather
merge slots and also free tuple array at end of execution.

I did the performance testing again with the latest patch and I haven't
observed any regression. Some of TPC-H queries showing additional benefit
with
the latest patch, but its just under 5%.

Do let me know if I missed anything.


On Mon, Oct 24, 2016 at 11:55 AM, Rushabh Lathia <rushabh.lat...@gmail.com>
wrote:

>
>
> On Thu, Oct 20, 2016 at 1:12 PM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
>
>> On Tue, Oct 18, 2016 at 5:29 PM, Rushabh Lathia
>> <rushabh.lat...@gmail.com> wrote:
>> > On Mon, Oct 17, 2016 at 2:26 PM, Amit Kapila <amit.kapil...@gmail.com>
>> > wrote:
>> >>
>> >> There is lot of common code between ExecGatherMerge and ExecGather.
>> >> Do you think it makes sense to have a common function to avoid the
>> >> duplicity?
>> >>
>> >> I see there are small discrepancies in both the codes like I don't see
>> >> the use of single_copy flag, as it is present in gather node.
>> >>
>> >
>> > Yes, even I thought to centrilize some things of ExecGather and
>> > ExecGatherMerge,
>> > but its really not something that is fixed. And I thought it might
>> change
>> > particularly
>> > for the Gather Merge. And as explained by Robert single_copy doesn't
>> make
>> > sense
>> > for the Gather Merge. I will still look into this to see if something
>> can be
>> > make
>> > centralize.
>> >
>>
>> Okay, I haven't thought about it, but do let me know if you couldn't
>> find any way to merge the code.
>>
>>
> Sure, I will look into this.
>
>
>> >>
>> >> 3.
>> >> +gather_merge_readnext(GatherMergeState * gm_state, int reader, bool
>> >> force)
>> >> {
>> >> ..
>> >> + tup = gm_readnext_tuple(gm_state, reader, force, NULL);
>> >> +
>> >> + /*
>> >> +
>> >>  * try to read more tuple into nowait mode and store it into the tuple
>> >> + * array.
>> >> +
>> >>  */
>> >> + if (HeapTupleIsValid(tup))
>> >> + fill_tuple_array(gm_state, reader);
>> >>
>> >> How the above read tuple is stored in array?  In anycase the above
>> >> interface seems slightly awkward to me. Basically, I think what you
>> >> are trying to do here is after reading first tuple in waitmode, you
>> >> are trying to fill the array by reading more tuples.  So, can't we
>> >> push reading of this fist tuple into that function and name it as
>> >> form_tuple_array().
>> >>
>> >
>> > Yes, you are right.
>> >
>>
>> You have not answered my first question.  I will try to ask again, how
>> the tuple read by below code is stored in the array:
>>
>>
> Tuple directly get stored into related TupleTableSlot.
> In gather_merge_readnext()
> at the end of function it build the TupleTableSlot for the given tuple. So
> tuple
> read by above code is directly stored into TupleTableSlot.
>
> >> + tup = gm_readnext_tuple(gm_state, reader, force, NULL);
>>
>> > First its trying to read tuple into wait-mode, and once
>> > it
>> > find tuple then its try to fill the tuple array (which basically try to
>> read
>> > tuple
>> > into nowait-mode). Reason I keep it separate is because in case of
>> > initializing
>> > the gather merge, if we unable to read tuple from all the worker - while
>> > trying
>> > re-read, we again try to fill the tuple array for the worker who already
>> > produced
>> > atleast a single tuple (see gather_merge_init() for more details).
>> >
>>
>> Whenever any worker produced one tuple, you already try to fill the
>> array in gather_merge_readnext(), so does the above code help much?
>>
>> > Also I
>> > thought
>> > filling tuple array (which basically read tuple into nowait mode) and
>> > reading tuple
>> > (into wait-mode) are two separate t

Re: [HACKERS] Gather Merge

2016-10-24 Thread Rushabh Lathia

On Thu, Oct 20, 2016 at 1:12 PM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Tue, Oct 18, 2016 at 5:29 PM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> > On Mon, Oct 17, 2016 at 2:26 PM, Amit Kapila <amit.kapil...@gmail.com>
> > wrote:
> >>
> >> There is lot of common code between ExecGatherMerge and ExecGather.
> >> Do you think it makes sense to have a common function to avoid the
> >> duplicity?
> >>
> >> I see there are small discrepancies in both the codes like I don't see
> >> the use of single_copy flag, as it is present in gather node.
> >>
> >
> > Yes, even I thought to centrilize some things of ExecGather and
> > ExecGatherMerge,
> > but its really not something that is fixed. And I thought it might change
> > particularly
> > for the Gather Merge. And as explained by Robert single_copy doesn't make
> > sense
> > for the Gather Merge. I will still look into this to see if something
> can be
> > make
> > centralize.
> >
>
> Okay, I haven't thought about it, but do let me know if you couldn't
> find any way to merge the code.
>
>
Sure, I will look into this.


> >>
> >> 3.
> >> +gather_merge_readnext(GatherMergeState * gm_state, int reader, bool
> >> force)
> >> {
> >> ..
> >> + tup = gm_readnext_tuple(gm_state, reader, force, NULL);
> >> +
> >> + /*
> >> +
> >>  * try to read more tuple into nowait mode and store it into the tuple
> >> + * array.
> >> +
> >>  */
> >> + if (HeapTupleIsValid(tup))
> >> + fill_tuple_array(gm_state, reader);
> >>
> >> How the above read tuple is stored in array?  In anycase the above
> >> interface seems slightly awkward to me. Basically, I think what you
> >> are trying to do here is after reading first tuple in waitmode, you
> >> are trying to fill the array by reading more tuples.  So, can't we
> >> push reading of this fist tuple into that function and name it as
> >> form_tuple_array().
> >>
> >
> > Yes, you are right.
> >
>
> You have not answered my first question.  I will try to ask again, how
> the tuple read by below code is stored in the array:
>
>
Tuple directly get stored into related TupleTableSlot.
In gather_merge_readnext()
at the end of function it build the TupleTableSlot for the given tuple. So
tuple
read by above code is directly stored into TupleTableSlot.

>> + tup = gm_readnext_tuple(gm_state, reader, force, NULL);
>
> > First its trying to read tuple into wait-mode, and once
> > it
> > find tuple then its try to fill the tuple array (which basically try to
> read
> > tuple
> > into nowait-mode). Reason I keep it separate is because in case of
> > initializing
> > the gather merge, if we unable to read tuple from all the worker - while
> > trying
> > re-read, we again try to fill the tuple array for the worker who already
> > produced
> > atleast a single tuple (see gather_merge_init() for more details).
> >
>
> Whenever any worker produced one tuple, you already try to fill the
> array in gather_merge_readnext(), so does the above code help much?
>
> > Also I
> > thought
> > filling tuple array (which basically read tuple into nowait mode) and
> > reading tuple
> > (into wait-mode) are two separate task - and if its into separate
> function
> > that code
> > look more clear.
>
> To me that looks slightly confusing.
>
> > If you have any suggestion for the function name
> > (fill_tuple_array)
> > then I am open to change that.
> >
>
> form_tuple_array (form_tuple is used at many places in code, so it
> should look okay).


Ok, I rename it with next patch.


> I think you might want to consider forming array
> even for leader, although it might not be as beneficial as for
> workers, OTOH, the code will get simplified if we do that way.
>

Yes, I did that earlier - and as you guessed its not be any beneficial
so to avoided extra memory allocation for the tuple array, I am not
forming array for leader.

Today, I observed another issue in code:
>
> +gather_merge_init(GatherMergeState *gm_state)
> {
> ..
> +reread:
> + for (i = 0; i < nreaders + 1; i++)
> + {
> + if (TupIsNull(gm_state->gm_slots[i]) ||
> + gm_state->gm_slots[i]->tts_isempty)
> + {
> + if (gather_merge_readnext(gm_state, i, initialize ? false : true))
> + {
> + binaryheap_add_unordered(gm_state->gm_heap,
> + Int32GetDatum(i));
> + }
> + }
> + else
> + fill_tuple_array

Re: [HACKERS] Gather Merge

2016-10-20 Thread Rushabh Lathia

On Thu, Oct 20, 2016 at 12:22 AM, Peter Geoghegan <p...@heroku.com> wrote:

> On Tue, Oct 4, 2016 at 11:05 PM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> > Query 4:  With GM 7901.480 -> Without GM 9064.776
> > Query 5:  With GM 53452.126 -> Without GM 55059.511
> > Query 9:  With GM 52613.132 -> Without GM 98206.793
> > Query 15: With GM 68051.058 -> Without GM 68918.378
> > Query 17: With GM 129236.075 -> Without GM 160451.094
> > Query 20: With GM 259144.232 -> Without GM 306256.322
> > Query 21: With GM 153483.497 -> Without GM 168169.916
> >
> > Here from the results we can see that query 9, 17 and 20 are the one
> which
> > show good performance benefit with the Gather Merge.
>
> Were all other TPC-H queries unaffected? IOW, did they have the same
> plan as before with your patch applied? Did you see any regressions?
>
>
Yes, all other TPC-H queries where unaffected with the patch. At the
initially stage of patch development I noticed the regressions, but then
realize that it is because I am not allowing leader to participate in the
GM. Later on I fixed that and after that I didn't noticed any regressions.

I assume that this patch has each worker use work_mem for its own
> sort, as with hash joins today. One concern with that model when
> testing is that you could end up with a bunch of internal sorts for
> cases with a GM node, where you get one big external sort for cases
> without one. Did you take that into consideration?
>
>
Yes, but isn't that good? Please correct me if I am missing anything.


> --
> Peter Geoghegan
>



--
Rushabh Lathia
www.EnterpriseDB.com

Re: [HACKERS] Gather Merge

2016-10-18 Thread Rushabh Lathia

Thanks Amit for reviewing this patch.

On Mon, Oct 17, 2016 at 2:26 PM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Wed, Oct 5, 2016 at 11:35 AM, Rushabh Lathia
> <rushabh.lat...@gmail.com> wrote:
> > Hi hackers,
> >
> > Attached is the patch to implement Gather Merge.
> >
>
> Couple of review comments:
>
> 1.
> ExecGatherMerge()
> {
> ..
> + /* No workers? Then never mind. */
> + if (!got_any_worker
> ||
> + node->nreaders < 2)
> + {
> +
> ExecShutdownGatherMergeWorkers(node);
> + node->nreaders = 0;
> +
> }
>
> Are you planning to restrict the use of gather merge based on number
> of workers, if there is a valid reason, then I think comments should
> be updated for same?
>
>
Thanks for catching this. This is left over from the earlier design patch.
With
current design we don't have any limitation for the number of worker . I did
the performance testing with setting max_parallel_workers_per_gather to 1
and didn't noticed any performance degradation. So I removed this limitation
into attached patch.

2.
> +ExecGatherMerge(GatherMergeState * node){
> ..
> + if (!node->initialized)
> + {
> + EState   *estate = node->ps.state;
> +
> GatherMerge *gm = (GatherMerge *) node->ps.plan;
> +
> + /*
> + * Sometimes we
> might have to run without parallelism; but if parallel
> + * mode is active then we can try to
> fire up some workers.
> + */
> + if (gm->num_workers > 0 && IsInParallelMode())
> +
> {
> + ParallelContext *pcxt;
> + bool got_any_worker =
> false;
> +
> + /* Initialize the workers required to execute Gather node. */
> +
> if (!node->pei)
> + node->pei = ExecInitParallelPlan(node-
> >ps.lefttree,
> +
>  estate,
> +
>  gm->num_workers);
> ..
> }
>
> There is lot of common code between ExecGatherMerge and ExecGather.
> Do you think it makes sense to have a common function to avoid the
> duplicity?
>
> I see there are small discrepancies in both the codes like I don't see
> the use of single_copy flag, as it is present in gather node.
>
>
Yes, even I thought to centrilize some things of ExecGather and
ExecGatherMerge,
but its really not something that is fixed. And I thought it might change
particularly
for the Gather Merge. And as explained by Robert single_copy doesn't make
sense
for the Gather Merge. I will still look into this to see if something can
be make
centralize.


> 3.
> +gather_merge_readnext(GatherMergeState * gm_state, int reader, bool
> force)
> {
> ..
> + tup = gm_readnext_tuple(gm_state, reader, force, NULL);
> +
> + /*
> +
>  * try to read more tuple into nowait mode and store it into the tuple
> + * array.
> +
>  */
> + if (HeapTupleIsValid(tup))
> + fill_tuple_array(gm_state, reader);
>
> How the above read tuple is stored in array?  In anycase the above
> interface seems slightly awkward to me. Basically, I think what you
> are trying to do here is after reading first tuple in waitmode, you
> are trying to fill the array by reading more tuples.  So, can't we
> push reading of this fist tuple into that function and name it as
> form_tuple_array().
>
>
Yes, you are right. First its trying to read tuple into wait-mode, and once
it
find tuple then its try to fill the tuple array (which basically try to
read tuple
into nowait-mode). Reason I keep it separate is because in case of
initializing
the gather merge, if we unable to read tuple from all the worker - while
trying
re-read, we again try to fill the tuple array for the worker who already
produced
atleast a single tuple (see gather_merge_init() for more details). Also I
thought
filling tuple array (which basically read tuple into nowait mode) and
reading tuple
(into wait-mode) are two separate task - and if its into separate function
that code
look more clear. If you have any suggestion for the function name
(fill_tuple_array)
then I am open to change that.



> 4.
> +create_gather_merge_path(..)
> {
> ..
> +  0 /* FIXME: pathnode->limit_tuples*/);
>
> What exactly you want to fix in above code?
>
>
Fixed.


> 5.
>  +/* Tuple array size */
> +#define MAX_TUPLE_STORE 10
>
> Have you tried with other values of MAX_TUPLE_STORE?  If yes, then
> what are the results?  I think it is better to add a comment why array
> size is best for performance.
>
>

Actually I was thinking on that, but I don't wanted to add their because
its just
performance number on my machine. Anyway I added the comments.


>
> 6.
> +/* INTERFACE ROUTINES
> + * ExecInitGatherMerge - initialize the MergeAppend
> node
> + * ExecGatherMerge - retrieve the next tuple from the node
> + *
> ExecEndGatherMerge - shut down the MergeAppend nod

Re: [HACKERS] Showing parallel status in \df+

2016-09-29 Thread Rushabh Lathia

On Thu, Sep 29, 2016 at 12:07 AM, Pavel Stehule <pavel.steh...@gmail.com>
wrote:

> Hi
>
> 2016-09-28 18:57 GMT+02:00 Tom Lane <t...@sss.pgh.pa.us>:
>
>> Pavel Stehule <pavel.steh...@gmail.com> writes:
>> > 2016-09-28 16:03 GMT+02:00 Tom Lane <t...@sss.pgh.pa.us>:
>> >> I propose to push my current patch (ie, move PL function
>> >> source code to \df+ footers), and we can use it in HEAD for awhile
>> >> and see what we think.  We can alway improve or revert it later.
>>
>> > I had some objection to format of source code - it should be full source
>> > code, not just header and body.
>>
>> That would be redundant with stuff that's in the main part of the \df
>> display.  I really don't need to see the argument types twice, for
>> instance.
>>
>
> I am sorry, I disagree. Proposed form is hard readable. Is not possible to
> simply copy/paste.
>
> I cannot to imagine any use case for proposed format.
>
>
I just did testing on Tom's patch - which show pl source code as a footer
(show-pl-source-code-as-a-footer.patch). I am sorry, but I agree with Paval,
its is hard readable - and its not adding any simplification on what we have
now.

Pavel Stehule <pavel.steh...@gmail.com> writes:
> We are in cycle because prosrc field is used for two independent features
-
> and then it can be hard to find a agreement.

> I thought pretty much everyone was on board with the idea of keeping
> prosrc in \df+ for internal/C-language functions (and then probably
> renaming the column, since it isn't actually source code in that case).
>The argument is over what to do for PL functions, which is only one use
> case not two

Thinking more, I am good for keeping prosrc in \df+ for internal/C-language
functions (with changed column name). and then \sf will be used to
get the source code for PL, SQL, language.


Regards
>
> Pavel
>
>
>>
>> regards, tom lane
>>
>
>


-- 
Rushabh Lathia
www.EnterpriseDB.com

Re: [HACKERS] Showing parallel status in \df+

2016-09-28 Thread Rushabh Lathia

On Mon, Sep 26, 2016 at 3:06 PM, Stephen Frost <sfr...@snowman.net> wrote:
> I feel like we're getting wrapped around the axle as it regards who is
> perceived to be voting for what.

Thanks Stephen Frost for listing down all the concerns from the people
on the different approaches.

On Tue, Sep 27, 2016 at 7:56 PM, Stephen Frost <sfr...@snowman.net> wrote:

> * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > I think the debate is more about whether moving the source display
> > functionality over to \sf is a better solution than rearranging \df+
> > output.  (If we had consensus to do that, I'd be happy to go code it,
> > but I'm not going to invest the effort when it seems like we don't.)
>
> Right, that's the main question.
>
> > If we'd had \sf all along, I think it's likely that we would never
> > have put source-code display into \df.  But of course we didn't,
>
> Indeed.
>
> > and what would have been best in a green field is not necessarily
> > what's best or achievable given existing reality.  Both Robert and
> > Peter have put forward the argument that people are used to finding
> > this info in \df+ output, and I think that deserves a whole lot of
> > weight.  The \sf solution might be cleaner, but it's not so much
> > better that it can justify forcing people to relearn their habits.
> >
> > So I think that rearranging \df+ output is really what we ought to
> > be doing here.
>
> Alright, given that Robert's made it clear what his preference is and
> you're in agreement with that, I'll remove my objection to moving down
> that path.  I agree that it's better than the current situation.  If we
> do end up improving \sf (which seems like a good idea, in general), then
> we may wish to consider a display option to control if the source is
> included in \df+ or not, but that doesn't need to bar this patch from
> going in.
>
> The earlier comments on the thread hadn't been as clear with regard to
> who held what opinions regarding the options and I'm glad that we were
> able to reach a point where it was much clearer that there was strong
> support for keeping the source in \df+.
>
> > I'm not necessarily wedded to any of the precise details of what I did
> > in my patch --- for instance, maybe function bodies ought to be indented
> > one tab stop?  But we've not gotten to the merits of such points, for
> > lack of agreement about whether this is the basic approach to take.
>
> As for this, I wouldn't indent or change the source at all.  For
> starters, indentation actually matters for some PLs, and I can certainly
> see people wanting to be able to copy/paste from the output, now that
> it'll be possible to reasonably do from the \df+ output.
>


Yes, it seems like "source code" as part of \df+ output (irrespective of
language) is still very much useful for the people - it make sense
not to change it at all. Also agree with Stephen view that once we do
end up improving \sf - we may be re-consider removing source code
from the \df+ output. For now we should stick with the goal for a thread
that started out being about showing parallel status in \df+ output.



> Thanks!
>
> Stephen
>



-- 
Rushabh Lathia
www.EnterpriseDB.com

Re: [HACKERS] PoC: Make it possible to disallow WHERE-less UPDATE and DELETE

2016-09-25 Thread Rushabh Lathia

On Mon, Sep 26, 2016 at 5:48 AM, Thomas Munro <thomas.mu...@enterprisedb.com
> wrote:

> On Mon, Sep 19, 2016 at 4:02 PM, David Fetter <da...@fetter.org> wrote:
> >
> > [training_wheels_004.patch]
>
> openjade:filelist.sgml:144:16:E: character "_" invalid: only parameter
> literal, "CDATA", "ENDTAG", "MD", "MS", "PI", "PUBLIC", "SDATA",
> "STARTTAG", "SYSTEM" and parameter separators allowed
> openjade:contrib.sgml:138:2:W: cannot generate system identifier for
> general entity "require"
>
> The documentation doesn't build here, I think because require_where is
> not an acceptable entity name.  It works for me if I change the
> underscore to a minus in various places.  That fixes these errors:
>
> + 
> +  Here is an example showing how to set up a database cluster with
> +  require_where.
> +
> +$ psql -U postgres
> +# SHOW shared_preload_libraries; /* Make sure not to clobber
> something by accident */
> +
> +If you found something,
> +# ALTER SYSTEM SET
> shared_preload_libraries='the,stuff,you,found,require_where';
> +
> +Otherwise,
> +# ALTER SYSTEM SET shared_preload_libraries='require_where';
> +
> +Then restart PostgreSQL
> +
> + 
>
> Could use a full stop (period) on the end of that sentence.  Also it
> shouldn't be inside the "screen" tags.  Maybe "If you found
> something," and "Otherwise," shouldn't be either, or should somehow be
> marked up so as not to appear to be text from the session.
>
> postgres=# delete from foo;
> ERROR:  DELETE requires a WHERE clause
> HINT:  To delete all rows, use "WHERE true" or similar.
>
> Maybe one of those messages could use some indication of where this is
> coming from, for surprised users encountering this non-standard
> behaviour for the first time?
>
>
+1.

I think hint message should be more clear about where its coming
from. May be it can specify the GUC name, or suggest that if you
really want to use DELETE without WHERE clause, turn OFF this
GUC? or something in similar line.


> FWIW I saw something similar enforced globally by the DBA team at a
> large company with many database users.  I think experienced users
> probably initially felt mollycoddled when they first encountered the
> error but I'm sure that some were secretly glad of its existence from
> time to time...  I think it's a useful feature for users who want it,
> and a nice little demonstration of how extensible Postgres is.
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Rushabh Lathia

Re: [HACKERS] Showing parallel status in \df+

2016-09-22 Thread Rushabh Lathia

On Thu, Sep 22, 2016 at 10:04 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Rushabh Lathia <rushabh.lat...@gmail.com> writes:
> > I agree with the argument in this thread, having "Source code" as part
> > of \df+ is bit annoying, specifically when output involve some really
> > big PL language functions. Having is separate does make \df+ output more
> > readable. So I would vote for \df++ rather then adding the source code
> > as part of footer for \df+.
>
> If it's unreadable in \df+, how would \df++ make that any better?
>
>
Eventhough source code as part of \df+ is bit annoying (specifically for PL
functions),
I noticed the argument in this thread that it's useful information for some
of.  So \df++
is just alternate option for the those who want the source code.



> regards, tom lane
>



-- 
Rushabh Lathia

Re: [HACKERS] Confusing docs about GetForeignUpperPaths in fdwhandler.sgml

2016-09-22 Thread Rushabh Lathia

On Fri, Sep 2, 2016 at 5:00 PM, Etsuro Fujita <fujita.ets...@lab.ntt.co.jp>
wrote:

> Hi Robert,
>
> Thanks for the comments!
>
> On 2016/09/02 11:55, Robert Haas wrote:
>
>> On Mon, Aug 1, 2016 at 5:44 PM, Etsuro Fujita
>> <fujita.ets...@lab.ntt.co.jp> wrote:
>>
>>> I noticed that the following note about direct modification via
>>> GetForeignUpperPaths in fdwhandler.sgml is a bit confusing.  We have
>>> another
>>> approach using PlanDirectModify, so that should be reflected in the note
>>> as
>>> well.  Please find attached a patch.
>>>
>>>  PlanForeignModify and the other callbacks described in
>>>   are designed around the
>>> assumption
>>>  that the foreign relation will be scanned in the usual way and then
>>>  individual row updates will be driven by a local
>>> ModifyTable
>>>  plan node.  This approach is necessary for the general case where an
>>>  update requires reading local tables as well as foreign tables.
>>>  However, if the operation could be executed entirely by the foreign
>>>  server, the FDW could generate a path representing that and insert
>>> it
>>>  into the UPPERREL_FINAL upper relation, where it would
>>>  compete against the ModifyTable approach.
>>>
>>
> I suppose this is factually correct, but I don't think it's very
>> illuminating.  I think that if we're going to document both
>> approaches, there should be some discussion of the pros and cons of
>> PlanDirectModify vs. PlanForeignModify.
>>
>
> PlanDirectModify vs. GetForeignUpperPaths for an UPPERREL_FINAL upper
> relation?
>
> Of course either should be
>> better than an iterative ModifyTable, but how should the FDW author
>> decide between the two of them?
>>
>
> That would apply to row locking.  We have two approaches for that too:
> GetForeignRowMarkType and GetForeignUpperPaths, which is documented in the
> same paragraph following the above documentation:
>
>  This approach
>  could also be used to implement remote SELECT FOR UPDATE,
>  rather than using the row locking callbacks described in
>  .  Keep in mind that a path
>
> The point of the patch is just to let the FDW author know that there is
> another approach for implementing direct modification (ie,
> PlanDirectModify) just as for implementing row locking.
>
>
Considering the primary object of this patch is just to let the FDW author
know
that there is another approach for implementing direct modification, I like
the
idea of modifying the document.

I agree that the documentation about how the FDW author should decide
> between the two would be helpful, but I'd like to leave that for future
> work.
>


I performed basic test with patch,

a) patch get applied cleanly on latest source,
b) able to build documentation cleanly.

Marking this as ready for committer.


>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Rushabh Lathia

Re: [HACKERS] Showing parallel status in \df+

2016-09-21 Thread Rushabh Lathia

I agree with the argument in this thread, having "Source code" as part of
\df+
is bit annoying, specifically when output involve some really big PL
language
functions. Having is separate does make \df+ output more readable. So I
would
vote for \df++ rather then adding the source code as part of footer for
\df+.

Personally I didn't like idea for keeping "source code" for C/internal
functions as part of \df+ and moving others out of it. If we really want to
move "source code" from \df+, then it should be consistent - irrespective
of language. So may be remove "source code" completely from \df+ and add
\df++ support for the "source code".



On Wed, Sep 7, 2016 at 12:14 AM, Pavel Stehule <pavel.steh...@gmail.com>
wrote:

> Hi
>
> 2016-09-06 0:05 GMT+02:00 Tom Lane <t...@sss.pgh.pa.us>:
>
>> I wrote:
>> > Pavel Stehule <pavel.steh...@gmail.com> writes:
>> >> Using footer for this purpose is little bit strange. What about
>> following
>> >> design?
>> >> 1. move out source code of PL functions from \df+
>> >> 2. allow not unique filter in \sf and allow to display multiple
>> functions
>>
>> > Wasn't that proposed and rejected upthread?
>>
>> So ... why did you put this patch in "Waiting on Author" state?  AFAIK,
>> we had dropped the idea of relying on \sf for this, mainly because
>> Peter complained about \df+ no longer providing source code.  I follow
>> his point: if you're used to using \df+ to see source code, you probably
>> can figure it out quickly if that command shows the source in a different
>> place than before.  But if it doesn't show it at all, using \sf instead
>> might not occur to you right away.
>>
>
> I see only one situation, when I want to see more then one source code -
> checking overloaded functions. I prefer to see complete source code - in
> \sf format. But I don't remember, when I did it last time. So I can live
> without it well.
>
> I am thinking, there is strong agreement about reduction \dt+ result. I am
> not sure about usability of showing source code in footer. It is not too
> much readable - and the fact, so function's body is displayed not as CREATE
> statements, does the result less readable.
>
> Now I am thinking so using footer for this purpose is not too great idea -
> maybe we can live better without it (without source code of PL in \dt+
> result, I would to see only C function source there). If you like using
> footer, then the format should be changed to be more consistent, readable?
> I am not sure, how it can be enhanced.
>
> Regards
>
> Pavel
>
>
>>
>> regards, tom lane
>>
>
>


-- 
Rushabh Lathia

Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-09-02 Thread Rushabh Lathia

On Fri, Sep 2, 2016 at 1:12 PM, Rahila Syed <rahilasye...@gmail.com> wrote:

> Hello,
>
> Thank you for comments.
>
> >Above test not throwing psql error, as you used strcmp(newval,"ON"). You
> >should use pg_strcasecmp.
> Corrected in the attached.
>
> >Above error coming because in below code block change, you don't have
> check for
> >command (you should check opt0 for AUTOCOMMIT command)
> Corrected in the attached.
>
> >postgres=# BEGIN;
> >BEGIN
> >postgres=# create table foo ( a int );
> >CREATE TABLE
> >postgres=# \set AUTOCOMMIT ON
> >Don't you think, in above case also we should throw a psql error?
> IMO, in this case BEGIN is explicitly specified by user, so I think it is
> understood that a commit is required for changes to be effective.
> Hence I did not consider this case.
>
>
Document doesn't say this changes are only for implicit BEGIN..

+
+
+ Autocommit cannot be set on inside a transaction, the ongoing
+ transaction has to be ended by entering COMMIT or
+ ROLLBACK before setting autocommit on.
+
+

In my opinion, its good to be consistent, whether its implicit or
explicitly specified BEGIN.

>postgres=# \set AUTOCOMMIT off
> >postgres=# create table foo ( a int );
> >CREATE TABLE
> >postgres=# \set XYZ ON
> >\set: Cannot set XYZ to ON inside a transaction, either COMMIT or
> ROLLBACK and retry
>
> >May be you would like to move the new code block inside SetVariable(). So
> that
> >don't need to worry about the validity of the variable names.
>
> I think validating variable names wont be required if we throw error only
> if  command is \set AUTOCOMMIT.
> Validation can happen later as in the existing code.
>
>
It might happen that SetVariable() can be called from other places in
future,
so its good to have all the set variable related checks inside
SetVariable().

Also you can modify the condition in such way, so that we don't often end up
calling PQtransactionStatus() even though its not require.

Example:

if (!strcmp(opt0,"AUTOCOMMIT") &&
!strcasecmp(newval,"ON") &&
!pset.autocommit &&
PQtransactionStatus(pset.db) == PQTRANS_INTRANS)


>Basically if I understand correctly, if we are within transaction and
> someone
> >tries the set the AUTOCOMMIT ON, it should throw a psql error. Correct me
> >if I am missing anything?
>
> Yes the psql_error is thrown when AUTOCOMMIT is turned on inside a
> transaction. But only when there is an implicit BEGIN as in following case,
>
> postgres=# \set AUTOCOMMIT OFF
> postgres=# create table test(i int);
> CREATE TABLE
> postgres=# \set AUTOCOMMIT ON
> \set: Cannot set AUTOCOMMIT to ON inside a transaction, either COMMIT or
> ROLLBACK and retry
> postgres=#
>
> Thank you,
> Rahila Syed
>
>


Regards,
Rushabh Lathia
www.EnterpriseDB.com

Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-09-02 Thread Rushabh Lathia

I don't think patch is doing correct things:

1)

postgres=# \set AUTOCOMMIT off
postgres=# create table foo (a int );
CREATE TABLE
postgres=# \set AUTOCOMMIT on

Above test not throwing psql error, as you used strcmp(newval,"ON"). You
should use pg_strcasecmp.

2)

postgres=# \set AUTOCOMMIT OFF
postgres=# create table foo ( a int );
CREATE TABLE
postgres=# \set VERBOSITY ON
\set: Cannot set VERBOSITY to ON inside a transaction, either COMMIT or
ROLLBACK and retry

Above error coming because in below code block change, you don't have check
for
command (you should check opt0 for AUTOCOMMIT command)

-if (!SetVariable(pset.vars, opt0, newval))
+if (transaction_status == PQTRANS_INTRANS && !pset.autocommit
&&
+!strcmp(newval,"ON"))
+{
+psql_error("\\%s: Cannot set %s to %s inside a
transaction, either COMMIT or ROLLBACK and retry\n",
+cmd, opt0, newval);
+success = false;
+}

3)

postgres=# BEGIN;
BEGIN
postgres=# create table foo ( a int );
CREATE TABLE
postgres=# \set AUTOCOMMIT ON

Don't you think, in above case also we should throw a psql error?

4)

postgres=# \set AUTOCOMMIT off
postgres=# create table foo ( a int );
CREATE TABLE
postgres=# \set XYZ ON
\set: Cannot set XYZ to ON inside a transaction, either COMMIT or ROLLBACK
and retry

May be you would like to move the new code block inside SetVariable(). So
that
don't need to worry about the validity of the variable names.

Basically if I understand correctly, if we are within transaction and
someone
tries the set the AUTOCOMMIT ON, it should throw a psql error. Correct me
if I am missing anything?

On Thu, Sep 1, 2016 at 4:23 PM, Rahila Syed <rahilasye...@gmail.com> wrote:

> Ok. Please find attached a patch which introduces psql error when
> autocommit is turned on inside a transaction. It also adds relevant
> documentation in psql-ref.sgml. Following is the output.
>
> bash-4.2$ psql -d postgres
> psql (10devel)
> Type "help" for help.
>
> postgres=# \set AUTOCOMMIT OFF
> postgres=# create table test(i int);
> CREATE TABLE
> postgres=# \set AUTOCOMMIT ON
> \set: Cannot set AUTOCOMMIT to ON inside a transaction, either COMMIT or
> ROLLBACK and retry
> postgres=#
>
>
> Thank you,
> Rahila Syed
>
> On Wed, Aug 17, 2016 at 6:31 PM, Robert Haas <robertmh...@gmail.com>
> wrote:
>
>> On Tue, Aug 16, 2016 at 5:25 PM, Rahila Syed <rahilasye...@gmail.com>
>> wrote:
>> >>I think I like the option of having psql issue an error.  On the
>> >>server side, the transaction would still be open, but the user would
>> >>receive a psql error message and the autocommit setting would not be
>> >>changed.  So the user could type COMMIT or ROLLBACK manually and then
>> >>retry changing the value of the setting.
>> >
>> > Throwing psql error comes out to be most accepted outcome on this
>> thread. I
>> > agree it is safer than guessing user intention.
>> >
>> > Although according to the default behaviour of psql, error will abort
>> the
>> > current transaction and roll back all the previous commands.
>>
>> A server error would do that, but a psql errror won't.
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


-- 
Rushabh Lathia

Re: [HACKERS] [parallel query] random server crash while running tpc-h query on power2

2016-08-15 Thread Rushabh Lathia

On Mon, Aug 15, 2016 at 6:02 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Sat, Aug 13, 2016 at 4:36 AM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
> > AFAICS, your patch seems to be the right fix for this issue, unless we
> > need the instrumentation information during execution (other than for
> > explain) for some purpose.
>
> Hmm, I disagree.  It should be the job of
> ExecParallelRetrieveInstrumentation to allocate its data in the
> correct context, not the responsibility of nodeGather.c to work around
> the fact that it doesn't.  The worker instrumentation should be
> allocated in the same context as the regular instrumentation
> information, which I assume is probably the per-query context.
>

I agree, this make sense.

Here is the patch to allocate worker instrumentation into same context
as the regular instrumentation which is per-query context.

PFA patch.


--
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 380d743..5aa6f02 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -500,6 +500,7 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 	int			n;
 	int			ibytes;
 	int			plan_node_id = planstate->plan->plan_node_id;
+	MemoryContext oldcontext;
 
 	/* Find the instumentation for this node. */
 	for (i = 0; i < instrumentation->num_plan_nodes; ++i)
@@ -514,10 +515,19 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 	for (n = 0; n < instrumentation->num_workers; ++n)
 		InstrAggNode(planstate->instrument, [n]);
 
-	/* Also store the per-worker detail. */
+	/*
+	 * Also store the per-worker detail.
+	 *
+	 * Worker instrumentation should be allocated in the same context as
+	 * the regular instrumentation information, which is the per-query
+	 * context. Switch into per-query memory context.
+	 */
+	oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
 	ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
 	planstate->worker_instrument =
 		palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
+	MemoryContextSwitchTo(oldcontext);
+
 	planstate->worker_instrument->num_workers = instrumentation->num_workers;
 	memcpy(>worker_instrument->instrument, instrument, ibytes);
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [parallel query] random server crash while running tpc-h query on power2

2016-08-12 Thread Rushabh Lathia

Hi All,

Recently while running tpc-h queries on postgresql master branch, I am
noticed
random server crash. Most of the time server crash coming while turn tpch
query
number 3 - (but its very random).

Call Stack of server crash:

(gdb) bt
#0  0x102aa9ac in ExplainNode (planstate=0x1001a7471d8,
ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0,
es=0x1001a65ad48)
at explain.c:1516
#1  0x102aaeb8 in ExplainNode (planstate=0x1001a746b50,
ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0,
es=0x1001a65ad48)
at explain.c:1599
#2  0x102aaeb8 in ExplainNode (planstate=0x1001a7468b8,
ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0,
es=0x1001a65ad48)
at explain.c:1599
#3  0x102aaeb8 in ExplainNode (planstate=0x1001a745e48,
ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0,
es=0x1001a65ad48)
at explain.c:1599
#4  0x102aaeb8 in ExplainNode (planstate=0x1001a745bb0,
ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0,
es=0x1001a65ad48)
at explain.c:1599
#5  0x102aaeb8 in ExplainNode (planstate=0x1001a745870,
ancestors=0x1001a745348, relationship=0x0, plan_name=0x0, es=0x1001a65ad48)
at explain.c:1599
#6  0x102a81d8 in ExplainPrintPlan (es=0x1001a65ad48,
queryDesc=0x1001a7442a0) at explain.c:599
#7  0x102a7e20 in ExplainOnePlan (plannedstmt=0x1001a744208,
into=0x0, es=0x1001a65ad48,
queryString=0x1001a6a8c08 "explain (analyze,buffers,verbose)
select\n\tl_orderkey,\n\tsum(l_extendedprice * (1 - l_discount)) as
revenue,\n\to_orderdate,\n\to_shippriority\nfrom\n\tcustomer,\n\torders,\n\tlineitem\nwhere\n\tc_mktsegment
= 'BUILD"..., params=0x0, planduration=0x3fffe17c6488) at explain.c:515
#8  0x102a7968 in ExplainOneQuery (query=0x1001a65ab98, into=0x0,
es=0x1001a65ad48,
queryString=0x1001a6a8c08 "explain (analyze,buffers,verbose)
select\n\tl_orderkey,\n\tsum(l_extendedprice * (1 - l_discount)) as
revenue,\n\to_orderdate,\n\to_shippriority\nfrom\n\tcustomer,\n\torders,\n\tlineitem\nwhere\n\tc_mktsegment
= 'BUILD"..., params=0x0) at explain.c:357
#9  0x102a74b8 in ExplainQuery (stmt=0x1001a6fe468,
queryString=0x1001a6a8c08 "explain (analyze,buffers,verbose)
select\n\tl_orderkey,\n\tsum(l_extendedprice * (1 - l_discount)) as
revenue,\n\to_orderdate,\n\to_shippriority\nfrom\n\tcustomer,\n\torders,\n\tlineitem\nwhere\n\tc_mktsegment
= 'BUILD"..., params=0x0, dest=0x1001a65acb0) at explain.c:245

(gdb) p *w
$2 = {num_workers = 2139062143, instrument = 0x1001af8a8d0}
(gdb) p planstate
$3 = (PlanState *) 0x1001a7471d8
(gdb) p *planstate
$4 = {type = T_HashJoinState, plan = 0x1001a740730, state = 0x1001a745758,
instrument = 0x1001a7514a8, worker_instrument = 0x1001af8a8c8,
  targetlist = 0x1001a747448, qual = 0x0, lefttree = 0x1001a74a0e8,
righttree = 0x1001a74f2f8, initPlan = 0x0, subPlan = 0x0, chgParam = 0x0,
  ps_ResultTupleSlot = 0x1001a750c10, ps_ExprContext = 0x1001a7472f0,
ps_ProjInfo = 0x1001a751220, ps_TupFromTlist = 0 '\000'}
(gdb) p *planstate->worker_instrument
$5 = {num_workers = 2139062143, instrument = 0x1001af8a8d0}
(gdb) p n
$6 = 13055
(gdb) p n
$7 = 13055

Here its clear that work_instrument is either corrupted or Un-inililized
that is the
reason its ending up with server crash.

With bit more debugging and looked at git history I found that issue
started coming
with commit af33039317ddc4a0e38a02e2255c2bf453115fd2. gather_readnext()
calls
ExecShutdownGatherWorkers() when nreaders == 0. ExecShutdownGatherWorkers()
calls ExecParallelFinish() which collects the instrumentation before
marking
ParallelExecutorInfo to finish. ExecParallelRetrieveInstrumentation() do
the allocation
of planstate->worker_instrument.

With commit af33039317 now we calling the gather_readnext() with per-tuple
context,
but with nreader == 0 with ExecShutdownGatherWorkers() we end up with
allocation
of planstate->worker_instrument into per-tuple context - which is wrong.

Now fix can be:

1) Avoid calling ExecShutdownGatherWorkers() from the gather_readnext() and
let
ExecEndGather() do that things. But with this change, gather_readread() and
gather_getnext() depend on planstate->reader structure to continue reading
tuple.
Now either we can change those condition to be depend on
planstate->nreaders or
just pfree(planstate->reader) into gather_readnext() instead of calling
ExecShutdownGatherWorkers().

2) Teach ExecParallelRetrieveInstrumentation() to do allocation of
planstate->worker_instrument into proper memory context.

Attaching patch, which fix the issue with approach 1).


Regards.

Rushabh Lathia
www.EnterpriseDB.com
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 438d1b2..bc56f0e 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -348,7

Re: [HACKERS] ERROR: ORDER/GROUP BY expression not found in targetlist

2016-06-28 Thread Rushabh Lathia

Thanks Tom.

I performed testing with the latest commit and test are
running fine.

On Tue, Jun 28, 2016 at 8:14 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Rushabh Lathia <rushabh.lat...@gmail.com> writes:
> > SELECT setval('s', max(100)) from tab;
> > ERROR:  ORDER/GROUP BY expression not found in targetlist
>
> Fixed, thanks for the report!
>
>     regards, tom lane
>

-- 
Rushabh Lathia

1 2 3 >

1 - 100 of 208 matches

Mail list logo