Re: Allow parallel DISTINCT

2021-08-22 Thread Tom Lane
David Rowley  writes:
> Thanks.  I pushed a fix for that.

Yeah, I saw your commit just after complaining.  Sorry for the noise.

regards, tom lane




Re: Allow parallel DISTINCT

2021-08-22 Thread David Rowley
On Mon, 23 Aug 2021 at 01:58, Tom Lane  wrote:
>
> David Rowley  writes:
> > I've pushed this to master now.
>
> ... and the buildfarm is pushing back, eg

Thanks.  I pushed a fix for that.

David




Re: Allow parallel DISTINCT

2021-08-22 Thread Tom Lane
David Rowley  writes:
> I've pushed this to master now.

... and the buildfarm is pushing back, eg

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly=2021-08-22%2011%3A31%3A45

diff -U3 
/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/expected/select_distinct.out
 
/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/results/select_distinct.out
--- 
/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/expected/select_distinct.out
2021-08-22 11:26:22.0 +
+++ 
/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/results/select_distinct.out
 2021-08-22 11:38:43.0 +
@@ -223,7 +223,7 @@
->  Sort
  Sort Key: four
  ->  Gather
-   Workers Planned: 2
+   Workers Planned: 5
->  HashAggregate
  Group Key: four
  ->  Parallel Seq Scan on tenk1
@@ -270,7 +270,7 @@
->  Sort
  Sort Key: (distinct_func(1))
  ->  Gather
-   Workers Planned: 2
+   Workers Planned: 5
->  Parallel Seq Scan on tenk1
 (6 rows)
 

regards, tom lane




Re: Allow parallel DISTINCT

2021-08-22 Thread David Rowley
On Wed, 18 Aug 2021 at 08:50, Zhihong Yu  wrote:
> The patch is good from my point of view.

Thanks for the review.  I looked over the patch again and the only
thing I adjusted was the order of the RESETs in the regression tests.

I left the " if (distinct_rel->pathlist == NIL)" ERROR case check so
that the ERROR is raised before we call the FDW function and hook
function to add more paths.  Part of me thinks it should probably go
afterwards, but I didn't want to change the behaviour there.   The
other part of me thinks that if you can't do distinct by sorting or
hashing then there's not much hope for the hook to add any paths
either.

I've pushed this to master now.

David




Re: Allow parallel DISTINCT

2021-08-17 Thread Zhihong Yu
On Tue, Aug 17, 2021 at 1:47 PM David Rowley  wrote:

> On Wed, 18 Aug 2021 at 02:42, Zhihong Yu  wrote:
> > Since create_partial_distinct_paths() calls
> create_final_distinct_paths(), I wonder if numDistinctRows can be passed to
> create_final_distinct_paths() so that the latter doesn't need to call
> estimate_num_groups().
>
> That can't be done. The two calls to estimate_num_groups() are passing
> in a different number of input rows.  In
> create_partial_distinct_paths() the number of rows is the number of
> expected input rows from a partial path.  In
> create_final_distinct_paths() when called to complete the final
> distinct step, that's the number of distinct values multiplied by the
> number of workers.
>
> It might be more possible to do something like cache the value of
> distinctExprs, but I just don't feel the need.  If there are partial
> paths in the input_rel then it's most likely that planning time is not
> going to dominate much between planning and execution. Also, if we
> were to calculate the value of distinctExprs in create_distinct_paths
> always, then we might end up calculating it for nothing as
> create_final_distinct_paths() does not always need it. I don't feel
> the need to clutter up the code by doing any lazy calculating of it
> either.
>
> David
>
Hi,
Thanks for your explanation.

The patch is good from my point of view.


Re: Allow parallel DISTINCT

2021-08-17 Thread David Rowley
On Wed, 18 Aug 2021 at 02:42, Zhihong Yu  wrote:
> Since create_partial_distinct_paths() calls create_final_distinct_paths(), I 
> wonder if numDistinctRows can be passed to create_final_distinct_paths() so 
> that the latter doesn't need to call estimate_num_groups().

That can't be done. The two calls to estimate_num_groups() are passing
in a different number of input rows.  In
create_partial_distinct_paths() the number of rows is the number of
expected input rows from a partial path.  In
create_final_distinct_paths() when called to complete the final
distinct step, that's the number of distinct values multiplied by the
number of workers.

It might be more possible to do something like cache the value of
distinctExprs, but I just don't feel the need.  If there are partial
paths in the input_rel then it's most likely that planning time is not
going to dominate much between planning and execution. Also, if we
were to calculate the value of distinctExprs in create_distinct_paths
always, then we might end up calculating it for nothing as
create_final_distinct_paths() does not always need it. I don't feel
the need to clutter up the code by doing any lazy calculating of it
either.

David




Re: Allow parallel DISTINCT

2021-08-17 Thread Zhihong Yu
On Tue, Aug 17, 2021 at 3:59 AM David Rowley  wrote:

> On Tue, 17 Aug 2021 at 20:07, Zhihong Yu  wrote:
> > Can you attach updated patch so that we know more detail about the two
> new functions; create_final_distinct_paths and
> > create_partial_distinct_paths ?
>
> Must've fallen off in transit :)
>
> David
>
Hi,
Since create_partial_distinct_paths() calls create_final_distinct_paths(),
I wonder if numDistinctRows can be passed to create_final_distinct_paths()
so that the latter doesn't need to call estimate_num_groups().

Cheers


Re: Allow parallel DISTINCT

2021-08-17 Thread David Rowley
On Tue, 17 Aug 2021 at 20:07, Zhihong Yu  wrote:
> Can you attach updated patch so that we know more detail about the two new 
> functions; create_final_distinct_paths and
> create_partial_distinct_paths ?

Must've fallen off in transit :)

David


parallel_distinct_v2.patch
Description: Binary data


Re: Allow parallel DISTINCT

2021-08-17 Thread Zhihong Yu
On Mon, Aug 16, 2021 at 10:07 PM David Rowley  wrote:

> On Wed, 11 Aug 2021 at 16:51, David Rowley  wrote:
> > The patch is just some plumbing work to connect all the correct paths
> > up to make it work. It's all fairly trivial.
>
> I looked at this patch again and realise that it could be done a bit
> better. For example, the previous version set the distinct_rel's FDW
> fields twice, once when making the serial paths and once when
> finalizing the partial paths.
>
> I've now added two new functions; create_final_distinct_paths and
> create_partial_distinct_paths.  The responsibility of
> create_distinct_paths has changed. Instead of it creating the
> non-parallel DISTINCT paths, it calls the two new functions and also
> takes charge of calling the create_upper_paths_hook for
> UPPERREL_DISTINCT plus the FDW GetForeignUpperPaths() call.   I think
> this is nicer as I'd previously added a new parameter to
> create_distinct_paths() so I could tell it not to call the hook as I
> didn't want to call that twice on the same relation as it would no
> doubt result in some plugin just creating the same paths again.
>
> I've also changed my mind about the previous choice I'd made not to
> call GetForeignUpperPaths for the UPPERREL_PARTIAL_DISTINCT.  I now
> think that's ok.
>
> I think this is a fairly trivial patch that just does a bit of wiring
> up of paths.  Unless anyone has anything to say about it in the next
> few days, I'll be looking at it again with intensions to push it.
>
> David
>
>
> Hi, David:
Can you attach updated patch so that we know more detail about the two new
functions; create_final_distinct_paths and
create_partial_distinct_paths ?

Thanks


Re: Allow parallel DISTINCT

2021-08-16 Thread David Rowley
On Wed, 11 Aug 2021 at 16:51, David Rowley  wrote:
> The patch is just some plumbing work to connect all the correct paths
> up to make it work. It's all fairly trivial.

I looked at this patch again and realise that it could be done a bit
better. For example, the previous version set the distinct_rel's FDW
fields twice, once when making the serial paths and once when
finalizing the partial paths.

I've now added two new functions; create_final_distinct_paths and
create_partial_distinct_paths.  The responsibility of
create_distinct_paths has changed. Instead of it creating the
non-parallel DISTINCT paths, it calls the two new functions and also
takes charge of calling the create_upper_paths_hook for
UPPERREL_DISTINCT plus the FDW GetForeignUpperPaths() call.   I think
this is nicer as I'd previously added a new parameter to
create_distinct_paths() so I could tell it not to call the hook as I
didn't want to call that twice on the same relation as it would no
doubt result in some plugin just creating the same paths again.

I've also changed my mind about the previous choice I'd made not to
call GetForeignUpperPaths for the UPPERREL_PARTIAL_DISTINCT.  I now
think that's ok.

I think this is a fairly trivial patch that just does a bit of wiring
up of paths.  Unless anyone has anything to say about it in the next
few days, I'll be looking at it again with intensions to push it.

David




Allow parallel DISTINCT

2021-08-10 Thread David Rowley
Back in March 2016, e06a38965 added support for parallel aggregation.
IIRC, because it was fairly late in the release cycle, I dropped
parallel DISTINCT to reduce the scope a little. It's been on my list
of things to fix since then. I just didn't get around to it until
today.

The patch is just some plumbing work to connect all the correct paths
up to make it work. It's all fairly trivial.

I thought about refactoring things a bit more to get rid of the
additional calls to grouping_is_sortable() and grouping_is_hashable(),
but I just don't think it's worth making the code ugly for.  We'll
only call them again if we're considering a parallel plan, in which
case it's most likely not a trivial query.  Those functions are pretty
cheap anyway.

I understand that there's another patch in the September commitfest
that does some stuff with Parallel DISTINCT, but that goes about
things a completely different way by creating multiple queues to
distribute values by hash.  I don't think there's any overlap here.
We'd likely want to still have the planner consider both methods if we
get that patch sometime.

David


parallel_distinct.patch
Description: Binary data