Re: Allow parallel DISTINCT
David Rowley writes: > Thanks. I pushed a fix for that. Yeah, I saw your commit just after complaining. Sorry for the noise. regards, tom lane
Re: Allow parallel DISTINCT
On Mon, 23 Aug 2021 at 01:58, Tom Lane wrote: > > David Rowley writes: > > I've pushed this to master now. > > ... and the buildfarm is pushing back, eg Thanks. I pushed a fix for that. David
Re: Allow parallel DISTINCT
David Rowley writes: > I've pushed this to master now. ... and the buildfarm is pushing back, eg https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly=2021-08-22%2011%3A31%3A45 diff -U3 /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/expected/select_distinct.out /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/results/select_distinct.out --- /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/expected/select_distinct.out 2021-08-22 11:26:22.0 + +++ /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/results/select_distinct.out 2021-08-22 11:38:43.0 + @@ -223,7 +223,7 @@ -> Sort Sort Key: four -> Gather - Workers Planned: 2 + Workers Planned: 5 -> HashAggregate Group Key: four -> Parallel Seq Scan on tenk1 @@ -270,7 +270,7 @@ -> Sort Sort Key: (distinct_func(1)) -> Gather - Workers Planned: 2 + Workers Planned: 5 -> Parallel Seq Scan on tenk1 (6 rows) regards, tom lane
Re: Allow parallel DISTINCT
On Wed, 18 Aug 2021 at 08:50, Zhihong Yu wrote: > The patch is good from my point of view. Thanks for the review. I looked over the patch again and the only thing I adjusted was the order of the RESETs in the regression tests. I left the " if (distinct_rel->pathlist == NIL)" ERROR case check so that the ERROR is raised before we call the FDW function and hook function to add more paths. Part of me thinks it should probably go afterwards, but I didn't want to change the behaviour there. The other part of me thinks that if you can't do distinct by sorting or hashing then there's not much hope for the hook to add any paths either. I've pushed this to master now. David
Re: Allow parallel DISTINCT
On Tue, Aug 17, 2021 at 1:47 PM David Rowley wrote: > On Wed, 18 Aug 2021 at 02:42, Zhihong Yu wrote: > > Since create_partial_distinct_paths() calls > create_final_distinct_paths(), I wonder if numDistinctRows can be passed to > create_final_distinct_paths() so that the latter doesn't need to call > estimate_num_groups(). > > That can't be done. The two calls to estimate_num_groups() are passing > in a different number of input rows. In > create_partial_distinct_paths() the number of rows is the number of > expected input rows from a partial path. In > create_final_distinct_paths() when called to complete the final > distinct step, that's the number of distinct values multiplied by the > number of workers. > > It might be more possible to do something like cache the value of > distinctExprs, but I just don't feel the need. If there are partial > paths in the input_rel then it's most likely that planning time is not > going to dominate much between planning and execution. Also, if we > were to calculate the value of distinctExprs in create_distinct_paths > always, then we might end up calculating it for nothing as > create_final_distinct_paths() does not always need it. I don't feel > the need to clutter up the code by doing any lazy calculating of it > either. > > David > Hi, Thanks for your explanation. The patch is good from my point of view.
Re: Allow parallel DISTINCT
On Wed, 18 Aug 2021 at 02:42, Zhihong Yu wrote: > Since create_partial_distinct_paths() calls create_final_distinct_paths(), I > wonder if numDistinctRows can be passed to create_final_distinct_paths() so > that the latter doesn't need to call estimate_num_groups(). That can't be done. The two calls to estimate_num_groups() are passing in a different number of input rows. In create_partial_distinct_paths() the number of rows is the number of expected input rows from a partial path. In create_final_distinct_paths() when called to complete the final distinct step, that's the number of distinct values multiplied by the number of workers. It might be more possible to do something like cache the value of distinctExprs, but I just don't feel the need. If there are partial paths in the input_rel then it's most likely that planning time is not going to dominate much between planning and execution. Also, if we were to calculate the value of distinctExprs in create_distinct_paths always, then we might end up calculating it for nothing as create_final_distinct_paths() does not always need it. I don't feel the need to clutter up the code by doing any lazy calculating of it either. David
Re: Allow parallel DISTINCT
On Tue, Aug 17, 2021 at 3:59 AM David Rowley wrote: > On Tue, 17 Aug 2021 at 20:07, Zhihong Yu wrote: > > Can you attach updated patch so that we know more detail about the two > new functions; create_final_distinct_paths and > > create_partial_distinct_paths ? > > Must've fallen off in transit :) > > David > Hi, Since create_partial_distinct_paths() calls create_final_distinct_paths(), I wonder if numDistinctRows can be passed to create_final_distinct_paths() so that the latter doesn't need to call estimate_num_groups(). Cheers
Re: Allow parallel DISTINCT
On Tue, 17 Aug 2021 at 20:07, Zhihong Yu wrote: > Can you attach updated patch so that we know more detail about the two new > functions; create_final_distinct_paths and > create_partial_distinct_paths ? Must've fallen off in transit :) David parallel_distinct_v2.patch Description: Binary data
Re: Allow parallel DISTINCT
On Mon, Aug 16, 2021 at 10:07 PM David Rowley wrote: > On Wed, 11 Aug 2021 at 16:51, David Rowley wrote: > > The patch is just some plumbing work to connect all the correct paths > > up to make it work. It's all fairly trivial. > > I looked at this patch again and realise that it could be done a bit > better. For example, the previous version set the distinct_rel's FDW > fields twice, once when making the serial paths and once when > finalizing the partial paths. > > I've now added two new functions; create_final_distinct_paths and > create_partial_distinct_paths. The responsibility of > create_distinct_paths has changed. Instead of it creating the > non-parallel DISTINCT paths, it calls the two new functions and also > takes charge of calling the create_upper_paths_hook for > UPPERREL_DISTINCT plus the FDW GetForeignUpperPaths() call. I think > this is nicer as I'd previously added a new parameter to > create_distinct_paths() so I could tell it not to call the hook as I > didn't want to call that twice on the same relation as it would no > doubt result in some plugin just creating the same paths again. > > I've also changed my mind about the previous choice I'd made not to > call GetForeignUpperPaths for the UPPERREL_PARTIAL_DISTINCT. I now > think that's ok. > > I think this is a fairly trivial patch that just does a bit of wiring > up of paths. Unless anyone has anything to say about it in the next > few days, I'll be looking at it again with intensions to push it. > > David > > > Hi, David: Can you attach updated patch so that we know more detail about the two new functions; create_final_distinct_paths and create_partial_distinct_paths ? Thanks
Re: Allow parallel DISTINCT
On Wed, 11 Aug 2021 at 16:51, David Rowley wrote: > The patch is just some plumbing work to connect all the correct paths > up to make it work. It's all fairly trivial. I looked at this patch again and realise that it could be done a bit better. For example, the previous version set the distinct_rel's FDW fields twice, once when making the serial paths and once when finalizing the partial paths. I've now added two new functions; create_final_distinct_paths and create_partial_distinct_paths. The responsibility of create_distinct_paths has changed. Instead of it creating the non-parallel DISTINCT paths, it calls the two new functions and also takes charge of calling the create_upper_paths_hook for UPPERREL_DISTINCT plus the FDW GetForeignUpperPaths() call. I think this is nicer as I'd previously added a new parameter to create_distinct_paths() so I could tell it not to call the hook as I didn't want to call that twice on the same relation as it would no doubt result in some plugin just creating the same paths again. I've also changed my mind about the previous choice I'd made not to call GetForeignUpperPaths for the UPPERREL_PARTIAL_DISTINCT. I now think that's ok. I think this is a fairly trivial patch that just does a bit of wiring up of paths. Unless anyone has anything to say about it in the next few days, I'll be looking at it again with intensions to push it. David
Allow parallel DISTINCT
Back in March 2016, e06a38965 added support for parallel aggregation. IIRC, because it was fairly late in the release cycle, I dropped parallel DISTINCT to reduce the scope a little. It's been on my list of things to fix since then. I just didn't get around to it until today. The patch is just some plumbing work to connect all the correct paths up to make it work. It's all fairly trivial. I thought about refactoring things a bit more to get rid of the additional calls to grouping_is_sortable() and grouping_is_hashable(), but I just don't think it's worth making the code ugly for. We'll only call them again if we're considering a parallel plan, in which case it's most likely not a trivial query. Those functions are pretty cheap anyway. I understand that there's another patch in the September commitfest that does some stuff with Parallel DISTINCT, but that goes about things a completely different way by creating multiple queues to distribute values by hash. I don't think there's any overlap here. We'd likely want to still have the planner consider both methods if we get that patch sometime. David parallel_distinct.patch Description: Binary data