Re: [HACKERS] Removing Functionally Dependent GROUP BY Columns

Julien Rouhaud Sat, 23 Jan 2016 03:58:31 -0800

On 23/01/2016 10:44, David Rowley wrote:
> On 23 January 2016 at 12:44, Tom Lane <[email protected]> wrote:
>> I spent some time looking through this but decided that it needs some work
>> to be committable.
>>
>> The main thing I didn't like was that identify_key_vars seems to have a
>> rather poorly chosen API: it mixes up identifying a rel's pkey with
>> sorting through the GROUP BY list.  I think it would be better to create
>> a function that, given a relid, hands back the OID of the pkey constraint
>> and a Bitmapset of the column numbers in the pkey (or 0/NULL if there's
>> no pkey or it's deferrable).  The column numbers should be offset by
>> FirstLowInvalidHeapAttributeNumber so that a pkey on OID can be
>> represented correctly --- compare RelationGetIndexAttrBitmap().
> 
> That seems like a much better design to me too. I've made changes and
> attached the updated patch.
> 
>> The reason this seems like a more attractive solution is that the output
>> of the pg_constraint lookup becomes independent of the particular query
>> and thus is potentially cacheable (in the relcache, say).  I do not think
>> we need to cache it right now but I'd like to have the option to do so.
> 
> I've not touched that yet, but good idea.
> 
>> (I wonder BTW whether check_functional_grouping couldn't be refactored
>> to use the output of such a function, too.)
> 
> I've attached a separate patch for that too. It applies after the
> prunegroupby patch.
>


This refactoring makes the code much more better, +1 for me.

I also reviewed it, it looks fine. However, the following removed
comment could be kept for clarity:

-           /* The PK is a subset of grouping_columns, so we win */



>> Some lesser points:
>>
>> * I did not like where you plugged in the call to
>> remove_useless_groupby_columns; there are weird interactions with grouping
>> sets as to whether it will get called or not, and also that whole chunk
>> of code is due for refactoring.  I'd be inclined to call it from
>> subquery_planner instead, maybe near where the HAVING clause preprocessing
>> happens.
> 
> I've moved the call to subquery_planner()
> 
>> * You've got it iterating over every RTE, even those that aren't
>> RTE_RELATION RTEs.  It's pure luck that identify_key_vars doesn't fail
>> outright when passed a bogus relid, and it's certainly wasteful.
> 
> :-( I've fixed that now.
> 
>> * Both of the loops iterating over the groupClause neglect to check
>> varlevelsup, thus leading to assert failures or worse if an outer-level
>> Var is present in the GROUP BY list.  (I'm pretty sure outer Vars can
>> still be present at this point, though I might be wrong.)
> 
> Fixed in the first loop, and the way I've rewritten the code to use
> bms_difference, there's no need to check again in the 2nd loop.
> 
>> * I'd be inclined to see if the relvarlists couldn't be converted into
>> bitmapsets too.  Then the test to see if the pkey is a subset of the
>> grouping columns becomes a simple and cheap bms_is_subset test.  You don't
>> really need surplusvars either, because you can instead test Vars for
>> membership in the pkey bitmapsets that pg_constraint.c will be handing
>> back.
> 
> I've changed to use a bitmapset now, but I've kept surplusvars, having
> this allows a single pass over the group by clause to remove the
> surplus Vars, rather than doing it again and again for each relation.
> I've also found a better way so that array is only allocated if some
> surplus Vars are found. I understand what you mean, and yes, it is
> possible to get rid of it, but I'd need to still add something else to
> mark that this rel's extra Vars are okay to be removed. I could do
> that by adding another bitmapset and marking the relid in that, but I
> quite like how I've changed it now as it saves having to check
> varlevelsup again on the Vars in the GROUP BY clause, as I just use
> bms_difference() on the originally found Vars subtracting off the PK
> vars, and assume all of those are surplus.
> 

I wonder if in remove_useless_groupby_columns(), in the foreach loop you
could change the

+       if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+       {

by something like


+       if (bms_num_members(relattnos) <= bms_num_members(pkattnos))
+           continue;
+
+       if (bms_is_subset(pkattnos, relattnos))
+       {


which may be cheaper.

Otherwise, I couldn't find any issue with this new version of the patch.

>> * What you did to join.sql/join.out seems a bit bizarre.  The existing
>> test case that you modified was meant to test something else, and
>> conflating this behavior with the pre-existing one (and not documenting
>> it) is confusing as can be.  A bit more work on regression test cases
>> seems indicated.
> 
> The history behind that is that at one point during developing the
> patch that test had started failing due to the group by item being
> removed therefore allowing the join removal conditions to be met. On
> testing again with the old test query I see this no longer happens, so
> I've removed the change, although the expected output still differs
> due to the group by item being removed.
> 
>> I'm going to set this back to "Waiting on Author".  I think the basic
>> idea is good but it needs a rewrite.
> 
> Thanks for the review and the good ideas.
> 


-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing Functionally Dependent GROUP BY Columns

Reply via email to