On Fri, 2023-11-10 at 17:19 -0800, Andres Freund wrote:
> I guess you are arguing that the user didn't intend to create an
> index here?

No, obviously the user should expect an index when a primary key is
created. But that doesn't mean that it necessarily needs to be ordered
according to the database collation.

Unfortunately, right now the planner doesn't understand that an index
in the "C" locale can satisfy equality searches and constraint
enforcement for "en_US" (or any other deterministic collation). That's
probably the first thing to fix.

Inequalities and ORDER BYs can't benefit from an index with a different
collation, but lots of indexes don't need that.

> Also, wouldn't the intent to use a different collation for the column
> be
> expressed by changing the column's collation?

The column collation expresses the semantics of that column. If the
user has a database collation of "en_US", they should expect ORDER BY
on that column to be according to that locale unless otherwise
specified.

That doesn't imply that indexes must have a matching collation. In fact
we already allow the column and index collations to differ, it just
doesn't work as well as it should.

> 
> OTOH, if we are choosing a groupagg, we might be able to implement
> that using
> an index, which is more likey to exist in the databases collation. 
> Looks like
> we even just look for indexes that are in the database collation.
> 
> Might be worth teaching the planner additional smarts here.

Yeah, we don't need to force anything, we could just create a few paths
with appropriate path key information and cost them.

> 
> - Teach the planner to take collation costs into account for costing

+1. I noticed that GroupAgg+Sort is often in the same ballpark as
HashAgg in runtime when the collation is "C", but HashAgg is way faster
when the collation is something else.

> - Teach the planner to use cheaper collations when ordering for
> reasons other
>   than the user's direct request (e.g. DISTINCT/GROUP BY, merge
> joins).

+1. Where "cheaper" comes from is an interesting question -- is it a
property of the provider or the specific collation? Or do we just call
"C" special?

Regards,
        Jeff Davis



Reply via email to