Re: Why do indexes and sorts use the database collation?

Tomas Vondra Tue, 14 Nov 2023 04:01:36 -0800

On 11/14/23 02:58, Jeff Davis wrote:
> On Mon, 2023-11-13 at 22:36 +0100, Tomas Vondra wrote:
>> Yeah. I don't quite agree with the initial argument that not
>> specifying
>> the collation explicitly in CREATE TABLE or a query means the user
>> does
>> not care about the collation.
> 
> I didn't argue that the user doesn't care about collation -- we need to
> honor the collation semantics of the column. And a column with
> unspecified collation must be the database collation (otherwise what
> would the database collation mean?). But the index's collation is an
> implementation detail that is not necessary to provide the requested
> semantics.
> 
> I'm arguing that pathkeys are often not even useful for providing the
> requested semantics, so why should the user have the pain of poor
> performance and versioning risks for every text index in their system?
> If the user just wants PK/FK constraints, and equality lookups, then an
> index with the "C" collation makes a lot of sense to serve those
> purposes.
> 

Thanks for the clarification. I agree index's collation can be seen as
an implementation detail, as long as it produces the correct results
(with respect to the column's collation). I'm somewhat skeptical about
doing this automatically, because the collations may be equivalent only
for some operations (and we don't know what the user will do).

My concern is we'll decide to alter the index collation, and then the
user will face the consequences. Presumably we'd no generate incorrect
results, but we'd not be able use an index, causing performance issues.

AFAICS this is a trade-off between known benefits (faster equality
searches, which are common for PK columns) vs. unknown downsides
(performance penalty for operations with unknown frequency).

Not sure it's a decision we can make automatically. But it's mostly
achievable manually, if the user specifies COLLATE "C" for the column.
You're right that changes the semantics of the column, but if the user
only does equality searches, that shouldn't be an issue. And if an
ordering is needed after all, it's possible to specify the collation in
the ORDER BY clause.

I realize you propose to do this automatically for everyone, because few
people will realize how much faster can this be. But maybe there's a way
to make this manual approach more convenient? Say, by allowing the PK to
have a different collation (which I don't think is allowed now).

FWIW I wonder what the impact of doing this automatically would be in
practice. I mean, in my experience the number of tables with TEXT (or
types sensitive to collations) primary keys is fairly low, especially
for tables of non-trivial size (where the performance impact might be
measurable).

>> For example, I don't see how we could arbitrarily override the
>> collation
>> for indexes backing primary keys, because how would you know the user
>> will never do a sort on it?
> 
> The column collation and index collation are tracked separately in the
> catalog. The column collation cannot be overridden because it's
> semantically signficant, but there are at least some degrees of freedom
> we have with the index collation.
> 
> I don't think we can completely change the default index collation to
> be "C", but perhaps there could be a database-level option to do so,
> and that would have no effect on semantics at all. If the user notices
> some queries that could benefit from an index with a non-"C" collation,
> they can add/replace an index as they see fit.
> 

True. What about trying to allow a separate collation for the PK
constraint (and the backing index)?

>>  Not that uncommon with natural primary keys,
>> I think (not a great practice, but people do that).
> 
> Natural keys often have an uncorrelated index, and if the index is not
> correlated, it's often not useful ORDER BY.
> 
> When I actually think about schemas and plans I've seen in the wild, I
> struggle to think of many cases that would really benefit from an index
> in a non-"C" collation. The best cases I can think of are where it's
> doing some kind of prefix search. That's not rare, but it's also not so
> common that I'd like to risk index corruption on every index in the
> system by default in case a prefix search is performed.
> 

OK. I personally don't recall any case where I'd see a collation on PK
indexes as a performance issue. Or maybe I just didn't realize it.

But speaking of natural keys, I recall a couple schemas with natural
keys in code/dimension tables, and it's not uncommon to cluster those
slow-moving tables once in a while. I don't know if ORDER BY queries
were very common on those tables, though.

>> Perhaps we could allow the PK index to have a different collation,
>> say
>> by supporting something like this:
>>
>>   ALTER TABLE distributors ADD PRIMARY KEY (dist_id COLLATE "C");
> 
> Yes, I'd like something like that to be supported. We'd have to check
> that, if the collations are different, that both are deterministic.
> 

OK, I think this answers my earlier question. Now that I think about
this, the one confusing thing with this syntax is that it seems to
assign the collation to the constraint, but in reality we want the
constraint to be enforced with the column's collation and the
alternative collation is for the index.

>> And then the planner would just pick the right index, I think.
> 
> Right now the planner doesn't seem to understand that an index in the
> "C" collation works just fine for answering equality queries. That
> should be fixed.
> 
>> If the
>> user cares about ordering, they'll specify ORDER BY with either an
>> explicit or the default collation. If the index collation matches, it
>> may be useful for the ordering.
> 
> Exactly.
> 
>> Of course, if we feel entitled to create the primary key index with a
>> collation of our choosing, that'd make this unpredictable.
> 
> I wouldn't describe it as "unpredictable". We'd have some defined way
> of defaulting the collation of an index which might be affected by a
> database option or something. In any case, it would be visible with \d.
> 

Perhaps "unpredictable" was not the right word. What I meant to express
is that it happens in the background, possibly confusing for the user.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Re: Why do indexes and sorts use the database collation?

Reply via email to