Hi, On 2023-11-10 16:03:16 -0800, Jeff Davis wrote: > An "en_US" user doing: > > CREATE TABLE foo(t TEXT PRIMARY KEY); > > is providing no indication that they want an index tailored to their > locale. Yet we are creating the index with the "en_US" collation and > therefore imposing huge performance costs (something like 2X slower > index build time than the "C" locale), and also huge dependency > versioning risks that could lead to index corruption and/or wrong > results.
I guess you are arguing that the user didn't intend to create an index here? I don't think that is true - users know that pkeys create indexes. If we used C here, users would often need to create a second index on the same column using the actual database collation - I think we'd very commonly end up with complaints that the pkey index doesn't work, because it's been automatically created with a different collation than the column. Also, wouldn't the intent to use a different collation for the column be expressed by changing the column's collation? > Similarly, a user doing: > > SELECT DISTINCT t FROM bar; > > is providing no indication that they care about the collation of "t" > (we are free to choose a HashAgg which makes no ordering guarantee at > all). Yet if we choose Sort+GroupAgg, the Sort will be performed in the > "en_US" locale, which is something like 2X slower than the "C" locale. OTOH, if we are choosing a groupagg, we might be able to implement that using an index, which is more likey to exist in the databases collation. Looks like we even just look for indexes that are in the database collation. Might be worth teaching the planner additional smarts here. > Thoughts? I seriously doubt its a good idea to change which collations primary keys use by default. But I think there's a decent bit of work we could do in the planner, e.g: - Teach the planner to take collation costs into account for costing - right now index scans with "C" cost the same as index scans with more expensive collations. That seems wrong even for equality lookups and would make it hard to make improvements to prefer cheaper collations in other situations. - Teach the planner to use cheaper collations when ordering for reasons other than the user's direct request (e.g. DISTINCT/GROUP BY, merge joins). I think we should also explain in our docs that C can be considerably faster - I couldn't find anything in a quick look. Greetings, Andres Freund