An "en_US" user doing: CREATE TABLE foo(t TEXT PRIMARY KEY);
is providing no indication that they want an index tailored to their locale. Yet we are creating the index with the "en_US" collation and therefore imposing huge performance costs (something like 2X slower index build time than the "C" locale), and also huge dependency versioning risks that could lead to index corruption and/or wrong results. Similarly, a user doing: SELECT DISTINCT t FROM bar; is providing no indication that they care about the collation of "t" (we are free to choose a HashAgg which makes no ordering guarantee at all). Yet if we choose Sort+GroupAgg, the Sort will be performed in the "en_US" locale, which is something like 2X slower than the "C" locale. One of the strongest arguments for using a non-C collation in these cases is the chance to use a non-deterministic collation, like a case- insensitive one. But the database collation is always deterministic, and all deterministic collations have exactly the same definition of equality, so there's no reason not to use "C". Another argument is that, if the column is the database collation and the index is "C", then the index is unusable for text range scans, etc. But there are two ways to solve that problem: 1. Set the column collation to "C"; or 2. Set the index collation to the database collation. Range scans are often most useful when the text is not actually natural language, but instead is some kind of formatted text representing another type of thing, often in ASCII. In that case, the range scan is really some kind of prefix search or partitioning, and the "C" locale is probably the right thing to use, and #1 wins. Granted, there are reasons to want an index to have a particular collation, in which case it makes sense to opt-in to #2. But in the common case, the high performance costs and dependency versioning risks aren't worth it. Thoughts? Regards, Jeff Davis