On Wed, Mar 11, 2026 at 2:20 PM Jeff Davis <[email protected]> wrote:
> Part of the reason for that is that changing collation is so difficult
> that we have very few examples of users moving real workloads from one
> collation to another.

Yes. I think actually one of the big challenges right now is making
sure that when you initdb to do a pg_upgrade, you get the right
settings to make the upgrade work. The extent to which any given
proposal makes that better or worse is definitely a point to consider.

> Thank you. I have one burning question: for these users who care deeply
> about sort order, which scenario best describes their needs?
>
>   (a) they mostly work in a single locale (if so, does it match their
> UNIX environment?); or
>
>   (b) one locale (which one?) is good enough for a variety of locales
> because even if it's not perfect, it's still better than ASCII; or
>
>   (c) they somehow partition their data by locale and use multiple
> locales; or
>
>   (d) they have a variety of indexes on the same column using different
> collations to satisfy queries from users in different locales

I don't have total information, but I think they mostly use a single
locale. If they have extremely specific needs, they are likely to end
up with ICU, else they pick a glibc locale. I have no idea how likely
that glibc locale is to match their environment. I wouldn't bet on it
being the norm, but I wouldn't bet against whatever they have in the
environment being more usable than "C".

> > And, on a personal level, I have a hard time understanding why anyone
> > would be OK with a sort order that puts Álvaro after Zebra instead of
> > between Alvaro and Beatriz, because that seems extremely frustrating.
>
> I tend to agree, and I wish we had a way to handle this at a
> "presentation" layer rather than pushing the whole thing down into
> indexes (storage layer).
>
> In theory, pushing collation down to indexes could offer performance
> advantages, but in practice humans don't read a lot of data, so a post-
> processing step would be efficient in most cases.

It's tough if people have range scans. Not everybody does, but they
also don't know whether or not they will want them when they're making
setup choices. Picking a locale that matches their desired sort order
*in case* they end up using range scans in some queries feels like the
"safe" coice.

> > That's perfectly legitimate, but it's different from my
> > experience. My experience is that when I tell people they can use
> > collate "C" to speed up sorting, they tell me that's a stupid
> > workaround that doesn't give them the answers that they want, which
> > obviously colors my viewpoint on this question in the same way that
> > your experiences color yours.
>
> "C" is especially unappealing because it doesn't even get basic case
> transformations right outside of ASCII.

I completely agree. I dislike it when providers change collation
behavior because I don't really believe the narrative that people from
a particular political unit have a unified view of how sorting should
be done. For example, my native language is English, and I have a view
that Á should go between A and B even though Á is not an English
character. Google says that's not the normal English sort order and
that accented characters should be pushed to the end, but as a native
Englsh speaker I find that idea ridiculous and I can't really imagine
anyone wanting it. Perhaps my imagination is too limited. Where it
gets really subjective is with strings like alvaro, Álvaro, and
.-Alvaro. There's no rule that anyone in the United States learns in
elementary school that answers that question. We can ask users what
they want, but there's not a single right answer that everyone knows,
the way everyone knows a < b. So I feel like it would be pretty
defensible to have the default be something that is either
case-sensitive or not, that either skips nonalphabetic characters or
sorts them in code-point order or handles them in literally any other
basically sane and understandable way that somebody wants to
implement, but something that treats an accented a in the same way it
treats a smiley face emoji -- i.e. like a character that nobody knows
how to sort -- sits really poorly with me. Am I injecting too much of
my personal view into what PostgreSQL's behavior ought to be?
Possibly!

So maybe the people who are saying that defaulting to C is fine and
that making people make an explicit choice if they want something else
are right. but my personal guess is that we will make a bunch of
people unhappy. I'm not actually super-worried about the people who
have very particular ideas about what they want, because there is a
decent chance that they're already asking for exactly that thing. We
might mess them up a little bit, but they'll figure it out. What I'm
most worried about is the population of users -- which I guess to be
large -- who do not have a strong preference but won't be happy with
something as dumb as "C". If even a small fraction of users create a
database using "C" unintentionally and load a terabyte of data into it
before realizing that all their text indexes are sorting "wrong", I
suspect that's not going to be much fun. Said differently, I don't
have an enormous amount of confidence in the environment being a good
source of information about what people want, but my experience thus
far suggests to that "C" probably isn't what they want, which makes me
skeptical of making it the default.

Obviously, I could be wildly incorrect. Maybe people will just be
super-happy about faster sorting and life will be great.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


Reply via email to