On Wed, Mar 11, 2026 at 2:20 PM Jeff Davis <[email protected]> wrote: > Part of the reason for that is that changing collation is so difficult > that we have very few examples of users moving real workloads from one > collation to another.
Yes. I think actually one of the big challenges right now is making sure that when you initdb to do a pg_upgrade, you get the right settings to make the upgrade work. The extent to which any given proposal makes that better or worse is definitely a point to consider. > Thank you. I have one burning question: for these users who care deeply > about sort order, which scenario best describes their needs? > > (a) they mostly work in a single locale (if so, does it match their > UNIX environment?); or > > (b) one locale (which one?) is good enough for a variety of locales > because even if it's not perfect, it's still better than ASCII; or > > (c) they somehow partition their data by locale and use multiple > locales; or > > (d) they have a variety of indexes on the same column using different > collations to satisfy queries from users in different locales I don't have total information, but I think they mostly use a single locale. If they have extremely specific needs, they are likely to end up with ICU, else they pick a glibc locale. I have no idea how likely that glibc locale is to match their environment. I wouldn't bet on it being the norm, but I wouldn't bet against whatever they have in the environment being more usable than "C". > > And, on a personal level, I have a hard time understanding why anyone > > would be OK with a sort order that puts Álvaro after Zebra instead of > > between Alvaro and Beatriz, because that seems extremely frustrating. > > I tend to agree, and I wish we had a way to handle this at a > "presentation" layer rather than pushing the whole thing down into > indexes (storage layer). > > In theory, pushing collation down to indexes could offer performance > advantages, but in practice humans don't read a lot of data, so a post- > processing step would be efficient in most cases. It's tough if people have range scans. Not everybody does, but they also don't know whether or not they will want them when they're making setup choices. Picking a locale that matches their desired sort order *in case* they end up using range scans in some queries feels like the "safe" coice. > > That's perfectly legitimate, but it's different from my > > experience. My experience is that when I tell people they can use > > collate "C" to speed up sorting, they tell me that's a stupid > > workaround that doesn't give them the answers that they want, which > > obviously colors my viewpoint on this question in the same way that > > your experiences color yours. > > "C" is especially unappealing because it doesn't even get basic case > transformations right outside of ASCII. I completely agree. I dislike it when providers change collation behavior because I don't really believe the narrative that people from a particular political unit have a unified view of how sorting should be done. For example, my native language is English, and I have a view that Á should go between A and B even though Á is not an English character. Google says that's not the normal English sort order and that accented characters should be pushed to the end, but as a native Englsh speaker I find that idea ridiculous and I can't really imagine anyone wanting it. Perhaps my imagination is too limited. Where it gets really subjective is with strings like alvaro, Álvaro, and .-Alvaro. There's no rule that anyone in the United States learns in elementary school that answers that question. We can ask users what they want, but there's not a single right answer that everyone knows, the way everyone knows a < b. So I feel like it would be pretty defensible to have the default be something that is either case-sensitive or not, that either skips nonalphabetic characters or sorts them in code-point order or handles them in literally any other basically sane and understandable way that somebody wants to implement, but something that treats an accented a in the same way it treats a smiley face emoji -- i.e. like a character that nobody knows how to sort -- sits really poorly with me. Am I injecting too much of my personal view into what PostgreSQL's behavior ought to be? Possibly! So maybe the people who are saying that defaulting to C is fine and that making people make an explicit choice if they want something else are right. but my personal guess is that we will make a bunch of people unhappy. I'm not actually super-worried about the people who have very particular ideas about what they want, because there is a decent chance that they're already asking for exactly that thing. We might mess them up a little bit, but they'll figure it out. What I'm most worried about is the population of users -- which I guess to be large -- who do not have a strong preference but won't be happy with something as dumb as "C". If even a small fraction of users create a database using "C" unintentionally and load a terabyte of data into it before realizing that all their text indexes are sorting "wrong", I suspect that's not going to be much fun. Said differently, I don't have an enormous amount of confidence in the environment being a good source of information about what people want, but my experience thus far suggests to that "C" probably isn't what they want, which makes me skeptical of making it the default. Obviously, I could be wildly incorrect. Maybe people will just be super-happy about faster sorting and life will be great. -- Robert Haas EDB: http://www.enterprisedb.com
