On Mon, Feb 16, 2026 at 6:07 PM Nico Williams <[email protected]> wrote: > On Mon, Feb 16, 2026 at 05:35:41PM +1300, Thomas Munro wrote: > > [...]. UTF-16 is > > apparently sometimes preferred to save space in other RDBMSs that can > > do it, but I suppose you could achieve the same size most of the time > > with a scheme like that. [...] > > [Off-topic] I think UTF-16 yielding smaller encodings is a truism. It > really depends on what language the text is mostly written in, but > mostly it's a truism that's not true. Anyways, UTF-16 has to go away, > and the sooner the better.
But when it's true for your language and that's what your database holds, then it's true all the time, and it's not just outliers, we're talking about nearly all of Asia's languages. That's ... a lot of NAND gates being wasted due to arbitrary choices made probably before UTF-8 even existed. I do agree with you that UTF-16 has turned out to be an odd beast, though, not big enough but also too big. Maybe it's only just right for CJK (or CJ?). I don't see much chance at all of anyone retro-fitting UTF-16 into PostgreSQL anyway, so I wouldn't worry about that. I could more easily see us figuring out how to drop the requirement for high bits in multi-byte sequence tails so that GB18030 could be used to store two-byte Chinese (while also retaining full access to all of Unicode as it does), and I was basically wondering out loud if Japan might be hiding something like that somewhere and imagining what it might look like.
