Re: Questionable description about character sets

Thomas Munro Mon, 16 Feb 2026 18:39:01 -0800

On Mon, Feb 16, 2026 at 6:07 PM Nico Williams <[email protected]> wrote:
> On Mon, Feb 16, 2026 at 05:35:41PM +1300, Thomas Munro wrote:
> >                                              [...].  UTF-16 is
> > apparently sometimes preferred to save space in other RDBMSs that can
> > do it, but I suppose you could achieve the same size most of the time
> > with a scheme like that.  [...]
>
> [Off-topic] I think UTF-16 yielding smaller encodings is a truism.  It
> really depends on what language the text is mostly written in, but
> mostly it's a truism that's not true.  Anyways, UTF-16 has to go away,
> and the sooner the better.


But when it's true for your language and that's what your database
holds, then it's true all the time, and it's not just outliers, we're
talking about nearly all of Asia's languages.  That's ... a lot of
NAND gates being wasted due to arbitrary choices made probably before
UTF-8 even existed.

I do agree with you that UTF-16 has turned out to be an odd beast,
though, not big enough but also too big.  Maybe it's only just right
for CJK (or CJ?).  I don't see much chance at all of anyone
retro-fitting UTF-16 into PostgreSQL anyway, so I wouldn't worry about
that.  I could more easily see us figuring out how to drop the
requirement for high bits in multi-byte sequence tails so that GB18030
could be used to store two-byte Chinese (while also retaining full
access to all of Unicode as it does), and I was basically wondering
out loud if Japan might be hiding something like that somewhere and
imagining what it might look like.

Re: Questionable description about character sets

Reply via email to