Hi Thomas,
Thank you again for sharing this exploration, and for including Korean in your experiment table. Rather than comment on the patch itself, let me offer a ground-level report on where Korean encoding reality sits in April 2026, because the picture has shifted enough that I think it is worth entering into the record before this thread accumulates momentum on motivations that may no longer fully hold on this side of the region. UTF-8 has already won in Korea, largely by inertia rather than active choice. Public web statistics put .kr sites at roughly 96% UTF-8 with a small EUC-KR residual of about 4% [1] — noticeably higher than the ~1% Shift-JIS residual on .jp [2], but steadily shrinking. The mechanism is mundane: modern Linux distributions default to UTF-8 locales, PostgreSQL's initdb inherits that, and every new cluster is therefore UTF-8 from birth. The remaining legacy installations are not "haven't migrated yet" — they are "have decided not to migrate," which is a different and much slower population. A clarification that often trips people up: in Korean practice, "EUC-KR" is the label written down and CP949 is what actually moves on the wire. Microsoft's UHC has been the Windows default for decades, and the MIME label has simply stuck. The historical stack goes KS X 1001 (완성형, 2,350 syllables) → EUC-KR → CP949 (11,172 syllables) → UTF-8. PostgreSQL's strict EUC_KR decoder rejects the bytes CP949 adds, which occasionally causes real incidents when Windows-exchanged files are loaded. For any design choice about "Korean legacy support", this matters — what needs supporting is usually CP949, not EUC-KR proper. Server encoding and client encoding are also routinely split. A common Korean deployment pattern is a PostgreSQL cluster with UTF-8 as server encoding, while legacy Windows / Delphi / C++ / older Java clients connect with client_encoding set to EUC-KR or CP949 and let PostgreSQL transcode at the wire boundary. Many systems that look like "EUC-KR systems" from the outside are actually UTF-8 storage with an EUC-KR wire. The storage-layer share of legacy is therefore probably smaller still than the 3.8% web figure would suggest. On the Korean row of your table landing at -16% under UTF-16: that is structural, not noise. Modern Korean writing mandates word-space separation (unlike Chinese and Japanese), has effectively abandoned hanja since the 1990s, and freely interleaves ASCII acronyms (IT, AI, CEO). As a result Korean carries the highest ASCII share among CJK languages, and UTF-16 pays for each ASCII position (one byte → two) in exactly the range where the Hangul savings are meant to come from. Columns without spaces — names, titles, addresses — could approach -33%, but general prose cannot. Those same short columns are, however, exactly where the compression angle I return to further below captures the equivalent saving without a new data type. Storage pressure, to the extent modern operators feel it at all, has largely migrated to other layers. Memory and disk have both followed exponential price/volume curves, and the CPU cost of text comparison has disappeared inside other costs — network, storage I/O, planning, JIT — to the point of invisibility in profiler output. For OLTP, the 2-vs-3-byte difference on Korean columns does not feel meaningful on modern hardware. For bulk scans where byte counts still do matter, the industry answer has already been columnar + zstd, which routinely reaches 90%+ compression on natural-language text and flattens the CJK-vs-Latin ratio to irrelevance. Embedded and edge are not PostgreSQL's primary target, and archival sits in zstd territory too. The domains that historically motivated "we must narrow CJK storage" have either moved outside the PostgreSQL shape or been absorbed by general-purpose compression. Meanwhile the cultural arrow points toward more Unicode, not less. KakaoTalk (which saturates domestic messaging), Naver comments, Instagram captions, and YouTube normalise emoji in everyday prose, while AI-generated Korean text contributes middle dots, em dashes, and curly quotes at a scale that was not present a few years ago. The share of non-EUC-KR content in everyday Korean prose is, informally, rising steadily. Each emoji is four UTF-8 bytes and is unrepresentable in any legacy encoding at all. A partial-coverage alternative looks increasingly awkward against that trend. Korean upstream feedback on encoding has also been notably quiet despite a very active de-Oracle migration wave in the late 2010s. I suspect this silence is not apathy but absence of a felt problem — most of the community has simply moved on. I should be careful here. The "Korean side needs narrower CJK storage" argument was genuinely strong around 2010, and I remember when it motivated serious engineering time. It is much weaker in 2026: UTF-8 has won by default, legacy survivors are confined to wire protocols and specific applications, OLTP does not feel the byte cost, and bulk scan is already handled elsewhere. I raise this not to dismiss the technical work — the patch shows real craft and the exploration is interesting on its own terms. But if the cover-letter motivation rests partly on "this will help East Asian users, including Korea," I wanted you to have a ground-level report: for Korean users specifically, the pressure may no longer be strong enough to justify the complexity described. The calculus may well differ in Japanese or Chinese markets — that is not for me to say. One broader question, then, that I wanted to put to you: there are three distinct axes on which utf16 could be pursued — as a server character set, as a data type, or as a compression angle. The character-set direction runs straight into the "continuation byte must not look like ASCII" rule, as you already noted, and is therefore effectively closed on PostgreSQL. The data-type direction is the current patch, which carries substantial catalogue and operator surface, while the storage wins mostly accrue on wider values — where columnar + zstd is already doing the work. What still seems genuinely unaddressed in practice is the short-value regime: word-sized strings such as names, titles, cities, and tags, which fall below the TOAST compression threshold and therefore never see a compressor at all. Would framing this as "a compression method effective on word-sized values" be a more productive angle than either of the other two? The storage outcome could be similar with much less surface area to maintain. A fair counter on memory, before I go on: disk pressure has clearly migrated elsewhere, but shared_buffers and work_mem remain finite, and compression primarily addresses the disk side. A data-type approach that goes far enough to shrink the in-memory representation — modifying every string function along the way — tends to become a degraded form of a new character set: doing most of the character-set work without the character-set slot in PostgreSQL's encoding machinery, which as above is closed. None of the three axes therefore cleanly solves the in-memory case; for truly memory-bound CJK workloads the honest answer is probably just more RAM. One concrete instantiation of that compression angle, if Korean capacity specifically is the example that matters: take CP949 (which is what actually circulates under the EUC-KR label) as a compression base and, for any character CP949 cannot represent, spell it inline as a readable textual escape such as \u2603 or U+2603 rather than a binary marker byte. Native Korean text then stays at two bytes per Hangul, emoji and modern Unicode remain fully representable (at a modest cost per occurrence), the in-memory representation stays plain UTF-8, and the on-disk byte stream stays entirely within ASCII + CP949 — no new marker byte, no collision with existing code paths that scan for raw ASCII bytes. If the source text itself contains sequences that look like the escape syntax (for instance documentation quoting \u-style literals), a simple doubling rule disambiguates them; such cases are vanishingly rare in Korean business data. This targets exactly the short-value regime above, with far less surface than a new data type. For tighter byte density, one could go further by devising a dedicated binary-level encoding, or by wiring zstd's external dictionary feature into the column-compression path with a pre-trained per-language dictionary — but either of those paths carries its own implementation and operational costs. Should you nonetheless decide to press on with utf16 as a data type, I am willing to take the patch through a proper review; I have already applied it on top of master and confirmed that the regression tests pass, so the mechanical footing is in place. [1] https://w3techs.com/technologies/segmentation/tld-kr-/character_encoding [2] https://w3techs.com/technologies/segmentation/tld-jp-/character_encoding Best regards, Henson >
