Looks great! Thanks for letting me know where the tests live. I’ll try to get these tests in the official Unicode test suite, too. Should help future implementors.
Thanks, Diego > On Jun 3, 2026, at 9:07 PM, Michael Paquier <[email protected]> wrote: > > On Mon, Jun 01, 2026 at 11:38:32AM -0700, Diego Frias wrote: >> In short, TCount actually counts 1 more than the number of T >> syllables; this is so s % TCount == 0 implies that s has no T >> syllable (because the 0th place represents the absence of a T >> syllable), where s is the s-index of a precomposed Hangul >> character. Anyway, since PostgreSQL recognizes 0x11A7 as a T >> syllable, the composition algorithm yields a nonsense character when >> 0x11A7 is put in the T position. > > Oops. Yes, including TBASE in the recomposition is incorrect, finding > your quote here (TBase is set to one less..): > https://unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59688 > > The character gets eaten by the normalization. Pas glop. > >> Let me know if this patch needs anything else. I can write a test >> for this, but it looks like the current testing setup in >> src/common/norm_test.c only runs the Unicode test suite and isn’t >> built for writing custom tests. If that is something of interest, >> though, I’m happy to add that to this patch. > > We have a set of tests in src/test/regress/sql/unicode.sql that would > fit nicely with what you want to address here. For this specific > problem, this would work: > SELECT normalize(U&'\AC00\11A7', NFC) = U&'\AC00\11A7'; > > How about adding more normalization check patterns, while on it? I am > finishing with the attached, all things combined. Diego. what do you > think? > -- > Michael > <0001-Fix-off-by-one-with-NFC-recomposition-for-Hangul-U-1.patch>
