Re: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization

Diego Frias Thu, 04 Jun 2026 09:34:31 -0700

Looks great! Thanks for letting me know where the tests live. I’ll
try to get these tests in the official Unicode test suite, too. Should
help future implementors.


Thanks,
Diego

> On Jun 3, 2026, at 9:07 PM, Michael Paquier <[email protected]> wrote:
> 
> On Mon, Jun 01, 2026 at 11:38:32AM -0700, Diego Frias wrote:
>> In short, TCount actually counts 1 more than the number of T
>> syllables; this is so s % TCount == 0 implies that s has no T
>> syllable (because the 0th place represents the absence of a T
>> syllable), where s is the s-index of a precomposed Hangul
>> character. Anyway, since PostgreSQL recognizes 0x11A7 as a T
>> syllable, the composition algorithm yields a nonsense character when
>> 0x11A7 is put in the T position.
> 
> Oops.  Yes, including TBASE in the recomposition is incorrect, finding
> your quote here (TBase is set to one less..):
> https://unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59688
> 
> The character gets eaten by the normalization.  Pas glop.
> 
>> Let me know if this patch needs anything else. I can write a test
>> for this, but it looks like the current testing setup in
>> src/common/norm_test.c only runs the Unicode test suite and isn’t
>> built for writing custom tests. If that is something of interest,
>> though, I’m happy to add that to this patch.
> 
> We have a set of tests in src/test/regress/sql/unicode.sql that would
> fit nicely with what you want to address here.  For this specific
> problem, this would work:
> SELECT normalize(U&'\AC00\11A7', NFC) = U&'\AC00\11A7';
> 
> How about adding more normalization check patterns, while on it?  I am
> finishing with the attached, all things combined.  Diego. what do you
> think?
> --
> Michael
> <0001-Fix-off-by-one-with-NFC-recomposition-for-Hangul-U-1.patch>

Re: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization

Reply via email to