Tighten pg_uhc_verifychar() to enforce CP949 lead/trail byte ranges

유도건 Thu, 04 Jun 2026 19:20:59 -0700

Hi,

Per CP949 (Windows-949), a two-byte UHC sequence requires the lead
byte to be in 0x81-0xFE and the trail byte to be in 0x41-0x5A,
0x61-0x7A, or 0x81-0xFE.

pg_uhc_verifychar() in src/common/wchar.c accepts any lead byte
with the high bit set (0x80-0xFF) and any trail byte other than
NUL, without enforcing those ranges. Out-of-range pairs such as
0x80 0x41 (invalid lead) or 0x81 0x40 (invalid trail) are accepted
by the verifier and rejected only later by the conversion table,
with the message:

ERROR: character with byte sequence 0x80 0x41 in encoding "UHC"
has no equivalent in encoding "UTF8"

This is misleading -- those pairs are not unmappable, they are
structurally invalid in CP949 -- and it is inconsistent with
pg_euckr_verifychar() (src/common/wchar.c:1044), which already
enforces lead/trail byte ranges explicitly via IS_EUC_RANGE_VALID().

The following evidence supports tightening the UHC verifier:

- Microsoft CP949 (Windows-949) specifies the two-byte form as
lead 0x81-0xFE, trail 0x41-0x5A | 0x61-0x7A | 0x81-0xFE.
Other byte values are not valid for the two-byte form.

- PostgreSQL's own UHC -> UTF-8 conversion table is already built
on this assumption. The radix tree header in
src/backend/utils/mb/Unicode/uhc_to_utf8.map declares:

0x81, /* b2_1_lower */
0xfe, /* b2_1_upper */
0x41, /* b2_2_lower */
0xfe, /* b2_2_upper */

i.e. the conversion side already restricts the byte ranges and
rejects anything outside them; the verifier is just doing the
rejection in the wrong place with the wrong message.

- pg_euckr_verifychar() already follows the strict shape: it
validates lead/trail ranges directly rather than relying on
pg_uhc_mblen() + a NUL-only trail check. This patch brings
pg_uhc_verifychar() in line with it.

This is split into two patches to make the change visible:

0001 -- Add a regression test for UHC.

UHC is a client-only encoding, so there has been no dedicated
test for pg_uhc_verifychar(). This adds
src/test/regress/sql/uhc.sql, exercising the verifier through
convert_from() in a UTF8 database. The expected output records
the *current* behavior on master, so this patch applies cleanly
and all tests pass without any code change.

0002 -- Tighten pg_uhc_verifychar() to enforce CP949 byte ranges.

Rewrite pg_uhc_verifychar() to check lead range (0x81-0xFE) and
trail range (0x41-0x5A, 0x61-0x7A, or 0x81-0xFE) directly,
following the style of pg_euckr_verifychar(). The new
trail-range check also subsumes the previous NONUTF8_INVALID
sentinel check (0x8d 0x20), which is removed -- 0x20 is not in
any valid trail range, so 0x8d 0x20 is still rejected.

The diff in expected/uhc.out is exactly eight lines, all of the
form:

-ERROR: character with byte sequence 0xXX 0xYY in encoding
- "UHC" has no equivalent in encoding "UTF8"
+ERROR: invalid byte sequence for encoding "UHC": 0xXX 0xYY

No other test result changes. This makes the user-visible
effect of the fix self-evident:

- the accept/reject outcome for any input is unchanged;
- the error message format changes from "has no equivalent in
encoding UTF8" to "invalid byte sequence for encoding UHC"
for the eight previously misclassified pairs;
- rejection moves from the conversion step to the verifier,
which is the appropriate place for a structural check.

Only client-side paths are affected since UHC is not supported as
a server encoding.

This issue was reported by Henson Choi in [1].

[1]
https://postgr.es/m/CAAAe_zBdGXsALm%3DGkUPtPx9MLcjcM5hBg3HZU%2Bnh8gKXSjXJJw%40mail.gmail.com

v1 patches attached.

Regards,
DoGeon Yoo

v1-0001-Add-regression-test-for-UHC-encoding-baseline-capture.patch
Description: Binary data

v1-0002-Tighten-pg_uhc_verifychar-to-enforce-CP949-lead-trail.patch
Description: Binary data

Tighten pg_uhc_verifychar() to enforce CP949 lead/trail byte ranges

Reply via email to