Hi,

Per CP949 (Windows-949), a two-byte UHC sequence requires the lead
byte to be in 0x81-0xFE and the trail byte to be in 0x41-0x5A,
0x61-0x7A, or 0x81-0xFE.

pg_uhc_verifychar() in src/common/wchar.c accepts any lead byte
with the high bit set (0x80-0xFF) and any trail byte other than
NUL, without enforcing those ranges.  Out-of-range pairs such as
0x80 0x41 (invalid lead) or 0x81 0x40 (invalid trail) are accepted
by the verifier and rejected only later by the conversion table,
with the message:

  ERROR:  character with byte sequence 0x80 0x41 in encoding "UHC"
          has no equivalent in encoding "UTF8"

This is misleading -- those pairs are not unmappable, they are
structurally invalid in CP949 -- and it is inconsistent with
pg_euckr_verifychar() (src/common/wchar.c:1044), which already
enforces lead/trail byte ranges explicitly via IS_EUC_RANGE_VALID().

The following evidence supports tightening the UHC verifier:

- Microsoft CP949 (Windows-949) specifies the two-byte form as
  lead 0x81-0xFE, trail 0x41-0x5A | 0x61-0x7A | 0x81-0xFE.
  Other byte values are not valid for the two-byte form.

- PostgreSQL's own UHC -> UTF-8 conversion table is already built
  on this assumption.  The radix tree header in
  src/backend/utils/mb/Unicode/uhc_to_utf8.map declares:

      0x81, /* b2_1_lower */
      0xfe, /* b2_1_upper */
      0x41, /* b2_2_lower */
      0xfe, /* b2_2_upper */

  i.e. the conversion side already restricts the byte ranges and
  rejects anything outside them; the verifier is just doing the
  rejection in the wrong place with the wrong message.

- pg_euckr_verifychar() already follows the strict shape: it
  validates lead/trail ranges directly rather than relying on
  pg_uhc_mblen() + a NUL-only trail check.  This patch brings
  pg_uhc_verifychar() in line with it.

This is split into two patches to make the change visible:

0001 -- Add a regression test for UHC.

  UHC is a client-only encoding, so there has been no dedicated
  test for pg_uhc_verifychar().  This adds
  src/test/regress/sql/uhc.sql, exercising the verifier through
  convert_from() in a UTF8 database.  The expected output records
  the *current* behavior on master, so this patch applies cleanly
  and all tests pass without any code change.

0002 -- Tighten pg_uhc_verifychar() to enforce CP949 byte ranges.

  Rewrite pg_uhc_verifychar() to check lead range (0x81-0xFE) and
  trail range (0x41-0x5A, 0x61-0x7A, or 0x81-0xFE) directly,
  following the style of pg_euckr_verifychar().  The new
  trail-range check also subsumes the previous NONUTF8_INVALID
  sentinel check (0x8d 0x20), which is removed -- 0x20 is not in
  any valid trail range, so 0x8d 0x20 is still rejected.

  The diff in expected/uhc.out is exactly eight lines, all of the
  form:

      -ERROR:  character with byte sequence 0xXX 0xYY in encoding
      -        "UHC" has no equivalent in encoding "UTF8"
      +ERROR:  invalid byte sequence for encoding "UHC": 0xXX 0xYY

  No other test result changes.  This makes the user-visible
  effect of the fix self-evident:

  - the accept/reject outcome for any input is unchanged;
  - the error message format changes from "has no equivalent in
    encoding UTF8" to "invalid byte sequence for encoding UHC"
    for the eight previously misclassified pairs;
  - rejection moves from the conversion step to the verifier,
    which is the appropriate place for a structural check.

Only client-side paths are affected since UHC is not supported as
a server encoding.

This issue was reported by Henson Choi in [1].

[1]
https://postgr.es/m/CAAAe_zBdGXsALm%3DGkUPtPx9MLcjcM5hBg3HZU%2Bnh8gKXSjXJJw%40mail.gmail.com

v1 patches attached.

Regards,
DoGeon Yoo

Attachment: v1-0001-Add-regression-test-for-UHC-encoding-baseline-capture.patch
Description: Binary data

Attachment: v1-0002-Tighten-pg_uhc_verifychar-to-enforce-CP949-lead-trail.patch
Description: Binary data

Reply via email to