Hi,

Per KS X 2901 (formerly KS C 5861-1992), EUC-KR designates only G0
(ASCII) and G1 (KS X 1001).  G2 and G3 are not designated; the
single-shift codes SS2 (0x8E) and SS3 (0x8F) therefore cannot appear
as lead bytes, and no 3-byte sequence is ever valid in EUC-KR.

PostgreSQL currently has two inconsistencies with this:

1. Table 23.3 in the documentation lists EUC_KR Bytes/Char as "1-3".
2. pg_euckr_mblen(), pg_euckr_dsplen(), and pg_euckr2wchar_with_len()
   delegate to the shared pg_euc_* helpers, which include SS2 (0x8E)
   and SS3 (0x8F) handling written for encodings that designate G2/G3
   (e.g. EUC-JP, EUC-TW).

The following evidence confirms that SS2/SS3 are not part of EUC-KR:

- KS X 2901 defines EUC-KR with the following code set table
  (see attached ksx2901-euc-kr-code-set-table.png):

    Code set  Code value representation          Character set
    0         0XXXXXXX                           KS X 1003 (ASCII)
    1         1XXXXXXX 1XXXXXXX                  KS X 1001
    2         SS2 1XXXXXXX [1XXXXXXX [...]]      undefined
    3         SS3 1XXXXXXX [1XXXXXXX [...]]      undefined

  The standard states: "In particular, since the character sets for
  code set 2 and code set 3 are not defined, they may be defined and
  used in the future if necessary."

- pg_euckr_verifychar() (src/common/wchar.c:1044) already has no SS2/SS3
  branch; it accepts only 0x00-0x7F (G0, ASCII) and 0xA1-0xFE lead bytes
  (G1, KS X 1001).  Any 0x8E or 0x8F byte is rejected.

This patch fixes both:

- Replace the three delegating functions with EUC-KR-specific
  implementations that recognise only G0 (1 byte) and G1 (2 bytes).
- Set maxmblen from 3 to 2 in pg_wchar_table[PG_EUC_KR].
- Correct Table 23.3 from "1-3" to "1-2".

pg_euckr_verifychar() already has no SS2/SS3 branch, so SS2/SS3 bytes
are never admitted as valid lead bytes.  This patch therefore introduces
no behavior change for valid EUC-KR data.

This was discussed in [1].

[1]
https://postgr.es/m/CAAAe_zBdGXsALm%3DGkUPtPx9MLcjcM5hBg3HZU%2Bnh8gKXSjXJJw%40mail.gmail.com

--
SungJun Jang

Attachment: v1-0001-Make-EUC-KR-encoding-routines-self-contained.patch
Description: Binary data

Reply via email to