[HACKERS] bugs with certain Asian multibyte charsets

Tatsuo Ishii Sat, 24 Dec 2005 01:31:42 -0800

I have found long standing bug with with certain Asian multibyte
charsets handling(original report was made by Mr. Ishida).


Some text operations on certain Asian charsets such as EUCj-JP code
set 3 (JIS X 0212) make wrong results. As far as I know, these
include:

- strpos
- regular expression

It seems LIKE is not affected by this bug.

The bug has been there since 6.4. The reason we did not notice the bug
is the affected charsts are merely used. Other charsets affected by
the bug are EUC_CN code set 2, 3 (it seems they are not used at all)
and EUC_TW code set 2, 3 (it seems code set 3 is not used). As far as
I know, EUC_KR is not affected (I believe code set 2, 3 is not used in
EUC_KR).

Here are the description of the bug.

- strpos

In EUC_JP database:

SELECT strpos(hextostr('8faaa18faae1'), hextostr('8faae1'));

returns 1, instead of 2. where hextostr() is a hexadecial to string
conversion functin developed by Mr. Ishida. Those three bytes sequence
starting with 8f is a JIS X 0212 letter encoded in EUC-JP (for
example, 8faaa18faae1 consists of 2 EUC_JP letters).

- regexp

SELECT hextostr('8faaa18faaa1') ~ hextostr('8faae1');

returns false instead of true.

details of the bug:

In backend/utils/mb/wchar.c there are functions to convert multibyte
to wchar. When the conversion performed, the second or third byte was
masked by 0x3f and which makes, for example, 8faaa1 and 8faae1 look
same.

I'm going to commit fixes for 7.3-statble, 7.4-stable, 8.0-stable,
8.1-stable and current.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to [EMAIL PROTECTED] so that your
       message can get through to the mailing list cleanly

[HACKERS] bugs with certain Asian multibyte charsets

Reply via email to