I have found long standing bug with with certain Asian multibyte charsets handling(original report was made by Mr. Ishida).
Some text operations on certain Asian charsets such as EUCj-JP code set 3 (JIS X 0212) make wrong results. As far as I know, these include: - strpos - regular expression It seems LIKE is not affected by this bug. The bug has been there since 6.4. The reason we did not notice the bug is the affected charsts are merely used. Other charsets affected by the bug are EUC_CN code set 2, 3 (it seems they are not used at all) and EUC_TW code set 2, 3 (it seems code set 3 is not used). As far as I know, EUC_KR is not affected (I believe code set 2, 3 is not used in EUC_KR). Here are the description of the bug. - strpos In EUC_JP database: SELECT strpos(hextostr('8faaa18faae1'), hextostr('8faae1')); returns 1, instead of 2. where hextostr() is a hexadecial to string conversion functin developed by Mr. Ishida. Those three bytes sequence starting with 8f is a JIS X 0212 letter encoded in EUC-JP (for example, 8faaa18faae1 consists of 2 EUC_JP letters). - regexp SELECT hextostr('8faaa18faaa1') ~ hextostr('8faae1'); returns false instead of true. details of the bug: In backend/utils/mb/wchar.c there are functions to convert multibyte to wchar. When the conversion performed, the second or third byte was masked by 0x3f and which makes, for example, 8faaa1 and 8faae1 look same. I'm going to commit fixes for 7.3-statble, 7.4-stable, 8.0-stable, 8.1-stable and current. -- Tatsuo Ishii SRA OSS, Inc. Japan ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly