At Fri, 23 Dec 2005 04:37:19 +0900, Fumitoshi UKAI wrote: > reassign 344146 libc6 2.3.5-8.1 > retitle 344146 re_search(3) dumps core > thanks > > It is a bug in libc6, not in grep. > grep 2.3.1.ds2-4 works fine on libc6 2.3.2.ds1-22 if I rebuilt on sarge.
> It seems some problem in posix/regex_internal.c:build_wcs_upper_buffer(). > > % LANG=ja_JP.EUC-JP gdb ./a.out > GNU gdb 6.4-debian > Copyright 2005 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public License, and you are > welcome to change it and/or distribute copies of it under certain conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show warranty" for details. > This GDB was configured as "i486-linux-gnu"...Using host libthread_db library > "/lib/tls/libthread_db.so.1". > > (gdb) run > Starting program: /tmp/a.out > > Program received signal SIGSEGV, Segmentation fault. > 0xb7f1920f in memcpy () from /lib/tls/libc.so.6 > (gdb) bt > #0 0xb7f1920f in memcpy () from /lib/tls/libc.so.6 > #1 0xb7f4a07a in build_wcs_upper_buffer () from /lib/tls/libc.so.6 > #2 0xb7f4a335 in re_string_reconstruct () from /lib/tls/libc.so.6 > #3 0xb7f5bde7 in re_search_internal () from /lib/tls/libc.so.6 > #4 0xb7f5ea89 in re_search_stub () from /lib/tls/libc.so.6 > #5 0xb7f5ef63 in re_search () from /lib/tls/libc.so.6 > #6 0x08048618 in main (argc=1, argv=0xbffffaf4) at rtest.c:28 > (gdb) I investigated this more on this: * input multi byte sequence is "\x8f\xa9\xc3", which is LATIN SMALL LETTER ETH in EUC-JP encoding. * if RE_ICASE is used in re_syntax, re_search tries to convert characters to be upper case by build_wcs_upper_buffer(). * when multibyte sequence "\x8f\xa9\xc3" in EUC-JP is converted to wide character, we'll get 0x00F0 (LATAIN SMALL LETTER ETH; U00F0). * This wide character (LATIN SMALL LETTER ETH; U00F0) is lower case, so we need to towupper() this. * when towupper() this wide character (LATIN SMALL LETTER ETH; U00F0), we'll get wide character 0x00D0 (LATIN CAPITAL LETTER ETH; U00D0). * when wide character 0x00D0 (LATIN CAPITAL LETTER ETH; U00D0) back to multibyte sequence in EUC-JP, it fails, so wcrtomb() returns (size_t)(-1). (there are no valid byte sequence to represent LATIN CAPITAL LETTER ETH; U00D0 in EUC-JP encoding). * however, build_wcs_upper_buffer() doesn't care this case. it assumes mbrtowc -> towupper -> wcrtomb always success and only care the case that lengths of multibyte sequences would be different. I'm not sure, but towupper(3) should not return wide character that can't be represented in current locale encoding. The Single UNIX Specification, Version 2 says: If the argument of towupper() represents a lower-case wide-character code, and there exists a corresponding upper-case wide-character code (as defined by character type information in the program locale category LC_CTYPE), the result is the corresponding upper-case wide-character code. http://www.opengroup.org/onlinepubs/007908799/xsh/towupper.html In this case, * the argument of towupper() represents a lower-case wide character code 0x00F0 (LATAIN SMALL LETTER ETH; U00F0) * but, there DOESN'T exist a corresponding upper-case wide-character code (as defined by character type information in the program locale category LC_CTYPE) upper-case wide-characeter code of (LATAIN SMALL LETTER ETH; U00F0) would be (LATIN CAPITAL LETTER ETH; U00D0), but there doesn't exist in EUC-JP encoding. % cat wupper-test.c #include <stdio.h> #include <stdlib.h> #include <string.h> #include <locale.h> #include <wchar.h> #include <wctype.h> int main(int argc, char *argv[]) { mbstate_t st; unsigned char buf[] = "\x8f\xa9\xc3"; unsigned char obuf[10]; wchar_t wc; wint_t wcu; size_t s; memset(&st, 0, sizeof(st)); setlocale(LC_ALL, ""); s = mbrtowc(&wc, (const char *)buf, sizeof(buf), &st); printf("mb:[%02x %02x %02x %02x] => len:%d wc: %04x\n", buf[0], buf[1], buf[2], buf[3], s, wc); memset(obuf, 0, sizeof(obuf)); s = wcrtomb((char *)obuf, wc, &st); printf("wc %04x => len:%d mb:[%02x %02x %02x %02x]\n", wc, s, obuf[0], obuf[1], obuf[2], obuf[3]); wcu = towupper(wc); printf("wc:%04x => wcu:%04x\n", wc, wcu); memset(obuf, 0, sizeof(obuf)); s = wcrtomb((char *)obuf, (wchar_t)wcu, &st); printf("wc %04x => len:%d mb:[%02x %02x %02x %02x]\n", wcu, s, obuf[0], obuf[1], obuf[2], obuf[3]); exit(0); } % cc -o wupper-test wupper-test.c % LANG=ja_JP.EUC-JP ./wupper-test mb:[8f a9 c3 00] => len:3 wc: 00f0 wc 00f0 => len:3 mb:[8f a9 c3 00] wc:00f0 => wcu:00d0 wc 00d0 => len:-1 mb:[00 00 00 00] Regards, Fumitoshi UKAI -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]