Bug? in bash setlocale implementation
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Configuration Information [Automatically generated, do not change]: Machine: x86_64 OS: linux-gnu Compiler: gcc Compilation CFLAGS: -DPROGRAM='bash' -DCONF_HOSTTYPE='x86_64' - -DCONF_OSTYPE='linux-gnu' -DCONF_MACHTYPE='x86_64-pc-linux-gnu' - -DCONF_VENDOR='pc' -DLOCALEDIR='/usr/s uname output: Linux DETH00 3.0.0-15-generic #26-Ubuntu SMP Fri Jan 20 17:23:00 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux Machine Type: x86_64-pc-linux-gnu Bash Version: 4.2 Patch Level: 10 Release Status: release Description: Basically if setting the locale fails variable should not be changed. Consider export LC_CTYPE= bash -c 'LC_CTYPE=ISO-8859-1 eval printf \${LC_CTYPE:-unset}' bash: warning: setlocale: LC_CTYPE: cannot change locale (ISO-8859-1): No such file or directory ISO-8859-1 ksh93 -c 'LC_CTYPE=ISO-8859-1 eval printf \${LC_CTYPE:-unset}' ISO-8859-1: unknown locale unset ksh93 -c 'LC_CTYPE=C.UTF-8 eval printf \${LC_CTYPE:-unset}' C.UTF-8 the advantage being you can check in the script if the local change worked. e.g. LC_CTYPE=ISO-8859-1 [ ${LC_CTYPE:-} = ISO-8859-1 ] || error exit -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJPQ1sbAAoJEKUDtR0WmS05dDEH+wf+Gix7NnSZ6WvwOt6ZRmlv /BXr94coQ1I6ODCXXAG0ExgqNs81gJ58N1xw0nBO/qMpJ1CWv+t5Gc+FP37RK9GK aZbrT6yYAueg/lz58o7hg76oRKVmOpzaYxdquC4dMKa8K1kEdxNyyO4Qxa8a/TNP qLC79kvBl/23CESRomZdhUpOOjTdzhiEo6njLxDmluhzA+U/WsMD1Zp7TJih30gu okkJESAwSsEoo8QIeFbzOFa/qEZQH05SwY0CoYO+OPC0qlNR/Jar9cAJhTpHfxjg bLYXSNlqs5ZCgbmUCypnOWpOktUVPNxpXabNTjWPwAnekEY8Ms4BR6XkG+yuclk= =+Z4p -END PGP SIGNATURE-
Re: Fix u32toutf8 so it encodes values 0xFFFF correctly.
On 02/20/2012 07:42 PM, Chet Ramey wrote: On 2/18/12 5:39 AM, John Kearney wrote: Bash Version: 4.2 Patch Level: 10 Release Status: release Description: Current u32toutf8 only encode values below 0x correctly. wchar_t can be ambiguous size better in my opinion to use unsigned long, or uint32_t, or something clearer. Thanks for the patch. It's good to have a complete implementation, though as a practical matter you won't see UTF-8 characters longer than four bytes. I agree with you about the unsigned 32-bit int type; wchar_t is signed, even if it's 32 bits, on several systems I use. Not only can wchar_t can be either signed or unsigned, you also have to worry about platforms where it is only 16 bits, such as cygwin; on the other hand, wint_t is always 32 bits, but you still have the issue that it can be either signed or unsigned. -- Eric Blake ebl...@redhat.com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: Fix u32toutf8 so it encodes values 0xFFFF correctly.
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/21/2012 01:34 PM, Eric Blake wrote: On 02/20/2012 07:42 PM, Chet Ramey wrote: On 2/18/12 5:39 AM, John Kearney wrote: Bash Version: 4.2 Patch Level: 10 Release Status: release Description: Current u32toutf8 only encode values below 0x correctly. wchar_t can be ambiguous size better in my opinion to use unsigned long, or uint32_t, or something clearer. Thanks for the patch. It's good to have a complete implementation, though as a practical matter you won't see UTF-8 characters longer than four bytes. I agree with you about the unsigned 32-bit int type; wchar_t is signed, even if it's 32 bits, on several systems I use. Not only can wchar_t can be either signed or unsigned, you also have to worry about platforms where it is only 16 bits, such as cygwin; on the other hand, wint_t is always 32 bits, but you still have the issue that it can be either signed or unsigned. signed / unsigend isn't really the problem anyway utf-8 only encodes up to 0x7fff and utf-16 only encodes up to 0x0010 . In my latest version I've pretty much removed all reference to wchar_t in unicode.c. It was unnecessary. However I would be interested in something like utf16_t or uint16_t currently using unsigned short which is intelligent but works. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJPQ593AAoJEKUDtR0WmS05g0wH/RPQMl1mfUdJBfzv5QkUtVSG ibezTe3/b7/9h8SG3LLrv2FiPS+FtcCbE4n8tUror3V1BHomsQHZdlj/Zshi8W/n YDl5ac5nc0rrOlw+SJxyCAJl9vHeEAXavjGw8m0KUv/vn0tZyWNM0RYXc7tRxJU2 uqY7G5sGLUt8uGuswCmSmucKjoB7guiUbsmTR+OzgDgKxuuSeQBr6/oIImo721pk nI5TYdqerPGCIMJoYPeZChCBAZ/WhK9i3C3/SxKme4zWnjySaDw3NH0yfqFHl4Ts IIOT4fYpm0h62U76+NJSPGWfadTd8UL4A/Jy4I3IwUS+mflwdU0Pu2zmwb8I+Xk= =pkAF -END PGP SIGNATURE-
Two minor correction to the manual of bash
Configuration Information [Automatically generated, do not change]: Machine: i486 OS: linux-gnu Compiler: gcc Compilation CFLAGS: -DPROGRAM='bash' -DCONF_HOSTTYPE='i486' -DCONF_OSTYPE='linux-gnu' -DCONF_MACHTYPE='i486-pc-linux-gnu' -DCONF_VENDOR='pc' -DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash' -DSHELL -DHAVE_CONFIG_H -I. -I../bash -I../bash/include -I../bash/lib -g -O2 -Wall uname output: Linux jeti.jeti.is 2.6.32-41 #1 Sat Feb 11 01:50:30 GMT 2012 i586 GNU/Linux Machine Type: i486-pc-linux-gnu Bash Version: 4.1 Patch Level: 5 Release Status: release Description: 1) The macro FN is defined to late in the manual of bash-builtins(7), which includes bash(1). Warnings from groff for the manual of bash-builtins: man1/bash.1:7477: warning: macro `FN' not defined 2) Escape used before +: standard input:6199: warning: escape character ignored before `+' Repeat-By: man --warnings=w bash-builtins and man --warnings=w bash with environmental variable MAN_KEEP_STDERR='y'. Fix: --- bash.1 2012-02-20 00:35:14.0 + +++ bash.new.1 2012-02-20 01:50:30.0 + @@ -7,6 +7,13 @@ .\ .\Last Change: Tue Dec 29 15:36:16 EST 2009 .\ +.\ +.\ File Name macro. This used to be `.PN', for Path Name, +.\ but Sun doesn't seem to like that very much. +.\ +.de FN +\fI\|\\$1\|\fP +.. .\ bash_builtins, strip all but Built-Ins section .if \n(zZ=1 .ig zZ .if \n(zY=1 .ig zY @@ -36,13 +43,6 @@ .el \\*(]X\h|\\n()Iu+\\n()Ru\c .}f .. -.\ -.\ File Name macro. This used to be `.PN', for Path Name, -.\ but Sun doesn't seem to like that very much. -.\ -.de FN -\fI\|\\$1\|\fP -.. .SH NAME bash \- GNU Bourne-Again SHell .SH SYNOPSIS @@ -6196,7 +6196,7 @@ This section describes what syntax features are available. This feature is enabled by default for interactive shells, and can be disabled using the -.B \+H +.B +H option to the .B set builtin command (see -- System Information: Debian Release: 6.0.4 APT prefers stable-updates APT policy: (500, 'stable-updates'), (500, 'proposed-updates'), (500, 'stable') Architecture: i386 (i586) Kernel: Linux 2.6.32-41 Locale: LANG=is_IS, LC_CTYPE=is_IS (charmap=ISO-8859-1) Shell: /bin/sh linked to /bin/dash Versions of packages bash depends on: ii base-files6.0squeeze4Debian base system miscellaneous f ii dash 0.5.5.1-7.4POSIX-compliant shell ii debianutils 3.4Miscellaneous utilities specific t ii libc6 2.11.3-2 Embedded GNU C Library: Shared lib ii libncurses5 5.7+20100313-5 shared libraries for terminal hand Versions of packages bash recommends: pn bash-completion none (no description available) Versions of packages bash suggests: pn bash-doc none (no description available) -- no debconf information -- Bjarni I. Gislason
Re: excess braces ignored: bug or feature ?
On 2/20/12 2:32 AM, Dan Douglas wrote: That one really is ignored. No variable named xxx... is actually set. I assume you mean the first one. It doesn't matter whether or not the variable is set as a side effect of the redirection -- it's in a subshell and disappears. Chet Oh so a subshell is created after all, and that really is a command substitution + redirect! I just chalked it up to Bash recycling the way redirects were parsed. Bash always forks for command substitution. It defers parsing the command between the parens until it's needed, and so doesn't notice that it's only a redirection until it has already forked. It is able to skip the exec and dump the file out directly. I think I ran across that quirk in trying to determine whether $(file) was a special command substitution or it's own kind of expansion but couldn't think of a way to test it. It's a special command substitution (there are others). David has gone to great lengths to avoid forking in ksh93 wherever possible; I assume this is one of those places where he's managed to do so. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/
Re: excess braces ignored: bug or feature ?
On 2/20/12 4:17 AM, Dan Douglas wrote: On Sunday, February 19, 2012 04:25:46 PM Chet Ramey wrote: I assume you mean the first one. It doesn't matter whether or not the variable is set as a side effect of the redirection -- it's in a subshell and disappears. Chet Forgot to mention though, It's possible in ksh there is no subshell created if you consider this: The best way to tell for sure is with a system call tracer. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/
Re: Fix u32toutf8 so it encodes values 0xFFFF correctly.
On 2/21/12 8:43 AM, John Kearney wrote: signed / unsigend isn't really the problem anyway utf-8 only encodes up to 0x7fff and utf-16 only encodes up to 0x0010 . In my latest version I've pretty much removed all reference to wchar_t in unicode.c. It was unnecessary. It's useful if the platform defines __STDC_ISO_10646__, wchar_t is 32 bits, and the value is less than 0x7fff. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/
Re: Bug? in bash setlocale implementation
On 2/21/12 3:51 AM, John Kearney wrote: Bash Version: 4.2 Patch Level: 10 Release Status: release Description: Basically if setting the locale fails variable should not be changed. I disagree. The assignment was performed correctly and as the user specified. The fact that a side effect of the assignment failed should not mean that the assignment should be undone. I got enough bug reports when I added the warning. I'd get at least as many if I undid a perfectly good assignment statement. I could see setting $? to a non-zero value if the setlocale() call fails, but not when the shell is in posix mode. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/
Here is a diff of all the changed to the unicode
Here is a diff of all the changed to the unicode This seems to work ok for me. but still needs further testing. My major goal was to make the code easier to follow and clearer. but also generally fixed and improved it. Added warning message ./bash -c 'printf string 1\\U8fffStromg 2' ./bash: line 0: printf: warning: U+8fff unsupported in destination charset .UTF-8 string 1U+8fffStromg 2 added utf32toutf16 and utf32towchar to allow usage of wcstombs both when wchar_t=2 or 4 generally reworked so consistent with function argument convention i.e. destination then source. diff --git a/builtins/printf.def b/builtins/printf.def index 9eca215..77a8159 100644 --- a/builtins/printf.def +++ b/builtins/printf.def @@ -859,15 +859,9 @@ tescape (estart, cp, lenp, sawc) *cp = '\\'; return 0; } - if (uvalue = UCHAR_MAX) - *cp = uvalue; - else - { - temp = u32cconv (uvalue, cp); - cp[temp] = '\0'; - if (lenp) - *lenp = temp; - } + temp = utf32tomb (cp, uvalue); + if (lenp) + *lenp = temp; break; #endif diff --git a/externs.h b/externs.h index 09244fa..ff3f344 100644 --- a/externs.h +++ b/externs.h @@ -460,7 +460,7 @@ extern unsigned int falarm __P((unsigned int, unsigned int)); extern unsigned int fsleep __P((unsigned int, unsigned int)); /* declarations for functions defined in lib/sh/unicode.c */ -extern int u32cconv __P((unsigned long, char *)); +extern int utf32tomb __P((char *, unsigned long)); /* declarations for functions defined in lib/sh/winsize.c */ extern void get_new_window_size __P((int, int *, int *)); diff --git a/lib/sh/strtrans.c b/lib/sh/strtrans.c index 2265782..e410cff 100644 --- a/lib/sh/strtrans.c +++ b/lib/sh/strtrans.c @@ -28,6 +28,7 @@ #include stdio.h #include chartypes.h +#include stdio.h #include shell.h #ifdef ESC @@ -140,21 +141,10 @@ ansicstr (string, len, flags, sawc, rlen) for (v = 0; ISXDIGIT ((unsigned char)*s) temp--; s++) v = (v * 16) + HEXVALUE (*s); if (temp == ((c == 'u') ? 4 : 8)) - { *r++ = '\\'; /* c remains unchanged */ - break; - } - else if (v = UCHAR_MAX) - { - c = v; - break; - } else - { - temp = u32cconv (v, r); - r += temp; - continue; - } + r += utf32tomb (r, v); + break; #endif case '\\': break; diff --git a/lib/sh/unicode.c b/lib/sh/unicode.c index d34fa08..5cc96bf 100644 --- a/lib/sh/unicode.c +++ b/lib/sh/unicode.c @@ -36,13 +36,7 @@ #include xmalloc.h -#ifndef USHORT_MAX -# ifdef USHRT_MAX -#define USHORT_MAX USHRT_MAX -# else -#define USHORT_MAX ((unsigned short) ~(unsigned short)0) -# endif -#endif +#include bashintl.h #if !defined (STREQ) # define STREQ(a, b) ((a)[0] == (b)[0] strcmp ((a), (b)) == 0) @@ -54,13 +48,14 @@ extern const char *locale_charset __P((void)); extern char *get_locale_var __P((char *)); #endif -static int u32init = 0; +const char *charset; static int utf8locale = 0; #if defined (HAVE_ICONV) static iconv_t localconv; #endif #ifndef HAVE_LOCALE_CHARSET +static char charset_buffer[40]={0}; static char * stub_charset () { @@ -68,168 +63,267 @@ stub_charset () locale = get_locale_var (LC_CTYPE); if (locale == 0 || *locale == 0) -return ASCII; - s = strrchr (locale, '.'); - if (s) { - t = strchr (s, '@'); - if (t) - *t = 0; - return ++s; + strcpy(charset_buffer, ASCII); } - else if (STREQ (locale, UTF-8)) -return UTF-8; else -return ASCII; +{ + s = strrchr (locale, '.'); + if (s) + { + t = strchr (s, '@'); + if (t) + *t = 0; + strcpy(charset_buffer, s); + } + else + { + strcpy(charset_buffer, locale); + } + /* free(locale) If we can Modify the buffer surely we need to free it?*/ +} + return charset_buffer; } #endif -/* u32toascii ? */ + +#if 0 int -u32tochar (wc, s) - wchar_t wc; +utf32tobig5 (s, c) char *s; + unsigned long c; { - unsigned long x; int l; - x = wc; - l = (x = UCHAR_MAX) ? 1 : ((x = USHORT_MAX) ? 2 : 4); - - if (x = UCHAR_MAX) -s[0] = x 0xFF; - else if (x = USHORT_MAX) /* assume unsigned short = 16 bits */ + if (c = 0x7F) { - s[0] = (x 8) 0xFF; - s[1] = x 0xFF; + s[0] = (char)c; + l = 1; +} + else if ((c = 0x8000) (c = 0x)) +{ + s[0] = (char)(c8); + s[1] = (char)(c 0xFF); + l = 2; } else { - s[0] = (x 24) 0xFF; - s[1] = (x 16) 0xFF; - s[2] = (x 8) 0xFF; - s[3] = x 0xFF; + /* Error Invalid UTF-8 */ + l = 0; } s[l] = '\0'; - return l; + return l; } - +#endif int -u32toutf8 (wc, s) - wchar_t wc; +utf32toutf8 (s, c) char *s; + unsigned long c; { int l; - l = (wc 0x0080) ? 1 : ((wc 0x0800) ? 2 : 3); - - if (wc 0x0080) -s[0] = (unsigned char)wc; - else if (wc 0x0800) + if (c = 0x7F) { - s[0] = (wc 6) | 0xc0; - s[1] = (wc 0x3f) |
Re: Bug? in bash setlocale implementation
On 02/22/2012 01:52 AM, Chet Ramey wrote: On 2/21/12 3:51 AM, John Kearney wrote: Bash Version: 4.2 Patch Level: 10 Release Status: release Description: Basically if setting the locale fails variable should not be changed. I disagree. The assignment was performed correctly and as the user specified. The fact that a side effect of the assignment failed should not mean that the assignment should be undone. I got enough bug reports when I added the warning. I'd get at least as many if I undid a perfectly good assignment statement. I could see setting $? to a non-zero value if the setlocale() call fails, but not when the shell is in posix mode. Chet ok I guess that makes sense, just ksh93 behavior also makes sense, I guess I can just use some command to check the charset is present before I assign it.
printf %q ~ not escaped?
Bash Version: 4.2 Patch Level: 10 Release Status: release Description: printf %q ~ not escaped? which means that this eval echo $(printf %q ~) results in your home path not a ~ unlike eval echo $(printf %q *) as far as I can see its the only character that isn't treated as I expected.