Bug? in bash setlocale implementation

2012-02-21 Thread John Kearney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: linux-gnu
Compiler: gcc
Compilation CFLAGS:  -DPROGRAM='bash' -DCONF_HOSTTYPE='x86_64'
- -DCONF_OSTYPE='linux-gnu' -DCONF_MACHTYPE='x86_64-pc-linux-gnu'
- -DCONF_VENDOR='pc' -DLOCALEDIR='/usr/s
uname output: Linux DETH00 3.0.0-15-generic #26-Ubuntu SMP Fri Jan 20
17:23:00 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
Machine Type: x86_64-pc-linux-gnu

Bash Version: 4.2
Patch Level: 10
Release Status: release

Description:
  Basically if setting the locale fails variable should not be changed.

 Consider


export LC_CTYPE=

bash -c 'LC_CTYPE=ISO-8859-1 eval printf \${LC_CTYPE:-unset}'
bash: warning: setlocale: LC_CTYPE: cannot change locale (ISO-8859-1):
No such file or directory
ISO-8859-1

ksh93 -c 'LC_CTYPE=ISO-8859-1 eval printf \${LC_CTYPE:-unset}'
ISO-8859-1: unknown locale
unset
ksh93 -c 'LC_CTYPE=C.UTF-8 eval printf \${LC_CTYPE:-unset}'
C.UTF-8

  the advantage being you can check in the script if the local change
worked.
  e.g.
  LC_CTYPE=ISO-8859-1
  [ ${LC_CTYPE:-} = ISO-8859-1 ] || error exit
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPQ1sbAAoJEKUDtR0WmS05dDEH+wf+Gix7NnSZ6WvwOt6ZRmlv
/BXr94coQ1I6ODCXXAG0ExgqNs81gJ58N1xw0nBO/qMpJ1CWv+t5Gc+FP37RK9GK
aZbrT6yYAueg/lz58o7hg76oRKVmOpzaYxdquC4dMKa8K1kEdxNyyO4Qxa8a/TNP
qLC79kvBl/23CESRomZdhUpOOjTdzhiEo6njLxDmluhzA+U/WsMD1Zp7TJih30gu
okkJESAwSsEoo8QIeFbzOFa/qEZQH05SwY0CoYO+OPC0qlNR/Jar9cAJhTpHfxjg
bLYXSNlqs5ZCgbmUCypnOWpOktUVPNxpXabNTjWPwAnekEY8Ms4BR6XkG+yuclk=
=+Z4p
-END PGP SIGNATURE-



Re: Fix u32toutf8 so it encodes values 0xFFFF correctly.

2012-02-21 Thread Eric Blake
On 02/20/2012 07:42 PM, Chet Ramey wrote:
 On 2/18/12 5:39 AM, John Kearney wrote:
 
 Bash Version: 4.2
 Patch Level: 10
 Release Status: release

 Description:
  Current u32toutf8 only encode values below 0x correctly.
 wchar_t can be ambiguous size better in my opinion to use
 unsigned long, or uint32_t, or something clearer.
 
 Thanks for the patch.  It's good to have a complete implementation,
 though as a practical matter you won't see UTF-8 characters longer
 than four bytes.  I agree with you about the unsigned 32-bit int
 type; wchar_t is signed, even if it's 32 bits, on several systems
 I use.

Not only can wchar_t can be either signed or unsigned, you also have to
worry about platforms where it is only 16 bits, such as cygwin; on the
other hand, wint_t is always 32 bits, but you still have the issue that
it can be either signed or unsigned.

-- 
Eric Blake   ebl...@redhat.com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: Fix u32toutf8 so it encodes values 0xFFFF correctly.

2012-02-21 Thread John Kearney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/21/2012 01:34 PM, Eric Blake wrote:
 On 02/20/2012 07:42 PM, Chet Ramey wrote:
 On 2/18/12 5:39 AM, John Kearney wrote:
 
 Bash Version: 4.2 Patch Level: 10 Release Status: release
 
 Description: Current u32toutf8 only encode values below 0x
 correctly. wchar_t can be ambiguous size better in my opinion
 to use unsigned long, or uint32_t, or something clearer.
 
 Thanks for the patch.  It's good to have a complete
 implementation, though as a practical matter you won't see UTF-8
 characters longer than four bytes.  I agree with you about the
 unsigned 32-bit int type; wchar_t is signed, even if it's 32
 bits, on several systems I use.
 
 Not only can wchar_t can be either signed or unsigned, you also
 have to worry about platforms where it is only 16 bits, such as
 cygwin; on the other hand, wint_t is always 32 bits, but you still
 have the issue that it can be either signed or unsigned.
 
signed / unsigend isn't really the problem anyway utf-8 only encodes
up to 0x7fff  and utf-16 only encodes up to 0x0010 .

In my latest version I've pretty much removed all reference to wchar_t
in unicode.c. It was unnecessary.

However I would be interested in something like utf16_t or uint16_t
currently using unsigned short which is intelligent but works.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPQ593AAoJEKUDtR0WmS05g0wH/RPQMl1mfUdJBfzv5QkUtVSG
ibezTe3/b7/9h8SG3LLrv2FiPS+FtcCbE4n8tUror3V1BHomsQHZdlj/Zshi8W/n
YDl5ac5nc0rrOlw+SJxyCAJl9vHeEAXavjGw8m0KUv/vn0tZyWNM0RYXc7tRxJU2
uqY7G5sGLUt8uGuswCmSmucKjoB7guiUbsmTR+OzgDgKxuuSeQBr6/oIImo721pk
nI5TYdqerPGCIMJoYPeZChCBAZ/WhK9i3C3/SxKme4zWnjySaDw3NH0yfqFHl4Ts
IIOT4fYpm0h62U76+NJSPGWfadTd8UL4A/Jy4I3IwUS+mflwdU0Pu2zmwb8I+Xk=
=pkAF
-END PGP SIGNATURE-



Two minor correction to the manual of bash

2012-02-21 Thread Bjarni Ingi Gislason
Configuration Information [Automatically generated, do not change]:
Machine: i486
OS: linux-gnu
Compiler: gcc
Compilation CFLAGS:  -DPROGRAM='bash' -DCONF_HOSTTYPE='i486' 
-DCONF_OSTYPE='linux-gnu' -DCONF_MACHTYPE='i486-pc-linux-gnu' 
-DCONF_VENDOR='pc' -DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash' -DSHELL 
-DHAVE_CONFIG_H   -I.  -I../bash -I../bash/include -I../bash/lib   -g -O2 -Wall
uname output: Linux jeti.jeti.is 2.6.32-41 #1 Sat Feb 11 01:50:30 GMT 2012 i586 
GNU/Linux
Machine Type: i486-pc-linux-gnu

Bash Version: 4.1
Patch Level: 5
Release Status: release

Description:

 1) The macro FN is defined to late in the manual of
bash-builtins(7), which includes bash(1).

  Warnings from groff for the manual of bash-builtins:

man1/bash.1:7477: warning: macro `FN' not defined

2) Escape used before +:

standard input:6199: warning: escape character ignored before `+'

Repeat-By:

  man --warnings=w bash-builtins and man --warnings=w bash with
environmental variable MAN_KEEP_STDERR='y'.

Fix:

--- bash.1  2012-02-20 00:35:14.0 +
+++ bash.new.1  2012-02-20 01:50:30.0 +
@@ -7,6 +7,13 @@
 .\
 .\Last Change: Tue Dec 29 15:36:16 EST 2009
 .\
+.\
+.\ File Name macro.  This used to be `.PN', for Path Name,
+.\ but Sun doesn't seem to like that very much.
+.\
+.de FN
+\fI\|\\$1\|\fP
+..
 .\ bash_builtins, strip all but Built-Ins section
 .if \n(zZ=1 .ig zZ
 .if \n(zY=1 .ig zY
@@ -36,13 +43,6 @@
 .el \\*(]X\h|\\n()Iu+\\n()Ru\c
 .}f
 ..
-.\
-.\ File Name macro.  This used to be `.PN', for Path Name,
-.\ but Sun doesn't seem to like that very much.
-.\
-.de FN
-\fI\|\\$1\|\fP
-..
 .SH NAME
 bash \- GNU Bourne-Again SHell
 .SH SYNOPSIS
@@ -6196,7 +6196,7 @@
 This section describes what syntax features are available.  This
 feature is enabled by default for interactive shells, and can be
 disabled using the
-.B \+H
+.B +H
 option to the
 .B set
 builtin command (see


-- System Information:
Debian Release: 6.0.4
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'proposed-updates'), (500, 
'stable')
Architecture: i386 (i586)

Kernel: Linux 2.6.32-41
Locale: LANG=is_IS, LC_CTYPE=is_IS (charmap=ISO-8859-1)
Shell: /bin/sh linked to /bin/dash

Versions of packages bash depends on:
ii  base-files6.0squeeze4Debian base system miscellaneous f
ii  dash  0.5.5.1-7.4POSIX-compliant shell
ii  debianutils   3.4Miscellaneous utilities specific t
ii  libc6 2.11.3-2   Embedded GNU C Library: Shared lib
ii  libncurses5   5.7+20100313-5 shared libraries for terminal hand

Versions of packages bash recommends:
pn  bash-completion   none (no description available)

Versions of packages bash suggests:
pn  bash-doc  none (no description available)

-- no debconf information

-- 
Bjarni I. Gislason



Re: excess braces ignored: bug or feature ?

2012-02-21 Thread Chet Ramey
On 2/20/12 2:32 AM, Dan Douglas wrote:

 That one really is ignored. No variable named xxx... is actually set.

 I assume you mean the first one.  It doesn't matter whether or not the
 variable is set as a side effect of the redirection -- it's in a
 subshell and disappears.

 Chet
 
 Oh so a subshell is created after all, and that really is a command 
 substitution + redirect! I just chalked it up to Bash recycling the way 
 redirects were parsed.

Bash always forks for command substitution.  It defers parsing the command
between the parens until it's needed, and so doesn't notice that it's only
a redirection until it has already forked.  It is able to skip the exec and
dump the file out directly.

 
 I think I ran across that quirk in trying to determine whether $(file) was a 
 special command substitution or it's own kind of expansion but couldn't think 
 of a way to test it. 

It's a special command substitution (there are others).  David has gone to
great lengths to  avoid forking in ksh93 wherever possible; I assume this
is one of those  places where he's managed to do so.

Chet

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/



Re: excess braces ignored: bug or feature ?

2012-02-21 Thread Chet Ramey
On 2/20/12 4:17 AM, Dan Douglas wrote:
 On Sunday, February 19, 2012 04:25:46 PM Chet Ramey wrote:
 
 I assume you mean the first one.  It doesn't matter whether or not the
 variable is set as a side effect of the redirection -- it's in a
 subshell and disappears.

 Chet
 
 Forgot to mention though, It's possible in ksh there is no subshell created 
 if 
 you consider this:

The best way to tell for sure is with a system call tracer.


-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/



Re: Fix u32toutf8 so it encodes values 0xFFFF correctly.

2012-02-21 Thread Chet Ramey
On 2/21/12 8:43 AM, John Kearney wrote:

 signed / unsigend isn't really the problem anyway utf-8 only encodes
 up to 0x7fff  and utf-16 only encodes up to 0x0010 .
 
 In my latest version I've pretty much removed all reference to wchar_t
 in unicode.c. It was unnecessary.

It's useful if the platform defines __STDC_ISO_10646__, wchar_t is 32 bits,
and the value is less than 0x7fff.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/



Re: Bug? in bash setlocale implementation

2012-02-21 Thread Chet Ramey
On 2/21/12 3:51 AM, John Kearney wrote:

 Bash Version: 4.2
 Patch Level: 10
 Release Status: release
 
 Description:
   Basically if setting the locale fails variable should not be changed.

I disagree.  The assignment was performed correctly and as the user
specified.  The fact that a side effect of the assignment failed should
not mean that the assignment should be undone.

I got enough bug reports when I added the warning.  I'd get at least as
many if I undid a perfectly good assignment statement.

I could see setting $? to a non-zero value if the setlocale() call fails,
but not when the shell is in posix mode.

Chet

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/



Here is a diff of all the changed to the unicode

2012-02-21 Thread John Kearney


Here is a diff of all the changed to the unicode

This seems to work ok for me. but still needs further testing.

My major goal was to make the code easier to follow and clearer.

but also generally fixed and improved it.

Added warning message
./bash -c 'printf string 1\\U8fffStromg 2'
./bash: line 0: printf: warning: U+8fff unsupported in destination
charset .UTF-8
string 1U+8fffStromg 2


added utf32toutf16 and utf32towchar to allow usage of wcstombs both when
wchar_t=2 or 4

generally reworked so consistent with function argument convention i.e.
destination then source.
diff --git a/builtins/printf.def b/builtins/printf.def
index 9eca215..77a8159 100644
--- a/builtins/printf.def
+++ b/builtins/printf.def
@@ -859,15 +859,9 @@ tescape (estart, cp, lenp, sawc)
 	*cp = '\\';
 	return 0;
 	  }
-	if (uvalue = UCHAR_MAX)
-	  *cp = uvalue;
-	else
-	  {
-	temp = u32cconv (uvalue, cp);
-	cp[temp] = '\0';
-	if (lenp)
-	  *lenp = temp;
-	  }
+	temp = utf32tomb (cp, uvalue);
+	if (lenp)
+	  *lenp = temp;
 	break;
 #endif
 	
diff --git a/externs.h b/externs.h
index 09244fa..ff3f344 100644
--- a/externs.h
+++ b/externs.h
@@ -460,7 +460,7 @@ extern unsigned int falarm __P((unsigned int, unsigned int));
 extern unsigned int fsleep __P((unsigned int, unsigned int));
 
 /* declarations for functions defined in lib/sh/unicode.c */
-extern int u32cconv __P((unsigned long, char *));
+extern int utf32tomb __P((char *, unsigned long));
 
 /* declarations for functions defined in lib/sh/winsize.c */
 extern void get_new_window_size __P((int, int *, int *));
diff --git a/lib/sh/strtrans.c b/lib/sh/strtrans.c
index 2265782..e410cff 100644
--- a/lib/sh/strtrans.c
+++ b/lib/sh/strtrans.c
@@ -28,6 +28,7 @@
 #include stdio.h
 #include chartypes.h
 
+#include stdio.h
 #include shell.h
 
 #ifdef ESC
@@ -140,21 +141,10 @@ ansicstr (string, len, flags, sawc, rlen)
 	  for (v = 0; ISXDIGIT ((unsigned char)*s)  temp--; s++)
 		v = (v * 16) + HEXVALUE (*s);
 	  if (temp == ((c == 'u') ? 4 : 8))
-		{
 		  *r++ = '\\';	/* c remains unchanged */
-		  break;
-		}
-	  else if (v = UCHAR_MAX)
-		{
-		  c = v;
-		  break;
-		}
 	  else
-		{
-		  temp = u32cconv (v, r);
-		  r += temp;
-		  continue;
-		}
+		  r += utf32tomb (r, v);
+	  break;
 #endif
 	case '\\':
 	  break;
diff --git a/lib/sh/unicode.c b/lib/sh/unicode.c
index d34fa08..5cc96bf 100644
--- a/lib/sh/unicode.c
+++ b/lib/sh/unicode.c
@@ -36,13 +36,7 @@
 
 #include xmalloc.h
 
-#ifndef USHORT_MAX
-#  ifdef USHRT_MAX
-#define USHORT_MAX USHRT_MAX
-#  else
-#define USHORT_MAX ((unsigned short) ~(unsigned short)0)
-#  endif
-#endif
+#include bashintl.h
 
 #if !defined (STREQ)
 #  define STREQ(a, b) ((a)[0] == (b)[0]  strcmp ((a), (b)) == 0)
@@ -54,13 +48,14 @@ extern const char *locale_charset __P((void));
 extern char *get_locale_var __P((char *));
 #endif
 
-static int u32init = 0;
+const char *charset;
 static int utf8locale = 0;
 #if defined (HAVE_ICONV)
 static iconv_t localconv;
 #endif
 
 #ifndef HAVE_LOCALE_CHARSET
+static char charset_buffer[40]={0};
 static char *
 stub_charset ()
 {
@@ -68,168 +63,267 @@ stub_charset ()
 
   locale = get_locale_var (LC_CTYPE);
   if (locale == 0 || *locale == 0)
-return ASCII;
-  s = strrchr (locale, '.');
-  if (s)
 {
-  t = strchr (s, '@');
-  if (t)
-	*t = 0;
-  return ++s;
+  strcpy(charset_buffer, ASCII);
 }
-  else if (STREQ (locale, UTF-8))
-return UTF-8;
   else
-return ASCII;
+{
+  s = strrchr (locale, '.');
+  if (s)
+	{
+	  t = strchr (s, '@');
+	  if (t)
+	*t = 0;
+	  strcpy(charset_buffer, s);
+	}
+  else
+	{
+	  strcpy(charset_buffer, locale);
+	}
+  /* free(locale)  If we can Modify the buffer surely we need to free it?*/
+}
+  return charset_buffer;
 }
 #endif
 
-/* u32toascii ? */
+
+#if 0
 int
-u32tochar (wc, s)
- wchar_t wc;
+utf32tobig5 (s, c)
  char *s;
+ unsigned long c;
 {
-  unsigned long x;
   int l;
 
-  x = wc;
-  l = (x = UCHAR_MAX) ? 1 : ((x = USHORT_MAX) ? 2 : 4);
-
-  if (x = UCHAR_MAX)
-s[0] = x  0xFF;
-  else if (x = USHORT_MAX)	/* assume unsigned short = 16 bits */
+  if (c = 0x7F)
 {
-  s[0] = (x  8)  0xFF;
-  s[1] = x  0xFF;
+  s[0] = (char)c;
+  l = 1;
+}
+  else if ((c = 0x8000)  (c = 0x))
+{
+  s[0] = (char)(c8);
+  s[1] = (char)(c  0xFF);
+  l = 2;
 }
   else
 {
-  s[0] = (x  24)  0xFF;
-  s[1] = (x  16)  0xFF;
-  s[2] = (x  8)  0xFF;
-  s[3] = x  0xFF;
+  /* Error Invalid UTF-8 */
+  l = 0;
 }
   s[l] = '\0';
-  return l;  
+  return l;
 }
-
+#endif
 int
-u32toutf8 (wc, s)
- wchar_t wc;
+utf32toutf8 (s, c)
  char *s;
+ unsigned long c;
 {
   int l;
 
-  l = (wc  0x0080) ? 1 : ((wc  0x0800) ? 2 : 3);
-
-  if (wc  0x0080)
-s[0] = (unsigned char)wc;
-  else if (wc  0x0800)
+  if (c = 0x7F)
 {
-  s[0] = (wc  6) | 0xc0;
-  s[1] = (wc  0x3f) | 

Re: Bug? in bash setlocale implementation

2012-02-21 Thread John Kearney
On 02/22/2012 01:52 AM, Chet Ramey wrote:
 On 2/21/12 3:51 AM, John Kearney wrote:
 
 Bash Version: 4.2 Patch Level: 10 Release Status: release
 
 Description: Basically if setting the locale fails variable
 should not be changed.
 
 I disagree.  The assignment was performed correctly and as the
 user specified.  The fact that a side effect of the assignment
 failed should not mean that the assignment should be undone.
 
 I got enough bug reports when I added the warning.  I'd get at
 least as many if I undid a perfectly good assignment statement.
 
 I could see setting $? to a non-zero value if the setlocale() call
 fails, but not when the shell is in posix mode.
 
 Chet
 
ok I guess that makes sense, just ksh93 behavior also makes sense, I
guess I can just use some command to check the charset is present
before I assign it.



printf %q ~ not escaped?

2012-02-21 Thread John Kearney
Bash Version: 4.2
Patch Level: 10
Release Status: release

Description:
printf %q ~ not escaped?

which means that this
eval echo $(printf %q ~)
results in your home path not a ~
unlike
eval echo $(printf %q *)

as far as I can see its the only character that isn't treated as I
expected.