Re: UTF-8 regression in guile 1.9.5
Hello, Andy Wingo writes: > On Tue 12 Jan 2010 12:25, l...@gnu.org (Ludovic Courtès) writes: > >> Andy Wingo writes: >> >>> perhaps we should setlocale in the `guile' binary (but not by default >>> when used by a library). >> >> We could, but it would break programs that have been assuming the ‘C’ >> locale, e.g., when parsing of printing data, etc... > > But surely it is the Right Thing; Perhaps, I don’t know. Apparently Perl does that, for instance. What do others do? Is there no real reason for POSIX to be this way, other than backward compatibility? > is there no way to make a transition to having it there by default? Not that I can think of. Thanks, Ludo’.
Re: UTF-8 regression in guile 1.9.5
Hi, On Tue 12 Jan 2010 12:25, l...@gnu.org (Ludovic Courtès) writes: > Andy Wingo writes: > >> perhaps we should setlocale in the `guile' binary (but not by default >> when used by a library). > > We could, but it would break programs that have been assuming the ‘C’ > locale, e.g., when parsing of printing data, etc... But surely it is the Right Thing; is there no way to make a transition to having it there by default? A -- http://wingolog.org/
Re: UTF-8 regression in guile 1.9.5
Hello, Andy Wingo writes: > On Mon 11 Jan 2010 14:38, l...@gnu.org (Ludovic Courtès) writes: > >> Mike Gran writes: >> >>> But as far as the greater question of the side effects of setting locale >>> early on startup... The parsing of any source code files after locale >>> is set will be done in that context. I don't think it would do anything >>> unexpected. The reader and the port routines tend to do their own parsing, >>> and don't tend to rely on libc locale-specific routines. Even so, it >>> would take some auditing to prove that there would be no effect. >> >> Source files should have the right ‘coding:’ meta anyway. I just >> changed the compiler to install the current user locale [0], as that’s >> typically what a standalone program does. > > If we're taking this tack, perhaps we should setlocale in the `guile' > binary (but not by default when used by a library). We could, but it would break programs that have been assuming the ‘C’ locale, e.g., when parsing of printing data, etc... Thanks, Ludo’.
Re: UTF-8 regression in guile 1.9.5
On Mon 11 Jan 2010 14:38, l...@gnu.org (Ludovic Courtès) writes: > Mike Gran writes: > >> But as far as the greater question of the side effects of setting locale >> early on startup... The parsing of any source code files after locale >> is set will be done in that context. I don't think it would do anything >> unexpected. The reader and the port routines tend to do their own parsing, >> and don't tend to rely on libc locale-specific routines. Even so, it >> would take some auditing to prove that there would be no effect. > > Source files should have the right ‘coding:’ meta anyway. I just > changed the compiler to install the current user locale [0], as that’s > typically what a standalone program does. If we're taking this tack, perhaps we should setlocale in the `guile' binary (but not by default when used by a library). Andy -- http://wingolog.org/
Re: UTF-8 regression in guile 1.9.5
Hi, Mike Gran writes: > But as far as the greater question of the side effects of setting locale > early on startup... The parsing of any source code files after locale > is set will be done in that context. I don't think it would do anything > unexpected. The reader and the port routines tend to do their own parsing, > and don't tend to rely on libc locale-specific routines. Even so, it > would take some auditing to prove that there would be no effect. Source files should have the right ‘coding:’ meta anyway. I just changed the compiler to install the current user locale [0], as that’s typically what a standalone program does. It makes it necessary for source files to have the right ‘coding:’ since otherwise they could get read with the current user’s locale encoding, which could be anything [1]. [0] http://git.savannah.gnu.org/cgit/guile.git/commit/?id=e6251e7bd98fbc64e9dbf489c8afaf426af46919 [1] http://git.savannah.gnu.org/cgit/guile.git/commit/?id=bce5cb56413da437c29628c529cec47649d12eb9 > If you were to set the locale in Guile, [...] I currently think we shouldn’t do it since (1) Guile can be embedded and it’s the application’s responsibility to set the locale, and (2) it would be a departure from previous versions of Guile and from POSIX behavior. Thanks, Ludo’.
Re: UTF-8 regression in guile 1.9.5
> From: Andy Wingo > Hi, > > Reviving an old thread... > > > I think I prefer that the coder take the responsibility of calling > > setlocale, but, I only think that because it is how C works. I'm used > > to that convention. > > I would still prefer ponies and magic, but I realized: if we do a > setlocale(LC_ALL, "") at the beginning, might that not change e.g. the > floating point format, or some other locale-related variable, which > would make Guile modules unreadable, or otherwise semantically different > or invalid? > > I'm asking because I ran into this bug now: > > scheme@(guile-user)> ,pr (resolve-module '(gnome gtk)) > Throw to key `wrong-type-arg' with args `("procedure-name" "Wrong type > argument in position ~A: ~S" (1 #) > (#))'. > Entering the debugger. Type `bt' for a backtrace or `c' to continue. > 0 debug> bt > In current input: > : 13 ERROR: cannot convert to output locale "NONE": > ""dynamic-wind"" > > So I guess we need a special case for NONE there, or something. I really > don't understand i18n/l10n. A LOCALE=NONE is the same as setting locale to any undefined value,e.g. LOCALE=martian_mars. There isn't a locale called 'none', so the system can provide no clues on how I/O, date format, etc, should be done. All programs are supposed to start with locale="C", so I guess the NONE locale is being set explicitly at some point. On my box, I can't (setlocale LC_ALL "NONE") since I don't have a NONE locale. So, the fact that locale=NONE seems like a bug to me. For Guile string conversion, if locale=NONE has some non-buggy meaning, I'd probably suggest making NONE the same as UTF-8. We can do whatever we want, since the result of the operation of conversion into NONE is undefined. But as far as the greater question of the side effects of setting locale early on startup... The parsing of any source code files after locale is set will be done in that context. I don't think it would do anything unexpected. The reader and the port routines tend to do their own parsing, and don't tend to rely on libc locale-specific routines. Even so, it would take some auditing to prove that there would be no effect. If you were to set the locale in Guile, you would need to add a condition to catch if the current LANG envvar isn't set to a valid locale so you can fall back to the "C" locale. -Mike Gran
Re: UTF-8 regression in guile 1.9.5
Hi, Reviving an old thread... On Fri 11 Dec 2009 16:05, Mike Gran writes: >> On Sun 06 Dec 2009 21:43, Linas Vepstas writes: >> >> > 2009/12/6 Mike Gran : >> >> >> >>> > need to call (setlocale LC_ALL "") >> >> >> >> But for Guile to store characters as codepoints, declaring a locale >> >> pretty much a requirement now. >> > >> > Would it make sense to add (setlocale LC_ALL "") to some default, >> > e.g. boot-9.scm ? > > If we always call setlocale, legacy code that used UTF-8 and other > non-Latin locales will just work. Legacy code that used strings to > contain binary data would break. > > (Of couse, UTF-8 strings only worked on Guile 1.8.x so long > as you either never looked at substrings or chars, or did > UTF-8 parsing yourself.) > > As it is now, the opposite is true: legacy code with strings > containing binary data will just work; strings containing non-8-bit > locale encoded strings will break. > > | 1.8.x | setlocale | > | Strings | called| Guile 2.0 > | contain | 1.8 | 2.0 | will > - > | ASCII | Y/N | Y/N | just work > - > | locale-encoded| Y/N | Y | just work > | strings | | | > - > | locale-encoded| Y/N | N | interpret string bytes as > | strings | | | Latin-1 > - > | binary data | Y/N | Y | if locale is Latin-1: just work > | | | | > | | | | if locale is not latin-1: > | | | | interpret string bytes using > | | | | locale encoding > - > | binary data | Y/N | N | just work > | | | | > > I think I prefer that the coder take the responsibility of calling > setlocale, but, I only think that because it is how C works. I'm used > to that convention. I would still prefer ponies and magic, but I realized: if we do a setlocale(LC_ALL, "") at the beginning, might that not change e.g. the floating point format, or some other locale-related variable, which would make Guile modules unreadable, or otherwise semantically different or invalid? I'm asking because I ran into this bug now: scheme@(guile-user)> ,pr (resolve-module '(gnome gtk)) Throw to key `wrong-type-arg' with args `("procedure-name" "Wrong type argument in position ~A: ~S" (1 #) (#))'. Entering the debugger. Type `bt' for a backtrace or `c' to continue. 0 debug> bt In current input: : 13 ERROR: cannot convert to output locale "NONE": ""dynamic-wind"" So I guess we need a special case for NONE there, or something. I really don't understand i18n/l10n. FWIW, it seems that both ruby and python require the user to call setlocale. Regards, Andy -- http://wingolog.org/
Re: UTF-8 regression in guile 1.9.5
Hi, Mike Gran writes: > I think I prefer that the coder take the responsibility of calling > setlocale, but, I only think that because it is how C works. I'm used > to that convention. +1. Thanks, Ludo’.
Re: UTF-8 regression in guile 1.9.5
2009/12/11 Mike Gran : > I think I prefer that the coder take the responsibility of calling > setlocale, but, I only think that because it is how C works. I'm used > to that convention. > OK works for me. --linas
Re: UTF-8 regression in guile 1.9.5
> From: Andy Wingo > Hi, > > On Sun 06 Dec 2009 21:43, Linas Vepstas writes: > > > 2009/12/6 Mike Gran : > >> > >>> > need to call (setlocale LC_ALL "") > >> > >> But for Guile to store characters as codepoints, declaring a locale > >> pretty much a requirement now. > > > > Would it make sense to add (setlocale LC_ALL "") to some default, > > e.g. boot-9.scm ? > > Mike I admit I don't follow this completely. Does Linas' suggestion > make sense? I somehow thought that locales would magically just > work. If we always call setlocale, legacy code that used UTF-8 and other non-Latin locales will just work. Legacy code that used strings to contain binary data would break. (Of couse, UTF-8 strings only worked on Guile 1.8.x so long as you either never looked at substrings or chars, or did UTF-8 parsing yourself.) As it is now, the opposite is true: legacy code with strings containing binary data will just work; strings containing non-8-bit locale encoded strings will break. | 1.8.x | setlocale | | Strings | called| Guile 2.0 | contain | 1.8 | 2.0 | will - | ASCII | Y/N | Y/N | just work - | locale-encoded| Y/N | Y | just work | strings | | | - | locale-encoded| Y/N | N | interpret string bytes as | strings | | | Latin-1 - | binary data | Y/N | Y | if locale is Latin-1: just work | | | | | | | | if locale is not latin-1: | | | | interpret string bytes using | | | | locale encoding - | binary data | Y/N | N | just work | | | | I think I prefer that the coder take the responsibility of calling setlocale, but, I only think that because it is how C works. I'm used to that convention. Thanks, Mike
Re: UTF-8 regression in guile 1.9.5
Hi, On Sun 06 Dec 2009 21:43, Linas Vepstas writes: > 2009/12/6 Mike Gran : >> >>> > need to call (setlocale LC_ALL "") >> >> But for Guile to store characters as codepoints, declaring a locale >> pretty much a requirement now. > > Would it make sense to add (setlocale LC_ALL "") to some default, > e.g. boot-9.scm ? Mike I admit I don't follow this completely. Does Linas' suggestion make sense? I somehow thought that locales would magically just work. Cheers, Andy -- http://wingolog.org/
Re: UTF-8 regression in guile 1.9.5
2009/12/6 Mike Gran : > >> > need to call (setlocale LC_ALL "") > > But for Guile to store characters as codepoints, declaring a locale > pretty much a requirement now. Would it make sense to add (setlocale LC_ALL "") to some default, e.g. boot-9.scm ? --linas
Re: UTF-8 regression in guile 1.9.5
> > Hmm. The "ã" is a dead giveaway that you are printing a UTF-8 string > > that is being interpreted as a ISO-8859-1 string. > > > > You've already said that you're in a UTF-8 locale. It could be that you > > need to call (setlocale LC_ALL "") > > That cured it. > > > as well as having a setlocale call in your program. > > Doesn't seem to be required, after the above. > > Thanks! > > Why this happened is strange; I'm now investigating. Sorry to > have bothered you with something that is dohh .. basic. 1.9.x does work fundamentally differently w.r.t. strings. The reason for that is because of how strings are now stored. In 1.8.x, a character was a byte. In 1.9.x a character is a codepoint. But for Guile to store characters as codepoints, declaring a locale pretty much a requirement now. -Mike
Re: UTF-8 regression in guile 1.9.5
2009/12/6 Mike Gran : >> From: Linas Vepstas > > >> Then, from the guile prompt, I can evaluate the following: >> >> (new-node "てみました。") >> >> and get the output "The name is てみました。" >> >> >> However, in guile-1.9.5, the above gives me: >> >> "The name is ã¦ã¿ã¾ããã" > > Hmm. The "ã" is a dead giveaway that you are printing a UTF-8 string > that is being interpreted as a ISO-8859-1 string. > > You've already said that you're in a UTF-8 locale. It could be that you > need to call (setlocale LC_ALL "") That cured it. > as well as having a setlocale call in your program. Doesn't seem to be required, after the above. Thanks! Why this happened is strange; I'm now investigating. Sorry to have bothered you with something that is dohh .. basic. --linas
Re: UTF-8 regression in guile 1.9.5
> From: Linas Vepstas > Then, from the guile prompt, I can evaluate the following: > >(new-node "てみました。") > > and get the output "The name is てみました。" > > > However, in guile-1.9.5, the above gives me: > >"The name is ã¦ã¿ã¾ããã" Hmm. The "ã" is a dead giveaway that you are printing a UTF-8 string that is being interpreted as a ISO-8859-1 string. You've already said that you're in a UTF-8 locale. It could be that you need to call (setlocale LC_ALL "") from the command line before entering (new-node "てみました。") as well as having a setlocale call in your program. Thanks, Mike Gran