Re: UTF-8 regression in guile 1.9.5

2010-01-12 Thread Ludovic Courtès
Hello,

Andy Wingo  writes:

> On Tue 12 Jan 2010 12:25, l...@gnu.org (Ludovic Courtès) writes:
>
>> Andy Wingo  writes:
>>
>>> perhaps we should setlocale in the `guile' binary (but not by default
>>> when used by a library).
>>
>> We could, but it would break programs that have been assuming the ‘C’
>> locale, e.g., when parsing of printing data, etc...
>
> But surely it is the Right Thing;

Perhaps, I don’t know.  Apparently Perl does that, for instance.  What
do others do?  Is there no real reason for POSIX to be this way, other
than backward compatibility?

> is there no way to make a transition to having it there by default?

Not that I can think of.

Thanks,
Ludo’.




Re: UTF-8 regression in guile 1.9.5

2010-01-12 Thread Andy Wingo
Hi,

On Tue 12 Jan 2010 12:25, l...@gnu.org (Ludovic Courtès) writes:

> Andy Wingo  writes:
>
>> perhaps we should setlocale in the `guile' binary (but not by default
>> when used by a library).
>
> We could, but it would break programs that have been assuming the ‘C’
> locale, e.g., when parsing of printing data, etc...

But surely it is the Right Thing; is there no way to make a transition
to having it there by default?

A
-- 
http://wingolog.org/




Re: UTF-8 regression in guile 1.9.5

2010-01-12 Thread Ludovic Courtès
Hello,

Andy Wingo  writes:

> On Mon 11 Jan 2010 14:38, l...@gnu.org (Ludovic Courtès) writes:
>
>> Mike Gran  writes:
>>
>>> But as far as the greater question of the side effects of setting locale
>>> early on startup...  The parsing of any source code files after locale 
>>> is set will be done in that context.  I don't think it would do anything
>>> unexpected. The reader and the port routines tend to do their own parsing,
>>> and don't tend to rely on libc locale-specific routines.  Even so, it 
>>> would take some auditing to prove that there would be no effect.
>>
>> Source files should have the right ‘coding:’ meta anyway.  I just
>> changed the compiler to install the current user locale [0], as that’s
>> typically what a standalone program does.
>
> If we're taking this tack, perhaps we should setlocale in the `guile'
> binary (but not by default when used by a library).

We could, but it would break programs that have been assuming the ‘C’
locale, e.g., when parsing of printing data, etc...

Thanks,
Ludo’.





Re: UTF-8 regression in guile 1.9.5

2010-01-11 Thread Andy Wingo
On Mon 11 Jan 2010 14:38, l...@gnu.org (Ludovic Courtès) writes:

> Mike Gran  writes:
>
>> But as far as the greater question of the side effects of setting locale
>> early on startup...  The parsing of any source code files after locale 
>> is set will be done in that context.  I don't think it would do anything
>> unexpected. The reader and the port routines tend to do their own parsing,
>> and don't tend to rely on libc locale-specific routines.  Even so, it 
>> would take some auditing to prove that there would be no effect.
>
> Source files should have the right ‘coding:’ meta anyway.  I just
> changed the compiler to install the current user locale [0], as that’s
> typically what a standalone program does.

If we're taking this tack, perhaps we should setlocale in the `guile'
binary (but not by default when used by a library).

Andy
-- 
http://wingolog.org/




Re: UTF-8 regression in guile 1.9.5

2010-01-11 Thread Ludovic Courtès
Hi,

Mike Gran  writes:

> But as far as the greater question of the side effects of setting locale
> early on startup...  The parsing of any source code files after locale 
> is set will be done in that context.  I don't think it would do anything
> unexpected. The reader and the port routines tend to do their own parsing,
> and don't tend to rely on libc locale-specific routines.  Even so, it 
> would take some auditing to prove that there would be no effect.

Source files should have the right ‘coding:’ meta anyway.  I just
changed the compiler to install the current user locale [0], as that’s
typically what a standalone program does.  It makes it necessary for
source files to have the right ‘coding:’ since otherwise they could get
read with the current user’s locale encoding, which could be anything [1].

[0] 
http://git.savannah.gnu.org/cgit/guile.git/commit/?id=e6251e7bd98fbc64e9dbf489c8afaf426af46919
[1] 
http://git.savannah.gnu.org/cgit/guile.git/commit/?id=bce5cb56413da437c29628c529cec47649d12eb9

> If you were to set the locale in Guile,

[...]

I currently think we shouldn’t do it since (1) Guile can be embedded and
it’s the application’s responsibility to set the locale, and (2) it
would be a departure from previous versions of Guile and from POSIX
behavior.

Thanks,
Ludo’.





Re: UTF-8 regression in guile 1.9.5

2010-01-10 Thread Mike Gran
> From: Andy Wingo 


> Hi,
> 
> Reviving an old thread...
> 

> > I think I prefer that the coder take the responsibility of calling
> > setlocale, but, I only think that because it is how C works.  I'm used
> > to that convention.
> 
> I would still prefer ponies and magic, but I realized: if we do a
> setlocale(LC_ALL, "") at the beginning, might that not change e.g. the
> floating point format, or some other locale-related variable, which
> would make Guile modules unreadable, or otherwise semantically different
> or invalid?
> 
> I'm asking because I ran into this bug now:
> 
> scheme@(guile-user)> ,pr (resolve-module '(gnome gtk))
> Throw to key `wrong-type-arg' with args `("procedure-name" "Wrong type 
> argument in position ~A: ~S" (1 #) 
> (#))'.
> Entering the debugger. Type `bt' for a backtrace or `c' to continue.
> 0 debug> bt
> In current input:
> : 13 ERROR: cannot convert to output locale "NONE": 
> ""dynamic-wind""
> 
> So I guess we need a special case for NONE there, or something. I really
> don't understand i18n/l10n.

A LOCALE=NONE is the same as setting locale to any undefined value,e.g.
LOCALE=martian_mars.  There isn't a locale called 'none', so the system 
can provide no clues on how I/O, date format, etc, should be done.

All programs are supposed to start with locale="C", so I guess the
NONE locale is being set explicitly at some point.  On my box, I can't
(setlocale LC_ALL "NONE") since I don't have a NONE locale.

So, the fact that locale=NONE seems like a bug to me.

For Guile string conversion, if locale=NONE has some non-buggy meaning, 
I'd probably suggest making NONE the same as UTF-8.  We can do whatever
we want, since the result of the operation of conversion into NONE
is undefined.

But as far as the greater question of the side effects of setting locale
early on startup...  The parsing of any source code files after locale 
is set will be done in that context.  I don't think it would do anything
unexpected. The reader and the port routines tend to do their own parsing,
and don't tend to rely on libc locale-specific routines.  Even so, it 
would take some auditing to prove that there would be no effect.

If you were to set the locale in Guile, you would need to add
a condition to catch if the current LANG envvar isn't set to a valid
locale so you can fall back to the "C" locale.

-Mike Gran




Re: UTF-8 regression in guile 1.9.5

2010-01-09 Thread Andy Wingo
Hi,

Reviving an old thread...

On Fri 11 Dec 2009 16:05, Mike Gran  writes:

>> On Sun 06 Dec 2009 21:43, Linas Vepstas writes:
>>
>> > 2009/12/6 Mike Gran :
>> >>
>> >>> > need to call (setlocale LC_ALL "")
>> >>
>> >> But for Guile to store characters as codepoints, declaring a locale
>> >> pretty much a requirement now.
>> >
>> > Would it make sense to add (setlocale LC_ALL "") to some default,
>> > e.g. boot-9.scm  ?
>
> If we always call setlocale, legacy code that used UTF-8 and other
> non-Latin locales will just work.  Legacy code that used strings to
> contain binary data would break.
>
> (Of couse, UTF-8 strings only worked on Guile 1.8.x so long
> as you either never looked at substrings or chars, or did
> UTF-8 parsing yourself.)
>
> As it is now, the opposite is true: legacy code with strings
> containing binary data will just work; strings containing non-8-bit
> locale encoded strings will break.
>
> | 1.8.x | setlocale |
> | Strings   | called| Guile 2.0
> | contain   | 1.8 | 2.0 | will
> -
> | ASCII | Y/N | Y/N | just work
> - 
> | locale-encoded| Y/N | Y   | just work
> | strings   | | |
> -
> | locale-encoded| Y/N | N   | interpret string bytes as
> | strings   | | | Latin-1
> -
> | binary data   | Y/N | Y   | if locale is Latin-1: just work
> |   | | |
> |   | | | if locale is not latin-1:
> |   | | | interpret string bytes using
> |   | | | locale encoding
> -
> | binary data   | Y/N | N   | just work
> |   | | |
>
> I think I prefer that the coder take the responsibility of calling
> setlocale, but, I only think that because it is how C works.  I'm used
> to that convention.

I would still prefer ponies and magic, but I realized: if we do a
setlocale(LC_ALL, "") at the beginning, might that not change e.g. the
floating point format, or some other locale-related variable, which
would make Guile modules unreadable, or otherwise semantically different
or invalid?

I'm asking because I ran into this bug now:

scheme@(guile-user)> ,pr (resolve-module '(gnome gtk))
Throw to key `wrong-type-arg' with args `("procedure-name" "Wrong type 
argument in position ~A: ~S" (1 #) 
(#))'.
Entering the debugger. Type `bt' for a backtrace or `c' to continue.
0 debug> bt
In current input:
: 13 ERROR: cannot convert to output locale "NONE": 
""dynamic-wind""

So I guess we need a special case for NONE there, or something. I really
don't understand i18n/l10n.

FWIW, it seems that both ruby and python require the user to call
setlocale.

Regards,

Andy
-- 
http://wingolog.org/




Re: UTF-8 regression in guile 1.9.5

2009-12-11 Thread Ludovic Courtès
Hi,

Mike Gran  writes:

> I think I prefer that the coder take the responsibility of calling
> setlocale, but, I only think that because it is how C works.  I'm used
> to that convention.

+1.

Thanks,
Ludo’.





Re: UTF-8 regression in guile 1.9.5

2009-12-11 Thread Linas Vepstas
2009/12/11 Mike Gran :

> I think I prefer that the coder take the responsibility of calling
> setlocale, but, I only think that because it is how C works.  I'm used
> to that convention.
>

OK works for me.

--linas




Re: UTF-8 regression in guile 1.9.5

2009-12-11 Thread Mike Gran
> From: Andy Wingo 
> Hi,
>
> On Sun 06 Dec 2009 21:43, Linas Vepstas writes:
>
> > 2009/12/6 Mike Gran :
> >>
> >>> > need to call (setlocale LC_ALL "")
> >>
> >> But for Guile to store characters as codepoints, declaring a locale
> >> pretty much a requirement now.
> >
> > Would it make sense to add (setlocale LC_ALL "") to some default,
> > e.g. boot-9.scm  ?
>
> Mike I admit I don't follow this completely. Does Linas' suggestion
> make sense? I somehow thought that locales would magically just
> work.

If we always call setlocale, legacy code that used UTF-8 and other
non-Latin locales will just work.  Legacy code that used strings to
contain binary data would break.

(Of couse, UTF-8 strings only worked on Guile 1.8.x so long
as you either never looked at substrings or chars, or did
UTF-8 parsing yourself.)

As it is now, the opposite is true: legacy code with strings
containing binary data will just work; strings containing non-8-bit
locale encoded strings will break.

| 1.8.x | setlocale |
| Strings   | called| Guile 2.0
| contain   | 1.8 | 2.0 | will
-
| ASCII | Y/N | Y/N | just work
- 
| locale-encoded| Y/N | Y   | just work
| strings   | | |
-
| locale-encoded| Y/N | N   | interpret string bytes as
| strings   | | | Latin-1
-
| binary data   | Y/N | Y   | if locale is Latin-1: just work
|   | | |
|   | | | if locale is not latin-1:
|   | | | interpret string bytes using
|   | | | locale encoding
-
| binary data   | Y/N | N   | just work
|   | | |

I think I prefer that the coder take the responsibility of calling
setlocale, but, I only think that because it is how C works.  I'm used
to that convention.

Thanks,

Mike





Re: UTF-8 regression in guile 1.9.5

2009-12-11 Thread Andy Wingo
Hi,

On Sun 06 Dec 2009 21:43, Linas Vepstas  writes:

> 2009/12/6 Mike Gran :
>>
>>> > need to call (setlocale LC_ALL "")
>>
>> But for Guile to store characters as codepoints, declaring a locale
>> pretty much a requirement now.
>
> Would it make sense to add (setlocale LC_ALL "") to some default,
> e.g. boot-9.scm  ?

Mike I admit I don't follow this completely. Does Linas' suggestion make
sense? I somehow thought that locales would magically just work.

Cheers,

Andy
-- 
http://wingolog.org/




Re: UTF-8 regression in guile 1.9.5

2009-12-06 Thread Linas Vepstas
2009/12/6 Mike Gran :
>
>> > need to call (setlocale LC_ALL "")
>
> But for Guile to store characters as codepoints, declaring a locale
> pretty much a requirement now.

Would it make sense to add (setlocale LC_ALL "") to some default,
e.g. boot-9.scm  ?

--linas




Re: UTF-8 regression in guile 1.9.5

2009-12-06 Thread Mike Gran


> > Hmm.  The "ã" is a dead giveaway that you are printing a UTF-8 string
> > that is being interpreted as a ISO-8859-1 string.
> >
> > You've already said that you're in a UTF-8 locale.  It could be that you
> > need to call (setlocale LC_ALL "")
> 
> That cured it.
> 
> > as well as having a setlocale call in your program.
> 
> Doesn't seem to be required, after the above.
> 
> Thanks!
> 
> Why this happened is strange; I'm now investigating.  Sorry to
> have bothered you with something that is dohh .. basic.

1.9.x does work fundamentally differently w.r.t. strings.
The reason for that is because of how strings are now stored.
In 1.8.x, a character was a byte.  In 1.9.x a character is a 
codepoint.

But for Guile to store characters as codepoints, declaring a locale
pretty much a requirement now.

-Mike




Re: UTF-8 regression in guile 1.9.5

2009-12-06 Thread Linas Vepstas
2009/12/6 Mike Gran :
>> From: Linas Vepstas 
>
>
>> Then, from the guile prompt, I can evaluate the following:
>>
>>    (new-node "てみました。")
>>
>> and get the output "The name is てみました。"
>>
>>
>> However, in guile-1.9.5, the above gives me:
>>
>>    "The name is ã¦ã¿ã¾ããã"
>
> Hmm.  The "ã" is a dead giveaway that you are printing a UTF-8 string
> that is being interpreted as a ISO-8859-1 string.
>
> You've already said that you're in a UTF-8 locale.  It could be that you
> need to call (setlocale LC_ALL "")

That cured it.

> as well as having a setlocale call in your program.

Doesn't seem to be required, after the above.

Thanks!

Why this happened is strange; I'm now investigating.  Sorry to
have bothered you with something that is dohh .. basic.

--linas




Re: UTF-8 regression in guile 1.9.5

2009-12-06 Thread Mike Gran
> From: Linas Vepstas 


> Then, from the guile prompt, I can evaluate the following:
> 
>(new-node "てみました。")
> 
> and get the output "The name is てみました。"
> 
> 
> However, in guile-1.9.5, the above gives me:
> 
>"The name is ã¦ã¿ã¾ããã"

Hmm.  The "ã" is a dead giveaway that you are printing a UTF-8 string
that is being interpreted as a ISO-8859-1 string.

You've already said that you're in a UTF-8 locale.  It could be that you 
need to call (setlocale LC_ALL "") from the command line before entering
(new-node "てみました。") as well as having a setlocale call in your program.

Thanks,

Mike Gran