Bug#292330: use UTF-8 by default

2007-06-15 Thread Thorsten Glaser
Hi,

wouldn't it be possible to at least make the en_US.UTF-8 locale forcibly
generated, so that it can't be deselected by dpkg-reconfigure locales?

If this one isn't installed, many apps break when I ssh from an OS that
uses exclusively UTF-8 to a Debian box.

Thanks!
//mirabile
-- 
I believe no one can invent an algorithm. One just happens to hit upon it
when God enlightens him. Or only God invents algorithms, we merely copy them.
If you don't believe in God, just consider God as Nature if you won't deny
existence.  -- Coywolf Qi Hunt


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#292330: use UTF-8 by default

2007-06-16 Thread martin f krafft
also sprach Thorsten Glaser <[EMAIL PROTECTED]> [2007.06.16.0013 +0100]:
> wouldn't it be possible to at least make the en_US.UTF-8 locale forcibly
> generated, so that it can't be deselected by dpkg-reconfigure locales?

Why en_US? Why not en_GB?

> If this one isn't installed, many apps break when I ssh from an OS that
> uses exclusively UTF-8 to a Debian box.

etch uses UTF-8 by default. And you can always dpkg-reconfigure.

-- 
 .''`.   martin f. krafft <[EMAIL PROTECTED]>
: :'  :  proud Debian developer, author, administrator, and user
`. `'`   http://people.debian.org/~madduck - http://debiansystem.info
  `-  Debian - when you have better things to do than fixing systems


signature.asc
Description: Digital signature (GPG/PGP)


Bug#292330: use UTF-8 by default

2007-06-16 Thread Mike Hommey
On Fri, Jun 15, 2007 at 11:13:29PM +, Thorsten Glaser <[EMAIL PROTECTED]> 
wrote:
> Hi,
> 
> wouldn't it be possible to at least make the en_US.UTF-8 locale forcibly
> generated, so that it can't be deselected by dpkg-reconfigure locales?
> 
> If this one isn't installed, many apps break when I ssh from an OS that
> uses exclusively UTF-8 to a Debian box.

They would break the same way if you'd ssh from an OS exclusively using
anything else (let's say ISO-2022-JP). So should we generate these too ?

Mike


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#292330: use UTF-8 by default

2007-06-16 Thread Thorsten Glaser
martin f krafft dixit:

>Why en_US? Why not en_GB?

Because how many applications come with en_US data files and how
many speak proper English? I know it's sad, but it happens to be
like this.

>> If this one isn't installed, many apps break when I ssh from an OS that
>> uses exclusively UTF-8 to a Debian box.
>
>etch uses UTF-8 by default. And you can always dpkg-reconfigure.

Funnily I cannot if I'm not root. And I've seen etch boxen where
en_US.UTF-8 was not installed.

//mirabile
-- 
  "Using Lynx is like wearing a really good pair of shades: cuts out
   the glare and harmful UV (ultra-vanity), and you feel so-o-o COOL."
 -- Henry Nelson, March 1999


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#292330: use UTF-8 by default

2007-06-16 Thread martin f krafft
also sprach Thorsten Glaser <[EMAIL PROTECTED]> [2007.06.16.1323 +0100]:
> Funnily I cannot if I'm not root. And I've seen etch boxen where
> en_US.UTF-8 was not installed.

Then please bug the admin.

-- 
 .''`.   martin f. krafft <[EMAIL PROTECTED]>
: :'  :  proud Debian developer, author, administrator, and user
`. `'`   http://people.debian.org/~madduck - http://debiansystem.info
  `-  Debian - when you have better things to do than fixing systems


signature.asc
Description: Digital signature (GPG/PGP)


Bug#292330: use UTF-8 by default

2007-06-16 Thread Thorsten Glaser
martin f krafft dixit:

>also sprach Thorsten Glaser <[EMAIL PROTECTED]> [2007.06.16.1323 +0100]:
>> Funnily I cannot if I'm not root. And I've seen etch boxen where
>> en_US.UTF-8 was not installed.
>
>Then please bug the admin.

That's what I did, but the idea is not to have to do that. (Besides,
"C" is installed by default, so we need some kind of "C.UTF-8", whose
role is – for LC_CTYPE – usually fulfilled by en_US.UTF-8.)

//mirabile
-- 
I believe no one can invent an algorithm. One just happens to hit upon it
when God enlightens him. Or only God invents algorithms, we merely copy them.
If you don't believe in God, just consider God as Nature if you won't deny
existence.  -- Coywolf Qi Hunt



Bug#292330: use UTF-8 by default

2007-06-16 Thread martin f krafft
also sprach Thorsten Glaser <[EMAIL PROTECTED]> [2007.06.16.1528 +0100]:
> That's what I did, but the idea is not to have to do that. (Besides,
> "C" is installed by default, so we need some kind of "C.UTF-8", whose
> role is – for LC_CTYPE – usually fulfilled by en_US.UTF-8.)

Please stop CCing debian-project.

Does a C.UTF-8 exist? If yes, then this is a sound proposal,
I think.

-- 
 .''`.   martin f. krafft <[EMAIL PROTECTED]>
: :'  :  proud Debian developer, author, administrator, and user
`. `'`   http://people.debian.org/~madduck - http://debiansystem.info
  `-  Debian - when you have better things to do than fixing systems


signature.asc
Description: Digital signature (GPG/PGP)


Bug#292330: use UTF-8 by default

2007-06-16 Thread Thorsten Glaser
martin f krafft dixit:

>Please stop CCing debian-project.

I don't.

>Does a C.UTF-8 exist? If yes, then this is a sound proposal,
>I think.

If not, one could probably easily create one. It would have to
have all properties of C except for LC_CTYPE, which it would
have to take from en_US.UTF-8.

//mirabile
-- 
I believe no one can invent an algorithm. One just happens to hit upon it
when God enlightens him. Or only God invents algorithms, we merely copy them.
If you don't believe in God, just consider God as Nature if you won't deny
existence.  -- Coywolf Qi Hunt


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#292330: use UTF-8 by default

2007-06-16 Thread Roger Leigh
martin f krafft <[EMAIL PROTECTED]> writes:

> also sprach Thorsten Glaser <[EMAIL PROTECTED]> [2007.06.16.1528 +0100]:
>> That's what I did, but the idea is not to have to do that. (Besides,
>> "C" is installed by default, so we need some kind of "C.UTF-8", whose
>> role is – for LC_CTYPE – usually fulfilled by en_US.UTF-8.)
>
> Please stop CCing debian-project.
>
> Does a C.UTF-8 exist? If yes, then this is a sound proposal,
> I think.

I believe that the "C" locale is supposed to be US_ASCII only.

Rationale:

ISO C99 §7.4 (ctype "functions that have locale-specific aspects only
when not in the "C" locale...)  Many of the constraints this section
for the various is* functions would restrict the "C" locale to
US_ASCII.  UTF-8 would break the constraints over e.g. whitespace
characters, digits, lowercase characters etc..

I'm not sure who would want to use C.UTF-8.  If you want a basic
non-Unicode locale with known behaviour on all platforms, the "C"
locale is useful, particularly when doing serialisation where
localisation would break file formats.  If you want Unicode, why would
you not just use the appropriate .UTF-8 locale for your
language/country?


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?   http://gutenprint.sourceforge.net/
   `-GPG Public Key: 0x25BFB848   Please GPG sign your mail.


pgpO8dUbwDBvA.pgp
Description: PGP signature


Bug#292330: use UTF-8 by default

2007-06-16 Thread Thorsten Glaser
Roger Leigh dixit:

>> Does a C.UTF-8 exist? If yes, then this is a sound proposal,
>> I think.
>
>I believe that the "C" locale is supposed to be US_ASCII only.

That applies to “C” but not to a hypothetical “C.UTF-8” locale,
which would have to be set via setlocale(3) anyway, and differ
from “C” only in “LC_CTYPE” category.

//mirabile
-- 
  “Using Lynx is like wearing a really good pair of shades: cuts out
   the glare and harmful UV (ultra-vanity), and you feel so-o-o COOL.”
 -- Henry Nelson, March 1999



Bug#292330: use UTF-8 by default

2007-06-16 Thread Peter Samuelson

[Thorsten Glaser]
> [...] a hypothetical "C.UTF-8" locale, which would have to be set via
> setlocale(3) anyway, and differ from "C" only in "LC_CTYPE" category.

I suggest a strategy of having locale.config (the script that prompts
you to generate locales at install time) automatically select any
locale which matches s/\..*/.UTF-8/ for the locales the user has
selected.

Of course, some users may be annoyed by the additional disk space
(what, another megabyte or so in /usr/lib/locale/locale-archive?) and
the additional CPU usage at install/upgrade time.


signature.asc
Description: Digital signature


Bug#292330: use UTF-8 by default

2007-06-18 Thread Joey Hess
Roger Leigh wrote:
> I'm not sure who would want to use C.UTF-8. 

d-i contains and uses a C.UTF-8 locale. You can find the source in the
installation-locale source package.

> non-Unicode locale with known behaviour on all platforms, the "C"
> locale is useful, particularly when doing serialisation where
> localisation would break file formats.  If you want Unicode, why would
> you not just use the appropriate .UTF-8 locale for your
> language/country?

In the case of d-i, we don't want to include more megabytes of locale
definitions in the installer than we include of translations. ;-)

-- 
see shy jo


signature.asc
Description: Digital signature


Bug#292330: use UTF-8 by default

2007-06-18 Thread Pierre Habouzit
On Sat, Jun 16, 2007 at 05:48:00PM +0100, martin f krafft wrote:
> also sprach Thorsten Glaser <[EMAIL PROTECTED]> [2007.06.16.1528 +0100]:
> > That's what I did, but the idea is not to have to do that. (Besides,
> > "C" is installed by default, so we need some kind of "C.UTF-8", whose
> > role is – for LC_CTYPE – usually fulfilled by en_US.UTF-8.)
> 
> Please stop CCing debian-project.
> 
> Does a C.UTF-8 exist? If yes, then this is a sound proposal,
> I think.

  it's not. We could create a neutral.utf-8 locale for sure, but a
C.utf-8 is really bad, because some programs check the locale for 'C'
and when they foind that use hand optimized functions to replace the
localized libc ones. And thanks to POSIX, even if it looks gross, it's
totally OK to do that.

  C charset is and should be ascii, that's an assumption you should not
break. In fact, using an 8bit locale would often not harm, but a
multi-byte one would be really really bad (as you would end up with e.g.
strings split in the middle of a point code, *brrr* you definitely don't
want that).

  in d-i it's ok to use C.utf-8 as the amount of programs running with
that locale is definitely small and manageable. In a debian system ? no,
it's definitely a _bad_ idea.
-- 
·O·  Pierre Habouzit
··O[EMAIL PROTECTED]
OOOhttp://www.madism.org


pgpdysWaeSCDb.pgp
Description: PGP signature


Bug#292330: use UTF-8 by default

2007-06-18 Thread Pierre Habouzit
On Mon, Jun 18, 2007 at 10:48:04AM +0100, Pierre Habouzit wrote:
> multi-byte one would be really really bad (as you would end up with e.g.
> strings split in the middle of a point code, *brrr* you definitely don't
> want that).

  I wasn't clear it seems, but what I mean is if a programs assumes he's
dealing with ascii, it may end up splitting strings in the middle of a
codepoint if it's in fact an utf-8 string. That's why a C.utf-8 is a
very bad idea.
-- 
·O·  Pierre Habouzit
··O[EMAIL PROTECTED]
OOOhttp://www.madism.org


pgpiOrzKypnvy.pgp
Description: PGP signature


Bug#292330: use UTF-8 by default

2007-06-18 Thread Mike Hommey
On Mon, Jun 18, 2007 at 10:48:04AM +0100, Pierre Habouzit <[EMAIL PROTECTED]> 
wrote:
> On Sat, Jun 16, 2007 at 05:48:00PM +0100, martin f krafft wrote:
> > also sprach Thorsten Glaser <[EMAIL PROTECTED]> [2007.06.16.1528 +0100]:
> > > That's what I did, but the idea is not to have to do that. (Besides,
> > > "C" is installed by default, so we need some kind of "C.UTF-8", whose
> > > role is – for LC_CTYPE – usually fulfilled by en_US.UTF-8.)
> > 
> > Please stop CCing debian-project.
> > 
> > Does a C.UTF-8 exist? If yes, then this is a sound proposal,
> > I think.
> 
>   it's not. We could create a neutral.utf-8 locale for sure, but a
> C.utf-8 is really bad, because some programs check the locale for 'C'
> and when they foind that use hand optimized functions to replace the
> localized libc ones. And thanks to POSIX, even if it looks gross, it's
> totally OK to do that.
> 
>   C charset is and should be ascii, that's an assumption you should not
> break. In fact, using an 8bit locale would often not harm, but a
> multi-byte one would be really really bad (as you would end up with e.g.
> strings split in the middle of a point code, *brrr* you definitely don't
> want that).

Note that you won't get strings split in the middle of a point code with
UTF-8.

Anyways, maybe the general problem is that there should be a way to
generate locales at the user level (and store everything in ~/.locale,
for example)

Mike


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#292330: use UTF-8 by default

2007-06-18 Thread Thorsten Glaser
Mike Hommey dixit:

>>   it's not. We could create a neutral.utf-8 locale for sure

Sounds like a plan. Maybe something short and uppercase, akin to
"C" and "POSIX", how about "STD.UTF-8"?

>> but a
>> C.utf-8 is really bad, because some programs check the locale for 'C'
>> and when they foind that use hand optimized functions to replace the
>> localized libc ones.

Ugh. Really? Nah, please spare me the details.

>Note that you won't get strings split in the middle of a point code with
>UTF-8.

This is possible with UTF-8.

Try this one: ł (U+0142) = C5 82
You can split between the C5 and the 82.

>Anyways, maybe the general problem is that there should be a way to
>generate locales at the user level (and store everything in ~/.locale,
>for example)

That'd be a nice additional idea, but it makes additional problems too,
for example quotas, or when do these get updated, or that duplication
is always bad. That would probably be a glibc issue then.

//mirabile
-- 
I believe no one can invent an algorithm. One just happens to hit upon it
when God enlightens him. Or only God invents algorithms, we merely copy them.
If you don't believe in God, just consider God as Nature if you won't deny
existence.  -- Coywolf Qi Hunt



Bug#292330: use UTF-8 by default

2007-06-18 Thread Mike Hommey
On Mon, Jun 18, 2007 at 01:09:17PM +, Thorsten Glaser <[EMAIL PROTECTED]> 
wrote:
> Mike Hommey dixit:
> 
> >>   it's not. We could create a neutral.utf-8 locale for sure
> 
> Sounds like a plan. Maybe something short and uppercase, akin to
> "C" and "POSIX", how about "STD.UTF-8"?
> 
> >> but a
> >> C.utf-8 is really bad, because some programs check the locale for 'C'
> >> and when they foind that use hand optimized functions to replace the
> >> localized libc ones.
> 
> Ugh. Really? Nah, please spare me the details.
> 
> >Note that you won't get strings split in the middle of a point code with
> >UTF-8.
> 
> This is possible with UTF-8.
> 
> Try this one: ł (U+0142) = C5 82
> You can split between the C5 and the 82.

The vast majority of programs won't split a string in the middle of a
word.

Mike


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#292330: use UTF-8 by default

2007-06-18 Thread Lars Wirzenius
On ma, 2007-06-18 at 13:37 +0100, Pierre Habouzit wrote:
> On Mon, Jun 18, 2007 at 10:48:04AM +0100, Pierre Habouzit wrote:
> > multi-byte one would be really really bad (as you would end up with e.g.
> > strings split in the middle of a point code, *brrr* you definitely don't
> > want that).
> 
>   I wasn't clear it seems, but what I mean is if a programs assumes he's
> dealing with ascii,

This buggy assumption seems to happen in every locale, not just C.UTF-8,
and in every other case we treat it as a bug. Is there a standard that
says every C.* locale must have the same single byte character set as
the plain C locale? 

(Incidentally, the standard for the C language does not require the
character set in the C locale to be ASCII; EBCDIC, for example, works as
well. See 5.2.1, "Character sets", for the full description. You might
be able to find a copy of the standard by searching for ISO/IEC 9899.)

-- 
Mulla on halu häkätä ja mulla on siihen taito




Bug#292330: use UTF-8 by default

2007-06-18 Thread Pierre Habouzit
On Mon, Jun 18, 2007 at 06:46:40PM +0300, Lars Wirzenius wrote:
> On ma, 2007-06-18 at 13:37 +0100, Pierre Habouzit wrote:
> > On Mon, Jun 18, 2007 at 10:48:04AM +0100, Pierre Habouzit wrote:
> > > multi-byte one would be really really bad (as you would end up with e..g.
> > > strings split in the middle of a point code, *brrr* you definitely don't
> > > want that).
> > 
> >   I wasn't clear it seems, but what I mean is if a programs assumes he's
> > dealing with ascii,
> 
> This buggy assumption seems to happen in every locale, not just C.UTF-8,
> and in every other case we treat it as a bug. Is there a standard that
> says every C.* locale must have the same single byte character set as
> the plain C locale? 
> 
> (Incidentally, the standard for the C language does not require the
> character set in the C locale to be ASCII; EBCDIC, for example, works as
> well. See 5.2.1, "Character sets", for the full description. You might
> be able to find a copy of the standard by searching for ISO/IEC 9899.)
> 

  I stand corrected, I read POSIX base, chapter 7: indeed, it specifies
how the collation is done, but not how the characters are encoded, so a
C.utf-8 does not seems like a so bad idea. Though I'm quite sure we can
find software that assume that the character set in the C locale are
always ASCII.

  But okay, maybe it's worth fixing those few.
-- 
·O·  Pierre Habouzit
··O[EMAIL PROTECTED]
OOOhttp://www.madism.org


pgpKuML6p0meJ.pgp
Description: PGP signature