[gentoo-dev] Re: LANG=en_GB.UTF-8 by default

2012-02-20 Thread Kerin Millar

On 20/02/2012 07:47, Fabian Groffen wrote:

On 20-02-2012 03:07:33 +, Kerin Millar wrote:

I know that adding LANG=POSIX doesn't do anything in this case but I
have a feeling that its presence would be instructive to new users. If a
user is asked to configure something which isn't present, it often
generates questions which might otherwise be avoided. I've changed
en_US.UTF-8 to en_US.utf8 there for similar reasons.


I don't understand.  UTF-8 is the codeset, that utf8 is recognised as
the same thing is IMO a GNUism.  glibc understands UTF-8 perfectly
fine these days, so it should preferably be used instead.  (Even the
man-page, utf8(7), suggests that.)



Most users don't read man pages. The rationale was that the user can 
copy-paste exactly what they see from locale -a, which might diminish 
the number of questions asked about it via mainstream support channels, 
as well as simplifying the instructions in the sample comment. It was 
just a thought; no big deal.


--Kerin




Re: [gentoo-dev] Re: LANG=en_GB.UTF-8 by default

2012-02-19 Thread Amadeusz Żołnowski
Excerpts from Ben's message of 2012-02-19 03:04:19 +0100:
 On 19 February 2012 09:00, James Cloos cl...@jhcloos.com wrote:
  Without LC_COLLATE=C things like [a-z]* gets a false=positive
  match on files like Makefile. [...]
 
  The real fix is to have root be C.UTF-8.  Which differs from C only
  in that the charset is utf-8.
 
 In my opinion we should set a default environment with the following
 values:
 
 LANG=en_US.UTF-8
 LC_ALL=
 LC_COLLATE=C

This is only on my setups or this is xy_XY.utf8 instead of
xy_XY.UTF-8?


-- 
Amadeusz Żołnowski


signature.asc
Description: PGP signature


Re: [gentoo-dev] Re: LANG=en_GB.UTF-8 by default

2012-02-19 Thread Ulrich Mueller
 On Sun, 19 Feb 2012, Ben  wrote:

 In my opinion we should set a default environment with the following
 values:

 LANG=en_US.UTF-8
 LC_ALL=
 LC_COLLATE=C

 This offers the best default options to the majority of users, and
 is easy to customize for those who wish to use another locale.

At least, LC_NUMERIC=C should be added to this, otherwise numbers will
be formatted with commas as thousands separators.

Also en_US.UTF-8 for LC_MEASUREMENT and LC_PAPER means imperial units
and letter paper, which isn't optimal for users outside of the U.S.

Ulrich



Re: [gentoo-dev] Re: LANG=en_GB.UTF-8 by default

2012-02-19 Thread Ben
On 19 February 2012 23:14, Ulrich Mueller u...@gentoo.org wrote:
 On Sun, 19 Feb 2012, Ben  wrote:

 In my opinion we should set a default environment with the following
 values:

 LANG=en_US.UTF-8
 LC_ALL=
 LC_COLLATE=C

 This offers the best default options to the majority of users, and
 is easy to customize for those who wish to use another locale.

 At least, LC_NUMERIC=C should be added to this, otherwise numbers will
 be formatted with commas as thousands separators.

 Also en_US.UTF-8 for LC_MEASUREMENT and LC_PAPER means imperial units
 and letter paper, which isn't optimal for users outside of the U.S.

 Ulrich


I think those users (and that includes myself) should then set LANG to
something more appropriate to their use case.

Ben



[gentoo-dev] Re: LANG=en_GB.UTF-8 by default

2012-02-19 Thread Kerin Millar

On 19/02/2012 15:56, Ben wrote:

On 19 February 2012 23:14, Ulrich Muelleru...@gentoo.org  wrote:

On Sun, 19 Feb 2012, Ben  wrote:



In my opinion we should set a default environment with the following
values:



LANG=en_US.UTF-8
LC_ALL=


LC_ALL isn't needed here because, unlike other LC_* settings, it does 
not inherit from LANG and, thus, will be undefined anyway. Although the 
above would not directly cause any harm, I am entirely certain that its 
mere presence would encourage users to explicitly define it where they 
most definitely should not. The misinformation that LC_ALL should be 
defined was propagated by the localization doc for rather a long time 
and it was rather challenging to impress upon its maintainers that 
change was required. Let's not repeat old mistakes.



LC_COLLATE=C



This offers the best default options to the majority of users, and
is easy to customize for those who wish to use another locale.


At least, LC_NUMERIC=C should be added to this, otherwise numbers will
be formatted with commas as thousands separators.

Also en_US.UTF-8 for LC_MEASUREMENT and LC_PAPER means imperial units
and letter paper, which isn't optimal for users outside of the U.S.

Ulrich



I think those users (and that includes myself) should then set LANG to
something more appropriate to their use case.



I agree; the defaults should not be over-engineered. For proper 
localisation, set LANG appropriately and done. The real issue is that 
locale configuration isn't mentioned in the handbook. It does, however, 
mention locale.gen so we're half-way there.


--Kerin




[gentoo-dev] Re: LANG=en_GB.UTF-8 by default

2012-02-19 Thread Kerin Millar

On 19/02/2012 01:00, James Cloos wrote:

KM == Kerin Millarkerfra...@gmail.com  writes:


KM  Arch also used to define LC_COLLATE=C by default, probably to
KM  mitigate unpredictable behaviour in some applications, but have
KM  since dropped this additional variable so they must have deemed it
KM  no longer necessary.

Without LC_COLLATE=C things like [a-z]* gets a false=positive match
on files like Makefile.


Indeed, character classes are a potential minefield. Incidentally, I 
just tested Ubuntu and Arch with only LANG set to a UTF-8 locale:-


$ echo Makefile | sed -re 's/[a-z]//g' # collation rules ignored
M

$ echo Makefile | grep -Eo '[a-z]*' # collation rules ignored
akefile

In neither case are the collation rules being obeyed. In Gentoo, however:-

$ echo Makefile | sed -re 's/[a-z]//g' # collation rules obeyed

$ echo Makefile | grep -Eo '[a-z]*' # collation rules ignored
akefile

Obeying the collation rules is ostensibly the correct thing to do but, 
until everyone starts using named character classes (which will never 
happen), it's not safe. The thing that worries me here is the 
inconsistency in Gentoo. LC_COLLATE=C is sufficient to work around the 
issue but the above makes me wonder why we still need it.


--Kerin




Re: [gentoo-dev] Re: LANG=en_GB.UTF-8 by default

2012-02-19 Thread Ulrich Mueller
 On Sun, 19 Feb 2012, Ben  wrote:

 In my opinion we should set a default environment with the
 following values:
 
 LANG=en_US.UTF-8
 LC_ALL=
 LC_COLLATE=C
 
 This offers the best default options to the majority of users, and
 is easy to customize for those who wish to use another locale.
 
 At least, LC_NUMERIC=C should be added to this, otherwise numbers
 will be formatted with commas as thousands separators.
 
 Also en_US.UTF-8 for LC_MEASUREMENT and LC_PAPER means imperial
 units and letter paper, which isn't optimal for users outside of
 the U.S.

 I think those users (and that includes myself) should then set LANG
 to something more appropriate to their use case.

And why should we set the default to an US locale then? IMHO something
like

  LANG=C
  LC_CTYPE=en_US.utf8

would be much less intrusive if you just want UTF-8, without
influencing other i18n variables.

Ulrich



Re: [gentoo-dev] Re: LANG=en_GB.UTF-8 by default

2012-02-19 Thread William Hubbs
On Sun, Feb 19, 2012 at 11:56:40PM +0800, Ben wrote:
 On 19 February 2012 23:14, Ulrich Mueller u...@gentoo.org wrote:
  On Sun, 19 Feb 2012, Ben  wrote:
 
  In my opinion we should set a default environment with the following
  values:
 
  LANG=en_US.UTF-8
  LC_ALL=
  LC_COLLATE=C
 
  This offers the best default options to the majority of users, and
  is easy to customize for those who wish to use another locale.
 
  At least, LC_NUMERIC=C should be added to this, otherwise numbers will
  be formatted with commas as thousands separators.
 
  Also en_US.UTF-8 for LC_MEASUREMENT and LC_PAPER means imperial units
  and letter paper, which isn't optimal for users outside of the U.S.
 
  Ulrich
 
 
 I think those users (and that includes myself) should then set LANG to
 something more appropriate to their use case.

According to our localization guide, there is a safe default that forces
UTF-8 characters but doesn't force any language. I have the following
single line in /etc/env.d/02locale:

LC_CTYPE=en_US.UTF-8

What do you think?

William



pgpn8ilNNBPSD.pgp
Description: PGP signature


[gentoo-dev] Re: LANG=en_GB.UTF-8 by default

2012-02-19 Thread Kerin Millar

On 20/02/2012 00:11, William Hubbs wrote:

On Sun, Feb 19, 2012 at 11:56:40PM +0800, Ben wrote:

On 19 February 2012 23:14, Ulrich Muelleru...@gentoo.org  wrote:

On Sun, 19 Feb 2012, Ben  wrote:



In my opinion we should set a default environment with the following
values:



LANG=en_US.UTF-8
LC_ALL=
LC_COLLATE=C



This offers the best default options to the majority of users, and
is easy to customize for those who wish to use another locale.


At least, LC_NUMERIC=C should be added to this, otherwise numbers will
be formatted with commas as thousands separators.

Also en_US.UTF-8 for LC_MEASUREMENT and LC_PAPER means imperial units
and letter paper, which isn't optimal for users outside of the U.S.

Ulrich



I think those users (and that includes myself) should then set LANG to
something more appropriate to their use case.


According to our localization guide, there is a safe default that forces
UTF-8 characters but doesn't force any language. I have the following
single line in /etc/env.d/02locale:

LC_CTYPE=en_US.UTF-8



That looks good but perhaps it should also define LANG=POSIX, which is 
similar to Ulrich's proposal. Something like:


# To configure for your region, set LANG to an appropriate locale, then 
comment

# or remove LC_CTYPE. Run locale -a to obtain a list of available locales.
LANG=POSIX
LC_CTYPE=en_US.utf8

I know that adding LANG=POSIX doesn't do anything in this case but I 
have a feeling that its presence would be instructive to new users. If a 
user is asked to configure something which isn't present, it often 
generates questions which might otherwise be avoided. I've changed 
en_US.UTF-8 to en_US.utf8 there for similar reasons.


Not to mention that, if one is curious and searches for posix locale 
via Google, the first link is for the Open Group specification :)


I reckon that this, along with some basic information in the handbook, 
would be a step in the right direction.


--Kerin




[gentoo-dev] Re: LANG=en_GB.UTF-8 by default

2012-02-19 Thread Fabian Groffen
On 20-02-2012 03:07:33 +, Kerin Millar wrote:
 I know that adding LANG=POSIX doesn't do anything in this case but I 
 have a feeling that its presence would be instructive to new users. If a 
 user is asked to configure something which isn't present, it often 
 generates questions which might otherwise be avoided. I've changed 
 en_US.UTF-8 to en_US.utf8 there for similar reasons.

I don't understand.  UTF-8 is the codeset, that utf8 is recognised as
the same thing is IMO a GNUism.  glibc understands UTF-8 perfectly
fine these days, so it should preferably be used instead.  (Even the
man-page, utf8(7), suggests that.)


-- 
Fabian Groffen
Gentoo on a different level


signature.asc
Description: Digital signature


Re: [gentoo-dev] Re: LANG=en_GB.UTF-8 by default

2012-02-18 Thread James Cloos
 KM == Kerin Millar kerfra...@gmail.com writes:

KM Arch also used to define LC_COLLATE=C by default, probably to
KM mitigate unpredictable behaviour in some applications, but have
KM since dropped this additional variable so they must have deemed it
KM no longer necessary.

Without LC_COLLATE=C things like [a-z]* gets a false=positive match
on files like Makefile.

I recently noticed a bug on b.g.o where the ebuild has something like
doc/[A-Z]* expecting that it will not match doc/some_lowercase_subdir.

The bug, of course, is that glibc fraudulently defaults the latin, greek
and cyrillic locales to case-insensitive.

The real fix is to have root be C.UTF-8.  Which differs from C only in
that the charset is utf-8.

-JimC
-- 
James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6



Re: [gentoo-dev] Re: LANG=en_GB.UTF-8 by default

2012-02-18 Thread Ben
On 19 February 2012 09:00, James Cloos cl...@jhcloos.com wrote:
 Without LC_COLLATE=C things like [a-z]* gets a false=positive match
 on files like Makefile. [...]

 The real fix is to have root be C.UTF-8.  Which differs from C only in
 that the charset is utf-8.

In my opinion we should set a default environment with the following values:

LANG=en_US.UTF-8
LC_ALL=
LC_COLLATE=C

This offers the best default options to the majority of users, and is
easy to customize for those who wish to use another locale.

And yes, LC_ALL needs to be empty, because it would override the other
LC_* values.

This should be combined with some good unicode fonts, such as the
LatCyrGr-16 for console, and dejavu for X.

Cheers,
Ben



[gentoo-dev] Re: LANG=en_GB.UTF-8 by default

2012-02-17 Thread Kerin Millar

On 15/02/2012 12:22, Mr. Aaron W. Swenson wrote:

On Wed, Feb 15, 2012 at 12:58:52PM +0100, Francesco R.(vivo) wrote:

as subject says could gentoo change the policy and set an UTF-8 environment by
default?


Perhaps it should define LANG=en_US.UTF-8 as a reasonable default, 
which would be in line with other notable distros. Arch also used to 
define LC_COLLATE=C by default, probably to mitigate unpredictable 
behaviour in some applications, but have since dropped this additional 
variable so they must have deemed it no longer necessary.


I think that having a default configuration file would also raise 
awareness of the importance of locale configuration and make it less 
likely that users configure their systems inappropriately (defining 
LC_ALL, for instance).



P.S. would be nice to have a wd_WD.UTF-8 with WD standing for world, just a
country is so 1900


Different countries/regions have different standards and conventions for 
character classification, case conversion, date/numerical/currency 
formatting etc. There's no basis on which to formally standardise a 
world-wide definition.






However, the stage 3, last time I used it, didn't default to a UTF-8
environment, and it didn't default to using and/or including a capable
UTF-8 font. It is something I think we should look at changing.



Yet unicode is a default flag in the standard profiles. Most console 
fonts have poor coverage. The best one I've found thus far is 
LatCyrGr-16 from fonty-rg, which provides good Latin and Cyrillic 
coverage along with some Greek and esoteric punctuation characters. 
Using this font, I've yet to find any developer's name that doesn't 
render as expected while perusing the contents of the portage tree.


Being a 512 character font, one loses bold support unless using a 
framebuffer console. Given that the default console fonts aren't 
especially useful, it seems a small price to pay.


--Kerin