Re: I18N issue: case-sensitivity of locale subdirectories

2009-06-08 Thread Marcos Caceres
On Thu, May 7, 2009 at 12:33 PM,  jere.kapy...@nokia.com wrote:
 On 5.5.2009 13.16, ext Marcos Caceres marc...@opera.com wrote:
 On Wed, Apr 29, 2009 at 4:16 PM, Robin Berjon ro...@berjon.com wrote:
 Assume we have two localisation subdirectories:

  locales/en/
  locales/EN/

 What happens? BCP47 (which we reference) is defined to be case-insensitive
 so it doesn't help us much in this respect.

 There are multiple options:

  a) we define a canonical casing and all others are ignored;
  b) we select an order of priority and we only consider one (the first to
 match);
  c) we select an order of priority and we merge them all (in that order,
 with a given precedence rule);
  d) the device on which the user agent is catches fire.

 I think that (a) should be ruled out because as BCP47 tells us, ISO639-1
 recommends lowercase (language codes), ISO3166-1 recommends uppercase
 (country codes), and ISO15924 recommends titlecase (script codes). These are
 different, but likely to be confusing, and I don't think that developers
 should have to worry about that.

 Agreed.

 Because BCP47 is indeed case-insensitive [1], both en and EN (and also
 eN and En) are considered equivalent.

In the spec, the problem is actually simpler (worst!) than that:

 1. authors can create folders that make use of language tags in any
case form (locales/EN-us etc.).
 2. the UA locale is in lowercase form and only attempt to match
folder names in lowecase form.

So, for instance, UA locale is en-us so it won't match en-US or
any case variant.

 While it is probably an oversight
 or error to provide several variants of the same language tag with different
 character case anyway, they need to be considered somehow because they *are*
 equivalent, unless it is made explicit that this is an error in the
 packaging.

This is too harsh, IMO. We need to deal with this in the spec and not
punish authors.

 The path inside the widget's ZIP file is already defined as
 case-insensitive, so it is actually already an error to have two or more
 folders with names that differ only by character case.

Right. It is possible to do this, but requires some skills (in
Windows) or a case insensitive OS.

  Even if some
 implementation unzips the content of the widget to a local filesystem, we
 have no control over whether filenames in that filesystem are
 case-insensitive or case-sensitive.

Right. As I already stated, in the spec, we convert user agent locales
list to lower case. However, paths are treated as case sensitive, so
'en-us' will not match 'en-US' at runtime. This is an issue, and not
sure what to do about that.

 I don't have a strong opinion on this, but I do I have a preference for a
 rule based on (b): if multiple locale subdirectories have the same
 case-insensitive name, then the one that comes first in ASCII-code order
 (e.g. in order: EN, En, eN, en) is used and the others are ignored.

 This seems reasonable. I will add this.

 I suggest that the widget packaging rules say that any localized folders
 must be unique in terms of a case-insensitive match, otherwise the packaging
 is invalid [2].

Like I said, I think it's a bit harsh to say that the package is
invalid. I think a conformance checker should warn authors to make
localized folders lower case OR we deal with it in the spec.

 This also allows us to not talk about ASCII code ordering.

Right.

 Furthermore, there is then no need to merge the contents of such folders.

Exactly.

 For the degenerate (but unfortunately unavoidable) case where someone has
 managed to slip in two or more such folders, define a canonical casing
 (obvious suggestion: lowercase) and use it, then simply ignore any others.

In the zip file, there are no folders per se. There are only zip
relative paths that act as identifiers. It's only upon decompression
that physical folders may be created (or merged). But, like I said,
the problem is more fundamental than what is being described because
the user agent's locales are in lower case form.

 The argument in favour of only using one is that we already have to merge
 multiple directories, and adding one merge operation for what is in all
 probability a user error seems like too much complexity for little value
 (I'm happy to be contradicted by implementers however). Picking ASCII-code
 order is based on the fact that the directory names must be ASCII here (the
 others must be discarded), and picking the first is arbitrary.

 Thoughts?

 I support b. Added some of your text above to the spec.

 I guess none of a)-d) really fit my observations as such. It's more like
 additional packaging rules + shades of a).

 Note that for comparisons with the widget locale value you still need to
 case-fold [3] everything anyway. There is no guarantee that the widget
 locale matches any localized subfolder name as such, because the widget
 locale itself could use capitalization that really carries no meaning, but
 fails to match any localized folder unless you do a 

Re: I18N issue: case-sensitivity of locale subdirectories

2009-05-07 Thread Jere.Kapyaho
On 5.5.2009 13.16, ext Marcos Caceres marc...@opera.com wrote:
 On Wed, Apr 29, 2009 at 4:16 PM, Robin Berjon ro...@berjon.com wrote:
 Assume we have two localisation subdirectories:
 
  locales/en/
  locales/EN/
 
 What happens? BCP47 (which we reference) is defined to be case-insensitive
 so it doesn't help us much in this respect.
 
 There are multiple options:
 
  a) we define a canonical casing and all others are ignored;
  b) we select an order of priority and we only consider one (the first to
 match);
  c) we select an order of priority and we merge them all (in that order,
 with a given precedence rule);
  d) the device on which the user agent is catches fire.
 
 I think that (a) should be ruled out because as BCP47 tells us, ISO639-1
 recommends lowercase (language codes), ISO3166-1 recommends uppercase
 (country codes), and ISO15924 recommends titlecase (script codes). These are
 different, but likely to be confusing, and I don't think that developers
 should have to worry about that.
 
 Agreed.

Because BCP47 is indeed case-insensitive [1], both en and EN (and also
eN and En) are considered equivalent. While it is probably an oversight
or error to provide several variants of the same language tag with different
character case anyway, they need to be considered somehow because they *are*
equivalent, unless it is made explicit that this is an error in the
packaging.

The path inside the widget's ZIP file is already defined as
case-insensitive, so it is actually already an error to have two or more
folders with names that differ only by character case. Even if some
implementation unzips the content of the widget to a local filesystem, we
have no control over whether filenames in that filesystem are
case-insensitive or case-sensitive.

 I don't have a strong opinion on this, but I do I have a preference for a
 rule based on (b): if multiple locale subdirectories have the same
 case-insensitive name, then the one that comes first in ASCII-code order
 (e.g. in order: EN, En, eN, en) is used and the others are ignored.
 
 This seems reasonable. I will add this.

I suggest that the widget packaging rules say that any localized folders
must be unique in terms of a case-insensitive match, otherwise the packaging
is invalid [2]. This also allows us to not talk about ASCII code ordering.
Furthermore, there is then no need to merge the contents of such folders.

For the degenerate (but unfortunately unavoidable) case where someone has
managed to slip in two or more such folders, define a canonical casing
(obvious suggestion: lowercase) and use it, then simply ignore any others.

 The argument in favour of only using one is that we already have to merge
 multiple directories, and adding one merge operation for what is in all
 probability a user error seems like too much complexity for little value
 (I'm happy to be contradicted by implementers however). Picking ASCII-code
 order is based on the fact that the directory names must be ASCII here (the
 others must be discarded), and picking the first is arbitrary.
 
 Thoughts?
 
 I support b. Added some of your text above to the spec.

I guess none of a)-d) really fit my observations as such. It's more like
additional packaging rules + shades of a).

Note that for comparisons with the widget locale value you still need to
case-fold [3] everything anyway. There is no guarantee that the widget
locale matches any localized subfolder name as such, because the widget
locale itself could use capitalization that really carries no meaning, but
fails to match any localized folder unless you do a case-insensitive
comparison. In this case the comparison can be also language-insensitive,
because BCP47 language tags consist of US-ASCII characters.

Hope this helps,
Jere

[1] http://tools.ietf.org/html/bcp47#section-2.1
[2] http://dev.w3.org/2006/waf/widgets/#invalid-widgets
[3] http://www.w3.org/International/wiki/Case_folding

 [0]http://dev.w3.org/cvsweb/~checkout~/2006/waf/widgets/i18n.html?rev=1.29co
 ntent-type=text/html;%20charset=utf-8
 
 --
 Robin Berjon - http://berjon.com/
    Feel like hiring me? Go to http://robineko.com/
 

 --
 Marcos Caceres
 http://datadriven.com.au

-- 
Jere Käpyaho (jere.kapy...@nokia.com)
Specialist, Developer Platforms Standardization
Devices RD, Nokia Corporation




Re: I18N issue: case-sensitivity of locale subdirectories

2009-05-01 Thread Jonas Sicking
On Wed, Apr 29, 2009 at 7:16 AM, Robin Berjon ro...@berjon.com wrote:
 Hi,

 the following issue has cropped up in the I18N model as described in the
 excellent I18N document from Marcos[0].

 Assume we have two localisation subdirectories:

  locales/en/
  locales/EN/

 What happens? BCP47 (which we reference) is defined to be case-insensitive
 so it doesn't help us much in this respect.

 There are multiple options:

  a) we define a canonical casing and all others are ignored;
  b) we select an order of priority and we only consider one (the first to
 match);
  c) we select an order of priority and we merge them all (in that order,
 with a given precedence rule);
  d) the device on which the user agent is catches fire.

 I think that (a) should be ruled out because as BCP47 tells us, ISO639-1
 recommends lowercase (language codes), ISO3166-1 recommends uppercase
 (country codes), and ISO15924 recommends titlecase (script codes). These are
 different, but likely to be confusing, and I don't think that developers
 should have to worry about that.

 I'd like to reject (d) as out of line with our design preferences.

 I don't have a strong opinion on this, but I do I have a preference for a
 rule based on (b): if multiple locale subdirectories have the same
 case-insensitive name, then the one that comes first in ASCII-code order
 (e.g. in order: EN, En, eN, en) is used and the others are ignored.

 The argument in favour of only using one is that we already have to merge
 multiple directories, and adding one merge operation for what is in all
 probability a user error seems like too much complexity for little value
 (I'm happy to be contradicted by implementers however). Picking ASCII-code
 order is based on the fact that the directory names must be ASCII here (the
 others must be discarded), and picking the first is arbitrary.

I strongly agree. (c) would add a lot of code that would likely never
be used which means is bad because dead code is always bad, and also
because if in the rare case it is actually used, it's much more likely
to be buggy.

I don't have an opinion on which of the two directories should take
priority, as long as it's one of them. I'd probably argue for using
the first in ASCII-code order since that seems the simplest to
implement, but I'm open to other suggestions.

/ Jonas



Re: I18N issue: case-sensitivity of locale subdirectories

2009-04-29 Thread Arthur Barstow

On Apr 29, 2009, at 10:16 AM, ext Robin Berjon wrote:


There are multiple options:

  b) we select an order of priority and we only consider one (the
first to match);


...


I don't have a strong opinion on this, but I do I have a preference
for a rule based on (b): if multiple locale subdirectories have the
same case-insensitive name, then the one that comes first in ASCII-
code order (e.g. in order: EN, En, eN, en) is used and the others are
ignored.

The argument in favour of only using one is that we already have to
merge multiple directories, and adding one merge operation for what is
in all probability a user error seems like too much complexity for
little value (I'm happy to be contradicted by implementers however).
Picking ASCII-code order is based on the fact that the directory names
must be ASCII here (the others must be discarded), and picking the
first is arbitrary.

Thoughts?


b) seems like a reasonable choice. I wonder if the I18N WG has any  
related guidelines recommendations we should consider.


-Regards, Art Barstow