Re: I18N issue: case-sensitivity of locale subdirectories
On Thu, May 7, 2009 at 12:33 PM, jere.kapy...@nokia.com wrote: On 5.5.2009 13.16, ext Marcos Caceres marc...@opera.com wrote: On Wed, Apr 29, 2009 at 4:16 PM, Robin Berjon ro...@berjon.com wrote: Assume we have two localisation subdirectories: locales/en/ locales/EN/ What happens? BCP47 (which we reference) is defined to be case-insensitive so it doesn't help us much in this respect. There are multiple options: a) we define a canonical casing and all others are ignored; b) we select an order of priority and we only consider one (the first to match); c) we select an order of priority and we merge them all (in that order, with a given precedence rule); d) the device on which the user agent is catches fire. I think that (a) should be ruled out because as BCP47 tells us, ISO639-1 recommends lowercase (language codes), ISO3166-1 recommends uppercase (country codes), and ISO15924 recommends titlecase (script codes). These are different, but likely to be confusing, and I don't think that developers should have to worry about that. Agreed. Because BCP47 is indeed case-insensitive [1], both en and EN (and also eN and En) are considered equivalent. In the spec, the problem is actually simpler (worst!) than that: 1. authors can create folders that make use of language tags in any case form (locales/EN-us etc.). 2. the UA locale is in lowercase form and only attempt to match folder names in lowecase form. So, for instance, UA locale is en-us so it won't match en-US or any case variant. While it is probably an oversight or error to provide several variants of the same language tag with different character case anyway, they need to be considered somehow because they *are* equivalent, unless it is made explicit that this is an error in the packaging. This is too harsh, IMO. We need to deal with this in the spec and not punish authors. The path inside the widget's ZIP file is already defined as case-insensitive, so it is actually already an error to have two or more folders with names that differ only by character case. Right. It is possible to do this, but requires some skills (in Windows) or a case insensitive OS. Even if some implementation unzips the content of the widget to a local filesystem, we have no control over whether filenames in that filesystem are case-insensitive or case-sensitive. Right. As I already stated, in the spec, we convert user agent locales list to lower case. However, paths are treated as case sensitive, so 'en-us' will not match 'en-US' at runtime. This is an issue, and not sure what to do about that. I don't have a strong opinion on this, but I do I have a preference for a rule based on (b): if multiple locale subdirectories have the same case-insensitive name, then the one that comes first in ASCII-code order (e.g. in order: EN, En, eN, en) is used and the others are ignored. This seems reasonable. I will add this. I suggest that the widget packaging rules say that any localized folders must be unique in terms of a case-insensitive match, otherwise the packaging is invalid [2]. Like I said, I think it's a bit harsh to say that the package is invalid. I think a conformance checker should warn authors to make localized folders lower case OR we deal with it in the spec. This also allows us to not talk about ASCII code ordering. Right. Furthermore, there is then no need to merge the contents of such folders. Exactly. For the degenerate (but unfortunately unavoidable) case where someone has managed to slip in two or more such folders, define a canonical casing (obvious suggestion: lowercase) and use it, then simply ignore any others. In the zip file, there are no folders per se. There are only zip relative paths that act as identifiers. It's only upon decompression that physical folders may be created (or merged). But, like I said, the problem is more fundamental than what is being described because the user agent's locales are in lower case form. The argument in favour of only using one is that we already have to merge multiple directories, and adding one merge operation for what is in all probability a user error seems like too much complexity for little value (I'm happy to be contradicted by implementers however). Picking ASCII-code order is based on the fact that the directory names must be ASCII here (the others must be discarded), and picking the first is arbitrary. Thoughts? I support b. Added some of your text above to the spec. I guess none of a)-d) really fit my observations as such. It's more like additional packaging rules + shades of a). Note that for comparisons with the widget locale value you still need to case-fold [3] everything anyway. There is no guarantee that the widget locale matches any localized subfolder name as such, because the widget locale itself could use capitalization that really carries no meaning, but fails to match any localized folder unless you do a
Re: I18N issue: case-sensitivity of locale subdirectories
On 5.5.2009 13.16, ext Marcos Caceres marc...@opera.com wrote: On Wed, Apr 29, 2009 at 4:16 PM, Robin Berjon ro...@berjon.com wrote: Assume we have two localisation subdirectories: locales/en/ locales/EN/ What happens? BCP47 (which we reference) is defined to be case-insensitive so it doesn't help us much in this respect. There are multiple options: a) we define a canonical casing and all others are ignored; b) we select an order of priority and we only consider one (the first to match); c) we select an order of priority and we merge them all (in that order, with a given precedence rule); d) the device on which the user agent is catches fire. I think that (a) should be ruled out because as BCP47 tells us, ISO639-1 recommends lowercase (language codes), ISO3166-1 recommends uppercase (country codes), and ISO15924 recommends titlecase (script codes). These are different, but likely to be confusing, and I don't think that developers should have to worry about that. Agreed. Because BCP47 is indeed case-insensitive [1], both en and EN (and also eN and En) are considered equivalent. While it is probably an oversight or error to provide several variants of the same language tag with different character case anyway, they need to be considered somehow because they *are* equivalent, unless it is made explicit that this is an error in the packaging. The path inside the widget's ZIP file is already defined as case-insensitive, so it is actually already an error to have two or more folders with names that differ only by character case. Even if some implementation unzips the content of the widget to a local filesystem, we have no control over whether filenames in that filesystem are case-insensitive or case-sensitive. I don't have a strong opinion on this, but I do I have a preference for a rule based on (b): if multiple locale subdirectories have the same case-insensitive name, then the one that comes first in ASCII-code order (e.g. in order: EN, En, eN, en) is used and the others are ignored. This seems reasonable. I will add this. I suggest that the widget packaging rules say that any localized folders must be unique in terms of a case-insensitive match, otherwise the packaging is invalid [2]. This also allows us to not talk about ASCII code ordering. Furthermore, there is then no need to merge the contents of such folders. For the degenerate (but unfortunately unavoidable) case where someone has managed to slip in two or more such folders, define a canonical casing (obvious suggestion: lowercase) and use it, then simply ignore any others. The argument in favour of only using one is that we already have to merge multiple directories, and adding one merge operation for what is in all probability a user error seems like too much complexity for little value (I'm happy to be contradicted by implementers however). Picking ASCII-code order is based on the fact that the directory names must be ASCII here (the others must be discarded), and picking the first is arbitrary. Thoughts? I support b. Added some of your text above to the spec. I guess none of a)-d) really fit my observations as such. It's more like additional packaging rules + shades of a). Note that for comparisons with the widget locale value you still need to case-fold [3] everything anyway. There is no guarantee that the widget locale matches any localized subfolder name as such, because the widget locale itself could use capitalization that really carries no meaning, but fails to match any localized folder unless you do a case-insensitive comparison. In this case the comparison can be also language-insensitive, because BCP47 language tags consist of US-ASCII characters. Hope this helps, Jere [1] http://tools.ietf.org/html/bcp47#section-2.1 [2] http://dev.w3.org/2006/waf/widgets/#invalid-widgets [3] http://www.w3.org/International/wiki/Case_folding [0]http://dev.w3.org/cvsweb/~checkout~/2006/waf/widgets/i18n.html?rev=1.29co ntent-type=text/html;%20charset=utf-8 -- Robin Berjon - http://berjon.com/ Feel like hiring me? Go to http://robineko.com/ -- Marcos Caceres http://datadriven.com.au -- Jere Käpyaho (jere.kapy...@nokia.com) Specialist, Developer Platforms Standardization Devices RD, Nokia Corporation
Re: I18N issue: case-sensitivity of locale subdirectories
On Wed, Apr 29, 2009 at 7:16 AM, Robin Berjon ro...@berjon.com wrote: Hi, the following issue has cropped up in the I18N model as described in the excellent I18N document from Marcos[0]. Assume we have two localisation subdirectories: locales/en/ locales/EN/ What happens? BCP47 (which we reference) is defined to be case-insensitive so it doesn't help us much in this respect. There are multiple options: a) we define a canonical casing and all others are ignored; b) we select an order of priority and we only consider one (the first to match); c) we select an order of priority and we merge them all (in that order, with a given precedence rule); d) the device on which the user agent is catches fire. I think that (a) should be ruled out because as BCP47 tells us, ISO639-1 recommends lowercase (language codes), ISO3166-1 recommends uppercase (country codes), and ISO15924 recommends titlecase (script codes). These are different, but likely to be confusing, and I don't think that developers should have to worry about that. I'd like to reject (d) as out of line with our design preferences. I don't have a strong opinion on this, but I do I have a preference for a rule based on (b): if multiple locale subdirectories have the same case-insensitive name, then the one that comes first in ASCII-code order (e.g. in order: EN, En, eN, en) is used and the others are ignored. The argument in favour of only using one is that we already have to merge multiple directories, and adding one merge operation for what is in all probability a user error seems like too much complexity for little value (I'm happy to be contradicted by implementers however). Picking ASCII-code order is based on the fact that the directory names must be ASCII here (the others must be discarded), and picking the first is arbitrary. I strongly agree. (c) would add a lot of code that would likely never be used which means is bad because dead code is always bad, and also because if in the rare case it is actually used, it's much more likely to be buggy. I don't have an opinion on which of the two directories should take priority, as long as it's one of them. I'd probably argue for using the first in ASCII-code order since that seems the simplest to implement, but I'm open to other suggestions. / Jonas
Re: I18N issue: case-sensitivity of locale subdirectories
On Apr 29, 2009, at 10:16 AM, ext Robin Berjon wrote: There are multiple options: b) we select an order of priority and we only consider one (the first to match); ... I don't have a strong opinion on this, but I do I have a preference for a rule based on (b): if multiple locale subdirectories have the same case-insensitive name, then the one that comes first in ASCII- code order (e.g. in order: EN, En, eN, en) is used and the others are ignored. The argument in favour of only using one is that we already have to merge multiple directories, and adding one merge operation for what is in all probability a user error seems like too much complexity for little value (I'm happy to be contradicted by implementers however). Picking ASCII-code order is based on the fact that the directory names must be ASCII here (the others must be discarded), and picking the first is arbitrary. Thoughts? b) seems like a reasonable choice. I wonder if the I18N WG has any related guidelines recommendations we should consider. -Regards, Art Barstow