On Fri, Mar 2, 2012 at 12:58 AM, Erik Corry <erik.co...@gmail.com> wrote:
> 2012/3/1 Glenn Adams <gl...@skynav.com>: > >> I'd like to plead for a solution rather like the one Java has, where > >> strings are sequences of UTF-16 codes and there are specialized ways > >> to iterate over them. Looking at this entry from the Unicode FAQ: > >> http://unicode.org/faq/char_combmark.html#7 there are different ways > >> to describe the length (and iteration) of a string. The BRS proposal > >> favours #2, but I think for most applications utf-16-based-#1 is just > >> fine, and for the applications that want to "do it right" #3 is almost > >> always the correct solution. Solution #3 needs library support in any > >> case and has no problems with UTF-16. > >> > >> The central point here is that there are combining characters > >> (accents) that you can't just normalize away. Getting them right has > >> a lot of the same issues as surrogate pairs (you shouldn't normally > >> chop them up, they count as one 'character', you can't tell how many > >> of them there are in a string without looking, etc.). If you can > >> handle combining characters then the surrogate pair support falls out > >> pretty much for free. > > > > > > The problem here is that you are mixing apples and oranges. Although it > > *may* appear that surrogate pairs and grapheme clusters have features in > > common, they operate at different semantic levels entirely. A solution > that > > attempts to conflate these two levels is going to cause problems at both > > levels. A distinction should be maintained between the following levels: > > > > (1) encoding units (e.g., UTF-16 coding units) > > (2) unicode scalar values (code points) > > (3) grapheme clusters > > This distinction is not lost on me. I propose that random access > indexing and .length in JS should work on level 1, that's where we are today: indexing and length based on 16-bit code units (of a UTF-16 encoding, likewise with Java) > and there should be > library support for levels 2 and 3. In order of descending usefulness > I think the order is 1, 3, 2. Therefore I don't want to cause a lot > of backwards compatibility headaches by prioritizing the efficient > handling of level 2. from a perspective of indexing "Unicode characters", level 2 is the correct place; level 3 is useful for higher level, language/locale sensitive text processing, but not particularly interesting at the basic ES string processing level; we aren't talking about (or IMO should not be talking about) a level 3 text processing library in this thread;
_______________________________________________ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss