On May 28, 2014, at 9:16 PM, Bardur Arantsson <[email protected]> wrote:
> Rust:
>
> $ cat
> fn main() {
> let l = "hï".len(); // Note the accent
> println!("{:u}", l);
> }
> $ rustc hello.rs
> $ ./hello
> 3
>
> No matter how defective the notion of "length" may be, personally I
> think that people will expect the former, but will be very surprised by
> the latter. There are certainly cases where the JavaScript version is
> wrong, but I conjecture that it "works" for the vast majority of cases
> that people and programs are likely to encounter.
The JavaScript version is quite wrong. Isaac points out that NFC vs NFD can
change the result, although that's really an issue with grapheme clusters vs
codepoints. More interestingly, JavaScript's idea of string length is wrong for
anything outside of the BMP:
$ node
> "𐀀".length
2
This is because it was designed for UCS-2 instead of UTF-16, so .length
actually returns the number of UCS-2 code units in the string.
Incidentally, that means that JavaScript and Rust do have the same fundamental
definition of length (which is to say, number of code units). They just have a
different code unit. In JavaScript it's confusing because you can learn to use
JavaScript quite well without ever realizing that it's UCS-2 code units (i.e.
that it's not codepoints). In Rust, we're very clear that our strings are utf-8
sequences, so it should surprise nobody when the length turns out to be the
number of utf-8 code units.
FWIW, Go uses utf-8 code units as well, and nobody seems to be confused about
that.
-Kevin
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Rust-dev mailing list [email protected] https://mail.mozilla.org/listinfo/rust-dev
