Re: [rust-dev] How to find Unicode string length in rustlang

Kevin Ballard Wed, 28 May 2014 22:48:07 -0700

On May 28, 2014, at 9:16 PM, Bardur Arantsson <[email protected]> wrote:


> Rust:
> 
>  $ cat
>  fn main() {
>    let l = "hï".len();     // Note the accent
>    println!("{:u}", l);
>  }
>  $ rustc hello.rs
>  $ ./hello
>  3
> 
> No matter how defective the notion of "length" may be, personally I
> think that people will expect the former, but will be very surprised by
> the latter. There are certainly cases where the JavaScript version is
> wrong, but I conjecture that it "works" for the vast majority of cases
> that people and programs are likely to encounter.

The JavaScript version is quite wrong. Isaac points out that NFC vs NFD can 
change the result, although that's really an issue with grapheme clusters vs 
codepoints. More interestingly, JavaScript's idea of string length is wrong for 
anything outside of the BMP:

$ node
> "𐀀".length
2

This is because it was designed for UCS-2 instead of UTF-16, so .length 
actually returns the number of UCS-2 code units in the string.

Incidentally, that means that JavaScript and Rust do have the same fundamental 
definition of length (which is to say, number of code units). They just have a 
different code unit. In JavaScript it's confusing because you can learn to use 
JavaScript quite well without ever realizing that it's UCS-2 code units (i.e. 
that it's not codepoints). In Rust, we're very clear that our strings are utf-8 
sequences, so it should surprise nobody when the length turns out to be the 
number of utf-8 code units.

FWIW, Go uses utf-8 code units as well, and nobody seems to be confused about 
that.

-Kevin

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] How to find Unicode string length in rustlang

Reply via email to