Re: [rust-dev] How to find Unicode string length in rustlang

Aravinda VK Wed, 28 May 2014 23:38:22 -0700

I think returning length of string in bytes is just fine. Since I didn't
know about the availability of char_len in rust caused this confusion.


python 2.7/Perl/PHP - Returns length of string in bytes, Python 3/JS
returns number of codepoints.

As long as we can iterate through chars without worrying about bytes
length/codepoint length, then it should be fine.

    let unicode_str = String::from_str("ಅರವಿಂದ");
    let v: Vec<char> = unicode_str.as_slice().chars().collect();

    for c in v.iter(){
        println!("{}", c);
    }

I wonder if chars() available for String itself, so that we can avoid
running as_slice().chars()

-- 
Regards
Aravinda | ಅರವಿಂದ
http://aravindavk.in


On Thu, May 29, 2014 at 11:17 AM, Kevin Ballard <ke...@sb.org> wrote:

> On May 28, 2014, at 9:16 PM, Bardur Arantsson <s...@scientician.net>
> wrote:
>
> > Rust:
> >
> >  $ cat
> >  fn main() {
> >    let l = "hï".len();     // Note the accent
> >    println!("{:u}", l);
> >  }
> >  $ rustc hello.rs
> >  $ ./hello
> >  3
> >
> > No matter how defective the notion of "length" may be, personally I
> > think that people will expect the former, but will be very surprised by
> > the latter. There are certainly cases where the JavaScript version is
> > wrong, but I conjecture that it "works" for the vast majority of cases
> > that people and programs are likely to encounter.
>
> The JavaScript version is quite wrong. Isaac points out that NFC vs NFD
> can change the result, although that's really an issue with grapheme
> clusters vs codepoints. More interestingly, JavaScript's idea of string
> length is wrong for anything outside of the BMP:
>
> $ node
> > "𐀀".length
> 2
>
> This is because it was designed for UCS-2 instead of UTF-16, so .length
> actually returns the number of UCS-2 code units in the string.
>
> Incidentally, that means that JavaScript and Rust do have the same
> fundamental definition of length (which is to say, number of code units).
> They just have a different code unit. In JavaScript it's confusing because
> you can learn to use JavaScript quite well without ever realizing that it's
> UCS-2 code units (i.e. that it's not codepoints). In Rust, we're very clear
> that our strings are utf-8 sequences, so it should surprise nobody when the
> length turns out to be the number of utf-8 code units.
>
> FWIW, Go uses utf-8 code units as well, and nobody seems to be confused
> about that.
>
> -Kevin
> _______________________________________________
> Rust-dev mailing list
> Rust-dev@mozilla.org
> https://mail.mozilla.org/listinfo/rust-dev
>
>

_______________________________________________
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] How to find Unicode string length in rustlang

Reply via email to