On Tue, Apr 24, 2012 at 11:24 PM, Graydon Hoare <gray...@mozilla.com> wrote:

> On 12-04-24 11:30 AM, Matthieu Monrocq wrote:
>
> > However this is at the condition of considering strings as list of
> > codepoints, and not list of bytes. List of bytes are useful in encoding
> > and decoding operations, but to manipulate Arabic or Korean, they fall
> > short: having users manipulate the strings byte-wise instead of
> > codepoint-wise is a recipe to disaster outside of English and Latin-1
> > representable languages.
>
> Could you elaborate on this a little bit? I'm curious to hear
> impressions -- even if vague or hard to specify -- about the experience
> of working with known-language, non-Latin-1 text. I'm an English-speaker
> and much technical material is English-derived, so usually when I'm
> working with text-processing code, it falls into one of two categories:
>
>  - ASCII-subset by construction (eg. structured-language keywords)
>
>  - Totally unknown language semantics, has to work with everything,
>    can't assume I know anything about the language (eg. "human input")
>
> I am emphatically not saying these are the _only_ two possible
> environments, just the two that I have experience in. So in my
> experience byte-operations in ASCII range works for the former and using
> a proper language-and-locale-aware unicode library like ICU works for
> the latter. That's where my "usability biases" emerge in the design of str.
>
> In particular I want to know if you would feel that there are common
> operations you expect to be able to do codepoint-at-a-time on the
> datatype "str", that you would not be comfortable doing on the datatype
> "[char]", if you converted str to [char] as a one-time pass in advance
> of performing the operation. That's what I assume people will do if they
> need random (rather than sequential) codepoint access. Sequential access
> we already have iterators for.
>
> But I understand this might not be right; it's a design space with a lot
> of tensions. There are as many different string representations in the
> world as there are opinionated programmers :)


> > I understand that this may seem contradictory to Rust's original
> > direction of utf-8 encoded strings, but having worked with utf-8 strings
> > using C++ `std::string` I can assure you that apart from blindly passing
> > them around, one cannot do much. All modifiying operations require the
> > use of Unicode aware libraries... even `substr`.
>
> Naturally so. We're intending to ship a relatively full binding to
> libicu for just this reason. Unicode Text Is Hard To Do By Hand.
>
> (Though, hmm, substr is actually fine on UTF-8, no? You just have to
> land on character boundaries. Which are easy to find; O(1) from any
> given start point -- at most 5 bytes away -- and the guaranteed output
> of any other algorithm that iterates over character boundaries...)
>

Thanks for this answer: I had not considered the ability to do a str ->
[char] -> str with actual Unicode work on the [char] type.

I also did not know about the intent of integrating a subset of libicu.
Indeed with a full library handling [char] correctly, and two simple
facilities to convert back and fro, then it would be trivial for the user
to use real Unicode operations (to_lower / to_upper / capitalize are not
fun :x) without too much hassle.

Regarding the use cases I have encountered, they were in a general public
web app:
- wrap-around at a specified length (in number of graphemes, which in the
appropriate canonical form was the number of codepoints in all the
languages we cared for)
- truncation at a specified length (also in number of graphemes)
- sorting lists (the first time we presented a list of countries in Greek,
it was nigh unusable...)

Pretty basic operations, we used ICU for sorting (collation) and conversion
to 32bits unicode codepoint value for length operations.

It was all the more funny with Arabic, of course, because of the control
characters for the direction of display which do not have a graphical
representation, but since we counted by hand, we just ignored them.


> > Second, I do not think that statically known sizes are so important in
> > the type system. I am a huge fan, and abuser, of the C++ template
> > system, but I will be the first to admit it is really complex and
> > generally poorly understood even among usually savvy C++ users.
> >
> > As I understand, fixed-length vectors were imagined for C-compatibility.
> > Statically allocated buffers have lifetime that exceed that of all other
> > objects in the system, therefore they can perfectly be accessed through
> > slices. Other uses implying C-compatibility should be based on
> > dynamically allocated memory, and the size will be unknown at
> compilation.
>
> They're useful for a lot of reasons. You can alloca them, which is good
> for small buffers. And a decent number of heap structures also have need
> of small fixed-fanout arrays, caches, lookup tables and the like.
>
> But beyond that they simply _occur_ in the C type system. With annoying
> frequency! We've designed (and intend to maintain) a degree of
> compatibility with C in our structured types: a rust record and a C
> struct containing the same elements ought to be memory-compatible. When
> a C struct has an array in the middle of it, we need to be able to
> represent that somehow. There are a nontrivial number of C structures
> that have that property (or, say, a fixed-sized reserved region). We
> currently address this by having users generate a sequence of fields like:
>
>   pad1: u8; pad2: u8; pad3: u8; pad4: u8;
>
> etc. etc. Not so fun.
>

Ah indeed, to emulate C's layout they seem quite necessary.


>
> > In the blog article linked, an issue regarding the variable-size of
> > `rust_vec<T>` is made because it plays havoc with stack-allocation.
> > However, is real stack-allocation necessary here ?
>
> It's not necessary, but if it's not done on the stack, it's done on a
> parallel-to-the-stack LIFO structure (a.k.a. "dynastack"), which we used
> to have, but have removed since we managed to move everything to the
> stack. If we had to re-acquire the dynastack for this purpose, it would
> not be the end of the world, but we'd like to avoid it. It's one more
> moving part.
>
> Yes, that is the issue. The more parts there are in a task and the heavier
they get.

> I hope my suggestions are reasonable. Do feel free to ignore them if
> > they are not!
>
> Quite reasonable. I hope I've provided useful answers to some.
>
> -Graydon
>

Thank you very much for the detailed answer. Having just discovered Rust a
few weeks ago I am afraid that I lack a lot of background on those
questions.

I definitely hope to get up to speed as the design of Rust (if not the
syntax ;) ) is extremely interesting: typestate, built-in log/note, task &
fail interactions, region pointers => that's a lot of goodness for someone
coming from a C++ background!

-- Matthieu
_______________________________________________
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to