On 12-04-24 11:30 AM, Matthieu Monrocq wrote:

> However this is at the condition of considering strings as list of
> codepoints, and not list of bytes. List of bytes are useful in encoding
> and decoding operations, but to manipulate Arabic or Korean, they fall
> short: having users manipulate the strings byte-wise instead of
> codepoint-wise is a recipe to disaster outside of English and Latin-1
> representable languages.

Could you elaborate on this a little bit? I'm curious to hear
impressions -- even if vague or hard to specify -- about the experience
of working with known-language, non-Latin-1 text. I'm an English-speaker
and much technical material is English-derived, so usually when I'm
working with text-processing code, it falls into one of two categories:

  - ASCII-subset by construction (eg. structured-language keywords)

  - Totally unknown language semantics, has to work with everything,
    can't assume I know anything about the language (eg. "human input")

I am emphatically not saying these are the _only_ two possible
environments, just the two that I have experience in. So in my
experience byte-operations in ASCII range works for the former and using
a proper language-and-locale-aware unicode library like ICU works for
the latter. That's where my "usability biases" emerge in the design of str.

In particular I want to know if you would feel that there are common
operations you expect to be able to do codepoint-at-a-time on the
datatype "str", that you would not be comfortable doing on the datatype
"[char]", if you converted str to [char] as a one-time pass in advance
of performing the operation. That's what I assume people will do if they
need random (rather than sequential) codepoint access. Sequential access
we already have iterators for.

But I understand this might not be right; it's a design space with a lot
of tensions. There are as many different string representations in the
world as there are opinionated programmers :)

> I understand that this may seem contradictory to Rust's original
> direction of utf-8 encoded strings, but having worked with utf-8 strings
> using C++ `std::string` I can assure you that apart from blindly passing
> them around, one cannot do much. All modifiying operations require the
> use of Unicode aware libraries... even `substr`.

Naturally so. We're intending to ship a relatively full binding to
libicu for just this reason. Unicode Text Is Hard To Do By Hand.

(Though, hmm, substr is actually fine on UTF-8, no? You just have to
land on character boundaries. Which are easy to find; O(1) from any
given start point -- at most 5 bytes away -- and the guaranteed output
of any other algorithm that iterates over character boundaries...)

> Second, I do not think that statically known sizes are so important in
> the type system. I am a huge fan, and abuser, of the C++ template
> system, but I will be the first to admit it is really complex and
> generally poorly understood even among usually savvy C++ users.
> 
> As I understand, fixed-length vectors were imagined for C-compatibility.
> Statically allocated buffers have lifetime that exceed that of all other
> objects in the system, therefore they can perfectly be accessed through
> slices. Other uses implying C-compatibility should be based on
> dynamically allocated memory, and the size will be unknown at compilation.

They're useful for a lot of reasons. You can alloca them, which is good
for small buffers. And a decent number of heap structures also have need
of small fixed-fanout arrays, caches, lookup tables and the like.

But beyond that they simply _occur_ in the C type system. With annoying
frequency! We've designed (and intend to maintain) a degree of
compatibility with C in our structured types: a rust record and a C
struct containing the same elements ought to be memory-compatible. When
a C struct has an array in the middle of it, we need to be able to
represent that somehow. There are a nontrivial number of C structures
that have that property (or, say, a fixed-sized reserved region). We
currently address this by having users generate a sequence of fields like:

   pad1: u8; pad2: u8; pad3: u8; pad4: u8;

etc. etc. Not so fun.

> In the blog article linked, an issue regarding the variable-size of
> `rust_vec<T>` is made because it plays havoc with stack-allocation.
> However, is real stack-allocation necessary here ? 

It's not necessary, but if it's not done on the stack, it's done on a
parallel-to-the-stack LIFO structure (a.k.a. "dynastack"), which we used
to have, but have removed since we managed to move everything to the
stack. If we had to re-acquire the dynastack for this purpose, it would
not be the end of the world, but we'd like to avoid it. It's one more
moving part.

> I hope my suggestions are reasonable. Do feel free to ignore them if
> they are not!

Quite reasonable. I hope I've provided useful answers to some.

-Graydon
_______________________________________________
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to