On 12-04-24 11:30 AM, Matthieu Monrocq wrote: > However this is at the condition of considering strings as list of > codepoints, and not list of bytes. List of bytes are useful in encoding > and decoding operations, but to manipulate Arabic or Korean, they fall > short: having users manipulate the strings byte-wise instead of > codepoint-wise is a recipe to disaster outside of English and Latin-1 > representable languages.
Could you elaborate on this a little bit? I'm curious to hear impressions -- even if vague or hard to specify -- about the experience of working with known-language, non-Latin-1 text. I'm an English-speaker and much technical material is English-derived, so usually when I'm working with text-processing code, it falls into one of two categories: - ASCII-subset by construction (eg. structured-language keywords) - Totally unknown language semantics, has to work with everything, can't assume I know anything about the language (eg. "human input") I am emphatically not saying these are the _only_ two possible environments, just the two that I have experience in. So in my experience byte-operations in ASCII range works for the former and using a proper language-and-locale-aware unicode library like ICU works for the latter. That's where my "usability biases" emerge in the design of str. In particular I want to know if you would feel that there are common operations you expect to be able to do codepoint-at-a-time on the datatype "str", that you would not be comfortable doing on the datatype "[char]", if you converted str to [char] as a one-time pass in advance of performing the operation. That's what I assume people will do if they need random (rather than sequential) codepoint access. Sequential access we already have iterators for. But I understand this might not be right; it's a design space with a lot of tensions. There are as many different string representations in the world as there are opinionated programmers :) > I understand that this may seem contradictory to Rust's original > direction of utf-8 encoded strings, but having worked with utf-8 strings > using C++ `std::string` I can assure you that apart from blindly passing > them around, one cannot do much. All modifiying operations require the > use of Unicode aware libraries... even `substr`. Naturally so. We're intending to ship a relatively full binding to libicu for just this reason. Unicode Text Is Hard To Do By Hand. (Though, hmm, substr is actually fine on UTF-8, no? You just have to land on character boundaries. Which are easy to find; O(1) from any given start point -- at most 5 bytes away -- and the guaranteed output of any other algorithm that iterates over character boundaries...) > Second, I do not think that statically known sizes are so important in > the type system. I am a huge fan, and abuser, of the C++ template > system, but I will be the first to admit it is really complex and > generally poorly understood even among usually savvy C++ users. > > As I understand, fixed-length vectors were imagined for C-compatibility. > Statically allocated buffers have lifetime that exceed that of all other > objects in the system, therefore they can perfectly be accessed through > slices. Other uses implying C-compatibility should be based on > dynamically allocated memory, and the size will be unknown at compilation. They're useful for a lot of reasons. You can alloca them, which is good for small buffers. And a decent number of heap structures also have need of small fixed-fanout arrays, caches, lookup tables and the like. But beyond that they simply _occur_ in the C type system. With annoying frequency! We've designed (and intend to maintain) a degree of compatibility with C in our structured types: a rust record and a C struct containing the same elements ought to be memory-compatible. When a C struct has an array in the middle of it, we need to be able to represent that somehow. There are a nontrivial number of C structures that have that property (or, say, a fixed-sized reserved region). We currently address this by having users generate a sequence of fields like: pad1: u8; pad2: u8; pad3: u8; pad4: u8; etc. etc. Not so fun. > In the blog article linked, an issue regarding the variable-size of > `rust_vec<T>` is made because it plays havoc with stack-allocation. > However, is real stack-allocation necessary here ? It's not necessary, but if it's not done on the stack, it's done on a parallel-to-the-stack LIFO structure (a.k.a. "dynastack"), which we used to have, but have removed since we managed to move everything to the stack. If we had to re-acquire the dynastack for this purpose, it would not be the end of the world, but we'd like to avoid it. It's one more moving part. > I hope my suggestions are reasonable. Do feel free to ignore them if > they are not! Quite reasonable. I hope I've provided useful answers to some. -Graydon _______________________________________________ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev