On 01/05/14 07:49 PM, Nathan Myers wrote: > On 05/01/2014 02:52 PM, Patrick Walton wrote: >> On 5/1/14 6:53 AM, Malthe Borch wrote: >>> In Rust, the built-in std::str type "is a sequence of unicode >>> codepoints encoded as a stream of UTF-8 bytes". >>> ... >>> A string would be essentially a rope where each leaf specifies an >>> encoding, e.g. UTF-8 or ISO8859-1 (ideally expressed as one or two >>> bytes). >> >> This is too complex for a systems language with a simple library. > > In defining a library string we always grapple over how it > should differ from a raw (variable or fixed) array of bytes. > Ease of appending and of assigning into substrings always > comes up. In the old days, copies shared storage, but nowadays > that's considered evil. Indexed random access lookup was once > thought essential, but with today's variable-sized characters, > strings have become sequential structures. We might snip out a > substring and splice another in its place, but we must identify > those places by stepping iterators to them. We need to put string values > in partial or total order, but no single ordering is > compellingly best. Equality depends on context. > > The outcome is that the context-independent requirements on > strings may not differ enough from an array of bytes to justify > a separate type. We might better give our byte arrays a few > stringy capabilities. Most users of strings don't need to know > anything about what's in them, and can operate on the raw byte > arrays. To use a string as a map key, though, implies choices: > fold case? canonicalize sequences? We need an object that can > remember your choices, and that the map can apply to strings > given to it. > > Ideally what we use to express our interpretation of some set > of strings could be used on any sequence of bytes, not necessarily > contiguous in memory, not necessarily all in memory at once, > not necessarily even produced until called for. > > The history of programming languages is littered with mistakes > around string types. There's no reason why Rust must repeat > them all. > > Nathan Myers
There's a string type because it *enforces* the guarantee of containing valid UTF-8, meaning it can always be converted to code points. This also means all of the Unicode algorithms can assume that they're dealing with a valid sequence of code points with no out-of-range values or surrogates, per the specification.
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Rust-dev mailing list [email protected] https://mail.mozilla.org/listinfo/rust-dev
