On 01/05/14 07:49 PM, Nathan Myers wrote:
> On 05/01/2014 02:52 PM, Patrick Walton wrote:
>> On 5/1/14 6:53 AM, Malthe Borch wrote:
>>> In Rust, the built-in std::str type "is a sequence of unicode
>>> codepoints encoded as a stream of UTF-8 bytes".
>>> ...
>>> A string would be essentially a rope where each leaf specifies an
>>> encoding, e.g. UTF-8 or ISO8859-1 (ideally expressed as one or two
>>> bytes).
>>
>> This is too complex for a systems language with a simple library.
> 
> In defining a library string we always grapple over how it
> should differ from a raw (variable or fixed) array of bytes.
> Ease of appending and of assigning into substrings always
> comes up. In the old days, copies shared storage, but nowadays
> that's considered evil. Indexed random access lookup was once
> thought essential, but with today's variable-sized characters,
> strings have become sequential structures. We might snip out a
> substring and splice another in its place, but we must identify
> those places by stepping iterators to them. We need to put string values
> in partial or total order, but no single ordering is
> compellingly best.  Equality depends on context.
> 
> The outcome is that the context-independent requirements on
> strings may not differ enough from an array of bytes to justify
> a separate type. We might better give our byte arrays a few
> stringy capabilities. Most users of strings don't need to know
> anything about what's in them, and can operate on the raw byte
> arrays. To use a string as a map key, though, implies choices:
> fold case? canonicalize sequences? We need an object that can
> remember your choices, and that the map can apply to strings
> given to it.
> 
> Ideally what we use to express our interpretation of some set
> of strings could be used on any sequence of bytes, not necessarily
> contiguous in memory, not necessarily all in memory at once,
> not necessarily even produced until called for.
> 
> The history of programming languages is littered with mistakes
> around string types.  There's no reason why Rust must repeat
> them all.
> 
> Nathan Myers

There's a string type because it *enforces* the guarantee of containing
valid UTF-8, meaning it can always be converted to code points. This
also means all of the Unicode algorithms can assume that they're dealing
with a valid sequence of code points with no out-of-range values or
surrogates, per the specification.

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to