Re: [rust-dev] UTF-8 strings versus "encoded ropes"

Nathan Myers Thu, 01 May 2014 17:53:22 -0700

On 05/01/2014 04:57 PM, Daniel Micay wrote:

On 01/05/14 07:49 PM, Nathan Myers wrote:

In defining a library string we always grapple over how it
should differ from a raw (variable or fixed) array of bytes.
Ease of appending and of assigning into substrings always
comes up. In the old days, copies shared storage, but nowadays
that's considered evil. Indexed random access lookup was once
thought essential, but with today's variable-sized characters,
strings have become sequential structures. We might snip out a
substring and splice another in its place, but we must identify
those places by stepping iterators to them. We need to put

>> string values in partial or total order, but no single ordering
>> is compellingly best.  Equality depends on context.


The outcome is that the context-independent requirements on
strings may not differ enough from an array of bytes to justify
a separate type. We might better give our byte arrays a few
stringy capabilities. Most users of strings don't need to know
anything about what's in them, and can operate on the raw byte
arrays. To use a string as a map key, though, implies choices:
fold case? canonicalize sequences? We need an object that can
remember your choices, and that the map can apply to strings
given to it.

Ideally what we use to express our interpretation of some set
of strings could be used on any sequence of bytes, not necessarily
contiguous in memory, not necessarily all in memory at once,
not necessarily even produced until called for.

The history of programming languages is littered with mistakes
around string types.  There's no reason why Rust must repeat
them all.

Nathan Myers


There's a string type because it *enforces* the guarantee of containing
valid UTF-8, meaning it can always be converted to code points. This
also means all of the Unicode algorithms can assume that they're dealing
with a valid sequence of code points with no out-of-range values or
surrogates, per the specification.


A UTF-8 string type can certainly earn its keep.  (Probably it should
have "utf8" somewhere in its name.)  Not all byte sequences a program
encounters are, or can or should be converted to, valid UTF-8.  Any
that might not be must still be put in something that users probably
want to call a string. The other issues remain; there are many equally
valid orderings for UTF-8 sequences, so any fixed choice will often be
wrong.

A discriminated string type that may be matched at runtime as a valid
UTF-8 sequence or not depending on what was last put in it would
probably be useful often enough to want it in std.

Nathan Myers
_______________________________________________
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] UTF-8 strings versus "encoded ropes"

Reply via email to