It would be a mistake for a byte sequence container, stream, or string type
to know anything about particular encodings. An encoding is an
interpretation imposed on a byte sequence. Users of a sequence need to be
able to choose what interpretation to apply without interference from some
previous user's choice, and without need to make a copy.
As an example, a given string may be seen as raw bytes, as a series of
delimited records, as Unicode code points within some of those records, as
a series of JSON name-value pairs within such a record, and as a decimal
number in a JSON value part. The same interpretations need to work on a
raw byte stream that would not tolerate in-band Rust-specific annotations.
The UTF-8 view of a string is an interesting special case. Depending on
context, what is considered a "character" may be a code point of at most 4
bytes, or any number of bytes representing a base and combining characters
which might or might not be collapsible to a canonical, single code point,
or a series of such constructs that is to be displayed as a ligature such
as "Qu" or "ffi". (Some languages are best displayed as mostly ligatures.)
Nathan Myers
On May 1, 2014 6:54:04 AM Malthe Borch <[email protected]> wrote:
In Rust, the built-in std::str type "is a sequence of unicode
codepoints encoded as a stream of UTF-8 bytes".
Meanwhile, building on experience with Python 2 and 3, I think it's
worth considering a more flexible design.
A string would be essentially a rope where each leaf specifies an
encoding, e.g. UTF-8 or ISO8859-1 (ideally expressed as one or two
bytes).
That is, a string may be comprised of segments of different encodings.
On the I/O barrier you would then explicitly encode (and flatten) to a
compatible encoding such as UTF-8.
Likewise, data may be read as 8-bit raw and then "decoded" at a later
stage. For instance, HTTP request headers are ISO8859-1, but the
entire input stream is 8-bit raw.
Sources:
- https://maltheborch.com/2014/04/pythons-missing-string-type
- http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev