Re: [rust-dev] UTF-8 strings versus "encoded ropes"

Mikhail Zabaluev Thu, 01 May 2014 09:54:29 -0700

Hi,

2014-05-01 16:53 GMT+03:00 Malthe Borch <[email protected]>:


> In Rust, the built-in std::str type "is a sequence of unicode
> codepoints encoded as a stream of UTF-8 bytes".
>
> Meanwhile, building on experience with Python 2 and 3, I think it's
> worth considering a more flexible design.
>
> A string would be essentially a rope where each leaf specifies an
> encoding, e.g. UTF-8 or ISO8859-1 (ideally expressed as one or two
> bytes).
>
> That is, a string may be comprised of segments of different encodings.
> On the I/O barrier you would then explicitly encode (and flatten) to a
> compatible encoding such as UTF-8.
>
> Likewise, data may be read as 8-bit raw and then "decoded" at a later
> stage. For instance, HTTP request headers are ISO8859-1, but the
> entire input stream is 8-bit raw.
>

I don't think that so much hidden complexity would be justified in the
built-in string type. Encoded text is typically dealt with in protocol
libraries or similar "I/O barriers" where it should be passed through a
validating decoder. std::str is guaranteed (within the usual safety
considerations) to hold a valid UTF-8 byte sequence, which can be passed
without copying to external libraries. For data domains richer than what
Unicode text can provide, more complex data representations would need to
be coded explicitly.

Best regards,
  Mikhail

_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] UTF-8 strings versus "encoded ropes"

Reply via email to