Re: [rust-dev] UTF-8 strings versus "encoded ropes"

Tony Arcieri Thu, 01 May 2014 13:48:38 -0700

On Thu, May 1, 2014 at 1:06 PM, Malthe Borch <[email protected]> wrote:


> This is not the case in the proposed design.
>

You're wrong.


> All string operations would behave exactly as if there was only a
> single encoding. The only requirement is that the strings are properly
> declared with an encoding (which may be different).
>

Nope, that's not how it works in practice (See below). I speak as someone
who has spent blood, sweat, and tears debugging systems that work exactly
like what you're proposing.


> With Ruby and most other languages, a string is just a sequence of
> bytes. It does not know about an encoding


Wrong again, and that hasn't been the case for some 7 years. That was the
case with Ruby <= 1.8, however Ruby 1.9 introduced a feature called "M17N"
which works almost exactly like what you describe: each string is tagged
with an encoding which is stored in a bitfield alongside the string object.


> Note that it may not always be possible to encode a string to a
> non-unicode encoding such as ASCII. But this is only a failure mode on
> the I/O barrier where you explicitly need to encode. When no I/O
> barrier and/or protocol is involved, there needs to be no awareness of
> string encodings.
>

No, when you combine strings with different encodings, you need to
transcode one of the strings. When this happens, the transcoding process
may encounter some characters which are valid in one encoding, but not
another, in which case the transcoding will fail, and it will fail at
runtime.

This can happen long after a string has crossed the I/O boundary. The
result is errors which pop up at runtime in odd circumstances. This is
nothing short of a fucking nightmare to debug.

-- 
Tony Arcieri

_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] UTF-8 strings versus "encoded ropes"

Reply via email to