I've been thinking about string encoding support lately, and the
approach I've decided to experiment with is representing a string
encoding as a pair of external Iterators. The encoder is an
Iterator<u8> that consumes an Iterator<char>, and the decoder is an
Iterator<char> that consumes an Iterator<u8>. A pair of conditions is
used to control error handling, with the default behavior to use
U+FFFD REPLACEMENT CHAR.

Today I implemented this as a new module std::encoding, with a
proof-of-concept UTF-16 implementation. This is available on my fork
at https://github.com/kballard/rust/tree/encodings (current compiling
commit is 
https://github.com/kballard/rust/commit/54b5b50f0afd8d5dc329d01109c6b760754d56de;
I won't guarantee that the branch will always compile).

Usage is slightly awkward at the moment due to the usage of
Iterator<u8> instead of Iterator<&u8> (and same for char). To convert
a UTF-16 &[u8] into a ~[char] you can say

    let res : ~[u8] = encoding::utf16.decode(src.iter().transform(|&x|
x)).collect();

I had hoped to provide a convenience method .decodeBytes() so you
could say `encoding::utf16.decodeBytes(src)` but the type system has
defeated my attempts to do this so far. I'm inclined to add an
.iter_clone() method to vectors, which would turn this into the
slightly simpler `encoding::utf16.decode(src.iter_clone()).collect()`.

If anyone wants to look over what I have so far, I'd love to get your
feedback/suggestions/complaints.

-Kevin
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to