Hi,
TR;DR: the actual proposal is at the end of this email.
Rust today has good support for UTF-8 which new content definitely
should use, but many systems still have to deal with legacy content that
uses other character encodings. There are several projects around to
implement more encodings in Rust. The most further along in my opinion
is rust-encoding, notably because it implements the right specification.
rust-encoding: https://github.com/lifthrasiir/rust-encoding
The spec: http://encoding.spec.whatwg.org/
It has more precise definitions of error handling than some original
RFCs, and better reflects the reality of legacy content on the web.
There was some discussion in the past few days about importing
rust-encoding (or part of it) into Rust’s libstd or libextra. Before
that, I think it is important to define a good API. The spec defines one
for JavaScript, but we should not copy that exactly. rust-encoding’s API
is mostly good, but I think that error handling could be simplified.
In abstract terms, an encoding (such as "UTF-8") is made of a decoder
and an encoder. A decoder converts a stream of bytes into a stream of
text (Unicode scalar values, ie. code points excluding surrogates),
while an encoder does the reverse. This does not cover other kinds of
stream transformation such as base64, compression, encryption, etc.
Bytes are represented in Rust by u8, text by str/char.
(Side note: Because of constraints imposed by JavaScript and to avoid
costly conversions, Servo will probably use a different data type for
representing text. This encoding API could eventually become generic
over a Text trait, but I think that it should stick to str for now.)
The most convenient way to represent a "stream" is with a vector or
string. This however requires the whole input to be in memory before
decoding/encoding can start, and that to be finished before any of the
output can be used. It should definitely be possible to eg. decode some
content as it arrives from the network, and parse it in a pipeline.
The most fundamental type API is one where the user repeatedly "pushes"
chunks of input into a decoder/encoders object (that may maintain state
between chunks) and gets the output so far in return, then signals the
end of the input.
In iterator adapter where the users "pulls" output from the decoder
which "pulls" from the input can be nicer, but is easy to build on top
of a "push-based" API, while the reverse requires tasks.
Iterator<u8> and Iterator<char> are tempting, but we may need to work on
big chucks at a time for efficiency: Iterator<~[u8]> and Iterator<~str>.
Or could single-byte/char iterators be reliably inlined to achieve
similar efficiency?
Finally, this API also needs to support several kinds of errors
handling. For example, a decoder should abort at the invalid byte
sequence for XML, but insert U+FFFD (replacement character) for HTML.
I’m not decided yet whether to just have the closed set of error
handling modes defined in the spec, or make this open-ended with conditions.
Based on all the above, here is a proposed API. Encoders are ommited,
but they are mostly the same as decoders with [u8] and str swapped.
/// Types implementing this trait are "algorithms"
/// such as UTF8, UTF-16, SingleByteEncoding, etc.
/// Values of these types are "encodings" as defined in the WHATWG spec:
/// UTF-8, UTF-16-LE, Windows-1252, etc.
trait Encoding {
/// Could become an associated type with a ::new() constructor
/// when the language supports that.
fn new_decoder(&self) -> ~Decoder;
/// Simple, "one shot" API.
/// Decode a single byte string that is entirely in memory.
/// May raise the decoding_error condition.
fn decode(&self, input: &[u8]) -> Result<~str, DecodeError> {
// Implementation (using a Decoder) left out.
// This is a default method, but not meant to be overridden.
}
}
/// Takes the invalid byte sequence.
/// Return a replacement string, or None to abort with a DecodeError.
condition! {
pub decoding_error : ~[u8] -> Option<~str>;
}
struct DecodeError {
input_byte_offset: uint,
invalid_byte_sequence: ~[u8],
}
/// Each implementation of Encoding has one corresponding implementation
/// of Decoder (and one of Encoder).
///
/// A new Decoder instance should be used for every input.
/// A Decoder instance should be discarded after DecodeError was returned.
trait Decoder {
/// Call this repeatedly with a chunck of input bytes.
/// As much as possible of the decoded text is appended to output.
/// May raise the decoding_error condition.
fn feed(input: &[u8], output: &mut ~str) -> Option<DecodeError>;
/// Call this to indicate the end of the input.
/// The Decoder instance should be discarded afterwards.
/// Some encodings may append some final output at this point.
/// May raise the decoding_error condition.
fn flush(output: &mut ~str) -> Option<DecodeError>;
}
/// "Pull-based" API.
struct DecoderIterator<I> {
input_iterator: I,
priv state: DecoderIteratorState<I>,
}
impl<I: Iterator<~[u8]>> DecoderIterator<I> {
fn new(input_iterator: I) -> DecoderIterator<I> {
// Implementation left out.
}
/// Consume the whole input iterator and return a single decoded
string.
/// May raise the decoding_error condition.
fn concat(&mut self) -> Result<~str, DecodeError> {
// Implementation left out.
}
}
impl<I: Iterator<~[u8]>> Iterator<Result<~str, DecodeError>> for
DecoderIterator<I> {
/// Call .next() once on the input iterator and decode the result.
/// May raise the decoding_error condition.
/// Returns None after DecodeError was returned once,
/// even if the input iterator is not exhausted yet.
fn next(&mut self) -> Option<Result<~str, DecodeError>> {
// Implementation left out.
}
}
Cheers,
--
Simon Sapin
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev