On 09/10/2013 08:47 AM, Simon Sapin wrote:
Hi,

TR;DR: the actual proposal is at the end of this email.

Thanks for working on this. It's crucial.


Rust today has good support for UTF-8 which new content definitely should use, but many systems still have to deal with legacy content that uses other character encodings. There are several projects around to implement more encodings in Rust. The most further along in my opinion is rust-encoding, notably because it implements the right specification.

rust-encoding: https://github.com/lifthrasiir/rust-encoding

The spec: http://encoding.spec.whatwg.org/
It has more precise definitions of error handling than some original RFCs, and better reflects the reality of legacy content on the web.

There was some discussion in the past few days about importing rust-encoding (or part of it) into Rust’s libstd or libextra. Before that, I think it is important to define a good API. The spec defines one for JavaScript, but we should not copy that exactly. rust-encoding’s API is mostly good, but I think that error handling could be simplified.


In abstract terms, an encoding (such as "UTF-8") is made of a decoder and an encoder. A decoder converts a stream of bytes into a stream of text (Unicode scalar values, ie. code points excluding surrogates), while an encoder does the reverse. This does not cover other kinds of stream transformation such as base64, compression, encryption, etc.

Bytes are represented in Rust by u8, text by str/char.

(Side note: Because of constraints imposed by JavaScript and to avoid costly conversions, Servo will probably use a different data type for representing text. This encoding API could eventually become generic over a Text trait, but I think that it should stick to str for now.)


The most convenient way to represent a "stream" is with a vector or string. This however requires the whole input to be in memory before decoding/encoding can start, and that to be finished before any of the output can be used. It should definitely be possible to eg. decode some content as it arrives from the network, and parse it in a pipeline.

The most fundamental type API is one where the user repeatedly "pushes" chunks of input into a decoder/encoders object (that may maintain state between chunks) and gets the output so far in return, then signals the end of the input.

In iterator adapter where the users "pulls" output from the decoder which "pulls" from the input can be nicer, but is easy to build on top of a "push-based" API, while the reverse requires tasks.

Iterator<u8> and Iterator<char> are tempting, but we may need to work on big chucks at a time for efficiency: Iterator<~[u8]> and Iterator<~str>. Or could single-byte/char iterators be reliably inlined to achieve similar efficiency?

Can Iterator<&[u8]> work if the iterator itself contains a fixed-sized or preallocated buffer? For I/O purposes, allocating a bunch of buffers just to write them out to a stream sounds wasteful..



Finally, this API also needs to support several kinds of errors handling. For example, a decoder should abort at the invalid byte sequence for XML, but insert U+FFFD (replacement character) for HTML. I’m not decided yet whether to just have the closed set of error handling modes defined in the spec, or make this open-ended with conditions.


Based on all the above, here is a proposed API. Encoders are ommited, but they are mostly the same as decoders with [u8] and str swapped.


/// Types implementing this trait are "algorithms"
/// such as UTF8, UTF-16, SingleByteEncoding, etc.
/// Values of these types are "encodings" as defined in the WHATWG spec:
/// UTF-8, UTF-16-LE, Windows-1252, etc.
trait Encoding {
    /// Could become an associated type with a ::new() constructor
    /// when the language supports that.
    fn new_decoder(&self) -> ~Decoder;

    /// Simple, "one shot" API.
    /// Decode a single byte string that is entirely in memory.
    /// May raise the decoding_error condition.
    fn decode(&self, input: &[u8]) -> Result<~str, DecodeError> {
        // Implementation (using a Decoder) left out.
        // This is a default method, but not meant to be overridden.
    }
}

/// Takes the invalid byte sequence.
/// Return a replacement string, or None to abort with a DecodeError.
condition! {
    pub decoding_error : ~[u8] -> Option<~str>;
}

struct DecodeError {
    input_byte_offset: uint,
    invalid_byte_sequence: ~[u8],
}

/// Each implementation of Encoding has one corresponding implementation
/// of Decoder (and one of Encoder).
///
/// A new Decoder instance should be used for every input.
/// A Decoder instance should be discarded after DecodeError was returned.
trait Decoder {
    /// Call this repeatedly with a chunck of input bytes.
    /// As much as possible of the decoded text is appended to output.
    /// May raise the decoding_error condition.
    fn feed(input: &[u8], output: &mut ~str) -> Option<DecodeError>;

    /// Call this to indicate the end of the input.
    /// The Decoder instance should be discarded afterwards.
    /// Some encodings may append some final output at this point.
    /// May raise the decoding_error condition.
    fn flush(output: &mut ~str) -> Option<DecodeError>;
}

/// "Pull-based" API.
struct DecoderIterator<I> {
    input_iterator: I,
    priv state: DecoderIteratorState<I>,
}


impl<I: Iterator<~[u8]>> DecoderIterator<I> {
    fn new(input_iterator: I) -> DecoderIterator<I> {
        // Implementation left out.
    }

/// Consume the whole input iterator and return a single decoded string.
    /// May raise the decoding_error condition.
    fn concat(&mut self) -> Result<~str, DecodeError> {

This would probably take self by value if it's consuming the inner iterator?

// Implementation left out.
    }
}

impl<I: Iterator<~[u8]>> Iterator<Result<~str, DecodeError>> for DecoderIterator<I> {
    /// Call .next() once on the input iterator and decode the result.
    /// May raise the decoding_error condition.
    /// Returns None after DecodeError was returned once,
    /// even if the input iterator is not exhausted yet.
    fn next(&mut self) -> Option<Result<~str, DecodeError>> {
        // Implementation left out.
    }
}



I don't understand this iterator. I'm guessing it calls `concat` on the `DecoderIterator` during each call to `next`, but `concat` consumes `DecoderIterator`s inner `Iterator`, so it subsequent calls to `concat` won't work.

This API only deals with Decoding. What about Encoding?

I don't see the utility of the `Encoding` factory type here, especially of instantiating it to get a `Decoder`. As you indicate, it's instance methods may want to be static methods.

Looks like a fine start to me. Let's do it.


_______________________________________________
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to