Re: [rust-dev] Proposed API for character encodings

Brian Anderson Wed, 18 Sep 2013 15:32:38 -0700

On 09/10/2013 08:47 AM, Simon Sapin wrote:

Hi,


TR;DR: the actual proposal is at the end of this email.


Thanks for working on this. It's crucial.

Rust today has good support for UTF-8 which new content definitelyshould use, but many systems still have to deal with legacy contentthat uses other character encodings. There are several projects aroundto implement more encodings in Rust. The most further along in myopinion is rust-encoding, notably because it implements the rightspecification.
rust-encoding: https://github.com/lifthrasiir/rust-encoding

The spec: http://encoding.spec.whatwg.org/
It has more precise definitions of error handling than some originalRFCs, and better reflects the reality of legacy content on the web.
There was some discussion in the past few days about importingrust-encoding (or part of it) into Rust’s libstd or libextra. Beforethat, I think it is important to define a good API. The spec definesone for JavaScript, but we should not copy that exactly.rust-encoding’s API is mostly good, but I think that error handlingcould be simplified.
In abstract terms, an encoding (such as "UTF-8") is made of a decoderand an encoder. A decoder converts a stream of bytes into a stream oftext (Unicode scalar values, ie. code points excluding surrogates),while an encoder does the reverse. This does not cover other kinds ofstream transformation such as base64, compression, encryption, etc.
Bytes are represented in Rust by u8, text by str/char.
(Side note: Because of constraints imposed by JavaScript and to avoidcostly conversions, Servo will probably use a different data type forrepresenting text. This encoding API could eventually become genericover a Text trait, but I think that it should stick to str for now.)
The most convenient way to represent a "stream" is with a vector orstring. This however requires the whole input to be in memory beforedecoding/encoding can start, and that to be finished before any of theoutput can be used. It should definitely be possible to eg. decodesome content as it arrives from the network, and parse it in a pipeline.
The most fundamental type API is one where the user repeatedly"pushes" chunks of input into a decoder/encoders object (that maymaintain state between chunks) and gets the output so far in return,then signals the end of the input.
In iterator adapter where the users "pulls" output from the decoderwhich "pulls" from the input can be nicer, but is easy to build on topof a "push-based" API, while the reverse requires tasks.
Iterator<u8> and Iterator<char> are tempting, but we may need to workon big chucks at a time for efficiency: Iterator<~[u8]> andIterator<~str>. Or could single-byte/char iterators be reliablyinlined to achieve similar efficiency?

Can Iterator<&[u8]> work if the iterator itself contains a fixed-sizedor preallocated buffer? For I/O purposes, allocating a bunch of buffersjust to write them out to a stream sounds wasteful..

Finally, this API also needs to support several kinds of errorshandling. For example, a decoder should abort at the invalid bytesequence for XML, but insert U+FFFD (replacement character) for HTML.I’m not decided yet whether to just have the closed set of errorhandling modes defined in the spec, or make this open-ended withconditions.

Based on all the above, here is a proposed API. Encoders are ommited,but they are mostly the same as decoders with [u8] and str swapped.



/// Types implementing this trait are "algorithms"
/// such as UTF8, UTF-16, SingleByteEncoding, etc.
/// Values of these types are "encodings" as defined in the WHATWG spec:
/// UTF-8, UTF-16-LE, Windows-1252, etc.
trait Encoding {
    /// Could become an associated type with a ::new() constructor
    /// when the language supports that.
    fn new_decoder(&self) -> ~Decoder;

    /// Simple, "one shot" API.
    /// Decode a single byte string that is entirely in memory.
    /// May raise the decoding_error condition.
    fn decode(&self, input: &[u8]) -> Result<~str, DecodeError> {
        // Implementation (using a Decoder) left out.
        // This is a default method, but not meant to be overridden.
    }
}

/// Takes the invalid byte sequence.
/// Return a replacement string, or None to abort with a DecodeError.
condition! {
    pub decoding_error : ~[u8] -> Option<~str>;
}

struct DecodeError {
    input_byte_offset: uint,
    invalid_byte_sequence: ~[u8],
}

/// Each implementation of Encoding has one corresponding implementation
/// of Decoder (and one of Encoder).
///
/// A new Decoder instance should be used for every input.

/// A Decoder instance should be discarded after DecodeError wasreturned.

trait Decoder {
    /// Call this repeatedly with a chunck of input bytes.
    /// As much as possible of the decoded text is appended to output.
    /// May raise the decoding_error condition.
    fn feed(input: &[u8], output: &mut ~str) -> Option<DecodeError>;

    /// Call this to indicate the end of the input.
    /// The Decoder instance should be discarded afterwards.
    /// Some encodings may append some final output at this point.
    /// May raise the decoding_error condition.
    fn flush(output: &mut ~str) -> Option<DecodeError>;
}

/// "Pull-based" API.
struct DecoderIterator<I> {
    input_iterator: I,
    priv state: DecoderIteratorState<I>,
}


impl<I: Iterator<~[u8]>> DecoderIterator<I> {
    fn new(input_iterator: I) -> DecoderIterator<I> {
        // Implementation left out.
    }

/// Consume the whole input iterator and return a single decodedstring.

    /// May raise the decoding_error condition.
    fn concat(&mut self) -> Result<~str, DecodeError> {


This would probably take self by value if it's consuming the inner iterator?

// Implementation left out.
    }
}

impl<I: Iterator<~[u8]>> Iterator<Result<~str, DecodeError>> forDecoderIterator<I> {

    /// Call .next() once on the input iterator and decode the result.
    /// May raise the decoding_error condition.
    /// Returns None after DecodeError was returned once,
    /// even if the input iterator is not exhausted yet.
    fn next(&mut self) -> Option<Result<~str, DecodeError>> {
        // Implementation left out.
    }
}

I don't understand this iterator. I'm guessing it calls `concat` on the`DecoderIterator` during each call to `next`, but `concat` consumes`DecoderIterator`s inner `Iterator`, so it subsequent calls to `concat`won't work.


This API only deals with Decoding. What about Encoding?

I don't see the utility of the `Encoding` factory type here, especiallyof instantiating it to get a `Decoder`. As you indicate, it's instancemethods may want to be static methods.


Looks like a fine start to me. Let's do it.


_______________________________________________
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] Proposed API for character encodings

Reply via email to