[rust-dev] Proposed API for character encodings

Simon Sapin Tue, 10 Sep 2013 09:25:43 -0700

Hi,

TR;DR: the actual proposal is at the end of this email.

Rust today has good support for UTF-8 which new content definitelyshould use, but many systems still have to deal with legacy content thatuses other character encodings. There are several projects around toimplement more encodings in Rust. The most further along in my opinionis rust-encoding, notably because it implements the right specification.


rust-encoding: https://github.com/lifthrasiir/rust-encoding

The spec: http://encoding.spec.whatwg.org/

It has more precise definitions of error handling than some originalRFCs, and better reflects the reality of legacy content on the web.

There was some discussion in the past few days about importingrust-encoding (or part of it) into Rust’s libstd or libextra. Beforethat, I think it is important to define a good API. The spec defines onefor JavaScript, but we should not copy that exactly. rust-encoding’s APIis mostly good, but I think that error handling could be simplified.

In abstract terms, an encoding (such as "UTF-8") is made of a decoderand an encoder. A decoder converts a stream of bytes into a stream oftext (Unicode scalar values, ie. code points excluding surrogates),while an encoder does the reverse. This does not cover other kinds ofstream transformation such as base64, compression, encryption, etc.


Bytes are represented in Rust by u8, text by str/char.

(Side note: Because of constraints imposed by JavaScript and to avoidcostly conversions, Servo will probably use a different data type forrepresenting text. This encoding API could eventually become genericover a Text trait, but I think that it should stick to str for now.)

The most convenient way to represent a "stream" is with a vector orstring. This however requires the whole input to be in memory beforedecoding/encoding can start, and that to be finished before any of theoutput can be used. It should definitely be possible to eg. decode somecontent as it arrives from the network, and parse it in a pipeline.

The most fundamental type API is one where the user repeatedly "pushes"chunks of input into a decoder/encoders object (that may maintain statebetween chunks) and gets the output so far in return, then signals theend of the input.

In iterator adapter where the users "pulls" output from the decoderwhich "pulls" from the input can be nicer, but is easy to build on topof a "push-based" API, while the reverse requires tasks.

Iterator<u8> and Iterator<char> are tempting, but we may need to work onbig chucks at a time for efficiency: Iterator<~[u8]> and Iterator<~str>.Or could single-byte/char iterators be reliably inlined to achievesimilar efficiency?

Finally, this API also needs to support several kinds of errorshandling. For example, a decoder should abort at the invalid bytesequence for XML, but insert U+FFFD (replacement character) for HTML.I’m not decided yet whether to just have the closed set of errorhandling modes defined in the spec, or make this open-ended with conditions.

Based on all the above, here is a proposed API. Encoders are ommited,but they are mostly the same as decoders with [u8] and str swapped.



/// Types implementing this trait are "algorithms"
/// such as UTF8, UTF-16, SingleByteEncoding, etc.
/// Values of these types are "encodings" as defined in the WHATWG spec:
/// UTF-8, UTF-16-LE, Windows-1252, etc.
trait Encoding {
    /// Could become an associated type with a ::new() constructor
    /// when the language supports that.
    fn new_decoder(&self) -> ~Decoder;

    /// Simple, "one shot" API.
    /// Decode a single byte string that is entirely in memory.
    /// May raise the decoding_error condition.
    fn decode(&self, input: &[u8]) -> Result<~str, DecodeError> {
        // Implementation (using a Decoder) left out.
        // This is a default method, but not meant to be overridden.
    }
}

/// Takes the invalid byte sequence.
/// Return a replacement string, or None to abort with a DecodeError.
condition! {
    pub decoding_error : ~[u8] -> Option<~str>;
}

struct DecodeError {
    input_byte_offset: uint,
    invalid_byte_sequence: ~[u8],
}

/// Each implementation of Encoding has one corresponding implementation
/// of Decoder (and one of Encoder).
///
/// A new Decoder instance should be used for every input.
/// A Decoder instance should be discarded after DecodeError was returned.
trait Decoder {
    /// Call this repeatedly with a chunck of input bytes.
    /// As much as possible of the decoded text is appended to output.
    /// May raise the decoding_error condition.
    fn feed(input: &[u8], output: &mut ~str) -> Option<DecodeError>;

    /// Call this to indicate the end of the input.
    /// The Decoder instance should be discarded afterwards.
    /// Some encodings may append some final output at this point.
    /// May raise the decoding_error condition.
    fn flush(output: &mut ~str) -> Option<DecodeError>;
}

/// "Pull-based" API.
struct DecoderIterator<I> {
    input_iterator: I,
    priv state: DecoderIteratorState<I>,
}


impl<I: Iterator<~[u8]>> DecoderIterator<I> {
    fn new(input_iterator: I) -> DecoderIterator<I> {
        // Implementation left out.
    }

/// Consume the whole input iterator and return a single decodedstring.

    /// May raise the decoding_error condition.
    fn concat(&mut self) -> Result<~str, DecodeError> {
        // Implementation left out.
    }
}

impl<I: Iterator<~[u8]>> Iterator<Result<~str, DecodeError>> forDecoderIterator<I> {

    /// Call .next() once on the input iterator and decode the result.
    /// May raise the decoding_error condition.
    /// Returns None after DecodeError was returned once,
    /// even if the input iterator is not exhausted yet.
    fn next(&mut self) -> Option<Result<~str, DecodeError>> {
        // Implementation left out.
    }
}



Cheers,
--
Simon Sapin
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

[rust-dev] Proposed API for character encodings

Reply via email to