Re: [rust-dev] Proposed API for character encodings

Simon Sapin Wed, 11 Sep 2013 15:44:58 -0700

Le 11/09/2013 17:19, Marvin Löbel a écrit :

On 09/10/2013 05:47 PM, Simon Sapin wrote:

Hi,


TR;DR: the actual proposal is at the end of this email.

Rust today has good support for UTF-8 which new content definitely
should use, but many systems still have to deal with legacy content
that uses other character encodings. There are several projects around
to implement more encodings in Rust. The most further along in my
opinion is rust-encoding, notably because it implements the right
specification.

rust-encoding: https://github.com/lifthrasiir/rust-encoding

The spec: http://encoding.spec.whatwg.org/
It has more precise definitions of error handling than some original
RFCs, and better reflects the reality of legacy content on the web.

There was some discussion in the past few days about importing
rust-encoding (or part of it) into Rust’s libstd or libextra. Before
that, I think it is important to define a good API. The spec defines
one for JavaScript, but we should not copy that exactly.
rust-encoding’s API is mostly good, but I think that error handling
could be simplified.


In abstract terms, an encoding (such as "UTF-8") is made of a decoder
and an encoder. A decoder converts a stream of bytes into a stream of
text (Unicode scalar values, ie. code points excluding surrogates),
while an encoder does the reverse. This does not cover other kinds of
stream transformation such as base64, compression, encryption, etc.

Bytes are represented in Rust by u8, text by str/char.

(Side note: Because of constraints imposed by JavaScript and to avoid
costly conversions, Servo will probably use a different data type for
representing text. This encoding API could eventually become generic
over a Text trait, but I think that it should stick to str for now.)


The most convenient way to represent a "stream" is with a vector or
string. This however requires the whole input to be in memory before
decoding/encoding can start, and that to be finished before any of the
output can be used. It should definitely be possible to eg. decode
some content as it arrives from the network, and parse it in a pipeline.

The most fundamental type API is one where the user repeatedly
"pushes" chunks of input into a decoder/encoders object (that may
maintain state between chunks) and gets the output so far in return,
then signals the end of the input.

In iterator adapter where the users "pulls" output from the decoder
which "pulls" from the input can be nicer, but is easy to build on top
of a "push-based" API, while the reverse requires tasks.

Iterator<u8> and Iterator<char> are tempting, but we may need to work
on big chucks at a time for efficiency: Iterator<~[u8]> and
Iterator<~str>. Or could single-byte/char iterators be reliably
inlined to achieve similar efficiency?


Finally, this API also needs to support several kinds of errors
handling. For example, a decoder should abort at the invalid byte
sequence for XML, but insert U+FFFD (replacement character) for HTML.
I’m not decided yet whether to just have the closed set of error
handling modes defined in the spec, or make this open-ended with
conditions.


Based on all the above, here is a proposed API. Encoders are ommited,
but they are mostly the same as decoders with [u8] and str swapped.


/// Types implementing this trait are "algorithms"
/// such as UTF8, UTF-16, SingleByteEncoding, etc.
/// Values of these types are "encodings" as defined in the WHATWG spec:
/// UTF-8, UTF-16-LE, Windows-1252, etc.
trait Encoding {
     /// Could become an associated type with a ::new() constructor
     /// when the language supports that.
     fn new_decoder(&self) -> ~Decoder;

     /// Simple, "one shot" API.
     /// Decode a single byte string that is entirely in memory.
     /// May raise the decoding_error condition.
     fn decode(&self, input: &[u8]) -> Result<~str, DecodeError> {
         // Implementation (using a Decoder) left out.
         // This is a default method, but not meant to be overridden.
     }
}

/// Takes the invalid byte sequence.
/// Return a replacement string, or None to abort with a DecodeError.
condition! {
     pub decoding_error : ~[u8] -> Option<~str>;
}

struct DecodeError {
     input_byte_offset: uint,
     invalid_byte_sequence: ~[u8],
}

/// Each implementation of Encoding has one corresponding implementation
/// of Decoder (and one of Encoder).
///
/// A new Decoder instance should be used for every input.
/// A Decoder instance should be discarded after DecodeError was
returned.
trait Decoder {
     /// Call this repeatedly with a chunck of input bytes.
     /// As much as possible of the decoded text is appended to output.
     /// May raise the decoding_error condition.
     fn feed(input: &[u8], output: &mut ~str) -> Option<DecodeError>;

     /// Call this to indicate the end of the input.
     /// The Decoder instance should be discarded afterwards.
     /// Some encodings may append some final output at this point.
     /// May raise the decoding_error condition.
     fn flush(output: &mut ~str) -> Option<DecodeError>;
}

/// "Pull-based" API.
struct DecoderIterator<I> {
     input_iterator: I,
     priv state: DecoderIteratorState<I>,
}


impl<I: Iterator<~[u8]>> DecoderIterator<I> {
     fn new(input_iterator: I) -> DecoderIterator<I> {
         // Implementation left out.
     }

     /// Consume the whole input iterator and return a single decoded
string.
     /// May raise the decoding_error condition.
     fn concat(&mut self) -> Result<~str, DecodeError> {
         // Implementation left out.
     }
}

impl<I: Iterator<~[u8]>> Iterator<Result<~str, DecodeError>> for
DecoderIterator<I> {
     /// Call .next() once on the input iterator and decode the result.
     /// May raise the decoding_error condition.
     /// Returns None after DecodeError was returned once,
     /// even if the input iterator is not exhausted yet.
     fn next(&mut self) -> Option<Result<~str, DecodeError>> {
         // Implementation left out.
     }
}

Looking at it first, I found a few things strange:
- new_decoder returns an trait object, which incurs dynamic dispatch costs

I should have added that in Servo’s main use-case, the encoding is notknown at compile time but is based on a string label that typicallycomes from a Content-Type HTTP header:


  fn get_encoding_from_label(label: &str) -> ~Encoding { /* ... */ }

In this case, we do want dynamic dispatch.

If you know the encoding at compile time and want static dispatch, it isperfectly fine to use a specific type such as UTF8Decoder directly,without trait objects.

- Encoding uses explicit self despite not having any state

It’s not exactly "state", but as explained in the comments an "encoding"is a value of a type that implements the Encoding trait, not just thetype itself. For example, rust-encoding has a SingleByteEncodingrepresenting many different encoding, with each value pointing atdifferent tables.

- Decoder doesn't use explicit self despite having state


That’s a mistake. .feed() and .flush() should be proper methods with self.

- Decoder get's passed the target ~str on each feed and and flush call,
despite it being always the same (should be passed on construction of
the Decoder state and used internally)

The point is that it doesn’t have to be the same string. Each decodedchunks could be passed to the next step of a pipeline, eg. to anincremental parser.

- I find the fact that flush itself can cause an DecodingError, despite
not decoding anything new strange.

.flush() indicates the end of the input. It can trigger an error eg. ifthe last input chunck ends with an incomplete but so far valid-lookingUTF-8 byte sequence.

To anticipate another possible question: .flush() most often doesn’twrite any output, but it can write a termination sequence in someencodings like ISO-2022-JP.

flush should also maybe be called in
the Decoder destructor.

I don’t think so. The point of .flush() is to deal with possibleremaining output or error. If you’re dropping everything you don’t careabout that.

However, after playing around with it for a while I found that
- If new_decoder returns a generic type, that must be passed in as a
type parameter or be implemented as an assoziated type, which we don't
have yet.

Yes, as said in the comments, Decoder and Encoder should ideally beassociated types and .new_decoder() would not be needed at all, but wedon’t have that yet.

- Calling assoziated functions on Encoding is currently impossible
without an workaround because you can't specify the self type. But even
if we can again it would look like this: `Encoding::<for
Utf8>::decode(source)`. Having assoziated function trait lookup on types
would ease that somewhat.


Just use self. It’s needed anyway.

- Additionally, if new_decoder results in type parameters on Encoding,
then you have to rely on inference for not having to specify them.


See comments on this below.

- You can't pass the &mut ~str into a new Decoder using the Encoding
trait because we need Higher Kinded Types for the 'self parameter.


Sorry, I don’t understand this part :/

Based on this I have written two working proof of concepts of the
proposed traits.
I left API designs like error handling as is, but made Decoder having a
proper state.

The first variant implements an Encoding as I think it should work:
Assoziated functions for new_decoder and decode:
https://gist.github.com/Kimundi/6523377

The second variant works around the verbosity of specifying the right
implementation by making trait lookup happen with a method call:
https://gist.github.com/Kimundi/6522973

As we discussed on IRC I believe that this does not cover the case whereyou do want dynamic dispatch.



Thank you for your feedback!

Cheers,
--
Simon Sapin
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] Proposed API for character encodings

Reply via email to