On 29 Sep 2009, at 3:44 pm, John Cowan wrote: > Alaric Snell-Pym scripsit: > >> The behaviour of read-char in terms of read-octet will need careful >> specifying for funny encodings, mind; some encodings have control >> characters that shift modes and the like, but aren't part of any >> character, so the byte on which a character boundary sits is a bit >> vague. I guess the best approach to that is to say that read-char >> reads 0 or more non-character octets, if present, then reads enough >> octets to decode one character, and anything it's buffered, it shares >> the buffer with read-octet. > > That *sounds* good, but it's horribly slow in practice, and > interpreters > (without JITs) will suffer especially badly from it.
Why is that, out of interest? > Character > encoding/decoding needs to be done in big buffers for the same reason > that actual I/O does. Why? Just because of the procedure call overhead of read-char? What is the operation of reading a character doing beyond reading some bytes while into a state machine until it reaches a state where it has a whole character to return? > Making those buffers the same buffer is horribly > messy: if the internal character format is UTF-16 and the file > encoding > is ASCII, you need twice as big a decoding buffer as the I/O buffer to > get any decent efficiency at all. Maybe, but I'd question why you'd need a buffer of decoded characters, rather than decoding them as they're asked for. I'd have thought there's an obvious optimisation to provide procedures to read a count of characters, or read until you hit a specified delimiter character, and produce a string in one go; but that's just removing the procedure call overhead of read-char. > >> This will run into issues with any hypothetical character encoding >> that uses sub-octet character boundaries, but that can be dealt with >> too, I think: if you do a read-octet when the character reader is in >> mid-octet, then the spare bits are discarded and you get the next >> octet. > > Character encodings can be weird, but not *that* weird. Bit-level > compression, when present, is usually expanded/compressed by a layer > between binary I/O and character I/O. > Yeah, I just like to keep my options open in standards, and testing how they'd react with outlandish cases is often illuminating! ABS -- Alaric Snell-Pym Work: http://www.snell-systems.co.uk/ Play: http://www.snell-pym.org.uk/alaric/ Blog: http://www.snell-pym.org.uk/archives/author/alaric/ _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
