On 29 Sep 2009, at 3:44 pm, John Cowan wrote:

> Alaric Snell-Pym scripsit:
>
>> The behaviour of read-char in terms of read-octet will need careful
>> specifying for funny encodings, mind; some encodings have control
>> characters that shift modes and the like, but aren't part of any
>> character, so the byte on which a character boundary sits is a bit
>> vague. I guess the best approach to that is to say that read-char
>> reads 0 or more non-character octets, if present, then reads enough
>> octets to decode one character, and anything it's buffered, it shares
>> the buffer with read-octet.
>
> That *sounds* good, but it's horribly slow in practice, and
> interpreters
> (without JITs) will suffer especially badly from it.

Why is that, out of interest?

>  Character
> encoding/decoding needs to be done in big buffers for the same reason
> that actual I/O does.

Why? Just because of the procedure call overhead of read-char? What is
the operation of reading a character doing beyond reading some bytes
while into a state machine until it reaches a state where it has a
whole character to return?

>  Making those buffers the same buffer is horribly
> messy: if the internal character format is UTF-16 and the file
> encoding
> is ASCII, you need twice as big a decoding buffer as the I/O buffer to
> get any decent efficiency at all.

Maybe, but I'd question why you'd need a buffer of decoded characters,
rather than decoding them as they're asked for.

I'd have thought there's an obvious optimisation to provide procedures
to read a count of characters, or read until you hit a specified
delimiter character, and produce a string in one go; but that's just
removing the procedure call overhead of read-char.

>
>> This will run into issues with any hypothetical character encoding
>> that uses sub-octet character boundaries, but that can be dealt with
>> too, I think: if you do a read-octet when the character reader is in
>> mid-octet, then the spare bits are discarded and you get the next
>> octet.
>
> Character encodings can be weird, but not *that* weird.  Bit-level
> compression, when present, is usually expanded/compressed by a layer
> between binary I/O and character I/O.
>

Yeah, I just like to keep my options open in standards, and testing
how they'd react with outlandish cases is often illuminating!

ABS

--
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/archives/author/alaric/




_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Reply via email to