Re: [r6rs-discuss] Proposed features for small Scheme, part 2 bis: I/O

Alaric Snell-Pym Tue, 22 Sep 2009 08:03:01 -0700

On 22 Sep 2009, at 3:09 pm, John Cowan wrote:

> Alaric Snell-Pym scripsit:
>
>> I'd say that a requirement for being usable for embedded programming
>> would mean it should mandate very little - not all embedded apps will
>> need binary ports. Obviously, a line must be drawn somewhere, but I
>> don't see the WG1 charter as requiring that the line be drawn above
>> binary I/O.
>
> It doesn't.  The draft charter only mandates IEEE Scheme.  But I'm
> arguing that binary ports are useful enough to be optional parts of
> Thing One.  (See http://tinyurl.com/feature-groups for details of
> what's optional in my proposals and what's not.)


An optional standard for binary ports is good. Optional, so it needn't
bloat environments that don't want it. Standard, so code that needs it
can be portable.

But we must be careful of (on the one hand) producing something that's
too bloated, or (on the other hand) that's not general enough and you
need to go off meddling with your own encoding/decoding layer on top
of binary ports whenever you want to mingle text. While not being
gratuitously incompatible with R5RS text ports.

This is an interesting challenge, and I think it's far too soon to be
getting too picky about what the charter says, as opposed to deciding
what's best then seeing if we can interpret the thankfully widely-
worded charter to mean that :-)

What are our actual requirements, here? Let me guess a few use cases:

1) An implementation that has single-byte characters, and wants to
provide text and binary I/O either because it's a requirement of the
user base, or just because it ought to be trivial to do so in such an
environment, for which a unified port object with both read-u8 and
read-char etc. are fine, and u8-position = text-position.

2) An educational/proof-of-concept implementation that needs enough I/
O to implement a repl, perhaps some library system internals, and some
toy examples, for which R5RS text ports would do just fine.

3) A Unicode implementation that supports a few simple encodings -
ISO-8859-* and UTF-8 and UTF-16, for example (notably, not that ISO
monstrosity with the complex state you mentioned). A character maps to
a varying number of bytes (even in UTF-16 with the surrogates), but
apart from reading those bytes, there's little encoder/decoder state
(endianness learnt from a BOM is all that comes to mind);  in which
case, a unified binary+text port is OK, with a small amount of state
to select a codec and an optional endianness, and read-char
implemented as reading bytes until it's got a whole codepoint to
return. text-position = u8-position as functions, but not all valid u8-
positions are valid text-positions; they might point into the middle
of codepoints, which is a situation that would never arise if text-
position is only called after performing read-char (and higher-level
operations based upon it).

4) A "full" implementation that can read text in arbitrarily complex
encodings with arbitrary complex state. Perhaps even ones that don't
byte-align, such as Huffman-encoded forms. To support this kind of
stuff, bloat is probably not such an issue any more. So here we need
to choose between some combination of:

   a) text-position returns an opaque object with loads of decoder state
   b) switching from read-char to read-u8 causes a translation from
text parsing state to binary position to occur according to some
standard rule (eg, the next byte with no bits from the last-read
character), prefetched bytes are flushed back, etc. Switching from
read-u8 to read-char causes read buffer state to be restarted, but
other encoder state (eg, all those code page mappings) are kept unless
an explicit encoding-specific state flush procedure is applied to the
port (as it's a matter for the application as to what's a new string,
and what's a continuation of the same string after some binary header).
   c) just throwing our hands up in the air at such matters as
mingling complex encodings with binary data, or advanced stuff such as
UTF-8 that's chopped into fixed-size chunks and then framed with error-
correction information (where the chopping might happen in mid-
codepoint), and saying "write your own continuation-coroutine-based
custom port that exposes the UTF-8 stream and then wrap a layered text
port over it".

The above musings suggest to me that it'd be a shame to disallow
simple mingling of text and binary as per cases 1,2,3 just because of
case 4, but I am leary of marking some cases as 'uncommon, and people
in this case must suffer for the common good' too hastily; while case
4 suggests that what we really want to support that kind of thing is a
layered-ports system (where the fixed-size-buffers thing would be a
custom port layer function from binary port to binary port, which can
then have a function from binary port to text port applied). In such a
layered port system, the semantics of access to the underlying port
depends on the layer function, which should document it.

So, after this long ramble, what can I conclude?

I think it's practical to have a system that lets you mingle binary
and text operations *as long as you're using a supported encoding*.
Eg, some encodings may make it impossible to perform binary operations
on the port once a text operation has occurred.

This is fine for encodings the developer has to select manually; it
would be unpleasant for an encoding they are handed as a "platform
default".

So let's say that ports that do not specify an encoding, or explicitly
ask for the implementation to guess them one, may only be used as text
or binary. If you want to mingle text and binary operations, you need
to manually specify an encoding (this makes sense from a practical
perspective anyway; such file formats will have some means of
specifying encodings, or just mandate one), and live with the special
rules of that encoding. If you want to do complex things like
splitting encoded text every N bytes and the like, then you either use
custom ports (providing the binary behaviour, and getting the
implementation's text decoding behaviour inherited) or use a blob
system with blob->string functionality, and create blobs by stitching
your fixed-size frames' contents together.

ABS

--
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/archives/author/alaric/




_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Re: [r6rs-discuss] Proposed features for small Scheme, part 2 bis: I/O

Reply via email to