On 22 Sep 2009, at 3:09 pm, John Cowan wrote: > Alaric Snell-Pym scripsit: > >> I'd say that a requirement for being usable for embedded programming >> would mean it should mandate very little - not all embedded apps will >> need binary ports. Obviously, a line must be drawn somewhere, but I >> don't see the WG1 charter as requiring that the line be drawn above >> binary I/O. > > It doesn't. The draft charter only mandates IEEE Scheme. But I'm > arguing that binary ports are useful enough to be optional parts of > Thing One. (See http://tinyurl.com/feature-groups for details of > what's optional in my proposals and what's not.)
An optional standard for binary ports is good. Optional, so it needn't bloat environments that don't want it. Standard, so code that needs it can be portable. But we must be careful of (on the one hand) producing something that's too bloated, or (on the other hand) that's not general enough and you need to go off meddling with your own encoding/decoding layer on top of binary ports whenever you want to mingle text. While not being gratuitously incompatible with R5RS text ports. This is an interesting challenge, and I think it's far too soon to be getting too picky about what the charter says, as opposed to deciding what's best then seeing if we can interpret the thankfully widely- worded charter to mean that :-) What are our actual requirements, here? Let me guess a few use cases: 1) An implementation that has single-byte characters, and wants to provide text and binary I/O either because it's a requirement of the user base, or just because it ought to be trivial to do so in such an environment, for which a unified port object with both read-u8 and read-char etc. are fine, and u8-position = text-position. 2) An educational/proof-of-concept implementation that needs enough I/ O to implement a repl, perhaps some library system internals, and some toy examples, for which R5RS text ports would do just fine. 3) A Unicode implementation that supports a few simple encodings - ISO-8859-* and UTF-8 and UTF-16, for example (notably, not that ISO monstrosity with the complex state you mentioned). A character maps to a varying number of bytes (even in UTF-16 with the surrogates), but apart from reading those bytes, there's little encoder/decoder state (endianness learnt from a BOM is all that comes to mind); in which case, a unified binary+text port is OK, with a small amount of state to select a codec and an optional endianness, and read-char implemented as reading bytes until it's got a whole codepoint to return. text-position = u8-position as functions, but not all valid u8- positions are valid text-positions; they might point into the middle of codepoints, which is a situation that would never arise if text- position is only called after performing read-char (and higher-level operations based upon it). 4) A "full" implementation that can read text in arbitrarily complex encodings with arbitrary complex state. Perhaps even ones that don't byte-align, such as Huffman-encoded forms. To support this kind of stuff, bloat is probably not such an issue any more. So here we need to choose between some combination of: a) text-position returns an opaque object with loads of decoder state b) switching from read-char to read-u8 causes a translation from text parsing state to binary position to occur according to some standard rule (eg, the next byte with no bits from the last-read character), prefetched bytes are flushed back, etc. Switching from read-u8 to read-char causes read buffer state to be restarted, but other encoder state (eg, all those code page mappings) are kept unless an explicit encoding-specific state flush procedure is applied to the port (as it's a matter for the application as to what's a new string, and what's a continuation of the same string after some binary header). c) just throwing our hands up in the air at such matters as mingling complex encodings with binary data, or advanced stuff such as UTF-8 that's chopped into fixed-size chunks and then framed with error- correction information (where the chopping might happen in mid- codepoint), and saying "write your own continuation-coroutine-based custom port that exposes the UTF-8 stream and then wrap a layered text port over it". The above musings suggest to me that it'd be a shame to disallow simple mingling of text and binary as per cases 1,2,3 just because of case 4, but I am leary of marking some cases as 'uncommon, and people in this case must suffer for the common good' too hastily; while case 4 suggests that what we really want to support that kind of thing is a layered-ports system (where the fixed-size-buffers thing would be a custom port layer function from binary port to binary port, which can then have a function from binary port to text port applied). In such a layered port system, the semantics of access to the underlying port depends on the layer function, which should document it. So, after this long ramble, what can I conclude? I think it's practical to have a system that lets you mingle binary and text operations *as long as you're using a supported encoding*. Eg, some encodings may make it impossible to perform binary operations on the port once a text operation has occurred. This is fine for encodings the developer has to select manually; it would be unpleasant for an encoding they are handed as a "platform default". So let's say that ports that do not specify an encoding, or explicitly ask for the implementation to guess them one, may only be used as text or binary. If you want to mingle text and binary operations, you need to manually specify an encoding (this makes sense from a practical perspective anyway; such file formats will have some means of specifying encodings, or just mandate one), and live with the special rules of that encoding. If you want to do complex things like splitting encoded text every N bytes and the like, then you either use custom ports (providing the binary behaviour, and getting the implementation's text decoding behaviour inherited) or use a blob system with blob->string functionality, and create blobs by stitching your fixed-size frames' contents together. ABS -- Alaric Snell-Pym Work: http://www.snell-systems.co.uk/ Play: http://www.snell-pym.org.uk/alaric/ Blog: http://www.snell-pym.org.uk/archives/author/alaric/ _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
