Some rule-of-thumb "axioms" for implementing strings: For I/O and good interaction with VM, string data in memory needs a high locality (bursts of characters contiguous in memory) but: the bursts don't have to be larger than a VM page to obtain the full benefits.
Substring extraction and concatenation should be O(1) operations. Then it would follow that mutation and reference are O(1). You can't do that perfectly, though. In a recursive algorithm, optimizing the leaf steps (base case steps) is critical. The contiguous bursts of characters should all use fixed-width encodings. A key trick is that not all bursts of characters have to use the same fixed-width encoding. The combinatorics are a little hairy but with only three fixed-width encodings (for 8-bit, 16-bit, and 32-bit code values) you can satisfy all of the above, swimmingly. (Notable objection: Stallman tells me thinks the combinatorics here are *too* hairy to be worth the effort.) The combinatorics get a little harrier but with just one additional fixed-width encoding you can handle a character set of unbounded size (such as Unicode when taking composite characters as your primitive notions of characters). Dillenger keeps hinting that that fourth encoding might be all anyone ever needs. I don't buy it but I can't prove I'm right. Boehm is probably the author of the cites everyone should be including. He gave us enough rope to hang ourselves. -t _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
