I'm sorry I can't respond to every comment here. A few general things.
Some comments have been dismissive of UTF-8 and UTF-16. Some have been dismissive of contiguous-buffer strings. This is surprising to me. As far as I know, *all* widely used, general-purpose string implementations are contiguous-buffer. And most (but not all) Unicode string implementations use UTF-16. Among languages and libraries that are very widely used, the majority is overwhelming: Java, Microsoft's CLR, Python, JavaScript, Qt, Xerces-C, and on and on. (The few counterexamples use UTF-8: glib, expat. And expat can be compiled to use UTF-16.) This is just an argument from popularity, but I think to discourage this simple, proven internal representation probably wasn't the editors' intent and would be an unpleasant surprise for implementors with enough brains to value simplicity and interoperability. ;) --- Shiro Kawai wrote:
I think string-find is a bad example, because "simple" and "efficient" are opposed. Efficient means Boyer-Moore or some variant of it, and that's not simple.
Well--yes. Regardless, I think the example is appropriate. R5.92RS doesn't support writing *any* such algorithm efficiently and portably (forget simply). And:
I assume the primary benefit of O(1) string-ref is that it is probably the simplest and the most portable way to point a position in a string. "Portable" here is that I can safely save it to file and read it by other implementation, or send it over the network. But for internal use, like implementing search operation, or passing its results to substring operation, it is an illusion that O(1) string-ref is enough to implement efficient algorithms. The efficient one differs greatly among implementations (e.g. using Boyer-Moore directly on utf-8 octet sequence), so it's better to have higher-level APIs.
Higher-level APIs are a fine approach. The other solution is to standardize the implementation, so that the efficient algorithms don't differ. I want to push this seriously one last time: Unicode strings have been kicked around for a while now, and despite Will's link, real-world implementations do not vary much. I don't think it's premature to standardize. And:
I agree that r6rs shouldn't be affected just because it can't be implemented easily by some specific implementing languages (otherwise we wouldn't have call/cc).
First of all, the words "just because" don't belong here. The Java thing is an afterthought. But also-- this is the second time someone has compared strings to core features of Scheme, like call/cc. I agree call/cc is too valuable to give up. But I don't see what we're talking about here that's so valuable. Avoiding a specific bug involving surrogate pairs? The freedom for implementors to choose whatever implementation they want (except, apparently, the one proven model that everyone else uses)? There are interesting areas where Scheme *should* be different from other languages... and there are areas I wish you guys would find uninteresting :) and just borrow an established design from somewhere. In this case, there's only one established design to choose from... -j _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
