Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

Thomas Lord Sat, 19 Sep 2009 15:50:51 -0700

On Sat, 2009-09-19 at 01:24 -0400, John Cowan wrote:
[....]
> 1) Immutable strings are more purely functional, and allow many
> optimizations, such as being transparently and freely shareable between
> procedures and between threads without concern for uncontrolled mutation.


This is true as well for immutable cons pairs
and immutable vectors.

It is also true that these immutable representations
prevent many optimizations so there is a bit of a 
toss-up between the two.

It is the case that a tiny implementation of small
Scheme - one small in footprint and simple in implementation - 
is unlikely to have much by the way of sophisticated
optimizations.   

If we're going to talk about optimizations for a 
small dialect of Scheme, I think we have to talk about
an implementation class that is somewhere in the middle:
it might not (except through libraries) offer a "big Scheme"
environment but aims to provide a rich environment with a
sophisticated implementation.

Those considerations lead me to the conclusion we
should really have both mutable and immutable strings.
(And mutable and immutable variants on several other
types, as well.)


> 2) Algorithms where you want to modify strings in the middle are rare,

Claims like this always make the hairs on the back of my 
neck stand up.  There are two problems with them.
First, there isn't a really good empirical way to establish
such claims.   Second, rarity per se, is not the most 
important consideration.

Surely if we scan the source for PLT or various Scheme
compilers, SLIB, and similar systems we'll find that 
string mutation is less common than immutable uses of
strings.   Well, I haven't actually *checked* but I wouldn't
be surprised.   Still, that tells us little about what
code there is that we can't see and won't here about and it
tells us even less about what kinds of code people will 
want to write in 3 years.

Rarity is not an especially compelling argument.  More 
important is *importance*.   The question is less "how often
do I need to reach for string mutation?" so much as the
question is "how painful is it if when I want string mutation
I can't have it?".

Complex numbers are an example of a feature whose use
is (probably) relatively rare.  Yet their absence when
you want them can be quite painful.



> 3) If strings are immutable, it's possible to have both fast O(1)
> access to individual characters or substrings, and fairly space-efficient
> representation of full Unicode strings, by using different representations
> for strings drawn from diferent character repertoires. [...]

Yes, but this is also true of mutable strings.  You
argue otherwise thusly:

> Unfortunately, mutating even a single character in such a representation
> may require the entire string to be copied, which means that it also
> requires indirection through a separate header that can be redirected
> to point to the newly allocated code unit sequence.  Immutable strings
> can just *be* their sequences, with a few extra bits indicating the
> size of the code units, although this design does prevent easy sharing
> of substrings.


In the absence of mutation, when people want to implement
"a string like thing whose contents and length can 
change over time" they will: (a) use an indirect header;
(b) copy strings or construct new "ropes" on *every* mutation,
not just the ones that change the maximum code-point size
in the string.

Specifying only immutable strings will not save programs the
kind of expenses you are talking about.  It will only give 
programs fewer ways in which to deal with them.



> 4) As currently designed, strings are functionally just vectors of
> characters.  In an 8-bit world, using the traditional representation
> of strings carries a 4:1 storage advantage, making it worthwhile
> to distinguish them clearly from general vectors  But 21-bit Unicode
> characters are a much better fit, if represented as immediate (unboxed)
> values, for general vectors using 32-bit pointers.  Granted that not all
> small Scheme systems will provide full Unicode support, general vectors
> start to look much less expensive than they once were.  In short: if
> you want something that behaves like a vector of characters, simply use
> a general vector that contains characters.


This bit really confuses me because there is nothing
in your arguments for the immutability of strings that
would appear to not also apply to vectors.   This is 
especially the case if we contemplate "homogenous vectors"
over types that may have (like codepoints) a variable-width
representation.

Moreover, using a heterogeneous vector representation for
a mutable string has the distinct disadvantage that any
natural representation, each character in the vector must
carry a type tag and no string algorithm can assume that
the string contains only characters.



> 5) Making strings immutable also permits a design in which all strings
> are Unicode-normalized.  Though this has its own costs (for example,
> appending two strings may create a new string whose length is different
> from the lengths of the two source strings), it would be effectively
> impossible where arbitrary mutation is allowed.

It would not be "effectively impossible" if programs
that want constant renormalization use a different set
of mutation functions.

An implementation that implicitly, constantly,
renormalizes all strings -- even immutable strings -- 
would be a quite specialized critter, unsuitable for
many purposes.   For example, consider a program that
is reading input from a network connection yielding a sequence of
(sure, immutable) chunks of the input, appending them
to assemble a complete message.   An implicit normalization
here might be handy in some cases but in other cases 
it would simply be an annoyance that yields an unfaithful
representation of the message that was transmitted.  Checking
a checksum, would become problematic, for example.

Therefore, I don't think that strong support for a design
in which all strings are Unicode-normalized is a particularly
important feature.



> As a consequence of removing string-set!, string-fill! (not in IEEE
> Scheme) becomes impossible and string-copy less useful.  I do not propose
> to remove string-copy, however, because it can eliminate space leaks
> that are caused by taking a small shared substring of a large existing
> string: when the larger string should be GC'ed, it is retained as a
> whole because of the shared substring.  Using string-copy judiciously
> can prevent such leaks.

That bit is particularly alarming.

Surely there will be no wise way to write code which
makes such a use of string-copy in a meaningfully 
portable way.

------------------------------------

I think there are two ways in which immutable strings
are an interesting idea, worth pursuing:

One is as a guide to implementation where the goal 
is to identify a "very small" set of primitives 
out of which small Scheme can be constructed using
libraries.   In that context, a system with only 
immutable strings natively is a plausible and 
interesting direction (but not the only one).

The other thing worth pursuing is the idea of
standardizing a way to create and manipulate 
immutable strings using a different set of 
primitive operations than the general purpose 
string operators (to assist in static recognition
of when they are being used). 

-t



_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Re: [r6rs-discuss] Proposed NON-features for small Scheme, part 8: string-set! must die

Reply via email to