Re: Strings Manifesto

Jeff Clites Sun, 02 May 2004 12:13:10 -0700

On May 2, 2004, at 7:38 AM, Andrew E Switala wrote:

Jeff Clites <[EMAIL PROTECTED]> 2004-05-01 18:23:02 >>>

[Finishing this discussion on p6i, since it began here.]

Good point. However, the more general usage seems to have largely
fallen out of use (to the extent to which I'd forgotten about it

until

now). For instance, the Java String class lacks this generality.
Additionally, ObjC's NSString and (from what I can tell) Python and
Ruby conceive of strings as textual.


    As a VM for multiple languages, Parrot must be more general than
any one of those languages, though, yes?

Yes, but not more general than any of them need.

In contrast, parrot doesn't allow for the possibility that INTVALs could represent complex numbers (3 + 2i), for instance--if a language wants those as its numerical primitives, it would layer them on top of INTVALs, possibly--but in fact no language wants that anyway. That is, parrot shouldn't try to provide a generality which can be built on top of something simpler, or which no language actually needs.

But so far (in this thread), I've been talking about the general concept--it was leading to recommendations for Parrot, but is turning into just recommendations for Perl6, probably.

I do think, though, that in practical terms languages that exist today fall into 2 categories: those which handle international text, and those which don't. Those which do, do so uniformly--they have one or maybe two internal representations for text, and they don't try to model it as "bytes in an interchange format plus indication of encoding plus something else". Languages in the latter category almost always think of text as ASCII, and model it as a buffer of bytes, and they also allow bytes with values > 127, but tend to leave them uninterpreted (that is, under things like case transformation).

It's pretty straightforward to programmatically model a string in manner which closely matches what the more text-sophisticated languages expect. For the other category, you can either (a) enforce simpler semantics at the PMC layer, or (b) gift them with full international text handling. I don't know for a fact that the Ruby community, for instance, wouldn't be thrilled to get full international text handling for free. I don't think we've asked.

If we're going to generalize an implementation, we'd best look around and find out the appropriate direction for generalization, rather than just guessing. I think Parrot's trying to overdo it.

The key point is that text and uninterpreted byte sequences are
semantically oceans apart. I'd say that as data types, byte sequences

are semantically much simpler than hashes (for instance), and
strings-as-text are much more complex. It makes little sense to
bitwise-not text, or to uppercase bytes.


    If your "text" is taken from a size-two character set, it makes
perfect sense to complement (bitwise-not) it.  Bit strings and text
strings are oceans apart like Alaska and Russia.

Yes and no.

In practical terms, if you bitwise-not some UTF-16 bytes (as bytes or 16-bit ints), you'll end up with bytes which don't represent any characters at all. (Because all characters in the Unicode repertoire have values < 2^21, and once you bitwise-not they'll all be >= 2^31.) And UTF-8 and UTF-32 suffer similar problems, and probably Shift-JIS too. You can only get away with this sort of thing if you are thinking in terms of encodings in which any byte sequence is interpretable in that encoding, which is only some encodings (ISO-8859-* fall into this category, for instance).

Bitwise-not-ing is simply not a text operation. Another way to see that is to remember that the assignment of numbers to characters is arbitrary, so doing a mathematical transformation based on those numbers isn't meaningful. Certainly, you can precisely define and implement such a transformation, but it has no meaning as an operation _on_text_.

If you take a top-down approach to designing a string API, you won't be tempted to think of anything as "size two"--all the confusion comes from working bottom-up. By analogy, look at objects. Objects represent a certain approach to factoring and organizing computer programs, with a focus on their behavior, and a hiding of their implementation details. People like to invent serialization format for them, so that they can persist them or send them between processes--to pick an example, think of an XML-based serialization format. Now, nobody today is seriously tempted to think of objects as just blobs of XML, plus an interpretation--to base their in-memory representation on XML, and create object API which reflect this. That would be thinking about objects the wrong way. Now, imagine if history had gone differently--if XML had been invented before object-oriented programming. You'd have all of these documents sitting around holding structured data. When objects began to materialize, as a concept, undoubtedly they'd arise as an approach to in-memory manipulation of XML. It would be difficult to get people thinking about objects top-down--to realize that the concept had nothing to do with the serialization format.

My claim is that this is what has happened with text. People are locked into a view based on the numerous interchange formats for text (encodings). But if you start top-down, you'd never be tempted to think that when I type and "A" (as I just did), what I _mean_ depends on the encoding that my email client ends up choosing to send this message. Quite the contrary--I don't know what encoding (or more generally, format) it's going to choose. All I care is that it picks one which will allow my text to get to you without losing anything.

But certainly, people have a strong tendency to think bottom-up. I believe it's a historical accident, but it may be psychologically impossible to get some people to think about this area differently. But I'm completely convinced (having worked with text systems which take a top-down approach), that thinking top-down makes things much clearer and much simpler (and actually leads to fewer security-related bugs). In particular, if your text model is locked in encoding land, then you force individual programmers to know all of the details of various encodings in order to work with text. With a top-down approach, they just need to think in terms of text, and they need to realize that when reading in some bytes off of disk which are supposed to represent text, they need to know which format was used by the process which created the file (but they don't need to know the details of that format).

The major problem with using "string" for the more general concept is

confusion. People do tend to get really confused here. If you define

"string of blahs" to mean "sequence of blahs" (to match the
historical
usage), that's on its face reasonable. But people jump to the
conclusion that a string-as-bytes is re-interpretable as a
string-as-text (and vice-versa) via something like a cast--a
reinterpretation of the bytes of some in-memory representation.
    It is thus reinterpretable---via (de-)serialization.  Take a "text"
string, serialize it in memory as UTF-8, say, to get a bit string, and
do ands ors and nots to your heart's content.

Yes, precisely--you can serialize a string into a bag of bytes, and do whatever you want with that. Similarly, you can serialize an object using some serialization scheme (Data::Dumper? XML? Perl's freeze/thaw? Python's pickle? Parrot's freeze?), and manipulate the bytes of that. But you wouldn't be tempted to want to bitwise-not the raw memory locations which implement an object. You'd never think to do _that_, and if someone suggested it you'd immediately think of two problems: (1) it's not semantically meaningful (the internal representation is supposed to be opaque, and changeable without changing the visible behavior), and (2) if you did that, you'd likely end up with something which no longer represented an object--you'd just get junk.

If the in-memory representation is already UTF-8, the serialization is nothing more than changing the string's charset+encoding to "binary".

Yes, and in light of the above, I have 2 problems with that: (1) It's basing externally-visible behavior on an internal implementation detail, and (2) people seem to be forgetting that if you do that, you won't be able to "go back"--that is, bitwise-not-ing a "UTF-8 string" won't leave you with something interpretable as UTF-8. I don't think that people would expect to bitwise-not some text, and have the result be non-textual. (But then, I don't believe someone would ever seriously want to bitwise-not some _text_.)

Compilers for languages like Perl 5, which treat strings as text or bits depending on the operation being performed, can insert the serialization/deserialization ops automatically as needed.

Yes, that's fine in terms of convenience, but leads people to think of text _as_ bytes. That going semantically in the wrong direction. And witness the confusion that brings (i.e., this whole thread). If you keep them clearly separate (text operations on text, binary operations on binary data, some operations on both), things are clearer.

JEff

Re: Strings Manifesto

Reply via email to