Re: Strings Manifesto

Jeff Clites Sat, 01 May 2004 17:01:01 -0700

[Finishing this discussion on p6i, since it began here.]

On Apr 28, 2004, at 5:05 PM, Larry Wall wrote:

On Wed, Apr 28, 2004 at 03:30:07PM -0700, Jeff Clites wrote:
: Outside. Conceptually, JPEG isn't a string any more than an XML
: document is an MP3.

I'm not vehemently opposed to redefining the meaning of "string"
this way, but I would like to point out that the term used to have
a more general meaning.  Witness terms like "bit string".

Good point. However, the more general usage seems to have largely fallen out of use (to the extent to which I'd forgotten about it until now). For instance, the Java String class lacks this generality. Additionally, ObjC's NSString and (from what I can tell) Python and Ruby conceive of strings as textual.

[And of course, it would be permissible in terms of English usage to say that a bit string isn't a string, much like a fire house isn't a house, and a suspected criminal isn't necessarily a criminal, and melted ice isn't ice.]

: Some languages make this very clear by providing a separate data type
: to hold a "blob of bytes". Java uses a byte[] for this (an array of
: bytes), rather than a String. And Objective-C (via the Foundation
: framework) has an NSData class for this (whereas strings are
: represented via NSString).

Another approach is to say that (in general) strings are sequences
of abstract integers, and byte strings (and their ilk) impose size
constraint, while text strings impose various semantic constraints.
This is more in line with the historical usage of "string".

Yes, though I think that this diverges from current usage (in general programming contexts), and more importantly promotes the confusion that "text" is inherently byte-based (or even, semantically number-based). The parenthesized point there is that a representation of text a sequence of numbers is an implementation detail--it's not inherent in the notion of text. The semantics of text do not imply that it is a semantic constraint layered on top of a sequence of numbers. In the vein of the Perl philosophy of making different things look different, I think it's important to linguistically distinguish between the two. Many programming languages do that, and users of those languages suffer less confusion in this area.

The key point is that text and uninterpreted byte sequences are semantically oceans apart. I'd say that as data types, byte sequences are semantically much simpler than hashes (for instance), and strings-as-text are much more complex. It makes little sense to bitwise-not text, or to uppercase bytes.

: (And it implies that you can uppercase a JPEG, for instance). : Only some encodings let you get away with this--for example, not every : byte sequence is valid UTF-8, so an arbitrary byte blob likely wouldn't : decode if you tried to pretend that it was the UTF-8-encoded version of : something. The major practical downside of doing something like this is : that it leads to confusion, and propagates the viewpoint that a string : is just a blob of bytes. And the conceptual downside is that if a : string is fundamentally intended to represent textual data, then it : doesn't make much sense to use it to represent something non-textual.
I think of a string as a fundamental data type that can be *used* to
represent text when properly typed.  But strings are more fundamental
than text--you can have a string of tokens, for instance.  Just because
various string types were confused in the past is no reason to settle
on a single string type as "the only true string".  If you can do it,
fine, but you'll have to come up with a substitute name for the more
general concept, or you're going to be fighting the culture continually
from here on out.  I don't like culture wars...

I think the more general concept is "array".

The major problem with using "string" for the more general concept is confusion. People do tend to get really confused here. If you define "string of blahs" to mean "sequence of blahs" (to match the historical usage), that's on its face reasonable. But people jump to the conclusion that a string-as-bytes is re-interpretable as a string-as-text (and vice-versa) via something like a cast--a reinterpretation of the bytes of some in-memory representation. As a general sequence, one wouldn't be tempted to think that a string-of-quaternions was necessarily re-interpretable as a string-of-PurchaseOrders. I don't think it's culturally possible to shake this view of text-is-really-just bytes without using distinct terminology.

I'm not vehemently opposed to jettisoning the word "string" entirely, and instead using "Text" and "Sequence" for the above concepts--that's the usual way to deal with an ambiguous term. But the downside is that it forms a learning barrier for people coming from other languages. I think that "string" meaning text, and "array" meaning general sequence would be the most consistent with current general usage. But my main concern is that we distinguish between different concepts, by using different names.

I believe that bringing clarity to this area is crucial.

Since Perl5 doesn't give you a way to manipulate a byte sequence as anything other than a string, I think it's an open question whether current Perl users really think in terms of a generalized string, or whether they've just not been given the tools to distinguish. It would be interesting to know whether many programmers, faced with the question "what's a string", would provide an answer which isn't text-centric.

JEff

Re: Strings Manifesto

Reply via email to