Strings Manifesto

Jeff Clites Wed, 28 Apr 2004 04:22:39 -0700

In light of ongoing discussions of Parrot's string model, I've decided to prepare a document spelling out my general viewpoint on the subject of strings. It's intended also to supply a self-consistent set of terminology. It will frame my future comments. Some notes:

1) This is explicitly arguing for one viewpoint, not against another. (That is, I've not included any such arguments.) 2) I've written it in a manner which doesn't directly mention Parrot, so that I can use it in other contexts. 3) There are two separate issues with respect to Parrot: the strings semantics, and the internal representation/implementation. This addresses only the former. It also doesn't include an API recommendation. This is because the semantics need to be sorted out first. 4) There are a few topics yet-to-be-covered. 5) I should probably provide a (much shorter) synopsis.

----------------------------------------------------------------------
What's a String?

This article is intended to clarify some concepts related to strings. The motivation is that they are a frequent source of confusion for developers, working in various programming language. This lays out a framework for thinking about strings--a viewpoint, if you will. It reflects a conceptual model that's already embraced by some programming environments, but even those environments often lack a detailed explanation. This document also aims to disambiguate various terms--clarifying some and providing self-consistent definitions of others which are used inconsistently in different contexts.

So, what is a string?

In the most general sense, a string is a data type in a programming language, meant to model text. It's meant to allow manipulation, viewing, analysis, searching, of the stuff you read, or which computer programs read on your behalf.

A bit more concretely:

    Definition: A string represents a sequence of abstract characters.

So we can't go much further until we say a few words about what a "character" is. But the two important words to remember from that definition are "represents" and "characters". (And, "sequence" is important too, but you won't overlook that one.)

What's a character?

A character, in fuzzy terms, is the smallest unit of meaning in a language. A character is an abstract concept. Characters are things like letters, numbers, punctuation marks, the Japanese symbol for "middle"--stuff like that. The concept of a character is supposed to capture an intuitive notion--the notion ususally understood by the common language speaker. (We are, after all, modelling what's essentially text--the stuff that humans read. That's the whole point.) That said, precisely determining what constitutes a character, and what doesn't, is essentially a judgement call, and a decision. Fortunately, essentially all existing standards agree on this, in a general sense. For instance, the lowercase "a" and the uppercase "A" are considered to be different characters, but stylistic variants (the letter "a" in different fonts, for instance) are not (instead, they're just different graphical representations of the same abstract character). These could have been modelled differently--you could consider "a" and "A" just variants of the same character. But, as it turns out, no one does it that way.

And I mention that now to prepare you for some arbitrariness--some conventions that could have gone another way, but which are actually still intuitively meaningful.

Notice that we haven't said anything about how a computer program (or language) might model a character. That's important. So far we've just said *what* a computer program is trying to model, not how it might do it.

And also, just to hit the point home, we're not talking about the "char" datatype in C. To keep things clear, we'll refer to that as a "byte", when it comes up.

What is Unicode?

At this point I'd better explain what "Unicode" means, before I slip and use the word. People like to throw the word around a lot, and it's not always clear which of the following related (but different) things they mean. There are at least 6 different possibilities:

1) The Unicode Consortium, a group of companies and organizations that got together to sort out the handling of international text. ("Sort out" in the sense of making it trivial to allow production of text documents containing a mixture of different langauges, rather than nearly impossible.)

2) The standard produced by that group, currently at version 4.0.

3) The character model which forms the basis of that standard, and which also forms the basis of what I'm talking about here.

4) The database of character properties (or the properties themselves) assembled by the Unicode Consortium. These properties are things like, "this character has an uppercase version over here", "this character represents the decimal digit 3", etc.

5) A family of encodings (word defined below) used for the interchange of international text between computer programs (by mean of files or network transmissions, for instance). Often, "Unicode" is used to refer to a specific one; that usage is inherently ambiguous, but harkens back to the first version of the Unicode standard, which defined only one encoding.

6) International (multi-language) text handling, in general. This usage arises because internaltionalization-aware applications tend to base their implementations on the Unicode character model, and tend to support the Unicode encodings. But it's important to recognize this usage, because sometimes the term "Unicode" is used in contexts which don't involve any of the above per se, but rather just indicate an awareness and handling of international text issues.

Of these, really (3) is the only one which someone could really "disagree" with. (That is, the consortium, standard, database, and encodings exist and are precisely defined--there's nothing philosophical there to reject.) But the other important point is that these are all separate things--sometimes people will cite the need to support other encodings as an argument against the character model, which doesn't follow logically.

So the string/character model I'm espousing is based largely on the Unicode Character Model, but it would be misleading to say that I'm advocating, "use Unicode". (What would that mean, anyway?) But it's (3) that does the conceptual heavy lifting, and (4) which does the grunt work. Or, (3) is the concept, and (4) is the details.

I'll try to avoid using the word "Unicode" by itself, to avoid confusion, but if I do, then I'm probably referring to the Unicode Standard, or something it specifies.

So let's review. A string is a concrete thing--a data type in a language; it represents a sequence of (abstract) characters, a character is a unit of meaning in a language, and a character has properties which have been collected into a database. To go a bit further, the fundamental question you can ask a string is, "what's your Nth character". All else flows from that.

What's a code point?

So if the fundamental question you can ask a string is, "what's your Nth character", we have to talk a bit about how one would programmatically express that answer--that is, what sort of data type is/can/should be use to represent a character? Fundamentally, a character can of course be represented opaquely--in an object-oriented language, you could have character object. That's quite reasonable. As it turns out, people find it convenient to programmatically represent a character by an integer (think "whole number", not a specific data type here). It's convenient for several reasons--it's compact and easy to refer to in speech. And if the fundamental thing you can ask a string is what its Nth character is, then the fundamental things you do with a character is look up its properties, and test it for equality against other characters. So if you just go through and give each character a little serial number, then you can find the properties of a character by using its number as an index into a property table (i.e., character 3's properties are at slot number 3 in the table), and you can tell that 2 characters are different characters by checking whether they are represented by different numbers.

A number used to represent a character like this is called a "code point". Which number you choose to assign to which character is fundamentally arbitrary--its one and only use is as a unique lookup key. The letter "A" could be assigned the number 1, or 42, or 31337, or 2001, or anything else. It doesn't matter. Now, various standards have numbered various characters in various ways, though many don't actually think in these terms, so to be clear if you are going to refer to "character number 42" or "the code point for the letter 'A'", then you really need to point out what numbering scheme you are talking about. Fortunately, the Unicode Standard has numbered *all* of them--it's given a number to essentially every character in every digitally-represented langauge in the world. Since its numbering scheme is comprehensive, and since the numbering scheme is arbitrary, without loss of generality we are going to use the Unicode standard's numbering scheme for the rest of this discussion, and we'll go ahead and pretend it's the only one. So when we say, "the code point for the letter 'A'", we'll mean the code point that the Unicode Standard assigns to it, which (in case you are wondering by now) is 65. (See, the number itself is not very interesting!)

So, let's review again. For various practical reasons, it's preferable to programatically represent characters using integers, you have to pick an arbitrary numbering scheme, and somebody's done that, and it's a good one. This numbering scheme defines a one-to-one correspondence between numbers (code points) and characters, and that makes it tempting to pretend that characters *are* numbers. But it's important to keep in the back of your mind an awareness that the numbers merely help you pick out the characters, and it's the characters themselves which are important, and characters are *abstract*--they never actually live inside of a computer program. [Note: Of course, some numbers don't represent any character--there are only so many characters. So to be mathematically precise, there's a one-to-one correspondence between a subset of integers and all characters.]

What's an encoding

So, what we've convered so far is the type of stuff you need to work with text in-memory. We can use the number 65 to represent the letter "A", and we have a properties database to tell us whether A is a number, or whitespace, or whatever.

But, text being what it is, people tend to want to save it to disk, or transfer it between applications. That is, they want to do IO operations on strings.

Now as it turns out, IO is alway an operation which involves bytes. Even if your IO is hidden under an abstraction layer, at the bottom you're always reading and writing bytes. Strings are not, fundamentally, bytes. So how do you create bytes from some high-level construct like a string? You define a serialization algorithm, that's how. That's an encoding:

Definition: An encoding is a mapping from sequences of abstract character to sequences of bytes (and vice versa).

Since we've said that a string is supposed to represent a sequence of abstract characters, we could also just say, "An encoding is a mapping from strings to sequences of bytes (and vice versa)."

*Important Note*: This is, in particular, an area where there is much conflicting and inconsistent terminology. The concept that I just defined is the exact same thing that the Unicode Standard refers to as a "Character Map". That would probably be the preferred term, since I don't believe it's been used elsewhere with a differing meaning. But, it's not a commonly used term, so I'm going to use "encoding" instead, with some hesitation, so that someone jumping into the middle of this discussion has a chance of figuring out what I'm talking about. My usage of "encoding" matches the usage of the XML specification, and the Objective-C Foundation framework. Java and MIME use the word "charset" for the same concept, and IANA uses "character set". Others use the term "character set" for something different. But the important thing to remember is just that, for the purposes of this document, I'm using the definition above, and when you are reading elsewhere and encounter any of those terms, you should look for a definition of what is meant there.

The absolutely crucial thing the remember about an encoding is that it defines an interchange format--that's 100% what it's for. It's purpose is to let you communicate textual data between processes (or, store it to the filesystem for later retrieval by some process). Encodings have nothing, fundamentally, to do with any sort of in-memory representation of a string.

The other thing to remember about an encoding is that there are lots of them--lots and lots of national and industry standard define lots of different ones. They tend to have names like, "ASCII" or "ISO-8859-1" or "Shift-JIS" or "Big5" or "UTF-8".

The reason there are so many of them is that, in the early days of computing, most everyone got it into their heads that they needed to represent textual data using only one byte per character, which meant that an encoding could only handle 256 different characters. This was fine for English (and in fact ASCII only encodes 128 characters), but as soon as you hit Europe you needed more characters--for French and German and Greek and Icelandic and Polish and Turkish and so on. You need more than 256 characters to handle all of that, so different encodings where invented to handle different collections of characters. This doesn't sound so bad, and it's fine as long as everyone just talks to the guy next door, but it quickly turns into a mess if you try to step outside--you can't create a single text document with words from multiple languages (for some combinations), and (potentially worse) when you read in a file created by someone else, you need to know what encoding they used (that is, you need metadata, of the sort supplied by MIME and HTTP headers, but not supplied by most filesystems). This wasn't fun for anyone.

The obvious thing to do, in retropect, was to come up with a different plan, one which would allow you to handle all languages without a separate song-and-dance for each one. That's in fact what the Unicode Consortium came together to do. And they did it. It was an enormous undertaking.

Okay, back to the point. Not only are there lots of different encodings, but by their nature most only know how to encode strings containing a limited collection of characters. ASCII, for example, doesn't know how to encode Japanese Kanji. But the Unicode Standard defines a collection of encodings (UTF-8, UTF-16, and UTF-32, with big- and little-endian variant of the latter two) which can encode strings containing basically any character.

It's important to take a moment to point out how general the above definition of "encoding" is, and what it doesn't imply:

1) As mentioned, an encoding doesn't have to know how to encode every possible string. And reciprocally, not every sequence of bytes will be decodable by every encoding.

2) Although an encoding is really a pair of algorithms (one to serialize, one to de-serialize), that doesn't imply that an encoding has to be strictly invertible. That is, if I encode a string into a sequence of bytes, and then decode that sequence of bytes into a string, I might not get back the same string I started with. For example, and encoding might, conceptually, strip accent characters.

3) An encoding doesn't necessarily operate on a character-by-character basis. That is, the bytes to encode "ab" might not be the concatenation of the bytes to encode "a" with the bytes to encode "b". It's even possible, based on the above definition, that an encoding might be able to encode the string "ab", but not the string "ba". (In practice, I don't know if this ever occurs "in the wild".)

So the key points here are: (1) an encoding is all about IO, (2) not all encodings guarantee data integrity, and (3) the Unicode encodings _do_ guarantee data integrity. Another important point is that the choice of an encoding is a crucial piece of metadata for bytes undergoing IO, and thus it's the sort of thing which is indicated explicitly in higher-level protocols and formats (HTTP, MIME, and XML, to name a few), and it is almost always indicated *by name*. IANA mantiains a registry of these, and many protocols use the IANA names to specify encodings.

Another important point: The Unicode Character Model (unlike many other standards) thinks of the process of going from strings to bytes as a multi-step process. That's useful from a pedagogical point of view, but in fact no one much cares about the results of the intermediate steps. (That is, Unicode describes the process as: characters --> code points --> code units --> bytes, and I'm saying that the code units aren't very useful to think about.) The reason for this is very simple--encodings are only about data interchange, and data interchange protocols and formats want to specify the encoding with a single name, which picks out the whole strings-to-bytes mapping. They don't much care if two different encodings could be thought of as agreeing on one of those steps, and differing on another. If you're writing a library to convert between sequences of bytes in different encodings, then this can also be useful (to minimize code duplication), but it really has nothing to do with the semantics of a string.

What's a Character Set?

As mentioned above, some people use the term "character set" to mean what I'm calling "encoding".

In my usage, a "character set" is something simpler--it's a set of characters, in the mathematical sense of "set". It's an unordered collection of characters. The letter "A", the comma, and the Japanese character for "middle"--there, that's a character set. The Unicode Standard uses the term "abstract character repertoire" for this notion. That's actually less ambigous terminology, but quite a mouthful. I'll try to use "repertoire".

A character set, in this sense, isn't a terribly interesting concept, and it's a good term to stay away from, given the ambiguity. But people often say things like, "the ASCII character set". Now, in my usage, the ASCII standard primarily defines an encoding. However, subject to a few caveats mentioned above, an encoding implicitly picks out a character set--specifically, the set of characters that it can encode. So in light of this, I can meaningfully say "the ASCII character set", and talk about "how the Shift-JIS encoding handles characters in the ISO-8859-1 character set". I mentioning this not to particularly encourage the usage, but because you'll hear it, and because it's a bit less awkward to say than things like, "how the Shift-JIS encoding handles characters encodable in the ISO-8859-1 encoding". And also, it's convenient to say, "ASCII characters" (through a small abuse of language) to talk about "ABC...", even in the context of another encoding.

Wow, so we've covered a lot of ground. Let's sum up again: Conceptually strings model a sequence of abstract characters--that's their job. Encodings (or Character Maps) define interchange formats, and are only important to IO--to let you transmit strings between processes (via the network or via files). Encodings are basically algorithms, but in usage they tend to be identified by conventional names.

What's a Locale?

One of the things that people like to do with strings is to manipulate them. Two prototypical things they want to do are case transformations (uppercasing and lowercasing) and sorting. And they also want to do things like create string representations of numbers and dates, and the inverse--interpret strings as representing numbers and dates.

Now as it turns out, different people want to do these things in different ways--think of "1/31/2004" v. "31/1/2004" v. "2004-1-31".

Consequently, that means the sorting algorithms and number formatting algorithms need to know which way you want to do things (and which way I want to do them). Traditionally that means that these types of algorithms need to take a parameter (explicitly or implicitly) to specify this choice. Traditionally, this parameter is called a "locale".

Now this word can cause a lot of confusion. It doesn't need to. In the most concrete sense, a "locale" just specifies a set a algorithms and settings--things like the format to use for dates, the comparison operation to use for sorting, and the characters to use for the decimal and thousands separators for numbers (think of "1,000.50" v. "1.000,50"). In fact, to reflect this definition, I'm going to coin a new term. I'm going to call this a "Text Preferences Set", or "TPS" for short, to emphasize this defintion.

Now where the complexity tends to come in also explains where the term "locale" came from. People, as it turns out, don't (often) sit around thinking up new sort orderings, or new formats to display numbers and dates. More often, they want to do these things in a way that matches the custom of some language or country or region--they want to sort strings to match somebody's phone book or dictionary. Consequently, they find it convenient to specify a TPS by specifying a langauge or country, and use notations such as "en" for English and "en_US" for US English (v. British English). But from an API perspective, this is just a way to go look up a TPS--once you have one, it doesn't much matter where it came from (looked up by some name or build up programmatically, piecemeal), you just use it.

Sorting, and Tailorings

Now as I hinted at above, people in different countries like to sort strings in different ways. Fair enough. That makes it sound like I need to define a whole bunch of binary string comparison functions--one for each different custom. One for German, one for Swedish, one for English, one for Japanese, etc. As it turns out, there's a more compact way to do this.

The Unicode Standard defines a base collation algorithm (sort algorithm)--you can use it to sort strings composed of any characters at all, but it doesn't necessarily give you an order that matches any particular langauge's custom (though for many languages it's at least reasonable). This is called the Unicode Collation Algorithm. Now, this algorithms supports the concept of "tailorings"--tweaks to change the sort behavior for certain characters (or sequences of characters). The idea is that you start with the UCA, and supply one set of tailorings to get the customary English sort order, another set to get the German sort order, etc. There are three really nice things about this approach:

1) Rather than writing a whole new algorithm, you just tweak a base algorithm, and these tweaks are usually data-driven. (That is, they usually take the form of modifications to a weighting table, rather than an algorithmic difference.)

2) Under any set of tailorings, you can sort strings containing any characters. So, you can meaningfully talk about how Japanese words sort under the English tailorings (or English locale or traditional English TPS). It's just that, naturally, the sorting of Japanese words won't differ under English v. German sorting rules (but sorting of some English words might), and English words probably won't sort differently under Japanese v. Taiwanese sorting rules (but Japanese and Chinese words might).

3) In contexts where you don't know what language is relevant, you have a reasonable fallback sort order.

So for sorting, a language (or more generally, a TPS) just provides a bit of tuning on top of a default algorithm, and you can sort with or without taking language into account.

Now, it's important to note that the UCA isn't just sorting in terms of the numerical values of the Unicode code points assigned to charactes. But you can sort this way too--it's sometimes called "binary order"--and it can make sense to do this in situations where you don't much care _what_ the sort order is, just that you have something canonical (for instance, so that you can count duplicates in a list). The reason you might want to use binary order in cases like this is that it's faster, and simpler to implement. (Also, you might know that it's an acceptable sort order for your problem domain--maybe you're sorting part numbers, for instance.)

So, another sum-up: A TPS (or "locale") is a set of preferences relating to how you want to carry out certain string operations, and often these preferences take the form of "tailorings" or tweaks on top of a base algorithm.

Another place where this sort of thing comes up is with case mappings--how you uppercase or lowercase strings depends somewhat on language-based conventions.

Another use for uppercasing or lowercasing in English is to perform case-insensitive string comparison, so that "foo" and "FOO" and "Foo" and "FoO" compare as the same. In English, you can either uppercase everything before comparison, or lowercase everything--it doesn't much matter which you do, the result will be the same. In other languages, that's not necessarily the case. Conveniently, the Unicode Standard defines a process called "case folding", which is designed to let you do case-insensitive comparisons. Case folding transforms strings into case-canonicalized representations (conceptually like uppercasing then lowercasing everything), and it does this in a single, non-language-sensitive way. So even though language is relevant to case mappings, you can use case folding to do case-insensitive string comparisons without specifying a language (or TPS).

So a take-away point here is that the Unicode Specification provides infrastructure in a number of areas for doing operations in a TPS-sensitive or TPS-insensitive manner, and provides a smooth transtion between the two.

A note about Language

There's one thing about langauge-sensitive operations which people tend to overlook, and which it vitally important: for string operations, it's not the language _of_the_string_ which is important, it's the language _of_the_reader_. The canonical example here is this: Consider a list of names, some of them German, and some of them Swedish. I, as an English-speaking reader, would want to see these sorted in the _English_ sort order. I don't much care how Germans or Swedes would sort them. My local phone book doesn't take the national origin of a person into account when organzing the listing. It doesn't matter--the point of having it sorted it to allow people in the US to look up a name. This means that there isn't a problem of how to compare "English strings" and "German strings"--the language of the sort operation, not of the individual strings, determines the binary comparison algorithm used in the sort. That also means that it's a non-problem to have a single string containing words from different languages.

This also means that, in the US, the German word "STRASSE" (on a sign or in the name of a company, for instance) would lowercase as "strasse", even though in Germany it might be preferable to use "stra�e".


What's a Grapheme Cluster?

More about this later, but briefly: As it turns out, certain langauges have certain funny conventions about grouping sequences of characters into what they see as one "thing". For instance, Spanish considers the character sequence "ll" to be a separate letter (not two "l"'s), and similarly for "rr". Now, they don't really treat them specially from an "information processing" point of view--I don't believe that there are any encodings which treat "ll" as a single character, and I don't think that Spanish typewriters have a separate key for "ll". It's just that they think of it as a separate letter--it shows up when they are reciting the alphabet, and they see the word "llama" as containing four letters (the first letter being called "ell-yay", with the word pronounced "yama", not "lama"), and they would sort that word after "luego" (because "ll" comes after "l").

The Unicode Standard calls this a "grapheme cluster", and (as mentioned above) it's relevant to localized sorting and word-length counting. This is a somewhat unfortunate term, since "grapheme" is a linguistics term, and it means something different to linguists. Previous versions of the Unicode Standard used to term "grapheme", but they've migrated to using "grapheme cluster" to minimize confusion, though the standard is still somewhat jumbled in this regard.

But the brief take-home idea here is: "grapheme cluster" ("grapheme" for short, with reservations), in this context, is a sequence of one or more characters (a range in a string) which natural-language users would see as a unit. This is again a situation where language (or TPS) is relevant, but where the Unicode Standard provides a default TPS-independent implementation which allows for tailorings. So again there's a smooth transition for language-independent to language-dependent settings.

Also, importantly, a grapheme cluster is a notion built on top of characters (it's a cluster of characters), and choosing a langauge lets you refine how you break up a string into grapheme clusters, but it's just a refinement--"adding a language into the mix" doesn't pick out a different semantic construct, it just help you customize your choice of what ranges make up single graphemes.

I haven't yet covered a few important topics, such as different character sequences representing equivalent graphemes, canonical and compatability equivalence, and Unicode normalization forms. I also haven't said anything yet about concrete implementation or API guidelines.

JEff

Strings Manifesto

Reply via email to