The strings design document

Dan Sugalski Fri, 23 Apr 2004 14:44:15 -0700

Is tacked on. Note that we *do* have to support as core languages which don't force unicode universally (perl 5, python, and ruby) *and* we have to support the writing of stream filters in pure parrot, so the goal of 100% pure unadulterated Unicode except at the very edge isn't attainable, no matter how nice it may be.

Anyway, here you go, and have at it.

Strings, a design document of sorts

A Preamble
==========

Let's get this on the table--I give. Unicode's officially enshrined
as the top level, Officially Blessed, "We think it's really keen"
standard for parrot.

Language support (computer, not human) realities mean we can't be
completely universal this way, and efficiency concerns mean that
internally we'll want to defer conversions for as long as possible, so
the guts need to be more flexible, but the presented model (presented
via ops to bytecode programs) is generally Unicode.

Requirements
============

* Efficiency - The system must do the absolute minimum amount of work
  to get the job done

* Correctness - The job that's done must actually be right

* Upgradeability - This stuff's all going to change again in five years
  so we really don't want to have to do it over again.

* Flexibility - Since, unfortunately, no one way of looking at
  strings is going to be right for everyone

Realities
=========

* There are a lot of different ways of representing text. Many of
  them annoying, some of them wildly incompatible, none of them
  wrong.

* We don't get to make the call what is right or wrong

* Some of the languages we support don't do Unicode, or do Unicode
  and other things (including perl 5 and Ruby)


Desires
=======

* We want to make it easily possible to do the right thing with string
  data

* We want all the troublesome stuff to be as invisible as possible

* We want to make it look like everyone's got what they want without
  actually doing it when we don't have to


With that list in mind, here's a solution. (It is, in large part, the
current solution, only with actual explanation to go with the fairly
enigmatic bits)

Definitions
===========

BYTE - 8 bits 'o data

CODE POINT - A 32-bit integer that represents a single thing in a
             character set

ENCODING - How code points are mapped to bytes, and vice versa

CHARACTER SET - Contains meta-information about code points. This
                includes both the meaning of individual code points
                (65 is capital A, 776 is a combining diaresis) as
                well as a set of categorizations of code
                points (alpha, numeric, whitespace, punctuation, and
                so on), and a sorting order.

CHARACTER - One or more code points which makes up a single real
            entity. The "oe" (I'm stuck with ASCII here, that should
            really be an o with two dots over it) in Leo's last name
            is, in the unicode character set, a single character with
            two code points, 111 (lowercase o) and 776 (combining
            diaresis). Characters can *not* be legitimately
            decomposed into individual code points in most cases.

Conceptually
============

The point of the string

The smallest unit of text that Parrot will process is the string,
something that can be put in an S register. These strings have the
following properties:

*) They have an encoding
*) They have a character set
*) They have a language
*) They have a taint status

The above things are independent of the view of the string presented
to bytecode programs--these are metadata elements that describe the
contents of the string as they actually exist, rather than as they
are presented.

Internally parrot is capable of maintaining strings in several
different basic encodings (8-bit, 16-bit, and 32-bit integer, as well
as UTF-8) and may load other encodings on the fly as needed. Parrot so
also capable of maintaining strings in many different character sets
(ASCII, EBCDIC, Unicode, Latin-n, etc) which are also dynamically
loadable. Finally Parrot is capable of maintaining strings in many
different languages, which also may be loaded on the fly.

This is done for maximum efficiency, regardless of the view of the
data presented to the bytecode programs. Conversion to a different
format may be done if needed to properly express the semantics of the
program, but will not be done if not needed.

For example, consider the following:

  use Unicode;
  open FOO, "foo.txt", :charset(latin-3);
  open BAR, "bar.txt", :charset(big5);
  $filehandle = 0;
  while (<>) {
    if ($filehandle++) {
      print FOO $_;
    } else {
      print BAR $_;
    }
    $filehadle %= 2;
  }

Relatively simple, the program reads from the input filehandle and
splits the data, line by line, between two output files. The two
output files have different requirements -- FOO gets data in Latin-1,
while BAR gets it in Big5. The "use Unicode;" thing at the top's a
hand-wavey way of asserting that we want full Unicode text semantics.

Even so, there's no actual reason in this program to convert to
Unicode at all. If the input file is either Latin-3 or Big5, half of
the lines read don't have to be converted to anything. If the input
file's a proper subset of both (like, US ASCII) then none of the
lines read in need any conversion at all.

If Parrot forced all input data to be converted to Unicode internally
then this program would potentially have some significant overhead,
depending on the type of the input file. Given the output, the input
is likely either Latin-3 or Big5, either of which needs some
conversion to get turned into Unicode, while Unicode is guaranteed to
need some conversion for proper output to both files.

Synthesized code points
=======================

Parrot provides code points for all characters, even for those
character sets/encodings which don't inherently do so. Most sets that
have variable-length encodings use an escape sequence scheme--the
value of the first byte in a character determines whether the
character is a one or more byte sequence. When parrot turns these into
code points it does it by building up the final value. The first byte
is put in the low 8 bits of the integer. If there's a second byte in
the sequence the current value is shifted left 8 bits and the new byte
is stuffed in the low 8 bits. If there's a third byte in the sequence
everything is shifted left again 8 bits and that third byte is stuffed
in the bottom, and so on.

For example, in Shift-JIS, if the frst byte is in the range
0x21-0x7E or 0xA1-0xDF the character is a single byte. If the first
byte is in the range 0x81-0x9F or 0xE0-0xEF the character takes two
bytes, with the first byte determining which table the second byte
indexes into. The roman character A is represented by a single byte
0x41, while the Japanese hiragana KA is represented by the byte
sequence 0x82 0xA9. When parrot turns this into code points, it
becomes two integers, 0x00000041 and 0x000082A9. (Though it could
represent them as 16-bit integers, since no character takes three or
more bytes)

While this is somewhat unconventional, it makes the text easy to
process internally as fixed-width integers, is trivally transformed
back into a byte stream, and trivially turned from a byte stream into
integers in the first place. It also has the advantage of making what
was a variable-width encoding (some of which make it difficult or
impossible to tell, if you pick a byte at a random spot in the byte
stream, whether you're in the middle of a character or not) into a
fixed-width encoding. As such it makes a reasonably pleasant way to
manipulate this sort of text.

Conversion Rules
================

There are two types of conversions, from one thing (encoding,
charset, or langauge) to a thing of a similar type or to a thing of a
different type.

Similar here means a thing where the conversion is lossless or
accepted as good enough to have no semantic loss--for example
converting US ASCII to most character sets, or pretty much any
character set to Unicode.  Different here means a thing where the
conversion is *not* guaranteed lossless--for example converting from
Shift-JIS to US ASCII or from Unicode to Latin-1.

Conversion lossiness is guaged either as a potential loss (where data
*may* be lost) or actual loss (where data, after conversion, *has*
been lost). While, for example, Big5 and Shift-JIS aren't
interchangeable in general so there is potential loss, they both have
US ASCII as a subset so it's possible that the conversion won't
actually lose any information.

Current interpreter settings determine when an exception or warning
is thrown. Some languages may deem it an error to implicitly shift to
an encoding where data may be lost and throw an error any time that
happens, others may defer the error until actual data loss occurs,
and still others may decide that data loss is fine, since if you were
worried about it in the first place you would've done something about
it.

Conversions are not required nor guaranteed to be symmetric. Just
about everything can shift to Unicode, and US ASCII can shift to just
about anything, but the converse is not true.

Since maintaining a full set of conversions is untenable, Parrot
declares that, by definition, all sets can pivot through
Unicode. Unicode pivoting is considered a potential loss of data, so
if the interpreter is set to warn or throw exceptions on potential
loss it will do so, even if the conversion is actually OK. (In which
case someone had better note that somewhere) It's perfectly acceptable
(and, in fact, encouraged) for a set to declare that it can explicitly
pivot to another set, with the actual internal code first going
through Unicode.

Internals
=========

Internally all strings are tagged with an encoding, a charset, a
language, and a taint status. This is the minimum amount of
information that can be reasonably kept for a string without losing
enough information to damage it if the data is passed into a
subroutine which expects a string parameter rather than a full-blown
PMC.

Tainting status is the simplest thing here, maintainable with a single
bit in the flags word for the string. We have to maintain this so that
the sequence:

   set S0, P0
   set P0, S0

doesn't lose the taint status of the data in P0, as well as so this:

  set S0, P0
  some_sub(S0)

passes in a properly tainted string to the some_sub subroutine. We're
encouraging code to use values of the lowest possible type, but we
don't want to be sacrificing safety for it.

Encoding needs to be attached to each string so we have some idea of
how to turn the bytes in the string's buffer into actual code
points. Since we defer transforming the string data until we actually
need to use it, regardless of what logical structure we may think the
string has, we still need to work on the actual structure it has.
This also allows easier processing of data in an encoding different
than whatever parrot may take as 'normal', if it ever does. Each
character set will have a preferred encoding, but people are going to
want to shift encodings around at times. (Especially the various utf-N
encodings)

Character set is attached so we can tell what to do with the code
points that come from the encoding and how to classify them. While we
prefer Unicode, that doesn't mean we're actually *in* unicode
yet. Also, since the possibility exists that we may at least have two
different character sets (either Unicode or binary, even if we declare
there are no others) it's less error-prone to unconditionally use the
set information hanging off the string itself.


The language determines the overridden special behaviour of a
string--how its case mangling should work, overridden character
classifications, and some comparison information. This will often be
overridden in main code, but becomes important in library code.
Language is a "humor Dan" thing. It won't hurt you, really.

Core functionality
==================

The following functions need to be performed by the core:

*) Transform encodings
*) Transform character sets
*) Get/set byte, code point, and character from a string
*) Get/set substring
*) get length in bytes, code points, and characters
*) Get/Set encoding
*) Get/Set character set
*) Get/set language
*) flatten to and thaw from a binary string
*) Upcase, downcase, and titlecase

These are all unary operations. While binary operations are necessary
for actual use, we'll deal with them after we get basic string
manipulation working.

Opcodes
=======

The following ops are proposed. Note that for many of them there is a
string-native version and a Unicode version--this is noted by a
(u). For Unicode strings these will behave identically, while for
strings that aren't in unicode they perform the operation and
translate to or from unicode as necessary.

getbyte          Ix, Sy, Iz
(u)getcodepoint  Ix, Sy, Iz
(u)getcharacter  Sx, Sy, Iz

Get the byte, codepoint, or character requested. Destination is either
an integer (representing the byte or codepoint) or a string. Sy is the
source string, Iz is the offset in bytes, code points, or characters
from the beginning of the string.

(u)getstring Sw, Sx, Iy, Iz

This is substr, with the destination guaranteed to be in Unicode for
the (u) case.

setbyte          Sx, Iy, Iz
(u)setcodepoint  Sx, Iy, Iz
(u)setcharacter  Sx, Sy, Iz

Sets the byte, code point, or character at offset Z in source string X
to the value in Y. Note that in the unocode case the source is taken
to be a unicode code point or character and translated to the type of
the destination string. These opcodes may throw an exception if the
resulting destination string is illegal (for example if the
destination is a unicode string with illegal combining character
construction, or in the byte case if the resulting buffer is un-decodable)

(u)setstring Sw, Sx, Iy, Iz

This is lvalue substr--the characters at offset Y, count Z (NB
*characters*, not code points) are replaced by the string X. In the
unicode case the string is taken to be unicode and translated to the
type of the destination string

encoding Ix, Sy
charset  Ix, Sy
language Ix, Sy

Returns the encoding, character set, or language of Y.

encodingname Sx, Iy
charsetname  Sx, Iy
languagename Sx, Iy

Returns the name of the encoding, character set, or language that
corresponds to the internal value Y. (As returned by the encoding,
charset, and langauge ops)

findencoding Ix, Sy
findcharset  Ix, Sy
findlanguage Ix, Sy

Find the internal value for the encoding, language, or character set
named Y.

bytelength      Ix, Sy
codepointlength Ix, Sy
characterlength Ix, Sy

Return the length of Y in bytes, code points, or characters. Length is
actual length, and as such may vary for otherwise identical
strings. (This is especially true for strings that change encoding, as
lengths can vary wildly between a UTF-8 and UTF-32 version of the same
unicode string)

transcode Sx, Iy
transset  Sx, Iy
translang Sx, Iy

Change the string to have the specified encoding, language, or
character set. Done in place

transcode Sx, Sy, Iz
transset  Sx, Sy, Iz
translang Sx, Sy, Iz

Generate a new version of Y with the encoding, character set, or
language Z.

tounicode Sx
tounicode Sx, Sy

Change the string to unicode. The one arg version does it in place,
the two arg version generates a new string.

(d)upcase    Sx
(d)upcase    Sx, Sy
(d)downcase  Sx
(d)downcase  Sx, Sy
(d)titlecase Sx
(d)titlecase Sx, Sy

Make the string all uppercase, all lower case, or titlecase the first
character. The two-arg versions generate a new string, the one arg
version does it in place. These ops have two variants--the one with a
leading d (dupcase, ddowncase, dtitlecase) use the current interpreter
default langauge rule for case mangling and set that as the language
for the generated string, while the non-d version uses the information
in the string itself.

decompose Sx, Sy

Take the string in Y and return a version in X which is a flat byte
string with no language, character set, or encoding. (or, rather, the
language none, charset none, and encoding 8-bit binary)

compose Sw, Ix, Iy, Iz

Take the flattened binary string W and mark it as having the encoding
X, character set Y, and langauge Z. This may throw an exception if the
string doesn't meet the requirements of the language, charset, or
encoding.

compose Sv, Sw, Ix, Iy, Iz

As above, only a new string is generated and the original left alone.

Exceptions
==========

Here's a list of the exceptions that will be thrown if the string
subsystem comes across things its not happy about. All of these
exceptions are optional, and may be overridden by interpreter
settings. Additionally, some conversions are deemed less dangerous
than others, and as such there are two different types of conversion
(similar and dissimilar) rather than just one. These exceptions may
also be thrown either because of potential problems (where something
might happen) or actual problems (where something did happen).

* LANG_MISMATCH - thrown whenever a binary operation is done on two
  strings with differing languages when there is otherwise no
  overriding semantic in place.

* CHARSET_MISMATCH - thrown whenever a binary operation is done on
  strings of different character sets.

* LOSSY_CONVERSION - Thrown whenever a conversion would lose
  information. This includes getting a plain string from a PMC which
  has segmented string data in it. (This would be a PMC which has
  some data in Unicode, EBCDIC, and RAD-50, for example, or whose
  contents had different languages attached to different parts of the
  string data)

* DECOMPOSITION_ERROR - Thrown whenever you try and act on part of a
  multi-code point character. This includes doing an ord() on a
  string where the character you're ord'ing is made up of two or more
  code points.

--
                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

The strings design document

Reply via email to