I originally thought could be a way of storing Unicode text in databases. However, after some thinking, I decided that idea was completely bogus, so I though to turn it into a joke for geeks. But it wasn't even amusing, so it went in the "Deleted Items" folder. However, I see that illogical ideas seem quite popular these days in the field of database, even "despite the logic of the arguments presented against" them, so perhaps someone will like it. Christopher JS Vance wrote (on May 18, 2001): > On DEC-10, with a 36-bit word, a byte was anywhere between 1 and 36 > bits. They typically packed 5 ASCII-7 characters into a word with the > extra bit unused. So that's what the "packed" keyword in Pascal was for! I was wondering: could something like this be revived in the age of 64-bit words and Unicode? A block of 64 bits can fit 9 ASCII 7-bit characters (0.888889 octets per character: more performing than DEC-10's packed ASCII!), or 3 Unicode 21-bit characters (2.666667 octets per character, which is not so bad for a millionaire character set). Both options leave one bit free (9*7 = 3*21 = 63), and that 64th bit can be used to distinguish two options, so that both can coexist in the same text stream. So, let's say that high bit 0 identifies 9*7 blocks, and high bit 1 identifies 3*21 blocks. E.g., a string like "Good day \U0010300\U0010305\U0010304" can be packed in only two 64-bit blocks, or 16 octets (a big save compared to the 48 octets needed in UTF-32, the 30 octets needed in UTF-16, or even the 21 octets needed in UTF-8): "Good day ": 9 characters * 7 bits = 1 block "\U0010300\U0010305\U0010304": Of course, I have been slightly cheating choosing a phrase that has exactly 9 7-bit characters and 3 21-bit characters. In the reality, boundaries between runs of characters in different ranges occur wherever they please. This causes that some characters in ASCII range have to be encoded in 3*21 blocks: E.g., a string like "Good night \U0010300\U0010305\U0010304" is not so lucky: "Good nigh": 9 characters * 7 bits = 1 block "t \U0010300": 3 characters x 21 bits = 1 block "\U0010305\U0010304": (2 characters + 1 unused position) * 21 bits = 1 block Notice that one position is unused in the last block. For this reason, a bit combination must be reserved as a padding code. This is not a big problem, because the highest Unicode character is 0x10FFFD, much less than the highest 21-bit number. Code 0x1FFFFF is one nice choice for the filler value. The basic rules for encoding Unicode with these 64-bit blocks could then be: 1) If there are 9 more characters to encode from the current position, and all of them are less than U+0080, pack them in a 9*7 block and move the current position 10 positions forward. Go back to point 1. 2) Else, if there are 3 more characters to encode from the current position, pack them in a 2*21 block and move the current position 4 positions forward. Go back to point 1. 3) Else, if there are 1 or 2 more characters to encode from the current position, pack them in a 2*21 block, padding the unused 21-bit positions with 0x1FFFFF. The encoding process is ended. 4) Else the encoding process is ended. For the joy of those who collect unconventional and/or aborted UTF's, I will name this "UTF-64". UTF-64 has a single CES (let's say big-endian). The reason is that, if you don't know where the high bit is, there is no way of making sense of those 64-bit pack. Of course, if super-intelligent Aliens will arrive on our planet, bearing a writing system with billions characters, I will withdraw this proposal and donate the name "UTF-64" to the Unicode Consortium. _ Marco