On Mon, 30 Apr 2012 13:46:20 +0200 Michael Probst <michael.probs...@web.de> wrote:
> Am Samstag, den 28.04.2012, 13:18 +0100 schrieb Richard Wordingham: > > Is it anywhere stated as policy that numbers written by a string of > > decimal digits will be encoded with the most significant digit > > first in storage order? I couldn't find it stated anywhere. > > Isn't this about encoding characters, mapping computer readable > numbers to human readable characters (which may be digits), but not > about encoding numbers, just as this is not about encoding words? The comparison is appropriate. The Unicode has Standard Annexes 14 'Unicode Line Breaking Algorithm' and 29 'Unicode Text Segmentation'. A lot of ill-will has been generated by forcing most Indic script users to store their characters in roughly phonetic order. The best justification is that it facilitates sorting and linguistically sensitive processes like transliteration and, I hope, spell-checking in highly inflected languages. Thai is the principal resister, with Thai-like scripts sheltering behind it. (Actually, Thai collation seems to have been designed for computers - you don't have to know how a word is pronounced to apply the rules properly. Interestingly, Thais seem to apply the rules on the basis of pronunciation and get the results slightly wrong!) I have been told that the order should be based on collation rather than on phonetics, in which case all the vowels in the Myanmar script are exceptions to the principle, for CVC words are sorted on the basis of first consonant, final consonant and then vowel. There are two major varieties of Lao collation - first, vowel, final and first, final, vowel. Last time I looked at the defaults for the Unicode Collation Algorithm, the first was implemented very imperfectly (vowel ordering in Lao is very UCA-unfriendly) and the latter seemed a great challenge for a concise tailoring that doesn't come close to listing all the possible rhymes without tone marks. The basic file of the Unicode Character Database has three different fields for numeric value, so someone cares a great deal about the interpretation of numerical characters. There is a hierarchy of numeric types ranging from decimal (digit), to digit, to numeric - see http://www.unicode.org/Public/UNIDATA/extracted/DerivedNumericType.txt . Only the first is totally aligned with a general category. > Arabs store (and read, and understand) the least significant digit of > a number first, on the right, on paper. <snip> > > "although digits run the other way, making the scripts > inherently bidirectional" > > http://unicode.org/faq/bidi.html#0 > > I don't think people writing Ivrit or Arabic perceive their writing as > bidirectional. Just their computers! It will be much better if the standard comes clean and admits that bidirectionality is a result of insisting on storing digits in decreasing order of significance. Does anyone here know the history of this encoding order for Arabic digits? I can guess that occasional data corruption in swapping the storage order of letters from left to right to right to left was tolerable, whereas accidentally reversing the digits in numbers recorded as text but functioning as numbers could have been catastrophic. Actually, some users of the Arabic script may feel that their numbers are written in a funny order. Or perhaps not - I've never felt that calling the least significant bit of a byte bit 0 was bizarre. > > As positional notation only seems to have been invented and > > propagated once or twice (Babylonian and Indian inventions), > > I don't think the Mayas copied this idea from hi.wikipedia.org ;-) I stand corrected. Richard.