From: "E. Keown" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]>; "Jony Rosenne" <[EMAIL PROTECTED]> > > This could be solved by making Phoenician and Hebrew > > base characters equivalent > > at the first level of collation. > > Could this be translated and expanded into Basic > Not-so-Geeky English???---Elaine
Collation is the process of converting strings into binary comparable "collation keys" (also known as "sort elements"). This is used to match words or sort them according to a linguistic rule. Unicode defines such a rule in a table of default collation key (known as DUCET, or "Default Unicode Collation Elements Table"), that can be used to sort ALL Unicode characters in a consistent way, but also as a base for tailoring the collation order to spezcific languages, without needing to recreate the whole collation table for all defined characters. A collation key can be thought, in a first approach, as another code substituted for each character. This works for some languages, but in fact many languages need further refinements to control how elements collate each other. This first level allows sorting: A < B < C, or a < b < c, while also grouping together related characters: a ~ A, b ~ B, and c ~ C. This means that "AB" will sort between "aa" and "Ab", by ignoring ALL case differences in ALL character. However, for strings that sort in the same group, case distinction comes into effect into a second level, after comparing all characters, instead of just comparing characters individually. To make this possible, characters are given collation keys whose first item is the relative (numeric) order of groups at the first level, and next item is the relative order of characters in that group. So for example: 'a' => [1; 10], 'A' => [1; 11], 'b' => [2; 10], 'B' => [2; 11], 'c' => [3; 10], 'C' => [3; 11]. Sorting "aa", "AB", "Ab" means sorting strings of collation keys, considering each dimension separately in successive passes : "aa" => [1; 10],[1; 10] => (1, 1); (10, 10) "AB" => [1; 11],[2; 11] => (1, 2); (11, 11) "Ab" => [1; 11],[2; 10] => (1, 2); (11, 10) "Aba" => [1; 11],[2; 10],[2; 10] => (1, 2, 2); (11, 10, 10) Above the second and third string collate equally at first level, with equal keys (1,2), but distinct at second level with keys (11,11), (11,10). To make things simpler, introduce a special collation key value which is lower than all others, (0 in the example above), and you get a simpler view of collation elements as a single vector of numeric value, if you use it as a terminator between each level for the resulting collation string: "aa" => (1, 1, 0, 10, 10, 0) "AB" => (1, 2, 0, 11, 11, 0) "Ab" => (1, 2, 0, 11, 10, 0) "Aba" => (1, 2, 2, 0, 11, 10, 10, 0) This simplifies things to get binary comparable vectors of numeric values. The length of the vector depends on the length (in characters or collation elements) of input strings, and on the number of levels considered. Understand here that these collation keys are coordinates in a 2-dimensional space, instead of just one unique code like code points. Some items may still have to the same coordinates (if considering only these two dimensions), for example: 'à' => [1; 10], 'À' => [1; 11] If you limit the collation level at 2, then there is no way to make distinctions between 'a' and 'à', so it may be a problem if you want to get a stable sort, because with only these keys they would be considered as fully equal. So a Unicode collation will append a final key element that just consists in the code point value of each character in the source string (independantly of collation elements considered). This is arbitrary (at a linguistic point of view), but still repects the 2-level collation order by adding a pseudo third level, so that sort order of strings in random initial order becomes stable whatever the order in which they are presented to the sort algorithm. These collation rules can be given with some basic syntax, without specifying the exact collation key values (count the number of "<" symbols to determine the collation level): a < b < c; a << A; b << B; c << C; a = à which are easily combined into a single rule: a = à << A < b << B < c << C Read it arithmetically, with implied grouping as if these were operators with priorities, where the lowest priority is for the primary collation level indicated by "<" and the highest priority is for the last collation level set by "=": ((a = à) << A) < (b << B) < (c << C) -- Now your initial question commenting the Geeky terms. What was said above is that the 22 letters of Phoenician would compare equally at first collation level with the corresponding 22 base letters of Hebrew, because these 22 letters in Hebrew are comparable at this level (the 5 final letter forms could be compared at this level too or at a secondary level, depending on tailored linguistic rules). So at first level, 'HEBREW ALEF' = 'PHOENICIAN ALEF' < 'HEBREW BET' = 'PHOENICIAN BET'. This could be defined in the DUCET as the default collation order (and this would be enough to make Hebrew readers of Phoenician happy.) Greek readers of Phoenician could as well tailor their collation to match ALEF with ALPHA... It is possible to do that without affecting the relative collation order of ANY Hebrew-only string, by assigning them a secondary or tertiary difference rather than a primary difference, so that a collation performed only at first level would group together the same Phoenician words written either with the Phoenician script or with the Hebrew script (provided that no additional Hebrew combining points or final forms are used into the Hebrew transliteration of Phoenician words). Hope this helps. -- Philippe.