RE: numeric ordering
> > 1. Is there another document/algorithm/table that does provide > > guidelines for sorting numbers within strings? Something > > that deals with different scripts? > > ISO/IEC 14651 "International String Ordering" includes > an informative annex on this topic. In particular, see > C.2 Handling of numeral substrings in collation. The specific C.3 in my copy... > case of sorting multiple-part section numbering is not > addressed in detail, ...because that is subsumed under C.3.1 (Handling of 'ordinary' numerals for natural numbers), when also considering FULL STOP to separate numerals, and not be part of them (which is usually the case for natural number numerals). (Teknisk norm nr. 34, Swedish Alphanumeric Sorting, [Swedish] Statskontoret, 1992, has a somewhat different approach to the same problem; however, that document is only available in Swedish, does not go into details on this, and even though it describes a multi-level ordering it does not fit well with the UTR10/14651 framework...) /Kent Karlsson > but many similar kinds of problems > are. > > --Ken > >
RE: numeric ordering
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] > Sent: Thursday, September 20, 2001 12:10 PM > > Why not have as part of your kanji collation order, the Han > digits one through nine, in that order? I believe that would be because they are not ordinarily sorted that way. > Why are they called CJK UNIFIED IDEOGRAPHs, anyway? Only a > committee would come up with a name like that. Beats the stuffing out of your choice of calling them "kanji" in one half of the sentence and "Han digits" in the next! In any case, that's probably the only good name, since some of the characters are unique to each of China, Japan, and Korea; some are used by all three; some are used in China and Japan, and some by China and Korea (I belive that there are none which are used in Japan and Korea, but not Chgina, but I could be wrong). > $B$8$e$&$$$C$A$c$s(J (Juuitchan) > Well, I guess what you say is true, > I could never be the right kind of girl for you, > I could never be your woman > - White Town It is generally considered good practice to put your signature at the *bottom*. /|/|ike
Re: numeric ordering
Viranga asked: > Questions > - > > 1. Is there another document/algorithm/table that does provide > guidelines for sorting numbers within strings? Something > that deals with different scripts? ISO/IEC 14651 "International String Ordering" includes an informative annex on this topic. In particular, see C.2 Handling of numeral substrings in collation. The specific case of sorting multiple-part section numbering is not addressed in detail, but many similar kinds of problems are. --Ken
RE: numeric ordering
Why not have as part of your kanji collation order, the Han digits one through nine, in that order? Why are they called CJK UNIFIED IDEOGRAPHs, anyway? Only a committee would come up with a name like that. $B$8$e$&$$$C$A$c$s(B(Juuitchan) Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town >These old numbering systems, however, have the additional problem that they >are not easily distinguished from other text. In most cases, these numbers >are not spelled with special numeric characters (such as the digits), but >rather use the normal letters or ideographs used to spell normal text. This >problem occurs with the old numbering systems of several scripts: Latin, >Greek, Armenian, Georgian, Hebrew, Arabic, and Chinese. > >> If so, how do you sort two different digits which have the >> same numeric value? > >I suggested point (f) in the algorithm above: if all else fails, revert to a >normal textual compare. > >_ Marco > >
RE: Arabic vs European digit shapes (was RE: numeric ordering)
Roozbeh Pournader wrote > > > 2. In practice, are digits from different scripts ever mixed? > > > > I don't think this normally happens. > > Yes, that happens in Persian contexts. There are texts that > use both kind of digits. Arabic-Extended ones for numerical > values and European ones for references to latin texts. > [...] Right, I was naive. In this case, I guess that a line beginning by "12-34-56" should go near a line beginning by "۱۲-۳۴-۵۶" (the same numbers written in Arabic-Extended=Persian digits). But what will this look like in a bidi context? Probably, the two section number will go on opposite sides of the line. And, I wonder, how will the Persian number look like in RTL? My mail client shows the string above as "56-23-12"! Would such a collation be friendly for a human reader? Do you have concrete examples of how Persian book indices are organized in such cases? Are European and Persian numbers listed separately? > 1. No, we are talking about typewritten text, I guess. These > digits are clearly distinguished in such contexts in all the > fonts I know. Well, I don't know if the distinction is so clear. Certainly a typographer or a careful reader would notice the difference, yet I'd try to avoid these cases in real life. Even in European usage, when letters and digits are used together to form identifiers or part numbers, it is customary to exclude letters "I" and "O" to avoid confusion with one and zero. > 2. BTW, Extended-Arabic variants of "five" and "six" are very > different from European "zero" and "seven", even when > handwritten. Well, "five" is still too similar to European "zero", and "six" is similar to European "nine". But, OK, the possibility of confusion is much smaller than with the digits used in Arab countries. _ Marco
Arabic vs European digit shapes (was RE: numeric ordering)
On Thu, 20 Sep 2001, Marco Cimarosti wrote: > > 2. In practice, are digits from different scripts ever mixed? > > I don't think this normally happens. Yes, that happens in Persian contexts. There are texts that use both kind of digits. Arabic-Extended ones for numerical values and European ones for references to latin texts. For example, in a text of typographic quality, one may use Eurpean digits for refering to Unicode "3.1.1", but Arabic-Extended ones for the text's own page or section numbers. This is becoming more and more used when refering to numbered versions of foreign software. But seeing a Latin "two" immediately adjacent to a Persian "one", or seeing them in different fields of a section number, for example, no, I have not seen such a thing. > E.g., imagine mixing Arabic-Hindi digits with European digits: that would be > a mess for the reader because Arabic digits "five" and "six" look almost > identical to European digits "zero" and "seven". 1. No, we are talking about typewritten text, I guess. These digits are clearly distinguished in such contexts in all the fonts I know. 2. BTW, Extended-Arabic variants of "five" and "six" are very different from European "zero" and "seven", even when handwritten. roozbeh
RE: numeric ordering
Viranga Ratnaike wrote: > [...] > > [...] > 2. numeric formatting: numbers composed of a string of digits or > other numerics will not necessarily sort in numerical order. That's right. Unicode is a standard for encoding text, so also its guidelines for sorting only deal with textual sorting. This does not mean that a mixed numerical/textual sorting may not be implemented with Unicode: it just means that specifying such a thing is out of the scope of UTR#10. > [...] > 1. Is there another document/algorithm/table that does provide > guidelines for sorting numbers within strings? Something > that deals with different scripts? I don't know, probably you may found something on Internet. It is not an Unicode-specific problem. I can try and come up with some common sense ideas about such an algorithm. I think that the first thing to do should be to split your string in textual and numerical segments, and compare each segment with on its own. Say that your string is "1.2.3 Sorting Techniques". You should split it into six typed segments (types are N=numeric and T=textual): 1) N "1" 2) T "." 3) N "2" 4) T "." 5) N: "3" 6) T: " Sorting Techniques" Notice that, in order to do such a segmentation, you must define your own syntax for numbers. I.e., it is up to you to define whether "1,234" is number 1234 or number one + "," + number 234. Then you can sort the text using a compare algorithm like this: a) take the 1st segments of both strings; b) if the two segments have different types, the N segment comes before (or after) the T segment; c) if both segments are N, compare them numerically (the smallest number comes first); d) if both segments are T, compare them textually (e.g., apply UTR#10); e) if the two segments compare equal, and both strings have at least a next segment, take the next segment and go back to point (b); f) if all segments compared equal, forget the segments and compare the whole string textually (e.g., apply UTR#10). > 2. In practice, are digits from different scripts ever mixed? I don't think this normally happens. E.g., imagine mixing Arabic-Hindi digits with European digits: that would be a mess for the reader because Arabic digits "five" and "six" look almost identical to European digits "zero" and "seven". However, it is common to mix European digits with non-digital numbering systems, such as the Roman numerals. It is common to see section numbers in books labeled like this: "VII.9.6". These old numbering systems, however, have the additional problem that they are not easily distinguished from other text. In most cases, these numbers are not spelled with special numeric characters (such as the digits), but rather use the normal letters or ideographs used to spell normal text. This problem occurs with the old numbering systems of several scripts: Latin, Greek, Armenian, Georgian, Hebrew, Arabic, and Chinese. > If so, how do you sort two different digits which have the > same numeric value? I suggested point (f) in the algorithm above: if all else fails, revert to a normal textual compare. _ Marco