RE: numeric ordering

2001-09-21 Thread Karlsson Kent - keka



  1.  Is there another document/algorithm/table that does provide
  guidelines for sorting numbers within strings?  Something
  that deals with different scripts?
 
 ISO/IEC 14651 International String Ordering includes
 an informative annex on this topic. In particular, see
 C.2 Handling of numeral substrings in collation. The specific

C.3 in my copy...

 case of sorting multiple-part section numbering is not
 addressed in detail, 

...because that is subsumed under C.3.1 (Handling of 'ordinary'
numerals for natural numbers), when also considering
FULL STOP to separate numerals, and not be part of them
(which is usually the case for natural number numerals).

(Teknisk norm nr. 34, Swedish Alphanumeric Sorting, [Swedish] Statskontoret,
1992, has a somewhat different approach to the same problem; however, that
document is only available in Swedish, does not go into details on this, and
even though it describes a multi-level ordering it does not fit well with
the 
UTR10/14651 framework...)

/Kent Karlsson

 but many similar kinds of problems
 are.
 
 --Ken
 
 




RE: numeric ordering

2001-09-20 Thread Marco Cimarosti

Viranga Ratnaike wrote:
 [...]
 UTR10
 [...]
   2. numeric formatting: numbers composed of a string of digits or
  other numerics will not necessarily sort in numerical order. 

That's right.  Unicode is a standard for encoding text, so also its
guidelines for sorting only deal with textual sorting.

This does not mean that a mixed numerical/textual sorting may not be
implemented with Unicode: it just means that specifying such a thing is out
of the scope of UTR#10.

 [...]
   1.  Is there another document/algorithm/table that does provide
   guidelines for sorting numbers within strings?  Something
   that deals with different scripts?

I don't know, probably you may found something on Internet.  It is not an
Unicode-specific problem.

I can try and come up with some common sense ideas about such an algorithm.
I think that the first thing to do should be to split your string in textual
and numerical segments, and compare each segment with on its own.

Say that your string is 1.2.3 Sorting Techniques.  You should split it
into six typed segments (types are N=numeric and T=textual):

1) N 1
2) T .
3) N 2
4) T .
5) N: 3
6) T:  Sorting Techniques

Notice that, in order to do such a segmentation, you must define your own
syntax for numbers.  I.e., it is up to you to define whether 1,234 is
number 1234 or number one + , + number 234.

Then you can sort the text using a compare algorithm like this:

a) take the 1st segments of both strings;

b) if the two segments have different types, the N segment comes before (or
after) the T segment;

c) if both segments are N, compare them numerically (the smallest number
comes first);

d) if both segments are T, compare them textually (e.g., apply UTR#10);

e) if the two segments compare equal, and both strings have at least a next
segment, take the next segment and go back to point (b);

f) if all segments compared equal, forget the segments and compare the whole
string textually (e.g., apply UTR#10).

   2.  In practice, are digits from different scripts ever mixed?

I don't think this normally happens.

E.g., imagine mixing Arabic-Hindi digits with European digits: that would be
a mess for the reader because Arabic digits five and six look almost
identical to European digits zero and seven.

However, it is common to mix European digits with non-digital numbering
systems, such as the Roman numerals.  It is common to see section numbers in
books labeled like this: VII.9.6.

These old numbering systems, however, have the additional problem that they
are not easily distinguished from other text.  In most cases, these numbers
are not spelled with special numeric characters (such as the digits), but
rather use the normal letters or ideographs used to spell normal text.  This
problem occurs with the old numbering systems of several scripts: Latin,
Greek, Armenian, Georgian, Hebrew, Arabic, and Chinese.

   If so, how do you sort two different digits which have the
   same numeric value?

I suggested point (f) in the algorithm above: if all else fails, revert to a
normal textual compare.

_ Marco




Arabic vs European digit shapes (was RE: numeric ordering)

2001-09-20 Thread Roozbeh Pournader

On Thu, 20 Sep 2001, Marco Cimarosti wrote:

  2.  In practice, are digits from different scripts ever mixed?

 I don't think this normally happens.

Yes, that happens in Persian contexts. There are texts that use both kind
of digits. Arabic-Extended ones for numerical values and European ones for
references to latin texts. For example, in a text of typographic quality,
one may use Eurpean digits for refering to Unicode 3.1.1, but
Arabic-Extended ones for the text's own page or section numbers. This is
becoming more and more used when refering to numbered versions of foreign
software.

But seeing a Latin two immediately adjacent to a Persian one, or
seeing them in different fields of a section number, for example, no, I
have not seen such a thing.

 E.g., imagine mixing Arabic-Hindi digits with European digits: that would be
 a mess for the reader because Arabic digits five and six look almost
 identical to European digits zero and seven.

1. No, we are talking about typewritten text, I guess. These digits are
clearly distinguished in such contexts in all the fonts I know.

2. BTW, Extended-Arabic variants of five and six are very different
from European zero and seven, even when handwritten.

roozbeh





RE: Arabic vs European digit shapes (was RE: numeric ordering)

2001-09-20 Thread Marco Cimarosti

Roozbeh Pournader wrote
 2.  In practice, are digits from different scripts ever mixed?
 
  I don't think this normally happens.
 
 Yes, that happens in Persian contexts. There are texts that 
 use both kind of digits. Arabic-Extended ones for numerical
 values and European ones for references to latin texts.
 [...]

Right, I was naive.  In this case, I guess that a line beginning by
12-34-56 should go near a line beginning by ۱۲-۳۴-۵۶ (the same numbers
written in Arabic-Extended=Persian digits).

But what will this look like in a bidi context?  Probably, the two section
number will go on opposite sides of the line.  And, I wonder, how will the
Persian number look like in RTL?  My mail client shows the string above as
56-23-12!  Would such a collation be friendly for a human reader?

Do you have concrete examples of how Persian book indices are organized in
such cases?  Are European and Persian numbers listed separately?

 1. No, we are talking about typewritten text, I guess. These 
 digits are clearly distinguished in such contexts in all the
 fonts I know.

Well, I don't know if the distinction is so clear.  Certainly a typographer
or a careful reader would notice the difference, yet I'd try to avoid these
cases in real life.  Even in European usage, when letters and digits are
used together to form identifiers or part numbers, it is customary to
exclude letters I and O to avoid confusion with one and zero.

 2. BTW, Extended-Arabic variants of five and six are very 
 different from European zero and seven, even when
 handwritten.

Well, five is still too similar to European zero, and six is similar
to European nine.  But, OK, the possibility of confusion is much smaller
than with the digits used in Arab countries.

_ Marco




RE: numeric ordering

2001-09-20 Thread $B$F$s$I$&$j$e$&$8(B
Why not have as part of your kanji collation order, the Han digits one through nine, 
in that order?
Why are they called CJK UNIFIED IDEOGRAPHs, anyway? Only a committee would come up 
with a name like that.



rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


These old numbering systems, however, have the additional problem that they
are not easily distinguished from other text.  In most cases, these numbers
are not spelled with special numeric characters (such as the digits), but
rather use the normal letters or ideographs used to spell normal text.  This
problem occurs with the old numbering systems of several scripts: Latin,
Greek, Armenian, Georgian, Hebrew, Arabic, and Chinese.

  If so, how do you sort two different digits which have the
  same numeric value?

I suggested point (f) in the algorithm above: if all else fails, revert to a
normal textual compare.

_ Marco




Re: numeric ordering

2001-09-20 Thread Kenneth Whistler

Viranga asked:

   Questions
   -
 
   1.  Is there another document/algorithm/table that does provide
   guidelines for sorting numbers within strings?  Something
   that deals with different scripts?

ISO/IEC 14651 International String Ordering includes
an informative annex on this topic. In particular, see
C.2 Handling of numeral substrings in collation. The specific
case of sorting multiple-part section numbering is not
addressed in detail, but many similar kinds of problems
are.

--Ken





RE: numeric ordering

2001-09-20 Thread Ayers, Mike

 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 
 Sent: Thursday, September 20, 2001 12:10 PM
 
 Why not have as part of your kanji collation order, the Han 
 digits one through nine, in that order?

I believe that would be because they are not ordinarily sorted that
way.

 Why are they called CJK UNIFIED IDEOGRAPHs, anyway? Only a 
 committee would come up with a name like that.

Beats the stuffing out of your choice of calling them "kanji" in one
half of the sentence and "Han digits" in the next!  In any case, that's
probably the only good name, since some of the characters are unique to each
of China, Japan, and Korea; some are used by all three; some are used in
China and Japan, and some by China and Korea (I belive that there are none
which are used in Japan and Korea, but not Chgina, but I could be wrong).

 rubyrb$B$8$e$&$$$C$A$c$s(J
/rbrp(/rprtJuuitchan/rtrp)/rp/ruby
 Well, I guess what you say is true,
 I could never be the right kind of girl for you,
 I could never be your woman
   - White Town

It is generally considered good practice to put your signature at
the *bottom*.


/|/|ike