RE: numeric ordering

2001-09-21 Thread Karlsson Kent - keka



> > 1.  Is there another document/algorithm/table that does provide
> > guidelines for sorting numbers within strings?  Something
> > that deals with different scripts?
> 
> ISO/IEC 14651 "International String Ordering" includes
> an informative annex on this topic. In particular, see
> C.2 Handling of numeral substrings in collation. The specific

C.3 in my copy...

> case of sorting multiple-part section numbering is not
> addressed in detail, 

...because that is subsumed under C.3.1 (Handling of 'ordinary'
numerals for natural numbers), when also considering
FULL STOP to separate numerals, and not be part of them
(which is usually the case for natural number numerals).

(Teknisk norm nr. 34, Swedish Alphanumeric Sorting, [Swedish] Statskontoret,
1992, has a somewhat different approach to the same problem; however, that
document is only available in Swedish, does not go into details on this, and
even though it describes a multi-level ordering it does not fit well with
the 
UTR10/14651 framework...)

/Kent Karlsson

> but many similar kinds of problems
> are.
> 
> --Ken
> 
> 




RE: numeric ordering

2001-09-20 Thread Ayers, Mike

> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 
> Sent: Thursday, September 20, 2001 12:10 PM
> 
> Why not have as part of your kanji collation order, the Han 
> digits one through nine, in that order?

I believe that would be because they are not ordinarily sorted that
way.

> Why are they called CJK UNIFIED IDEOGRAPHs, anyway? Only a 
> committee would come up with a name like that.

Beats the stuffing out of your choice of calling them "kanji" in one
half of the sentence and "Han digits" in the next!  In any case, that's
probably the only good name, since some of the characters are unique to each
of China, Japan, and Korea; some are used by all three; some are used in
China and Japan, and some by China and Korea (I belive that there are none
which are used in Japan and Korea, but not Chgina, but I could be wrong).

> $B$8$e$&$$$C$A$c$s(J
(Juuitchan)
> Well, I guess what you say is true,
> I could never be the right kind of girl for you,
> I could never be your woman
>   - White Town

It is generally considered good practice to put your signature at
the *bottom*.


/|/|ike


Re: numeric ordering

2001-09-20 Thread Kenneth Whistler

Viranga asked:

>   Questions
>   -
> 
>   1.  Is there another document/algorithm/table that does provide
>   guidelines for sorting numbers within strings?  Something
>   that deals with different scripts?

ISO/IEC 14651 "International String Ordering" includes
an informative annex on this topic. In particular, see
C.2 Handling of numeral substrings in collation. The specific
case of sorting multiple-part section numbering is not
addressed in detail, but many similar kinds of problems
are.

--Ken





RE: numeric ordering

2001-09-20 Thread
Why not have as part of your kanji collation order, the Han digits one through nine, 
in that order?
Why are they called CJK UNIFIED IDEOGRAPHs, anyway? Only a committee would come up 
with a name like that.



$B$8$e$&$$$C$A$c$s(B(Juuitchan)
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


>These old numbering systems, however, have the additional problem that they
>are not easily distinguished from other text.  In most cases, these numbers
>are not spelled with special numeric characters (such as the digits), but
>rather use the normal letters or ideographs used to spell normal text.  This
>problem occurs with the old numbering systems of several scripts: Latin,
>Greek, Armenian, Georgian, Hebrew, Arabic, and Chinese.
>
>>  If so, how do you sort two different digits which have the
>>  same numeric value?
>
>I suggested point (f) in the algorithm above: if all else fails, revert to a
>normal textual compare.
>
>_ Marco
>
>


RE: Arabic vs European digit shapes (was RE: numeric ordering)

2001-09-20 Thread Marco Cimarosti

Roozbeh Pournader wrote
> > >   2.  In practice, are digits from different scripts ever mixed?
> >
> > I don't think this normally happens.
> 
> Yes, that happens in Persian contexts. There are texts that 
> use both kind of digits. Arabic-Extended ones for numerical
> values and European ones for references to latin texts.
> [...]

Right, I was naive.  In this case, I guess that a line beginning by
"12-34-56" should go near a line beginning by "۱۲-۳۴-۵۶" (the same numbers
written in Arabic-Extended=Persian digits).

But what will this look like in a bidi context?  Probably, the two section
number will go on opposite sides of the line.  And, I wonder, how will the
Persian number look like in RTL?  My mail client shows the string above as
"56-23-12"!  Would such a collation be friendly for a human reader?

Do you have concrete examples of how Persian book indices are organized in
such cases?  Are European and Persian numbers listed separately?

> 1. No, we are talking about typewritten text, I guess. These 
> digits are clearly distinguished in such contexts in all the
> fonts I know.

Well, I don't know if the distinction is so clear.  Certainly a typographer
or a careful reader would notice the difference, yet I'd try to avoid these
cases in real life.  Even in European usage, when letters and digits are
used together to form identifiers or part numbers, it is customary to
exclude letters "I" and "O" to avoid confusion with one and zero.

> 2. BTW, Extended-Arabic variants of "five" and "six" are very 
> different from European "zero" and "seven", even when
> handwritten.

Well, "five" is still too similar to European "zero", and "six" is similar
to European "nine".  But, OK, the possibility of confusion is much smaller
than with the digits used in Arab countries.

_ Marco




Arabic vs European digit shapes (was RE: numeric ordering)

2001-09-20 Thread Roozbeh Pournader

On Thu, 20 Sep 2001, Marco Cimarosti wrote:

> > 2.  In practice, are digits from different scripts ever mixed?
>
> I don't think this normally happens.

Yes, that happens in Persian contexts. There are texts that use both kind
of digits. Arabic-Extended ones for numerical values and European ones for
references to latin texts. For example, in a text of typographic quality,
one may use Eurpean digits for refering to Unicode "3.1.1", but
Arabic-Extended ones for the text's own page or section numbers. This is
becoming more and more used when refering to numbered versions of foreign
software.

But seeing a Latin "two" immediately adjacent to a Persian "one", or
seeing them in different fields of a section number, for example, no, I
have not seen such a thing.

> E.g., imagine mixing Arabic-Hindi digits with European digits: that would be
> a mess for the reader because Arabic digits "five" and "six" look almost
> identical to European digits "zero" and "seven".

1. No, we are talking about typewritten text, I guess. These digits are
clearly distinguished in such contexts in all the fonts I know.

2. BTW, Extended-Arabic variants of "five" and "six" are very different
from European "zero" and "seven", even when handwritten.

roozbeh





RE: numeric ordering

2001-09-20 Thread Marco Cimarosti

Viranga Ratnaike wrote:
> [...]
> 
> [...]
>   2. numeric formatting: numbers composed of a string of digits or
>  other numerics will not necessarily sort in numerical order. 

That's right.  Unicode is a standard for encoding text, so also its
guidelines for sorting only deal with textual sorting.

This does not mean that a mixed numerical/textual sorting may not be
implemented with Unicode: it just means that specifying such a thing is out
of the scope of UTR#10.

> [...]
>   1.  Is there another document/algorithm/table that does provide
>   guidelines for sorting numbers within strings?  Something
>   that deals with different scripts?

I don't know, probably you may found something on Internet.  It is not an
Unicode-specific problem.

I can try and come up with some common sense ideas about such an algorithm.
I think that the first thing to do should be to split your string in textual
and numerical segments, and compare each segment with on its own.

Say that your string is "1.2.3 Sorting Techniques".  You should split it
into six typed segments (types are N=numeric and T=textual):

1) N "1"
2) T "."
3) N "2"
4) T "."
5) N: "3"
6) T: " Sorting Techniques"

Notice that, in order to do such a segmentation, you must define your own
syntax for numbers.  I.e., it is up to you to define whether "1,234" is
number 1234 or number one + "," + number 234.

Then you can sort the text using a compare algorithm like this:

a) take the 1st segments of both strings;

b) if the two segments have different types, the N segment comes before (or
after) the T segment;

c) if both segments are N, compare them numerically (the smallest number
comes first);

d) if both segments are T, compare them textually (e.g., apply UTR#10);

e) if the two segments compare equal, and both strings have at least a next
segment, take the next segment and go back to point (b);

f) if all segments compared equal, forget the segments and compare the whole
string textually (e.g., apply UTR#10).

>   2.  In practice, are digits from different scripts ever mixed?

I don't think this normally happens.

E.g., imagine mixing Arabic-Hindi digits with European digits: that would be
a mess for the reader because Arabic digits "five" and "six" look almost
identical to European digits "zero" and "seven".

However, it is common to mix European digits with non-digital numbering
systems, such as the Roman numerals.  It is common to see section numbers in
books labeled like this: "VII.9.6".

These old numbering systems, however, have the additional problem that they
are not easily distinguished from other text.  In most cases, these numbers
are not spelled with special numeric characters (such as the digits), but
rather use the normal letters or ideographs used to spell normal text.  This
problem occurs with the old numbering systems of several scripts: Latin,
Greek, Armenian, Georgian, Hebrew, Arabic, and Chinese.

>   If so, how do you sort two different digits which have the
>   same numeric value?

I suggested point (f) in the algorithm above: if all else fails, revert to a
normal textual compare.

_ Marco




numeric ordering

2001-09-19 Thread Viranga Ratnaike

Hi All,

some questions, but first some background.

This morning, Alan (my technical director) raised the issue of
sorting section numbers.



1
1.1
1.1.1
1.1.1.1
1.1.1.2
1.1.1.9
1.1.1.10
1.1.1.11
1.2
1.2.3
1.10.3

The ALPHANUMERIC sort order at present will whisk the integers and
floats up to the top, but leave 1.1.1 etc back with the words.
LEXICOGRAPHIC wont get it right when two digits are required.

Examples of usage:

Sort all versions by version number
(where version numbers are like CVS)

Sort section by section number.




I looked at UTR#10 and noticed

-


1.2 Non-Goals

The Default Unicode Collation Element Table explicitly does
not provide for the following features:

1. ... 
2. numeric formatting: numbers composed of a string of digits or
   other numerics will not necessarily sort in numerical order. 
3. ...
4. ...



-

Questions
-

1.  Is there another document/algorithm/table that does provide
guidelines for sorting numbers within strings?  Something
that deals with different scripts?

2.  In practice, are digits from different scripts ever mixed?
If so, how do you sort two different digits which have the
same numeric value?

Regards,

Viranga