RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

John Cowan wrote:
> None of which is as weird as Leghorn for Livorno (Italy).

It's as weird as some Italian names for German cities: Aquisgrana for
Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di Baviera) for
München.

_ Marco




RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

Carl W. Brown wrote:
> In Arabic do you include vowels or not?

Yes, and also consonants sometimes...

Traditional Arabic dictionary sorting uses the three-letter root ("radical")
of a word as the primary key.  So, "madrasa" (school) would be under "d"
(because its radical is "d-r-s" = to learn), ignoring the "ma-" prefix.

I doubt, however, that this system is used with automatic sort orders
generated by computers.

_ Marco




RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

Asmus Freytag wrote:
> But if you do this, all compound words starting with "data" 
> and continuing 
> with another word starting with "a" will be sorted incorrectly!
> 
> To achieve this effect, you would have to mark which AAs are 
> A-Rings and which ones are accidental adjacencies. In Danish
> one can use the SHY (soft hyphen) [...]

Real-life sort orders often ignore these subtleties and are often based on a
small set of rules which is applied blindly, regardless of the origin,
meaning, or pronunciation of headwords.

For instance, I have noticed that Dutch telephone directories always sort
the sequence "ij" as if it was "y", regardless that it actually occurs in a
Dutch word.  E.g., Beijing Chinese Restaurant would be listed after Mr. Bex.

Similarly, old Italian encyclopedias (e.g. Dizionario Enciclopedico Teccani)
equated "j" to "i" because, in Italian, the former is just a graphic variant
of the latter.  But this also applied to foreign name such as "Jefferson"
(which was listed between "iee-" and "ieg-"), regardless that, of course, it
would not be allowed to spell "Iefferson".

_ Marco




Re: [OT] o-circumflex

2001-09-10 Thread Michael Everson

At 18:04 +0200 2001-09-09, Stefan Persson wrote:

>  > well, the official spelling of the town is Aalborg.
>
>In Sweden it has always been written "Ålborg."

At one stage, in both countries, it was written Álaborg, I suspect, 
as it is in Iceland today.
-- 
Michael Everson




Re: [OT] o-circumflex

2001-09-10 Thread Michael Everson

At 18:10 -0400 2001-09-09, John Cowan wrote:
>Keld Jørn Simonsen scripsit:
>
>>  Yes, foreigners call our cities many strange things:-)
>>  København is called Köpenhamn, Copenhagen, Kobenhagen, Copenhague,
>  > and many more.

In Iceland it is Kaupmannahöfn, I believe. In unadorned English that 
would be something like Cheapmenshaven, maybe to weaken as 
Cheapenhaven, in German Kaufenhagen
-- 
Michael Everson




Re: [OT] o-circumflex

2001-09-10 Thread Keld Jørn Simonsen

On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote:
> Asmus Freytag wrote:
> > But if you do this, all compound words starting with "data" 
> > and continuing 
> > with another word starting with "a" will be sorted incorrectly!
> > 
> > To achieve this effect, you would have to mark which AAs are 
> > A-Rings and which ones are accidental adjacencies. In Danish
> > one can use the SHY (soft hyphen) [...]
> 
> Real-life sort orders often ignore these subtleties and are often based on a
> small set of rules which is applied blindly, regardless of the origin,
> meaning, or pronunciation of headwords.
> 

Real-life sorts, like MS Windows sorting or Linux sorting, actually adheres
to these Danish rules, once you have set up your machine for Danish.

Kind regards
Keld




Re: [OT] o-circumflex

2001-09-10 Thread Michael \(michka\) Kaplan

From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]>

> Real-life sorts, like MS Windows sorting or Linux sorting, actually
adheres
> to these Danish rules, once you have set up your machine for Danish.

And this is the *true* answer to the whole mess of attempting *multilingual*
sorts -- once the user chooses the sort they WANT, the system might handle
other language strings in a way that might be obscure to those who know the
other language but the person who expected Danish or whatever will see what
they want.

Since various sorts openly conflict with each other there is no other
general case solution which would be appropriate, anyway?

(can't believe this thread is still going on!)


MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/






RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

> On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote:
> > Asmus Freytag wrote:
> > > But if you do this, all compound words starting with "data" 
> > > and continuing 
> > > with another word starting with "a" will be sorted incorrectly!
> > > 
> > > To achieve this effect, you would have to mark which AAs are 
> > > A-Rings and which ones are accidental adjacencies. In Danish
> > > one can use the SHY (soft hyphen) [...]
> > 
> > Real-life sort orders often ignore these subtleties and are 
> often based on a
> > small set of rules which is applied blindly, regardless of 
> the origin,
> > meaning, or pronunciation of headwords.
> > 
> 
> Real-life sorts, like MS Windows sorting or Linux sorting, 
> actually adheres
> to these Danish rules, once you have set up your machine for Danish.

If I understand what you mean, perhaps my point was not clear.

I know that "aa" sorts like "å", and that it should go after "z".  But there
are also cases when the sequence "aa" is just two a's, adjacent to each
other by pure chance.

One of these cases could be the word "dataarkiv", which I found in a Danish
web page
(http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html).

Now: if your Windows or Linux collations states (correctly!) that "aa"
should go after "z", you may have a list ordered like this:

Order A:
1. data
2. Datben, Dr. Keld
3. Datz, Mr. Marco
4. dataarkiv
5. Datåz, Dr. Asmus

But if "dataarkiv" was written using an invisible separator between the two
a's (e.g. a soft hyphen, or a zero width non joiner), the your list would be
like this:

Order B:
1. data
2. dataarkiv
3. Datben, Dr. Keld
4. Datz, Mr. Marco
5. Datåz, Dr. Asmus

Asmus was arguing that List B would be the correct one (and this is
certainly true on, e.g., a dictionary) but, in order to obtain it, the
source text must be properly encoded with invisible separators inserted
where needed.

What I was saying is that the "automatic" Order A is also often used, and I
brought the example of the Dutch phone directories (where "Beijing" is
sorted as if it was "Beying"), and of the Italian encyclopedia (where
"Jefferson" is sorted as if it was "Iefferson").

Michael (michka) Kaplan wrote:
> And this is the *true* answer to the whole mess of attempting 
> *multilingual* sorts -- once the user chooses the sort they
> WANT, the system might handle other language strings in a
> way that might be obscure to those who know the other
> language but the person who expected Danish or whatever 
> will see what they want.

And this is precisely what I was trying to say, although I was not
necessarily talking about multilingual sort ("dataarkiv" seems a purely
Danish word, although derived from Latin roots).

For some users and some usages, the "incorrect" Order B may be much more
useful than the "correct" Order A.  If the rules says that "ij" goes between
"x" and "z", a Dutchman should find the "Beijing Restaurant" between "bex-"
and "bez-".

If someone wants Order A (as may be the case for the author of a
dictionary), then they should apply Asmus' suggestion in order to drive the
collation algorithm.

_ Marco




Re: [OT] o-circumflex

2001-09-10 Thread Marcin 'Qrczak' Kowalczyk

Mon, 10 Sep 2001 10:47:48 +0200, Marco Cimarosti <[EMAIL PROTECTED]> pisze:

> It's as weird as some Italian names for German cities: Aquisgrana
> for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di
> Baviera) for München.

Interesting that Polish names of these cities are more like Italian
than German: Akwizgran, Augsburg, Moguncja, Monachium.

Ko/benhavn is Kopenhaga, again more like other foreign forms than
Danish.

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTĘPCZA
QRCZAK





Re: [OT] o-circumflex

2001-09-10 Thread Keld Jørn Simonsen

On Mon, Sep 10, 2001 at 03:58:05PM +0200, Marco Cimarosti wrote:
> > On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote:
> > > Asmus Freytag wrote:
> > > > But if you do this, all compound words starting with "data" 
> > > > and continuing 
> > > > with another word starting with "a" will be sorted incorrectly!
> > > > 
> > > > To achieve this effect, you would have to mark which AAs are 
> > > > A-Rings and which ones are accidental adjacencies. In Danish
> > > > one can use the SHY (soft hyphen) [...]
> > > 
> > > Real-life sort orders often ignore these subtleties and are 
> > often based on a
> > > small set of rules which is applied blindly, regardless of 
> > the origin,
> > > meaning, or pronunciation of headwords.
> > > 
> > 
> > Real-life sorts, like MS Windows sorting or Linux sorting, 
> > actually adheres
> > to these Danish rules, once you have set up your machine for Danish.
> 
> If I understand what you mean, perhaps my point was not clear.

My point was that real-life sorts nowadays are quite sophisticated,
and the major systems have adequate sorting for Danish and other
languages with that kind of complexity.

> I know that "aa" sorts like "å", and that it should go after "z".  But there
> are also cases when the sequence "aa" is just two a's, adjacent to each
> other by pure chance.
> 
> One of these cases could be the word "dataarkiv", which I found in a Danish
> web page
> (http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html).

Yes, and ekstraarbejde - extra work. I know.

> Now: if your Windows or Linux collations states (correctly!) that "aa"
> should go after "z", you may have a list ordered like this:
> 
>   Order A:
>   1. data
>   2. Datben, Dr. Keld
>   3. Datz, Mr. Marco
>   4. dataarkiv
>   5. Datåz, Dr. Asmus
> 
> But if "dataarkiv" was written using an invisible separator between the two
> a's (e.g. a soft hyphen, or a zero width non joiner), the your list would be
> like this:
> 
>   Order B:
>   1. data
>   2. dataarkiv
>   3. Datben, Dr. Keld
>   4. Datz, Mr. Marco
>   5. Datåz, Dr. Asmus
> 
> Asmus was arguing that List B would be the correct one (and this is
> certainly true on, e.g., a dictionary) but, in order to obtain it, the
> source text must be properly encoded with invisible separators inserted
> where needed.

Yes, that is also my advice.

> What I was saying is that the "automatic" Order A is also often used, and I
> brought the example of the Dutch phone directories (where "Beijing" is
> sorted as if it was "Beying"), and of the Italian encyclopedia (where
> "Jefferson" is sorted as if it was "Iefferson").

You have to sort it according to the expectations of the user.
A Dutch book would use Dutch rules, an Italian book would use
the italian order. You cannot mix ordering, such that some words follow
one set of rules, and other words follow other rules. It all needs
to be comprehended by one human, the reader, and there only one ruleset
applies.

> 
> Michael (michka) Kaplan wrote:
> > And this is the *true* answer to the whole mess of attempting 
> > *multilingual* sorts -- once the user chooses the sort they
> > WANT, the system might handle other language strings in a
> > way that might be obscure to those who know the other
> > language but the person who expected Danish or whatever 
> > will see what they want.
> 
> And this is precisely what I was trying to say, although I was not
> necessarily talking about multilingual sort ("dataarkiv" seems a purely
> Danish word, although derived from Latin roots).
> 
> For some users and some usages, the "incorrect" Order B may be much more
> useful than the "correct" Order A.  If the rules says that "ij" goes between
> "x" and "z", a Dutchman should find the "Beijing Restaurant" between "bex-"
> and "bez-".
> 
> If someone wants Order A (as may be the case for the author of a
> dictionary), then they should apply Asmus' suggestion in order to drive the
> collation algorithm.

I think we agree, but what you call "simple set of rules" I call "quite complex".
I also think that the Danish rules are quite simple as they can be formulated
in say 4 lines of Danish prose. But compared to ascii sorting they are to some
people unbelievable complex, and I think many Danish believes that you cannot get
programs that adhere, although the major systems do that out of the box.

Your incorrect and correct examples use the very same sorting algoritm, the only
thing is that the data is coded differently.

But maybe you are driving for a yet more complex sorting, one that can sort
according to multiple rules? Beijing should then not be sorted as Beÿing?
As stated above I think - and other sorting experts too - that sorting
with multiple rules is a conceptual misunderstanding.

Kind regards
Keld




Re: [OT] o-circumflex

2001-09-10 Thread Michael \(michka\) Kaplan

From: "Mark Davis" <[EMAIL PROTECTED]>

> Michael, that isn't the point. There is a problem even when you stick to
one
> language.
>
> That is, there are situations where two letters in a language, e.g. "ch"
in
> Slovak, are normally sorted as one. However, in some exceptional
> circumstances those letters should be sorted separated. It could be
because
> they come originally from another language, or it could be because they
> happen to arise when two other words are conjoined. There is no
algorithmic
> distinction. So without some special character, it would require a
> dictionary look-up to produce the right sort

I would argue that most users of the language are not expecting this type of
thing, and that when they are looking for a word that this might be the
SECOND place they look, not the first.

There are exceptions, but they are not outnumbered by the general case, by
any means.

> For example, suppose that "th" were sorted separately in English, after Z.
> Yet people would expect the following order:
>
> cast
> cathouse
> caul
> cathode
>
> because the "t" and "h" are logically separate in "cathouse".

Again, I think most people would look first in the place that does not
assume the exception -- the computer's original limitations havse trained
them. The notion of a natural language processing engine that would have all
of the specific differences (with appropriate dictionaries for exceptions to
even the NLP results) is a fascinating notion, but one that no one is even
close to, yet.

We do not even have available UCA tailorings for most of the world's
languages. Though I have high hopes for the future (if not in the UCA then
in other mechanisms).

By that time, many langauges may have TWO collations, since users have been
expecting something else for the last few decades?

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/






Re: [OT] o-circumflex

2001-09-10 Thread Mark Davis

Michael, that isn't the point. There is a problem even when you stick to one
language.

That is, there are situations where two letters in a language, e.g. "ch" in
Slovak, are normally sorted as one. However, in some exceptional
circumstances those letters should be sorted separated. It could be because
they come originally from another language, or it could be because they
happen to arise when two other words are conjoined. There is no algorithmic
distinction. So without some special character, it would require a
dictionary look-up to produce the right sort

For example, suppose that "th" were sorted separately in English, after Z.
Yet people would expect the following order:

cast
cathouse
caul
cathode

because the "t" and "h" are logically separate in "cathouse".

Mark
—————

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Όμήρου Μαργίτῃ
[http://www.macchiato.com]
- Original Message -
From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
To: "Keld Jørn Simonsen" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Monday, September 10, 2001 5:48 AM
Subject: Re: [OT] o-circumflex


> From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]>
>
> > Real-life sorts, like MS Windows sorting or Linux sorting, actually
> adheres
> > to these Danish rules, once you have set up your machine for Danish.
>
> And this is the *true* answer to the whole mess of attempting
*multilingual*
> sorts -- once the user chooses the sort they WANT, the system might handle
> other language strings in a way that might be obscure to those who know
the
> other language but the person who expected Danish or whatever will see
what
> they want.
>
> Since various sorts openly conflict with each other there is no other
> general case solution which would be appropriate, anyway?
>
> (can't believe this thread is still going on!)
>
>
> MichKa
>
> Michael Kaplan
> Trigeminal Software, Inc.
> http://www.trigeminal.com/
>
>
>
>





Re: [OT] o-circumflex

2001-09-10 Thread John Wilcock

On Mon, 10 Sep 2001 16:42:45 +0200, Keld Jørn Simonsen wrote:
> But maybe you are driving for a yet more complex sorting, one that can sort
> according to multiple rules? Beijing should then not be sorted as Beÿing?

I haven't followed this discussion from the beginning, so apologies if
I'm missing the point, but it seems to me that the Beijing case in
Dutch is no different from the ekstraarbejde case in Danish - a SHY or
ZWNJ is all that is needed to stop Beijing sorting with Bey. 


John.

-- 
-- Over 1500 webcams from ski resorts around the world - http://www.snoweye.com/
-- Translate your technical documents and web pages- http://www.tradoc.fr/




Unicode Conference Opportunity

2001-09-10 Thread Suzanne M. Topping

Unicode conference attendees are invited to join the newly formed
Professional Association for Localization, a non-profit organization for
people in globalization related industries, and localization in
particular.

Those who sign up at the conference will recieve 10% off the low annual
membership fee. Keep a look out for the PAL table at the conference, or
ask at the Localization Institute table for more information.

See you there!

Suzanne Topping
Communication Officer
Professional Association for Localization (PAL)

[EMAIL PROTECTED]

BizWonk Inc.
25 N. Washington St.
Rochester, NY 14614-1110
USA

Phone: +1 716.454.4210
Fax: +1 716.454.4213 




CORRECTION RE: Unicode Conference Opportunity

2001-09-10 Thread Suzanne M. Topping


Apologies, but the note below should have read $10 off rather
than 10% off. (Since membership is only $75, this is an even
better deal.)
 
> Unicode conference attendees are invited to join the newly 
> formed Professional Association for Localization, a 
> non-profit organization for people in globalization related 
> industries, and localization in particular.
> 
> Those who sign up at the conference will recieve 10% off the 
> low annual membership fee. Keep a look out for the PAL table 
> at the conference, or ask at the Localization Institute table 
> for more information.
> 
> See you there!
> 
> Suzanne Topping
> Communication Officer
> Professional Association for Localization (PAL)
> 
> [EMAIL PROTECTED]
> 
> BizWonk Inc.
> 25 N. Washington St.
> Rochester, NY 14614-1110
> USA
> 
> Phone: +1 716.454.4210
> Fax: +1 716.454.4213 
> 




Alternative sorting for digraphs (Was Re: [OT] o-circumflex)

2001-09-10 Thread Mark Davis

A SHY will mean that the word can break at "Bei-
jing". It is not clear to me at least that that is safe in all cases for all
languages with digraphs that sort separately, although it may be a solution
for some.

A ZWNJ will break ligatures and cursive connections. While probably safe in
Danish or Dutch, it is unclear to me that that is safe in all languages
where this situation occurs. There are diagraphs in Urdu, for example. While
I don't know their sorting order, if they do sort separately then ZWNJ can't
be used to express the alternative sorting, since it would give the wrong
rendering.

Mark
—

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Όμήρου Μαργίτῃ
[http://www.macchiato.com]
- Original Message -
From: "John Wilcock" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, September 10, 2001 8:39 AM
Subject: Re: [OT] o-circumflex


> On Mon, 10 Sep 2001 16:42:45 +0200, Keld Jørn Simonsen wrote:
> > But maybe you are driving for a yet more complex sorting, one that can
sort
> > according to multiple rules? Beijing should then not be sorted as
Beÿing?
>
> I haven't followed this discussion from the beginning, so apologies if
> I'm missing the point, but it seems to me that the Beijing case in
> Dutch is no different from the ekstraarbejde case in Danish - a SHY or
> ZWNJ is all that is needed to stop Beijing sorting with Bey.
>
>
> John.
>
> --
> -- Over 1500 webcams from ski resorts around the world -
http://www.snoweye.com/
> -- Translate your technical documents and web pages-
http://www.tradoc.fr/
>
>





RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

John Wilcock wrote:
> I haven't followed this discussion from the beginning, so apologies if
> I'm missing the point, but it seems to me that the Beijing case in
> Dutch is no different from the ekstraarbejde case in Danish - a SHY or
> ZWNJ is all that is needed to stop Beijing sorting with Bey. 

Yes, it is exactly the same thing.

But my point is that a Dutch reader probably *does* expect Beijing to sort
like Bey, not like Bei.  So, in some cases, a "correct" (i.e., expected)
behavior could rather be to *remove* all SHY/ZWNJ's before sorting.

_ Marco




UTF-8 validation rules

2001-09-10 Thread Carl W. Brown

I am checking out my UTF-8 validation rules to see if they are correct.

Check each character to be a valid UTF-8 initial character.

\x00 to \x7f or \xC2 to \xF4

Allow invalid forms such as \xC0 & \xC1 to decode but consider them invalid.

A first byte of \xE0 or \xF0 with a second byte less than \xA0 is also an
invalid form.

\xED followed by anything >= \xA0 is an encoded surrogate and not a valid
character.

\xEF\xBF\xBE and \xEF\xBF\xBF are invalid Unicode characters.

Anything greater than \xF4\x80\xBF\xBF is beyond the Unicode range.

All UTF-8 characters must be followed by the proper number of valid
continuation characters, if any.

Carl






Re: [OT] o-circumflex

2001-09-10 Thread

If they can't agree on the pronunciation for these cities, can they agree on the Hanzi 
for them?
What ARE the Hanzi for these cities, anyway??

$B$8$e$&$$$C$A$c$s(B(Juuitchan)
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


--- Original Message ---
$B:9=P?M(B: Marcin 'Qrczak' Kowalczyk <[EMAIL PROTECTED]>;
$B08@h(B: [EMAIL PROTECTED];
Cc: 
$BF|;~(B: 01/09/10 14:02
$B7oL>(B: Re: [OT] o-circumflex

>Mon, 10 Sep 2001 10:47:48 +0200, Marco Cimarosti <[EMAIL PROTECTED]> pisze:
>
>> It's as weird as some Italian names for German cities: Aquisgrana
>> for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di
>> Baviera) for M$B!&(Bchen.
>
>Interesting that Polish names of these cities are more like Italian
>than German: Akwizgran, Augsburg, Moguncja, Monachium.
>
>Ko/benhavn is Kopenhaga, again more like other foreign forms than
>Danish.
>
>-- 
> __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
> \__/
>  ^^  SYGNATURA ZAST$B%O(BPCZA
>QRCZAK
>
>
>


The trouble with text-sorting algorithms

2001-09-10 Thread

The trouble with algorithms for sorting *text* is that often an algorithm that 
prurportedly sorts TEXT will really be sorting at least partly by PRONUNCIATION. So is 
it really sorting text?

I bet you could disturb the peace by wanting to know how to sort the Japanese word for 
"Japan" in Japanese. Does it sort before or after the pok


Re: [OT] o-circumflex

2001-09-10 Thread Stefan Persson

- Original Message -
From: "Marco Cimarosti" <[EMAIL PROTECTED]>
To: "'John Wilcock'" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: den 10 september 2001 18:35
Subject: RE: [OT] o-circumflex


> John Wilcock wrote:
> > I haven't followed this discussion from the beginning, so apologies if
> > I'm missing the point, but it seems to me that the Beijing case in
> > Dutch is no different from the ekstraarbejde case in Danish - a SHY or
> > ZWNJ is all that is needed to stop Beijing sorting with Bey.
>
> Yes, it is exactly the same thing.
>
> But my point is that a Dutch reader probably *does* expect Beijing to sort
> like Bey, not like Bei.  So, in some cases, a "correct" (i.e., expected)
> behavior could rather be to *remove* all SHY/ZWNJ's before sorting.

I thought "ij" sorted after "z?"


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

Stefan Persson wrote:
> I thought "ij" sorted after "z?"

Not in Dutch: as far as I have seen it sorts the same as "y".  In fact, in
the telephone directory many people who had an "y" in their surname listed
near people who had the same surname spelled with "ij" (e.g. "Meyer" and
"Meijer").

(Anyway, next time they send me to Holland, I'll ask for a downtown hotel.
So, after dinner, I'll go sightseeing rather than spending the whole evening
looking at the collation of the phone directory:-)

_ Marco




Re: [OT] o-circumflex

2001-09-10 Thread Stefan Persson

There is a similar problem with Swedish:

Our alphabet goes:

a
...
u
v & w (no difference made)
x
y
z
Ã¥
ä (the Danish/Norwegian "æ" is also sorted as "ä")
ö (the Danish/Norwegian "ø" is also sorted as "ö")

The German character "ü" is pronunciated as a Swedish "y," so when any
German name or loan word containing that character occurs in Swedish it
should be sorted as "y." However, if any "ü" occurs in a Dutch loan word it
is considered as an "u" with umlaut and is sorted as "u."

The same goes for "ä" and "ö": If they are the Swedish/Finnish/German
letters "ä" and "ö" they are sorted after "å," if they are the Dutch letters
"a" with umlaut and "o" with umlaut, they're sorted as "a" and "o" in a
Swedish encyclopædia.

In Swedish the Danish/Norwegian letter "æ" is sorted as "ä," while the
Latin/Icelandic letter "æ" is sorted as "ae."

Stefan

- Original Message -
From: "Mark Davis" <[EMAIL PROTECTED]>
To: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>; "Keld Jørn Simonsen"
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: den 10 september 2001 17:27
Subject: Re: [OT] o-circumflex


> Michael, that isn't the point. There is a problem even when you stick to
one
> language.
>
> That is, there are situations where two letters in a language, e.g. "ch"
in
> Slovak, are normally sorted as one. However, in some exceptional
> circumstances those letters should be sorted separated. It could be
because
> they come originally from another language, or it could be because they
> happen to arise when two other words are conjoined. There is no
algorithmic
> distinction. So without some special character, it would require a
> dictionary look-up to produce the right sort
>
> For example, suppose that "th" were sorted separately in English, after Z.
> Yet people would expect the following order:
>
> cast
> cathouse
> caul
> cathode
>
> because the "t" and "h" are logically separate in "cathouse".
>
> Mark
> —————
>
> Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
>πάντα — Όμήρου Μαργίτῃ
> [http://www.macchiato.com]
> - Original Message -
> From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
> To: "Keld Jørn Simonsen" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> Sent: Monday, September 10, 2001 5:48 AM
> Subject: Re: [OT] o-circumflex
>
>
> > From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]>
> >
> > > Real-life sorts, like MS Windows sorting or Linux sorting, actually
> > adheres
> > > to these Danish rules, once you have set up your machine for Danish.
> >
> > And this is the *true* answer to the whole mess of attempting
> *multilingual*
> > sorts -- once the user chooses the sort they WANT, the system might
handle
> > other language strings in a way that might be obscure to those who know
> the
> > other language but the person who expected Danish or whatever will see
> what
> > they want.
> >
> > Since various sorts openly conflict with each other there is no other
> > general case solution which would be appropriate, anyway?
> >
> > (can't believe this thread is still going on!)
> >
> >
> > MichKa
> >
> > Michael Kaplan
> > Trigeminal Software, Inc.
> > http://www.trigeminal.com/
> >
> >
> >
> >
>


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





RE: The trouble with text-sorting algorithms

2001-09-10 Thread Ayers, Mike


> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 
> Sent: Monday, September 10, 2001 10:40 AM


> The trouble with algorithms for sorting *text* is that often 
> an algorithm that prurportedly sorts TEXT will really be 
> sorting at least partly by PRONUNCIATION. So is it really 
> sorting text?

I do not believe we've seen any examples of words being sorted by
pronunciation here.  However, higher level structures do participate in
sorting, and we have seen examples of this.

> I bet you could disturb the peace by wanting to know how to 
> sort the Japanese word for "Japan" in Japanese. Does it sort 
> before or after the pok

It sorts on "日”, of course!

:-p

There are several sort orders of Kanji within dictionaries.  In a
phrase dictionary, the sorting is traditional kana sorting - IIRC,
duplicated consonants sort last.  You are much overdue to buy a learner's
dictionary.  I recommend the Nelson.  Also, you should get a phrase
dictionary - one of the better kept secrets of Japanese study.


/|/|ike




Re: UTF-8 validation rules

2001-09-10 Thread Misha . Wolf


Carl,

You seem to be using the word "character" in some places where
you (probably) mean "byte", eg:

> All UTF-8 characters must be followed by the proper number of valid
> continuation characters, if any.

Misha


On 10/09/2001 18:21:48 Carl W. Brown wrote:
> I am checking out my UTF-8 validation rules to see if they are correct.
>
> Check each character to be a valid UTF-8 initial character.
>
> \x00 to \x7f or \xC2 to \xF4
>
> Allow invalid forms such as \xC0 & \xC1 to decode but consider them invalid.
>
> A first byte of \xE0 or \xF0 with a second byte less than \xA0 is also an
> invalid form.
>
> \xED followed by anything >= \xA0 is an encoded surrogate and not a valid
> character.
>
> \xEF\xBF\xBE and \xEF\xBF\xBF are invalid Unicode characters.
>
> Anything greater than \xF4\x80\xBF\xBF is beyond the Unicode range.
>
> All UTF-8 characters must be followed by the proper number of valid
> continuation characters, if any.
>
> Carl
>
>
>



-
Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.




RE: UTF-8 validation rules

2001-09-10 Thread Carl W. Brown

Misha,

> You seem to be using the word "character" in some places where
> you (probably) mean "byte", eg:
>

I am getting fuzzy headed these days.  Thanks for pointing it out.  It
should read:

> > I am checking out my UTF-8 validation rules to see if they are correct.
> >
> > Check each character to be a valid UTF-8 initial character.
Check each initial character byte to be a valid UTF-8 initial byte.
> >
> > \x00 to \x7f or \xC2 to \xF4
> >
> > Allow invalid forms such as \xC0 & \xC1 to decode but consider
> them invalid.
> >
> > A first byte of \xE0 or \xF0 with a second byte less than \xA0
> is also an
> > invalid form.
> >
> > \xED followed by anything >= \xA0 is an encoded surrogate and
> not a valid
> > character.
> >
> > \xEF\xBF\xBE and \xEF\xBF\xBF are invalid Unicode characters.
> >
> > Anything greater than \xF4\x80\xBF\xBF is beyond the Unicode range.
> >
> > All UTF-8 characters must be followed by the proper number of valid
> > continuation characters, if any.
All UTF-8 initial character bytes must be followed by the proper number of
valid
continuation bytes, if any.

Carl





Re: UTF-8 validation rules

2001-09-10 Thread Kenneth Whistler

Carl,

> 
> \xEF\xBF\xBE and \xEF\xBF\xBF are invalid Unicode characters.

In current parlance (see Unicode 3.1, UAX #27), these are
"noncharacters", and you must account for the fact that
U+1FFFE..U+1
U+2FFFE..U+2
...
U+10FFFE..U+10

all have the same status as noncharacters.

With Unicode 3.2 (in the works), the 32 additional code points
at U+FDD0..U+FDEF go from unallocated status to noncharacters
as well.

UTF-8 (and UTF-16 and UTF-32) convertors must allow the conversion
of noncharacter code points, but may then allow the detection of
their noncharacter status. Noncharacters should not appear in
open interchange of Unicode textual data, but can have internal
usage unspecified by the standard.

Detection of the status of a code point as a noncharacter
(allocated, but unassigned to a character) or as a regular unassigned code
point (not allocated) is conceptually distinct from the
validation of the UTF-8 conversion per se.

--Ken






Re: [OT] o-circumflex

2001-09-10 Thread Thomas Chan

On Mon, 10 Sep 2001, [ISO-2022-JP] $B$F$s$I$&$j$e$&$8(B wrote:

> If they can't agree on the pronunciation for these cities, can they
> agree on the Hanzi for them? What ARE the Hanzi for these cities,
> anyway??

Are you asking for the names of cities in Chinese?  Copenhagen is
ge1ben3ha1gen1 \u54e5\u672c\u54c8\u6839.  The Han characters used to write
the names of cities depends on many factors, including but not
limited to source spelling/pronunciation, language/dialect of the
rendering party, mapping rules used by the renderer, time period, etc.
For example, New York is rendered in Chinese as Mandarin niu3yue4
\u7d10\u7d04, lit. 'button-appointment' (nauyeuk in Cantonese), while in
Japanese it was at one time rendered as \u7d10\u80b2, lit.
'button-rearing'.  Asking for the "hanzi" (from your wording, I don't
think you are just talking about Chinese usage of Han characters) is like
asking for a single Latin script rendering.

(I think you need to get yourself an English<->Chinese dictionary or
something, btw...)


Thomas Chan
[EMAIL PROTECTED]






Re: [OT] o-circumflex

2001-09-10 Thread Keld Jørn Simonsen

Where is this done for swedish? I have read both the TN and the SIS
standard, and I dont believe these say something on sorting 
ü according to either German or Dutch sounds. Rolf Gavare does not
say something along this either, as far as I can remember.

Kind regards
keld

On Mon, Sep 10, 2001 at 07:09:34PM +0200, Stefan Persson wrote:
> There is a similar problem with Swedish:
> 
> Our alphabet goes:
> 
> a
> ...
> u
> v & w (no difference made)
> x
> y
> z
> å
> ä (the Danish/Norwegian "æ" is also sorted as "ä")
> ö (the Danish/Norwegian "ø" is also sorted as "ö")
> 
> The German character "ü" is pronunciated as a Swedish "y," so when any
> German name or loan word containing that character occurs in Swedish it
> should be sorted as "y." However, if any "ü" occurs in a Dutch loan word it
> is considered as an "u" with umlaut and is sorted as "u."
> 
> The same goes for "ä" and "ö": If they are the Swedish/Finnish/German
> letters "ä" and "ö" they are sorted after "å," if they are the Dutch letters
> "a" with umlaut and "o" with umlaut, they're sorted as "a" and "o" in a
> Swedish encyclopædia.
> 
> In Swedish the Danish/Norwegian letter "æ" is sorted as "ä," while the
> Latin/Icelandic letter "æ" is sorted as "ae."
> 
> Stefan




Re: [OT] o-circumflex

2001-09-10 Thread
I hate this sort:
Club Mix 2000
Club Mix 98
Club Mix 99

Those non Y2K compliant fools!


$B$8$e$&$$$C$A$c$s(B(Juuitchan)
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


--- Original Message ---
$B:9=P?M(B: Stefan Persson <[EMAIL PROTECTED]>;
$B08@h(B: Mark Davis <[EMAIL PROTECTED]>;"Michael (michka) Kaplan" 
<[EMAIL PROTECTED]>;Keld J?n Simonsen <[EMAIL PROTECTED]>;[EMAIL PROTECTED];
Cc: 
$BF|;~(B: 01/09/10 17:09
$B7oL>(B: Re: [OT] o-circumflex

>There is a similar problem with Swedish:
>
>Our alphabet goes:
>
>a
>...
>u
>v & w (no difference made)
>x
>y
>z
>$B%F!&(B
>$B%F!"(B (the Danish/Norwegian "$B%F%r(B" is also sorted as "$B%F!"(B")
>$B%F%+(B (the Danish/Norwegian "$B%F%/(B" is also sorted as "$B%F%+(B")
>
>The German character "$B%F%7(B" is pronunciated as a Swedish "y," so when any
>German name or loan word containing that character occurs in Swedish it
>should be sorted as "y." However, if any "$B%F%7(B" occurs in a Dutch loan word it
>is considered as an "u" with umlaut and is sorted as "u."
>
>The same goes for "$B%F!"(B" and "$B%F%+(B": If they are the 
>Swedish/Finnish/German
>letters "$B%F!"(B" and "$B%F%+(B" they are sorted after "$B%F!&(B," if they are 
>the Dutch letters
>"a" with umlaut and "o" with umlaut, they're sorted as "a" and "o" in a
>Swedish encyclop$B%F%r(Bdia.
>
>In Swedish the Danish/Norwegian letter "$B%F%r(B" is sorted as "$B%F!"(B," while 
>the
>Latin/Icelandic letter "$B%F%r(B" is sorted as "ae."
>
>Stefan
>
>- Original Message -
>From: "Mark Davis" <[EMAIL PROTECTED]>
>To: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>; "Keld J$B%F%/(Brn Simonsen"
><[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
>Sent: den 10 september 2001 17:27
>Subject: Re: [OT] o-circumflex
>
>
>> Michael, that isn't the point. There is a problem even when you stick to
>one
>> language.
>>
>> That is, there are situations where two letters in a language, e.g. "ch"
>in
>> Slovak, are normally sorted as one. However, in some exceptional
>> circumstances those letters should be sorted separated. It could be
>because
>> they come originally from another language, or it could be because they
>> happen to arise when two other words are conjoined. There is no
>algorithmic
>> distinction. So without some special character, it would require a
>> dictionary look-up to produce the right sort
>>
>> For example, suppose that "th" were sorted separately in English, after Z.
>> Yet people would expect the following order:
>>
>> cast
>> cathouse
>> caul
>> cathode
>>
>> because the "t" and "h" are logically separate in "cathouse".
>>
>> Mark
>> $Bc`Hd?Hd?Hd?Hd?!&(B>>
>> $B%[?%^8P%5%[%5c`!&b>?%^?%[%C%^&Q!&%"%^!&%=(B $Bb>HQ"P%&%[%"(B, 
>$B%[%3%[%"%[%3bA%+%^!&%[%(c`!&b>?%^?%[%C%^&Q!&%"%^!&%=(B $B%^?%[%c%[%9%^!&%"(B 
>$Bc`!&bA%1%[%7%[%g%^"P%=%^!&%[XP%"%^"P%&%[%C%^!&%=!&(B>> [http://www.macchiato.com]
>> - Original Message -
>> From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
>> To: "Keld J$B%F%/(Brn Simonsen" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
>> Sent: Monday, September 10, 2001 5:48 AM
>> Subject: Re: [OT] o-circumflex
>>
>>
>> > From: "Keld J$B%F%/(Brn Simonsen" <[EMAIL PROTECTED]>
>> >
>> > > Real-life sorts, like MS Windows sorting or Linux sorting, actually
>> > adheres
>> > > to these Danish rules, once you have set up your machine for Danish.
>> >
>> > And this is the *true* answer to the whole mess of attempting
>> *multilingual*
>> > sorts -- once the user chooses the sort they WANT, the system might
>handle
>> > other language strings in a way that might be obscure to those who know
>> the
>> > other language but the person who expected Danish or whatever will see
>> what
>> > they want.
>> >
>> > Since various sorts openly conflict with each other there is no other
>> > general case solution which would be appropriate, anyway?
>> >
>> > (can't believe this thread is still going on!)
>> >
>> >
>> > MichKa
>> >
>> > Michael Kaplan
>> > Trigeminal Software, Inc.
>> > http://www.trigeminal.com/
>> >
>> >
>> >
>> >
>>
>
>
>_
>Do You Yahoo!?
>Get your free @yahoo.com address at http://mail.yahoo.com
>
>
>


Re: [OT] o-circumflex

2001-09-10 Thread Lars Marius Garshol


* Carl W. Brown
| 
| You are quite correct that is why Unicode support differing
| collation strengths.  Some times you only care about the actual
| letters without diacritics.  But even then letters are locale
| sensitive.  For example the Danish alphabet starts with an A and
| ends it with A ring above.  A Dane would look for Alborg near the
| end of a list of towns.

This example doesn't apply to this discussions, since Danes and
Norwegians consider Å to be a separate letter. That is, it is not A
with ring above, but Å, which is not related to A any more than E is
related to F.

What J. M. Sykes writes about the lack of established sort orders
seems right to me. I've done consulting work for Norwegian
encyclopedia publishers, which involved developing their sorting
routines. The orders for the different publishers did differ, and it
is not so surprising given that there are a number of cases to
consider, such as how to sort diacritics, what to consider as
diacritics, how to sort numbers, Roman numerals, ordinals, and
whatnot.

--Lars M.





Re: [OT] o-circumflex

2001-09-10 Thread Lars Marius Garshol


* Francesco Zappa Nardelli
| 
| I was in Aalborg fifteen days ago, and I have seen its name written
| both as Ålborg and as Aalborg.  Where does Aalborg appear in a list
| of towns?

At the end.

In both Danish and Norwegian 'aa' and 'å' are considered equivalent.
I am not sure of this, but I think 'å' is a relatively modern
invention, and that it was originally written only as 'aa'. 

--Lars M.





Re: [OT] o-circumflex

2001-09-10 Thread Lars Marius Garshol


* Jonathan Rosenne
|
| This is not always the right thing to do. For example, with personal
| names the person involved may decide whether he prefers the old (AA)
| spelling or the new Å. In any case they are equivalent.

This is true, but this is nothing particular to the aa/å distinction.
Many given names have a number of possible spellings, such as Astri /
Astrid, Cathrine / Katrine / Kathrine, Wenche / Venke / Venche, Espen
/ Esben, ...   In fact, given names which can be written both aa and å
are rare. I can only think of Åge offhand, and that is only rarely
written Aage in Norway (and the other way round in Denmark).

AA/Å confusion is much more common in surnames, but there there is no
choice involved.

--Lars M.





RE: UTF-8 validation rules

2001-09-10 Thread Carl W. Brown

Ken,

> -Original Message-
> From: Kenneth Whistler [mailto:[EMAIL PROTECTED]]
> Sent: Monday, September 10, 2001 12:48 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: UTF-8 validation rules
>
>
> Carl,
>
> >
> > \xEF\xBF\xBE and \xEF\xBF\xBF are invalid Unicode characters.
>
> In current parlance (see Unicode 3.1, UAX #27), these are
> "noncharacters", and you must account for the fact that
> U+1FFFE..U+1
> U+2FFFE..U+2
> U+10FFFE..U+10
>

Based on http://www.unicode.org/unicode/reports/tr27/ I added the check or 4
byte codes:

if (ch[1] & 0x0F == 0x0F) /* U+nFFFE & U+n are invalid */
{
if (ch[2] == 0xBF && ch[3] >= 0xBE)
{
curr_thread->status = U_ILLEGAL_CHAR_FOUND;
return ch - source;
}
}

I also used the handy charts to see that I had made a calculation error.  I
found that the shortest form for 4 byte codes starts at \x90\x80\x80
instead of \xF0\xA0\x80\x80.

Thanks,

Carl






Re: [OT] o-circumflex

2001-09-10 Thread Lars Marius Garshol


* Keld Jørn Simonsen
|
| Yes, foreigners call our cities many strange things:-) København is
| called Köpenhamn, Copenhagen, Kobenhagen, Copenhague, and many more.

* Michael Everson
| 
| In Iceland it is Kaupmannahöfn, I believe. In unadorned English that
| would be something like Cheapmenshaven, maybe to weaken as
| Cheapenhaven, in German Kaufenhagen

Which makes eminent sense, given that København by this logic would
translate as Cheapenhaven. (Your German translation should be
Kaufmannshagen, I guess, to become Kaufenhagen when translated from
København.)

--Lars M.





Re: [OT] o-circumflex

2001-09-10 Thread Lars Marius Garshol


* Marco Cimarosti
| 
| One of these cases could be the word "dataarkiv", which I found in a Danish
| web page
| (http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html).

Uh, no, you found it in a Norwegian web page. The word is the same in
Danish, though.
 
|   Order B:
|   1. data
|   2. dataarkiv
|   3. Datben, Dr. Keld
|   4. Datz, Mr. Marco
|   5. Datåz, Dr. Asmus
| 
| Asmus was arguing that List B would be the correct one (and this is
| certainly true on, e.g., a dictionary) but, in order to obtain it, the
| source text must be properly encoded with invisible separators inserted
| where needed.

Not necessarily. One solution I've seen automatically generated sort
keys from the headwords, but allowed users to adjust them where
necessary. I think users are likely to favour this solution if given a
choice. 

Of course, it depends on how important it is to get the sorting
right, and what importance the headwords have within the system
whether this solution is feasible or not. In a phone directory I guess
nobody would use it.
 
| And this is precisely what I was trying to say, although I was not
| necessarily talking about multilingual sort ("dataarkiv" seems a purely
| Danish word, although derived from Latin roots).

It's a simple concatenation of the words for 'computing' (data) and
'archive' (arkiv), meaning any electronic archive. 

This kind of construction is very common in Norwegian and Danish,
leading speakers to invent all kinds of strange new words when writing
English[1], and the Swedes to joke that we call bananas 'yellowbends'.
 
--Lars M.

[1] And, conversely, after learning English, to split apart words that
God meant us to write without spaces in them. It really ann oys to
see people write in that incon venient way.





Re: [OT] o-circumflex

2001-09-10 Thread Peter_Constable


On 09/10/2001 07:48:05 AM Michael \(michka\) Kaplan wrote:

>(can't believe this thread is still going on!)

I just wanted to know about how Francophones perceive certain graphemes,
and I got that answer a long time ago.



Peter





RE: UTF-8 validation rules

2001-09-10 Thread Carl W. Brown

Ken,

>
> With Unicode 3.2 (in the works), the 32 additional code points
> at U+FDD0..U+FDEF go from unallocated status to noncharacters
> as well.
>

Interesting.  I have seen some of the proposed characters but nothing on
non-characters.  It seems like an interesting range for non-characters.

Carl





Re: [OT] o-circumflex

2001-09-10 Thread Juliusz Chroboczek

>> It's as weird as some Italian names for German cities: Aquisgrana
>> for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di
>> Baviera) for München.

MK> Interesting that Polish names of these cities are more like Italian
MK> than German: Akwizgran, Augsburg, Moguncja, Monachium.

Because they're adaptations of the mediaeval Latin names.

The same is true of historically important Polish cities, by the way:
Varsovie, Cracovie in French, Varsavia, Cracovia in Italian.  English
uses the German names instead (Warsaw, Cracow).

Juliusz




RE: [OT] o-circumflex

2001-09-10 Thread Otmar Permentier

Marco,

When you're in Holland you may want to check some dictionaries too. You'll notice in 
dictionaries 'ij' is considered to consist of two letters 'i' and 'j', so the word 
'ijs' sorts between 'iets' and 'ik'.
You're right the PTT doesn't make the distinction between 'ij' and 'y', so in the 
phone book 'Meyer' and 'Meijer' are indeed near each other. I suspected they would at 
least first list all Meijers, then all Meyers, but when I just checked they appeared 
to be intermingled. On closer inspection it turned out the Meijers and Meyers are 
further sorted by street name! 
By the way, in crossword puzzles and the like, 'ij' always occupies one box (but isn't 
considered the same as 'y' I believe)

Regards,

Otmar Permentier

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Marco Cimarosti
> Sent: maandag 10 september 2001 19:59
> To: 'Stefan Persson'; 'John Wilcock'; [EMAIL PROTECTED]
> Subject: RE: [OT] o-circumflex
> 
> 
> Stefan Persson wrote:
> > I thought "ij" sorted after "z?"
> 
> Not in Dutch: as far as I have seen it sorts the same as "y".  In fact, in
> the telephone directory many people who had an "y" in their surname listed
> near people who had the same surname spelled with "ij" (e.g. "Meyer" and
> "Meijer").
> 
> (Anyway, next time they send me to Holland, I'll ask for a downtown hotel.
> So, after dinner, I'll go sightseeing rather than spending the 
> whole evening
> looking at the collation of the phone directory:-)
> 
> _ Marco
> 
> 





Re: [OT] o-circumflex

2001-09-10 Thread
AAARRRGGHHH

I give up!

I was hoping that there is SOME system that would give these cities UNIQUE names... 
postal codes???


$B$8$e$&$$$C$A$c$s(B(Juuitchan)
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


--- Original Message ---
$B:9=P?M(B: Thomas Chan <[EMAIL PROTECTED]>;
$B08@h(B: [EMAIL PROTECTED];
Cc: 
$BF|;~(B: 01/09/10 19:59
$B7oL>(B: Re: [OT] o-circumflex

>On Mon, 10 Sep 2001, [ISO-2022-JP] $B$F$s$I$&$j$e$&$8(B wrote:
>
>> If they can't agree on the pronunciation for these cities, can they
>> agree on the Hanzi for them? What ARE the Hanzi for these cities,
>> anyway??
>
>Are you asking for the names of cities in Chinese?  Copenhagen is
>ge1ben3ha1gen1 \u54e5\u672c\u54c8\u6839.  The Han characters used to write
>the names of cities depends on many factors, including but not
>limited to source spelling/pronunciation, language/dialect of the
>rendering party, mapping rules used by the renderer, time period, etc.
>For example, New York is rendered in Chinese as Mandarin niu3yue4
>\u7d10\u7d04, lit. 'button-appointment' (nauyeuk in Cantonese), while in
>Japanese it was at one time rendered as \u7d10\u80b2, lit.
>'button-rearing'.  Asking for the "hanzi" (from your wording, I don't
>think you are just talking about Chinese usage of Han characters) is like
>asking for a single Latin script rendering.
>
>(I think you need to get yourself an English<->Chinese dictionary or
>something, btw...)
>
>
>Thomas Chan
>[EMAIL PROTECTED]
>
>
>
>


Re: UTF-8 validation rules

2001-09-10 Thread David Hopwood

-BEGIN PGP SIGNED MESSAGE-

"Carl W. Brown" wrote:
> I am checking out my UTF-8 validation rules to see if they are correct.
> 
> Check each character to be a valid UTF-8 initial character.
> 
> \x00 to \x7f or \xC2 to \xF4
> 
> Allow invalid forms such as \xC0 & \xC1 to decode but consider them invalid.

Unicode 3.1 says that these should not be allowed to decode (see the first
and second notes after C12 added by UAX #27).

> A first byte of \xE0 or \xF0 with a second byte less than \xA0 is also an
> invalid form.
> 
> \xED followed by anything >= \xA0 is an encoded surrogate and not a valid
> character.
> 
> \xEF\xBF\xBE and \xEF\xBF\xBF are invalid Unicode characters.
> 
> Anything greater than \xF4\x80\xBF\xBF is beyond the Unicode range.

It's arguably simpler to convert to a code point, and then check whether the
code point is valid, than to directly check that the UTF-8 encoding is valid
(see the pseudocode below for precisely what I mean).

Also, if you're converting to, say, UTF-16, then non-character sequences
like \xEF\xBF\xBE and \xEF\xBF\xBF should probably be converted to the
corresponding UTF-16 non-characters (\uFFFE and \u), rather than being
rejected. (Note: Unicode 3.1 and ISO/IEC 10646-1:2000 differ on this point;
10646 requires them to be rejected.)

Here is some C-like pseudocode for a validating converter from UTF-8 to
UTF-16. It is suitable for cases where a bijective mapping between valid
sequences is needed, provided the ALLOW_IRREGULAR flag is *not* set.

// Set STRICT_ISO10646 for strict ISO/IEC 10646-1:2000 Annex D compliance
//   (reject U+FFFE and U+).
// Set ALLOW_IRREGULAR to tolerate irregular UTF-8 sequences (that is,
//   where UTF-16 surrogates have been incorrectly treated as separate
//   characters).

int toUTF16(uint8_t * utf8, int utf8len) { // utf8len type must be signed
uint8_t b0, b1, b2, b3;
uint32_t codepoint, temp;
int i;

for (i = 0; i < utf8len; ) {
b0 = utf8[i++];
if ((b0 & 0x80) == 0) {   // 0xxx
output b0;

} else if ((b0 & 0xE0) == 0xC0) { // 110x 10xx
if (i >= utf8len) {
return TRUNCATED;
}
b1 = utf8[i++];
if ((b1 & 0xC0) != 0x80) {
return INVALID;
}
codepoint = ((b0 & 0x1F) << 6) | (b1 & 0x3F);
if (codepoint < 0x80) {
return INVALID; // non-shortest form
}
output codepoint;

} else if ((b0 & 0xF0) == 0xE0) { // 1110 10xx 10xx
if (i >= utf8len-1) {
return TRUNCATED;
}
b1 = utf8[i++];
b2 = utf8[i++];
if ((b1 & 0xC0) != 0x80 || (b2 & 0xC0) != 0x80) {
return INVALID;
}
codepoint = ((b0 & 0x0F) << 12) | ((b1 & 0x3F) << 6) | (b2 & 0x3F);

if (ALLOW_IRREGULAR && codepoint >= 0xD800 && codepoint <= 0xDBFF) {
if (i >= utf8len-2) {
return TRUNCATED;
}
b0 = utf8[i++];
b1 = utf8[i++];
b2 = utf8[i++];
if ((b0 & 0xF0) != 0xE0 || (b1 & 0xC0) != 0x80 || (b2 & 0xC0) != 0x80) 
{
return INVALID;
}
temp = ((b0 & 0x0F) << 12) | ((b1 & 0x3F) << 6) | (b2 & 0x3F);
if (temp < 0xDC00 || temp > 0xDFFF) {
return INVALID;
}
output codepoint;
output temp;
} else if (codepoint < 0x800 // non-shortest form
   || (codepoint >= 0xD800 && codepoint <= 0xDFFF)
   || (STRICT_ISO10646 && codepoint >= 0xFFFE)) {
return INVALID;
} else {
output codepoint;
}
} else if ((b0 & 0xF8) == 0xF0) { // 0xxx 10xx 10xx 10xx
if (i >= utf8len-2) {
return TRUNCATED;
}
b1 = utf8[i++];
b2 = utf8[i++];
b3 = utf8[i++];
if ((b1 & 0xC0) != 0x80 || (b2 & 0xC0) != 0x80 || (b3 & 0xC0) != 0x80) {
return INVALID;
}
codepoint = ((b0 & 0x07) << 18) | ((b1 & 0x3F) << 12) |
((b2 & 0x3F) << 6) | (b3 & 0x3F);
if (codepoint < 0x1 // non-shortest form
|| codepoint > 0x10) {
return INVALID;
}
temp = codepoint - 0x1;
output (temp >> 10  ) + 0xD800;
output (temp & 0x3FF) + 0xDC00;

} else {
return INVALID;
}
} /* for i */

return VALID;
}

- -- 
David Hopwood <[EMAIL PROTECTED]>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. 

Re: UTF-8 validation rules

2001-09-10 Thread Kenneth Whistler


> Also, if you're converting to, say, UTF-16, then non-character sequences
> like \xEF\xBF\xBE and \xEF\xBF\xBF should probably be converted to the
> corresponding UTF-16 non-characters (\uFFFE and \u), rather than being
> rejected. (Note: Unicode 3.1 and ISO/IEC 10646-1:2000 differ on this point;
> 10646 requires them to be rejected.)

This discrepancy has been noted by the relevant committees, and is
the subject of ballot comment in the current amendment of 10646.
It should be fixed soon.

--Ken






Re: [OT] o-circumflex

2001-09-10 Thread Kenneth Whistler

Wy OT by now...

> AAARRRGGHHH
> 
> I give up!
> 
> I was hoping that there is SOME system that would give these cities UNIQUE names... 
>postal codes???

Ain't reality a bitch?

What you're looking for doesn't exist in the world of natural language
names -- it can only exist in artificially constructed global
geographic databases, where people may have assigned unique keys
to cities. And even there, the geographic experts are going to
argue over the exact meaning of terms. Is "Los Angeles" the
incorporated city presided over by the mayor or does it include
all the other small cities that Los Angeles surrounds and engulfs,
or does it included unincorporated parts of Los Angeles county,
or does it refer to Greater Los Angeles, the metropolitan area,
or is it related to Los Angeles county?

Not such a simple distinction, sometimes. San Francisco is
a city *and* a county, and the mayor of the city is also mayor
of the county. The mayor of New York is mayor of half a
dozen boroughs, the moral equivalent of counties.

Is Stonyford, California (population 150), a "city"? It isn't
incorporated as a city, or even a town, but it is an independent
geographic location that occurs as a "town" on maps. Where do
you draw the line between named localities and cities? Do you
depend on legally incorporated city status? But what if the
laws don't match up between different countries? How am I going
to know that "cities" in Bourkina Fasso match the same criteria
I use to designate "cities" in the United States or Japan?

Some cities have multiple postal codes, and some postal codes
cover multiple cities. And while postal codes are subject to
international treaty, how countries divide their territories
up and use the codes is still up to them.

--Ken





Re: UTF-8 validation rules

2001-09-10 Thread David Hopwood

-BEGIN PGP SIGNED MESSAGE-

Kenneth Whistler wrote:
> Carl,
> > \xEF\xBF\xBE and \xEF\xBF\xBF are invalid Unicode characters.
> 
> In current parlance (see Unicode 3.1, UAX #27), these are
> "noncharacters", and you must account for the fact that
> U+1FFFE..U+1
> U+2FFFE..U+2
> ...
> U+10FFFE..U+10
> 
> all have the same status as noncharacters.
> 
> With Unicode 3.2 (in the works), the 32 additional code points
> at U+FDD0..U+FDEF go from unallocated status to noncharacters
> as well.

Those are non-characters in Unicode 3.1 (see D7b in UAX #27).

Carl W. Brown wrote:
| ... It seems like an interesting range for non-characters.

It's for Arabic presentation forms internal to a rendering implementation,
I assume (although it's not clear why existing private-use characters
couldn't have been used for that).

Kenneth Whistler wrote:
> UTF-8 (and UTF-16 and UTF-32) convertors must allow the conversion
> of noncharacter code points, but may then allow the detection of
> their noncharacter status.

Where does the standard say that conversion of these code points must
be allowed? That would make it impossible to strictly comply with both
Unicode 3.1 and ISO/IEC 10646-1:2000, since the latter says that U+FFFE
and U+ (but not other non-characters) are illegal in UTF-8 and must
be rejected.

As far as I understand, according to Unicode 3.1, non-characters may be
*either* converted or rejected.

- -- 
David Hopwood <[EMAIL PROTECTED]>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-BEGIN PGP SIGNATURE-
Version: 2.6.3i
Charset: noconv

iQEVAwUBO5v5gTkCAxeYt5gVAQGuKwf/QIrfzIcrbxhUiTH3MTZVIn92UfXv6g7L
HNXdK7Dt4eBauBNf8Dx3d9ZfLIEBFL2BobMoSbclLPyyWv/5tVKc4W1U3TOXvc9m
xxAEVEgaW4pJKG63TKERANaf1xDfIlyIQk+APNMxLzlwUN9I0ENKV5d91BHp8F9y
lj5OGBWHRzjZwbtPT+Y9/Bx5/8l9+6jp4ZtFPrqFe9q7QCAg9+WTY1L3FdYgQiDK
/jtl8y2cPG0jHQ/DQul6spnZPZqEItDbfLeaDCu9minCcQ4Lscb9n+kayOQV/S0D
kVQbgIB9q7KXmYlY0CsYtNnRfARFS59yGwYnoVc352ZPS8OALoE12g==
=tVxi
-END PGP SIGNATURE-





Re: UTF-8 validation rules

2001-09-10 Thread David Starner

On Mon, Sep 10, 2001 at 12:22:20AM +0100, David Hopwood wrote:
> It's for Arabic presentation forms internal to a rendering implementation,
> I assume (although it's not clear why existing private-use characters
> couldn't have been used for that).

Because if the implementation uses them, then the end user can't. A large
use of the PUA is scripts and characters that will never be encoded, or
corporate logos for internal use. If the implementation uses codepoints that
others use for Shavian (the PUA implementation actually seeing use in the
wild) or the rest of your organization uses for the private symbols, then
you're just out of luck.

I don't think anyone was specific about Arabic presentation forms. From 
what I understand, it's more so an application has a large area set aside
for internal use that won't get mixed up with PUA. The Object Replacement
character and the Ruby characters would have been subsumed in this if
this came first. It could be used for markup, making a wordprocessor format
that's plain text once you strip these characters.

-- 
David Starner - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
"I don't care if Bill personally has my name and reads my email and 
laughs at me. In fact, I'd be rather honored." - Joseph_Greg




Re: UTF-8 validation rules

2001-09-10 Thread Kenneth Whistler

David Hopwood said:

> > 
> > With Unicode 3.2 (in the works), the 32 additional code points
> > at U+FDD0..U+FDEF go from unallocated status to noncharacters
> > as well.
> 
> Those are non-characters in Unicode 3.1 (see D7b in UAX #27).

Yes, I stand corrected. They are *already* approved by the UTC
and have been published in Unicode 3.1.

The issue is one of synchronization with the Amendment 1 to 10646-1:2000,
which is still under ballot and which will designate the same code
points in 10646. Most of the content of Amendment 1 (the additional
characters, anyway) will appear as Unicode 3.2, but the architectural
changes, including these new designations of noncharacter code points,
are considered already in Unicode 3.1.

> 
> Carl W. Brown wrote:
> | ... It seems like an interesting range for non-characters.
> 
> It's for Arabic presentation forms internal to a rendering implementation,
> I assume (although it's not clear why existing private-use characters
> couldn't have been used for that).

This is incorrect. The range of noncharacters U+FDD0..U+FDEF are
not for Arabic presentation forms at all. They are noncharacters.
Internally, they could be used for anything, but they are not
to be externally interchanged, and have no public interpretation.

The choice of FDD0..FDEF as the code points for these noncharacters
was a reasonably arbitrary one, but was attempting to make use of
a contiguous range of 32 code points that couldn't reasonably be
assigned to anything else. And neither the UTC nor WG2 wants to
assign any more Arabic presentation forms!

> 
> Kenneth Whistler wrote:
> > UTF-8 (and UTF-16 and UTF-32) convertors must allow the conversion
> > of noncharacter code points, but may then allow the detection of
> > their noncharacter status.
> 
> Where does the standard say that conversion of these code points must
> be allowed? That would make it impossible to strictly comply with both
> Unicode 3.1 and ISO/IEC 10646-1:2000, since the latter says that U+FFFE
> and U+ (but not other non-characters) are illegal in UTF-8 and must
> be rejected.

See my subsequent note. The text in 10646-1 is being corrected. It is
inconsistent as it stands, since it treats U+FFFE and U+ one way,
and the other noncharacters (U+1FFFE, etc.) another.

> 
> As far as I understand, according to Unicode 3.1, non-characters may be
> *either* converted or rejected.

O.k. Let me put it this way.

*Definitionally*, the encoding forms define the relationships:

UTF-32UTF-16UTF-8

 <==>  <==> EF BF FF

(and so on for each of the noncharacter code points)

Note that Table D.3 "Examples in hexadecimal notation" in Annex D UTF-3
in 10646-1 even explicitly lists this example! This despite the contrary
text in Note 3 to clause D.4, which claims that the UTF-8 mapping of
U- is undefined. (Which is what needs to be fixed in 10646.)

A convertor for UTF-8 *should* be able to do this conversion correctly.

It is another thing to decide whether an API for UTF-8 conversion will
report a non-character value as an error. That depends on the context.

In my opinion, the most robust implementation is for the convertor to
convert clean through, only reporting errors on *illegal* values
(unpaired surrogates, code points > 0x10). It should then be up
to some other piece of code to determine whether a code point is
unassigned, a noncharacter, or something else.

--Ken

> 
> - -- 
> David Hopwood <[EMAIL PROTECTED]>




RE: UTF-8 validation rules

2001-09-10 Thread Carl W. Brown

David,

> 
> It's for Arabic presentation forms internal to a rendering implementation,
> I assume (although it's not clear why existing private-use characters
> couldn't have been used for that).
> 

Now I remember.

Thanks,

Carl