date:20031216

RE: Case mapping of dotless lowercase letters

2003-12-16 Thread Philippe Verdy

Doug Ewell [EMAIL PROTECTED] writes:
  Wrong here: I have found occurences of dotless lowercase i, used
  instead of soft-dotted lowercase i, as base letters for diacritics
  added above it (it was an accute accent...)
 
 Don't do that.

What? This is VALID UNICODE to have texts coded like this. The proposed
change for soft-dotted/dotless letters used with diacritics is still not in
the standard, and it just gives rendering hints so that both base letters
should have the same rendering, requiring the use of a explicit dot when the
soft dot muct be kept with the diacritic.

  There was two sequences which looked apparently identical when
  rendered, and that were distinct after case folding compare check:
 
  (1) LATIN SMALL LETTER I, COMBINING ACCUTE ACCENT
  (2) LATIN SMALL LETTER DOTLESS I, COMBINING ACCUTE ACCENT
 
  but were no more distinct when converted to uppercase in a locale
  neutral environment not using the Turkic rules:
 
  (1') LATIN CAPITAL LETTER I, COMBINING ACCUTE ACCENT
  (2') LATIN CAPITAL LETTER I, COMBINING ACCUTE ACCENT
 
 OK, so you want the default, local-neutral case mapping tables to equate
 U+0069 with U+0131, right?

Yes. And I have good reasons for that, coming from the fact that default
locale-neutral mappings tables already equate their uppercase versions U+049
with U+0130, by returning U+0069 for both of them.

 This is close to being a spoofing problem, though.  See TUS 4.0, page
 141.

If you think this is a spoofing problem, then the existing locale-neutral
full case mapping of U+0130 is bogous and should not be U+0069

  The string (2) may have been produced to avoid displaying the dot
  with some fonts that don't apply the soft-dotted rule when there's
  an additional diacritic above...
 
 Don't do that.  That's misusing the standard.  The font should be fixed
 instead.

For whatever reason, encoded texts exist before correct fonts are used to
render them. So there does exist texts which use dotless lowercase i before
a diacritic above, simply because the author of the text did not want it to
be rendered with a superposed dot. These texts are clearly not Turkic (in
Turkish or Azeri, the dot of the soft-dotted i should have been displayed
with the diacritic above it, and the dotless i should have been used to
avoid it explicitly).

But this is not the only reason, I can give other examples which also have
security impacts and filesystems impact.

Suppose you have a database of user names or file names allowing
internationalized names coded along the recommanded Unicode principles. But
these names are used in a way that makes it impossible to track the language
in which these names are entered (filenames or users names or address fields
in a entry form are such cases).

Now provide a facility that allows to identify and avoid duplicate
case-equivalents, using full mappings. Because you can't track the language,
you'll need to use the default case-neutral full case mappings.

Now a Turkish user enters a name or address in a entry form, or creates
files with dotless lowercase i in it, and attempts to reenter later its case
equivalent (dotless) uppercase I. The system will not identify both as being
case equivalents, so it will accept both as if they were distinct.

The Turkish user or the system then attempts to list files or database table
fields matching some regular expression like i* with case insensitive
option, to count the number of occurences of the names containing a
(soft-)dotted i (or I). He will get all files containing one of three codes,
and not the fourth one.


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

Re: Stability of WG2 (was: Re: [OT] CJK - CJC)

2003-12-16 Thread Michael Everson

At 19:13 -0800 2003-12-15, Doug Ewell wrote:

The North Korean and Chinese national bodies have already made proposals
that violate both the letter and spirit of stability policies.
Yes. And we have rejected them.

I'm glad the U.S. national body will stay involved, but having to rely
on that does sound a bit like having to rely on enlightened statesmen,
doesn't it?
Better than if the whole thing were just left to the employees of 
large companies, Doug. We have good checks and balances.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Stability of WG2

2003-12-16 Thread Peter Kirk

On 15/12/2003 16:57, Doug Ewell wrote:

...

I'm not saying Peter is right, that this WILL happen, just trying to
articulate his point that the possibility in the future is greater than
nil.
 

I didn't say that it WILL happen either, just that it might happen (and, 
later, that some changes might be desirable).

...

It seems clear that the current enlightened WG2 membership is
committed to both the letter and spirit of the current stability policy
(to the dismay of Peter, who would like to see certain changes in names,
combining classes, etc.).  But there is really no way we can predict
whether the eventual successors to Ken, Michael, Rick, Michel, etc. will
share the same commitment.  Remember that most of us once believed in
the stability of ISO 3166 as well.
 

Good point. Remember that the predicted life of Unicode (recently 
predicted by Michael, anyway) is longer than the lifetime of the current 
WG2 members, longer even than the US Constitution (so far), the figure 
of 1000 years was mentioned. Even if this is a millennial reign of peace 
and prosperity, processes of language change will not stop. A list of 
character names from 1000 years ago, even from 400 years ago, would look 
very strange today. Surely long before then the members of the successor 
body to WG2 will realise that the Unicode 4.0 list of character names, 
and probably also a lot of other things in Unicode which are now 
considered stable, require major updates.



--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Stability of WG2

2003-12-16 Thread Peter Kirk

On 15/12/2003 22:00, Christopher John Fynn wrote:

Doug Ewell [EMAIL PROTECTED]

 

The North Korean and Chinese national bodies have already 
made proposals that violate both the letter and spirit of stability 
policies.
   

Fortunately they each have only one vote in WG2.

- Chris

 

But isn't that enough to outvote the US body?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Case mapping of dotless lowercase letters

2003-12-16 Thread Michael Everson

At 11:03 +0100 2003-12-16, Philippe Verdy wrote:
Doug Ewell [EMAIL PROTECTED] writes:
  Wrong here: I have found occurences of dotless lowercase i, used
  instead of soft-dotted lowercase i, as base letters for diacritics
  added above it (it was an accute accent...)
 Don't do that.
What? This is VALID UNICODE to have texts coded like this.
In Irish, it is INCORRECT to spell físeán 
'video' with a DOTLESS I + COMBINING ACUTE. It is 
a spelling error, and will fail in 
spell-checking. The correct spelling is either I 
+ COMBINING ACUTE or precomposed I WITH ACUTE.

It is VALID UNICODE to follow LATIN CAPITAL 
LETTER Q with DEVANAGARI VOWEL SIGN E but that 
doesn't mean it's the right way to write anything.

For whatever reason, encoded texts exist before correct fonts are used to
render them. So there does exist texts which use dotless lowercase i before
a diacritic above, simply because the author of the text did not want it to
be rendered with a superposed dot.
Texts which contain spelling errors. Or old IPA 
texts using any number of ad-hoc IPA font 
solutions. Those texts have to be transcoded to 
proper Unicode at some stage. What you suggest is 
Not Recommended.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Stability of WG2

2003-12-16 Thread Michael Everson

At 03:03 -0800 2003-12-16, Peter Kirk wrote:

The North Korean and Chinese national bodies have already made 
proposals that violate both the letter and spirit of stability 
policies.
Fortunately they each have only one vote in WG2.

But isn't that enough to outvote the US body?
Not with Ireland and Japan standing with the US on such an issue. ;-)

We really must get the UK back into SC2 ;-)
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread Stefan Persson

Michael Everson wrote:
In Irish, it is INCORRECT to spell físeán 'video' with a DOTLESS I + 
COMBINING ACUTE. It is a spelling error, and will fail in 
spell-checking. The correct spelling is either I + COMBINING ACUTE or 
precomposed I WITH ACUTE.
Isn't the sequence dotless i + combining acute canonically equivalent 
to dotted i + combining acute?

Stefan

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread Michael Everson

At 13:00 +0100 2003-12-16, Stefan Persson wrote:
Michael Everson wrote:
In Irish, it is INCORRECT to spell físeán 
'video' with a DOTLESS I + COMBINING ACUTE. It 
is a spelling error, and will fail in 
spell-checking. The correct spelling is either 
I + COMBINING ACUTE or precomposed I WITH ACUTE.
Isn't the sequence dotless i + combining acute 
canonically equivalent to dotted i + combining 
acute?
It is not.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

RE: Case mapping of dotless lowercase letters

2003-12-16 Thread Arcane Jill






This occurred to be even before I read Phillppe's email.

Since {U+0069} is not canonically equivalent to
{U+0131}{U+0307}, I don't see anything to stop me from registering the
domain name "un{U+0131}{U+0307}code.org", for example. It is in
NFC, after all.

Jill



-Original Message-
From:  Philippe Verdy [mailto:[EMAIL PROTECTED]] 
Sent: Tuesday, December 16, 2003 2:21 AM
To: Doug Ewell
Cc: [EMAIL PROTECTED]
Subject: RE: Case mapping of dotless lowercase letters

Doug Ewell wrote:

I detected it after it produced a security bug (a user record was
unexpectedly updated on my database...)

Re: [OT] CJK - CJC (Re: Corea?)

2003-12-16 Thread Jungshik Shin

On Mon, 15 Dec 2003, Doug Ewell wrote:

 Jungshik Shin jshin at mailaps dot org wrote:

   If those 20 assemblymen have time and energy to deal with this
  foolish name change business, they had better push for a bill to

 If those 20 assemblymen really think a name change will boost national
 identity and pride, shouldn't they be trying to persuade English
 speakers to say Taehan Minguk instead?

 No, that's not only even sillier (as we'd all agree) but also
is incorrect because 'Taehan Minguk' does not mean Korea but specifically
mean 'Republic of Korea' that was founded in 1948.  Moreover, North
Koreans would prefer 'Chosun' to 'Hanguk' (Using 'Taehan Minguk' is
obviously out of question to them). Using 'Korea' (English name) is a
rather convenient way to work around the difference (between two Koreas).

  Jungshik

Re: Stability of WG2

2003-12-16 Thread Peter Kirk

On 16/12/2003 03:35, Michael Everson wrote:

At 03:03 -0800 2003-12-16, Peter Kirk wrote:

The North Korean and Chinese national bodies have already made 
proposals that violate both the letter and spirit of stability 
policies.


Fortunately they each have only one vote in WG2.

But isn't that enough to outvote the US body?


Not with Ireland and Japan standing with the US on such an issue. ;-)

We really must get the UK back into SC2 ;-)


Even at the risk of finding me evening up the vote? ;-)

Seriously, can you remind us briefly what the situation is, why there is 
no current UK representation?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: [OT] Euro-English (was: Corea? (Re: Swastika to be banned by Microsoft?)

2003-12-16 Thread Jungshik Shin

On Mon, 15 Dec 2003, Philippe Verdy wrote:

 But you may see one day their national airways renamed
 Corean Airlines, or its main standard body renamed CSC...

  There's no national airline in South Korea. Korean Air has been private
for more than two decades and has been competing with Asiana Airlines
in both domestic routes and int'l routes for over a decade. As for the
ROK standard body, it's not KSC.  KS C is just a section in KS (Korean
Standard) for electric and electronic technology.  KS C used to cover
IT as well but in 1997-98, IT was moved to a new section 'X', which is
why KS C 5601 was renamed KS X 1001.

  Jungshik

Re: Stability of WG2

2003-12-16 Thread Michael Everson

At 04:36 -0800 2003-12-16, Peter Kirk wrote:

Seriously, can you remind us briefly what the situation is, why 
there is no current UK representation?
I will answer this off-line.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Stability of WG2 [OT]

2003-12-16 Thread Karl Pentzlin

Am Dienstag, 16. Dezember 2003 um 11:53 schrieb Peter Kirk:
PK A list of
PK character names from 1000 years ago, even from 400 years ago, would look
PK very strange today.
If they were made by scientists of the Western culture, they were
Latin; such names look by no ways strange, as biology and medicine
show. Maybe English in 1000 years will be something like Latin today,
and the LATIN CAPITAL LETTER A will have its name unchanged as well
as the dorsa spinalis in anatomy.

--
Karl Pentzlin
ACS Analysis Consulting  Software GmbH
München, Germany

Re: Stability of WG2

2003-12-16 Thread Michael Everson

At 02:53 -0800 2003-12-16, Peter Kirk wrote:

Good point. Remember that the predicted life of Unicode (recently 
predicted by Michael, anyway) is longer than the lifetime of the 
current WG2 members
My point is that the work we do identifying characters and encoding 
them won't have to be done again. Once Manichaean is encoded, it's 
encoded.

One day, 200 years from now, there may be some Puricode revision 
which will do away with some of the duplicate encodings which we have 
for various legacy and round-trip requirements. But that will not 
invalidate our work today.

Even if this is a millennial reign of peace and prosperity, 
processes of language change will not stop. A list of character 
names from 1000 years ago, even from 400 years ago, would look very 
strange today.
Nothing stops you from publishing a list of character names in proper 
English, in Portuguese, or on some Inglish which may exist a long 
time from now. Currently those strings are required to be 
changeless for stability. So we do not change them, as long as that 
requirement remains, which the vendors say it is.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

RE: Case mapping of dotless lowercase letters

2003-12-16 Thread jon

 Since {U+0069} is /not/ canonically equivalent to {U+0131}{U+0307}, I 
 don't see anything to stop me from registering the domain name 
 un{U+0131}{U+0307}code.org, for example. It /is/ in NFC, after all.

Do we have Unicode DNS yet? I know there's stuff out there passing UTF-8 
around, but is this formalised yet?

But yes, {U+0131}{U+0307} can look awfully similar to {U+0069}, I think {U+0069}
{U+0307} would as well (and of course there are other opportunities for visual 
confusion unrelated to the U+0069 and U+0131).

--
Jon Hanna   | Toys and books
http://www.hackcraft.net/ | for hospitals:
| http://santa.boards.ie

RE: Stability of WG2

2003-12-16 Thread Arcane Jill






Speaking as a Brit, I would like to know the answer to this one too.
What's the problem with answering online?

And if you're really not going toanswer this online, you could
have just emailed Peter privately, instead of telling the whole list
that you're going to keep the answer secret from all of us except
Peter. What a wind up!

Jill


 -Original Message-
 From: Michael Everson [mailto:[EMAIL PROTECTED]]
 Sent: Tuesday, December 16, 2003 12:49 PM
 To: [EMAIL PROTECTED]
 Subject: Re: Stability of WG2
 
 
 At 04:36 -0800 2003-12-16, Peter Kirk wrote:
 
 Seriously, can you remind us briefly what the situation is,
why 
 there is no current UK representation?
 
 I will answer this off-line.
 -- 
 Michael Everson * * Everson Typography * * http://www.evertype.com

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread John Cowan

Arcane Jill scripsit:

 Since {U+0069} is /not/ canonically equivalent to {U+0131}{U+0307}, I 
 don't see anything to stop me from registering the domain name 
 un{U+0131}{U+0307}code.org, for example. It /is/ in NFC, after all.

You can (or rather, you will be able to when internationalized domain
names become a reality).  But in fact you have to use case folding
plus NKFC, and there is a list of forbidden characters as well.
See RFCs 3454 and 3491 for the exact rules.

-- 
There is no real going back.  Though I John Cowan
may come to the Shire, it will not seem [EMAIL PROTECTED]
the same; for I shall not be the same.  http://www.reutershealth.com
I am wounded with knife, sting, and tooth,  http://www.ccil.org/~cowan
and a long burden.  Where shall I find rest?   --Frodo

RE: Case mapping of dotless lowercase letters

2003-12-16 Thread Arcane Jill








 Do we have Unicode DNS yet?

Yup. You can put Chinese letters in domain names now. You do it like
this:
(1) Convert to NFC
(2) Encode in UTF-8
(3) Replace all reserved characters (space, %, etc.) with the three
character string "%hh" (where hh is hex for the substituted character)
(4) Now similarly replace all bytes  0x7F with the three-character
string "%hh" (where hh is hex for the substituted character)


 But yes, {U+0131}{U+0307} can look awfully similar to 
 {U+0069}, I think {U+0069}
 {U+0307} would as well (and of course there are other 
 opportunities for visual 
 confusion unrelated to the U+0069 and U+0131).

Yeah, I thought of that. Yuk. The whole issue of spoof detection is an
absolute nightmare. There are some things you can do to help,
though:. security-conscious applications could use fonts in which 0
looks different from O, and in which 1 looks different from l;
different scripts could be displayed in different colors; a warning
dialog could be presented to the user if any character is a
compatibility character, and so on. But NONE of these tricks will catch
the distinction between U+0069 and U+0307. Both are letters, both are
in the Latin script, neither is a compatilibility character, etc..
Automation can only go so far. Eventually, you're left with only one
choice - to advise the user: "Never click on a hyperlink. Instead,
always type in the URL by hand". Trouble is, such advice is more
trouble than it's worth, and would kill the fluidity of the internet.

Jill

RE: Case mapping of dotless lowercase letters

2003-12-16 Thread jon

Quoting Arcane Jill [EMAIL PROTECTED]:

   Do we have Unicode DNS yet?
 
 Yup. You can put Chinese letters in domain names now. You do it like this:
 (1) Convert to NFC
 (2) Encode in UTF-8
 (3) Replace all reserved characters (space, %, etc.) with the three 
 character string %hh (where hh is hex for the substituted character)
 (4) Now similarly replace all bytes  0x7F with the three-character 
 string %hh (where hh is hex for the substituted character)

I know that this is done with Internationalised URIs, but does this work in the 
domain portion as well? I thought the DNS rules still prohibited it, although 
the URI rules don't - the inverse to how URIs are case-sensitive but the DNS 
portion isn't treated as such when dereferencing.

Eventually, you're left with only one choice - to advise the user: 
 Never click on a hyperlink. Instead, always type in the URL by hand. 
 Trouble is, such advice is more trouble than it's worth, and would kill 
 the fluidity of the internet.

Or click on whatever hyperlinks you like, but have the hatches battened down 
and don't assume you are where you appear to be.

I like to summarise security advice thusly: if you trust my advice on security 
you're starting with completely the wrong attitude :)

--
Jon Hanna   | Toys and books
http://www.hackcraft.net/ | for hospitals:
| http://santa.boards.ie

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread John Jenkins

On Dec 16, 2003, at 4:27 AM, Michael Everson wrote:

At 11:03 +0100 2003-12-16, Philippe Verdy wrote:
Doug Ewell [EMAIL PROTECTED] writes:
  Wrong here: I have found occurences of dotless lowercase i, used
  instead of soft-dotted lowercase i, as base letters for diacritics
  added above it (it was an accute accent...)
 Don't do that.
What? This is VALID UNICODE to have texts coded like this.
In Irish, it is INCORRECT to spell físeán 'video' with a DOTLESS I + 
COMBINING ACUTE. It is a spelling error, and will fail in 
spell-checking. The correct spelling is either I + COMBINING ACUTE or 
precomposed I WITH ACUTE.

Michael is, of course, correct.  The problem here is that in books on 
Latin typography from the not too distant past, such as those by Robin 
Williams (the other one), recommend using dotless-i + accent for 
precisely this reason that the dot would otherwise collide with the 
accent.  Ms Williams was working in an environment, however, where all 
kinds of hacks were needed for non-international software like Quark to 
do the fancy stuff typographers wanted to do.  A lot of the old 
typography tricks are being obsoleted by Unicode, 
OpenType/AAT/Graphite, and should no longer be adhered to.


John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage..mac.com/jhjenkins/

RE: Case mapping of dotless lowercase letters

2003-12-16 Thread Philippe Verdy

Michael Everson wrote:
 At 11:03 +0100 2003-12-16, Philippe Verdy wrote:
 Doug Ewell [EMAIL PROTECTED] writes:
Wrong here: I have found occurences of dotless lowercase i, used
instead of soft-dotted lowercase i, as base letters for diacritics
added above it (it was an accute accent...)
  
   Don't do that.
 
 What? This is VALID UNICODE to have texts coded like this.
 
 In Irish, it is INCORRECT to spell físeán 
 'video' with a DOTLESS I + COMBINING ACUTE. It is 
 a spelling error, and will fail in 
 spell-checking. The correct spelling is either I 
 + COMBINING ACUTE or precomposed I WITH ACUTE.

Spelling was not the issue there. Only Unicode validity.

 For whatever reason, encoded texts exist before correct fonts are used to
 render them. So there does exist texts which use dotless lowercase i 
 before a diacritic above, simply because the author of the text did not 
 want it to be rendered with a superposed dot.
 
 Texts which contain spelling errors. Or old IPA 
 texts using any number of ad-hoc IPA font 
 solutions. Those texts have to be transcoded to 
 proper Unicode at some stage. What you suggest is 
 Not Recommended.

Not recommanded but still valid (and actually used in Turkish as well!), and
used in some occasions because of defects in fonts that don't have a
precomposed glyph for letter i with the diacritic but have a separate glyph
for the combining diacritic and for the dotted and dotless letters i, or
that use renderers unable to remove the soft dot. The IPA-93 font is such
one, which allows good typesetting, but which needs glyph processing to
select the appropriate base letter.

My main issue is, however with Turkish names found in environments where
language identification is not possible (for example a simple filename or a
locale-neutral database field or an international HTML form which requests
user names and use them as case insensitive identifiers); lowercase dotless
i do not work appropriately there.

I think it is completely illogical to match together with case-insensitive
compares, the three letters:
LATIN SMALL LETTER I (dotted)
LATIN CAPITAL LETTER I (dotless)
LATIN CAPITAL LETTER I WITH DOT ABOVE
but not:
LATIN SMALL LETTER DOTLESS I
when use locale-neutral compares, given that the normative uppercase mapping
of this fourth letter is the second letter above.

I'm sorry that nobody wants to admit it, and that this is a security issue
which causes problems when applications that expect a case-insensitive
difference means that converting the string to either lowercase or uppercase
or titlecase will preserve this difference.


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

RE: Stability of WG2

2003-12-16 Thread Winkler, Arnold F




Jill,

Speaking as an Austrian, I don't care why the UK does not 
participate in SC2/WG2.

But I DO appreciate the information, that I am not going to 
see an answer to this question. Please be kind to 
Michael.

Regards
Arnold

  
  
  From: Arcane Jill 
  [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 16, 2003 
  8:41 AMTo: [EMAIL PROTECTED]Subject: RE: Stability of 
  WG2
  Speaking as a Brit, I would like to know the answer to this one 
  too. What's the problem with answering online?And if you're 
  really not going toanswer this online, you could have just emailed 
  Peter privately, instead of telling the whole list that you're going to keep 
  the answer secret from all of us except Peter. What a wind 
  up!Jill -Original Message- From: 
  Michael Everson [mailto:[EMAIL PROTECTED]] 
  Sent: Tuesday, December 16, 2003 12:49 PM To: [EMAIL PROTECTED] Subject: Re: 
  Stability of WG2   At 04:36 -0800 2003-12-16, Peter 
  Kirk wrote:  Seriously, can you remind us briefly what the 
  situation is, why  there is no current UK representation? 
   I will answer this off-line. --  Michael Everson * * 
  Everson Typography * * http://www.evertype.com

RE: Case mapping of dotless lowercase letters

2003-12-16 Thread Philippe Verdy

Stefan Persson writes:
 Isn't the sequence dotless i + combining acute canonically equivalent 
 to dotted i + combining acute?

NO. There's no canonical equivalence between distinct pairs of characters,
if the first letter of each pair are not also canonically equivalent.



__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

Re: [OT] CJK - CJC (Re: Corea?)

2003-12-16 Thread Doug Ewell

Jungshik Shin jshin at mailaps dot org wrote:

 If those 20 assemblymen really think a name change will boost
 national identity and pride, shouldn't they be trying to persuade
 English speakers to say Taehan Minguk instead?

  No, that's not only even sillier (as we'd all agree) but also
 is incorrect because 'Taehan Minguk' does not mean Korea but
 specifically mean 'Republic of Korea' that was founded in 1948.
 Moreover, North Koreans would prefer 'Chosun' to 'Hanguk' (Using
 'Taehan Minguk' is obviously out of question to them). Using 'Korea'
 (English name) is a rather convenient way to work around the
 difference (between two Koreas).

Sorry, I was under the impression that this name thing was specifically
a South Korean idea.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

G-Strings

2003-12-16 Thread Arcane Jill





There was talk recently on this list of mapping grapheme clusters to
the PUA (for application internal use only, obviously, not for export
to the real world). I actually did this recently, though my app ended
up in an incomplete state since I got bored and moved onto something
else. The algorithm worked though, so I present it here and place it in
the public domain, licence free, for anyone to use who wants to do so.
Such an encoded string I called a "grapheme string", or "gstring" for
short. Of course, that was before "grapheme" was renamed as "default
grapheme cluster", so the name doesn't work quite as well now.

The range of characters I resereved for my private use actually
consisted of the surrogate codepoints, not the PUA codepoints. I
reasoned that the PUA area might actually be being used for something
(else), but the surrogate codepoints were illegal and therefore
available. Despite the fact that number of possible graphmes is
infinite, I never actually ran out of codepoints.

Here's the algorithm in pseudo-code:


// The following are static and global
max_word (a 16-bit unsigned integer, initially the lowest codepoint you
reserve (e.g. the start of the PUA) minus one)
map_grapheme_to_word[] (a mapping from grapheme (=array of codepoints)
to 16-bit word, initially empty)
map_word_to_grapheme[] (a mapping from 16-bit word to grapheme,
initially empty)



// Convert unicode text to internal representation with one 16-bit word
per grapheme
// -- input (text_unicode) is an array of codepoints (ie. it
has already been decoded from UTF-whatever)
// -- output (text_internal) is an array of 16-bit words, each
representing one grapheme. THIS STRING MAY NEVER BE EXPORTED.

text_internal = ""
for (each grapheme in text_unicode) // each grapheme is a substring of
one or more codepoints
{
 grapheme = convert_to_NFC(grapheme);
 if (num_codepoints(grapheme) == 1  codepoint_of(grapheme)
 0x1)
 {
  text_internal += codepoint_of(grapheme);
 }
 else
 {
  if (!exists(map_grapheme_to_word[grapheme]))
  {
   if (max_word still in range)
   {
map_grapheme_to_word[grapheme] = ++max_word;
map_word_to_grapheme[max_word] = grapheme;
   }
   else
   {
text_internal += U+FFFD; // Whoa!! Ran out of reserved
characters! Could add error handling here.
   }
  }
  text_internal += map_grapheme_to_word[grapheme];
 }
}
return text_internal;



// The converse process
text_unicode = "";
for (each word in text_internal)
{
 if (word in correct range) // e.g. PUA but doesn't have to be
 {
  if (exists(map_word_to_grapheme[max_word]))
  {
   text_unicode += map_word_to_grapheme[max_word];
  }
  else
  {
   // error - should never happen
   text_unicode += U+FFFD;
  }
 }
 else
 {
  text_unicode += word;
 }
}
return text_unicode;



Jill

WG2 - anyone from the UK interested?

2003-12-16 Thread Christopher John Fynn

There seems to be at least some interest in re-establishing the UK character
encoding committee which contributed to ISO/IEC JTC1/SC2/WG2 10646.

Anyone in Britain (or British) who might be interested in participating, please
let me know ASAP.

Thanks

- Chris

==
Christopher Fynn
4 Chester Court
84 Salusbury Road
London NW6 6PA




- Original Message - 
From: Elaine Keown [EMAIL PROTECTED]
To: Michael Everson [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Tuesday, December 16, 2003 2:56 PM
Subject: Re: Stability of WG2


Elaine Keown
in Austin

 Hi,

  Not with Ireland and Japan standing with the US on
  such an issue. ;-)

  We really must get the UK back into SC2 ;-)

 Is this another joke?--Elaine

 __
 Do you Yahoo!?
 New Yahoo! Photos - easier uploading and sharing.
 http://photos.yahoo.com/

RE: Case mapping of dotless lowercase letters

2003-12-16 Thread Michael Everson

At 16:48 +0100 2003-12-16, Philippe Verdy wrote:
Michael Everson wrote:
 At 11:03 +0100 2003-12-16, Philippe Verdy wrote:
 Doug Ewell [EMAIL PROTECTED] writes:
Wrong here: I have found occurences of dotless lowercase i, used
instead of soft-dotted lowercase i, as base letters for diacritics
added above it (it was an accute accent...)
  
   Don't do that.
 
 What? This is VALID UNICODE to have texts coded like this.
 In Irish, it is INCORRECT to spell físeán
 'video' with a DOTLESS I + COMBINING ACUTE. It is
 a spelling error, and will fail in
 spell-checking. The correct spelling is either I
 + COMBINING ACUTE or precomposed I WITH ACUTE.
Spelling was not the issue there. Only Unicode validity.
Apparently you should look up the word valid.

Any character can follow any other character and 
be valid. Any combining character can be 
applied to any base character, regardless of 
script.

  Texts which contain spelling errors. Or old IPA
 texts using any number of ad-hoc IPA font
 solutions. Those texts have to be transcoded to
 proper Unicode at some stage. What you suggest is
 Not Recommended.
Not recommanded but still valid (and actually used in Turkish as well!)
Case folding in Turkish and Azeri is DIFFERENT 
from everywhere else and you have to have a local 
tailoring for it.

used in some occasions because of defects in fonts that don't have a
precomposed glyph for letter i with the diacritic but have a separate glyph
for the combining diacritic and for the dotted and dotless letters i, or
that use renderers unable to remove the soft dot.
What defects there are in FONTS without UNICODE CMAPS is of no concern to us.

The IPA-93 font is such one, which allows good 
typesetting, but which needs glyph processing to 
select the appropriate base letter.
It isn't a Unicode font, and so it doesn't 
matter. Data represented in it has to be 
transcoded to Unicode, and the font has to have 
the right thing in it.

My main issue is, however with Turkish names found in environments where
language identification is not possible (for example a simple filename or a
locale-neutral database field or an international HTML form which requests
user names and use them as case insensitive identifiers); lowercase dotless
i do not work appropriately there.
Oh well.

I think it is completely illogical to match together with case-insensitive
compares, the three letters:
LATIN SMALL LETTER I (dotted)
LATIN CAPITAL LETTER I (dotless)
LATIN CAPITAL LETTER I WITH DOT ABOVE
but not:
LATIN SMALL LETTER DOTLESS I
when use locale-neutral compares, given that the normative uppercase mapping
of this fourth letter is the second letter above.
That is not what happens in locale-neutral comparisons, I believe.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

RE: Case mapping of dotless lowercase letters

2003-12-16 Thread Kent Karlsson


  Since {U+0069} is /not/ canonically equivalent to 
 {U+0131}{U+0307}, I 
  don't see anything to stop me from registering the domain name 
  un{U+0131}{U+0307}code.org, for example. It /is/ in NFC, 
 after all.
 
 You can (or rather, you will be able to when internationalized domain
 names become a reality).  But in fact you have to use case folding

Yes. And as it happens, dotless-i case-*folds* to (soft)dotted-i,
so you cannot register an IDN that after nameprep has a dotless-i
in it, since that name isn't correctly nameprepped.

This does not guard against (soft)dotted-i, dot-above, but for
the registered part of a domain name, registrars are *supposed* to
have some rules for what is allowed, and what is not (for that paticular
registrar). E.g. the Swedish domain name registry *currently* allows
only ASCII letters plus åäöé (after nameprep) in domain names they
register, though this may be somewhat augmented in the future
(to cover Sami too at least, maybe more). This kind of solution
was driven mainly by the issue of the traditional chinese vs.
simplified chinese problem, but that approach applies to cases
like dotless i, dot-above too.

 plus NKFC, and there is a list of forbidden characters as well.
 See RFCs 3454 and 3491 for the exact rules.

No letter is forbidden (though several are case-folded to the same
letter), nor is any 'graphic' combining mark.

/kent k

RE: Case mapping of dotless lowercase letters

2003-12-16 Thread Kent Karlsson


  Since {U+0069} is /not/ canonically equivalent to 
 {U+0131}{U+0307}, I 
  don't see anything to stop me from registering the domain name 
  un{U+0131}{U+0307}code.org, for example. It /is/ in NFC, 
 after all.
 
 You can (or rather, you will be able to when internationalized domain
 names become a reality).  But in fact you have to use case folding

Yes. And as it happens, dotless-i case-*folds* to (soft)dotted-i,
so you cannot register an IDN that after nameprep has a dotless-i
in it, since that name isn't correctly nameprepped.

This does not guard against (soft)dotted-i, dot-above, but for
the registered part of a domain name, registrars are *supposed* to
have some rules for what is allowed, and what is not (for that paticular
registrar). E.g. the Swedish domain name registry *currently* allows
only ASCII letters plus åäöé (after nameprep) in domain names they
register, though this may be somewhat augmented in the future
(to cover Sami too at least, maybe more). This kind of solution
was driven mainly by the issue of the traditional chinese vs.
simplified chinese problem, but that approach applies to cases
like dotless i, dot-above too.

 plus NKFC, and there is a list of forbidden characters as well.
 See RFCs 3454 and 3491 for the exact rules.

No letter is forbidden (though several are case-folded to the same
letter), nor is any 'graphic' combining mark.

/kent k

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread Stefan Persson

Kent Karlsson wrote:
 This kind of solution
was driven mainly by the issue of the traditional chinese vs.
simplified chinese problem, but that approach applies to cases
like dotless i, dot-above too.
Do you mean that people were afraid that someone would register e.g.  
.com, while someone else would register .com?

Stefan

Re: Stability of WG2

2003-12-16 Thread Curtis Clark

on 2003-12-16 02:53 Peter Kirk wrote:
Even if this is a millennial reign of peace 
and prosperity, processes of language change will not stop. 
A measure of comparison is the system of biological nomenclature, which 
has maintained stability of names in the face of increasing knowledge of 
organisms over a period of a quarter of a millenium. There are no ISO 
standards for scientific names--the system has succeeded through 
consensus, by biologists agreeing that a stable system is worth the 
trade of quite a bit of individualism (not to mention the periodic and 
sometimes raucous conventions when the rules are modified).

--
Curtis Clark  http://www.csupomona.edu/~jcclark/
Mockingbird Font Works  http://www.mockfont.com/

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread jcowan

Michael Everson scripsit:

[Philippe Verdy scripsisset:]
 I think it is completely illogical to match together with case-insensitive
 compares, the three letters:
  LATIN SMALL LETTER I (dotted) [U+0069]
  LATIN CAPITAL LETTER I (dotless)$ [U+0049]
  LATIN CAPITAL LETTER I WITH DOT ABOVE [U+0130]
 but not:
  LATIN SMALL LETTER DOTLESS I [U+0131]
 when using locale-neutral compares, given that the normative uppercase mapping
 of this fourth letter is the second letter above.
 
 That is not what happens in locale-neutral comparisons, I believe.

Here's what happens exactly:

source  simple case folding full case folding   tr/az case folding
dotted idotted idotted idotted i
dotless i   dotless i   dotless i   dotless i
dotted Idotted Idotted i + comb. dotdotted i
dotless I   dotted idotted idotless i

-- 
John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan  www.reutershealth.com
The competent programmer is fully aware of the strictly limited size of his own
skull; therefore he approaches the programming task in full humility, and among
other things he avoids clever tricks like the plague.  --Edsger Dijkstra

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread Peter Kirk

On 16/12/2003 08:41, Kent Karlsson wrote:

...

Yes. And as it happens, dotless-i case-*folds* to (soft)dotted-i,
so you cannot register an IDN that after nameprep has a dotless-i
in it, since that name isn't correctly nameprepped.
This does not guard against (soft)dotted-i, dot-above, but for
the registered part of a domain name, registrars are *supposed* to
have some rules for what is allowed, and what is not (for that paticular
registrar). E.g. the Swedish domain name registry *currently* allows
only ASCII letters plus åäöé (after nameprep) in domain names they
register, though this may be somewhat augmented in the future
(to cover Sami too at least, maybe more). This kind of solution
was driven mainly by the issue of the traditional chinese vs.
simplified chinese problem, but that approach applies to cases
like dotless i, dot-above too.
 

If the Swedish registry allows all the letters used in Swedish and Sami, 
and far eastern registries allow Chinese characters, the Turkish and 
Azerbaijani registries should allow, and be allowed to allow, all the 
letters of the alphabets of their national languages.



--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread Stefan Persson

Peter Kirk wrote:
If the Swedish registry allows all the letters used in Swedish and Sami, 
and far eastern registries allow Chinese characters, the Turkish and 
Azerbaijani registries should allow, and be allowed to allow, all the 
letters of the alphabets of their national languages.
They would in that case allow dotted and dotless i, but would they 
automatically allow dot above?  There's still the uppercase/lowercase 
problem, though---maybe these registries should not allow different 
domain names that differ only in dotless/dotted i?

Stefan

Re: Swastika to be banned by Microsoft?

2003-12-16 Thread Anto'nio Martins-Tuva'lkin

On 2003.12.15, 12:54, Tom Emerson [EMAIL PROTECTED] wrote:

 Apparantly that S is the Futhark rune Sigel, encoded at U+16CB.

 Holocaust scholars wanting to encode German documents from the 1930s
 and 1940s would want the double runic S encoded, since this was a
 specific character found on type-writers of the era and saw regular
 use.

 A proposal to encode this was shot down a few years ago, however.

Even if it were encoded it could still have been made cannonically (or
otherwise) decomposed to U+16CB U+16CB. Or to U+16CB U+034F U+16CB, to
keep its logoness.

--.
António MARTINS-Tuválkin |  ()|
[EMAIL PROTECTED]||
Rua Alberto Bramão, 8-1º d.to |
PT-1700-132 LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread Peter Kirk

On 16/12/2003 11:49, Stefan Persson wrote:

Peter Kirk wrote:

If the Swedish registry allows all the letters used in Swedish and 
Sami, and far eastern registries allow Chinese characters, the 
Turkish and Azerbaijani registries should allow, and be allowed to 
allow, all the letters of the alphabets of their national languages.


They would in that case allow dotted and dotless i, but would they 
automatically allow dot above? ...
Probably not, although there would be a certain irony if dotless i with 
dot above was allowed but ordinary i was not.

...  There's still the uppercase/lowercase problem, though ...
True. This problem needs to be solved. In the circumstances, and since 
as a general rule IDNs are written lower case, it might be acceptable 
for the lower case mapping of (ordinary dotless) I to be indeterminate, 
so that if I type UNICODE.ORG I might get unicode.org or uncode.org.

... ---maybe these registries should not allow different domain names 
that differ only in dotless/dotted i?
Indeed they should, just as the Swedish registry allows names that 
differ only in umlauts. These are different letters of the alphabet. 
Otherwise we are imposing foreign alphabetic practices.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread jcowan

Kent Karlsson scripsit:

 Yes. And as it happens, dotless-i case-*folds* to (soft)dotted-i,
 so you cannot register an IDN that after nameprep has a dotless-i
 in it, since that name isn't correctly nameprepped.

What is the source of this claim?  The tables in RFC 3454 (stringprep)
do not mention dotless-i, and neither does RFC 3491.

-- 
Knowledge studies others / Wisdom is self-known;  John Cowan
Muscle masters brothers / Self-mastery is bone;   [EMAIL PROTECTED]
Content need never borrow / Ambition wanders blind;   www.ccil.org/~cowan
Vitality cleaves to the marrow / Leaving death behind.--Tao 33 (Bynner)

Speaking of glottophagic hegemony (was Re: [OT] CJK - CJC (Re: Corea?))

2003-12-16 Thread Kenneth Whistler

Wow. Antonio is running it down!

 Etc. All this crackpot misguided political correctness reeks of
 unconscious glottophagic hegemony, cultural parochalism and well-meaning
 gringocentered patronizing -- it's unsettling to sniff (in this and
 other threads) whips of it in a forum such as this.
 ^
 
But your p seems to have glottophagiated the ff in whiffs,
unless the implication is that the Mistress of Cultural
Parochialism also has an odiferous fetish with the leather lash
she's using to scourge the misbehaving perpetrators of foreignisms. :-)

And yes, that's an open invitation to further OT-ify a thread that
has gone bad. This forum definitely needs some thread discipline here.

--Ken

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread Stefan Persson

Peter Kirk wrote:
In the circumstances, and since 
as a general rule IDNs are written lower case, it might be acceptable 
for the lower case mapping of (ordinary dotless) I to be indeterminate, 
so that if I type UNICODE.ORG I might get unicode.org or uncode.org.

... ---maybe these registries should not allow different domain names 
that differ only in dotless/dotted i?
Indeed they should, just as the Swedish registry allows names that 
differ only in umlauts.
In that case, how would the browser know if UNICODE.ORG means that you 
want to visit unicode.org or uncode.org, if both domains exist? 
Maybe one could assume Turkish casing for .tr and .az domains, and 
non-Turkish casing for all other domains.

Stefan

Re: Stability of WG2

2003-12-16 Thread jcowan

Peter Kirk scripsit:

 On 16/12/2003 09:41, Curtis Clark wrote:
 
 A measure of comparison is the system of biological nomenclature, ... 
 (not to mention the periodic and sometimes raucous conventions when 
 the rules are modified).
 
 Probably the secret of its success is the existence of such conventions. 

*chuckle*

The first use of conventions above means meetings; the second means
rules.  Result: a non-meeting of the minds.

 If biologists had insisted that names once assigned could not be changed 
 because of advances in knowledge, or even to correct errors, then surely 
 the system would have broken down centuries ago.

In fact, Linnaean names are *not* changed for either of those reasons,
nor for any other reason whatsoever: though we now know that Basilosaurus
is a proto-whale and not any sort of reptile, Basilosaurus it will
remain forever.

The only thing that can happen in Linnaean nomenclature is the recognition
that two names are synonymous.  In that case, there is a question which
shall be the preferred name, and normally it is the first name published,
but exceptions sometimes occur.  Thus when Brontosaurus and Apatosaurus
were found to be synonyms, Apatosaurus was chosen as the preferred
name because it was published first; however, this is not properly
describable as changing the name of Brontosaurus to 'Apatosaurus'.
Brontosaurus is a perfectly good name and may still be used even though
it is dispreferred.

-- 
You are a child of the universe no less John Cowan
than the trees and all other acyclichttp://www.reutershealth.com
graphs; you have a right to be here.http://www.ccil.org/~cowan
  --DeXiderata by Sean McGrath  [EMAIL PROTECTED]

Re: Speaking of glottophagic hegemony (was Re: [OT] CJK - CJC (Re: Corea?))

2003-12-16 Thread jcowan

Kenneth Whistler scripsit:

 But your p seems to have glottophagiated the ff in whiffs,
 unless the implication is that the Mistress of Cultural
 Parochialism also has an odiferous fetish with the leather lash
 she's using to scourge the misbehaving perpetrators of foreignisms. :-)

I think you want odoriferous rather than odiferous, though the latter
term, ynkhorne as it is, may have some small applicability here.

 And yes, that's an open invitation to further OT-ify a thread that
 has gone bad. This forum definitely needs some thread discipline here.

Well, until we see either a whip (in the parliamentary sense) or a flagellifer
here, we can't expect much.

-- 
A rabbi whose congregation doesn't want John Cowan
to drive him out of town isn't a rabbi, http://www.ccil.org/~cowan
and a rabbi who lets them do it [EMAIL PROTECTED]
isn't a man.--Jewish saying http://www.reutershealth.com

Re: Stability of WG2

2003-12-16 Thread Michael Everson

At 16:05 -0500 2003-12-16, [EMAIL PROTECTED] wrote:

Thus when Brontosaurus and Apatosaurus were found to be synonyms, 
Apatosaurus was chosen as the preferred name because it was 
published first; however, this is not properly describable as 
changing the name of Brontosaurus to 'Apatosaurus'. Brontosaurus 
is a perfectly good name and may still be used even though it is 
dispreferred.
Brontosaurus was good enough for me when I was five, and it's good 
enough for me today. Hmpf. Dispreferred me elbow.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Qumran Greek

2003-12-16 Thread Elaine Keown

 Elaine Keown

Hi,

--- Michael Everson [EMAIL PROTECTED] wrote:
 The X looks like a CHI of course.

It is a chi!!!--E. G. Turner
Greek Manuscripts of the Ancient World 1987

says that chi is an editorial mark. 

His book has a plate of a Greek ms showing the chi and
paragraphos near each other, as in the Qumran Isaiah.
Elaine

__
Do you Yahoo!?
New Yahoo! Photos - easier uploading and sharing.
http://photos.yahoo.com/

RE: Case mapping of dotless lowercase letters

2003-12-16 Thread Kent Karlsson


Stefan Persson wrote:
 Kent Karlsson wrote:
   This kind of solution
  was driven mainly by the issue of the traditional chinese vs.
  simplified chinese problem, but that approach applies to cases
  like dotless i, dot-above too.
 
 Do you mean that people were afraid that someone would 
 register e.g. .com, while someone else would register .com?

Assuming that those are SC and TC for the same reading,
yes. Worse, those worrying argued that more than 2^n IDNs,
where n is the number of CJK characters in the intended name
would be needed for each intended name (ignoring that SC
and TC don't usually mix).


Peter Kirk wrote:
 If the Swedish registry allows all the letters used in Swedish and Sami, 
 and far eastern registries allow Chinese characters, the Turkish and 
 Azerbaijani registries should allow, and be allowed to allow, all the 
 letters of the alphabets of their national languages.

Note that  (sharp s) casefolds to ss, and  (long s) casefolds to s. So
strae, strase, and strasse also both map to the same (strasse)
subname.


John Cowan wrote:
  Yes. And as it happens, dotless-i case-*folds* to (soft)dotted-i,
  so you cannot register an IDN that after nameprep has a dotless-i
  in it, since that name isn't correctly nameprepped.
 
 What is the source of this claim?  The tables in RFC 3454 (stringprep)
 do not mention dotless-i, and neither does RFC 3491.

Aha, a change that escaped me. (It used to be folded as described above.)
My apologies.


/kent k

RE: Case mapping of dotless lowercase letters

2003-12-16 Thread Philippe Verdy

 Here's what happens exactly:

Note the rules in CaseFolding.txt:

0049;  C; 0069;  # CAPITAL (dotless) I- SMALL (soft-dotted)
I
0049;  T; 0131;  # CAPITAL (dotless) I- SMALL DOTLESS I
0130;  F; 0069 0307; # CAPITAL I WITH DOT - SMALL (soft-dotted)
I, DOT
0130;  T; 0069;  # CAPITAL I WITH DOT - SMALL (soft-dotted)
I

But also that the other 'i's are mapped to themselves by default.
There's no explicit Casefolding mapping defined for them so we also have
currently these defaults:

0069;  C; 0069;  # SMALL (soft-dotted) I  - SMALL (soft-dotted)
I
0130;  C; 0130;  # CAPITAL I WITH DOT - CAPITAL I WITH DOT
0131;  C; 0131;  # SMALL DOTLESS I- SMALL DOTLESS I

And we also have the explitly dotted Turkic lowercase i, whose folding is
defined by the 5th of all rules above (thanks, there's no canonical
equivalence between 0069 0307 and 0069):

0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT - SMALL (soft-dotted)
I, DOT

And for the decomposition of the Turkic dotted uppercase I, case folding is
defined by the 1st or 2nd of all rules above (note that 0049 0307 and 0130
should be canonically equivalent, and should produce identical case foldings
with the 3rd or 4th rules above, to preserve canonical equivalence):

0049 0307; C; 0069 0307; # CAPITAL (dotless) I, DOT   - SMALL (soft-dotted)
I, DOT
0049 0307; T; 0131 0307; # CAPITAL (dotless) I, DOT   - SMALL DOTLESS I,
DOT



Now let's look at each CaseFolding type, and look at their result:


(1) Mappings for Simple CaseFolding:

(1.1) First class of source strings:
0131;  C; 0131;  # SMALL DOTLESS I- SMALL DOTLESS I
(1.2) Second class of source strings:
0049;  C; 0069;  # CAPITAL (dotless) I- SMALL (soft-dotted)
I
0069;  C; 0069;  # SMALL (soft-dotted) I  - SMALL (soft-dotted)
I
(1.3) Third class of source strings:
0130;  C; 0130;  # CAPITAL I WITH DOT - CAPITAL I WITH DOT
(1.4) Fourth class of source strings: 
0049 0307; C; 0069 0307; # CAPITAL (dotless) I, DOT   - SMALL (soft-dotted)
I, DOT
0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT - SMALL (soft-dotted)
I, DOT

Do these classes resist (don't merge or split) with uppercase/titlecase or
lowercase?

(1.1) 0131;  lower=0131 ; upper/title=0131

(1.2) 0049;  lower=0069 ; upper/title=0049
(1.2) 0069;  lower=0069 ; upper/title=0049

(1.3) 0130;  lower=0130 ; upper/title=0130

(1.4) 0049 0307; lower=0069 0307; upper/title=0049 0307
(1.4) 0069 0307; lower=0069 0307; upper/title=0049 0307

OK, there's no merge, so no problem with Simple CaseFolding, which resist to
case mappings.


(2) Mappings for Turkic CaseFolding:

(2.1) First class of source strings:
0131;  C; 0131;  # SMALL DOTLESS I- SMALL DOTLESS I
0049;  T; 0131;  # CAPITAL (dotless) I- SMALL DOTLESS I
(2.2) Second class of source strings:
0069;  C; 0069;  # SMALL (soft-dotted) I  - SMALL (soft-dotted)
I
0130;  T; 0069;  # CAPITAL I WITH DOT - SMALL (soft-dotted)
I
(2.3) Third class of source strings:
0049 0307; T; 0131 0307; # CAPITAL (dotless) I, DOT   - SMALL DOTLESS I,
DOT
(2.4) Fourth class of source strings:
0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT - SMALL (soft-dotted)
I, DOT

Do these classes resist (don't merge or split) with common
uppercase/titlecase or lowercase mappings?

(2.1) 0131;  C; lower=0131 ; upper/title=0131

(2.1) 0049;  C; lower=0069 ; upper/title=0049
(2.2) 0069;  C; lower=0069 ; upper/title=0049

(2.2) 0130;  C; lower=0130 ; upper/title=0130

(2.3) 0049 0307; C; lower=0069 0307; upper/title=0049 0307
(2.4) 0069 0307; C; lower=0069 0307; upper/title=0049 0307

Problem here: uppercase mappings do not follow case folding rules.
We would also need Turkic-specific mappings for upper/title case:

(2.1) 0131;  T; upper/title=0049
(2.1) 0049;  C; upper/title=0049

(2.2) 0069;  T; upper/title=0130
(2.2) 0130;  C; upper/title=0130

(2.3) 0049 0307; T; upper/title=0049 0307 (=0130 ?)

(2.4) 0069 0307; T; upper/title=0130 0307 (=0130 ?)

But we would need then to define canonical equivalence between 0130 and 0049
0307 and 0130 0307 to preserve canonical equivalence... So Turkic
CaseFoldings would be broken, unless we say that Turkish texts should NOT be
encoded with 0307, but only with 0049, 0069, 0130 or 0131. So Turkic
CaseFolding rules should also avoid generating any 0307, whose behavior is
not clear.

If we just remove any 0307 from the Turkic texts, there is absolutely no
problem with Turkic CaseFolding, provided that we also define
Turkic-specific uppercase mappings as done above, and don't use the default

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread Peter Kirk

On 16/12/2003 13:09, Stefan Persson wrote:

...
In that case, how would the browser know if UNICODE.ORG means that 
you want to visit unicode.org or uncode.org, if both domains 
exist? Maybe one could assume Turkish casing for .tr and .az domains, 
and non-Turkish casing for all other domains.

Stefan

As soon as I had written the above I realised that I had hurried too 
much, but I was going out. Let me clarify:

If it is the client software (browser etc) which resolves the casing, 
then how it resolves it is essentially a local matter which doesn't need 
to be standardised. But my recommendation would be that the mapping 
followed the local language context, i.e. in general the system locale 
except where overridden by language markup in the local context e.g. 
when the URL is embedded in a document. That is, I would map to i, 
unless the locale or markup language is tr or az in which case it would 
map to dotless i. (There are actually a few other language orthographies 
which use Turkic casing.) The alternative of using the Turkic mapping 
for .tr and .az domains is possible but seems less desirable to me.

If the casing is resolved by the nameserver, there is no alternative to 
using the Turkic mapping only for .tr and .az domains.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Case mapping of dotless lowercase letters

2003-12-16 Thread Philippe Verdy

Chris Jacobs [mailto:[EMAIL PROTECTED]
 From: Philippe Verdy [EMAIL PROTECTED]
  Stefan Persson writes:
   Isn't the sequence dotless i + combining acute canonically 
 equivalent
   to dotted i + combining acute?
 
  NO. There's no canonical equivalence between distinct pairs of 
 characters,
  if the first letter of each pair are not also canonically equivalent.
 
 compare ? with e
 
 The first pair has e trema as its first letter, the second pair e ogonek.
 Yet these  pairs are canonical equivalent.

True in the way you interpret my sentence, but when I say the first letter
of each pair, I mean the first non decomposable character of each pair. In
your example, both letters are simple e vowels.

Both dotted lowercase i and dotless lowercase i are not decomposable...
unlike dotter uppercase I...

Well Outlook 2000 is unable to represent any e with ogonek and trema of your
example. So, despite they are canonically equivalent, they are rendered
differently:

- ? SMALL LETTER E WITH DIAERERESIS, COMBINING OGONEK
  displays SMALL LETTER E WITH DIAERESIS, MISSING SPACING GLYPH FOR
COMBINING OGONEK
  in an unbreakable sequence of glyphs or editable grapheme clusters (the
keyboard edit cannot move in the middle, but the mouse selection can break
before the ogonek.)

- e SMALL LETTER E WITH OGONEK, COMBINING DIAERERESIS
  and e? SMALL LETTER E, COMBINING OGONEK, COMBINING DIAERERESIS
  both display E WITH OGONEK, SPACING DIAERESIS
  with a break between glyphs, as if it were two distinct editable grapheme
clusters.

All these should better display E WITH OGONEK, MISSING NON-SPACING GLYPH
FOR COMBINING DIAERESIS
Isn't there a distinct glyph for missing glyphs representing spacing
diacritics, or not even a spacing glyph with a dotted circle? And grapheme
clusters are incorrectly mapped for editing in Outlook.


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

Re: Stability of WG2

2003-12-16 Thread Peter Kirk

On 16/12/2003 13:05, [EMAIL PROTECTED] wrote:

Peter Kirk scripsit:

 

On 16/12/2003 09:41, Curtis Clark wrote:

   

A measure of comparison is the system of biological nomenclature, ... 
(not to mention the periodic and sometimes raucous conventions when 
the rules are modified).

 

Probably the secret of its success is the existence of such conventions. 
   

*chuckle*

The first use of conventions above means meetings; the second means
rules.  Result: a non-meeting of the minds.
 

Not so! I intended such conventions as an explicit reference to the 
meetings which Curtis described, although I was also aware of the double 
meaning and deliberately didn't cancel it.

If biologists had insisted that names once assigned could not be changed 
because of advances in knowledge, or even to correct errors, then surely 
the system would have broken down centuries ago.
   

In fact, Linnaean names are *not* changed for either of those reasons,
nor for any other reason whatsoever: though we now know that Basilosaurus
is a proto-whale and not any sort of reptile, Basilosaurus it will
remain forever.
The only thing that can happen in Linnaean nomenclature is the recognition
that two names are synonymous.  In that case, there is a question which
shall be the preferred name, and normally it is the first name published,
but exceptions sometimes occur.  Thus when Brontosaurus and Apatosaurus
were found to be synonyms, Apatosaurus was chosen as the preferred
name because it was published first; however, this is not properly
describable as changing the name of Brontosaurus to 'Apatosaurus'.
Brontosaurus is a perfectly good name and may still be used even though
it is dispreferred.
 

I'm no expert on this... but I thought that species could be transferred 
from genus to genus as knowledge advances. And presumably obvious 
spelling mistakes are corrected (contrast FHTORA in U+1D0C5), or are 
you saying that if the first publication had Brontosuarus as a typo 
this error would remain for ever?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Case mapping of dotless lowercase letters

2003-12-16 Thread Michael Everson

At 00:35 +0100 2003-12-17, Philippe Verdy wrote:

 NO. There's no canonical equivalence between distinct pairs of
 characters, if the first letter of each pair are not also canonically
 equivalent.
 
 compare ë? with e¨

 The first pair has e trema as its first letter, the second pair e ogonek.
 Yet these  pairs are canonical equivalent.
True in the way you interpret my sentence, but when I say the first letter
of each pair, I mean the first non decomposable character of each pair. In
your example, both letters are simple e vowels.
e-diaeresis is decomposable to e + combining 
diaeresis. e-ogonek-diaeresis is decomposable to 
e + combining diaeresis + combining ogonek or to 
e + combining ogonek + combining diaeresis. The 
last two are equivalent.

Both dotted lowercase i and dotless lowercase i are not decomposable...
unlike dotter uppercase I...
small letter i and small letter dotless i are as different as t and thorn.

Well Outlook 2000 is unable to represent any e with ogonek and trema of your
example.
Get a better browser.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

RE: Case mapping of dotless lowercase letters

2003-12-16 Thread Philippe Verdy

Peter Kirk writes:
 If it is the client software (browser etc) which resolves the casing, 
 then how it resolves it is essentially a local matter which doesn't need 
 to be standardised. But my recommendation would be that the mapping 
 followed the local language context, i.e. in general the system locale 
 except where overridden by language markup in the local context e.g. 
 when the URL is embedded in a document. That is, I would map to i, 
 unless the locale or markup language is tr or az in which case it would 
 map to dotless i. (There are actually a few other language orthographies 
 which use Turkic casing.) The alternative of using the Turkic mapping 
 for .tr and .az domains is possible but seems less desirable to me.
 
 If the casing is resolved by the nameserver, there is no alternative to 
 using the Turkic mapping only for .tr and .az domains.

Turkic case mappings are not usable in DNS and not even in IDNA, simply
because all legacy ASCII names must continue to resolve ASCII 'I'
identically with ASCII 'i' and not 'i' (encoded with Punycode). This is
needed for upwards compatibility.

So even localized browsers will need to forbid mapping 'i' as if it was
'I', and IDNA names containing 'i' cannot be fully converted to uppercase,
even with Full case mappings, which will need to keep the lowercase
letter. This will be true also for .tr' and '.az' registries, unless these
registries adopt a policy requiring the reservation of domain names in
bundles. If this occurs, it will be the registry which will map domain names
containing 'i'=='I' identically to domain names containing either a dotless
lowercase i. For the case of the dotted uppercase I, separate allocation is
still possible, but it would be too easily spoofable as they can be too
easily entered on Turkic keyboards to spoof the soft-dotted lowercase i.

So I doubt that .tr and .az registry will ever adopt a distinction between
dotted and undotted i in domain names, but they will ensure that by adding
bundle reservation policies if they ever implement IDNA. I doubt that
Turkish and Azeri registries will resolve names in bundles with dotless-i
or dotted-I, as it would require server-side dynamic DNS capabilities, which
would also mean scalability problems (the .fr registry has already rejected
the idea of resolving names reserved in bundles because of scalability
problems with some bundles which may have thousands of equivalents and would
be difficult to support in fast static DNS servers: only one canonical
name in the bundle will be resolved on DNS servers, the other names being
left reserved, until a standard solution is found to allow such resolution
in clients of these registries, using the bundle equivalence rules defined
by the specific IDNA bundle profile of each registry).


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread Kenneth Whistler

John Cowan noted:

quote
Here's what happens exactly:

 source simple case folding full case folding   tr/az case folding
 dotted i   dotted idotted idotted i
 dotless i  dotless i   dotless i   dotless i
 dotted I   dotted Idotted i + comb. dotdotted i
 dotless I  dotted idotted idotless i
/quote

Add to that specification of the case *folding* (from
CaseFolding.txt), the default case *mappings* (from
UnicodeData.txt):

 source default lc mapping  default uc mapping
 dotted i   dotted i(dotless) I
 dotless i  dotless i   (dotless) I
 dotted I   dotted idotted I
 (dotless) Idotted i(dotless) I
 
If you are case *folding* you are doing one thing; if you are
case *mapping* you are doing another.

Case *folding* creates equivalence classes for different sequences.

Simple case folding, as defined above, creates the following 
equivalence classes, adding in the sequences involving use of
the combining dot as well.

   A. { i, I }
   B. { dotless i }
   C. { dotted I }
   D. { i, dot above, I, dot above }
   E. { dotless i, dot above }
   F. { dotted I, dot above }
   
These 6 classes are distinguished. They do not conflate, although
in class A and in class D, there are two sequences which do fold
together.

Full case folding, as defined above, creates the following
equivalence classes.

   A. { i, I }
   B. { dotless i }
   G. { dotted I, i, dot above, I, dot above }
   E. { dotless i, dot above }
   F. { dotted I, dot above }
   
In other words, there are now 5, not 6 equivalence classes, as the
classes C and D from simple case folding have been conflated.

Turkic/Azeri case folding, as defined above, creates the following
equivalence classes.

   H. { i, dotted I }
   I. { dotless i, I }
   J. { i, dot above, dotted I, dot above }
   K. { dotless i, dot above, I, dot above }
   
And now there are 4 *different* equivalence classes, which group
together the sequences which make sense for Turkish/Azeri.

Note that none of the 3 sets of equivalence classes violates
*canonical* equivalence, because none of the 8 sequences involved
is canonically equivalent to any other. In other words, no matter
which of the 3 approaches you take to case folding, in no instance
are you claiming that canonically equivalent sequences are to be
interpreted differently.

Now let's look at what happens with case *mapping*, using the
default mappings of UnicodeData.txt.

Lowercasing first:

   L. { i, I, dotted I } -- i
   B. { dotless i }  -- dotless i
   M. { i, dot above, I, dot above, dotted I, dot above }
 -- i, dot above
   E. { dotless i, dot above } -- dotless i, dot above
   
Uppercasing next:

   N. { i, I, dotless i } -- I
   C. { dotted I }-- dotted I
   O. { i, dot above, I, dot above, dotless i, dot above }
 -- I, dot above
   F. { dotted I, dot above } -- dotted I, dot above
   
The classes of sequences that get conflated are different here. In
particular, classes L, M, N, O conflate characters that are not
conflated by the formal definition of case folding.

So, in particular, one should *not* expect the results of case
mapping, followed by a binary comparison, to be the same as
a formal case folding comparison. There will be differences.
Any implementation that does not take this into account is still
confused (aren't we all?) in its handling of these letters.

Now add to that the problem of which of the elements in the
equivalence classes *look* the same, and you have the potential
for even more confusion. In particular, in simple case folding,
you have the equivalence classes:

   A. { i, I }
   E. { dotless i, dot above }
  
Members of class E are *not* equivalent to members of class A.
But of course, dotless i, dot above *looks like* i and does
*not* look like I. Add in the others, plus all the potential
differences in how fonts may implemented the soft-dotted
property, and this entire area can lead to total confusion.

One moral of the story is: DO NOT USE COMBINING DOTS WITH I's.

If you subtract out all the superfluous combinations cited above
with combining dots (for completeness), then the situation
becomes much simpler and more comprehensible:

Simple case folding. [disallows string length change]

   A. { i, I }
   B. { dotless i }
   C. { dotted I }
   
Full case folding.   [allows string length change]

   A. { i, I }
   B. { dotless i }
   G. { dotted I }  [represented in folded form as i, dot above]
   
Turkic/Azeri case folding.

   H. { i, dotted I }
   I. { dotless i, I }

Lowercasing:

   L. { i, I, dotted I } -- i
   B. { dotless i }  -- dotless i
   
Uppercasing:

   N. { i, I, dotless i } -- I
   C. { dotted I }-- dotted I

Add in Turkic locale-specific special casing.

Lowercasing:

   H. { i, dotted

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread Kenneth Whistler

Correcting myself:

 Note that none of the 3 sets of equivalence classes violates
 *canonical* equivalence, because none of the 8 sequences involved
 is canonically equivalent to any other. In other words, no matter
 which of the 3 approaches you take to case folding, in no instance
 are you claiming that canonically equivalent sequences are to be
 interpreted differently.

Actually, dotted I *is* canonically equivalent to I, dot above
(I overlooked that when compiling the summary.)

Hence the equivalence classes for simple case folding:

   C. { dotted I }
   D. { i, dot above, I, dot above }
   
*do* violate canonical equivalence. And that is the whole
reason for the separate definition of full case folding,
which defines the equivalence class:

   G. { dotted I, i, dot above, I, dot above }

which observes canonical equivalence, but which has the
drawback of string length change in case folding.

--Ken

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread Chris Jacobs

- Original Message - 
From: Michael Everson [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, December 16, 2003 9:00 PM
Subject: Re: Case mapping of dotless lowercase letters

 At 20:30 +0100 2003-12-16, Chris Jacobs wrote:

NO. There's no canonical equivalence between distinct pairs of
characters, if the first letter of each pair are not also canonically
equivalent.

 compare  with 

 The first pair has e trema as its first letter, the second pair e ogonek.
 Yet these  pairs are canonical equivalent.

 The base letter is e

Nope. That would be the base char of their NFD.
The base chars of themselves are  and .

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread John Cowan

Philippe Verdy scripsit:

 If we just remove any 0307 from the Turkic texts, there is absolutely no
 problem with Turkic CaseFolding, provided that we also define
 Turkic-specific uppercase mappings as done above, and don't use the default
 locale-neutral uppercase mappings of the UCD.

There's no reason to expect that there will be any 0307 whatever in
Turkish/Azeri texts: it's not a diacritic those languages use, AFAIK.

-- 
How they ever reached any conclusion at all[EMAIL PROTECTED]
is starkly unknowable to the human mind.   http://www.reutershealth.com
--Backstage Lensman, Randall Garrett  http://www.ccil.org/~cowan

Re: Stability of WG2

2003-12-16 Thread John Cowan

Peter Kirk scripsit:

 I'm no expert on this... but I thought that species could be transferred 
 from genus to genus as knowledge advances. 

True enough, but the specific epithet remains the same, and the old names
are still available (as the jargon has it) though no longer valid
(what I was calling preferred in my previous post).  Linnaeus himself,
working with two different descriptions of chimps, split them into
Homo troglodytes and Simia satyrus (which latter also included bonobos
and orangutans); when the mistake was cleared up, the specific epithet
troglodytes, being the older, was retained for chimps, whereas bonobos
got satyrus, both now in the new genus Pan; orangs were moved to Pongo
and given the new epithet pygmaeus.  (There's now a move underfoot to
move all of these, plus gorillas, into Homo; I don't give it much chance,
though I think it's a cool idea.)

Nobody would call chimps Homo troglodytes, or orangs Simia satyrus,
today, but those names can't ever be assigned to other species in future.
(If chimps were folded into Homo, they would be H. troglodytes again.)

 And presumably obvious
 spelling mistakes are corrected (contrast FHTORA in U+1D0C5), or are 
 you saying that if the first publication had Brontosuarus as a typo 
 this error would remain for ever?

It depends.  If the article said I dub this genus 'Brontosuarus', from
the Greek for 'thunder lizard', then yes, it would be fixed.  But if
there isn't a positive *indication in the text of the original article*
that makes the error evident on its face, then 'Brontosuarus' it would be.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
Big as a house, much bigger than a house, it looked to [Sam], a grey-clad
moving hill.  Fear and wonder, maybe, enlarged him in the hobbit's eyes,
but the Mumak of Harad was indeed a beast of vast bulk, and the like of him
does not walk now in Middle-earth; his kin that live still in latter days are
but memories of his girth and his majesty.  --Of Herbs and Stewed Rabbit

Re: Case mapping of dotless lowercase letters

2003-12-16 Thread John Cowan

Kenneth Whistler scripsit:

 John Cowan noted:
 
 quote
 Here's what happens exactly:
 
  source   simple case folding full case folding   tr/az case folding
  dotted i dotted idotted idotted i
  dotless idotless i   dotless i   dotless i
  dotted I dotted Idotted i + comb. dotdotted i
  dotless Idotted idotted idotless i
 /quote

[snip]

 One moral of the story is: DO NOT USE COMBINING DOTS WITH I's.

A fine moral, indeed.  Unfortunately, full case folding generates such
things for downstream processes to trip over.  It's too late to fix
the RFCs, alas.

-- 
Where the wombat has walked,John Cowan [EMAIL PROTECTED]
it will inevitably walk again.  http://www.ccil.org/~cowan

57 matches

Mail list logo