RE: Case mapping of dotless lowercase letters
Doug Ewell [EMAIL PROTECTED] writes: Wrong here: I have found occurences of dotless lowercase i, used instead of soft-dotted lowercase i, as base letters for diacritics added above it (it was an accute accent...) Don't do that. What? This is VALID UNICODE to have texts coded like this. The proposed change for soft-dotted/dotless letters used with diacritics is still not in the standard, and it just gives rendering hints so that both base letters should have the same rendering, requiring the use of a explicit dot when the soft dot muct be kept with the diacritic. There was two sequences which looked apparently identical when rendered, and that were distinct after case folding compare check: (1) LATIN SMALL LETTER I, COMBINING ACCUTE ACCENT (2) LATIN SMALL LETTER DOTLESS I, COMBINING ACCUTE ACCENT but were no more distinct when converted to uppercase in a locale neutral environment not using the Turkic rules: (1') LATIN CAPITAL LETTER I, COMBINING ACCUTE ACCENT (2') LATIN CAPITAL LETTER I, COMBINING ACCUTE ACCENT OK, so you want the default, local-neutral case mapping tables to equate U+0069 with U+0131, right? Yes. And I have good reasons for that, coming from the fact that default locale-neutral mappings tables already equate their uppercase versions U+049 with U+0130, by returning U+0069 for both of them. This is close to being a spoofing problem, though. See TUS 4.0, page 141. If you think this is a spoofing problem, then the existing locale-neutral full case mapping of U+0130 is bogous and should not be U+0069 The string (2) may have been produced to avoid displaying the dot with some fonts that don't apply the soft-dotted rule when there's an additional diacritic above... Don't do that. That's misusing the standard. The font should be fixed instead. For whatever reason, encoded texts exist before correct fonts are used to render them. So there does exist texts which use dotless lowercase i before a diacritic above, simply because the author of the text did not want it to be rendered with a superposed dot. These texts are clearly not Turkic (in Turkish or Azeri, the dot of the soft-dotted i should have been displayed with the diacritic above it, and the dotless i should have been used to avoid it explicitly). But this is not the only reason, I can give other examples which also have security impacts and filesystems impact. Suppose you have a database of user names or file names allowing internationalized names coded along the recommanded Unicode principles. But these names are used in a way that makes it impossible to track the language in which these names are entered (filenames or users names or address fields in a entry form are such cases). Now provide a facility that allows to identify and avoid duplicate case-equivalents, using full mappings. Because you can't track the language, you'll need to use the default case-neutral full case mappings. Now a Turkish user enters a name or address in a entry form, or creates files with dotless lowercase i in it, and attempts to reenter later its case equivalent (dotless) uppercase I. The system will not identify both as being case equivalents, so it will accept both as if they were distinct. The Turkish user or the system then attempts to list files or database table fields matching some regular expression like i* with case insensitive option, to count the number of occurences of the names containing a (soft-)dotted i (or I). He will get all files containing one of three codes, and not the fourth one. __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
Re: Stability of WG2 (was: Re: [OT] CJK - CJC)
At 19:13 -0800 2003-12-15, Doug Ewell wrote: The North Korean and Chinese national bodies have already made proposals that violate both the letter and spirit of stability policies. Yes. And we have rejected them. I'm glad the U.S. national body will stay involved, but having to rely on that does sound a bit like having to rely on enlightened statesmen, doesn't it? Better than if the whole thing were just left to the employees of large companies, Doug. We have good checks and balances. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Stability of WG2
On 15/12/2003 16:57, Doug Ewell wrote: ... I'm not saying Peter is right, that this WILL happen, just trying to articulate his point that the possibility in the future is greater than nil. I didn't say that it WILL happen either, just that it might happen (and, later, that some changes might be desirable). ... It seems clear that the current enlightened WG2 membership is committed to both the letter and spirit of the current stability policy (to the dismay of Peter, who would like to see certain changes in names, combining classes, etc.). But there is really no way we can predict whether the eventual successors to Ken, Michael, Rick, Michel, etc. will share the same commitment. Remember that most of us once believed in the stability of ISO 3166 as well. Good point. Remember that the predicted life of Unicode (recently predicted by Michael, anyway) is longer than the lifetime of the current WG2 members, longer even than the US Constitution (so far), the figure of 1000 years was mentioned. Even if this is a millennial reign of peace and prosperity, processes of language change will not stop. A list of character names from 1000 years ago, even from 400 years ago, would look very strange today. Surely long before then the members of the successor body to WG2 will realise that the Unicode 4.0 list of character names, and probably also a lot of other things in Unicode which are now considered stable, require major updates. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Stability of WG2
On 15/12/2003 22:00, Christopher John Fynn wrote: Doug Ewell [EMAIL PROTECTED] The North Korean and Chinese national bodies have already made proposals that violate both the letter and spirit of stability policies. Fortunately they each have only one vote in WG2. - Chris But isn't that enough to outvote the US body? -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Case mapping of dotless lowercase letters
At 11:03 +0100 2003-12-16, Philippe Verdy wrote: Doug Ewell [EMAIL PROTECTED] writes: Wrong here: I have found occurences of dotless lowercase i, used instead of soft-dotted lowercase i, as base letters for diacritics added above it (it was an accute accent...) Don't do that. What? This is VALID UNICODE to have texts coded like this. In Irish, it is INCORRECT to spell físeán 'video' with a DOTLESS I + COMBINING ACUTE. It is a spelling error, and will fail in spell-checking. The correct spelling is either I + COMBINING ACUTE or precomposed I WITH ACUTE. It is VALID UNICODE to follow LATIN CAPITAL LETTER Q with DEVANAGARI VOWEL SIGN E but that doesn't mean it's the right way to write anything. For whatever reason, encoded texts exist before correct fonts are used to render them. So there does exist texts which use dotless lowercase i before a diacritic above, simply because the author of the text did not want it to be rendered with a superposed dot. Texts which contain spelling errors. Or old IPA texts using any number of ad-hoc IPA font solutions. Those texts have to be transcoded to proper Unicode at some stage. What you suggest is Not Recommended. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Stability of WG2
At 03:03 -0800 2003-12-16, Peter Kirk wrote: The North Korean and Chinese national bodies have already made proposals that violate both the letter and spirit of stability policies. Fortunately they each have only one vote in WG2. But isn't that enough to outvote the US body? Not with Ireland and Japan standing with the US on such an issue. ;-) We really must get the UK back into SC2 ;-) -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Case mapping of dotless lowercase letters
Michael Everson wrote: In Irish, it is INCORRECT to spell físeán 'video' with a DOTLESS I + COMBINING ACUTE. It is a spelling error, and will fail in spell-checking. The correct spelling is either I + COMBINING ACUTE or precomposed I WITH ACUTE. Isn't the sequence dotless i + combining acute canonically equivalent to dotted i + combining acute? Stefan
Re: Case mapping of dotless lowercase letters
At 13:00 +0100 2003-12-16, Stefan Persson wrote: Michael Everson wrote: In Irish, it is INCORRECT to spell físeán 'video' with a DOTLESS I + COMBINING ACUTE. It is a spelling error, and will fail in spell-checking. The correct spelling is either I + COMBINING ACUTE or precomposed I WITH ACUTE. Isn't the sequence dotless i + combining acute canonically equivalent to dotted i + combining acute? It is not. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Case mapping of dotless lowercase letters
This occurred to be even before I read Phillppe's email. Since {U+0069} is not canonically equivalent to {U+0131}{U+0307}, I don't see anything to stop me from registering the domain name "un{U+0131}{U+0307}code.org", for example. It is in NFC, after all. Jill -Original Message- From: Philippe Verdy [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 16, 2003 2:21 AM To: Doug Ewell Cc: [EMAIL PROTECTED] Subject: RE: Case mapping of dotless lowercase letters Doug Ewell wrote: I detected it after it produced a security bug (a user record was unexpectedly updated on my database...)
Re: [OT] CJK - CJC (Re: Corea?)
On Mon, 15 Dec 2003, Doug Ewell wrote: Jungshik Shin jshin at mailaps dot org wrote: If those 20 assemblymen have time and energy to deal with this foolish name change business, they had better push for a bill to If those 20 assemblymen really think a name change will boost national identity and pride, shouldn't they be trying to persuade English speakers to say Taehan Minguk instead? No, that's not only even sillier (as we'd all agree) but also is incorrect because 'Taehan Minguk' does not mean Korea but specifically mean 'Republic of Korea' that was founded in 1948. Moreover, North Koreans would prefer 'Chosun' to 'Hanguk' (Using 'Taehan Minguk' is obviously out of question to them). Using 'Korea' (English name) is a rather convenient way to work around the difference (between two Koreas). Jungshik
Re: Stability of WG2
On 16/12/2003 03:35, Michael Everson wrote: At 03:03 -0800 2003-12-16, Peter Kirk wrote: The North Korean and Chinese national bodies have already made proposals that violate both the letter and spirit of stability policies. Fortunately they each have only one vote in WG2. But isn't that enough to outvote the US body? Not with Ireland and Japan standing with the US on such an issue. ;-) We really must get the UK back into SC2 ;-) Even at the risk of finding me evening up the vote? ;-) Seriously, can you remind us briefly what the situation is, why there is no current UK representation? -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: [OT] Euro-English (was: Corea? (Re: Swastika to be banned by Microsoft?)
On Mon, 15 Dec 2003, Philippe Verdy wrote: But you may see one day their national airways renamed Corean Airlines, or its main standard body renamed CSC... There's no national airline in South Korea. Korean Air has been private for more than two decades and has been competing with Asiana Airlines in both domestic routes and int'l routes for over a decade. As for the ROK standard body, it's not KSC. KS C is just a section in KS (Korean Standard) for electric and electronic technology. KS C used to cover IT as well but in 1997-98, IT was moved to a new section 'X', which is why KS C 5601 was renamed KS X 1001. Jungshik
Re: Stability of WG2
At 04:36 -0800 2003-12-16, Peter Kirk wrote: Seriously, can you remind us briefly what the situation is, why there is no current UK representation? I will answer this off-line. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Stability of WG2 [OT]
Am Dienstag, 16. Dezember 2003 um 11:53 schrieb Peter Kirk: PK A list of PK character names from 1000 years ago, even from 400 years ago, would look PK very strange today. If they were made by scientists of the Western culture, they were Latin; such names look by no ways strange, as biology and medicine show. Maybe English in 1000 years will be something like Latin today, and the LATIN CAPITAL LETTER A will have its name unchanged as well as the dorsa spinalis in anatomy. -- Karl Pentzlin ACS Analysis Consulting Software GmbH München, Germany
Re: Stability of WG2
At 02:53 -0800 2003-12-16, Peter Kirk wrote: Good point. Remember that the predicted life of Unicode (recently predicted by Michael, anyway) is longer than the lifetime of the current WG2 members My point is that the work we do identifying characters and encoding them won't have to be done again. Once Manichaean is encoded, it's encoded. One day, 200 years from now, there may be some Puricode revision which will do away with some of the duplicate encodings which we have for various legacy and round-trip requirements. But that will not invalidate our work today. Even if this is a millennial reign of peace and prosperity, processes of language change will not stop. A list of character names from 1000 years ago, even from 400 years ago, would look very strange today. Nothing stops you from publishing a list of character names in proper English, in Portuguese, or on some Inglish which may exist a long time from now. Currently those strings are required to be changeless for stability. So we do not change them, as long as that requirement remains, which the vendors say it is. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Case mapping of dotless lowercase letters
Since {U+0069} is /not/ canonically equivalent to {U+0131}{U+0307}, I don't see anything to stop me from registering the domain name un{U+0131}{U+0307}code.org, for example. It /is/ in NFC, after all. Do we have Unicode DNS yet? I know there's stuff out there passing UTF-8 around, but is this formalised yet? But yes, {U+0131}{U+0307} can look awfully similar to {U+0069}, I think {U+0069} {U+0307} would as well (and of course there are other opportunities for visual confusion unrelated to the U+0069 and U+0131). -- Jon Hanna | Toys and books http://www.hackcraft.net/ | for hospitals: | http://santa.boards.ie
RE: Stability of WG2
Speaking as a Brit, I would like to know the answer to this one too. What's the problem with answering online? And if you're really not going toanswer this online, you could have just emailed Peter privately, instead of telling the whole list that you're going to keep the answer secret from all of us except Peter. What a wind up! Jill -Original Message- From: Michael Everson [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 16, 2003 12:49 PM To: [EMAIL PROTECTED] Subject: Re: Stability of WG2 At 04:36 -0800 2003-12-16, Peter Kirk wrote: Seriously, can you remind us briefly what the situation is, why there is no current UK representation? I will answer this off-line. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Case mapping of dotless lowercase letters
Arcane Jill scripsit: Since {U+0069} is /not/ canonically equivalent to {U+0131}{U+0307}, I don't see anything to stop me from registering the domain name un{U+0131}{U+0307}code.org, for example. It /is/ in NFC, after all. You can (or rather, you will be able to when internationalized domain names become a reality). But in fact you have to use case folding plus NKFC, and there is a list of forbidden characters as well. See RFCs 3454 and 3491 for the exact rules. -- There is no real going back. Though I John Cowan may come to the Shire, it will not seem [EMAIL PROTECTED] the same; for I shall not be the same. http://www.reutershealth.com I am wounded with knife, sting, and tooth, http://www.ccil.org/~cowan and a long burden. Where shall I find rest? --Frodo
RE: Case mapping of dotless lowercase letters
Do we have Unicode DNS yet? Yup. You can put Chinese letters in domain names now. You do it like this: (1) Convert to NFC (2) Encode in UTF-8 (3) Replace all reserved characters (space, %, etc.) with the three character string "%hh" (where hh is hex for the substituted character) (4) Now similarly replace all bytes 0x7F with the three-character string "%hh" (where hh is hex for the substituted character) But yes, {U+0131}{U+0307} can look awfully similar to {U+0069}, I think {U+0069} {U+0307} would as well (and of course there are other opportunities for visual confusion unrelated to the U+0069 and U+0131). Yeah, I thought of that. Yuk. The whole issue of spoof detection is an absolute nightmare. There are some things you can do to help, though:. security-conscious applications could use fonts in which 0 looks different from O, and in which 1 looks different from l; different scripts could be displayed in different colors; a warning dialog could be presented to the user if any character is a compatibility character, and so on. But NONE of these tricks will catch the distinction between U+0069 and U+0307. Both are letters, both are in the Latin script, neither is a compatilibility character, etc.. Automation can only go so far. Eventually, you're left with only one choice - to advise the user: "Never click on a hyperlink. Instead, always type in the URL by hand". Trouble is, such advice is more trouble than it's worth, and would kill the fluidity of the internet. Jill
RE: Case mapping of dotless lowercase letters
Quoting Arcane Jill [EMAIL PROTECTED]: Do we have Unicode DNS yet? Yup. You can put Chinese letters in domain names now. You do it like this: (1) Convert to NFC (2) Encode in UTF-8 (3) Replace all reserved characters (space, %, etc.) with the three character string %hh (where hh is hex for the substituted character) (4) Now similarly replace all bytes 0x7F with the three-character string %hh (where hh is hex for the substituted character) I know that this is done with Internationalised URIs, but does this work in the domain portion as well? I thought the DNS rules still prohibited it, although the URI rules don't - the inverse to how URIs are case-sensitive but the DNS portion isn't treated as such when dereferencing. Eventually, you're left with only one choice - to advise the user: Never click on a hyperlink. Instead, always type in the URL by hand. Trouble is, such advice is more trouble than it's worth, and would kill the fluidity of the internet. Or click on whatever hyperlinks you like, but have the hatches battened down and don't assume you are where you appear to be. I like to summarise security advice thusly: if you trust my advice on security you're starting with completely the wrong attitude :) -- Jon Hanna | Toys and books http://www.hackcraft.net/ | for hospitals: | http://santa.boards.ie
Re: Case mapping of dotless lowercase letters
On Dec 16, 2003, at 4:27 AM, Michael Everson wrote: At 11:03 +0100 2003-12-16, Philippe Verdy wrote: Doug Ewell [EMAIL PROTECTED] writes: Wrong here: I have found occurences of dotless lowercase i, used instead of soft-dotted lowercase i, as base letters for diacritics added above it (it was an accute accent...) Don't do that. What? This is VALID UNICODE to have texts coded like this. In Irish, it is INCORRECT to spell físeán 'video' with a DOTLESS I + COMBINING ACUTE. It is a spelling error, and will fail in spell-checking. The correct spelling is either I + COMBINING ACUTE or precomposed I WITH ACUTE. Michael is, of course, correct. The problem here is that in books on Latin typography from the not too distant past, such as those by Robin Williams (the other one), recommend using dotless-i + accent for precisely this reason that the dot would otherwise collide with the accent. Ms Williams was working in an environment, however, where all kinds of hacks were needed for non-international software like Quark to do the fancy stuff typographers wanted to do. A lot of the old typography tricks are being obsoleted by Unicode, OpenType/AAT/Graphite, and should no longer be adhered to. John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage..mac.com/jhjenkins/
RE: Case mapping of dotless lowercase letters
Michael Everson wrote: At 11:03 +0100 2003-12-16, Philippe Verdy wrote: Doug Ewell [EMAIL PROTECTED] writes: Wrong here: I have found occurences of dotless lowercase i, used instead of soft-dotted lowercase i, as base letters for diacritics added above it (it was an accute accent...) Don't do that. What? This is VALID UNICODE to have texts coded like this. In Irish, it is INCORRECT to spell físeán 'video' with a DOTLESS I + COMBINING ACUTE. It is a spelling error, and will fail in spell-checking. The correct spelling is either I + COMBINING ACUTE or precomposed I WITH ACUTE. Spelling was not the issue there. Only Unicode validity. For whatever reason, encoded texts exist before correct fonts are used to render them. So there does exist texts which use dotless lowercase i before a diacritic above, simply because the author of the text did not want it to be rendered with a superposed dot. Texts which contain spelling errors. Or old IPA texts using any number of ad-hoc IPA font solutions. Those texts have to be transcoded to proper Unicode at some stage. What you suggest is Not Recommended. Not recommanded but still valid (and actually used in Turkish as well!), and used in some occasions because of defects in fonts that don't have a precomposed glyph for letter i with the diacritic but have a separate glyph for the combining diacritic and for the dotted and dotless letters i, or that use renderers unable to remove the soft dot. The IPA-93 font is such one, which allows good typesetting, but which needs glyph processing to select the appropriate base letter. My main issue is, however with Turkish names found in environments where language identification is not possible (for example a simple filename or a locale-neutral database field or an international HTML form which requests user names and use them as case insensitive identifiers); lowercase dotless i do not work appropriately there. I think it is completely illogical to match together with case-insensitive compares, the three letters: LATIN SMALL LETTER I (dotted) LATIN CAPITAL LETTER I (dotless) LATIN CAPITAL LETTER I WITH DOT ABOVE but not: LATIN SMALL LETTER DOTLESS I when use locale-neutral compares, given that the normative uppercase mapping of this fourth letter is the second letter above. I'm sorry that nobody wants to admit it, and that this is a security issue which causes problems when applications that expect a case-insensitive difference means that converting the string to either lowercase or uppercase or titlecase will preserve this difference. __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
RE: Stability of WG2
Jill, Speaking as an Austrian, I don't care why the UK does not participate in SC2/WG2. But I DO appreciate the information, that I am not going to see an answer to this question. Please be kind to Michael. Regards Arnold From: Arcane Jill [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 16, 2003 8:41 AMTo: [EMAIL PROTECTED]Subject: RE: Stability of WG2 Speaking as a Brit, I would like to know the answer to this one too. What's the problem with answering online?And if you're really not going toanswer this online, you could have just emailed Peter privately, instead of telling the whole list that you're going to keep the answer secret from all of us except Peter. What a wind up!Jill -Original Message- From: Michael Everson [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 16, 2003 12:49 PM To: [EMAIL PROTECTED] Subject: Re: Stability of WG2 At 04:36 -0800 2003-12-16, Peter Kirk wrote: Seriously, can you remind us briefly what the situation is, why there is no current UK representation? I will answer this off-line. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Case mapping of dotless lowercase letters
Stefan Persson writes: Isn't the sequence dotless i + combining acute canonically equivalent to dotted i + combining acute? NO. There's no canonical equivalence between distinct pairs of characters, if the first letter of each pair are not also canonically equivalent. __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
Re: [OT] CJK - CJC (Re: Corea?)
Jungshik Shin jshin at mailaps dot org wrote: If those 20 assemblymen really think a name change will boost national identity and pride, shouldn't they be trying to persuade English speakers to say Taehan Minguk instead? No, that's not only even sillier (as we'd all agree) but also is incorrect because 'Taehan Minguk' does not mean Korea but specifically mean 'Republic of Korea' that was founded in 1948. Moreover, North Koreans would prefer 'Chosun' to 'Hanguk' (Using 'Taehan Minguk' is obviously out of question to them). Using 'Korea' (English name) is a rather convenient way to work around the difference (between two Koreas). Sorry, I was under the impression that this name thing was specifically a South Korean idea. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
G-Strings
There was talk recently on this list of mapping grapheme clusters to the PUA (for application internal use only, obviously, not for export to the real world). I actually did this recently, though my app ended up in an incomplete state since I got bored and moved onto something else. The algorithm worked though, so I present it here and place it in the public domain, licence free, for anyone to use who wants to do so. Such an encoded string I called a "grapheme string", or "gstring" for short. Of course, that was before "grapheme" was renamed as "default grapheme cluster", so the name doesn't work quite as well now. The range of characters I resereved for my private use actually consisted of the surrogate codepoints, not the PUA codepoints. I reasoned that the PUA area might actually be being used for something (else), but the surrogate codepoints were illegal and therefore available. Despite the fact that number of possible graphmes is infinite, I never actually ran out of codepoints. Here's the algorithm in pseudo-code: // The following are static and global max_word (a 16-bit unsigned integer, initially the lowest codepoint you reserve (e.g. the start of the PUA) minus one) map_grapheme_to_word[] (a mapping from grapheme (=array of codepoints) to 16-bit word, initially empty) map_word_to_grapheme[] (a mapping from 16-bit word to grapheme, initially empty) // Convert unicode text to internal representation with one 16-bit word per grapheme // -- input (text_unicode) is an array of codepoints (ie. it has already been decoded from UTF-whatever) // -- output (text_internal) is an array of 16-bit words, each representing one grapheme. THIS STRING MAY NEVER BE EXPORTED. text_internal = "" for (each grapheme in text_unicode) // each grapheme is a substring of one or more codepoints { grapheme = convert_to_NFC(grapheme); if (num_codepoints(grapheme) == 1 codepoint_of(grapheme) 0x1) { text_internal += codepoint_of(grapheme); } else { if (!exists(map_grapheme_to_word[grapheme])) { if (max_word still in range) { map_grapheme_to_word[grapheme] = ++max_word; map_word_to_grapheme[max_word] = grapheme; } else { text_internal += U+FFFD; // Whoa!! Ran out of reserved characters! Could add error handling here. } } text_internal += map_grapheme_to_word[grapheme]; } } return text_internal; // The converse process text_unicode = ""; for (each word in text_internal) { if (word in correct range) // e.g. PUA but doesn't have to be { if (exists(map_word_to_grapheme[max_word])) { text_unicode += map_word_to_grapheme[max_word]; } else { // error - should never happen text_unicode += U+FFFD; } } else { text_unicode += word; } } return text_unicode; Jill
WG2 - anyone from the UK interested?
There seems to be at least some interest in re-establishing the UK character encoding committee which contributed to ISO/IEC JTC1/SC2/WG2 10646. Anyone in Britain (or British) who might be interested in participating, please let me know ASAP. Thanks - Chris == Christopher Fynn 4 Chester Court 84 Salusbury Road London NW6 6PA - Original Message - From: Elaine Keown [EMAIL PROTECTED] To: Michael Everson [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Tuesday, December 16, 2003 2:56 PM Subject: Re: Stability of WG2 Elaine Keown in Austin Hi, Not with Ireland and Japan standing with the US on such an issue. ;-) We really must get the UK back into SC2 ;-) Is this another joke?--Elaine __ Do you Yahoo!? New Yahoo! Photos - easier uploading and sharing. http://photos.yahoo.com/
RE: Case mapping of dotless lowercase letters
At 16:48 +0100 2003-12-16, Philippe Verdy wrote: Michael Everson wrote: At 11:03 +0100 2003-12-16, Philippe Verdy wrote: Doug Ewell [EMAIL PROTECTED] writes: Wrong here: I have found occurences of dotless lowercase i, used instead of soft-dotted lowercase i, as base letters for diacritics added above it (it was an accute accent...) Don't do that. What? This is VALID UNICODE to have texts coded like this. In Irish, it is INCORRECT to spell físeán 'video' with a DOTLESS I + COMBINING ACUTE. It is a spelling error, and will fail in spell-checking. The correct spelling is either I + COMBINING ACUTE or precomposed I WITH ACUTE. Spelling was not the issue there. Only Unicode validity. Apparently you should look up the word valid. Any character can follow any other character and be valid. Any combining character can be applied to any base character, regardless of script. Texts which contain spelling errors. Or old IPA texts using any number of ad-hoc IPA font solutions. Those texts have to be transcoded to proper Unicode at some stage. What you suggest is Not Recommended. Not recommanded but still valid (and actually used in Turkish as well!) Case folding in Turkish and Azeri is DIFFERENT from everywhere else and you have to have a local tailoring for it. used in some occasions because of defects in fonts that don't have a precomposed glyph for letter i with the diacritic but have a separate glyph for the combining diacritic and for the dotted and dotless letters i, or that use renderers unable to remove the soft dot. What defects there are in FONTS without UNICODE CMAPS is of no concern to us. The IPA-93 font is such one, which allows good typesetting, but which needs glyph processing to select the appropriate base letter. It isn't a Unicode font, and so it doesn't matter. Data represented in it has to be transcoded to Unicode, and the font has to have the right thing in it. My main issue is, however with Turkish names found in environments where language identification is not possible (for example a simple filename or a locale-neutral database field or an international HTML form which requests user names and use them as case insensitive identifiers); lowercase dotless i do not work appropriately there. Oh well. I think it is completely illogical to match together with case-insensitive compares, the three letters: LATIN SMALL LETTER I (dotted) LATIN CAPITAL LETTER I (dotless) LATIN CAPITAL LETTER I WITH DOT ABOVE but not: LATIN SMALL LETTER DOTLESS I when use locale-neutral compares, given that the normative uppercase mapping of this fourth letter is the second letter above. That is not what happens in locale-neutral comparisons, I believe. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Case mapping of dotless lowercase letters
Since {U+0069} is /not/ canonically equivalent to {U+0131}{U+0307}, I don't see anything to stop me from registering the domain name un{U+0131}{U+0307}code.org, for example. It /is/ in NFC, after all. You can (or rather, you will be able to when internationalized domain names become a reality). But in fact you have to use case folding Yes. And as it happens, dotless-i case-*folds* to (soft)dotted-i, so you cannot register an IDN that after nameprep has a dotless-i in it, since that name isn't correctly nameprepped. This does not guard against (soft)dotted-i, dot-above, but for the registered part of a domain name, registrars are *supposed* to have some rules for what is allowed, and what is not (for that paticular registrar). E.g. the Swedish domain name registry *currently* allows only ASCII letters plus åäöé (after nameprep) in domain names they register, though this may be somewhat augmented in the future (to cover Sami too at least, maybe more). This kind of solution was driven mainly by the issue of the traditional chinese vs. simplified chinese problem, but that approach applies to cases like dotless i, dot-above too. plus NKFC, and there is a list of forbidden characters as well. See RFCs 3454 and 3491 for the exact rules. No letter is forbidden (though several are case-folded to the same letter), nor is any 'graphic' combining mark. /kent k
RE: Case mapping of dotless lowercase letters
Since {U+0069} is /not/ canonically equivalent to {U+0131}{U+0307}, I don't see anything to stop me from registering the domain name un{U+0131}{U+0307}code.org, for example. It /is/ in NFC, after all. You can (or rather, you will be able to when internationalized domain names become a reality). But in fact you have to use case folding Yes. And as it happens, dotless-i case-*folds* to (soft)dotted-i, so you cannot register an IDN that after nameprep has a dotless-i in it, since that name isn't correctly nameprepped. This does not guard against (soft)dotted-i, dot-above, but for the registered part of a domain name, registrars are *supposed* to have some rules for what is allowed, and what is not (for that paticular registrar). E.g. the Swedish domain name registry *currently* allows only ASCII letters plus åäöé (after nameprep) in domain names they register, though this may be somewhat augmented in the future (to cover Sami too at least, maybe more). This kind of solution was driven mainly by the issue of the traditional chinese vs. simplified chinese problem, but that approach applies to cases like dotless i, dot-above too. plus NKFC, and there is a list of forbidden characters as well. See RFCs 3454 and 3491 for the exact rules. No letter is forbidden (though several are case-folded to the same letter), nor is any 'graphic' combining mark. /kent k
Re: Case mapping of dotless lowercase letters
Kent Karlsson wrote: This kind of solution was driven mainly by the issue of the traditional chinese vs. simplified chinese problem, but that approach applies to cases like dotless i, dot-above too. Do you mean that people were afraid that someone would register e.g. .com, while someone else would register .com? Stefan
Re: Stability of WG2
on 2003-12-16 02:53 Peter Kirk wrote: Even if this is a millennial reign of peace and prosperity, processes of language change will not stop. A measure of comparison is the system of biological nomenclature, which has maintained stability of names in the face of increasing knowledge of organisms over a period of a quarter of a millenium. There are no ISO standards for scientific names--the system has succeeded through consensus, by biologists agreeing that a stable system is worth the trade of quite a bit of individualism (not to mention the periodic and sometimes raucous conventions when the rules are modified). -- Curtis Clark http://www.csupomona.edu/~jcclark/ Mockingbird Font Works http://www.mockfont.com/
Re: Case mapping of dotless lowercase letters
Michael Everson scripsit: [Philippe Verdy scripsisset:] I think it is completely illogical to match together with case-insensitive compares, the three letters: LATIN SMALL LETTER I (dotted) [U+0069] LATIN CAPITAL LETTER I (dotless)$ [U+0049] LATIN CAPITAL LETTER I WITH DOT ABOVE [U+0130] but not: LATIN SMALL LETTER DOTLESS I [U+0131] when using locale-neutral compares, given that the normative uppercase mapping of this fourth letter is the second letter above. That is not what happens in locale-neutral comparisons, I believe. Here's what happens exactly: source simple case folding full case folding tr/az case folding dotted idotted idotted idotted i dotless i dotless i dotless i dotless i dotted Idotted Idotted i + comb. dotdotted i dotless I dotted idotted idotless i -- John Cowan [EMAIL PROTECTED] www.ccil.org/~cowan www.reutershealth.com The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague. --Edsger Dijkstra
Re: Case mapping of dotless lowercase letters
On 16/12/2003 08:41, Kent Karlsson wrote: ... Yes. And as it happens, dotless-i case-*folds* to (soft)dotted-i, so you cannot register an IDN that after nameprep has a dotless-i in it, since that name isn't correctly nameprepped. This does not guard against (soft)dotted-i, dot-above, but for the registered part of a domain name, registrars are *supposed* to have some rules for what is allowed, and what is not (for that paticular registrar). E.g. the Swedish domain name registry *currently* allows only ASCII letters plus åäöé (after nameprep) in domain names they register, though this may be somewhat augmented in the future (to cover Sami too at least, maybe more). This kind of solution was driven mainly by the issue of the traditional chinese vs. simplified chinese problem, but that approach applies to cases like dotless i, dot-above too. If the Swedish registry allows all the letters used in Swedish and Sami, and far eastern registries allow Chinese characters, the Turkish and Azerbaijani registries should allow, and be allowed to allow, all the letters of the alphabets of their national languages. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Case mapping of dotless lowercase letters
Peter Kirk wrote: If the Swedish registry allows all the letters used in Swedish and Sami, and far eastern registries allow Chinese characters, the Turkish and Azerbaijani registries should allow, and be allowed to allow, all the letters of the alphabets of their national languages. They would in that case allow dotted and dotless i, but would they automatically allow dot above? There's still the uppercase/lowercase problem, though---maybe these registries should not allow different domain names that differ only in dotless/dotted i? Stefan
Re: Swastika to be banned by Microsoft?
On 2003.12.15, 12:54, Tom Emerson [EMAIL PROTECTED] wrote: Apparantly that S is the Futhark rune Sigel, encoded at U+16CB. Holocaust scholars wanting to encode German documents from the 1930s and 1940s would want the double runic S encoded, since this was a specific character found on type-writers of the era and saw regular use. A proposal to encode this was shot down a few years ago, however. Even if it were encoded it could still have been made cannonically (or otherwise) decomposed to U+16CB U+16CB. Or to U+16CB U+034F U+16CB, to keep its logoness. --. António MARTINS-Tuválkin | ()| [EMAIL PROTECTED]|| Rua Alberto Bramão, 8-1º d.to | PT-1700-132 LISBOA Não me invejo de quem tem| +351 934 821 700 carros, parelhas e montes| http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe| http://pagina.de/bandeiras/ a água em todas as fontes|
Re: Case mapping of dotless lowercase letters
On 16/12/2003 11:49, Stefan Persson wrote: Peter Kirk wrote: If the Swedish registry allows all the letters used in Swedish and Sami, and far eastern registries allow Chinese characters, the Turkish and Azerbaijani registries should allow, and be allowed to allow, all the letters of the alphabets of their national languages. They would in that case allow dotted and dotless i, but would they automatically allow dot above? ... Probably not, although there would be a certain irony if dotless i with dot above was allowed but ordinary i was not. ... There's still the uppercase/lowercase problem, though ... True. This problem needs to be solved. In the circumstances, and since as a general rule IDNs are written lower case, it might be acceptable for the lower case mapping of (ordinary dotless) I to be indeterminate, so that if I type UNICODE.ORG I might get unicode.org or uncode.org. ... ---maybe these registries should not allow different domain names that differ only in dotless/dotted i? Indeed they should, just as the Swedish registry allows names that differ only in umlauts. These are different letters of the alphabet. Otherwise we are imposing foreign alphabetic practices. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Case mapping of dotless lowercase letters
Kent Karlsson scripsit: Yes. And as it happens, dotless-i case-*folds* to (soft)dotted-i, so you cannot register an IDN that after nameprep has a dotless-i in it, since that name isn't correctly nameprepped. What is the source of this claim? The tables in RFC 3454 (stringprep) do not mention dotless-i, and neither does RFC 3491. -- Knowledge studies others / Wisdom is self-known; John Cowan Muscle masters brothers / Self-mastery is bone; [EMAIL PROTECTED] Content need never borrow / Ambition wanders blind; www.ccil.org/~cowan Vitality cleaves to the marrow / Leaving death behind.--Tao 33 (Bynner)
Speaking of glottophagic hegemony (was Re: [OT] CJK - CJC (Re: Corea?))
Wow. Antonio is running it down! Etc. All this crackpot misguided political correctness reeks of unconscious glottophagic hegemony, cultural parochalism and well-meaning gringocentered patronizing -- it's unsettling to sniff (in this and other threads) whips of it in a forum such as this. ^ But your p seems to have glottophagiated the ff in whiffs, unless the implication is that the Mistress of Cultural Parochialism also has an odiferous fetish with the leather lash she's using to scourge the misbehaving perpetrators of foreignisms. :-) And yes, that's an open invitation to further OT-ify a thread that has gone bad. This forum definitely needs some thread discipline here. --Ken
Re: Case mapping of dotless lowercase letters
Peter Kirk wrote: In the circumstances, and since as a general rule IDNs are written lower case, it might be acceptable for the lower case mapping of (ordinary dotless) I to be indeterminate, so that if I type UNICODE.ORG I might get unicode.org or uncode.org. ... ---maybe these registries should not allow different domain names that differ only in dotless/dotted i? Indeed they should, just as the Swedish registry allows names that differ only in umlauts. In that case, how would the browser know if UNICODE.ORG means that you want to visit unicode.org or uncode.org, if both domains exist? Maybe one could assume Turkish casing for .tr and .az domains, and non-Turkish casing for all other domains. Stefan
Re: Stability of WG2
Peter Kirk scripsit: On 16/12/2003 09:41, Curtis Clark wrote: A measure of comparison is the system of biological nomenclature, ... (not to mention the periodic and sometimes raucous conventions when the rules are modified). Probably the secret of its success is the existence of such conventions. *chuckle* The first use of conventions above means meetings; the second means rules. Result: a non-meeting of the minds. If biologists had insisted that names once assigned could not be changed because of advances in knowledge, or even to correct errors, then surely the system would have broken down centuries ago. In fact, Linnaean names are *not* changed for either of those reasons, nor for any other reason whatsoever: though we now know that Basilosaurus is a proto-whale and not any sort of reptile, Basilosaurus it will remain forever. The only thing that can happen in Linnaean nomenclature is the recognition that two names are synonymous. In that case, there is a question which shall be the preferred name, and normally it is the first name published, but exceptions sometimes occur. Thus when Brontosaurus and Apatosaurus were found to be synonyms, Apatosaurus was chosen as the preferred name because it was published first; however, this is not properly describable as changing the name of Brontosaurus to 'Apatosaurus'. Brontosaurus is a perfectly good name and may still be used even though it is dispreferred. -- You are a child of the universe no less John Cowan than the trees and all other acyclichttp://www.reutershealth.com graphs; you have a right to be here.http://www.ccil.org/~cowan --DeXiderata by Sean McGrath [EMAIL PROTECTED]
Re: Speaking of glottophagic hegemony (was Re: [OT] CJK - CJC (Re: Corea?))
Kenneth Whistler scripsit: But your p seems to have glottophagiated the ff in whiffs, unless the implication is that the Mistress of Cultural Parochialism also has an odiferous fetish with the leather lash she's using to scourge the misbehaving perpetrators of foreignisms. :-) I think you want odoriferous rather than odiferous, though the latter term, ynkhorne as it is, may have some small applicability here. And yes, that's an open invitation to further OT-ify a thread that has gone bad. This forum definitely needs some thread discipline here. Well, until we see either a whip (in the parliamentary sense) or a flagellifer here, we can't expect much. -- A rabbi whose congregation doesn't want John Cowan to drive him out of town isn't a rabbi, http://www.ccil.org/~cowan and a rabbi who lets them do it [EMAIL PROTECTED] isn't a man.--Jewish saying http://www.reutershealth.com
Re: Stability of WG2
At 16:05 -0500 2003-12-16, [EMAIL PROTECTED] wrote: Thus when Brontosaurus and Apatosaurus were found to be synonyms, Apatosaurus was chosen as the preferred name because it was published first; however, this is not properly describable as changing the name of Brontosaurus to 'Apatosaurus'. Brontosaurus is a perfectly good name and may still be used even though it is dispreferred. Brontosaurus was good enough for me when I was five, and it's good enough for me today. Hmpf. Dispreferred me elbow. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Qumran Greek
Elaine Keown Hi, --- Michael Everson [EMAIL PROTECTED] wrote: The X looks like a CHI of course. It is a chi!!!--E. G. Turner Greek Manuscripts of the Ancient World 1987 says that chi is an editorial mark. His book has a plate of a Greek ms showing the chi and paragraphos near each other, as in the Qumran Isaiah. Elaine __ Do you Yahoo!? New Yahoo! Photos - easier uploading and sharing. http://photos.yahoo.com/
RE: Case mapping of dotless lowercase letters
Stefan Persson wrote: Kent Karlsson wrote: This kind of solution was driven mainly by the issue of the traditional chinese vs. simplified chinese problem, but that approach applies to cases like dotless i, dot-above too. Do you mean that people were afraid that someone would register e.g. .com, while someone else would register .com? Assuming that those are SC and TC for the same reading, yes. Worse, those worrying argued that more than 2^n IDNs, where n is the number of CJK characters in the intended name would be needed for each intended name (ignoring that SC and TC don't usually mix). Peter Kirk wrote: If the Swedish registry allows all the letters used in Swedish and Sami, and far eastern registries allow Chinese characters, the Turkish and Azerbaijani registries should allow, and be allowed to allow, all the letters of the alphabets of their national languages. Note that (sharp s) casefolds to ss, and (long s) casefolds to s. So strae, strase, and strasse also both map to the same (strasse) subname. John Cowan wrote: Yes. And as it happens, dotless-i case-*folds* to (soft)dotted-i, so you cannot register an IDN that after nameprep has a dotless-i in it, since that name isn't correctly nameprepped. What is the source of this claim? The tables in RFC 3454 (stringprep) do not mention dotless-i, and neither does RFC 3491. Aha, a change that escaped me. (It used to be folded as described above.) My apologies. /kent k
RE: Case mapping of dotless lowercase letters
Here's what happens exactly: Note the rules in CaseFolding.txt: 0049; C; 0069; # CAPITAL (dotless) I- SMALL (soft-dotted) I 0049; T; 0131; # CAPITAL (dotless) I- SMALL DOTLESS I 0130; F; 0069 0307; # CAPITAL I WITH DOT - SMALL (soft-dotted) I, DOT 0130; T; 0069; # CAPITAL I WITH DOT - SMALL (soft-dotted) I But also that the other 'i's are mapped to themselves by default. There's no explicit Casefolding mapping defined for them so we also have currently these defaults: 0069; C; 0069; # SMALL (soft-dotted) I - SMALL (soft-dotted) I 0130; C; 0130; # CAPITAL I WITH DOT - CAPITAL I WITH DOT 0131; C; 0131; # SMALL DOTLESS I- SMALL DOTLESS I And we also have the explitly dotted Turkic lowercase i, whose folding is defined by the 5th of all rules above (thanks, there's no canonical equivalence between 0069 0307 and 0069): 0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT - SMALL (soft-dotted) I, DOT And for the decomposition of the Turkic dotted uppercase I, case folding is defined by the 1st or 2nd of all rules above (note that 0049 0307 and 0130 should be canonically equivalent, and should produce identical case foldings with the 3rd or 4th rules above, to preserve canonical equivalence): 0049 0307; C; 0069 0307; # CAPITAL (dotless) I, DOT - SMALL (soft-dotted) I, DOT 0049 0307; T; 0131 0307; # CAPITAL (dotless) I, DOT - SMALL DOTLESS I, DOT Now let's look at each CaseFolding type, and look at their result: (1) Mappings for Simple CaseFolding: (1.1) First class of source strings: 0131; C; 0131; # SMALL DOTLESS I- SMALL DOTLESS I (1.2) Second class of source strings: 0049; C; 0069; # CAPITAL (dotless) I- SMALL (soft-dotted) I 0069; C; 0069; # SMALL (soft-dotted) I - SMALL (soft-dotted) I (1.3) Third class of source strings: 0130; C; 0130; # CAPITAL I WITH DOT - CAPITAL I WITH DOT (1.4) Fourth class of source strings: 0049 0307; C; 0069 0307; # CAPITAL (dotless) I, DOT - SMALL (soft-dotted) I, DOT 0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT - SMALL (soft-dotted) I, DOT Do these classes resist (don't merge or split) with uppercase/titlecase or lowercase? (1.1) 0131; lower=0131 ; upper/title=0131 (1.2) 0049; lower=0069 ; upper/title=0049 (1.2) 0069; lower=0069 ; upper/title=0049 (1.3) 0130; lower=0130 ; upper/title=0130 (1.4) 0049 0307; lower=0069 0307; upper/title=0049 0307 (1.4) 0069 0307; lower=0069 0307; upper/title=0049 0307 OK, there's no merge, so no problem with Simple CaseFolding, which resist to case mappings. (2) Mappings for Turkic CaseFolding: (2.1) First class of source strings: 0131; C; 0131; # SMALL DOTLESS I- SMALL DOTLESS I 0049; T; 0131; # CAPITAL (dotless) I- SMALL DOTLESS I (2.2) Second class of source strings: 0069; C; 0069; # SMALL (soft-dotted) I - SMALL (soft-dotted) I 0130; T; 0069; # CAPITAL I WITH DOT - SMALL (soft-dotted) I (2.3) Third class of source strings: 0049 0307; T; 0131 0307; # CAPITAL (dotless) I, DOT - SMALL DOTLESS I, DOT (2.4) Fourth class of source strings: 0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT - SMALL (soft-dotted) I, DOT Do these classes resist (don't merge or split) with common uppercase/titlecase or lowercase mappings? (2.1) 0131; C; lower=0131 ; upper/title=0131 (2.1) 0049; C; lower=0069 ; upper/title=0049 (2.2) 0069; C; lower=0069 ; upper/title=0049 (2.2) 0130; C; lower=0130 ; upper/title=0130 (2.3) 0049 0307; C; lower=0069 0307; upper/title=0049 0307 (2.4) 0069 0307; C; lower=0069 0307; upper/title=0049 0307 Problem here: uppercase mappings do not follow case folding rules. We would also need Turkic-specific mappings for upper/title case: (2.1) 0131; T; upper/title=0049 (2.1) 0049; C; upper/title=0049 (2.2) 0069; T; upper/title=0130 (2.2) 0130; C; upper/title=0130 (2.3) 0049 0307; T; upper/title=0049 0307 (=0130 ?) (2.4) 0069 0307; T; upper/title=0130 0307 (=0130 ?) But we would need then to define canonical equivalence between 0130 and 0049 0307 and 0130 0307 to preserve canonical equivalence... So Turkic CaseFoldings would be broken, unless we say that Turkish texts should NOT be encoded with 0307, but only with 0049, 0069, 0130 or 0131. So Turkic CaseFolding rules should also avoid generating any 0307, whose behavior is not clear. If we just remove any 0307 from the Turkic texts, there is absolutely no problem with Turkic CaseFolding, provided that we also define Turkic-specific uppercase mappings as done above, and don't use the default
Re: Case mapping of dotless lowercase letters
On 16/12/2003 13:09, Stefan Persson wrote: ... In that case, how would the browser know if UNICODE.ORG means that you want to visit unicode.org or uncode.org, if both domains exist? Maybe one could assume Turkish casing for .tr and .az domains, and non-Turkish casing for all other domains. Stefan As soon as I had written the above I realised that I had hurried too much, but I was going out. Let me clarify: If it is the client software (browser etc) which resolves the casing, then how it resolves it is essentially a local matter which doesn't need to be standardised. But my recommendation would be that the mapping followed the local language context, i.e. in general the system locale except where overridden by language markup in the local context e.g. when the URL is embedded in a document. That is, I would map to i, unless the locale or markup language is tr or az in which case it would map to dotless i. (There are actually a few other language orthographies which use Turkic casing.) The alternative of using the Turkic mapping for .tr and .az domains is possible but seems less desirable to me. If the casing is resolved by the nameserver, there is no alternative to using the Turkic mapping only for .tr and .az domains. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Case mapping of dotless lowercase letters
Chris Jacobs [mailto:[EMAIL PROTECTED] From: Philippe Verdy [EMAIL PROTECTED] Stefan Persson writes: Isn't the sequence dotless i + combining acute canonically equivalent to dotted i + combining acute? NO. There's no canonical equivalence between distinct pairs of characters, if the first letter of each pair are not also canonically equivalent. compare ? with e The first pair has e trema as its first letter, the second pair e ogonek. Yet these pairs are canonical equivalent. True in the way you interpret my sentence, but when I say the first letter of each pair, I mean the first non decomposable character of each pair. In your example, both letters are simple e vowels. Both dotted lowercase i and dotless lowercase i are not decomposable... unlike dotter uppercase I... Well Outlook 2000 is unable to represent any e with ogonek and trema of your example. So, despite they are canonically equivalent, they are rendered differently: - ? SMALL LETTER E WITH DIAERERESIS, COMBINING OGONEK displays SMALL LETTER E WITH DIAERESIS, MISSING SPACING GLYPH FOR COMBINING OGONEK in an unbreakable sequence of glyphs or editable grapheme clusters (the keyboard edit cannot move in the middle, but the mouse selection can break before the ogonek.) - e SMALL LETTER E WITH OGONEK, COMBINING DIAERERESIS and e? SMALL LETTER E, COMBINING OGONEK, COMBINING DIAERERESIS both display E WITH OGONEK, SPACING DIAERESIS with a break between glyphs, as if it were two distinct editable grapheme clusters. All these should better display E WITH OGONEK, MISSING NON-SPACING GLYPH FOR COMBINING DIAERESIS Isn't there a distinct glyph for missing glyphs representing spacing diacritics, or not even a spacing glyph with a dotted circle? And grapheme clusters are incorrectly mapped for editing in Outlook. __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
Re: Stability of WG2
On 16/12/2003 13:05, [EMAIL PROTECTED] wrote: Peter Kirk scripsit: On 16/12/2003 09:41, Curtis Clark wrote: A measure of comparison is the system of biological nomenclature, ... (not to mention the periodic and sometimes raucous conventions when the rules are modified). Probably the secret of its success is the existence of such conventions. *chuckle* The first use of conventions above means meetings; the second means rules. Result: a non-meeting of the minds. Not so! I intended such conventions as an explicit reference to the meetings which Curtis described, although I was also aware of the double meaning and deliberately didn't cancel it. If biologists had insisted that names once assigned could not be changed because of advances in knowledge, or even to correct errors, then surely the system would have broken down centuries ago. In fact, Linnaean names are *not* changed for either of those reasons, nor for any other reason whatsoever: though we now know that Basilosaurus is a proto-whale and not any sort of reptile, Basilosaurus it will remain forever. The only thing that can happen in Linnaean nomenclature is the recognition that two names are synonymous. In that case, there is a question which shall be the preferred name, and normally it is the first name published, but exceptions sometimes occur. Thus when Brontosaurus and Apatosaurus were found to be synonyms, Apatosaurus was chosen as the preferred name because it was published first; however, this is not properly describable as changing the name of Brontosaurus to 'Apatosaurus'. Brontosaurus is a perfectly good name and may still be used even though it is dispreferred. I'm no expert on this... but I thought that species could be transferred from genus to genus as knowledge advances. And presumably obvious spelling mistakes are corrected (contrast FHTORA in U+1D0C5), or are you saying that if the first publication had Brontosuarus as a typo this error would remain for ever? -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Case mapping of dotless lowercase letters
At 00:35 +0100 2003-12-17, Philippe Verdy wrote: NO. There's no canonical equivalence between distinct pairs of characters, if the first letter of each pair are not also canonically equivalent. compare ë? with e¨ The first pair has e trema as its first letter, the second pair e ogonek. Yet these pairs are canonical equivalent. True in the way you interpret my sentence, but when I say the first letter of each pair, I mean the first non decomposable character of each pair. In your example, both letters are simple e vowels. e-diaeresis is decomposable to e + combining diaeresis. e-ogonek-diaeresis is decomposable to e + combining diaeresis + combining ogonek or to e + combining ogonek + combining diaeresis. The last two are equivalent. Both dotted lowercase i and dotless lowercase i are not decomposable... unlike dotter uppercase I... small letter i and small letter dotless i are as different as t and thorn. Well Outlook 2000 is unable to represent any e with ogonek and trema of your example. Get a better browser. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Case mapping of dotless lowercase letters
Peter Kirk writes: If it is the client software (browser etc) which resolves the casing, then how it resolves it is essentially a local matter which doesn't need to be standardised. But my recommendation would be that the mapping followed the local language context, i.e. in general the system locale except where overridden by language markup in the local context e.g. when the URL is embedded in a document. That is, I would map to i, unless the locale or markup language is tr or az in which case it would map to dotless i. (There are actually a few other language orthographies which use Turkic casing.) The alternative of using the Turkic mapping for .tr and .az domains is possible but seems less desirable to me. If the casing is resolved by the nameserver, there is no alternative to using the Turkic mapping only for .tr and .az domains. Turkic case mappings are not usable in DNS and not even in IDNA, simply because all legacy ASCII names must continue to resolve ASCII 'I' identically with ASCII 'i' and not 'i' (encoded with Punycode). This is needed for upwards compatibility. So even localized browsers will need to forbid mapping 'i' as if it was 'I', and IDNA names containing 'i' cannot be fully converted to uppercase, even with Full case mappings, which will need to keep the lowercase letter. This will be true also for .tr' and '.az' registries, unless these registries adopt a policy requiring the reservation of domain names in bundles. If this occurs, it will be the registry which will map domain names containing 'i'=='I' identically to domain names containing either a dotless lowercase i. For the case of the dotted uppercase I, separate allocation is still possible, but it would be too easily spoofable as they can be too easily entered on Turkic keyboards to spoof the soft-dotted lowercase i. So I doubt that .tr and .az registry will ever adopt a distinction between dotted and undotted i in domain names, but they will ensure that by adding bundle reservation policies if they ever implement IDNA. I doubt that Turkish and Azeri registries will resolve names in bundles with dotless-i or dotted-I, as it would require server-side dynamic DNS capabilities, which would also mean scalability problems (the .fr registry has already rejected the idea of resolving names reserved in bundles because of scalability problems with some bundles which may have thousands of equivalents and would be difficult to support in fast static DNS servers: only one canonical name in the bundle will be resolved on DNS servers, the other names being left reserved, until a standard solution is found to allow such resolution in clients of these registries, using the bundle equivalence rules defined by the specific IDNA bundle profile of each registry). __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
Re: Case mapping of dotless lowercase letters
John Cowan noted: quote Here's what happens exactly: source simple case folding full case folding tr/az case folding dotted i dotted idotted idotted i dotless i dotless i dotless i dotless i dotted I dotted Idotted i + comb. dotdotted i dotless I dotted idotted idotless i /quote Add to that specification of the case *folding* (from CaseFolding.txt), the default case *mappings* (from UnicodeData.txt): source default lc mapping default uc mapping dotted i dotted i(dotless) I dotless i dotless i (dotless) I dotted I dotted idotted I (dotless) Idotted i(dotless) I If you are case *folding* you are doing one thing; if you are case *mapping* you are doing another. Case *folding* creates equivalence classes for different sequences. Simple case folding, as defined above, creates the following equivalence classes, adding in the sequences involving use of the combining dot as well. A. { i, I } B. { dotless i } C. { dotted I } D. { i, dot above, I, dot above } E. { dotless i, dot above } F. { dotted I, dot above } These 6 classes are distinguished. They do not conflate, although in class A and in class D, there are two sequences which do fold together. Full case folding, as defined above, creates the following equivalence classes. A. { i, I } B. { dotless i } G. { dotted I, i, dot above, I, dot above } E. { dotless i, dot above } F. { dotted I, dot above } In other words, there are now 5, not 6 equivalence classes, as the classes C and D from simple case folding have been conflated. Turkic/Azeri case folding, as defined above, creates the following equivalence classes. H. { i, dotted I } I. { dotless i, I } J. { i, dot above, dotted I, dot above } K. { dotless i, dot above, I, dot above } And now there are 4 *different* equivalence classes, which group together the sequences which make sense for Turkish/Azeri. Note that none of the 3 sets of equivalence classes violates *canonical* equivalence, because none of the 8 sequences involved is canonically equivalent to any other. In other words, no matter which of the 3 approaches you take to case folding, in no instance are you claiming that canonically equivalent sequences are to be interpreted differently. Now let's look at what happens with case *mapping*, using the default mappings of UnicodeData.txt. Lowercasing first: L. { i, I, dotted I } -- i B. { dotless i } -- dotless i M. { i, dot above, I, dot above, dotted I, dot above } -- i, dot above E. { dotless i, dot above } -- dotless i, dot above Uppercasing next: N. { i, I, dotless i } -- I C. { dotted I }-- dotted I O. { i, dot above, I, dot above, dotless i, dot above } -- I, dot above F. { dotted I, dot above } -- dotted I, dot above The classes of sequences that get conflated are different here. In particular, classes L, M, N, O conflate characters that are not conflated by the formal definition of case folding. So, in particular, one should *not* expect the results of case mapping, followed by a binary comparison, to be the same as a formal case folding comparison. There will be differences. Any implementation that does not take this into account is still confused (aren't we all?) in its handling of these letters. Now add to that the problem of which of the elements in the equivalence classes *look* the same, and you have the potential for even more confusion. In particular, in simple case folding, you have the equivalence classes: A. { i, I } E. { dotless i, dot above } Members of class E are *not* equivalent to members of class A. But of course, dotless i, dot above *looks like* i and does *not* look like I. Add in the others, plus all the potential differences in how fonts may implemented the soft-dotted property, and this entire area can lead to total confusion. One moral of the story is: DO NOT USE COMBINING DOTS WITH I's. If you subtract out all the superfluous combinations cited above with combining dots (for completeness), then the situation becomes much simpler and more comprehensible: Simple case folding. [disallows string length change] A. { i, I } B. { dotless i } C. { dotted I } Full case folding. [allows string length change] A. { i, I } B. { dotless i } G. { dotted I } [represented in folded form as i, dot above] Turkic/Azeri case folding. H. { i, dotted I } I. { dotless i, I } Lowercasing: L. { i, I, dotted I } -- i B. { dotless i } -- dotless i Uppercasing: N. { i, I, dotless i } -- I C. { dotted I }-- dotted I Add in Turkic locale-specific special casing. Lowercasing: H. { i, dotted
Re: Case mapping of dotless lowercase letters
Correcting myself: Note that none of the 3 sets of equivalence classes violates *canonical* equivalence, because none of the 8 sequences involved is canonically equivalent to any other. In other words, no matter which of the 3 approaches you take to case folding, in no instance are you claiming that canonically equivalent sequences are to be interpreted differently. Actually, dotted I *is* canonically equivalent to I, dot above (I overlooked that when compiling the summary.) Hence the equivalence classes for simple case folding: C. { dotted I } D. { i, dot above, I, dot above } *do* violate canonical equivalence. And that is the whole reason for the separate definition of full case folding, which defines the equivalence class: G. { dotted I, i, dot above, I, dot above } which observes canonical equivalence, but which has the drawback of string length change in case folding. --Ken
Re: Case mapping of dotless lowercase letters
- Original Message - From: Michael Everson [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, December 16, 2003 9:00 PM Subject: Re: Case mapping of dotless lowercase letters At 20:30 +0100 2003-12-16, Chris Jacobs wrote: NO. There's no canonical equivalence between distinct pairs of characters, if the first letter of each pair are not also canonically equivalent. compare with The first pair has e trema as its first letter, the second pair e ogonek. Yet these pairs are canonical equivalent. The base letter is e Nope. That would be the base char of their NFD. The base chars of themselves are and .
Re: Case mapping of dotless lowercase letters
Philippe Verdy scripsit: If we just remove any 0307 from the Turkic texts, there is absolutely no problem with Turkic CaseFolding, provided that we also define Turkic-specific uppercase mappings as done above, and don't use the default locale-neutral uppercase mappings of the UCD. There's no reason to expect that there will be any 0307 whatever in Turkish/Azeri texts: it's not a diacritic those languages use, AFAIK. -- How they ever reached any conclusion at all[EMAIL PROTECTED] is starkly unknowable to the human mind. http://www.reutershealth.com --Backstage Lensman, Randall Garrett http://www.ccil.org/~cowan
Re: Stability of WG2
Peter Kirk scripsit: I'm no expert on this... but I thought that species could be transferred from genus to genus as knowledge advances. True enough, but the specific epithet remains the same, and the old names are still available (as the jargon has it) though no longer valid (what I was calling preferred in my previous post). Linnaeus himself, working with two different descriptions of chimps, split them into Homo troglodytes and Simia satyrus (which latter also included bonobos and orangutans); when the mistake was cleared up, the specific epithet troglodytes, being the older, was retained for chimps, whereas bonobos got satyrus, both now in the new genus Pan; orangs were moved to Pongo and given the new epithet pygmaeus. (There's now a move underfoot to move all of these, plus gorillas, into Homo; I don't give it much chance, though I think it's a cool idea.) Nobody would call chimps Homo troglodytes, or orangs Simia satyrus, today, but those names can't ever be assigned to other species in future. (If chimps were folded into Homo, they would be H. troglodytes again.) And presumably obvious spelling mistakes are corrected (contrast FHTORA in U+1D0C5), or are you saying that if the first publication had Brontosuarus as a typo this error would remain for ever? It depends. If the article said I dub this genus 'Brontosuarus', from the Greek for 'thunder lizard', then yes, it would be fixed. But if there isn't a positive *indication in the text of the original article* that makes the error evident on its face, then 'Brontosuarus' it would be. -- John Cowan [EMAIL PROTECTED] www.reutershealth.com www.ccil.org/~cowan Big as a house, much bigger than a house, it looked to [Sam], a grey-clad moving hill. Fear and wonder, maybe, enlarged him in the hobbit's eyes, but the Mumak of Harad was indeed a beast of vast bulk, and the like of him does not walk now in Middle-earth; his kin that live still in latter days are but memories of his girth and his majesty. --Of Herbs and Stewed Rabbit
Re: Case mapping of dotless lowercase letters
Kenneth Whistler scripsit: John Cowan noted: quote Here's what happens exactly: source simple case folding full case folding tr/az case folding dotted i dotted idotted idotted i dotless idotless i dotless i dotless i dotted I dotted Idotted i + comb. dotdotted i dotless Idotted idotted idotless i /quote [snip] One moral of the story is: DO NOT USE COMBINING DOTS WITH I's. A fine moral, indeed. Unfortunately, full case folding generates such things for downstream processes to trip over. It's too late to fix the RFCs, alas. -- Where the wombat has walked,John Cowan [EMAIL PROTECTED] it will inevitably walk again. http://www.ccil.org/~cowan