Re: TR35 (was: Standardize TimeZone ID)
On 07/05/2004 14:53, [EMAIL PROTECTED] wrote: ... So the database aliases one to the other. Aliases are used for timezones that are compeltely equivalent on the whole timeframe considered (apparently only starting in the early years of last century). The cutoff date is 1970-01-01; if two timezones have been the same ever since then, they are not separately encoded *unless* they are in separate national jurisdictions (because after all it is the nation-state which sets up the rules). This date is the Posix zero point. It is not always the nation-state which sets the rules. For example, in Australia each state sets its own rules; and so there are six different schemes with half hour differences, some daylight saving and some without. It is not only possible but quite likely that new distinctions will be introduced in time zones which have been the same since 1970; e.g. very likely New South Wales and Victoria have been in the same time zone ever since then, but there is a real chance that NSW will abolish daylight saving but Victoria will not. So don't assume too quickly that time zones will not be split. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Phoenician
On 09/05/2004 01:05, Peter Constable wrote: I think one's track record in making judgments on boundary cases is established only after having successfully dealt with boundary cases -- and enough to establish a level of confidence. Of things already in Unicode, what have been boundary cases between unificiation and de-unification? The unified Latin-but-not-Cyrillic w q (if I've recalled the two letters correctly) and Coptic/Greek characters are the only prior boundary cases I can think of. Peter And these two cases are hardly a good advertisement for the expert's reputation. The Coptic/Greek unification proved to be ill-advised and is being undone. As for the unified W and Q, well, I guess that if the Kurds and others who use these letters in Cyrillic knew how this decision would mean that their alphabet will never be sorted correctly (unless they get round to tailoring their collations), they would make a strongly argued case for disunification. Well, perhaps the expert can feel how much his fingers have been burned by over-unification and so is now pressing for everything to be disunified. And then there is the matter of CJK unification, which I gather is still rather contentious. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: TR35
On 07/05/2004 09:44, Carl W. Brown wrote: ... If I live in Guam I will probably be using an en_US locale. However the US territory does not contain my time zone. Probably the best solution for this problem is to add a category of possessions to the territory information. This allows applications to enumerate available time zones for not only the country itself but also it possessions that might be using the locale. This issue is not limited to a country's possessions. Many expatriates and travelling business people etc want to keep their (laptop) computer's general locale settings as that of their home country (not least because changing this often destabilises data) but need to set it to the time zone in which they are temporarily resident. So time zones should be kept independent of other locale information, especially independent of such things as date and decimal point formats, and preferred languages. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Phoenician
On 07/05/2004 15:59, Michael Everson wrote: At 17:10 -0400 2004-05-07, [EMAIL PROTECTED] wrote: This would only be the *default* rules. Unicode-savvy sort programs can accept tailorings that make the rules different, like the Swedish tailoring that makes a-ring, a-umlaut, and o-umlaut sort after z instead of in their default places with a and o. As I said, they would be the *tailored* rules. Mixing scripts would go against the current practice of ISO/IEC 14651. Well, we are not talking about ISO/IEC 14651 but about Unicode. Is there any really good reason not to mix two scripts, which are according to many people actually variants of one script but which are (if your proposal is accepted) seperately encoded for the convenience of some scholars? This sounds to me like the kind of rule which is made to be broken. If all the 22 CSWA scripts are collated together by default, this would significantly reduce the objections to encoding them as separate scripts. We can perhaps consider them as a family of congruent scripts. Of course we might then think that there are other such families, e.g. the different Indic scripts, but how to collate them should depend on Indian etc custom. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Everson-bashing (was: Phoenician)
Peter Kirk scripsit: But have the others agreed with his judgments because they are convinced of their correctness? Or is it more that the others have trusted the judgments of the one they consider to be an expert, and have either not dared to stand up to him or have simply been unqulified to do so? This is laughable. It amazes me that all of the existing scripts have apparently been encoded without any properly documented justification apart from one expert's unchallenged judgments. It would be amazing if it were true, but of course it's absolutely false. And these two cases are hardly a good advertisement for the expert's reputation. The Coptic/Greek unification proved to be ill-advised and is being undone. As for the unified W and Q, well, I guess that if the Kurds and others who use these letters in Cyrillic knew how this decision would mean that their alphabet will never be sorted correctly (unless they get round to tailoring their collations), they would make a strongly argued case for disunification. Nobody writes Kurdish in Cyrillic any more: it's a historic use of the script only. In any event, Michael had *nothing* to do with those unifications. He has consistently pressed for disunification (rightly, IMHO). Well, perhaps the expert can feel how much his fingers have been burned by over-unification and so is now pressing for everything to be disunified. Nonsense, and insulting nonsense to boot. Michael has never pressed for either total unification or total disunification, because both positions are absurd, and his position is never absurd. (I may disagree with it from time to time, and I am willing to press him for reasons, but I *always* respect his point of view.) This verbal sniping on a subject (the history of character encoding) you know nothing about is beneath you. Try and do better. And then there is the matter of CJK unification, which I gather is still rather contentious. Only among the invincibly ignorant. -- John Cowan [EMAIL PROTECTED] http://www.ccil.org/~cowan One time I called in to the central system and started working on a big thick 'sed' and 'awk' heavy duty data bashing script. One of the geologists came by, looked over my shoulder and said 'Oh, that happens to me too. Try hanging up and phoning in again.' --Beverly Erlebacher
Re: Phoenician
Peter Kirk peterkirk at qaya dot org wrote: And these two cases are hardly a good advertisement for the expert's reputation. The Coptic/Greek unification proved to be ill-advised and is being undone. As for the unified W and Q, well, I guess that if the Kurds and others who use these letters in Cyrillic knew how this decision would mean that their alphabet will never be sorted correctly (unless they get round to tailoring their collations), they would make a strongly argued case for disunification. Well, perhaps the expert can feel how much his fingers have been burned by over-unification and so is now pressing for everything to be disunified. I can't believe I am reading this. Far more than anyone else, Michael has *always* supported the disunification of Coptic from Greek and of Kurdish Cyrillic Q and W from their Latin counterparts. They have been two of his signature causes through the years. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: interleaved ordering (was RE: Phoenician)
At 11:45 +0200 2004-05-10, Kent Karlsson wrote: We do actually mix scripts. Hiragana and Katakana are interleaved. Mark And it might make sense to interleave (say) Thai and Lao in the default ordering. No, it wouldn't. Or to interleave, in the default ordering, the Indic scripts covered by ISCII. No, it wouldn't! Any pecularities could be handled in tailorings. Such interleaving is the peculiarity. It renders an ordered text illegible to interleave Kannada, Sinhala, and Gujarati. Japanese is different; the users all use both scripts all the time. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: interleaved ordering (was RE: Phoenician)
From: Michael Everson [EMAIL PROTECTED] Japanese is different; the users all use both scripts all the time. And there are occurences in Japanese of Katakana suffixes or particules added to Latin or Han words, notably to people names and trademarks... I've seen many texts where Han and Katakana are mixed in the same word (where it would be inappropriate to insert a word-break between runs of Han and Katakana particules.) My first implementation allowed line-breaks after each Han character, but an exception was made after users request to not do that after Han and before Katakana (despite line break is allowed between two Han characters), or after Latin and Katakana. So a simple approache that allows linebreaks between distinct scripts is deceptive. Am I wrong, or are my users wrong and want it as a presentation preference? Also, what about line breaking in long runs of Hangul grapheme clusters (I mean here the true L+V*T* syllables with their diacritics, not the simplified LV and LVT sub-syllables encoded in Hangul)? It seems that line breaking in Korean obeys more to semantics constraints than to normative syllables, and I think it is quite logical when you see that such presentation is sometimes prefered by Latin readers too... To make this work appropriately for some long Japanese or Korean sentences, and match with users expectations, I had to support explicitly marks where line-breaks should be allowed, using zero-width spaces. This makes things complicate if the text is not modified with them. So I had to consider ideographic (full-width) punctuation too (which is not directly equivalent to their half-width Latin counter-part, as they already include the space after them (for example the full-width period/dot, comma or colon) even if the glyph looks a bit larger.
Re: Katakana_Or_Hiragana
Tom Emerson scripsit: Perhaps Michael can enlighten us on the rational for grouping hiragana and katakana together as a single script. They aren't. They are collated together, that's all. -- How they ever reached any conclusion at all[EMAIL PROTECTED] is starkly unknowable to the human mind. http://www.reutershealth.com --Backstage Lensman, Randall Garrett http://www.ccil.org/~cowan
Re: Katakana_Or_Hiragana
Michael Everson scripsit: Phoenician and Hebrew should not be interfiled, of course, in the default table, though John Cowan seems to think otherwise. 'Seems', monsieur? Nay, 'does'; I know not 'seems'. --Not Quite Hamlet The point is, of course, that if Phoenician is to be used to represent palaeo-Hebrew (as I agree is correct), then it will create an artificial separation to *not* interfile them. Consider a concordance to your Phoenician-script-Tetragrammaton Bibles. Such Tetras should not appear at the beginning, nor yet at the end, but under yod where they belong. This will also be of great value in the other application of collation, viz. searching. Those who use Phoenician primarily contrastively with Greek will want them filed separately, and my proposal will sort Phoenician words after Greek ones. -- John Cowan [EMAIL PROTECTED] http://www.ccil.org/~cowan http://www.reutershealth.com Charles li reis, nostre emperesdre magnes, Set anz totz pleinz ad ested in Espagnes.
Who's Harry Potter, err..., Potter Stewart? (was Re: Phoenician)
Who's Potter Stewart? (I don't own a TV).Elaine A former Associate Justice of the U.S. Supreme Court, who memorably declared in a 1964 concurring opinion that he could not define pornography, but he knew it when he saw it (and the movie in ^ Les Amants question wasn't it). Jacobellis v. Ohio, 378 U.S. 184 (1964) Read it here: http://caselaw.lp.findlaw.com/scripts/getcase.pl?court=usvol=378invol=184 I shall not today attempt further to define the kinds of material to be embraced within that shorthand description [hard-core pornography]; and perhaps I could never succed in intelligibly doing so. But I know it when I see it, and the motion picture involved in this case is not that. Eminently sensible of him, by the way. And that, folks, is about as OT as we get on this list. :-) --Ken
RE: interleaved ordering (was RE: Phoenician)
Title: RE: interleaved ordering (was RE: Phoenician) From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Philippe Verdy Sent: Monday, May 10, 2004 9:09 AM From: Michael Everson [EMAIL PROTECTED] Japanese is different; the users all use both scripts all the time. And there are occurences in Japanese of Katakana suffixes or particules added to Latin or Han words, notably to people names and trademarks... I've seen many texts where Han and Katakana are mixed in the same word (where it would be inappropriate to insert a word-break between runs of Han and Katakana particules.) You mean hiragana, not katakana, and kanji, not Han, I believe. Katakana are used for transliteration, and are not typically joined to kanji, whereas hiragana are ubiquitously joined to kanji, as Japanese particles do not ordinarily have kanji representation. I have not seen katakana joined to kanji (or romaji), and suspect that such does not occur. My first implementation allowed line-breaks after each Han character, but an exception was made after users request to not do that after Han and before Katakana (despite line break is allowed between two Han characters), or after Latin and Katakana. So a simple approache that allows linebreaks between distinct scripts is deceptive. Am I wrong, or are my users wrong and want it as a presentation preference? I believe, but am not certain, that nonbreaking kanji-to-hiragana is correct, whereas you can break on kanji-to-katakana. But all this leads me to finally ask: what does script mean? It seems clear to me that although the term has been used throughout the Phoenician debate, not everyone is using it the same way. I know that there is a definition of script that is used for encoding purposes, but can I find it written anywhere, or is it more of an ephemeral thing? Thanks, /|/|ike
RE: Katakana_Or_Hiragana
Title: RE: Katakana_Or_Hiragana From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of [EMAIL PROTECTED] Sent: Monday, May 10, 2004 10:22 AM Tom Emerson scripsit: Perhaps Michael can enlighten us on the rational for grouping hiragana and katakana together as a single script. They aren't. They are collated together, that's all. I guess it depends on how you look at it. The japanese refer to kana script, which encompasses both hiragana and katakana, so it could be said that the single scripts hirgana and katakana are encoded, whereas the single script kana is collated. This would not be evasive, either, as this is how they are used. /|/|ike
RE: interleaved ordering (was RE: Phoenician)
Mike Ayers writes: You mean hiragana, not katakana, and kanji, not Han, I believe. Katakana are used for transliteration, and are not typically joined to kanji, whereas hiragana are ubiquitously joined to kanji, as Japanese particles do not ordinarily have kanji representation. I have not seen katakana joined to kanji (or romaji), and suspect that such does not occur. We have observed that katakana is being used more and more in places that you traditionally saw hiragana, especially in advertisements and on the Web. Katakana is also being used as a way of emphasizing words in a text, even those that would normally be written in hiragana. The choice of script is becoming a stylistic issue lately and you are seeing katakana in places you wouldn't expect them. I also haven't seen katakana attached to kanji, though I have seen it attached to romaji in constrained circumstances. It is very rare, however. I have seen hiragana attached to romaji, usually in the context of particles attached to English nouns. You see the same thing (only more so) in Korean, where an eojeol may contain mixed latin script and hankul. This may be all beside the point: people are probably not interested in contemporary script usage in these contexts. -tree -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com Beware the lollipop of mediocrity: lick it once and you suck forever
Re: Phoenician
Elaine Keown Tucson Dear Asmus Freytag: Becker's law: For every expert there's an equal and opposite expert. This saying is especially true within Semitics (I'm sure). But for me personally, my only interest is in database functionality for Semitics. As far as I can tell, that interest runs directly counter to the interests of more font-oriented people. - Elaine __ Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs http://hotjobs.sweepstakes.yahoo.com/careermakeover
RE: interleaved ordering (was RE: Phoenician)
At 12:12 -0700 2004-05-10, Mike Ayers wrote: But all this leads me to finally ask: what does script mean? It seems clear to me that although the term has been used throughout the Phoenician debate, not everyone is using it the same way. I know that there is a definition of script that is used for encoding purposes, but can I find it written anywhere, or is it more of an ephemeral thing? I am way too jetlagged to go near this one today. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Subject lines that have nothing to do with message content
Personally speaking, I would have expected that a recent message on this list with the sujbect line Katakana_Or_Hiragana might have something to do with Japanese, Hiragana, Katakana, or at least Han, or perhaps even Asia. But no... It was about Phoenician. It would be really helpful if people could use subject lines that have something to do with the subject of the message. It just can't be that difficult for people to pick a reasonable subject line. And if you're going to go off-topic in a thread, you might consider getting a different subject line -- or at least adding a parenthetical about how you're going to go off the thread... (As usual, this is my personal opinion and doesn't reflect an official policy, etc.) Rick
Re: interleaved ordering (was RE: Phoenician)
Mike Ayers wrote: I have not seen katakana joined to kanji (or romaji), and suspect that such does not occur. There are a few cases, e.g. (So-Ren: Soviet Union), but that could also be written as two kanji as (which is however very rare in modern Japanese). I believe, but am not certain, that nonbreaking kanji-to-hiragana is correct, whereas you can break on kanji-to-katakana. In Japanese you can put a line break between *any* characer, except before punctuation end quote or after start quote. Stefan
Script vs Writing System
At 12:12 -0700 2004-05-10, Mike Ayers wrote: But all this leads me to finally ask: what does script mean? It seems clear to me that although the term has been used throughout the Phoenician debate, not everyone is using it the same way. I know that there is a definition of script that is used for encoding purposes, but can I find it written anywhere, or is it more of an ephemeral thing? [PA] The glossary has « A collection of symbols used to represent textual information in one or more writing systems. » Chapter 6 also defines Writing Systems summarized by Table 6-1 Typology of Scripts (Writing Systems then Scripts) : A writing system is then defined as « A set of rules for using one or more scripts to write a particular language. Examples include the American English writing System, the British English writing system, the French writing system, and the Japanese writing system. » Writing System TypeUnicode Script(s) -- « Alphabets: Latin, Greek, Cyrillic, Armenian, Thaana, Georgian, Ogham, Runic, Mongolian, Old Italic, Gothic, Ugaritic, Deseret, Shavian, Osmanya Abjads:Hebrew, Arabic, Syriac Abugidas: Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Tagalog, Hanunóo, Buhid, Tagbanwa, Khmer, Limbu, Tai Le Logosyllabaries: Han Simple Syllabaries: Cherokee, Hiragana, Katakana, Bopomofo, Yi, Linear B, Cypriot Featural Syllabaries: Ethiopic, Canadian Aboriginal Syllabics, Hangul » Note : «Table 6-1 lists all of the scripts currently encoded in the Unicode Standard, showing the writing system type for each. The list is an approximate guide, rather than a definitive classification, because of the mix of features seen in many scripts. The writing systems for some languages may be quite complex, mixing more than one writing system together in a composite system. Japanese is the best example; it mixes a logosyllabary (Han), two syllabaries (Hiragana and Katakana), and one alphabet (Latin, for romaji).»
Re: Japanese line breaks (was: interleaved ordering)
From: Stefan Persson [EMAIL PROTECTED] In Japanese you can put a line break between *any* characer, except before punctuation end quote or after start quote. Are you SURE of that? I had many negative comments about undesirable line breaks in the middle of what is perceived as a single word, and where a single Kana moved to the next line was seen as bad, notably when it is a particle. I had similar comments from Korean users with Hangul. OK the traditional writing rules will allow putting breaks everywhere so that characters will line up equally in a grid, that would fill all free space in paper rolls, but today, with mixed use of half-width/full-width, mixed scripts, mixed font sizes or styles, etc... this traditional usage does not seem tolerable as it would be hard to read. Japense users are now adpt of fast-reading technics, and breaking some words or concepts to the next line does not ease the understanding of text. Users today want better hyphenation of text (bad term because they don't use hyphens to mark it...), and they want style on it. Most commercial Asian websites are very colorful, and use many font sizes and styles, much more often than on European/American websites which look so monotonous for them... We don't perceive the same idea of what is ugly such as patchworked colors. Asian text is generally better shown with carefully chosen layouts so that words will be placed according to their meaning and relation). Web design in Japan is extremely creative. And there's a strong tradition in graphic arts.
Katakana and Kanji (was: Re: interleaved ordering (was RE: Phoenician))
Stefan Persson wrote: Mike Ayers wrote: I have not seen katakana joined to kanji (or romaji), and suspect that such does not occur. There are a few cases, e.g. ã½é£ (So-Ren: Soviet Union), but that could also be written as two kanji as èé£ (which is however very rare in modern Japanese). It's actually quite common, depending on how you choose to construe joined. Certainly, mixed katakana/kanji lexical items occur all the time. Japanese for PGA: puroogorufukyookai ^^^=== katakana kanji PGA Championship: zenbeipuroo ==^ kanji katakana It's true that katakana aren't normally used the way okurigana are, to write out the grammatically changeable suffixal portion of verb stems written in kanji. But that's rather beside the point when kanji and katakana are rather freely mixed in nominal compounds of all sorts. By the way, the So-Ren example is just an abbreviation of the same kind of pattern I show above: Japanese for Soviet Union: sobietorenhoo == soren ^^^== ^^=== katakana kanji This process is an onrushing, accelerating one. If you look at early 20th century Japanese materials, it is rather uncommon, but if you look at contemporary Japanese writing -- particularly the sort seen in popular culture, which is the leading edge of this kind of change, it is all over the place. Katakana is sweeping in as it carries with it all the English (and other) language material rapidly moving into Japanese, along with all the other popular functions of katakana. Other examples from corporate names: fujizerokkusu (Fuji Xerox) ^ tookyoogasu (Tokyo Gas) === nihonai·bii·emu (IBM Japan) =^^ ^^^ ^^^ Then there's always that all-purpose fixer-upper: nenchakuteepu (duct tape, adhesive tape) ^ --Ken
RE: Subject lines that have nothing to do with message content
Of course, if ever there was a subject line that permitted the topic to wander howsoever far from where it started, the one on this thread is it. :-) Peter
Re: Thai Fongman and Khmer Phnek Moan
What little I know about the phnek moan makes it seem peculiar that its Line Break class is NS. Is there truly a distinction between how these two characters are used in their respective scripts that makes this difference warranted, Dunno. or is this a possible error in the standard Possibly. that deserves official scrutiny? Certainly, if it is wrong. By the way, this is the kind of thing which *can* be fixed in the standard, if shown to be problematical. This deserves some research by people who know something about how these characters do in fact behave in line-breaking, and then, if a change is in order, a documented proposal explaining the problem and the suggested fix could be submitted to the UTC for consideration. --Ken
RE: Thai Fongman and Khmer Phnek Moan
Title: RE: Thai Fongman and Khmer Phnek Moan Insofar as both AL and NS are informative properties, how much does in matter? I cannot find any discussion of the Thai fongman in NECTEC's book on typography. It is described in the names list as a bullet. The Royal Institute's Thai dictionary defines ¿Í§Áѹ as name of a type of symbol used in old books to mark the beginning of a section [this word can also mean a paragraph or verse, or blank lines separating them] or the start of a line [either poetry or prose]. So, the description bullet seems reasonable. Other bullets have a breaking class of AL, so that seems appropriate for the Thai fongman. I have no info regarding the Khmer counterpart. Peter Peter Constable Globalization Infrastructure and Font Technologies Microsoft Windows Division
RE: Japanese line breaks (was: interleaved ordering)
Microsoft Office (Win and Mac) applications ensure that the line breaking is correct for East Asian Text. For example, in Microsoft Word, under Options | Asian Typography | First and Last Characters, you will find the following options for Japanese: Cannot Start Line with: !%),.:;?]} Cannot End Line with: $([\{ There are slight variations for Traditional Chinese, Simplified Chinese, Japanese, and Korean --- which is respected by Word as well. Han-yi -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tom Emerson Sent: Monday, May 10, 2004 7:39 PM To: Philippe Verdy Cc: Unicode Mailing List Subject: Re: Japanese line breaks (was: interleaved ordering) Philippe Verdy writes: From: Stefan Persson [EMAIL PROTECTED] In Japanese you can put a line break between *any* characer, except before punctuation end quote or after start quote. Are you SURE of that? I had many negative comments about undesirable line breaks in the middle of what is perceived as a single word, and where a single Kana moved to the next line was seen as bad, notably when it is a particle. I had similar comments from Korean users with Hangul. We've found an amazing amount of variation in where breaks occur on text live on the web... breaks show up everywhere and anywhere, to the point where our Japanese morphological analyzer has to ignore whitespace (horizontal and vertical) in many situations.(*) There is a JIS standard for line breaking, though I don't have a copy of it here at home right now. I can look up the official rules tomorrow if people are interested. -tree (*) The worst case we've seen was the use of katanana and hiragana in ASCII art, Picasso's Guarnica to be exact. Gave our analyzer a real fit for a while. -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com Beware the lollipop of mediocrity: lick it once and you suck forever
interleaved ordering (was RE: Phoenician)
We do actually mix scripts. Hiragana and Katakana are interleaved. Mark And it might make sense to interleave (say) Thai and Lao in the default ordering. Or to interleave, in the default ordering, the Indic scripts covered by ISCII. Any pecularities could be handled in tailorings. /kent k